DataSHIELD Training Part 3: Assign functions and tables
Prerequisites
It is recommended that you familiarise yourself with R first by sitting our Introduction to R tutorial.
It also requires that you have the DataSHIELD training environment installed on your machine, see our Installation Instructions for Linux, Windows, or Mac.
Help
DataSHIELD support is freely available in the DataSHIELD forum by the DataSHIELD community. Please use this as the first port of call for any problems you may be having, it is monitored closely for new threads.
DataSHIELD bespoke user support and also user training classes are offered on a fee-paying basis. Please enquire at datashield@newcastle.ac.uk for current prices.
Introduction
This is the third in a 6-part DataSHIELD tutorial series.
The other parts in this DataSHIELD tutorial series are:
5: Subsetting
6: Modelling
Quick reminder for logging in:
Descriptive statistics: assigning variables
So far all the functions in the tutorial have returned something to the screen. Some functions (assign functions) create new objects in the server-side R session that are required for analysis but do not return an anything to the client screen. For example, in analysis the log values of a variable may be required.
- By default the function
ds.log
computes the natural logarithm. It is possible to compute a different logarithm by setting the argumentbase
to a different value. There is no output to screen:
ds.log(x='DST$LAB_HDL', datasources = connections)
Aggregated (exists("DST")) [=============================================================] 100% / 0s Aggregated (classDS("DST$LAB_HDL")) [====================================================] 100% / 1s Assigned expr. (log.newobj <- log(DST$LAB_HDL,2.71828182845905)) [=======================] 100% / 0s Aggregated (exists("log.newobj")) [====================================================] 100% / 0s
- In the above example the name of the new object was not specified. By default the name of the new variable is set to the input vector followed by the suffix '_log' (i.e. '
LAB_HDL_log'
)
- It is possible to customise the name of the new object by using the
newobj
argument:
ds.log(x='DST$LAB_HDL', newobj='LAB_HDL_log', datasources = connections)
- The new object is not attached to assigned variables data frame (default name "
D
"). We can check the size of the new LAB_HDL_log vector we generated above; the command should return the same figure as the number of rows in the data frame 'D'.
ds.length(x='LAB_HDL_log', datasources = connections)
Aggregated (lengthDS("LAB_HDL_log")) [=================================================] 100% / 0s $`length of LAB_HDL_log in study1` [1] 2163 $`length of LAB_HDL_log in study2` [1] 3088 $`total length of LAB_HDL_log in all studies combined` [1] 5251
ds.assign
The ds.assign
function enables the creation of new objects in the server-side R session to be used in later analysis. ds.assign
can be used to evaluate simple expressions passed on to its argument toAssign
and assign the output of the evaluation to a new object.
- Using
ds.assign
we subtract the pooled mean calculated earlier from LAB_HDL (mean centring) and assign the output to a new variable calledLAB_HDL.c
. The function returns no output to the client screen, the newly created variable is stored server-side.
ds.assign(toAssign='DST$LAB_HDL-1.562', newobj='LAB_HDL.c', datasources = connections)
Further DataSHIELD functions can now be run on this new mean-centred variable
LAB_HDL.c
. The example below calculates the mean of the new variable
LAB_HDL.c
which should be approximately 0.
ds.mean(x='LAB_HDL.c', datasources = connections)
Aggregated (meanDS(LAB_HDL.c)) [=======================================================] 100% / 0s $Mean.by.Study EstimatedMean Nmissing Nvalid Ntotal study1 0.007416316 360 1803 2163 study2 -0.005352231 555 2533 3088 $Nstudies [1] 2 $ValidityMessage ValidityMessage study1 "VALID ANALYSIS" study2 "VALID ANALYSIS"
Contingency tables
The function
ds.table
creates contingency tables of a categorical variables. The default is set to run on pooled data from all studies, to obtain an output of each study set the argument
type='split'
.
- The example below calculates a one-dimensional table for the variable
GENDER
. The function returns the counts and the column and row percent per category, as well as information about the validity of the variable in each study dataset:
ds.table(rvar="DST$GENDER")
Aggregated (asFactorDS1("DST$GENDER")) [=================================================] 100% / 0s Aggregated (tableDS(rvar.transmit = "DST$GENDER", cvar.transmit = NULL, stvar.transmit = NULL, ) ... Data in all studies were valid Study 1 : No errors reported from this study Study 2 : No errors reported from this study $output.list $output.list$TABLE_rvar.by.study_row.props study DST$GENDER 1 2 0 0.4079193 0.5920807 1 0.4160839 0.5839161 $output.list$TABLE_rvar.by.study_col.props study DST$GENDER 1 2 0 0.5048544 0.5132772 1 0.4951456 0.4867228 $output.list$TABLE_rvar.by.study_counts study DST$GENDER 1 2 0 1092 1585 1 1071 1503 $output.list$TABLES.COMBINED_all.sources_proportions DST$GENDER 0 1 0.51 0.49 $output.list$TABLES.COMBINED_all.sources_counts DST$GENDER 0 1 2677 2574 $validity.message [1] "Data in all studies were valid"
In DataSHIELD tabulated data are flagged as invalid
if one or more cells have a count of between 1 and the minimal cell count allowed by the data providers. For example data providers may only allow cell counts ≥ 3.
The function ds.table also creates two-dimensional contingency tables of a categorical variable. The example below constructs a two-dimensional table comprising cross-tabulation of the variables
DIS_DIAB
(diabetes status) and
GENDER
.
ds.table(rvar='DST$DIS_DIAB', cvar='DST$GENDER', datasources = connections)
Aggregated (asFactorDS1("DST$DIS_DIAB")) [===============================================] 100% / 0s Aggregated (asFactorDS1("DST$GENDER")) [=================================================] 100% / 0s Aggregated (tableDS(rvar.transmit = "DST$DIS_DIAB", cvar.transmit = "DST$GENDER", ) [======] 100% / 0s Data in all studies were valid Study 1 : No errors reported from this study Study 2 : No errors reported from this study $output.list $output.list$TABLE.STUDY.1_row.props DST$GENDER DST$DIS_DIAB 0 1 0 0.502 0.498 1 0.700 0.300 $output.list$TABLE.STUDY.1_col.props DST$GENDER DST$DIS_DIAB 0 1 0 0.9810 0.9920 1 0.0192 0.0084 $output.list$TABLE.STUDY.2_row.props DST$GENDER DST$DIS_DIAB 0 1 0 0.511 0.489 1 0.660 0.340 $output.list$TABLE.STUDY.2_col.props DST$GENDER DST$DIS_DIAB 0 1 0 0.9800 0.9890 1 0.0196 0.0106 $output.list$TABLES.COMBINED_all.sources_row.props DST$GENDER DST$DIS_DIAB 0 1 0 0.507 0.493 1 0.675 0.325 $output.list$TABLES.COMBINED_all.sources_col.props DST$GENDER DST$DIS_DIAB 0 1 0 0.9810 0.99000 1 0.0194 0.00971 $output.list$TABLE_STUDY.1_counts DST$GENDER DST$DIS_DIAB 0 1 0 1071 1062 1 21 9 $output.list$TABLE_STUDY.2_counts DST$GENDER DST$DIS_DIAB 0 1 0 1554 1487 1 31 16 $output.list$TABLES.COMBINED_all.sources_counts DST$GENDER DST$DIS_DIAB 0 1 0 2625 2549 1 52 25 $validity.message [1] "Data in all studies were valid"
The function can additionally compute a chi-squared test for homogeneity on (nc-1)*(nr-1) degrees of freedom (where nc is the number of columns and nr is the number of rows):
ds.table(rvar='DST$DIS_DIAB', cvar='DST$GENDER', datasources = connections, report.chisq.tests = TRUE)
Below code omits the first section of output which is an exact duplicate of above, only chisquare reports shown:
$chisq.tests $chisq.tests$chisq.test_TABLE.STUDY.1_counts Pearson's Chi-squared test with Yates' continuity correction data: input.array.source.specific X-squared = 3.8767, df = 1, p-value = 0.04896 $chisq.tests$chisq.test_TABLE.STUDY.2_counts Pearson's Chi-squared test with Yates' continuity correction data: input.array.source.specific X-squared = 3.5158, df = 1, p-value = 0.06079 $chisq.tests$chisq.test_TABLES.COMBINED_all.sources_counts Pearson's Chi-squared test with Yates' continuity correction data: combine.array.all.sources X-squared = 7.9078, df = 1, p-value = 0.004922 $validity.message [1] "Data in all studies were valid"
Conclusion
The other parts in this DataSHIELD tutorial series are:
5: Subsetting
6: Modelling
Also remember you can:
- get a function list for any DataSHIELD package and
- view the manual help page individual functions
- in the DataSHIELD test environment it is possible to print analyses to file (.csv, .txt, .pdf, .png)
- take a look at our FAQ page for solutions to common problems such as Changing variable class to use in a specific DataSHIELD function.
- Get support from our DataSHIELD forum.
DataSHIELD Wiki by DataSHIELD is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at http://www.datashield.ac.uk/wiki