It is recommended that you familiarise yourself with R first by sitting our Introduction to R tutorial. It also requires that you have the DataSHIELD training environment installed on your machine, see our Installation Instructions for Linux, Windows, or Mac. |
DataSHIELD support is freely available in the DataSHIELD forum by the DataSHIELD community. Please use this as the first port of call for any problems you may be having, it is monitored closely for new threads. DataSHIELD bespoke user support and also user training classes are offered on a fee-paying basis. Please enquire at datashield@newcastle.ac.uk for current prices. |
This is the third in a 6-part DataSHIELD tutorial series.
The other parts in this DataSHIELD tutorial series are:
5: Sub-setting
6: Modelling
Recall from the installation instructions, the Opal web interface is a simple check to tell if the VMs have started. Load the following urls, waiting at least 1 minute after starting the training VMs. Start R/RStudioLoad Packages
Build your login dataframe
|
So far all the functions in the tutorial have returned something to the screen. Some functions (assign functions) create new objects in the server-side R session that are required for analysis but do not return an anything to the client screen. For example, in analysis the log values of a variable may be required.
ds.log
computes the natural logarithm. It is possible to compute a different logarithm by setting the argument base
to a different value. There is no output to screen:ds.log(x='D$LAB_HDL', datasources = connections) |
Aggregated (exists("D")) [=============================================================] 100% / 0s Aggregated (classDS("D$LAB_HDL")) [====================================================] 100% / 1s Assigned expr. (log.newobj <- log(D$LAB_HDL,2.71828182845905)) [=======================] 100% / 0s Aggregated (exists("log.newobj")) [====================================================] 100% / 0s |
LAB_HDL_log'
)newobj
argument:ds.log(x='D$LAB_HDL', newobj='LAB_HDL_log', datasources = connections) |
D
"). We can check the size of the new LAB_HDL_log vector we generated above; the command should return the same figure as the number of rows in the data frame 'D'.ds.length(x='LAB_HDL_log', datasources = connections) |
Aggregated (lengthDS("LAB_HDL_log")) [=================================================] 100% / 0s $`length of LAB_HDL_log in study1` [1] 2163 $`length of LAB_HDL_log in study2` [1] 3088 $`total length of LAB_HDL_log in all studies combined` [1] 5251 |
The |
ds.assign
we subtract the pooled mean calculated earlier from LAB_HDL (mean centring) and assign the output to a new variable called
LAB_HDL.c
. The function returns no output to the client screen, the newly created variable is stored server-side.ds.assign(toAssign='D$LAB_HDL-1.562', newobj='LAB_HDL.c', datasources = connections) |
Further DataSHIELD functions can now be run on this new mean-centred variable
LAB_HDL.c
. The example below calculates the mean of the new variable
LAB_HDL.c
which should be approximately 0.
ds.mean(x='LAB_HDL.c', datasources = connections) |
Aggregated (meanDS(LAB_HDL.c)) [=======================================================] 100% / 0s $Mean.by.Study EstimatedMean Nmissing Nvalid Ntotal study1 0.007416316 360 1803 2163 study2 -0.005352231 555 2533 3088 $Nstudies [1] 2 $ValidityMessage ValidityMessage study1 "VALID ANALYSIS" study2 "VALID ANALYSIS" |
The function
ds.table
creates contingency tables of a categorical variables. The default is set to run on pooled data from all studies, to obtain an output of each study set the argument
type='split'
.
GENDER
. The function returns the counts and the column and row percent per category, as well as information about the validity of the variable in each study dataset:ds.table(rvar="D$GENDER") |
Aggregated (asFactorDS1("D$GENDER")) [=================================================] 100% / 0s Aggregated (tableDS(rvar.transmit = "D$GENDER", cvar.transmit = NULL, stvar.transmit = NULL, ) ... Data in all studies were valid Study 1 : No errors reported from this study Study 2 : No errors reported from this study $output.list $output.list$TABLE_rvar.by.study_row.props study D$GENDER 1 2 0 0.4079193 0.5920807 1 0.4160839 0.5839161 $output.list$TABLE_rvar.by.study_col.props study D$GENDER 1 2 0 0.5048544 0.5132772 1 0.4951456 0.4867228 $output.list$TABLE_rvar.by.study_counts study D$GENDER 1 2 0 1092 1585 1 1071 1503 $output.list$TABLES.COMBINED_all.sources_proportions D$GENDER 0 1 0.51 0.49 $output.list$TABLES.COMBINED_all.sources_counts D$GENDER 0 1 2677 2574 $validity.message [1] "Data in all studies were valid" |
In DataSHIELD tabulated data are flagged as |
The function ds.table also creates two-dimensional contingency tables of a categorical variable. The example below constructs a two-dimensional table comprising cross-tabulation of the variables
DIS_DIAB
(diabetes status) and
GENDER
.
ds.table(rvar='D$DIS_DIAB', cvar='D$GENDER', datasources = connections) |
Aggregated (asFactorDS1("D$DIS_DIAB")) [===============================================] 100% / 0s Aggregated (asFactorDS1("D$GENDER")) [=================================================] 100% / 0s Aggregated (tableDS(rvar.transmit = "D$DIS_DIAB", cvar.transmit = "D$GENDER", ) [======] 100% / 0s Data in all studies were valid Study 1 : No errors reported from this study Study 2 : No errors reported from this study $output.list $output.list$TABLE.STUDY.1_row.props D$GENDER D$DIS_DIAB 0 1 0 0.502 0.498 1 0.700 0.300 $output.list$TABLE.STUDY.1_col.props D$GENDER D$DIS_DIAB 0 1 0 0.9810 0.9920 1 0.0192 0.0084 $output.list$TABLE.STUDY.2_row.props D$GENDER D$DIS_DIAB 0 1 0 0.511 0.489 1 0.660 0.340 $output.list$TABLE.STUDY.2_col.props D$GENDER D$DIS_DIAB 0 1 0 0.9800 0.9890 1 0.0196 0.0106 $output.list$TABLES.COMBINED_all.sources_row.props D$GENDER D$DIS_DIAB 0 1 0 0.507 0.493 1 0.675 0.325 $output.list$TABLES.COMBINED_all.sources_col.props D$GENDER D$DIS_DIAB 0 1 0 0.9810 0.99000 1 0.0194 0.00971 $output.list$TABLE_STUDY.1_counts D$GENDER D$DIS_DIAB 0 1 0 1071 1062 1 21 9 $output.list$TABLE_STUDY.2_counts D$GENDER D$DIS_DIAB 0 1 0 1554 1487 1 31 16 $output.list$TABLES.COMBINED_all.sources_counts D$GENDER D$DIS_DIAB 0 1 0 2625 2549 1 52 25 $validity.message [1] "Data in all studies were valid" |
The function can additionally compute a chi-squared test for homogeneity on (nc-1)*(nr-1) degrees of freedom (where nc is the number of columns and nr is the number of rows):
Below code omits the first section of output which is an exact duplicate of above, only chisquare reports shown:
$chisq.tests $chisq.tests$chisq.test_TABLE.STUDY.1_counts Pearson's Chi-squared test with Yates' continuity correction data: input.array.source.specific X-squared = 3.8767, df = 1, p-value = 0.04896 $chisq.tests$chisq.test_TABLE.STUDY.2_counts Pearson's Chi-squared test with Yates' continuity correction data: input.array.source.specific X-squared = 3.5158, df = 1, p-value = 0.06079 $chisq.tests$chisq.test_TABLES.COMBINED_all.sources_counts Pearson's Chi-squared test with Yates' continuity correction data: combine.array.all.sources X-squared = 7.9078, df = 1, p-value = 0.004922 $validity.message [1] "Data in all studies were valid" |
The other parts in this DataSHIELD tutorial series are:
5: Sub-setting
6: Modelling
Also remember you can:
|