DataSHIELD Training Part 3: Assign functions and tables

DataSHIELD Training Part 3: Assign functions and tables

 

Prerequisites

It is recommended that you familiarise yourself with R first by sitting our Introduction to R tutorial.

It also requires that you have the DataSHIELD training environment installed on your machine, see our Installation Instructions for Linux, Windows, or Mac.

Help

DataSHIELD support is freely available in the DataSHIELD forum by the DataSHIELD community. Please use this as the first port of call for any problems you may be having, it is monitored closely for new threads.

DataSHIELD bespoke user support and also user training classes are offered on a fee-paying basis. Please enquire at datashield@newcastle.ac.uk for current prices. 

Introduction

This is the third in a 6-part DataSHIELD tutorial series. 

The other parts in this DataSHIELD tutorial series are:

Quick reminder for logging in:

Start R/RStudio

Load Packages

#load libraries library(DSI) library(DSOpal) library(dsBaseClient)

Build your login dataframe 

Build your login dataframe
builder <- DSI::newDSLoginBuilder() builder <- DSI::newDSLoginBuilder() builder$append(server = "server1", url = "https://opal-demo.obiba.org/", user = "dsuser", password = "P@ssw0rd", driver = "OpalDriver", options='list(ssl_verifyhost=0, ssl_verifypeer=0)') builder$append(server = "server2", url = "https://opal-demo.obiba.org/", user = "dsuser", password = "P@ssw0rd", driver = "OpalDriver", options='list(ssl_verifyhost=0, ssl_verifypeer=0)') logindata <- builder$build() logindata <- builder$build()connections <- DSI::datashield.login(logins = logindata, assign = TRUE) DSI::datashield.assign.table(conns = connections, symbol = "DST", table = c("CNSIM.CNSIM1","CNSIM.CNSIM2"))
  • Command to logout:

DSI::datashield.logout(connections)

Descriptive statistics: assigning variables

So far all the functions in the tutorial have returned something to the screen. Some functions (assign functions) create new objects in the server-side R session that are required for analysis but do not return an anything to the client screen. For example, in analysis the log values of a variable may be required.

  • By default the function ds.log computes the natural logarithm. It is possible to compute a different logarithm by setting the argument base to a different value. There is no output to screen:

ds.log(x='DST$LAB_HDL', datasources = connections)
Aggregated (exists("DST")) [=============================================================] 100% / 0s Aggregated (classDS("DST$LAB_HDL")) [====================================================] 100% / 1s Assigned expr. (log.newobj <- log(DST$LAB_HDL,2.71828182845905)) [=======================] 100% / 0s Aggregated (exists("log.newobj")) [====================================================] 100% / 0s
  • In the above example the name of the new object was not specified. By default the name of the new variable is set to the input vector followed by the suffix '_log' (i.e. 'LAB_HDL_log')

  • It is possible to customise the name of the new object by using the newobj argument:

ds.log(x='DST$LAB_HDL', newobj='LAB_HDL_log', datasources = connections)
  • The new object is not attached to assigned variables data frame (default name "D"). We can check the size of the new LAB_HDL_log vector we generated above; the command should return the same figure as the number of rows in the data frame 'D'.

ds.length(x='LAB_HDL_log', datasources = connections)
Aggregated (lengthDS("LAB_HDL_log")) [=================================================] 100% / 0s $`length of LAB_HDL_log in study1` [1] 2163 $`length of LAB_HDL_log in study2` [1] 3088 $`total length of LAB_HDL_log in all studies combined` [1] 5251

ds.assign

The ds.assign function enables the creation of new objects in the server-side R session to be used in later analysis. ds.assign can be used to evaluate simple expressions passed on to its argument toAssign and assign the output of the evaluation to a new object.

  • Using ds.assign we subtract the pooled mean calculated earlier from LAB_HDL (mean centring) and assign the output to a new variable called LAB_HDL.c. The function returns no output to the client screen, the newly created variable is stored server-side.

ds.assign(toAssign='DST$LAB_HDL-1.562', newobj='LAB_HDL.c', datasources = connections)

Further DataSHIELD functions can now be run on this new mean-centred variable LAB_HDL.c. The example below calculates the mean of the new variable LAB_HDL.c which should be approximately 0.

ds.mean(x='LAB_HDL.c', datasources = connections)
Aggregated (meanDS(LAB_HDL.c)) [=======================================================] 100% / 0s $Mean.by.Study EstimatedMean Nmissing Nvalid Ntotal study1 0.007416316 360 1803 2163 study2 -0.005352231 555 2533 3088 $Nstudies [1] 2 $ValidityMessage ValidityMessage study1 "VALID ANALYSIS" study2 "VALID ANALYSIS"

Contingency tables

The function ds.table creates contingency tables of a categorical variables. The default is set to run on pooled data from all studies, to obtain an output of each study set the argument type='split' .

  • The example below calculates a one-dimensional table for the variable GENDER . The function returns the counts and the column and row percent per category, as well as information about the validity of the variable in each study dataset:

ds.table(rvar="DST$GENDER")
Aggregated (asFactorDS1("DST$GENDER")) [=================================================] 100% / 0s Aggregated (tableDS(rvar.transmit = "DST$GENDER", cvar.transmit = NULL, stvar.transmit = NULL, ) ... Data in all studies were valid Study 1 : No errors reported from this study Study 2 : No errors reported from this study $output.list $output.list$TABLE_rvar.by.study_row.props study DST$GENDER 1 2 0 0.4079193 0.5920807 1 0.4160839 0.5839161 $output.list$TABLE_rvar.by.study_col.props study DST$GENDER 1 2 0 0.5048544 0.5132772 1 0.4951456 0.4867228 $output.list$TABLE_rvar.by.study_counts study DST$GENDER 1 2 0 1092 1585 1 1071 1503 $output.list$TABLES.COMBINED_all.sources_proportions DST$GENDER 0 1 0.51 0.49 $output.list$TABLES.COMBINED_all.sources_counts DST$GENDER 0 1 2677 2574 $validity.message [1] "Data in all studies were valid"

In DataSHIELD tabulated data are flagged as invalid if one or more cells have a count of between 1 and the minimal cell count allowed by the data providers. For example data providers may only allow cell counts ≥ 3.

The function ds.table also creates two-dimensional contingency tables of a categorical variable. The example below constructs a two-dimensional table comprising cross-tabulation of the variables DIS_DIAB (diabetes status) and GENDER .

 

ds.table(rvar='DST$DIS_DIAB', cvar='DST$GENDER', datasources = connections)
Aggregated (asFactorDS1("DST$DIS_DIAB")) [===============================================] 100% / 0s Aggregated (asFactorDS1("DST$GENDER")) [=================================================] 100% / 0s Aggregated (tableDS(rvar.transmit = "DST$DIS_DIAB", cvar.transmit = "DST$GENDER", ) [======] 100% / 0s Data in all studies were valid Study 1 : No errors reported from this study Study 2 : No errors reported from this study $output.list $output.list$TABLE.STUDY.1_row.props DST$GENDER DST$DIS_DIAB 0 1 0 0.502 0.498 1 0.700 0.300 $output.list$TABLE.STUDY.1_col.props DST$GENDER DST$DIS_DIAB 0 1 0 0.9810 0.9920 1 0.0192 0.0084 $output.list$TABLE.STUDY.2_row.props DST$GENDER DST$DIS_DIAB 0 1 0 0.511 0.489 1 0.660 0.340 $output.list$TABLE.STUDY.2_col.props DST$GENDER DST$DIS_DIAB 0 1 0 0.9800 0.9890 1 0.0196 0.0106 $output.list$TABLES.COMBINED_all.sources_row.props DST$GENDER DST$DIS_DIAB 0 1 0 0.507 0.493 1 0.675 0.325 $output.list$TABLES.COMBINED_all.sources_col.props DST$GENDER DST$DIS_DIAB 0 1 0 0.9810 0.99000 1 0.0194 0.00971 $output.list$TABLE_STUDY.1_counts DST$GENDER DST$DIS_DIAB 0 1 0 1071 1062 1 21 9 $output.list$TABLE_STUDY.2_counts DST$GENDER DST$DIS_DIAB 0 1 0 1554 1487 1 31 16 $output.list$TABLES.COMBINED_all.sources_counts DST$GENDER DST$DIS_DIAB 0 1 0 2625 2549 1 52 25 $validity.message [1] "Data in all studies were valid"

The function can additionally compute a chi-squared test for homogeneity on (nc-1)*(nr-1) degrees of freedom (where nc is the number of columns and nr is the number of rows):

 

ds.table(rvar='DST$DIS_DIAB', cvar='DST$GENDER', datasources = connections, report.chisq.tests = TRUE)

Below code omits the first section of output which is an exact duplicate of above, only chisquare reports shown:

$chisq.tests $chisq.tests$chisq.test_TABLE.STUDY.1_counts Pearson's Chi-squared test with Yates' continuity correction data: input.array.source.specific X-squared = 3.8767, df = 1, p-value = 0.04896 $chisq.tests$chisq.test_TABLE.STUDY.2_counts Pearson's Chi-squared test with Yates' continuity correction data: input.array.source.specific X-squared = 3.5158, df = 1, p-value = 0.06079 $chisq.tests$chisq.test_TABLES.COMBINED_all.sources_counts Pearson's Chi-squared test with Yates' continuity correction data: combine.array.all.sources X-squared = 7.9078, df = 1, p-value = 0.004922 $validity.message [1] "Data in all studies were valid"

Conclusion

The other parts in this DataSHIELD tutorial series are:

Also remember you can: