DataSHIELD Training Part 3: Assign functions and tables


 Click here for page contents...

Prerequisites

It is recommended that you familiarise yourself with R first by sitting our Introduction to R tutorial.

It also requires that you have the DataSHIELD training environment installed on your machine, see our Installation Instructions for Linux, Windows, or Mac.

Help

DataSHIELD support is freely available in the DataSHIELD forum by the DataSHIELD community. Please use this as the first port of call for any problems you may be having, it is monitored closely for new threads.

DataSHIELD bespoke user support and also user training classes are offered on a fee-paying basis. Please enquire at datashield@newcastle.ac.uk for current prices. 

Introduction

This is the third in a 6-part DataSHIELD tutorial series. 

The other parts in this DataSHIELD tutorial series are:

Quick reminder for logging in:

 Click here to expand...

Recall from the installation instructions, the Opal web interface is a simple check to tell if the VMs have started. Load the following urls, waiting at least 1 minute after starting the training VMs.

Start R/RStudio

Load Packages

#load libraries
library(DSI)
library(DSOpal)
library(dsBaseClient)

Build your login dataframe 

Build your login dataframe
builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1",  url = "http://192.168.56.100:8080/",
               user = "administrator", password = "datashield_test&",
               table = "CNSIM.CNSIM1", driver = "OpalDriver")
builder$append(server = "study2", url = "http://192.168.56.101:8080/",
               user = "administrator", password = "datashield_test&",
               table = "CNSIM.CNSIM2", driver = "OpalDriver")

logindata <- builder$build()

connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "D")
  • Command to logout:
DSI::datashield.logout(connections)


Descriptive statistics: assigning variables

So far all the functions in the tutorial have returned something to the screen. Some functions (assign functions) create new objects in the server-side R session that are required for analysis but do not return an anything to the client screen. For example, in analysis the log values of a variable may be required.

  • By default the function ds.log computes the natural logarithm. It is possible to compute a different logarithm by setting the argument base to a different value. There is no output to screen:
ds.log(x='D$LAB_HDL', datasources = connections)
  Aggregated (exists("D")) [=============================================================] 100% / 0s
  Aggregated (classDS("D$LAB_HDL")) [====================================================] 100% / 1s
  Assigned expr. (log.newobj <- log(D$LAB_HDL,2.71828182845905)) [=======================] 100% / 0s
  Aggregated (exists("log.newobj")) [====================================================] 100% / 0s
  • In the above example the name of the new object was not specified. By default the name of the new variable is set to the input vector followed by the suffix '_log' (i.e. 'LAB_HDL_log')
  • It is possible to customise the name of the new object by using the newobj argument:
ds.log(x='D$LAB_HDL', newobj='LAB_HDL_log', datasources = connections)
  • The new object is not attached to assigned variables data frame (default name "D"). We can check the size of the new LAB_HDL_log vector we generated above; the command should return the same figure as the number of rows in the data frame 'D'.
ds.length(x='LAB_HDL_log', datasources = connections)
Aggregated (lengthDS("LAB_HDL_log")) [=================================================] 100% / 0s
$`length of LAB_HDL_log in study1`
[1] 2163

$`length of LAB_HDL_log in study2`
[1] 3088

$`total length of LAB_HDL_log in all studies combined`
[1] 5251

ds.assign

The ds.assign function enables the creation of new objects in the server-side R session to be used in later analysis. ds.assign can be used to evaluate simple expressions passed on to its argument toAssign and assign the output of the evaluation to a new object.

  • Using ds.assign we subtract the pooled mean calculated earlier from LAB_HDL (mean centring) and assign the output to a new variable called LAB_HDL.c. The function returns no output to the client screen, the newly created variable is stored server-side.
ds.assign(toAssign='D$LAB_HDL-1.562', newobj='LAB_HDL.c', datasources = connections)

Further DataSHIELD functions can now be run on this new mean-centred variable LAB_HDL.c. The example below calculates the mean of the new variable LAB_HDL.c which should be approximately 0.

ds.mean(x='LAB_HDL.c', datasources = connections)
  Aggregated (meanDS(LAB_HDL.c)) [=======================================================] 100% / 0s
$Mean.by.Study
       EstimatedMean Nmissing Nvalid Ntotal
study1   0.007416316      360   1803   2163
study2  -0.005352231      555   2533   3088

$Nstudies
[1] 2

$ValidityMessage
       ValidityMessage 
study1 "VALID ANALYSIS"
study2 "VALID ANALYSIS"

Contingency tables

The function ds.table creates contingency tables of a categorical variables. The default is set to run on pooled data from all studies, to obtain an output of each study set the argument type='split' .

  • The example below calculates a one-dimensional table for the variable GENDER . The function returns the counts and the column and row percent per category, as well as information about the validity of the variable in each study dataset:
ds.table(rvar="D$GENDER")
  Aggregated (asFactorDS1("D$GENDER")) [=================================================] 100% / 0s
  Aggregated (tableDS(rvar.transmit = "D$GENDER", cvar.transmit = NULL, stvar.transmit = NULL, ) ...

 Data in all studies were valid 

Study 1 :  No errors reported from this study
Study 2 :  No errors reported from this study

$output.list
$output.list$TABLE_rvar.by.study_row.props
        study
D$GENDER         1         2
       0 0.4079193 0.5920807
       1 0.4160839 0.5839161

$output.list$TABLE_rvar.by.study_col.props
        study
D$GENDER         1         2
       0 0.5048544 0.5132772
       1 0.4951456 0.4867228

$output.list$TABLE_rvar.by.study_counts
        study
D$GENDER    1    2
       0 1092 1585
       1 1071 1503

$output.list$TABLES.COMBINED_all.sources_proportions
D$GENDER
   0    1 
0.51 0.49 

$output.list$TABLES.COMBINED_all.sources_counts
D$GENDER
   0    1 
2677 2574 


$validity.message
[1] "Data in all studies were valid"

In DataSHIELD tabulated data are flagged as invalid if one or more cells have a count of between 1 and the minimal cell count allowed by the data providers. For example data providers may only allow cell counts ≥ 3.

The function ds.table also creates two-dimensional contingency tables of a categorical variable. The example below constructs a two-dimensional table comprising cross-tabulation of the variables DIS_DIAB (diabetes status) and GENDER .


ds.table(rvar='D$DIS_DIAB', cvar='D$GENDER', datasources = connections)
  Aggregated (asFactorDS1("D$DIS_DIAB")) [===============================================] 100% / 0s
  Aggregated (asFactorDS1("D$GENDER")) [=================================================] 100% / 0s
  Aggregated (tableDS(rvar.transmit = "D$DIS_DIAB", cvar.transmit = "D$GENDER", ) [======] 100% / 0s

 Data in all studies were valid 

Study 1 :  No errors reported from this study
Study 2 :  No errors reported from this study

$output.list
$output.list$TABLE.STUDY.1_row.props
          D$GENDER
D$DIS_DIAB     0     1
         0 0.502 0.498
         1 0.700 0.300

$output.list$TABLE.STUDY.1_col.props
          D$GENDER
D$DIS_DIAB      0      1
         0 0.9810 0.9920
         1 0.0192 0.0084

$output.list$TABLE.STUDY.2_row.props
          D$GENDER
D$DIS_DIAB     0     1
         0 0.511 0.489
         1 0.660 0.340

$output.list$TABLE.STUDY.2_col.props
          D$GENDER
D$DIS_DIAB      0      1
         0 0.9800 0.9890
         1 0.0196 0.0106

$output.list$TABLES.COMBINED_all.sources_row.props
          D$GENDER
D$DIS_DIAB     0     1
         0 0.507 0.493
         1 0.675 0.325

$output.list$TABLES.COMBINED_all.sources_col.props
          D$GENDER
D$DIS_DIAB      0       1
         0 0.9810 0.99000
         1 0.0194 0.00971

$output.list$TABLE_STUDY.1_counts
          D$GENDER
D$DIS_DIAB    0    1
         0 1071 1062
         1   21    9

$output.list$TABLE_STUDY.2_counts
          D$GENDER
D$DIS_DIAB    0    1
         0 1554 1487
         1   31   16

$output.list$TABLES.COMBINED_all.sources_counts
          D$GENDER
D$DIS_DIAB    0    1
         0 2625 2549
         1   52   25


$validity.message
[1] "Data in all studies were valid"

The function can additionally compute a chi-squared test for homogeneity on (nc-1)*(nr-1) degrees of freedom (where nc is the number of columns and nr is the number of rows):


ds.table(rvar='D$DIS_DIAB', cvar='D$GENDER', datasources = connections, report.chisq.tests = TRUE)

Below code omits the first section of output which is an exact duplicate of above, only chisquare reports shown:

$chisq.tests
$chisq.tests$chisq.test_TABLE.STUDY.1_counts

	Pearson's Chi-squared test with Yates' continuity correction

data:  input.array.source.specific
X-squared = 3.8767, df = 1, p-value = 0.04896


$chisq.tests$chisq.test_TABLE.STUDY.2_counts

	Pearson's Chi-squared test with Yates' continuity correction

data:  input.array.source.specific
X-squared = 3.5158, df = 1, p-value = 0.06079


$chisq.tests$chisq.test_TABLES.COMBINED_all.sources_counts

	Pearson's Chi-squared test with Yates' continuity correction

data:  combine.array.all.sources
X-squared = 7.9078, df = 1, p-value = 0.004922



$validity.message
[1] "Data in all studies were valid"

Conclusion

The other parts in this DataSHIELD tutorial series are:

Also remember you can: