Click here for page contents...

Prerequisites

It is recommended that you familiarise yourself with R first by sitting our Introduction to R tutorial.

It also requires that you have the DataSHIELD training environment installed on your machine, see our Installation Instructions for Linux, Windows, or Mac.

Help

DataSHIELD support is freely available in the DataSHIELD forum by the DataSHIELD community. Please use this as the first port of call for any problems you may be having, it is monitored closely for new threads.

DataSHIELD bespoke user support and also user training classes are offered on a fee-paying basis. Please enquire at datashield@newcastle.ac.uk for current prices.

Introduction

This is the third in a 6-part DataSHIELD tutorial series.

The other parts in this DataSHIELD tutorial series are:

1: Introduction and logging in
2: Basic statistics and data manipulations
3: Assign functions and tables
4: Plotting graphs
5: Subsetting
6: Modelling

Quick reminder for logging in:

Click here to expand...

Start R/RStudio

Load Packages

#load libraries
library(DSI)
library(DSOpal)
library(dsBaseClient)

Build your login dataframe

Build your login dataframe

builder <- DSI::newDSLoginBuilder()
builder <- DSI::newDSLoginBuilder()
builder$append(server = "server1", url = "https://opal-demo.obiba.org/",
user = "dsuser", password = "P@ssw0rd", driver = "OpalDriver", options='list(ssl_verifyhost=0, ssl_verifypeer=0)')
builder$append(server = "server2", url = "https://opal-demo.obiba.org/",
user = "dsuser", password = "P@ssw0rd", driver = "OpalDriver", options='list(ssl_verifyhost=0, ssl_verifypeer=0)')

logindata <- builder$build()

logindata <- builder$build()connections <- DSI::datashield.login(logins = logindata, assign = TRUE)
DSI::datashield.assign.table(conns = connections, symbol = "DST", table = c("CNSIM.CNSIM1","CNSIM.CNSIM2"))

Command to logout:

DSI::datashield.logout(connections)

Descriptive statistics: assigning variables

So far all the functions in the tutorial have returned something to the screen. Some functions (assign functions) create new objects in the server-side R session that are required for analysis but do not return an anything to the client screen. For example, in analysis the log values of a variable may be required.

By default the function ds.log computes the natural logarithm. It is possible to compute a different logarithm by setting the argument base to a different value. There is no output to screen:

ds.log(x='DST$LAB_HDL', datasources = connections)

  Aggregated (exists("DST")) [=============================================================] 100% / 0s
  Aggregated (classDS("DST$LAB_HDL")) [====================================================] 100% / 1s
  Assigned expr. (log.newobj <- log(DST$LAB_HDL,2.71828182845905)) [=======================] 100% / 0s
  Aggregated (exists("log.newobj")) [====================================================] 100% / 0s

In the above example the name of the new object was not specified. By default the name of the new variable is set to the input vector followed by the suffix '_log' (i.e. 'LAB_HDL_log')

It is possible to customise the name of the new object by using the newobj argument:

ds.log(x='DST$LAB_HDL', newobj='LAB_HDL_log', datasources = connections)

The new object is not attached to assigned variables data frame (default name "D"). We can check the size of the new LAB_HDL_log vector we generated above; the command should return the same figure as the number of rows in the data frame 'D'.

ds.length(x='LAB_HDL_log', datasources = connections)

Aggregated (lengthDS("LAB_HDL_log")) [=================================================] 100% / 0s
$`length of LAB_HDL_log in study1`
[1] 2163

$`length of LAB_HDL_log in study2`
[1] 3088

$`total length of LAB_HDL_log in all studies combined`
[1] 5251

ds.assign

The ds.assign function enables the creation of new objects in the server-side R session to be used in later analysis. ds.assign can be used to evaluate simple expressions passed on to its argument toAssign and assign the output of the evaluation to a new object.

Using ds.assignwe subtract the pooled mean calculated earlier from LAB_HDL (mean centring) and assign the output to a new variable called LAB_HDL.c. The function returns no output to the client screen, the newly created variable is stored server-side.

ds.assign(toAssign='DST$LAB_HDL-1.562', newobj='LAB_HDL.c', datasources = connections)

Further DataSHIELD functions can now be run on this new mean-centred variable LAB_HDL.c. The example below calculates the mean of the new variable LAB_HDL.c which should be approximately 0.

ds.mean(x='LAB_HDL.c', datasources = connections)

  Aggregated (meanDS(LAB_HDL.c)) [=======================================================] 100% / 0s
$Mean.by.Study
       EstimatedMean Nmissing Nvalid Ntotal
study1   0.007416316      360   1803   2163
study2  -0.005352231      555   2533   3088

$Nstudies
[1] 2

$ValidityMessage
       ValidityMessage 
study1 "VALID ANALYSIS"
study2 "VALID ANALYSIS"

Contingency tables

The function ds.table creates contingency tables of a categorical variables. The default is set to run on pooled data from all studies, to obtain an output of each study set the argument type='split'.

The example below calculates a one-dimensional table for the variable GENDER. The function returns the counts and the column and row percent per category, as well as information about the validity of the variable in each study dataset:

ds.table(rvar="DST$GENDER")

  Aggregated (asFactorDS1("DST$GENDER")) [=================================================] 100% / 0s
  Aggregated (tableDS(rvar.transmit = "DST$GENDER", cvar.transmit = NULL, stvar.transmit = NULL, ) ...

 Data in all studies were valid 

Study 1 :  No errors reported from this study
Study 2 :  No errors reported from this study

$output.list
$output.list$TABLE_rvar.by.study_row.props
        study
DST$GENDER         1         2
       0 0.4079193 0.5920807
       1 0.4160839 0.5839161

$output.list$TABLE_rvar.by.study_col.props
        study
DST$GENDER         1         2
       0 0.5048544 0.5132772
       1 0.4951456 0.4867228

$output.list$TABLE_rvar.by.study_counts
        study
DST$GENDER    1    2
       0 1092 1585
       1 1071 1503

$output.list$TABLES.COMBINED_all.sources_proportions
DST$GENDER
   0    1 
0.51 0.49 

$output.list$TABLES.COMBINED_all.sources_counts
DST$GENDER
   0    1 
2677 2574 


$validity.message
[1] "Data in all studies were valid"

In DataSHIELD tabulated data are flagged as invalid if one or more cells have a count of between 1 and the minimal cell count allowed by the data providers. For example data providers may only allow cell counts ≥ 3.

The function ds.table also creates two-dimensional contingency tables of a categorical variable. The example below constructs a two-dimensional table comprising cross-tabulation of the variables DIS_DIAB (diabetes status) and GENDER.

ds.table(rvar='DST$DIS_DIAB', cvar='DST$GENDER', datasources = connections)

  Aggregated (asFactorDS1("DST$DIS_DIAB")) [===============================================] 100% / 0s
  Aggregated (asFactorDS1("DST$GENDER")) [=================================================] 100% / 0s
  Aggregated (tableDS(rvar.transmit = "DST$DIS_DIAB", cvar.transmit = "DST$GENDER", ) [======] 100% / 0s

 Data in all studies were valid 

Study 1 :  No errors reported from this study
Study 2 :  No errors reported from this study

$output.list
$output.list$TABLE.STUDY.1_row.props
          DST$GENDER
DST$DIS_DIAB     0     1
         0 0.502 0.498
         1 0.700 0.300

$output.list$TABLE.STUDY.1_col.props
          DST$GENDER
DST$DIS_DIAB      0      1
         0 0.9810 0.9920
         1 0.0192 0.0084

$output.list$TABLE.STUDY.2_row.props
          DST$GENDER
DST$DIS_DIAB     0     1
         0 0.511 0.489
         1 0.660 0.340

$output.list$TABLE.STUDY.2_col.props
          DST$GENDER
DST$DIS_DIAB      0      1
         0 0.9800 0.9890
         1 0.0196 0.0106

$output.list$TABLES.COMBINED_all.sources_row.props
          DST$GENDER
DST$DIS_DIAB     0     1
         0 0.507 0.493
         1 0.675 0.325

$output.list$TABLES.COMBINED_all.sources_col.props
          DST$GENDER
DST$DIS_DIAB      0       1
         0 0.9810 0.99000
         1 0.0194 0.00971

$output.list$TABLE_STUDY.1_counts
          DST$GENDER
DST$DIS_DIAB    0    1
         0 1071 1062
         1   21    9

$output.list$TABLE_STUDY.2_counts
          DST$GENDER
DST$DIS_DIAB    0    1
         0 1554 1487
         1   31   16

$output.list$TABLES.COMBINED_all.sources_counts
          DST$GENDER
DST$DIS_DIAB    0    1
         0 2625 2549
         1   52   25


$validity.message
[1] "Data in all studies were valid"

The function can additionally compute a chi-squared test for homogeneity on (nc-1)*(nr-1) degrees of freedom (where nc is the number of columns and nr is the number of rows):

ds.table(rvar='DST$DIS_DIAB', cvar='DST$GENDER', datasources = connections, report.chisq.tests = TRUE)

Below code omits the first section of output which is an exact duplicate of above, only chisquare reports shown:

$chisq.tests
$chisq.tests$chisq.test_TABLE.STUDY.1_counts

	Pearson's Chi-squared test with Yates' continuity correction

data:  input.array.source.specific
X-squared = 3.8767, df = 1, p-value = 0.04896


$chisq.tests$chisq.test_TABLE.STUDY.2_counts

	Pearson's Chi-squared test with Yates' continuity correction

data:  input.array.source.specific
X-squared = 3.5158, df = 1, p-value = 0.06079


$chisq.tests$chisq.test_TABLES.COMBINED_all.sources_counts

	Pearson's Chi-squared test with Yates' continuity correction

data:  combine.array.all.sources
X-squared = 7.9078, df = 1, p-value = 0.004922



$validity.message
[1] "Data in all studies were valid"

Conclusion