DataSHIELD Training Part 2: Basic statistics and data manipulations


 Click here for page contents...

Prerequisites

It is recommended that you familiarise yourself with R first by sitting our Introduction to R tutorial.

It also requires that you have the DataSHIELD training environment installed on your machine, see our Installation Instructions for Linux, Windows, or Mac.

Help

DataSHIELD support is freely available in the DataSHIELD forum by the DataSHIELD community. Please use this as the first port of call for any problems you may be having, it is monitored closely for new threads.

DataSHIELD bespoke user support and also user training classes are offered on a fee-paying basis. Please enquire at datashield@newcastle.ac.uk for current prices. 

Introduction

This is the second in a 6-part DataSHIELD tutorial series.

The other parts in this DataSHIELD tutorial series are:

Quick reminder for logging in:

 Click here to expand...

Recall from the installation instructions, the Opal web interface is a simple check to tell if the VMs have started. Load the following urls, waiting at least 1 minute after starting the training VMs.

Start R/RStudio

Load Packages

#load libraries
library(DSI)
library(DSOpal)
library(dsBaseClient)

Build your login dataframe 

Build your login dataframe
builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1",  url = "http://192.168.56.100:8080/",
               user = "administrator", password = "datashield_test&",
               table = "CNSIM.CNSIM1", driver = "OpalDriver")
builder$append(server = "study2", url = "http://192.168.56.101:8080/",
               user = "administrator", password = "datashield_test&",
               table = "CNSIM.CNSIM2", driver = "OpalDriver")

logindata <- builder$build()

connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "D")
  • Command to logout:
DSI::datashield.logout(connections)


Basic statistics and data manipulations

Descriptive statistics: variable dimensions and class

It is possible to get some descriptive or exploratory statistics about the assigned variables held in the server-side R session such as number of participants at each data provider, number of participants across all data providers and number of variables. Identifying parameters of the data will facilitate your analysis.

Note, we have gone back to using the default symbol for connections, "D". This will be the case for the rest of the tutorial. Also, the DSI::datashield.login() function has an auto logout feature built into the start of it, so logging out from the previous session can be omitted.


connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "D")
ds.dim(x = 'D')

The output of the command is shown below. It shows that in study 1 there are 2163 individuals with 11 variables and in study 2 there are 3088 individuals with 11 variables, and that in both studies together there are in total 5251 individuals with 11 variables:

Aggregated (dimDS("D")) [==============================================================] 100% / 0s
$`dimensions of D in study1`
[1] 2163   11

$`dimensions of D in study2`
[1] 3088   11

$`dimensions of D in combined studies`
[1] 5251   11


Almost all functions in DataSHIELD can display split results (results separated for each study) or pooled results (results for all the studies combined). This can be done using the argument  type='split' or type='combine' in each function. The majority of DataSHIELD functions default to type='combine'. The default for each function can be checked in the function help page. Some of the new versions of functions include the option type='both' which returns both the split and the pooled results. 

  • Up to here, the dimensions of the assigned data frame D have been found using the ds.dim command in which type='both' is the default argument.
  • Now use the type='combine' argument in the ds.dim function to identify the number of individuals (5251) and variables (11) pooled across all studies:
ds.dim(x='D', type='combine', datasources = connections)
  Aggregated (dimDS("D")) [==============================================================] 100% / 0s
$`dimensions of D in combined studies`
[1] 5251   11


The argument "datasources=" is routinely specified in this tutorial for the purpose of clarity; however it can be omitted in general DataSHIELD practice- if the datasources argument is not specified the default set of connections will be used.

  • To check the variables in each study are identical (as is required for pooled data analysis), use the ds.colnames function on the assigned data frame D:
ds.colnames(x='D', datasources = connections)
  Aggregated (exists("D")) [=============================================================] 100% / 1s
  Aggregated (classDS("D")) [============================================================] 100% / 1s
  Aggregated (colnamesDS("D")) [=========================================================] 100% / 0s
$study1
 [1] "LAB_TSC"            "LAB_TRIG"           "LAB_HDL"            "LAB_GLUC_ADJUSTED"  "PM_BMI_CONTINUOUS" 
 [6] "DIS_CVA"            "MEDI_LPD"           "DIS_DIAB"           "DIS_AMI"            "GENDER"            
[11] "PM_BMI_CATEGORICAL"

$study2
 [1] "LAB_TSC"            "LAB_TRIG"           "LAB_HDL"            "LAB_GLUC_ADJUSTED"  "PM_BMI_CONTINUOUS" 
 [6] "DIS_CVA"            "MEDI_LPD"           "DIS_DIAB"           "DIS_AMI"            "GENDER"            
[11] "PM_BMI_CATEGORICAL"
  • Use the ds.class function to identify the class (type) of a variable - for example if it is an integer, character, factor etc. This will determine what analysis you can run using this variable class. The example below defines the class of the variable LAB_HDL held in the assigned data frame D, denoted by the argument x='D$LAB_HDL'.
ds.class(x='D$LAB_HDL', datasources = connections)
  Aggregated (exists("D")) [=============================================================] 100% / 0s
  Aggregated (classDS("D$LAB_HDL")) [====================================================] 100% / 1s
$study1
[1] "numeric"

$study2
[1] "numeric"

Descriptive statistics: quantiles and mean

As LAB_HDL is a numeric variable the distribution of the data can be explored.

  • The function ds.quantileMean returns the quantiles and the statistical mean.

It does not return minimum and maximum values as these values are potentially disclosive (e.g. the presence of an outlier). By default type='combined' in this function - the results reflect the quantiles and mean pooled for all studies. Specifying the argument type='split' will give the quantiles and mean for each study:


ds.quantileMean(x='D$LAB_HDL', datasources = connections)
  Aggregated (exists("D")) [=============================================================] 100% / 0s
  Aggregated (classDS("D$LAB_HDL")) [====================================================] 100% / 1s
  Aggregated (quantileMeanDS(D$LAB_HDL)) [===============================================] 100% / 0s
  Aggregated (lengthDS("D$LAB_HDL")) [===================================================] 100% / 0s
  Aggregated (numNaDS(D$LAB_HDL)) [======================================================] 100% / 0s
 Quantiles of the pooled data
       5%       10%       25%       50%       75%       90%       95%      Mean 
0.8606589 1.0385205 1.2964949 1.5704848 1.8418712 2.0824057 2.2191369 1.5619572 
  • To get the statistical mean alone, use the function ds.mean use the argument type to request split results:
ds.mean(x='D$LAB_HDL', datasources = connections)
  Aggregated (meanDS(D$LAB_HDL)) [=======================================================] 100% / 0s
$Mean.by.Study
       EstimatedMean Nmissing Nvalid Ntotal
study1      1.569416      360   1803   2163
study2      1.556648      555   2533   3088

$Nstudies
[1] 2

$ValidityMessage
       ValidityMessage 
study1 "VALID ANALYSIS"
study2 "VALID ANALYSIS"

Conclusion

The other parts in this DataSHIELD tutorial series are:

Also remember you can: