It is recommended that you familiarise yourself with R first by sitting our Introduction to R tutorial. It also requires that you have the DataSHIELD training environment installed on your machine, see our Installation Instructions for Linux, Windows, or Mac. |
DataSHIELD support is freely available in the DataSHIELD forum by the DataSHIELD community. Please use this as the first port of call for any problems you may be having, it is monitored closely for new threads. DataSHIELD bespoke user support and also user training classes are offered on a fee-paying basis. Please enquire at datashield@newcastle.ac.uk for current prices. |
This is the second in a 6-part DataSHIELD tutorial series.
The other parts in this DataSHIELD tutorial series are:
5: Subsetting
6: Modelling
Start R/RStudioLoad Packages
Build your login dataframe
|
DataSHIELD commands call functions that range from carrying out pre-requisite tasks such as login to the datasources, to generating basic descriptive statistics, plots and tabulations. More advance functions allow for users to fit generalized linear models and generalized estimating equations models. R can list all functions available in DataSHIELD. This section explains the functions we will call during this tutorial. Although this knowledge is not required to run DataSHIELD analyses it helps to understand the output of the commands. It can explain why some commands call functions that return nothing to the user, but rather store the output on the server of the data provider for use in a second function. In DataSHIELD the person running an analysis (the client) uses client-side functions to issue commands (instructions). These commands initiate the execution (running) of server-side functions that run the analysis server-side (behind the firewall of the data provider). There are two types of server-side function: assign functions and aggregate functions. Assign functions do not return an output to the client, with the exception of error or status messages. Assign functions create new objects and store them server-side either because the objects are potentially disclosive, or because they consist of the individual-level data which, in DataSHIELD, is never seen by the analyst. These new objects can include:
Assign functions return no output to the client except to indicate an error or useful messages about the object store on server-side. Aggregate functions analyse the data server-side and return an output in the form of aggregate data (summary statistics that are not disclosive) to the client. The help page for each function tells us what is returned and when not to expect an output on client-side. |
It is possible to get some descriptive or exploratory statistics about the assigned variables held in the server-side R session such as number of participants at each data provider, number of participants across all data providers and number of variables. Identifying parameters of the data will facilitate your analysis.
connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "DST") ds.dim(x = 'DST') |
The output of the command is shown below. It shows that in study 1 there are 2163 individuals with 11 variables and in study 2 there are 3088 individuals with 11 variables, and that in both studies together there are in total 5251 individuals with 11 variables:
Aggregated (dimDS("DST")) [==============================================================] 100% / 0s $`dimensions of DST in study1` [1] 2163 11 $`dimensions of DST in study2` [1] 3088 11 $`dimensions of DST in combined studies` [1] 5251 11 |
Almost all functions in DataSHIELD can display split results (results separated for each study) or pooled results (results for all the studies combined). This can be done using the argument |
DST
have been found using the ds.dim
command in which type='both'
is the default argument.type='combine'
argument in the ds.dim
function to identify the number of individuals (5251) and variables (11) pooled across all studies:ds.dim(x='DST', type='combine', datasources = connections) |
Aggregated (dimDS("DST")) [==============================================================] 100% / 0s $`dimensions of DST in combined studies` [1] 5251 11 |
The argument "datasources=" is routinely specified in this tutorial for the purpose of clarity; however it can be omitted in general DataSHIELD practice- if the datasources argument is not specified the default set of connections will be used. |
ds.colnames
function on the assigned data frame D
:ds.colnames(x='DST', datasources = connections) |
Aggregated (exists("DST")) [=============================================================] 100% / 1s Aggregated (classDS("DST")) [============================================================] 100% / 1s Aggregated (colnamesDS("DST")) [=========================================================] 100% / 0s $study1 [1] "LAB_TSC" "LAB_TRIG" "LAB_HDL" "LAB_GLUC_ADJUSTED" "PM_BMI_CONTINUOUS" [6] "DIS_CVA" "MEDI_LPD" "DIS_DIAB" "DIS_AMI" "GENDER" [11] "PM_BMI_CATEGORICAL" $study2 [1] "LAB_TSC" "LAB_TRIG" "LAB_HDL" "LAB_GLUC_ADJUSTED" "PM_BMI_CONTINUOUS" [6] "DIS_CVA" "MEDI_LPD" "DIS_DIAB" "DIS_AMI" "GENDER" [11] "PM_BMI_CATEGORICAL" |
ds.class
function to identify the class (type) of a variable - for example if it is an integer, character, factor etc. This will determine what analysis you can run using this variable class. The example below defines the class of the variable LAB_HDL
held in the assigned data frame D
, denoted by the argument x='DST$LAB_HDL'
.ds.class(x='DST$LAB_HDL', datasources = connections) |
Aggregated (exists("DST")) [=============================================================] 100% / 0s Aggregated (classDS("DST$LAB_HDL")) [====================================================] 100% / 1s $study1 [1] "numeric" $study2 [1] "numeric" |
As LAB_HDL
is a numeric variable the distribution of the data can be explored.
ds.quantileMean
returns the quantiles and the statistical mean.It does not return minimum and maximum values as these values are potentially disclosive (e.g. the presence of an outlier). By default |
ds.quantileMean(x='DST$LAB_HDL', datasources = connections) |
Aggregated (exists("DST")) [=============================================================] 100% / 0s Aggregated (classDS("DST$LAB_HDL")) [====================================================] 100% / 1s Aggregated (quantileMeanDS(DST$LAB_HDL)) [===============================================] 100% / 0s Aggregated (lengthDS("DST$LAB_HDL")) [===================================================] 100% / 0s Aggregated (numNaDS(DST$LAB_HDL)) [======================================================] 100% / 0s Quantiles of the pooled data 5% 10% 25% 50% 75% 90% 95% Mean 0.8606589 1.0385205 1.2964949 1.5704848 1.8418712 2.0824057 2.2191369 1.5619572 |
ds.mean
use the argument type
to request split results:ds.mean(x='DST$LAB_HDL', datasources = connections) |
Aggregated (meanDS(DST$LAB_HDL)) [=======================================================] 100% / 0s $Mean.by.Study EstimatedMean Nmissing Nvalid Ntotal study1 1.569416 360 1803 2163 study2 1.556648 555 2533 3088 $Nstudies [1] 2 $ValidityMessage ValidityMessage study1 "VALID ANALYSIS" study2 "VALID ANALYSIS" |
The other parts in this DataSHIELD tutorial series are:
5: Subsetting
6: Modelling
Also remember you can:
|