DataSHIELD Training: All parts
(Written for release of v6.0, June 2020)
Prerequisites
It is recommended that you familiarise yourself with R first by sitting our Introduction to R tutorial.
It also requires that you have the DataSHIELD training environment installed on your machine, see our Installation Instructions for Linux, Windows, or Mac.
Help
DataSHIELD support is freely available in the DataSHIELD forum by the DataSHIELD community. Please use this as the first port of call for any problems you may be having, it is monitored closely for new threads.
DataSHIELD bespoke user support and also user training classes are offered on a fee-paying basis. Please enquire at datashield@newcastle.ac.uk for current prices.
Introduction
This tutorial introduces users to DataSHIELD commands and syntax. Throughout this document we refer to R, but all commands are run in the same way in Rstudio. This tutorial contains a limited number of examples; further examples are available in each DataSHIELD function manual page that can be accessed via the help function.
The DataSHIELD approach: aggregate and assign functions
How assign and aggregate functions work
DataSHIELD commands call functions that range from carrying out pre-requisite tasks such as login to the datasources, to generating basic descriptive statistics, plots and tabulations. More advance functions allow for users to fit generalized linear models and generalized estimating equations models. R can list all functions available in DataSHIELD.
This section explains the functions we will call during this tutorial. Although this knowledge is not required to run DataSHIELD analyses it helps to understand the output of the commands. It can explain why some commands call functions that return nothing to the user, but rather store the output on the server of the data provider for use in a second function.
In DataSHIELD the person running an analysis (the client) uses client-side functions to issue commands (instructions). These commands initiate the execution (running) of server-side functions that run the analysis server-side (behind the firewall of the data provider). There are two types of server-side function: assign functions and aggregate functions.
Assign functions do not return an output to the client, with the exception of error or status messages. Assign functions create new objects and store them server-side either because the objects are potentially disclosive, or because they consist of the individual-level data which, in DataSHIELD, is never seen by the analyst. These new objects can include:
- new transformed variables (e.g. mean centred or log transformed variables)
- a new variable of a modified class (e.g. a variable of class numeric may be converted into a factor which R can then model as having discrete categorical levels)
- a subset object (e.g. a dataframe including gender as a variable may be split into males and females).
Assign functions return no output to the client except to indicate an error or useful messages about the object store on server-side.
Aggregate functions analyse the data server-side and return an output in the form of aggregate data (summary statistics that are not disclosive) to the client. The help page for each function tells us what is returned and when not to expect an output on client-side.
Start your training Virtual Machines
Please follow instructions to Start the Opal VMs.
Recall from the installation instructions, the Opal web interface:
is a simple check to tell if the VMs have started.
Start R/RStudio
Load Packages
- The following relevant R packages are required for analysis:
DSI to login and logout.
DSOpal used by DSI to access the Opal server.
dsBaseClient containing all DataSHIELD functions referred to in this tutorial.
- To load the R packages, type the
library
function into the command line as given in the example below:
#load libraries library(DSI) library(DSOpal) library(dsBaseClient)
Build your login dataframe
DataSHIELD cloud IP addresses
The DataSHIELD cloud training environment does not use fixed IP addresses, the client and opal training server addresses change each training session. As part of the user tutorial you learn how to build a DataSHIELD login dataframe. In a real world instance of DataSHIELD this is populated with secure certificates not text based usernames and passwords.
Login Dataframe
- Build login dataframe.
builder <- DSI::newDSLoginBuilder() builder$append(server = "study1", url = "http://192.168.56.100:8080/", user = "administrator", password = "datashield_test&", table = "CNSIM.CNSIM1", driver = "OpalDriver") builder$append(server = "study2", url = "http://192.168.56.101:8080/", user = "administrator", password = "datashield_test&", table = "CNSIM.CNSIM2", driver = "OpalDriver") logindata <- builder$build()
Log onto the remote Opal training servers
Assign to "connections" the DSI
function to log into the desired Opal servers. In the DataSHIELD test environmentlogindata
is our login dataframe for the Opal training servers.
connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "D")
The output below indicates that each of the two Opal training servers dstutorial-100 and dstutorial-101 contain the same 11 variables listed in capital letters under
Variables assigned
:
Logging into the collaborating servers Logged in all servers [================================================================] 100% /14s No variables have been specified. All the variables in the table (the whole dataset) will be assigned to R! Assigning table data... Assigned all tables [==================================================================] 100% /13s Variables assigned: study1 -- LAB_TSC, LAB_TRIG, LAB_HDL, LAB_GLUC_ADJUSTED, PM_BMI_CONTINUOUS, DIS_CVA, MEDI_LPD, DIS_DIAB, DIS_AMI, GENDER, PM_BMI_CATEGORICAL study2 -- LAB_TSC, LAB_TRIG, LAB_HDL, LAB_GLUC_ADJUSTED, PM_BMI_CONTINUOUS, DIS_CVA, MEDI_LPD, DIS_DIAB, DIS_AMI, GENDER, PM_BMI_CATEGORICAL
- Command to logout:
DSI::datashield.logout(connections)
In Horizontal DataSHIELD pooled analysis, the data are harmonized and the variables given the same names across the studies, as agreed by all data providers.
How datashield.login works
The datashield.login
function from the R package opal
allows users to login and assign data to analyse from the Opal server in a server-side R session created behind the firewall of the data provider.
All the commands sent after login are processed within the server-side R instance only allows a specific set of commands to run (see the details of a typical horizontal DataSHIELD process). The server-side R session is wiped after logging out.
If we do not specify individual variables to assign to the server-side R session, all variables held in the Opal servers are assigned. Assigned data are kept in a data frame named D
by default. Each column of that data frame represents one variable and the rows are the individual records.
Assign individual variables on login
Users can specify individual variables to assign to the server-side R session. It is best practice to first create a list of the Opal variables you want to analyse.
- The example below creates a new variable
myvar
that lists the Opal variables required for analysis:LAB_HDL
andGENDER
- The
variables
argument in the functiondatashield.login
usesmyvar
, which then will call only this list.
myvar <- list('LAB_HDL', 'GENDER') connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "D", variables=myvar)
Logging into the collaborating servers Logged in all servers [================================================================] 100% / 4s Assigning table data... Assigned all tables [==================================================================] 100% / 7s Variables assigned: study1 -- LAB_HDL, GENDER study2 -- LAB_HDL, GENDER
The format of assigned data frames
Assigned data are kept in a data frame (table) named D
by default. Each row of the data frame are the individual records and each column is a separate variable.
- The example below uses the argument
symbol
in thedatashield.login
function to change the name of the data frame fromD
tomytable
myvar <- list('LAB_HDL', 'GENDER') connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol='mytable', variables=myvar)
Logging into the collaborating servers Logged in all servers [================================================================] 100% / 4s Assigning table data... Assigned all tables [==================================================================] 100% / 6s Variables assigned: study1 -- LAB_HDL, GENDER study2 -- LAB_HDL, GENDER
Basic statistics and data manipulations
Descriptive statistics: variable dimensions and class
It is possible to get some descriptive or exploratory statistics about the assigned variables held in the server-side R session such as number of participants at each data provider, number of participants across all data providers and number of variables. Identifying parameters of the data will facilitate your analysis.
Note, we have gone back to using the default symbol for connections, "D". This will be the case for the rest of the tutorial. Also, the DSI::datashield.login() function has an auto logout feature built into the start of it, so logging out from the previous session can be omitted.
connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "D") ds.dim(x = 'D')
The output of the command is shown below. It shows that in study 1 there are 2163 individuals with 11 variables and in study 2 there are 3088 individuals with 11 variables, and that in both studies together there are in total 5251 individuals with 11 variables:
Aggregated (dimDS("D")) [==============================================================] 100% / 0s $`dimensions of D in study1` [1] 2163 11 $`dimensions of D in study2` [1] 3088 11 $`dimensions of D in combined studies` [1] 5251 11
Almost all functions in DataSHIELD can display split results (results separated for each study) or pooled results (results for all the studies combined). This can be done using the argument type='split'
or type='combine'
in each function. The majority of DataSHIELD functions default to type='combine'
. The default for each function can be checked in the function help page. Some of the new versions of functions include the option type='both'
which returns both the split and the pooled results.
- Up to here, the dimensions of the assigned data frame
D
have been found using theds.dim
command in whichtype='both'
is the default argument. - Now use the
type='combine'
argument in theds.dim
function to identify the number of individuals (5251) and variables (11) pooled across all studies:
ds.dim(x='D', type='combine', datasources = connections)
Aggregated (dimDS("D")) [==============================================================] 100% / 0s $`dimensions of D in combined studies` [1] 5251 11
The argument "datasources=" is routinely specified in this tutorial for the purpose of clarity; however it can be omitted in general DataSHIELD practice- if the datasources argument is not specified the default set of connections will be used.
- To check the variables in each study are identical (as is required for pooled data analysis), use the
ds.colnames
function on the assigned data frameD
:
ds.colnames(x='D', datasources = connections)
Aggregated (exists("D")) [=============================================================] 100% / 1s Aggregated (classDS("D")) [============================================================] 100% / 1s Aggregated (colnamesDS("D")) [=========================================================] 100% / 0s $study1 [1] "LAB_TSC" "LAB_TRIG" "LAB_HDL" "LAB_GLUC_ADJUSTED" "PM_BMI_CONTINUOUS" [6] "DIS_CVA" "MEDI_LPD" "DIS_DIAB" "DIS_AMI" "GENDER" [11] "PM_BMI_CATEGORICAL" $study2 [1] "LAB_TSC" "LAB_TRIG" "LAB_HDL" "LAB_GLUC_ADJUSTED" "PM_BMI_CONTINUOUS" [6] "DIS_CVA" "MEDI_LPD" "DIS_DIAB" "DIS_AMI" "GENDER" [11] "PM_BMI_CATEGORICAL"
- Use the
ds.class
function to identify the class (type) of a variable - for example if it is an integer, character, factor etc. This will determine what analysis you can run using this variable class. The example below defines the class of the variableLAB_HDL
held in the assigned data frameD
, denoted by the argumentx='D$LAB_HDL'
.
ds.class(x='D$LAB_HDL', datasources = connections)
Aggregated (exists("D")) [=============================================================] 100% / 0s Aggregated (classDS("D$LAB_HDL")) [====================================================] 100% / 1s $study1 [1] "numeric" $study2 [1] "numeric"
Descriptive statistics: quantiles and mean
As LAB_HDL
is a numeric variable the distribution of the data can be explored.
- The function
ds.quantileMean
returns the quantiles and the statistical mean.
It does not return minimum and maximum values as these values are potentially disclosive (e.g. the presence of an outlier). By default type='combined'
in this function - the results reflect the quantiles and mean pooled for all studies. Specifying the argument type='split'
will give the quantiles and mean for each study:
ds.quantileMean(x='D$LAB_HDL', datasources = connections)
Aggregated (exists("D")) [=============================================================] 100% / 0s Aggregated (classDS("D$LAB_HDL")) [====================================================] 100% / 1s Aggregated (quantileMeanDS(D$LAB_HDL)) [===============================================] 100% / 0s Aggregated (lengthDS("D$LAB_HDL")) [===================================================] 100% / 0s Aggregated (numNaDS(D$LAB_HDL)) [======================================================] 100% / 0s Quantiles of the pooled data 5% 10% 25% 50% 75% 90% 95% Mean 0.8606589 1.0385205 1.2964949 1.5704848 1.8418712 2.0824057 2.2191369 1.5619572
- To get the statistical mean alone, use the function
ds.mean
use the argumenttype
to request split results:
ds.mean(x='D$LAB_HDL', datasources = connections)
Aggregated (meanDS(D$LAB_HDL)) [=======================================================] 100% / 0s $Mean.by.Study EstimatedMean Nmissing Nvalid Ntotal study1 1.569416 360 1803 2163 study2 1.556648 555 2533 3088 $Nstudies [1] 2 $ValidityMessage ValidityMessage study1 "VALID ANALYSIS" study2 "VALID ANALYSIS"
Descriptive statistics: assigning variables
So far all the functions in this section have returned something to the screen. Some functions (assign functions) create new objects in the server-side R session that are required for analysis but do not return an anything to the client screen. For example, in analysis the log values of a variable may be required.
- By default the function
ds.log
computes the natural logarithm. It is possible to compute a different logarithm by setting the argumentbase
to a different value. There is no output to screen:
ds.log(x='D$LAB_HDL', datasources = connections)
Aggregated (exists("D")) [=============================================================] 100% / 0s Aggregated (classDS("D$LAB_HDL")) [====================================================] 100% / 1s Assigned expr. (log.newobj <- log(D$LAB_HDL,2.71828182845905)) [=======================] 100% / 0s Aggregated (exists("log.newobj")) [====================================================] 100% / 0s
- In the above example the name of the new object was not specified. By default the name of the new variable is set to the input vector followed by the suffix '_log' (i.e. '
LAB_HDL_log'
)
- It is possible to customise the name of the new object by using the
newobj
argument:
ds.log(x='D$LAB_HDL', newobj='LAB_HDL_log', datasources = connections)
- The new object is not attached to assigned variables data frame (default name "
D
"). We can check the size of the new LAB_HDL_log vector we generated above; the command should return the same figure as the number of rows in the data frame 'D'.
ds.length(x='LAB_HDL_log', datasources = connections)
Aggregated (lengthDS("LAB_HDL_log")) [=================================================] 100% / 0s $`length of LAB_HDL_log in study1` [1] 2163 $`length of LAB_HDL_log in study2` [1] 3088 $`total length of LAB_HDL_log in all studies combined` [1] 5251
ds.assign
The ds.assign
function enables the creation of new objects in the server-side R session to be used in later analysis. ds.assign
can be used to evaluate simple expressions passed on to its argument toAssign
and assign the output of the evaluation to a new object.
- Using
ds.assign
we subtract the pooled mean calculated earlier from LAB_HDL (mean centring) and assign the output to a new variable calledLAB_HDL.c
. The function returns no output to the client screen, the newly created variable is stored server-side.
ds.assign(toAssign='D$LAB_HDL-1.562', newobj='LAB_HDL.c', datasources = connections)
Further DataSHIELD functions can now be run on this new mean-centred variable
LAB_HDL.c
. The example below calculates the mean of the new variable
LAB_HDL.c
which should be approximately 0.
ds.mean(x='LAB_HDL.c', datasources = connections)
Aggregated (meanDS(LAB_HDL.c)) [=======================================================] 100% / 0s $Mean.by.Study EstimatedMean Nmissing Nvalid Ntotal study1 0.007416316 360 1803 2163 study2 -0.005352231 555 2533 3088 $Nstudies [1] 2 $ValidityMessage ValidityMessage study1 "VALID ANALYSIS" study2 "VALID ANALYSIS"
Generating contingency tables
The function
ds.table
creates contingency tables of a categorical variables. The default is set to run on pooled data from all studies, to obtain an output of each study set the argument
type='split'
.
- The example below calculates a one-dimensional table for the variable
GENDER
. The function returns the counts and the column and row percent per category, as well as information about the validity of the variable in each study dataset:
ds.table(rvar="D$GENDER")
Aggregated (asFactorDS1("D$GENDER")) [=================================================] 100% / 0s Aggregated (tableDS(rvar.transmit = "D$GENDER", cvar.transmit = NULL, stvar.transmit = NULL, ) ... Data in all studies were valid Study 1 : No errors reported from this study Study 2 : No errors reported from this study $output.list $output.list$TABLE_rvar.by.study_row.props study D$GENDER 1 2 0 0.4079193 0.5920807 1 0.4160839 0.5839161 $output.list$TABLE_rvar.by.study_col.props study D$GENDER 1 2 0 0.5048544 0.5132772 1 0.4951456 0.4867228 $output.list$TABLE_rvar.by.study_counts study D$GENDER 1 2 0 1092 1585 1 1071 1503 $output.list$TABLES.COMBINED_all.sources_proportions D$GENDER 0 1 0.51 0.49 $output.list$TABLES.COMBINED_all.sources_counts D$GENDER 0 1 2677 2574 $validity.message [1] "Data in all studies were valid"
In DataSHIELD tabulated data are flagged as invalid
if one or more cells have a count of between 1 and the minimal cell count allowed by the data providers. For example data providers may only allow cell counts ≥ 3.
The function ds.table also creates two-dimensional contingency tables of a categorical variable. The example below constructs a two-dimensional table comprising cross-tabulation of the variables
DIS_DIAB
(diabetes status) and
GENDER
.
ds.table(rvar='D$DIS_DIAB', cvar='D$GENDER', datasources = connections)
Aggregated (asFactorDS1("D$DIS_DIAB")) [===============================================] 100% / 0s Aggregated (asFactorDS1("D$GENDER")) [=================================================] 100% / 0s Aggregated (tableDS(rvar.transmit = "D$DIS_DIAB", cvar.transmit = "D$GENDER", ) [======] 100% / 0s Data in all studies were valid Study 1 : No errors reported from this study Study 2 : No errors reported from this study $output.list $output.list$TABLE.STUDY.1_row.props D$GENDER D$DIS_DIAB 0 1 0 0.502 0.498 1 0.700 0.300 $output.list$TABLE.STUDY.1_col.props D$GENDER D$DIS_DIAB 0 1 0 0.9810 0.9920 1 0.0192 0.0084 $output.list$TABLE.STUDY.2_row.props D$GENDER D$DIS_DIAB 0 1 0 0.511 0.489 1 0.660 0.340 $output.list$TABLE.STUDY.2_col.props D$GENDER D$DIS_DIAB 0 1 0 0.9800 0.9890 1 0.0196 0.0106 $output.list$TABLES.COMBINED_all.sources_row.props D$GENDER D$DIS_DIAB 0 1 0 0.507 0.493 1 0.675 0.325 $output.list$TABLES.COMBINED_all.sources_col.props D$GENDER D$DIS_DIAB 0 1 0 0.9810 0.99000 1 0.0194 0.00971 $output.list$TABLE_STUDY.1_counts D$GENDER D$DIS_DIAB 0 1 0 1071 1062 1 21 9 $output.list$TABLE_STUDY.2_counts D$GENDER D$DIS_DIAB 0 1 0 1554 1487 1 31 16 $output.list$TABLES.COMBINED_all.sources_counts D$GENDER D$DIS_DIAB 0 1 0 2625 2549 1 52 25 $validity.message [1] "Data in all studies were valid"
The function can additionally compute a chi-squared test for homogeneity on (nc-1)*(nr-1) degrees of freedom (where nc is the number of columns and nr is the number of rows):
Below code omits the first section of output which is an exact duplicate of above, only chisquare reports shown:
$chisq.tests $chisq.tests$chisq.test_TABLE.STUDY.1_counts Pearson's Chi-squared test with Yates' continuity correction data: input.array.source.specific X-squared = 3.8767, df = 1, p-value = 0.04896 $chisq.tests$chisq.test_TABLE.STUDY.2_counts Pearson's Chi-squared test with Yates' continuity correction data: input.array.source.specific X-squared = 3.5158, df = 1, p-value = 0.06079 $chisq.tests$chisq.test_TABLES.COMBINED_all.sources_counts Pearson's Chi-squared test with Yates' continuity correction data: combine.array.all.sources X-squared = 7.9078, df = 1, p-value = 0.004922 $validity.message [1] "Data in all studies were valid"
Generating graphs
It is currently possible to produce 4 types of graphs in DataSHIELD: histograms, contour plots, heatmap plots and scatter plots.
Histograms
Histograms
In the default method of generating a DataSHIELD histogram outliers are not shown as these are potentially disclosive. The text summary of the function printed to the client screen informs the user of the presence of classes (bins) with a count smaller than the minimal cell count set by data providers.
Note that a small random number is added to the minimum and maximum values of the range of the input variable. Therefore each user should expect slightly different printed results from those shown below:
- The
ds.histogram
function can be used to create a histogram ofLAB_HDL
that is in the assigned variable dataframe (named "D", by default). The default analysis is set to run on separate data from all studies. Note that Study 1 contains 2 invalid cells (bins) - those bins contain counts less than the data provider minimal cell count.
ds.histogram(x='D$LAB_HDL', datasources = connections)
Aggregated (exists("D")) [=============================================================] 100% / 0s Aggregated (classDS("D$LAB_HDL")) [====================================================] 100% / 0s Aggregated (histogramDS1(D$LAB_HDL,1,3,0.25)) [========================================] 100% / 0s Aggregated (histogramDS2(D$LAB_HDL,10,-0.153421749557465,3.0579610811785,1,3,0.25)) [==] 100% / 0s Warning: study1: 2 invalid cells Warning: study2: 0 invalid cells [[1]] $breaks [1] -0.1534217 0.1677165 0.4888548 0.8099931 1.1311314 1.4522697 1.7734079 2.0945462 2.4156845 2.7368228 [11] 3.0579611 $counts [1] 0 18 51 172 443 550 387 153 25 0 $density [1] 0.00000000 0.03108742 0.08808103 0.29705758 0.76509598 0.94989343 0.66837956 0.26424308 0.04317697 0.00000000 $mids [1] 0.007147392 0.328285675 0.649423958 0.970562241 1.291700524 1.612838807 1.933977090 2.255115373 2.576253657 [10] 2.897391940 $xname [1] "xvect" $equidist [1] TRUE attr(,"class") [1] "histogram" [[2]] $breaks [1] -0.1534217 0.1677165 0.4888548 0.8099931 1.1311314 1.4522697 1.7734079 2.0945462 2.4156845 2.7368228 [11] 3.0579611 $counts [1] 9 19 83 275 604 769 545 182 42 5 $density [1] 0.01106408 0.02335750 0.10203539 0.33806906 0.74252258 0.94536402 0.66999140 0.22374025 0.05163237 0.00614671 $mids [1] 0.007147392 0.328285675 0.649423958 0.970562241 1.291700524 1.612838807 1.933977090 2.255115373 2.576253657 [10] 2.897391940 $xname [1] "xvect" $equidist [1] TRUE attr(,"class") [1] "histogram"
- To produce histograms from a pooled study, the argument
type='combine'
is used.
ds.histogram(x='D$LAB_HDL', type = 'combine', datasources = connections)
Aggregated (exists("D")) [=============================================================] 100% / 0s Aggregated (classDS("D$LAB_HDL")) [====================================================] 100% / 0s Aggregated (histogramDS1(D$LAB_HDL,1,3,0.25)) [========================================] 100% / 0s Aggregated (histogramDS2(D$LAB_HDL,10,-0.153421749557465,3.0579610811785,1,3,0.25)) [==] 100% / 0s $breaks [1] -0.1534217 0.1677165 0.4888548 0.8099931 1.1311314 1.4522697 1.7734079 2.0945462 2.4156845 2.7368228 [11] 3.0579611 $counts [1] 9 37 134 447 1047 1319 932 335 67 5 $density [1] 0.003688026 0.018148307 0.063372138 0.211708879 0.502539521 0.631752481 0.446123653 0.162661110 0.031603113 [10] 0.002048903 $mids [1] 0.007147392 0.328285675 0.649423958 0.970562241 1.291700524 1.612838807 1.933977090 2.255115373 2.576253657 [10] 2.897391940 $xname [1] "xvect" $equidist [1] TRUE $intensities [1] 0.003688026 0.018148307 0.063372138 0.211708879 0.502539521 0.631752481 0.446123653 0.162661110 0.031603113 [10] 0.002048903 attr(,"class") [1] "histogram"
Contour plots
Contour plots are used to visualize a correlation pattern.
- The function
ds.contourPlot
is used to visualise the correlation between the variablesLAB_TSC
(total serum cholesterol andLAB_HDL
(HDL cholesterol). The default istype='combined'
- the results represent a contour plot on pooled data across all studies:
ds.contourPlot(x='D$LAB_TSC', y='D$LAB_HDL', datasources = connections)
Aggregated (exists("D")) [=============================================================] 100% / 0s Aggregated (exists("D")) [=============================================================] 100% / 0s Aggregated (classDS("D$LAB_TSC")) [====================================================] 100% / 0s Aggregated (classDS("D$LAB_HDL")) [====================================================] 100% / 0s Aggregated (rangeDS(D$LAB_TSC)) [======================================================] 100% / 0s Aggregated (rangeDS(D$LAB_HDL)) [======================================================] 100% / 0s Aggregated (densityGridDS(D$LAB_TSC,D$LAB_HDL,TRUE,1.03336178741064,10.5673103958328,-0.1460271... study1: Number of invalid cells (cells with counts >0 and < nfilter.tab ) is 63 study2: Number of invalid cells (cells with counts >0 and < nfilter.tab ) is 74
Heat map plots
An alternative way to visualise correlation between variables is via a heat map plot.
- The function
ds.heatmapPlot
is applied to the variablesLAB_TSC
andLAB_HDL
:
ds.heatmapPlot(x='D$LAB_TSC', y='D$LAB_HDL', datasources = connections)
Aggregated (exists("D")) [=============================================================] 100% / 0s Aggregated (exists("D")) [=============================================================] 100% / 0s Aggregated (classDS("D$LAB_TSC")) [====================================================] 100% / 0s Aggregated (classDS("D$LAB_HDL")) [====================================================] 100% / 0s Aggregated (rangeDS( D$LAB_TSC )) [====================================================] 100% / 0s Aggregated (rangeDS( D$LAB_HDL )) [====================================================] 100% / 0s Aggregated (densityGridDS(D$LAB_TSC,D$LAB_HDL,TRUE,1.03336178741064,10.5673103958328,-0.1460271... study1: Number of invalid cells (cells with counts >0 and < nfilter.tab ) is 63 study2: Number of invalid cells (cells with counts >0 and < nfilter.tab ) is 74
The functions ds.contourPlot
and ds.heatmapPlot
use the range (minimum and maximum values) of the x and y vectors in the process of generating the graph. But in DataSHIELD the minimum and maximum values cannot be returned because they are potentially disclosive; hence what is actually returned for these plots is the 'obscured' minimum and maximum.
Saving Graphs / Plots in R Studio
- Any plots will appear in the bottom right window in R Studio, within the
plot
tab - Select
export
>save as image
- Name the file and select / create a folder to store the image in on the DataSHIELD Client server.
- You can also edit the width and height of the graph
- The plot will now be accessible from your Home folder directory structure.
Sub-setting
Limitations on subsetting
Sub-setting is particularly useful in statistical analyses to break down variables or tables of variables into groups for analysis. Repeated sub-setting, however, can lead to thinning of the data to individual-level records that are disclosive (e.g. the statistical mean of a single value point is the value itself). Therefore, DataSHIELD does not subset an object below the minimal subset length set by the data providers (typically this is ≤ 4 observations).
In DataSHIELD there are currently 3 functions that allow us to generate subset data:
- ds.subsetByClass
- ds.dataFrameSubset
- ds.subset (WARNING: this function will be deprecated in the release of 6.1, all functionallity has been added to ds.dataFrameSubset which will become the one-stop replacement).
Sub-setting using ds.subsetByClass
- The
ds.subsetByClass
function generates subsets for each level of acategorical
variable. If the input is a data frame it produces a subset of that data frame for each class of each categorical variable held in the data frame. - Best practice is to state the categorical variable(s) to subset using the
variables
argument, and the name of the subset data using thesubsets
argument. - The example subsets
GENDER
from our assigned data frameD
, the subset data is namedGenderTables
:
ds.subsetByClass(x = 'D', subsets = "GenderTables", variables = 'GENDER', datasources = connections)
- The output of
ds.subsetByClass
is held in alist
object stored server-side, as the subset data contain individual-level records. If no name is specified in thesubsets
argument, the default name "subClasses"
is used.
Running ds.subsetByClass
on a data frame without specifying the categorical variable in the argument
variables
will create a subset of all categorical variables. If the data frame holds many categorical variables the number of subsets produces might be too large - many of which may not be of interest for the analysis.
In the previous example, the
GENDER
variable in assigned data frame
D
had females coded as 0 and males coded as 1. When
GENDER
was subset using the ds.subsetByClass function, two subset tables were generated for each study dataset; one for females and one for males.
- The
ds.names
function obtains the names of these subset data:
ds.names('GenderTables', datasources = connections)
Aggregated (exists("GenderTables")) [==================================================] 100% / 0s Aggregated (classDS("GenderTables")) [=================================================] 100% / 0s Aggregated (namesDS(GenderTables)) [===================================================] 100% / 0s $study1 [1] "GENDER.level_0" "GENDER.level_1" $study2 [1] "GENDER.level_0" "GENDER.level_1"
Sub-setting using ds.subset
This function is soon to be deprecated. Its replacement will be ds.dataFrameSubset().
ds.dataFrameSubset() uses very different arguments to ds.subset()
Changes will be coming soon to this page. Use function help to investigate how ds.dataFrameSubset() works similarly.
The function
ds.subset
allows general sub-setting of different data types e.g. categorical, numeric, character, data frame, matrix. It is also possible to subset rows (the individual records). No output is returned to the client screen, the generated subsets are stored in the server-side R session.
- The example below uses the function
ds.subset
to subset the assigned data frameD
by rows (individual records) that have no missing values (missing values are denoted withNA
) given by the argumentcompleteCases=TRUE
. The output subset is named "D_without_NA"
:
ds.subset(x='D', subset='D_without_NA', completeCases=TRUE, datasources = connections)
The ds.subset
function prints an invalid
message to the client screen to inform if missing values exist in a subset.
#In order to indicate that a generated subset dataframe or vector is invalid all values within it are set to NA!
An invalid
message also denotes subsets that contain less than the minimum cell count determined by data providers.
- The second example creates a subset of the assigned data frame
D
with BMI values ≥ 25 using the argumentlogicalOperator
. The subset object is namedBMI25plus
using thesubset
argument and is not printed to client screen but is stored in the server-side R session:
ds.subset(x='D', subset='BMI25plus', logicalOperator='PM_BMI_CONTINUOUS>=', threshold=25, datasources = opals)
The subset of data retains the same variables names i.e. column names
- To verify the subset above is correct (holds only observations with BMI ≥ 25) the function
ds.quantileMean
with the argumenttype='split'
will confirm the BMI results for each study are ≥ 25.
ds.quantileMean('BMI25plus$PM_BMI_CONTINUOUS', type='split', datasources = opals) $`dstesting-100` 5% 10% 25% 50% 75% 90% 95% Mean 25.3500 25.7100 27.1500 29.2000 32.0600 34.6560 36.4980 29.9019 $`dstesting-101` 5% 10% 25% 50% 75% 90% 95% Mean 25.46900 25.91800 27.19000 29.27000 32.20500 34.76200 36.24300 29.92606
- Also a histogram of the variable BMI of the new subset data frame could be created for each study separately:
ds.histogram('BMI25plus$PM_BMI_CONTINUOUS', datasources = opals) Warning: dstesting-100: 2 invalid cells Warning: dstesting-101: 1 invalid cells [[1]] $breaks [1] 23.93659 27.17016 30.40373 33.63731 36.87088 40.10445 43.33803 46.57160 49.80518 53.03875 56.27232 $counts [1] 365 511 331 150 49 15 0 0 0 0 $density [1] 0.079212771 0.110897880 0.071834047 0.032553194 0.010634043 0.003255319 0.000000000 0.000000000 0.000000000 0.000000000 $mids [1] 25.55337 28.78695 32.02052 35.25409 38.48767 41.72124 44.95482 48.18839 51.42196 54.65554 $xname [1] "xvect" $equidist [1] TRUE attr(,"class") [1] "histogram" [[2]] $breaks [1] 23.93659 27.17016 30.40373 33.63731 36.87088 40.10445 43.33803 46.57160 49.80518 53.03875 56.27232 $counts [1] 506 750 476 229 62 11 4 0 0 0 $density [1] 0.0767450721 0.1137525773 0.0721949690 0.0347324536 0.0094035464 0.0016683711 0.0006066804 0.0000000000 0.0000000000 0.0000000000 $mids [1] 25.55337 28.78695 32.02052 35.25409 38.48767 41.72124 44.95482 48.18839 51.42196 54.65554 $xname [1] "xvect" $equidist [1] TRUE attr(,"class") [1] "histogram
Modelling
Horizontal DataSHIELD allows the fitting of generalised linear models (GLM). In the GLM function the outcome can be modelled as continuous, or categorical (binomial or discrete). The error to use in the model can follow a range of distribution including gaussian, binomial, Gamma and poisson. In this section only one example will be shown, for more examples please see the manual help page for the function.
Generalised linear models
- The function
ds.glm
is used to analyse the outcome variableDIS_DIAB
(diabetes status) and the covariatesPM_BMI_CONTINUOUS
(continuous BMI),LAB_HDL
(HDL cholesterol) andGENDER
(gender), with an interaction between the latter two. In R this model is represented as:
D$DIS_DIAB ~ D$PM_BMI_CONTINUOUS+D$LAB_HDL*D$GENDER
- By default intermediate results are not printed to the client screen. It is possible to display the intermediate results, to show the coefficients after each iteration, by using the argument
display=TRUE
.
ds.glm(formula=D$DIS_DIAB~D$PM_BMI_CONTINUOUS+D$LAB_HDL*D$GENDER, family='binomial')
Aggregated (glmDS1(D$DIS_DIAB ~ D$PM_BMI_CONTINUOUS + D$LAB_HDL * D$GENDER, ) [========] 100% / 0s Iteration 1... Aggregated (glmDS2(D$DIS_DIAB ~ D$PM_BMI_CONTINUOUS + D$LAB_HDL * D$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 5772.52971970323 Iteration 2... Aggregated (glmDS2(D$DIS_DIAB ~ D$PM_BMI_CONTINUOUS + D$LAB_HDL * D$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 1316.46958534192 Iteration 3... Aggregated (glmDS2(D$DIS_DIAB ~ D$PM_BMI_CONTINUOUS + D$LAB_HDL * D$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 725.530885478793 Iteration 4... Aggregated (glmDS2(D$DIS_DIAB ~ D$PM_BMI_CONTINUOUS + D$LAB_HDL * D$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 574.04091649261 Iteration 5... Aggregated (glmDS2(D$DIS_DIAB ~ D$PM_BMI_CONTINUOUS + D$LAB_HDL * D$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 539.36484363672 Iteration 6... Aggregated (glmDS2(D$DIS_DIAB ~ D$PM_BMI_CONTINUOUS + D$LAB_HDL * D$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 534.529367269197 Iteration 7... Aggregated (glmDS2(D$DIS_DIAB ~ D$PM_BMI_CONTINUOUS + D$LAB_HDL * D$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 534.369349024873 Iteration 8... Aggregated (glmDS2(D$DIS_DIAB ~ D$PM_BMI_CONTINUOUS + D$LAB_HDL * D$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 534.369101005548 Iteration 9... Aggregated (glmDS2(D$DIS_DIAB ~ D$PM_BMI_CONTINUOUS + D$LAB_HDL * D$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 534.369101004809 SUMMARY OF MODEL STATE after iteration 9 Current deviance 534.369101004809 on 4159 degrees of freedom Convergence criterion TRUE (1.38325209879068e-12) beta: -6.90641416691308 0.142256253558181 -0.967440739752358 -1.40945273343359 0.646007073975 Information matrix overall: (Intercept) D$PM_BMI_CONTINUOUS D$LAB_HDL D$GENDER1 D$LAB_HDL:D$GENDER1 (Intercept) 52.47803 1624.8945 71.78440 16.77192 25.91140 D$PM_BMI_CONTINUOUS 1624.89450 51515.6450 2204.90085 503.20484 774.18028 D$LAB_HDL 71.78440 2204.9008 109.48855 25.91140 42.87626 D$GENDER1 16.77192 503.2048 25.91140 16.77192 25.91140 D$LAB_HDL:D$GENDER1 25.91140 774.1803 42.87626 25.91140 42.87626 Score vector overall: [,1] (Intercept) -3.618013e-10 D$PM_BMI_CONTINUOUS -9.890691e-09 D$LAB_HDL -6.317578e-10 D$GENDER1 -2.913176e-10 D$LAB_HDL:D$GENDER1 -5.343783e-10 Current deviance: 534.369101004809 $Nvalid [1] 4164 $Nmissing [1] 1087 $Ntotal [1] 5251 $disclosure.risk RISK OF DISCLOSURE study1 0 study2 0 $errorMessage ERROR MESSAGES study1 "No errors" study2 "No errors" $nsubs [1] 4164 $iter [1] 9 $family Family: binomial Link function: logit $formula [1] "D$DIS_DIAB ~ D$PM_BMI_CONTINUOUS + D$LAB_HDL * D$GENDER" $coefficients Estimate Std. Error z-value p-value low0.95CI.LP high0.95CI.LP P_OR (Intercept) -6.9064142 1.08980103 -6.3373166 2.338013e-10 -9.04238494 -4.7704434 0.00100034 D$PM_BMI_CONTINUOUS 0.1422563 0.02932171 4.8515676 1.224894e-06 0.08478676 0.1997257 1.15287204 D$LAB_HDL -0.9674407 0.36306348 -2.6646601 7.706618e-03 -1.67903208 -0.2558494 0.38005445 D$GENDER1 -1.4094527 1.06921103 -1.3182175 1.874308e-01 -3.50506784 0.6861624 0.24427693 D$LAB_HDL:D$GENDER1 0.6460071 0.69410419 0.9307062 3.520056e-01 -0.71441214 2.0064263 1.90790747 low0.95CI.P_OR high0.95CI.P_OR (Intercept) 0.0001182744 0.008405372 D$PM_BMI_CONTINUOUS 1.0884849336 1.221067831 D$LAB_HDL 0.1865544594 0.774258560 D$GENDER1 0.0300447352 1.986079051 D$LAB_HDL:D$GENDER1 0.4894797709 7.436693232 $dev [1] 534.3691 $df [1] 4159 $output.information [1] "SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES"
How ds.glm works
After every iteration in the glm, each study returns non disclosive summaries (a score vector and an information matrix) that are combined on the client R session. The model is fitted again with the updated beta coefficients, this iterative process continues until convergence or the maximum number of iterations is reached. The output of ds.glm
returns the final pooled score vector and information along with some information about the convergence and the final pooled beta coefficients.
Conclusion
You have now completed our basic DataSHIELD training.
Some extended practicals will be coming soon.
Also remember you can:
- get a function list for any DataSHIELD package and
- view the manual help page individual functions
- in the DataSHIELD test environment it is possible to print analyses to file (.csv, .txt, .pdf, .png)
- take a look at our FAQ page for solutions to common problems such as Changing variable class to use in a specific DataSHIELD function.
- Get support from our DataSHIELD forum.
DataSHIELD Wiki by DataSHIELD is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at http://www.datashield.ac.uk/wiki