Quick link to page for presentations etc. : http://bit.ly/intro-DS
Child pages (Children Display) |
---|
Tip |
---|
|
Free DataSHIELD support is given in the DataSHIELD forum by the DataSHIELD community. DataSHIELD bespoke user support and also user training classes are offered on a fee basis. Please enquire at datashield@newcastle.ac.uk for current prices. |
Introduction
This tutorial introduces users to DataSHIELD commands and syntax. Throughout this document we refer to R, but all commands are run in the same way in Rstudio. This tutorial contains a limited number of examples; further examples are available in each DataSHIELD function manual page that can be accessed via the help function.
The DataSHIELD approach: aggregate and assign functions
DataSHIELD commands call functions that range from carrying out pre-requisite tasks such as login to the datasources, to generating basic descriptive statistics, plots and tabulations. More advance functions allow for users to fit generalized linear models and generalized estimating equations models. R can list all functions available in DataSHIELD.
This section explains the functions we will call during this tutorial. Although this knowledge is not required to run DataSHIELD analyses it helps to understand the output of the commands. It can explain why some commands call functions that return nothing to the user, but rather store the output on the server of the data provider for use in a second function.
...
Info |
---|
title | How assign and aggregate functions work |
---|
|
Assign functions do not return an output to the client, with the exception of error or status messages. Assign functions create new objects and store them server-side either because the objects are potentially disclosive, or because they consist of the individual-level data which, in DataSHIELD, is never seen by the analyst. These new objects can include: - new transformed variables (e.g. mean centred or log transformed variables)
- a new variable of a modified class (e.g. a variable of class numeric may be converted into a factor which R can then model as having discrete categorical levels)
- a subset object (e.g. a dataframe including gender as a variable may be split into males and females).
Assign functions return no output to the client except to indicate an error or useful messages about the object store on server-side. Aggregate functions analyse the data server-side and return an output in the form of aggregate data (summary statistics that are not disclosive) to the client. The help page for each function tells us what is returned and when not to expect an output on client-side. |
Starting and Logging onto the Opal Training Servers - Cloud Training Environment
If you are attending one of our training sessions you will be using our DataSHIELD Cloud Training Environment. If you are running this tutorial on a DataSHIELD VM installed on your own machine, please skip to these instructions.
Start the Opal Servers and Login - Your trainer will have started your Opal training servers in the cloud for you.
- Your trainer will give you the IP address of the DataSHIELD client portal (RStudio) ending :8787
- They will also provide you with a username and password to login with.
...
Anchor |
---|
| login-template |
---|
| login-template |
---|
|
Build your login dataframe
Info |
---|
title | DataSHIELD cloud IP addresses |
---|
|
The DataSHIELD cloud training environment does not use fixed IP addresses, the client and opal training server addresses change each training session. As part of the user tutorial you learn how to build a DataSHIELD login dataframe. In a real world instance of DataSHIELD this is populated with secure certificates not text based usernames and passwords. |
...
Anchor |
---|
| load_packages |
---|
| load_packages |
---|
|
Start R/RStudio and load packages
- The following relevant R packages are required for analysis
...
Code Block |
---|
|
#load libraries
library(opal)
Loading required package: RCurl
Loading required package: bitops
Loading required package: rjson
Loading required package: mime
library(dsBaseClient)
|
Log onto the remote Opal training servers
- Create a variable called
opals
that calls the datashield.login
function to log into the desired Opal servers. In the DataSHIELD test environment my_logindata
is our login dataframe for the Opal training servers.
...
Anchor |
---|
| assign_variables |
---|
| assign_variables |
---|
|
Assign individual variables on login
Users can specify individual variables to assign to the server-side R session. It is best practice to first create a list of the Opal variables you want to analyse.
...
Div |
---|
style | page-break-after:always; |
---|
|
|
Basic statistics and data manipulations
Descriptive statistics: variable dimensions and class
Tip |
---|
Almost all functions in DataSHIELD can display split results (results separated for each study) or pooled results (results for all the studies combined). This can be done using the type='split' and type='combine' argument in each function. The majority of DataSHIELD functions have a default of type='combine' . The default for each function can be checked in the function help page. Some of the new versions of functions include the option type='both' which returns both the split and the pooled results. |
...
Code Block |
---|
|
ds.class(x='D$LAB_HDL', datasources = opals)
$`dstesting-100`
[1] "numeric"
$`dstesting-101`
[1] "numeric"
|
Descriptive statistics: quantiles and mean
As LAB_HDL
is a numeric variable the distribution of the data can be explored.
...
Code Block |
---|
|
ds.mean(x='D$LAB_HDL', datasources = opals)
$Mean.by.Study
EstimatedMean Nmissing Nvalid Ntotal
dstesting-100 1.569416 360 1803 2163
dstesting-101 1.556648 555 2533 3088
$Nstudies
[1] 2
$ValidityMessage
ValidityMessage
dstesting-100 "VALID ANALYSIS"
dstesting-101 "VALID ANALYSIS"
|
Descriptive statistics: assigning variables
So far all the functions in this section have returned something to the screen. Some functions (assign functions) create new objects in the server-side R session that are required for analysis but do not return an anything to the client screen. For example, in analysis the log values of a variable may be required.
...
Code Block |
---|
|
ds.mean(x='LAB_HDL.c', datasources = opals)
$Mean.by.Study
EstimatedMean Nmissing Nvalid Ntotal
dstesting-100 0.007416316 360 1803 2163
dstesting-101 -0.005352231 555 2533 3088
$Nstudies
[1] 2
$ValidityMessage
ValidityMessage
dstesting-100 "VALID ANALYSIS"
dstesting-101 "VALID ANALYSIS"
|
Generating contingency tables
The function ds.table1D
creates a one-dimensional contingency table of a categorical variable. The default is set to run on pooled data from all studies, to obtain an output of each study set the argument type='split'
.
...
Code Block |
---|
|
$chi2Test.all.studies
$chi2Test.all.studies$`pooled-D$DIS_DIAB(row)|D$GENDER(col)`
Pearson's Chi-squared test with Yates' continuity correction
data: pooledContingencyTable
X-squared = 7.9078, df = 1, p-value = 0.004922
$counts
$counts$`dstesting-100-D$DIS_DIAB(row)|D$GENDER(col)`
0 1 Total
0 1071 1062 2133
1 21 9 30
Total 1092 1071 2163
$counts$`dstesting-101-D$DIS_DIAB(row)|D$GENDER(col)`
0 1 Total
0 1554 1487 3041
1 31 16 47
Total 1585 1503 3088
$counts.all.studies
$counts.all.studies$`pooled-D$DIS_DIAB(row)|D$GENDER(col)`
0 1 Total
0 2625 2549 5174
1 52 25 77
Total 2677 2574 5251
$validity
[1] "All tables are valid!"
|
Generating graphs
It is currently possible to produce 4 types of graphs in DataSHIELD: histograms, contour plots, heatmap plots, scatter plots
Histograms
Info |
---|
|
In the default method of generating a DataSHIELD histogram outliers are not shown as these are potentially disclosive. The text summary of the function printed to the client screen informs the user of the presence of classes (bins) with a count smaller than the minimal cell count set by data providers. |
...
Anchor |
---|
| contour_plots |
---|
| contour_plots |
---|
|
Contour plots
Contour plots are used to visualize a correlation pattern.
...
Anchor |
---|
| heatmap_plots |
---|
| heatmap_plots |
---|
|
Heat map plots
An alternative way to visualise correlation between variables is via a heat map plot.
...
Note |
---|
The functions ds.contourPlot and ds.heatmapPlot use the range (minimum and maximum values) of the x and y vectors in the process of generating the graph. But in DataSHIELD the minimum and maximum values cannot be returned because they are potentially disclosive; hence what is actually returned for these plots is the 'obscured' minimum and maximum. |
Saving Graphs / Plots in R Studio
- Any plots will appear in the bottom right window in R Studio, within the
plot
tab - Select
export
> save as image
...
Div |
---|
style | page-break-after:always; |
---|
|
|
Sub-setting
Info |
---|
title | Limitations on subsetting |
---|
|
Sub-setting is particularly useful in statistical analyses to break down variables or tables of variables into groups for analysis. Repeated sub-setting, however, can lead to thinning of the data to individual-level records that are disclosive (e.g. the statistical mean of a single value point is the value itself). Therefore, DataSHIELD does not subset an object below the minimal subset length set by the data providers (typically this is ≤ 4 observations). |
...
- ds.subsetByClass
- ds.subset
- ds.dataFrameSubset
Sub-setting using ds.subsetByClass
- The
ds.subsetByClass
function generates subsets for each level of a categorical
variable. If the input is a data frame it produces a subset of that data frame for each class of each categorical variable held in the data frame. - Best practice is to state the categorical variable(s) to subset using the
variables
argument, and the name of the subset data using the subsets
argument. - The example subsets
GENDER
from our assigned data frame D
, the subset data is named GenderTables
:
...
Anchor |
---|
| sub_meanbyclass |
---|
| sub_meanbyclass |
---|
|
Sub-setting using ds.subset
The function ds.subset
allows general sub-setting of different data types e.g. categorical, numeric, character, data frame, matrix. It is also possible to subset rows (the individual records). No output is returned to the client screen, the generated subsets are stored in the server-side R session.
...
Code Block |
---|
|
ds.histogram('BMI25plus$PM_BMI_CONTINUOUS', datasources = opals)
Warning: dstesting-100: 2 invalid cells
Warning: dstesting-101: 1 invalid cells
[[1]]
$breaks
[1] 23.93659 27.17016 30.40373 33.63731 36.87088 40.10445 43.33803 46.57160 49.80518 53.03875 56.27232
$counts
[1] 365 511 331 150 49 15 0 0 0 0
$density
[1] 0.079212771 0.110897880 0.071834047 0.032553194 0.010634043 0.003255319 0.000000000 0.000000000 0.000000000 0.000000000
$mids
[1] 25.55337 28.78695 32.02052 35.25409 38.48767 41.72124 44.95482 48.18839 51.42196 54.65554
$xname
[1] "xvect"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram"
[[2]]
$breaks
[1] 23.93659 27.17016 30.40373 33.63731 36.87088 40.10445 43.33803 46.57160 49.80518 53.03875 56.27232
$counts
[1] 506 750 476 229 62 11 4 0 0 0
$density
[1] 0.0767450721 0.1137525773 0.0721949690 0.0347324536 0.0094035464 0.0016683711 0.0006066804 0.0000000000 0.0000000000 0.0000000000
$mids
[1] 25.55337 28.78695 32.02052 35.25409 38.48767 41.72124 44.95482 48.18839 51.42196 54.65554
$xname
[1] "xvect"
$equidist
[1] TRUE
attr(,"class")
[1] "histogram |

Modelling
Horizontal DataSHIELD allows the fitting of generalised linear models (GLM). In the GLM function the outcome can be modelled as continuous, or categorical (binomial or discrete). The error to use in the model can follow a range of distribution including gaussian, binomial, Gamma and poisson. In this section only one example will be shown, for more examples please see the manual help page for the function.
Generalised linear models
- The function
ds.glm
is used to analyse the outcome variable DIS_DIAB
(diabetes status) and the covariates PM_BMI_CONTINUOUS
(continuous BMI), LAB_HDL
(HDL cholesterol) and GENDER
(gender), with an interaction between the latter two. In R this model is represented as:
...
Tip |
---|
You have now sat our basic DataSHIELD training. If you would like to practice further please sit our Also remember you can: |
...