This tutorial assumes you have already installed the DataSHIELD training environment (installation takes around half an hour) or that you have a login for a DataSHIELD cloud training server. It is recommended that you familiarise yourself with R first by sitting our Session 1: Introduction to R tutorial |
Start
arrow (or double click on the Opal test server name).You can check whether the Opal test servers are ready by typing the following into your navigation bar
|
opal
to login and logoutdsBaseClient
, dsStatsClient
, ds.GraphicsClient
and dsModellingClient
containing all DataSHIELD functions referred to in this tutorial.library
function into the command line as given in the example below:#update dataSHIELD packages update.packages(repos='http://cran.obiba.org') #load libraries library(opal) library(dsBaseClient) library(dsStatsClient) library(dsGraphicsClient) library(dsModellingClient) |
The output in R/RStudio will look as follows:
library(opal) #Loading required package: RCurl #Loading required package: bitops #Loading required package: rjson library(dsBaseClient) #Loading required package: fields #Loading required package: spam #Loading required package: grid library(dsStatsClient) library(dsGraphicsClient) library(dsModellingClient) |
You might see the following status message that you can ignore. The message refers to the blocking of functions within the package. |
A Horizontal-DataSHIELD process starts with a login to one or more Opal servers that hold the data behind the data provider firewall. Formatting of the login details is required to log into Opal servers:
logindata
is built into the DataSHIELD test environment.data
function. Calling logindata
allows users to view the login data on the screen:data(logindata) logindata # server url user password table # study1 192.168.56.100:8080 administrator password CNSIM.CNSIM # study2 192.168.56.101:8080 administrator password CNSIM.CNSIM |
You can create a login table to use with /wiki/spaces/DSDEV/pages/12943489 or live research data.
The login details for live research data will have the same format as the login template except:
|
If you are not using your own data, information for the login table is obtained from the data provider. Please follow the appropriate procedures to gain clearance to analyse their data. |
Your login details must be loaded via the |
opals
that calls the datashield.login
function to log into the desired Opal servers. In the DataSHIELD test environment logindata
is our login template for the test Opal servers.opals <- datashield.login(logins=logindata,assign=TRUE) |
study1
and study2
contain the same 11 variables listed in capital letters under Variables assigned:
.> opals <- datashield.login(logins=logindata,assign=TRUE) Logging into the collaborating servers No variables have been specified. All the variables in the opal table (the whole dataset) will be assigned to R! Assigining data: study1... study2... Variables assigned: study1--LAB_TSC, LAB_TRIG, LAB_HDL, LAB_GLUC_ADJUSTED, PM_BMI_CONTINUOUS, DIS_CVA, MEDI_LPD, DIS_DIAB, DIS_AMI, GENDER, PM_BMI_CATEGORICAL study2--LAB_TSC, LAB_TRIG, LAB_HDL, LAB_GLUC_ADJUSTED, PM_BMI_CONTINUOUS, DIS_CVA, MEDI_LPD, DIS_DIAB, DIS_AMI, GENDER, PM_BMI_CATEGORICAL |
In Horizontal DataSHIELD pooled analysis the data are harmonized and the variables given the same names across the studies, as agreed by all data providers. |
Users can specify individual variables to assign to the server-side R session. It is best practice to first create a list of the Opal variables you want to analyse.
myvar
that lists the Opal variables required for analysis: LAB_HDL
and GENDER
variables
argument in the function datashield.login
uses myvar
, which then will call only this list.myvar <- list('LAB_HDL', 'GENDER') opals <- datashield.login(logins=logindata,assign=TRUE,variables=myvar) #Logging into the collaborating servers #Assigining data: #study1... #study2... #Variables assigned: #study1--LAB_HDL, GENDER #study2--LAB_HDL, GENDER |
Assigned data are kept in a data frame (table) named |
symbol
in the datashield.login
function to change the name of the data frame from D
to mytable
.myvar <- list('LAB_HDL', 'GENDER') opals <- datashield.login(logins=logindata,assign=TRUE,variables=myvar, symbol='mytable') #Logging into the collaborating servers #Assigining data: #study1... #study2... #Variables assigned: #study1--LAB_HDL, GENDER #study2--LAB_HDL, GENDER |
Only DataSHIELD developers will need to change the default value of the last argument, |
Almost all functions in DataSHIELD can display split results (results separated for each study) or pooled results (results for all the studies combined). This can be done using the |
It is possible to get some descriptive or exploratory statistics about the assigned variables held in the server-side R session such as number of participants at each data provider, number of participants across all data providers and number of variables. Identifying parameters of the data will facilitate your analysis.
D
can be found using the ds.dim
command in which type='split'
is the default argument:opals <- datashield.login(logins=logindata,assign=TRUE) ds.dim(x='D') |
The output of the command is shown below. It shows that in study1 there are 2163 individuals with 11 variables and in study2 there are 3088 individuals with 11 variables:
> opals <- datashield.login(logins=logindata,assign=TRUE) Logging into the collaborating servers No variables have been specified. All the variables in the opal table (the whole dataset) will be assigned to R! Assigning data: study1... study2... Variables assigned: study1--LAB_TSC, LAB_TRIG, LAB_HDL, LAB_GLUC_ADJUSTED, PM_BMI_CONTINUOUS, DIS_CVA, MEDI_LPD, DIS_DIAB, DIS_AMI, GENDER, PM_BMI_CATEGORICAL study2--LAB_TSC, LAB_TRIG, LAB_HDL, LAB_GLUC_ADJUSTED, PM_BMI_CONTINUOUS, DIS_CVA, MEDI_LPD, DIS_DIAB, DIS_AMI, GENDER, PM_BMI_CATEGORICAL > ds.dim(x='D') $study1 [1] 2163 11 $study2 [1] 3088 11 |
type='combine'
argument in the ds.dim
function to identify the number of individuals (5251) and variables (11) pooled across all studies:ds.dim('D', type='combine') #$pooled.dimension #[1] 5251 11 |
ds.colnames
function on the assigned data frame D
:ds.colnames(x='D') #$study1 # [1] "LAB_TSC" "LAB_TRIG" "LAB_HDL" "LAB_GLUC_ADJUSTED" "PM_BMI_CONTINUOUS" "DIS_CVA" # [7] "MEDI_LPD" "DIS_DIAB" "DIS_AMI" "GENDER" "PM_BMI_CATEGORICAL" #$study2 # [1] "LAB_TSC" "LAB_TRIG" "LAB_HDL" "LAB_GLUC_ADJUSTED" "PM_BMI_CONTINUOUS" "DIS_CVA" # [7] "MEDI_LPD" "DIS_DIAB" "DIS_AMI" "GENDER" "PM_BMI_CATEGORICAL" |
ds.class
function to identify the class (type) of a variable - for example if it is an integer, character, factor etc. This will determine what analysis you can run using this variable class. The example below defines the class of the variable LAB_HDL held in the assigned data frame D, denoted by the argument x='D$LAB_HDL'
ds.class(x='D$LAB_HDL') #$study1 #[1] "numeric" #$study2 #[1] "numeric" |
You can now continue with the remainder of the DataSHIELD training tutorial from Descriptive Statistices: Quantiles and Mean as it is for the DataSHIELD Cloud training environment. |
====================
If you have not installed the DataSHIELD test environment you can login to our cloud Opal test servers using the alternative details below. This will require a good internet connection. Please note this service is not reliable and will be discontinued soon.
The login template for the cloud Opal test servers can be called using:
|