This tutorial assumes you have already installed the DataSHIELD training environment (installation takes around half an hour) or that you have a login for a DataSHIELD cloud training server.

It is recommended that you familiarise yourself with R first by sitting our Session 1: Introduction to R tutorial

Getting ready

Start the Opal Servers

You can check whether the Opal test servers are ready by typing the following into your navigation bar

http://192.168.56.100:8080 and http://192.168.56.101:8080 or

https://192.168.56.100:8443 and https://192.168.56.101:8443

Start R/RStudio and load packages

#update dataSHIELD packages
update.packages(repos='http://cran.obiba.org') 
 
#load libraries
library(opal)
library(dsBaseClient)
library(dsStatsClient)
library(dsGraphicsClient)
library(dsModellingClient)

The output in R/RStudio will look as follows:

library(opal)
#Loading required package: RCurl
#Loading required package: bitops
#Loading required package: rjson
library(dsBaseClient)
#Loading required package: fields
#Loading required package: spam
#Loading required package: grid
library(dsStatsClient)
library(dsGraphicsClient)
library(dsModellingClient)

You might see the following status message that you can ignore. The message refers to the blocking of functions within the package.
The following objects are masked from ‘package:xxxx’

Login template

A Horizontal-DataSHIELD process starts with a login to one or more Opal servers that hold the data behind the data provider firewall. Formatting of the login details is required to log into Opal servers:

data(logindata)
logindata

# server url                 user          password table
# study1 192.168.56.100:8080 administrator password CNSIM.CNSIM
# study2 192.168.56.101:8080 administrator password CNSIM.CNSIM

You can create a login table to use with /wiki/spaces/DSDEV/pages/12943489 or live research data.

path<-"/FILEPATH/login.txt"
my_login<-read.table(path, sep="", header=TRUE)

The login details for live research data will have the same format as the login template except:

  • user must contain the file path to an ssl certificate
  • password must contain the file path to an ssl key
  • server must be the formal name of the study
  • url is the url of the remote server or the test environment
  • table indicates the location of the dataset in Opal, e.g. project_name.table_name

If you are not using your own data, information for the login table is obtained from the data provider. Please follow the appropriate procedures to gain clearance to analyse their data.

Log in to the remote servers

Your login details must be loaded via the data() function or read into the R session first.

opals <- datashield.login(logins=logindata,assign=TRUE)
> opals <- datashield.login(logins=logindata,assign=TRUE)
Logging into the collaborating servers

  No variables have been specified.
  All the variables in the opal table
  (the whole dataset) will be assigned to R!

Assigining data:
study1...
study2...

Variables assigned:
study1--LAB_TSC, LAB_TRIG, LAB_HDL, LAB_GLUC_ADJUSTED, PM_BMI_CONTINUOUS, DIS_CVA, MEDI_LPD, DIS_DIAB, DIS_AMI, GENDER, PM_BMI_CATEGORICAL
study2--LAB_TSC, LAB_TRIG, LAB_HDL, LAB_GLUC_ADJUSTED, PM_BMI_CONTINUOUS, DIS_CVA, MEDI_LPD, DIS_DIAB, DIS_AMI, GENDER, PM_BMI_CATEGORICAL

In Horizontal DataSHIELD pooled analysis the data are harmonized and the variables given the same names across the studies, as agreed by all data providers.

Assign individual variables on login

Users can specify individual variables to assign to the server-side R session. It is best practice to first create a list of the Opal variables you want to analyse.

myvar <- list('LAB_HDL', 'GENDER')
opals <- datashield.login(logins=logindata,assign=TRUE,variables=myvar)

#Logging into the collaborating servers

#Assigining data:
#study1...
#study2...

#Variables assigned:
#study1--LAB_HDL, GENDER
#study2--LAB_HDL, GENDER

Assigned data are kept in a data frame (table) named D by default. Each row of the data frame are the individual records and each column is a separate variable.

myvar <- list('LAB_HDL', 'GENDER')
opals <- datashield.login(logins=logindata,assign=TRUE,variables=myvar, symbol='mytable')

#Logging into the collaborating servers

#Assigining data:
#study1...
#study2...

#Variables assigned:
#study1--LAB_HDL, GENDER
#study2--LAB_HDL, GENDER

Only DataSHIELD developers will need to change the default value of the last argument, directory, of the datashield.login function.

Basic statistics and data manipulations

Descriptive statistics: variable dimensions and class

Almost all functions in DataSHIELD can display split results (results separated for each study) or pooled results (results for all the studies combined). This can be done using the type='split' and type='combine' argument in each function. The majority of DataSHIELD functions have a default of type='combined'. The default for each function can be checked in the function help page.

It is possible to get some descriptive or exploratory statistics about the assigned variables held in the server-side R session such as number of participants at each data provider, number of participants across all data providers and number of variables. Identifying parameters of the data will facilitate your analysis.

opals <- datashield.login(logins=logindata,assign=TRUE)
ds.dim(x='D')

The output of the command is shown below. It shows that in study1 there are 2163 individuals with 11 variables and in study2 there are 3088 individuals with 11 variables:

> opals <- datashield.login(logins=logindata,assign=TRUE)
Logging into the collaborating servers
  No variables have been specified. 
  All the variables in the opal table 
  (the whole dataset) will be assigned to R!
Assigning data:
study1...
study2...
Variables assigned:
study1--LAB_TSC, LAB_TRIG, LAB_HDL, LAB_GLUC_ADJUSTED, PM_BMI_CONTINUOUS, DIS_CVA, MEDI_LPD, DIS_DIAB, DIS_AMI, GENDER, PM_BMI_CATEGORICAL
study2--LAB_TSC, LAB_TRIG, LAB_HDL, LAB_GLUC_ADJUSTED, PM_BMI_CONTINUOUS, DIS_CVA, MEDI_LPD, DIS_DIAB, DIS_AMI, GENDER, PM_BMI_CATEGORICAL


> ds.dim(x='D')
$study1
[1] 2163   11

$study2
[1] 3088   11
ds.dim('D', type='combine')
#$pooled.dimension
#[1] 5251   11
ds.colnames(x='D')
#$study1
# [1] "LAB_TSC"            "LAB_TRIG"           "LAB_HDL"            "LAB_GLUC_ADJUSTED"  "PM_BMI_CONTINUOUS"  "DIS_CVA"
# [7] "MEDI_LPD"           "DIS_DIAB"           "DIS_AMI"            "GENDER"             "PM_BMI_CATEGORICAL"

#$study2
# [1] "LAB_TSC"            "LAB_TRIG"           "LAB_HDL"            "LAB_GLUC_ADJUSTED"  "PM_BMI_CONTINUOUS"  "DIS_CVA"
# [7] "MEDI_LPD"           "DIS_DIAB"           "DIS_AMI"            "GENDER"             "PM_BMI_CATEGORICAL"
ds.class(x='D$LAB_HDL')
#$study1
#[1] "numeric"

#$study2
#[1] "numeric"
You can now continue with the remainder of the DataSHIELD training tutorial from Descriptive Statistices: Quantiles and Mean as it is for the DataSHIELD Cloud training environment. 

====================

 

 

 

 

 

If you have not installed the DataSHIELD test environment you can login to our cloud Opal test servers using the alternative details below. This will require a good internet connection. Please note this service is not reliable and will be discontinued soon.

opals <- datashield.login(logins=login_remoteServer,assign=TRUE)

The login template for the cloud Opal test servers can be called using:

data(login_remoteServer)