This tutorial assumes you have already installed the DataSHIELD training environment (installation takes around half an hour) or that you have a login for a DataSHIELD cloud training server.
It is recommended that you familiarise yourself with R first by sitting our Session 1: Introduction to R tutorial
Getting ready
Start the Opal Servers
- Open VirtualBox. Left click on an Opal test server and select the green
Start
arrow (or double click on the Opal test server name). - The Opal test servers take a few minutes to boot. Leave them running for a few minutes to ensure Opal has started. You do not need to login.
You can check whether the Opal test servers are ready by typing the following into your navigation bar
http://192.168.56.100:8080
and http://192.168.56.101:8080 or
Start R/RStudio and load packages
- Start R (or RStudio, if using).
- Check for dataSHIELD package updates
- The following relevant R packages are required for analysis
opal
to login and logoutdsBaseClient
,dsStatsClient
,ds.GraphicsClient
anddsModellingClient
containing all DataSHIELD functions referred to in this tutorial.
- To load the R packages, type the
library
function into the command line as given in the example below:
#update dataSHIELD packages update.packages(repos='http://cran.obiba.org') #load libraries library(opal) library(dsBaseClient) library(dsStatsClient) library(dsGraphicsClient) library(dsModellingClient)
The output in R/RStudio will look as follows:
library(opal) #Loading required package: RCurl #Loading required package: bitops #Loading required package: rjson library(dsBaseClient) #Loading required package: fields #Loading required package: spam #Loading required package: grid library(dsStatsClient) library(dsGraphicsClient) library(dsModellingClient)
You might see the following status message that you can ignore. The message refers to the blocking of functions within the package.
The following objects are masked from ‘package:xxxx’
Login template
A Horizontal-DataSHIELD process starts with a login to one or more Opal servers that hold the data behind the data provider firewall. Formatting of the login details is required to log into Opal servers:
- Login details are held in a data frame (table). The example below shows the format of the login data frame for DataSHIELD training VMs which consists of two datasets held by two Opal servers. This login file
logindata
is built into the DataSHIELD test environment.
- The DataSHIELD test environment login details are loaded using the
data
function. Callinglogindata
allows users to view the login data on the screen:
data(logindata) logindata # server url user password table # study1 192.168.56.100:8080 administrator password CNSIM.CNSIM # study2 192.168.56.101:8080 administrator password CNSIM.CNSIM
You can create a login table to use with /wiki/spaces/DSDEV/pages/12943489 or live research data.
- Right click to download and edit our login template login.txt
- create a variable called
my_login
that contains the file path to the login table:
path<-"/FILEPATH/login.txt"
- Load the login table using
read.table
and then follow the standard instructions for log in but remember to change the logins argument:
my_login<-read.table(path, sep="", header=TRUE)
The login details for live research data will have the same format as the login template except:
user
must contain the file path to an ssl certificatepassword
must contain the file path to an ssl keyserver
must be the formal name of the studyurl
is the url of the remote server or the test environmenttable
indicates the location of the dataset in Opal, e.g.project_name.table_name
If you are not using your own data, information for the login table is obtained from the data provider. Please follow the appropriate procedures to gain clearance to analyse their data.
Log in to the remote servers
Your login details must be loaded via the data()
function or read into the R session first.
- Create a variable called
opals
that calls thedatashield.login
function to log into the desired Opal servers. In the DataSHIELD test environmentlogindata
is our login template for the test Opal servers.
opals <- datashield.login(logins=logindata,assign=TRUE)
- The output below indicates that each of the two test Opal servers
study1
andstudy2
contain the same 11 variables listed in capital letters underVariables assigned:
.
> opals <- datashield.login(logins=logindata,assign=TRUE) Logging into the collaborating servers No variables have been specified. All the variables in the opal table (the whole dataset) will be assigned to R! Assigining data: study1... study2... Variables assigned: study1--LAB_TSC, LAB_TRIG, LAB_HDL, LAB_GLUC_ADJUSTED, PM_BMI_CONTINUOUS, DIS_CVA, MEDI_LPD, DIS_DIAB, DIS_AMI, GENDER, PM_BMI_CATEGORICAL study2--LAB_TSC, LAB_TRIG, LAB_HDL, LAB_GLUC_ADJUSTED, PM_BMI_CONTINUOUS, DIS_CVA, MEDI_LPD, DIS_DIAB, DIS_AMI, GENDER, PM_BMI_CATEGORICAL
In Horizontal DataSHIELD pooled analysis the data are harmonized and the variables given the same names across the studies, as agreed by all data providers.
Assign individual variables on login
Users can specify individual variables to assign to the server-side R session. It is best practice to first create a list of the Opal variables you want to analyse.
- The example below creates a new variable
myvar
that lists the Opal variables required for analysis:LAB_HDL
andGENDER
- The
variables
argument in the functiondatashield.login
usesmyvar
, which then will call only this list.
myvar <- list('LAB_HDL', 'GENDER') opals <- datashield.login(logins=logindata,assign=TRUE,variables=myvar) #Logging into the collaborating servers #Assigining data: #study1... #study2... #Variables assigned: #study1--LAB_HDL, GENDER #study2--LAB_HDL, GENDER
The format of assigned data frames
Assigned data are kept in a data frame (table) named D
by default. Each row of the data frame are the individual records and each column is a separate variable.
- The example below uses the argument
symbol
in thedatashield.login
function to change the name of the data frame fromD
tomytable
.
myvar <- list('LAB_HDL', 'GENDER') opals <- datashield.login(logins=logindata,assign=TRUE,variables=myvar, symbol='mytable') #Logging into the collaborating servers #Assigining data: #study1... #study2... #Variables assigned: #study1--LAB_HDL, GENDER #study2--LAB_HDL, GENDER
Only DataSHIELD developers will need to change the default value of the last argument, directory
, of the datashield.login
function.
====================
If you have not installed the DataSHIELD test environment you can login to our cloud Opal test servers using the alternative details below. This will require a good internet connection. Please note this service is not reliable and will be discontinued soon.
opals <- datashield.login(logins=login_remoteServer,assign=TRUE)
The login template for the cloud Opal test servers can be called using:
data(login_remoteServer)