Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Note
titlePrerequisites

It is recommended that you familiarise yourself with R first by sitting our Introduction to R tutorial.

It will also requires that you have the DataSHIELD training environment installed on your machine, see our Installation Instructions for Linux, Windows, or Mac.necessary to have an up to date R instance on your machine, also RStudio is an optional but useful extra!


Tip
titleHelp

DataSHIELD support is freely available in the DataSHIELD forum by the DataSHIELD community. Please use this as the first port of call for any problems you may be having, it is monitored closely for new threads.

DataSHIELD bespoke user support and also user training classes are offered on a fee-paying basis. Please enquire at datashield@newcastledatashield@liverpool.ac.uk for current prices. 

Introduction

This tutorial introduces users to DataSHIELD commands and syntax. Throughout this document we refer to R, but all commands are run in the same way in Rstudio. This tutorial contains a limited number of examples; further examples are available in each DataSHIELD function manual page that can be accessed via the help function.The DataSHIELD approach: aggregate and assign functions

Anchor
assignload_functionspackagesassign
load_functions

Tip
titleHow assign and aggregate functions work

DataSHIELD commands call functions that range from carrying out pre-requisite tasks such as login to the datasources, to generating basic descriptive statistics, plots and tabulations. More advance functions allow for users to fit generalized linear models and generalized estimating equations models. R can list all functions available in DataSHIELD.

This section explains the functions we will call during this tutorial. Although this knowledge is not required to run DataSHIELD analyses it helps to understand the output of the commands. It can explain why some commands call functions that return nothing to the user, but rather store the output on the server of the data provider for use in a second function.

In DataSHIELD the person running an analysis (the client) uses client-side functions to issue commands (instructions). These commands initiate the execution (running) of server-side functions that run the analysis server-side (behind the firewall of the data provider). There are two types of server-side function: assign functions and aggregate functions.

Assign functions do not return an output to the client, with the exception of error or status messages. Assign functions create new objects and store them server-side either because the objects are potentially disclosive, or because they consist of the individual-level data which, in DataSHIELD, is never seen by the analyst. These new objects can include:

  • new transformed variables (e.g. mean centred or log transformed variables)
  • a new variable of a modified class (e.g. a variable of class numeric may be converted into a factor which R can then model as having discrete categorical levels)
  • a subset object (e.g. a dataframe including gender as a variable may be split into males and females).

Assign functions return no output to the client except to indicate an error or useful messages about the object store on server-side.

Aggregate functions analyse the data server-side and return an output in the form of aggregate data (summary statistics that are not disclosive) to the client. The help page for each function tells us what is returned and when not to expect an output on client-side.

...

Please follow instructions to Start the Opal VMs.

Recall from the installation instructions, the Opal web interface:

is a simple check to tell if the VMs have started.

...

Start R/RStudio

...

packages

Start R/RStudio

Start R, RGui, or RStudio, which you will be using for this analysis training exercise.

Install Packages

The following relevant R packages are required for analysis:

  • DSI to login and logout.
  • DSOpal used by DSI to access the Opal server.
  • dsBaseClient containing all DataSHIELD functions referred to in this tutorial.


Code Block
install.packages('DSI')
install.packages('DSOpal', dependencies=TRUE)
install.packages('dsBaseClient', repos=c(getOption('repos'), 'http://cran.datashield.org'), dependencies=TRUE)

Load Packages

To load the R packages, type the library function into the command line as given in the example below:

Code Block
xml
xml
#load libraries
library(DSI)
library(DSOpal)
library(dsBaseClient)

Build your login dataframe 

Tip
titleDataSHIELD cloud IP addresses

The DataSHIELD cloud training environment does not use fixed IP addresses, the client and opal training server addresses change each training session. As part of the user tutorial you learn how to build a DataSHIELD login dataframe. In a real world instance of DataSHIELD this is populated with secure certificates not text based usernames and passwords.

Login Dataframe

...

The login dataframe is an R object that is created to store all of the login information necessary to access a DataSHIELD server, and save it (as an R script) for future logins, without having to gather the information each time. It is done by using DataSHIELD functions from the DSI package. It is then assigned to a local object, in the case below called "logindata", to be passed into the function for logging in to servers.

Code Block
languagexml
titleBuild your login dataframe
# Build your login dataframe

builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1server1",  url = "httphttps://192opal-demo.168.56.100:8080obiba.org/",
               user = "administratordsuser", password = "datashield_test&P@ssw0rd",                tabledriver = "CNSIM.CNSIM1OpalDriver", driver = "OpalDriver"options='list(ssl_verifyhost=0, ssl_verifypeer=0)')
builder$append(server = "study2server2", url = "httphttps://192opal-demo.168.56.101:8080obiba.org/",
               user = "administratordsuser", password = "datashield_test&P@ssw0rd",                table driver = "CNSIM.CNSIM2OpalDriver", driver = "OpalDriver"options='list(ssl_verifyhost=0, ssl_verifypeer=0)')

logindata <- builder$build()

...

Login Command

Assign to a local object called "connections" the

...

DSI function to log in to the desired Opal servers. In the DataSHIELD test environment logindata is our login dataframe for the Opal training servers.

Code Block
xml
xml
connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "D")

The output below indicates that each of the two Opal training servers

...

"server1" and "server2" contain the same 11 variables listed in capital letters under Variables assigned:

Code Block
languagexml
themeRDark
Logging into the collaborating servers
  Logged in all servers [================================================================] 100% /14s

  No variables have been specified. 
  All the variables in the table 
  (the whole dataset) will be assigned to R!

Assigning table data...
  Assigned all tables [==================================================================] 100% /13s

Variables assigned:
study1 -- LAB_TSC, LAB_TRIG, LAB_HDL, LAB_GLUC_ADJUSTED, PM_BMI_CONTINUOUS, DIS_CVA, MEDI_LPD, DIS_DIAB, DIS_AMI, GENDER, PM_BMI_CATEGORICAL
study2 -- LAB_TSC, LAB_TRIG, LAB_HDL, LAB_GLUC_ADJUSTED, PM_BMI_CONTINUOUS, DIS_CVA, MEDI_LPD, DIS_DIAB, DIS_AMI, GENDER, PM_BMI_CATEGORICAL
  • Command to logout:
Code Block
languagebash
DSI::datashield.logout(connections)
Note
In Horizontal DataSHIELD pooled analysis, the data are harmonized and the variables given the same names across the studies, as agreed by all data providers.
24s


Tip
titleHow datashield.login works

The datashield.login function from the R package opal allows package "DSIallows users to login and assign data to analyse from the Opal server in a server-side R session created behind the firewall of the data provider.

All the commands sent after login are processed within the server-side R instance only allows a specific set of commands to run (see the details of a typical horizontal DataSHIELD process). The server-side R session is wiped after logging out.

Assign tables command

Finally, after successfully making a connection with the server, you must specify which studies, stored in tables, you wish to load into the session. This is done with another of the DSI package functions, "datashield.assign.table". 

Code Block
xml
xml
DSI::datashield.assign.table(conns = connections, symbol = "DST", table = c("CNSIM.CNSIM1","CNSIM.CNSIM2"))
  • The "conns" argument is to create a name for a DSConnection-class object, which will be used by statistical commands to refer to particular studies.
  • The "symbol" argument is to create a name by which to refer to the dataframes in each study. Here we have opted for "DST" , an initialism for "DataSHIELD Table" 
Info

(you may have seen "D" for "Dataframe" being used historically, but we are now phasing this out as it sometimes causes problems with another function called "D")

  • the "table" argument is to specify the names of the tables you wish to connect to as they appear on the servers you are using. The structure, "AAAA.BBBB", "AAAA.CCCC" means that within project AAAA there are tables BBBB and CCCC which we connect to both of, by listing them in an R vector.
Tip
titleHow datashield.login works

If we do not specify individual variables to assign to the server-side R session, like in this case, all variables held in the Opal servers are assigned. You can add an argument ("variables") to datashield.assign.table which will limit the columns which will be loaded from the server data frame (for usage see the help materials for the function: ?datashield.assign.table ).

Assigned data are kept in a data frame named D DST by defaultconvention. Each column of that data frame represents one variable and the rows are the individual records.

An example of the printout after the login process has finished:

Code Block
languagexml
themeRDark
Assigned all table (DST <- ...) [======================================================] 100% /25s


Tip
titleHow datashield.login works

At this point, you are logged in and ready to proceed!

However, let's quickly review some other tips and tricks about using the login dataframe.

Command to logout:

You should get into the habit of putting this command at the end of your scripts, and running it after you are finished. It is particularly important to do so when connecting to shared DataSHIELD servers, to save resources for the analyses of others.

Code Block
languagebash
DSI::datashield.logout(connections)

In a later tutorial in this series, you will find the option of saving your workspace before logging out, to be able to log in another day and have all your variables intact and ready to go without having to run everything again!

Anchor
assign_variables
assign_variables

Expand
titleUnsure of how to make this compatible with new login method...

Assign individual variables on login

Users can specify individual variables to assign to the server-side R session. It is best practice to first create a list of the Opal variables you want to analyse.

  • The example below creates a new variable myvar that lists the Opal variables required for analysis: LAB_HDL and GENDER
  • The variables argument in the function datashield.login uses myvar, which then will call only this list.
Code Block
languagexml
myvar <- list('LAB_HDL', 'GENDER')
connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "
D
DST", variables=myvar)


Code Block
languagexml
themeRDark
Logging into the collaborating servers
  Logged in all servers [================================================================] 100% / 4s

Assigning table data...
  Assigned all tables [==================================================================] 100% / 7s

Variables assigned:
study1 -- LAB_HDL, GENDER
study2 -- LAB_HDL, GENDER


Tip
titleThe format of assigned data frames

Assigned data are kept in a data frame (table) named D by default. Each row of the data frame are the individual records and each column is a separate variable.

  • The example below uses the argument symbol in the datashield.login function to change the name of the data frame from D to mytable
Code Block
languagexml
myvar <- list('LAB_HDL', 'GENDER')
connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol='mytable', variables=myvar)


Code Block
languagexml
themeRDark
Logging into the collaborating servers
  Logged in all servers [================================================================] 100% / 4s

Assigning table data...
  Assigned all tables [==================================================================] 100% / 6s

Variables assigned:
study1 -- LAB_HDL, GENDER
study2 -- LAB_HDL, GENDER



Conclusion

The other parts in this DataSHIELD tutorial series are:

Tip

Also remember you can:

...