DataSHIELD Training Part 1: Introduction and logging in
Prerequisites
It is recommended that you familiarise yourself with R first by sitting our Introduction to R tutorial.
It will also necessary to have an up to date R instance on your machine, also RStudio is an optional but useful extra!
Help
DataSHIELD support is freely available in the DataSHIELD forum by the DataSHIELD community. Please use this as the first port of call for any problems you may be having, it is monitored closely for new threads.
DataSHIELD bespoke user support and also user training classes are offered on a fee-paying basis. Please enquire at datashield@liverpool.ac.uk for current prices.
Introduction
This tutorial introduces users to DataSHIELD commands and syntax. Throughout this document we refer to R, but all commands are run in the same way in Rstudio. This tutorial contains a limited number of examples; further examples are available in each DataSHIELD function manual page that can be accessed via the help function.
Start R/RStudio
Start R, RGui, or RStudio, which you will be using for this analysis training exercise.
Install Packages
The following relevant R packages are required for analysis:
- DSI to login and logout.
- DSOpal used by DSI to access the Opal server.
- dsBaseClient containing all DataSHIELD functions referred to in this tutorial.
install.packages('DSI') install.packages('DSOpal', dependencies=TRUE) install.packages('dsBaseClient', repos=c(getOption('repos'), 'http://cran.datashield.org'), dependencies=TRUE)
Load Packages
To load the R packages, type the library
function into the command line as given in the example below:
#load libraries library(DSI) library(DSOpal) library(dsBaseClient)
Build your login dataframe
DataSHIELD cloud IP addresses
The DataSHIELD cloud training environment does not use fixed IP addresses, the client and opal training server addresses change each training session. As part of the user tutorial you learn how to build a DataSHIELD login dataframe. In a real world instance of DataSHIELD this is populated with secure certificates not text based usernames and passwords.
The login dataframe is an R object that is created to store all of the login information necessary to access a DataSHIELD server, and save it (as an R script) for future logins, without having to gather the information each time. It is done by using DataSHIELD functions from the DSI package. It is then assigned to a local object, in the case below called "logindata", to be passed into the function for logging in to servers.
# Build your login dataframe builder <- DSI::newDSLoginBuilder() builder$append(server = "server1", url = "https://opal-demo.obiba.org/", user = "dsuser", password = "P@ssw0rd", driver = "OpalDriver", options='list(ssl_verifyhost=0, ssl_verifypeer=0)') builder$append(server = "server2", url = "https://opal-demo.obiba.org/", user = "dsuser", password = "P@ssw0rd", driver = "OpalDriver", options='list(ssl_verifyhost=0, ssl_verifypeer=0)') logindata <- builder$build()
Login Command
Assign to a local object called "connections" the DSI function to log in to the desired Opal servers. In the DataSHIELD test environment logindata
is our login dataframe for the Opal training servers.
connections <- DSI::datashield.login(logins = logindata, assign = TRUE)
The output below indicates that each of the two Opal training servers "server1" and "server2" contain the same 11 variables listed in capital letters under Variables assigned
:
Logging into the collaborating servers Logged in all servers [================================================================] 100% /24s
How datashield.login works
The datashield.login
function from the R package "DSI" allows users to login to the Opal server in a server-side R session created behind the firewall of the data provider.
All the commands sent after login are processed within the server-side R instance only allows a specific set of commands to run (see the details of a typical horizontal DataSHIELD process). The server-side R session is wiped after logging out.
Assign tables command
Finally, after successfully making a connection with the server, you must specify which studies, stored in tables, you wish to load into the session. This is done with another of the DSI package functions, "datashield.assign.table".
DSI::datashield.assign.table(conns = connections, symbol = "DST", table = c("CNSIM.CNSIM1","CNSIM.CNSIM2"))
- The "conns" argument is to create a name for a DSConnection-class object, which will be used by statistical commands to refer to particular studies.
- The "symbol" argument is to create a name by which to refer to the dataframes in each study. Here we have opted for "DST" , an initialism for "DataSHIELD Table"
(you may have seen "D" for "Dataframe" being used historically, but we are now phasing this out as it sometimes causes problems with another function called "D")
- the "table" argument is to specify the names of the tables you wish to connect to as they appear on the servers you are using. The structure, "AAAA.BBBB", "AAAA.CCCC" means that within project AAAA there are tables BBBB and CCCC which we connect to both of, by listing them in an R vector.
How datashield.login works
If we do not specify individual variables to assign to the server-side R session, like in this case, all variables held in the Opal servers are assigned. You can add an argument ("variables") to datashield.assign.table which will limit the columns which will be loaded from the server data frame (for usage see the help materials for the function: ?datashield.assign.table ).
Assigned data are kept in a data frame named DST by convention. Each column of that data frame represents one variable and the rows are the individual records.
An example of the printout after the login process has finished:
Assigned all table (DST <- ...) [======================================================] 100% /25s
How datashield.login works
At this point, you are logged in and ready to proceed!
However, let's quickly review some other tips and tricks about using the login dataframe.
Command to logout:
You should get into the habit of putting this command at the end of your scripts, and running it after you are finished. It is particularly important to do so when connecting to shared DataSHIELD servers, to save resources for the analyses of others.
DSI::datashield.logout(connections)
In a later tutorial in this series, you will find the option of saving your workspace before logging out, to be able to log in another day and have all your variables intact and ready to go without having to run everything again!
Conclusion
The other parts in this DataSHIELD tutorial series are:
5: Subsetting
6: Modelling
Also remember you can:
- get a function list for any DataSHIELD package and
- view the manual help page individual functions
- in the DataSHIELD test environment it is possible to print analyses to file (.csv, .txt, .pdf, .png)
- take a look at our FAQ page for solutions to common problems such as Changing variable class to use in a specific DataSHIELD function.
- Get support from our DataSHIELD forum.
DataSHIELD Wiki by DataSHIELD is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at http://www.datashield.ac.uk/wiki