...
Quick reminder for logging in:
Expand |
---|
Recall from the installation instructions, the Opal web interface is a simple check to tell if the VMs have started. Load the following urls, waiting at least 1 minute after starting the training VMs. Start R/RStudioLoad Packages Code Block |
---|
| xml | xml | Start R/RStudio
Load Packages
Code Block |
---|
|
#load libraries
library(DSI)
library(DSOpal)
library(dsBaseClient)
|
Build your login dataframe
Code Block |
---|
language | xml |
---|
title | Build your login dataframe |
---|
|
builder <- DSI::newDSLoginBuilder()
builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1server1", url = "httphttps://192opal-demo.168.56.100:8080obiba.org/",
user = "administratordsuser", password = "datashield_test&P@ssw0rd", table driver = "CNSIM.CNSIM1OpalDriver", driver = "OpalDriver"options='list(ssl_verifyhost=0, ssl_verifypeer=0)')
builder$append(server = "study2server2", url = "httphttps://192opal-demo.168.56.101:8080obiba.org/",
user = "administratordsuser", password = "datashield_test&P@ssw0rd", driver = "OpalDriver", table = "CNSIM.CNSIM2", driver = "OpalDriver"options='list(ssl_verifyhost=0, ssl_verifypeer=0)')
logindata <- builder$build()
logindata <- builder$build()
connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "D"))
DSI::datashield.assign.table(conns = connections, symbol = "DST", table = c("CNSIM.CNSIM1","CNSIM.CNSIM2")) |
Code Block |
---|
|
DSI::datashield.logout(connections) |
Basic statistics and data manipulations
Descriptive statistics: variable dimensions and class
It is possible to get some descriptive or exploratory statistics about the assigned variables held in the server-side R session such as number of participants at each data provider, number of participants across all data providers and number of variables. Identifying parameters of the data will facilitate your analysis.
...
The DataSHIELD approach: aggregate and assign functions
Anchor |
---|
| assign_functions |
---|
| assign_functions |
---|
|
Tip |
---|
title | How assign and aggregate functions work |
---|
|
DataSHIELD commands call functions that range from carrying out pre-requisite tasks such as login to the datasources, to generating basic descriptive statistics, plots and tabulations. More advance functions allow for users to fit generalized linear models and generalized estimating equations models. R can list all functions available in DataSHIELD. This section explains the functions we will call during this tutorial. Although this knowledge is not required to run DataSHIELD analyses it helps to understand the output of the commands. It can explain why some commands call functions that return nothing to the user, but rather store the output on the server of the data provider for use in a second function. In DataSHIELD the person running an analysis (the client) uses client-side functions to issue commands (instructions). These commands initiate the execution (running) of server-side functions that run the analysis server-side (behind the firewall of the data provider). There are two types of server-side function: assign functions and aggregate functions. Assign functions do not return an output to the client, with the exception of error or status messages. Assign functions create new objects and store them server-side either because the objects are potentially disclosive, or because they consist of the individual-level data which, in DataSHIELD, is never seen by the analyst. These new objects can include: - new transformed variables (e.g. mean centred or log transformed variables)
- a new variable of a modified class (e.g. a variable of class numeric may be converted into a factor which R can then model as having discrete categorical levels)
- a subset object (e.g. a dataframe including gender as a variable may be split into males and females).
Assign functions return no output to the client except to indicate an error or useful messages about the object store on server-side. Aggregate functions analyse the data server-side and return an output in the form of aggregate data (summary statistics that are not disclosive) to the client. The help page for each function tells us what is returned and when not to expect an output on client-side. |
Basic statistics and data manipulations
Descriptive statistics: variable dimensions and class
It is possible to get some descriptive or exploratory statistics about the assigned variables held in the server-side R session such as number of participants at each data provider, number of participants across all data providers and number of variables. Identifying parameters of the data will facilitate your analysis.
Code Block |
---|
|
connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "DDST")
ds.dim(x = 'DDST')
|
The output of the command is shown below. It shows that in study 1 there are 2163 individuals with 11 variables and in study 2 there are 3088 individuals with 11 variables, and that in both studies together there are in total 5251 individuals with 11 variables:
Code Block |
---|
|
Aggregated (dimDS("DDST")) [==============================================================] 100% / 0s
$`dimensions of D in study1`
[1] 2163 11
$`dimensions of D in study2`
[1] 3088 11
$`dimensions of D in combined studies`
[1] 5251 11
|
...
Code Block |
---|
|
Aggregated (dimDS("DDST")) [==============================================================] 100% / 0s
$`dimensions of D in combined studies`
[1] 5251 11 |
...
Code Block |
---|
|
Aggregated (exists("DDST")) [=============================================================] 100% / 1s
Aggregated (classDS("DDST")) [============================================================] 100% / 1s
Aggregated (colnamesDS("DDST")) [=========================================================] 100% / 0s
$study1
[1] "LAB_TSC" "LAB_TRIG" "LAB_HDL" "LAB_GLUC_ADJUSTED" "PM_BMI_CONTINUOUS"
[6] "DIS_CVA" "MEDI_LPD" "DIS_DIAB" "DIS_AMI" "GENDER"
[11] "PM_BMI_CATEGORICAL"
$study2
[1] "LAB_TSC" "LAB_TRIG" "LAB_HDL" "LAB_GLUC_ADJUSTED" "PM_BMI_CONTINUOUS"
[6] "DIS_CVA" "MEDI_LPD" "DIS_DIAB" "DIS_AMI" "GENDER"
[11] "PM_BMI_CATEGORICAL" |
...
Code Block |
---|
|
ds.class(x='D$LABDST$LAB_HDL', datasources = connections)
|
...
Code Block |
---|
|
Aggregated (exists("DDST")) [=============================================================] 100% / 0s
Aggregated (classDS("D$LABDST$LAB_HDL")) [====================================================] 100% / 1s
$study1
[1] "numeric"
$study2
[1] "numeric" |
...
Code Block |
---|
|
Aggregated (exists("DDST")) [=============================================================] 100% / 0s
Aggregated (classDS("D$LABDST$LAB_HDL")) [====================================================] 100% / 1s
Aggregated (quantileMeanDS(D$LABDST$LAB_HDL)) [===============================================] 100% / 0s
Aggregated (lengthDS("D$LABDST$LAB_HDL")) [===================================================] 100% / 0s
Aggregated (numNaDS(D$LABDST$LAB_HDL)) [======================================================] 100% / 0s
Quantiles of the pooled data
5% 10% 25% 50% 75% 90% 95% Mean
0.8606589 1.0385205 1.2964949 1.5704848 1.8418712 2.0824057 2.2191369 1.5619572 |
...
Code Block |
---|
|
ds.mean(x='D$LABDST$LAB_HDL', datasources = connections)
|
...
Code Block |
---|
|
Aggregated (meanDS(D$LABDST$LAB_HDL)) [=======================================================] 100% / 0s
$Mean.by.Study
EstimatedMean Nmissing Nvalid Ntotal
study1 1.569416 360 1803 2163
study2 1.556648 555 2533 3088
$Nstudies
[1] 2
$ValidityMessage
ValidityMessage
study1 "VALID ANALYSIS"
study2 "VALID ANALYSIS" |
...
Tip |
---|
Also remember you can: |
...