Page Comparison

...

Note

title	Prerequisites

It is recommended that you familiarise yourself with R first by sitting our Introduction to R tutorial.

It also requires that you have the DataSHIELD training environment installed on your machine, see our Installation Instructions for Linux, Windows, or Mac.

Tip

title	Help

DataSHIELD support is freely available in the DataSHIELD forum by the DataSHIELD community. Please use this as the first port of call for any problems you may be having, it is monitored closely for new threads.

DataSHIELD bespoke user support and also user training classes are offered on a fee-paying basis. Please enquire at datashield@newcastle.ac.uk for current prices.

...

The other parts in this DataSHIELD tutorial series are:

1: Introduction and logging in
2: Basic statistics and data manipulations
3: Assign functions and tables
4: Plotting graphs
5: Subsetting
6: Modelling

Quick reminder for logging in:

xml

Start R/RStudio

Load Packages

Section

border	true

Expand

Follow instructions to Start the Opal VMs.

Recall from the installation instructions, the Opal web interface is a simple check to tell if the VMs have started. Load the following urls, waiting at least 1 minute after starting the training VMs.

Start R/RStudio

Load Packages

Code Block

xml

Code Block

	xml
	xml

#load libraries
library(DSI)
library(DSOpal)
library(dsBaseClient)

Build your login dataframe

DSI::datashield.logout(connections

, options='list(ssl_verifyhost=0, ssl_verifypeer=0)')

logindata <- builder$build()

logindata <- builder$build()connections <- DSI::datashield.login(logins = logindata, assign = TRUE)
DSI::datashield.assign.table(conns = connections, symbol = "DST", table = c("CNSIM.CNSIM1","CNSIM.CNSIM2"))

Code Block

language	xml
title	Build your login dataframe

builder <- DSI::newDSLoginBuilder()
builder <- DSI::newDSLoginBuilder()
builder$append(server = "

study1

server1",

 url = "

http

https://

192

opal-demo.

168.56.100:8080/",

obiba.org/",
user = "

administrator

dsuser", password = "

datashield_test&

P@ssw0rd", driver = "OpalDriver", options='list(ssl_verifyhost=0, ssl_verifypeer=0)')
builder$append(server = "server2", url

table

= "https://opal-demo.obiba.org/",
user = "dsuser", password = "

CNSIM.CNSIM1

P@ssw0rd", driver = "OpalDriver"

) builder$append(server = "study2", url = "http://192.168.56.101:8080/", user = "administrator", password = "datashield_test&", table = "CNSIM.CNSIM2", driver = "OpalDriver") logindata <- builder$build() connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "D")

Command to logout:

Code Block

language	bash

Command to logout:

Code Block

language	bash

DSI::datashield.logout(connections)

Modelling

Horizontal DataSHIELD allows the fitting of generalised linear models (GLM). In Generalised Linear Modelling, the outcome can be modelled as continuous, or categorical (binomial or discrete). The error to use in the model can follow a range of distributions including Gaussian, binomial, gamma and Poisson. In this section only one example will be shown, for more examples please see the manual help page for the function.

This section will make more sense with an understanding of Generalised Linear Modelling theory and techniques. More information can always be found online, for example this Colorado University publication.

Basic 1-covariate Gaussian GLM

We want to examine the relationship between BMI (a continuous variable) and Triglycerides (another continuous variable). Because the response variable here, BMI, is continuous, this indicates that there should be a Gaussian underlying distribution.

A correlation command will establish how closely linked these two variables might be:

Code Block
ds.cor(x='D$PMDST$PM_BMI_CONTINUOUS', y='D$LABDST$LAB_TRIG')

Let's visualise with a scatterplot:

Code Block
ds.scatterPlot(x='D$PMDST$PM_BMI_CONTINUOUS', y='D$LABDST$LAB_TRIG')

Regress Triglycerides with BMI using the Individual Partition Data (IPD) approach:

...

Code Block
ds.glm(formula = "D$LABDST$LAB_TRIG~D$PMTRIG~DST$PM_BMI_CONTINUOUS", family="gaussian", datasources = connections)

Regress Triglycerides with BMI using the Study-Level Meta-Analysis (SLMA) approach:

ds.glmSLMA(formula = "D$LABDST$LAB_TRIG~DST$PM_TRIG~D$PM_BMI_CONTINUOUS", family="gaussian", newobj = "workshop.obj", datasources = connectionsBMI_CONTINUOUS", family="gaussian", newobj = "workshop.obj", datasources = connections)

For the SLMA approach we can assign the predicted values at each study:

Code Block

ds.glmPredict(glmname = "workshop.obj", newobj = "workshop.prediction.obj", datasources = connections)
ds.length("workshop.prediction.obj$fit", datasources=connections)
ds.length("DST$LAB_TRIG", datasources=connections)

Filter out the incomplete cases, using "ds.completeCases()" sub-setting command:

Code Block
ds.cbind(c('DST$LAB_TRIG', 'DST$PM_BMI_CONTINUOUS'), newobj='vars') ds.completeCases('vars', newobj='vars.complete') ds.dim('vars.complete')

Then plot the resultant data as a best linear fit on a scatter plot:

Code Block

df1 <- ds.scatterPlot('DST$PM_BMI_CONTINUOUS', "DST$LAB_TRIG", datasources = connections, return.coords = TRUE)
df2 <- ds.scatterPlot('vars.complete$PM_BMI_CONTINUOUS', "workshop.prediction.obj$fit", datasources = connections, return.coords = TRUE)
# then in native R
par(mfrow=c(2,2))
plot(as.data.frame(df1[[1]][[1]])$x,as.data.frame(df1[[1]][[1]])$y, xlab='Body Mass Index', ylab='Triglycerides', main='Study 1')
lines(as.data.frame(df2[[1]][[1]])$x,as.data.frame(df2[[1]][[1]])$y, col='red')
plot(as.data.frame(df1[[1]][[2]])$x,as.data.frame(df1[[1]][[2]])$y, xlab='Body Mass Index', ylab='Triglycerides', main='Study 2')
lines(as.data.frame(df2[[1]][[2]])$x,as.data.frame(df2[[1]][[2]])$y, col='red')
plot(as.data.frame(df1[[1]][[3]])$x,as.data.frame(df1[[1]][[3]])$y, xlab='Body Mass Index', ylab='Triglycerides', main='Study 3')
lines(as.data.frame(df2[[1]][[3]])$x,as.data.frame(df2[[1]][[3]])$y, col='red')

1-covariate Binomial GLM

Say we want to regress Cholesterol against Diabetes status. This is not as simple a matter as above, where a simple (& familiar!) Gaussian linear model is fitted. The response data takes values of 0 and 1, the binary measure of whether a person does (=1) or doesn't (=0) have the diabetes diagnosis. Because of this, we want a different type of linear model, to make sense.

Here we will use the ds.glmSLMA command as we are about to calculate predicted values again, which only work with the SLMA version of the function. We will only use the 3rd connected study as we are interested in conducting an analysis and producing the graph for that study site alone.

Code Block
ds.glmSLMA(formula = "DST$DIS_DIAB~DST$LAB_HDL", family="binomial", newobj = "workshop.obj", datasources = connections[3])

So, with this formula, family and new object established, this is the rest of the code:

Code Block

ds.length("workshop.prediction.obj$fit", datasources = connections[3])
ds.length('DST$LAB_HDL', datasources = connections[3])
ds.numNA('DST$LAB_HDL', datasources = connections[3])
ds.completeCases('DST$LAB_HDL', 'hdl.complete', datasources = connections[3])  # FAILED
ds.cbind(c('DST$LAB_HDL','DST$LAB_HDL'), newobj='D2', datasources = connections[3])
ds.completeCases('D2', 'D2.complete', datasources = connections[3]) #doesnt' fail because input object is dataframe
ds.dim('D2', datasources = connections[3])
ds.dim('D2.complete', datasources = connections[3])

ds.asNumeric('DST$DIS_DIAB', newobj='DIAB.n', datasources = connections[3])
df1 <- ds.scatterPlot('DST$LAB_HDL', "DIAB.n", datasources = connections[3], return.coords = TRUE)
df2 <- ds.scatterPlot('D2.complete$LAB_HDL', "workshop.prediction.obj$fit", datasources = connections[3], return.coords = TRUE)

plot(as.data.frame(df1[[1]][[1]])$x,as.data.frame(df1[[1]][[1]])$y, xlab='LAB_HDL', ylab='DIS_DIAB')
lines(as.data.frame(df2[[1]][[1]])$x,as.data.frame(df2[[1]][[1]])$y,col='red')

mod <- ds.glm(formula = "DST$DIS_DIAB~DST$LAB_HDL", family="binomial", datasources = connections[3])
mod$coefficients
modSLMA$output.summary$study1$coefficients


X <- seq(from=-0.5, to=3, by=0.01)
Y <- (exp(mod$coefficients[1,1]+mod$coefficients[2,1]*X))/(1+exp(mod$coefficients[1,1]+mod$coefficients[2,1]*X))
plot(X,Y)

X <- seq(from=-10, to=3, by=0.01)
Y <- (exp(mod$coefficients[1,1]+mod$coefficients[2,1]*X))/(1+exp(mod$coefficients[1,1]+mod$coefficients[2,1]*X))
plot(X,Y)

plot(as.data.frame(df1[[1]][[1]])$x,as.data.frame(df1[[1]][[1]])$y, xlab='LAB_HDL', ylab='DIS_DIAB', xlim=c(-10,3))
lines(X,Y, col='red')

Multi-covariate GLM

The function ds.glm is used to analyse the outcome variable DIS_DIAB (diabetes status) and the covariates PM_BMI_CONTINUOUS (continuous BMI), LAB_HDL (HDL cholesterol) and GENDER (gender), with an interaction between the latter two. In R this model is represented as:

Code Block
D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS+D$LABDST$LAB_HDL*D$GENDERDST$GENDER

Since v6.0, the intermediate results are printed by default, (in red when viewing in RStudio):

...

Code Block

language	xml

ds.glm(formula=D$DISDST$DIS_DIAB~D$PMDIAB~DST$PM_BMI_CONTINUOUS+D$LABDST$LAB_HDL*D$GENDERDST$GENDER, family='binomial')

Code Block

language	xml
theme	RDark

  Aggregated (glmDS1(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s
Iteration 1...
  Aggregated (glmDS2(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s
CURRENT DEVIANCE:      5772.52971970323
Iteration 2...
  Aggregated (glmDS2(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s
CURRENT DEVIANCE:      1316.46958534192
Iteration 3...
  Aggregated (glmDS2(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s
CURRENT DEVIANCE:      725.530885478793
Iteration 4...
  Aggregated (glmDS2(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s
CURRENT DEVIANCE:      574.04091649261
Iteration 5...
  Aggregated (glmDS2(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s
CURRENT DEVIANCE:      539.36484363672
Iteration 6...
  Aggregated (glmDS2(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s
CURRENT DEVIANCE:      534.529367269197
Iteration 7...
  Aggregated (glmDS2(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s
CURRENT DEVIANCE:      534.369349024873
Iteration 8...
  Aggregated (glmDS2(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s
CURRENT DEVIANCE:      534.369101005548
Iteration 9...
  Aggregated (glmDS2(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s
CURRENT DEVIANCE:      534.369101004809
SUMMARY OF MODEL STATE after iteration 9
Current deviance 534.369101004809 on 4159 degrees of freedom
Convergence criterion TRUE (1.38325209879068e-12)

beta: -6.90641416691308 0.142256253558181 -0.967440739752358 -1.40945273343359 0.646007073975

Information matrix overall:
                    (Intercept) D$PMDST$PM_BMI_CONTINUOUS  D$LABDST$LAB_HDL D$GENDER1DST$GENDER1 D$LABDST$LAB_HDL:D$GENDER1DSTDST$GENDER1
(Intercept)            52.47803           1624.8945   71.78440  16.77192            25.91140
D$PMDST$PM_BMI_CONTINUOUS  1624.89450          51515.6450 2204.90085 503.20484           774.18028
D$LABDST$LAB_HDL              71.78440           2204.9008  109.48855  25.91140            42.87626
D$GENDER1DST$GENDER1              16.77192            503.2048   25.91140  16.77192            25.91140
D$LABDST$LAB_HDL:D$GENDER1DST$GENDER1    25.91140            774.1803   42.87626  25.91140            42.87626

Score vector overall:
                             [,1]
(Intercept)         -3.618013e-10
D$PMDST$PM_BMI_CONTINUOUS -9.890691e-09
D$LABDST$LAB_HDL           -6.317578e-10
D$GENDER1DST$GENDER1           -2.913176e-10
D$LABDST$LAB_HDL:D$GENDER1DST$GENDER1 -5.343783e-10

Current deviance: 534.369101004809

...

Code Block

language	py

$Nvalid
[1] 4164

$Nmissing
[1] 1087

$Ntotal
[1] 5251

$disclosure.risk
       RISK OF DISCLOSURE
study1                  0
study2                  0

$errorMessage
       ERROR MESSAGES
study1 "No errors"   
study2 "No errors"   

$nsubs
[1] 4164

$iter
[1] 9

$family

Family: binomial 
Link function: logit 


$formula
[1] "D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER"

$coefficients
                      Estimate Std. Error    z-value      p-value low0.95CI.LP high0.95CI.LP       P_OR
(Intercept)         -6.9064142 1.08980103 -6.3373166 2.338013e-10  -9.04238494    -4.7704434 0.00100034
D$PMDST$PM_BMI_CONTINUOUS  0.1422563 0.02932171  4.8515676 1.224894e-06   0.08478676     0.1997257 1.15287204
D$LABDST$LAB_HDL           -0.9674407 0.36306348 -2.6646601 7.706618e-03  -1.67903208    -0.2558494 0.38005445
D$GENDER1DST$GENDER1           -1.4094527 1.06921103 -1.3182175 1.874308e-01  -3.50506784     0.6861624 0.24427693
D$LABDST$LAB_HDL:D$GENDER1DST$GENDER1  0.6460071 0.69410419  0.9307062 3.520056e-01  -0.71441214     2.0064263 1.90790747
                    low0.95CI.P_OR high0.95CI.P_OR
(Intercept)           0.0001182744     0.008405372
D$PMDST$PM_BMI_CONTINUOUS   1.0884849336     1.221067831
D$LABDST$LAB_HDL             0.1865544594     0.774258560
D$GENDER1DST$GENDER1             0.0300447352     1.986079051
D$LABDST$LAB_HDL:D$GENDER1DST$GENDER1   0.4894797709     7.436693232

$dev
[1] 534.3691

$df
[1] 4159

$output.information
[1] "SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES"

...

You can get back to the training homepage by clicking here.

Versions Compared

Old Version 9

New Version Current

Key

Quick reminder for logging in:

Start R/RStudio

Load Packages

Start R/RStudio

Load Packages

Build your login dataframe

Modelling

Basic 1-covariate Gaussian GLM

1-covariate Binomial GLM

Multi-covariate GLM