DataSHIELD Training Part 6: Modelling
Prerequisites
It is recommended that you familiarise yourself with R first by sitting our Introduction to R tutorial.
It also requires that you have the DataSHIELD training environment installed on your machine, see our Installation Instructions for Linux, Windows, or Mac.
Help
DataSHIELD support is freely available in the DataSHIELD forum by the DataSHIELD community. Please use this as the first port of call for any problems you may be having, it is monitored closely for new threads.
DataSHIELD bespoke user support and also user training classes are offered on a fee-paying basis. Please enquire at datashield@newcastle.ac.uk for current prices.
Introduction
This is the sixth and final page of a 6-part DataSHIELD tutorial series. Well done!
The other parts in this DataSHIELD tutorial series are:
5: Subsetting
6: Modelling
Quick reminder for logging in:
Modelling
Horizontal DataSHIELD allows the fitting of generalised linear models (GLM). In Generalised Linear Modelling, the outcome can be modelled as continuous, or categorical (binomial or discrete). The error to use in the model can follow a range of distributions including Gaussian, binomial, gamma and Poisson. In this section only one example will be shown, for more examples please see the manual help page for the function.
This section will make more sense with an understanding of Generalised Linear Modelling theory and techniques. More information can always be found online, for example this Colorado University publication.
Basic 1-covariate Gaussian GLM
We want to examine the relationship between BMI (a continuous variable) and Triglycerides (another continuous variable). Because the response variable here, BMI, is continuous, this indicates that there should be a Gaussian underlying distribution.
A correlation command will establish how closely linked these two variables might be:
ds.cor(x='DST$PM_BMI_CONTINUOUS', y='DST$LAB_TRIG')
Let's visualise with a scatterplot:
ds.scatterPlot(x='DST$PM_BMI_CONTINUOUS', y='DST$LAB_TRIG')
Regress Triglycerides with BMI using the Individual Partition Data (IPD) approach:
The method for this (ds.glm) is a "pooled analysis"- equivalent to placing the individual-level data from all sources in one warehouse.
Important to note that the link function is by default the canonical link function for each family. So binomial <-> logistic link, poisson <-> log link, gaussian <-> identity link.
ds.glm(formula = "DST$LAB_TRIG~DST$PM_BMI_CONTINUOUS", family="gaussian", datasources = connections)
Regress Triglycerides with BMI using the Study-Level Meta-Analysis (SLMA) approach:
ds.glmSLMA(formula = "DST$LAB_TRIG~DST$PM_BMI_CONTINUOUS", family="gaussian", newobj = "workshop.obj", datasources = connections)
For the SLMA approach we can assign the predicted values at each study:
ds.glmPredict(glmname = "workshop.obj", newobj = "workshop.prediction.obj", datasources = connections) ds.length("workshop.prediction.obj$fit", datasources=connections) ds.length("DST$LAB_TRIG", datasources=connections)
Filter out the incomplete cases, using "ds.completeCases()" sub-setting command:
ds.cbind(c('DST$LAB_TRIG', 'DST$PM_BMI_CONTINUOUS'), newobj='vars') ds.completeCases('vars', newobj='vars.complete') ds.dim('vars.complete')
Then plot the resultant data as a best linear fit on a scatter plot:
df1 <- ds.scatterPlot('DST$PM_BMI_CONTINUOUS', "DST$LAB_TRIG", datasources = connections, return.coords = TRUE) df2 <- ds.scatterPlot('vars.complete$PM_BMI_CONTINUOUS', "workshop.prediction.obj$fit", datasources = connections, return.coords = TRUE) # then in native R par(mfrow=c(2,2)) plot(as.data.frame(df1[[1]][[1]])$x,as.data.frame(df1[[1]][[1]])$y, xlab='Body Mass Index', ylab='Triglycerides', main='Study 1') lines(as.data.frame(df2[[1]][[1]])$x,as.data.frame(df2[[1]][[1]])$y, col='red') plot(as.data.frame(df1[[1]][[2]])$x,as.data.frame(df1[[1]][[2]])$y, xlab='Body Mass Index', ylab='Triglycerides', main='Study 2') lines(as.data.frame(df2[[1]][[2]])$x,as.data.frame(df2[[1]][[2]])$y, col='red') plot(as.data.frame(df1[[1]][[3]])$x,as.data.frame(df1[[1]][[3]])$y, xlab='Body Mass Index', ylab='Triglycerides', main='Study 3') lines(as.data.frame(df2[[1]][[3]])$x,as.data.frame(df2[[1]][[3]])$y, col='red')
1-covariate Binomial GLM
Say we want to regress Cholesterol against Diabetes status. This is not as simple a matter as above, where a simple (& familiar!) Gaussian linear model is fitted. The response data takes values of 0 and 1, the binary measure of whether a person does (=1) or doesn't (=0) have the diabetes diagnosis. Because of this, we want a different type of linear model, to make sense.
Here we will use the ds.glmSLMA command as we are about to calculate predicted values again, which only work with the SLMA version of the function. We will only use the 3rd connected study as we are interested in conducting an analysis and producing the graph for that study site alone.
ds.glmSLMA(formula = "DST$DIS_DIAB~DST$LAB_HDL", family="binomial", newobj = "workshop.obj", datasources = connections[3])
So, with this formula, family and new object established, this is the rest of the code:
ds.length("workshop.prediction.obj$fit", datasources = connections[3]) ds.length('DST$LAB_HDL', datasources = connections[3]) ds.numNA('DST$LAB_HDL', datasources = connections[3]) ds.completeCases('DST$LAB_HDL', 'hdl.complete', datasources = connections[3]) # FAILED ds.cbind(c('DST$LAB_HDL','DST$LAB_HDL'), newobj='D2', datasources = connections[3]) ds.completeCases('D2', 'D2.complete', datasources = connections[3]) #doesnt' fail because input object is dataframe ds.dim('D2', datasources = connections[3]) ds.dim('D2.complete', datasources = connections[3]) ds.asNumeric('DST$DIS_DIAB', newobj='DIAB.n', datasources = connections[3]) df1 <- ds.scatterPlot('DST$LAB_HDL', "DIAB.n", datasources = connections[3], return.coords = TRUE) df2 <- ds.scatterPlot('D2.complete$LAB_HDL', "workshop.prediction.obj$fit", datasources = connections[3], return.coords = TRUE) plot(as.data.frame(df1[[1]][[1]])$x,as.data.frame(df1[[1]][[1]])$y, xlab='LAB_HDL', ylab='DIS_DIAB') lines(as.data.frame(df2[[1]][[1]])$x,as.data.frame(df2[[1]][[1]])$y,col='red') mod <- ds.glm(formula = "DST$DIS_DIAB~DST$LAB_HDL", family="binomial", datasources = connections[3]) mod$coefficients modSLMA$output.summary$study1$coefficients X <- seq(from=-0.5, to=3, by=0.01) Y <- (exp(mod$coefficients[1,1]+mod$coefficients[2,1]*X))/(1+exp(mod$coefficients[1,1]+mod$coefficients[2,1]*X)) plot(X,Y) X <- seq(from=-10, to=3, by=0.01) Y <- (exp(mod$coefficients[1,1]+mod$coefficients[2,1]*X))/(1+exp(mod$coefficients[1,1]+mod$coefficients[2,1]*X)) plot(X,Y) plot(as.data.frame(df1[[1]][[1]])$x,as.data.frame(df1[[1]][[1]])$y, xlab='LAB_HDL', ylab='DIS_DIAB', xlim=c(-10,3)) lines(X,Y, col='red')
Multi-covariate GLM
- The function
ds.glm
is used to analyse the outcome variableDIS_DIAB
(diabetes status) and the covariatesPM_BMI_CONTINUOUS
(continuous BMI),LAB_HDL
(HDL cholesterol) andGENDER
(gender), with an interaction between the latter two. In R this model is represented as:
DST$DIS_DIAB ~ DST$PM_BMI_CONTINUOUS+DST$LAB_HDL*DST$GENDER
- Since v6.0, the intermediate results are printed by default, (in red when viewing in RStudio):
ds.glm(formula=DST$DIS_DIAB~DST$PM_BMI_CONTINUOUS+DST$LAB_HDL*DST$GENDER, family='binomial')
Aggregated (glmDS1(DST$DIS_DIAB ~ DST$PM_BMI_CONTINUOUS + DST$LAB_HDL * DST$GENDER, ) [========] 100% / 0s Iteration 1... Aggregated (glmDS2(DST$DIS_DIAB ~ DST$PM_BMI_CONTINUOUS + DST$LAB_HDL * DST$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 5772.52971970323 Iteration 2... Aggregated (glmDS2(DST$DIS_DIAB ~ DST$PM_BMI_CONTINUOUS + DST$LAB_HDL * DST$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 1316.46958534192 Iteration 3... Aggregated (glmDS2(DST$DIS_DIAB ~ DST$PM_BMI_CONTINUOUS + DST$LAB_HDL * DST$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 725.530885478793 Iteration 4... Aggregated (glmDS2(DST$DIS_DIAB ~ DST$PM_BMI_CONTINUOUS + DST$LAB_HDL * DST$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 574.04091649261 Iteration 5... Aggregated (glmDS2(DST$DIS_DIAB ~ DST$PM_BMI_CONTINUOUS + DST$LAB_HDL * DST$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 539.36484363672 Iteration 6... Aggregated (glmDS2(DST$DIS_DIAB ~ DST$PM_BMI_CONTINUOUS + DST$LAB_HDL * DST$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 534.529367269197 Iteration 7... Aggregated (glmDS2(DST$DIS_DIAB ~ DST$PM_BMI_CONTINUOUS + DST$LAB_HDL * DST$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 534.369349024873 Iteration 8... Aggregated (glmDS2(DST$DIS_DIAB ~ DST$PM_BMI_CONTINUOUS + DST$LAB_HDL * DST$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 534.369101005548 Iteration 9... Aggregated (glmDS2(DST$DIS_DIAB ~ DST$PM_BMI_CONTINUOUS + DST$LAB_HDL * DST$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 534.369101004809 SUMMARY OF MODEL STATE after iteration 9 Current deviance 534.369101004809 on 4159 degrees of freedom Convergence criterion TRUE (1.38325209879068e-12) beta: -6.90641416691308 0.142256253558181 -0.967440739752358 -1.40945273343359 0.646007073975 Information matrix overall: (Intercept) DST$PM_BMI_CONTINUOUS DST$LAB_HDL DST$GENDER1 DST$LAB_HDL:DSTDST$GENDER1 (Intercept) 52.47803 1624.8945 71.78440 16.77192 25.91140 DST$PM_BMI_CONTINUOUS 1624.89450 51515.6450 2204.90085 503.20484 774.18028 DST$LAB_HDL 71.78440 2204.9008 109.48855 25.91140 42.87626 DST$GENDER1 16.77192 503.2048 25.91140 16.77192 25.91140 DST$LAB_HDL:DST$GENDER1 25.91140 774.1803 42.87626 25.91140 42.87626 Score vector overall: [,1] (Intercept) -3.618013e-10 DST$PM_BMI_CONTINUOUS -9.890691e-09 DST$LAB_HDL -6.317578e-10 DST$GENDER1 -2.913176e-10 DST$LAB_HDL:DST$GENDER1 -5.343783e-10 Current deviance: 534.369101004809
- Then, the rest of the results are printed in black:
$Nvalid [1] 4164 $Nmissing [1] 1087 $Ntotal [1] 5251 $disclosure.risk RISK OF DISCLOSURE study1 0 study2 0 $errorMessage ERROR MESSAGES study1 "No errors" study2 "No errors" $nsubs [1] 4164 $iter [1] 9 $family Family: binomial Link function: logit $formula [1] "DST$DIS_DIAB ~ DST$PM_BMI_CONTINUOUS + DST$LAB_HDL * DST$GENDER" $coefficients Estimate Std. Error z-value p-value low0.95CI.LP high0.95CI.LP P_OR (Intercept) -6.9064142 1.08980103 -6.3373166 2.338013e-10 -9.04238494 -4.7704434 0.00100034 DST$PM_BMI_CONTINUOUS 0.1422563 0.02932171 4.8515676 1.224894e-06 0.08478676 0.1997257 1.15287204 DST$LAB_HDL -0.9674407 0.36306348 -2.6646601 7.706618e-03 -1.67903208 -0.2558494 0.38005445 DST$GENDER1 -1.4094527 1.06921103 -1.3182175 1.874308e-01 -3.50506784 0.6861624 0.24427693 DST$LAB_HDL:DST$GENDER1 0.6460071 0.69410419 0.9307062 3.520056e-01 -0.71441214 2.0064263 1.90790747 low0.95CI.P_OR high0.95CI.P_OR (Intercept) 0.0001182744 0.008405372 DST$PM_BMI_CONTINUOUS 1.0884849336 1.221067831 DST$LAB_HDL 0.1865544594 0.774258560 DST$GENDER1 0.0300447352 1.986079051 DST$LAB_HDL:DST$GENDER1 0.4894797709 7.436693232 $dev [1] 534.3691 $df [1] 4159 $output.information [1] "SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES"
How ds.glm works
After every iteration in the glm, each study returns non disclosive summaries (a score vector and an information matrix) that are combined on the client R session. The model is fitted again with the updated beta coefficients, this iterative process continues until convergence or the maximum number of iterations is reached. The output of ds.glm
returns the final pooled score vector and information along with some information about the convergence and the final pooled beta coefficients.
Conclusion
You have now completed our basic DataSHIELD training.
Some extended practicals will be coming soon.
Also remember you can:
- get a function list for any DataSHIELD package and
- view the manual help page individual functions
- in the DataSHIELD test environment it is possible to print analyses to file (.csv, .txt, .pdf, .png)
- take a look at our FAQ page for solutions to common problems such as Changing variable class to use in a specific DataSHIELD function.
- Get support from our DataSHIELD forum.
You can get back to the training homepage by clicking here.
DataSHIELD Wiki by DataSHIELD is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at http://www.datashield.ac.uk/wiki