...
Note | ||
---|---|---|
| ||
It is recommended that you familiarise yourself with R first by sitting our Introduction to R tutorial. It also requires that you have the DataSHIELD training environment installed on your machine, see our Installation Instructions for Linux, Windows, or Mac. |
Tip | ||
---|---|---|
| ||
DataSHIELD support is freely available in the DataSHIELD forum by the DataSHIELD community. Please use this as the first port of call for any problems you may be having, it is monitored closely for new threads. DataSHIELD bespoke user support and also user training classes are offered on a fee-paying basis. Please enquire at datashield@newcastle.ac.uk for current prices. |
...
5: Subsetting
6: Modelling
Quick reminder for logging in:
Section | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||
Expand | |||||||||||||||||||
Follow instructions to Startthe Opal VMs.Recall from the installation instructions, the Opal web interface is a simple check to tell if the VMs have started. Load the following urls, waiting at least 1 minute after starting the training VMs. StartR/RStudioLoad Packages
Build your login dataframe
|
Modelling
Horizontal DataSHIELD allows the fitting of generalised linear models (GLM). In Generalised Linear Modelling, the outcome can be modelled as continuous, or categorical (binomial or discrete). The error to use in the model can follow a range of distributions including Gaussian, binomial, gamma and Poisson. In this section only one example will be shown, for more examples please see the manual help page for the function.
...
A correlation command will establish how closely linked these two variables might be:
Code Block |
---|
ds.cor(x='D$PMDST$PM_BMI_CONTINUOUS', y='D$LABDST$LAB_TRIG') |
Let's visualise with a scatterplot:
Code Block |
---|
ds.scatterPlot(x='D$PMDST$PM_BMI_CONTINUOUS', y='D$LABDST$LAB_TRIG') |
Regress Triglycerides with BMI using the Individual Partition Data (IPD) approach:
...
Code Block |
---|
ds.glm(formula = "D$LABDST$LAB_TRIG~D$PMTRIG~DST$PM_BMI_CONTINUOUS", family="gaussian", datasources = connections) |
Regress Triglycerides with BMI using the Study-Level Meta-Analysis (SLMA) approach:
ds.glmSLMA(formula = "D$LABDST$LAB_TRIG~D$PMTRIG~DST$PM_BMI_CONTINUOUS", family="gaussian", newobj = "workshop.obj", datasources = connections)
...
Code Block |
---|
ds.glmPredict(glmname = "workshop.obj", newobj = "workshop.prediction.obj", datasources = connections) ds.length("workshop.prediction.obj$fit", datasources=connections) ds.length("D$LABDST$LAB_TRIG", datasources=connections) |
Filter out the incomplete cases, using "ds.completeCases()" sub-setting command:
Code Block |
---|
ds.cbind(c('D$LABDST$LAB_TRIG', 'D$PMDST$PM_BMI_CONTINUOUS'), newobj='vars') ds.completeCases('vars', newobj='vars.complete') ds.dim('vars.complete') |
...
Code Block |
---|
df1 <- ds.scatterPlot('D$PMDST$PM_BMI_CONTINUOUS', "D$LABDST$LAB_TRIG", datasources = connections, return.coords = TRUE) df2 <- ds.scatterPlot('vars.complete$PM_BMI_CONTINUOUS', "workshop.prediction.obj$fit", datasources = connections, return.coords = TRUE) # then in native R par(mfrow=c(2,2)) plot(as.data.frame(df1[[1]][[1]])$x,as.data.frame(df1[[1]][[1]])$y, xlab='Body Mass Index', ylab='Triglycerides', main='Study 1') lines(as.data.frame(df2[[1]][[1]])$x,as.data.frame(df2[[1]][[1]])$y, col='red') plot(as.data.frame(df1[[1]][[2]])$x,as.data.frame(df1[[1]][[2]])$y, xlab='Body Mass Index', ylab='Triglycerides', main='Study 2') lines(as.data.frame(df2[[1]][[2]])$x,as.data.frame(df2[[1]][[2]])$y, col='red') plot(as.data.frame(df1[[1]][[3]])$x,as.data.frame(df1[[1]][[3]])$y, xlab='Body Mass Index', ylab='Triglycerides', main='Study 3') lines(as.data.frame(df2[[1]][[3]])$x,as.data.frame(df2[[1]][[3]])$y, col='red') |
...
Code Block |
---|
ds.glmSLMA(formula = "D$DISDST$DIS_DIAB~D$LABDIAB~DST$LAB_HDL", family="binomial", newobj = "workshop.obj", datasources = connections[3]) |
...
Code Block |
---|
ds.length("workshop.prediction.obj$fit", datasources = connections[3]) ds.length('D$LABDST$LAB_HDL', datasources = connections[3]) ds.numNA('D$LABDST$LAB_HDL', datasources = connections[3]) ds.completeCases('D$LABDST$LAB_HDL', 'hdl.complete', datasources = connections[3]) # FAILED ds.cbind(c('D$LABDST$LAB_HDL','D$LABDST$LAB_HDL'), newobj='D2', datasources = connections[3]) ds.completeCases('D2', 'D2.complete', datasources = connections[3]) #doesnt' fail because input object is dataframe ds.dim('D2', datasources = connections[3]) ds.dim('D2.complete', datasources = connections[3]) ds.asNumeric('D$DISDST$DIS_DIAB', newobj='DIAB.n', datasources = connections[3]) df1 <- ds.scatterPlot('D$LABDST$LAB_HDL', "DIAB.n", datasources = connections[3], return.coords = TRUE) df2 <- ds.scatterPlot('D2.complete$LAB_HDL', "workshop.prediction.obj$fit", datasources = connections[3], return.coords = TRUE) plot(as.data.frame(df1[[1]][[1]])$x,as.data.frame(df1[[1]][[1]])$y, xlab='LAB_HDL', ylab='DIS_DIAB') lines(as.data.frame(df2[[1]][[1]])$x,as.data.frame(df2[[1]][[1]])$y,col='red') mod <- ds.glm(formula = "D$DISDST$DIS_DIAB~D$LABDIAB~DST$LAB_HDL", family="binomial", datasources = connections[3]) mod$coefficients modSLMA$output.summary$study1$coefficients X <- seq(from=-0.5, to=3, by=0.01) Y <- (exp(mod$coefficients[1,1]+mod$coefficients[2,1]*X))/(1+exp(mod$coefficients[1,1]+mod$coefficients[2,1]*X)) plot(X,Y) X <- seq(from=-10, to=3, by=0.01) Y <- (exp(mod$coefficients[1,1]+mod$coefficients[2,1]*X))/(1+exp(mod$coefficients[1,1]+mod$coefficients[2,1]*X)) plot(X,Y) plot(as.data.frame(df1[[1]][[1]])$x,as.data.frame(df1[[1]][[1]])$y, xlab='LAB_HDL', ylab='DIS_DIAB', xlim=c(-10,3)) lines(X,Y, col='red') |
...
- The function
ds.glm
is used to analyse the outcome variableDIS_DIAB
(diabetes status) and the covariatesPM_BMI_CONTINUOUS
(continuous BMI),LAB_HDL
(HDL cholesterol) andGENDER
(gender), with an interaction between the latter two. In R this model is represented as:
Code Block |
---|
D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS+D$LABDST$LAB_HDL*D$GENDERDST$GENDER |
- Since v6.0, the intermediate results are printed by default, (in red when viewing in RStudio):
...
Code Block | ||
---|---|---|
| ||
ds.glm(formula=D$DISDST$DIS_DIAB~D$PMDIAB~DST$PM_BMI_CONTINUOUS+D$LABDST$LAB_HDL*D$GENDERDST$GENDER, family='binomial') |
Code Block | ||||
---|---|---|---|---|
| ||||
Aggregated (glmDS1(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s Iteration 1... Aggregated (glmDS2(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 5772.52971970323 Iteration 2... Aggregated (glmDS2(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 1316.46958534192 Iteration 3... Aggregated (glmDS2(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 725.530885478793 Iteration 4... Aggregated (glmDS2(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 574.04091649261 Iteration 5... Aggregated (glmDS2(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 539.36484363672 Iteration 6... Aggregated (glmDS2(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 534.529367269197 Iteration 7... Aggregated (glmDS2(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 534.369349024873 Iteration 8... Aggregated (glmDS2(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 534.369101005548 Iteration 9... Aggregated (glmDS2(D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER, ) [========] 100% / 0s CURRENT DEVIANCE: 534.369101004809 SUMMARY OF MODEL STATE after iteration 9 Current deviance 534.369101004809 on 4159 degrees of freedom Convergence criterion TRUE (1.38325209879068e-12) beta: -6.90641416691308 0.142256253558181 -0.967440739752358 -1.40945273343359 0.646007073975 Information matrix overall: (Intercept) D$PMDST$PM_BMI_CONTINUOUS D$LABDST$LAB_HDL D$GENDER1DST$GENDER1 D$LABDST$LAB_HDL:D$GENDER1DSTDST$GENDER1 (Intercept) 52.47803 1624.8945 71.78440 16.77192 25.91140 D$PMDST$PM_BMI_CONTINUOUS 1624.89450 51515.6450 2204.90085 503.20484 774.18028 D$LABDST$LAB_HDL 71.78440 2204.9008 109.48855 25.91140 42.87626 D$GENDER1DST$GENDER1 16.77192 503.2048 25.91140 16.77192 25.91140 D$LABDST$LAB_HDL:D$GENDER1DST$GENDER1 25.91140 774.1803 42.87626 25.91140 42.87626 Score vector overall: [,1] (Intercept) -3.618013e-10 D$PMDST$PM_BMI_CONTINUOUS -9.890691e-09 D$LABDST$LAB_HDL -6.317578e-10 D$GENDER1DST$GENDER1 -2.913176e-10 D$LABDST$LAB_HDL:D$GENDER1DST$GENDER1 -5.343783e-10 Current deviance: 534.369101004809 |
...
Code Block | ||
---|---|---|
| ||
$Nvalid [1] 4164 $Nmissing [1] 1087 $Ntotal [1] 5251 $disclosure.risk RISK OF DISCLOSURE study1 0 study2 0 $errorMessage ERROR MESSAGES study1 "No errors" study2 "No errors" $nsubs [1] 4164 $iter [1] 9 $family Family: binomial Link function: logit $formula [1] "D$DISDST$DIS_DIAB ~ D$PMDST$PM_BMI_CONTINUOUS + D$LABDST$LAB_HDL * D$GENDERDST$GENDER" $coefficients Estimate Std. Error z-value p-value low0.95CI.LP high0.95CI.LP P_OR (Intercept) -6.9064142 1.08980103 -6.3373166 2.338013e-10 -9.04238494 -4.7704434 0.00100034 D$PMDST$PM_BMI_CONTINUOUS 0.1422563 0.02932171 4.8515676 1.224894e-06 0.08478676 0.1997257 1.15287204 D$LABDST$LAB_HDL -0.9674407 0.36306348 -2.6646601 7.706618e-03 -1.67903208 -0.2558494 0.38005445 D$GENDER1DST$GENDER1 -1.4094527 1.06921103 -1.3182175 1.874308e-01 -3.50506784 0.6861624 0.24427693 D$LABDST$LAB_HDL:D$GENDER1DST$GENDER1 0.6460071 0.69410419 0.9307062 3.520056e-01 -0.71441214 2.0064263 1.90790747 low0.95CI.P_OR high0.95CI.P_OR (Intercept) 0.0001182744 0.008405372 D$PMDST$PM_BMI_CONTINUOUS 1.0884849336 1.221067831 D$LABDST$LAB_HDL 0.1865544594 0.774258560 D$GENDER1DST$GENDER1 0.0300447352 1.986079051 D$LABDST$LAB_HDL:D$GENDER1DST$GENDER1 0.4894797709 7.436693232 $dev [1] 534.3691 $df [1] 4159 $output.information [1] "SEE TOP OF OUTPUT FOR INFORMATION ON MISSING DATA AND ERROR MESSAGES" |
...