DataSHIELD Training Part 6: Modelling


Introduction

This is the sixth and final page of a 6-part DataSHIELD tutorial series. Well done!

The other parts in this DataSHIELD tutorial series are:

Quick reminder for logging in:



Modelling

Horizontal DataSHIELD allows the fitting of generalised linear models (GLM). In Generalised Linear Modelling, the outcome can be modelled as continuous, or categorical (binomial or discrete). The error to use in the model can follow a range of distributions including Gaussian, binomial, gamma and Poisson. In this section only one example will be shown, for more examples please see the manual help page for the function.

This section will make more sense with an understanding of Generalised Linear Modelling theory and techniques. More information can always be found online, for example this Colorado University publication.


Basic 1-covariate Gaussian GLM

We want to examine the relationship between BMI (a continuous variable) and Triglycerides (another continuous variable). Because the response variable here, BMI, is continuous, this indicates that there should be a Gaussian underlying distribution.

A correlation command will establish how closely linked these two variables might be:

Let's visualise with a scatterplot:

Regress Triglycerides with BMI using the Individual Partition Data (IPD) approach:

The method for this (ds.glm) is a "pooled analysis"- equivalent to placing the individual-level data from all sources in one warehouse.

Important to note that the link function is by default the canonical link function for each family. So binomial <-> logistic link, poisson <-> log link, gaussian <-> identity link.

Regress Triglycerides with BMI using the Study-Level Meta-Analysis (SLMA) approach:

ds.glmSLMA(formula = "D$LAB_TRIG~D$PM_BMI_CONTINUOUS", family="gaussian", newobj = "workshop.obj", datasources = connections)

For the SLMA approach we can assign the predicted values at each study:

Filter out the incomplete cases, using "ds.completeCases()" sub-setting command:

Then plot the resultant data as a best linear fit on a scatter plot:

1-covariate Binomial GLM

Say we want to regress Cholesterol against Diabetes status. This is not as simple a matter as above, where a simple (& familiar!) Gaussian linear model is fitted. The response data takes values of 0 and 1, the binary measure of whether a person does (=1) or doesn't (=0) have the diabetes diagnosis. Because of this, we want a different type of linear model, to make sense.

Here we will use the ds.glmSLMA command as we are about to calculate predicted values again, which only work with the SLMA version of the function. We will only use the 3rd connected study as we are interested in conducting an analysis and producing the graph for that study site alone.

So, with this formula, family and new object established, this is the rest of the code:

Multi-covariate GLM

  • The function ds.glm is used to analyse the outcome variable DIS_DIAB (diabetes status) and the covariates PM_BMI_CONTINUOUS (continuous BMI), LAB_HDL (HDL cholesterol) and GENDER (gender), with an interaction between the latter two. In R this model is represented as:
  • Since v6.0, the intermediate results are printed by default, (in red when viewing in RStudio):


  •  Then, the rest of the results  are printed in black:


Conclusion

You can get back to the training homepage by clicking here.