...
This section will make more sense with an understanding of Generalised Linear Modelling theory and techniques. More information can always be found online, for example this Colorado University publication.
Basic 1-covariate Gaussian GLM
We want to examine the relationship between BMI (a continuous variable) and Triglycerides (another continuous variable). Because the response variable here, BMI, is continuous, this indicates that there should be a Gaussian underlying distribution.
...
ds.glmSLMA(formula = "D$LAB_TRIG~D$PM_BMI_CONTINUOUS", family="gaussian", newobj = "workshop.obj", datasources = connections)
For the SLMA approach we can assign the predicted values at each study:
Code Block |
---|
ds.glmPredict(glmname = "workshop.obj", newobj = "workshop.prediction.obj", datasources = connections) ds.length("workshop.prediction.obj$fit", datasources=connections) ds.length("D$LAB_TRIG", datasources=connections) |
Filter out the incomplete cases, using "ds.completeCases()" sub-setting command:
Code Block |
---|
ds.cbind(c('D$LAB_TRIG', 'D$PM_BMI_CONTINUOUS'), newobj='vars')
ds.completeCases('vars', newobj='vars.complete')
ds.dim('vars.complete')
|
Then plot the resultant data as a best linear fit on a scatter plot:
Code Block |
---|
df1 <- ds.scatterPlot('D$PM_BMI_CONTINUOUS', "D$LAB_TRIG", datasources = connections, return.coords = TRUE)
df2 <- ds.scatterPlot('vars.complete$PM_BMI_CONTINUOUS', "workshop.prediction.obj$fit", datasources = connections, return.coords = TRUE)
# then in native R
par(mfrow=c(2,2))
plot(as.data.frame(df1[[1]][[1]])$x,as.data.frame(df1[[1]][[1]])$y, xlab='Body Mass Index', ylab='Triglycerides', main='Study 1')
lines(as.data.frame(df2[[1]][[1]])$x,as.data.frame(df2[[1]][[1]])$y, col='red')
plot(as.data.frame(df1[[1]][[2]])$x,as.data.frame(df1[[1]][[2]])$y, xlab='Body Mass Index', ylab='Triglycerides', main='Study 2')
lines(as.data.frame(df2[[1]][[2]])$x,as.data.frame(df2[[1]][[2]])$y, col='red')
plot(as.data.frame(df1[[1]][[3]])$x,as.data.frame(df1[[1]][[3]])$y, xlab='Body Mass Index', ylab='Triglycerides', main='Study 3')
lines(as.data.frame(df2[[1]][[3]])$x,as.data.frame(df2[[1]][[3]])$y, col='red') |
1-covariate Binomial GLM
Say we want to regress Cholesterol against Diabetes status. This is not as simple a matter as above, where a simple (& familiar!) Gaussian linear model is fitted. The response data takes values of 0 and 1, the binary measure of whether a person does (=1) or doesn't (=0) have the diabetes diagnosis. Because of this, we want a different type of linear model, to make sense.
Here we will use the ds.glmSLMA command as we are about to calculate predicted values again, which only work with the SLMA version of the function. We will only use the 3rd connected study as we are interested in conducting an analysis and producing the graph for that study site alone.
Code Block |
---|
ds.glmSLMA(formula = "D$DIS_DIAB~D$LAB_HDL", family="binomial", newobj = "workshop.obj", datasources = connections[3]) |
So, with this formula, family and new object established, this is the rest of the code:
Code Block |
---|
ds.length("workshop.prediction.obj$fit", datasources = connections[3]) ds.length('D$LAB_HDL', datasources = connections[3]) ds.numNA('D$LAB_HDL', datasources = connections[3]) ds.completeCases('D$LAB_HDL', 'hdl.complete', datasources = connections[3]) # FAILED ds.cbind(c('D$LAB_HDL','D$LAB_HDL'), newobj='D2', datasources = connections[3]) ds.completeCases('D2', 'D2.complete', datasources = connections[3]) #doesnt' fail because input object is dataframe ds.dim('D2', datasources = connections[3]) ds.dim('D2.complete', datasources = connections[3]) ds.asNumeric('D$DIS_DIAB', newobj='DIAB.n', datasources = connections[3]) df1 <- ds.scatterPlot('D$LAB_HDL', "DIAB.n", datasources = connections[3], return.coords = TRUE) df2 <- ds.scatterPlot('D2.complete$LAB_HDL', "workshop.prediction.obj$fit", datasources = connections[3], return.coords = TRUE) plot(as.data.frame(df1[[1]][[1]])$x,as.data.frame(df1[[1]][[1]])$y, xlab='LAB_HDL', ylab='DIS_DIAB') lines(as.data.frame(df2[[1]][[1]])$x,as.data.frame(df2[[1]][[1]])$y,col='red') mod <- ds.glm(formula = "D$DIS_DIAB~D$LAB_HDL", family="binomial", datasources = connections[3]) mod$coefficients modSLMA$output.summary$study1$coefficients X <- seq(from=-0.5, to=3, by=0.01) Y <- (exp(mod$coefficients[1,1]+mod$coefficients[2,1]*X))/(1+exp(mod$coefficients[1,1]+mod$coefficients[2,1]*X)) plot(X,Y) X <- seq(from=-10, to=3, by=0.01) Y <- (exp(mod$coefficients[1,1]+mod$coefficients[2,1]*X))/(1+exp(mod$coefficients[1,1]+mod$coefficients[2,1]*X)) plot(X,Y) plot(as.data.frame(df1[[1]][[1]])$x,as.data.frame(df1[[1]][[1]])$y, xlab='LAB_HDL', ylab='DIS_DIAB', xlim=c(-10,3)) lines(X,Y, col='red') |
Multi-covariate GLM
- The function
ds.glm
is used to analyse the outcome variableDIS_DIAB
(diabetes status) and the covariatesPM_BMI_CONTINUOUS
(continuous BMI),LAB_HDL
(HDL cholesterol) andGENDER
(gender), with an interaction between the latter two. In R this model is represented as:
...