Extended R Practical: Analysing simulated ALSPAC data

Reading the data 

 DataSHIELD Cloud Training Environment

We have simulated some data from the ALSPAC study (see: Avraam, Wilson and Burton. 2018 for data synthesis details) located in your work folder as alspac-simulated.csv . The data dictionary for this simulated dataset is in the panel below. 

 DataSHIELD Training VMs

We have simulated some data from the ALSPAC study (see: Avraam, Wilson and Burton. 2018 for data synthesis details) 

The data dictionary for this simulated dataset is in the panel below. 



Information: the data dictionary

The names of all other variables end in either .7 or .11 (depending whether they were measured at the age 7 clinic or the age 11 clinic)

male codes sex: 1=male, 0=female

age.yrs and age.yrs are the age (in decimal years) on the day of the clinic at age 7 or 11

ht is height in cm

ht.sit is sitting height in cm

ws is waist circumference in cm

hp is hip circumference in cm

wt is weight in Kg

sbp is systolic blood pressure (the top of the blood pressure fluctuation) measured (as is conventional) in mm of Hg (mercury)

dbp is diastolic blood pressure (the bottom of the blood pressure fluctuation) measured (as is conventional) in mm of Hg (mercury)

pulse is pulse rate measured in beats per minute

BMI is body mass index derived as wt/(ht/100)2 The height variable is divided by 100 to express it in metres rather than centimeters

  • start a new script and save it as a .R file in an appropriate location
  • comment in some header information: what is the script for? who is it written by? what data set is being used? etc
  • read the dataset into R and assign it the variable sim.alspac using the read.csv function
  • look up the colnames function in the help file and apply it to sim.alspac to list all the column headings in the data.
  • look up the dim function in the help file and apply it to to sim.alspac to get the dimensions of the dataset.  Number of columns is the number of variables, number of rows is the number of participants. 

Selecting and subsetting

Selecting variables can be done a number of ways including selection by column number or column name.  It is best practice to use the column name as the column number may vary between datasets.

select.1<-dataframe[,x] #assign the variable select1 column number x in dataframe
select.2<-dataframe[,"x"] #assign the variable select2 column named x in dataframe
select.3<-dataframe$x #assign the variable select3 dataframe column x 

It is also possible to use operators to subset between a range of values.  See the help file for the subset function for further explanation

subset.4<-subset(dataframe, x < 5) #subset of the whole dataframe where x < 5
subset.4<-subset(dataframe, x == 5) #subset of the whole dataframe where x = 5
  • create a subset of sim.alspac for males called subset.male and for females called subset.female
  • How many participants are female and how many are male? HINT: Use dim to check the dimensions of subset.male and subset.female.  

Exploring the data

  • Get object summary statistics by using the summary function on subset.male and subset.female
  • Use the boxplot function to plot BMI at age 7 against gender. HINT: You will only need to use the arguments formula= and data=
  • Output your boxplot as a .png file using the png function. 

  • Use the hist function to plot histograms of BMI age 7 for females and males.  HINT: You can layer graphs over one another by using the argument add=T in the second histogram.  Line colour of the histogram can be set using the argument e.g. border="red"
  • Make the plot more readable by using the legend to add an appropriate key.
  • Output your histogram as a .png file using the png function. 

  • Use the plot function to create a scatter plot of height and weight age 7 for males. 
  • Use lm function to generate a linear model called lm1 for the two variables.  HINT: R uses formula notation in formula argument e.g. formula=y~x
  • Use the summary function on lm1 to get the coefficients.
  • You can add your regression line to the scatterplot by running the abline function  on lm1 after your plot function

Modelling

  • Apply a generalised linear model (glm) using the glm function to investigate the relationships between the variables


Practical completed

Your R script should be similar to the example R answer script. Try uploading your own dataset and repeat the practical.






DataSHIELD Wiki by DataSHIELD is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at http://www.datashield.ac.uk/wiki