Explore data: the basics

Reading in data

 DataSHIELD Cloud Training Environment

In your working folder there will be a file called ukgas.csv. UKgas is a dataset of the quarterly UK gas consumption from 1960 Q1 to 1986 Q4, in millions of therms. It is native to R as a built in dataset but we will learn how to read a file in.

 DataSHIELD Training VMs (local machine)

UKgas is a dataset of the quarterly UK gas consumption from 1960 Q1 to 1986 Q4, in millions of therms. It is native to R as a built in dataset but we will learn how to read a file in.

  • In R set your working directory - where all your work will be saved
#R
setwd("C://FILE_PATH_TO_DATA_AND_RSCRIPTS")
  • Download ukgs.csv and save it to your working directory set above



Tip: Comments

Remember to comment your code e.g.

#this is a comment

or even create code sections e.g.

###################################
#NEW SECTION
################################### 

  • look up the read.csv function in the help file to learn how to read a .csv data file into R (web manual page).
?read.csv
  • read the data in and assign the data a name e.g. file1
file1<-read.csv("ukgas.csv")
  • you will see the newly created file1 variable in your top right Environment window

Exploring the data

  • In your script, comment what the following commands do:

file1 # this displays all data in file1
head(file1) # this displays xxxxxxxxx
tail(file1) # this displays xxxxxxxxxxxxx
file1 [1,] # this displays xxxxxxxxxxx
file1[1:5,] # displays xxxxxxxxxxx
file1[,1] # this displays xxxxxxxxxxxxxxxxxx
file1[,1:5] # this displays xxxxxxxxxxxx
file1[,'year'] # this displays xxxxxx
file1 [,'qtr1'] # this displays xxxxxx


<- or =

You can assign your selected data to a new variable name using <- or  = 

year <- file1[,'year'] 
qtr1 = file1[,'qtr1']

It is best practice to use <- to assign a value to a new variable x rather than = which implies x equals the value. = is typically used to denote arguments within functions. 

Plotting data

  • In R open up the manual page for the plot function:
?plot

  • make a basic plot.  This will automatically appear in your plot window in the bottom right quadrant of R Studio.
plot(x = year, y = qtr1)
  • use the type argument to change the plot to lines
  • search the manual for the par function - this allows you to set additional parameters in graphs. In the manual page search (using Ctrl F) for the word color. Find and implement the argument for plotting colour - make your plotting colour ‘red’.

Writing the data

  • printing to file requires opening a graphics driver (e.g. pdf, png, jpg), the plot to be defined, and then once you have finished printing to file this device driver needs closing.
  • Open up the manual page for the png function to find out how to apply the function and then write the plot to file. 
?png 
 
png(file="plot1.png") # opens the png printing driver, output to appear in plot1.png
plot(x = year, y = qtr1, type = 'b', col = 'red') # this plot is printed
dev.off() #prin driver closed
  • A .png will appear in your working directory viewable in bottom right quadrant in R Studio

Tip: Printing a .pdf

A pdf output can be printed using:

pdf (file="plot1.pdf", h=7, w=12) 
#where h is height in inches and w is width in inches
plot(x = year, y = qtr1, type = 'l', col = 'red')
dev.off()

Beautifying plots

R plots in layers. You start with a base plot using the plot function and then can add layers of extra data, regression lines, legends, text equations etc on top of it using:

  • points()
  • text()
  • lines()
  • legend()
  • abline()

  • Open up the manual page for points to learn how to add additional data to a plot.  
  • Use points to add qtr2 data points to the qtr1 graph. Remember you will need to assign your qtr2 data to a new variable before you can plot it. 

#look up points in help file
?points
 
#assign qtr2 data to variable called qtr2
qtr2 = file1[,'qtr2']
 
#plot qtr 1
plot(x = year, y = qtr1, type = 'l', col = 'red')
 
#add qtr2 data
points(x = year, y = qtr2, type = 'l', col = 'black')

Tip: points()

The arguments that can be applied in the points function are very similar to the plot function.

  • Open up the manual page for the legend function.  Use the information to add a legend to the plot.

#look up points in help file
?legend


#assign qtr2 data to variable called qtr2
qtr2 = file1[,'qtr2']

#plot qtr 1
plot(x = year, y = qtr1, type = 'l', col = 'red')

#add qtr2 data
points(x = year, y = qtr2, type = 'l', col = 'black')
 
#add legend
legend(x = 'topleft', y = NULL, legend = c('qtr1', 'qtr2'), col = c('red', 'black'), lty = 1)
 
  • Add qtr3 and qtr4 data to the plot. Print the final plot to file. 

Tip: Full plot

You will see that qtr 3 plots data off the y axis. Use the min and max functions respectively to identify the min of qtr3 and max of qtr1. Use the ylim argument in plot() to set the min and max y axis as the example below (denoted by i and j, respectively).

plot(x = year, y = qtr1, type = 'l', col = 'red', ylim=c(i,j))
points(x = year, y = qtr2, type = 'l', col = 'black')
points(x = year, y = qtr3, type = 'l', col = 'blue')
points(x = year, y = qtr4, type = 'l', col = 'green')
legend(x = 'topleft', y = NULL, legend = c('qtr1', 'qtr2','qtr3', 'qtr4'), col = c('red', 'black', 'blue', 'green'), lty = 1)



info: Answer script

You have now completed Exploring the data: the basics. Make sure you:

  • comment your script appropriately
  • save your script somewhere sensible
  • your script should be similar to the example answer script.

DataSHIELD Wiki by DataSHIELD is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at http://www.datashield.ac.uk/wiki