DataSHIELD Training Part 2: Basic statistics and data manipulations
Introduction
This is the second in a 6-part DataSHIELD tutorial series.
The other parts in this DataSHIELD tutorial series are:
5: Subsetting
6: Modelling
Quick reminder for logging in:
Basic statistics and data manipulations
Descriptive statistics: variable dimensions and class
It is possible to get some descriptive or exploratory statistics about the assigned variables held in the server-side R session such as number of participants at each data provider, number of participants across all data providers and number of variables. Identifying parameters of the data will facilitate your analysis.
The output of the command is shown below. It shows that in study 1 there are 2163 individuals with 11 variables and in study 2 there are 3088 individuals with 11 variables, and that in both studies together there are in total 5251 individuals with 11 variables:
- Up to here, the dimensions of the assigned data frame
D
have been found using theds.dim
command in whichtype='both'
is the default argument. - Now use the
type='combine'
argument in theds.dim
function to identify the number of individuals (5251) and variables (11) pooled across all studies:
- To check the variables in each study are identical (as is required for pooled data analysis), use the
ds.colnames
function on the assigned data frameD
:
- Use the
ds.class
function to identify the class (type) of a variable - for example if it is an integer, character, factor etc. This will determine what analysis you can run using this variable class. The example below defines the class of the variableLAB_HDL
held in the assigned data frameD
, denoted by the argumentx='D$LAB_HDL'
.
Descriptive statistics: quantiles and mean
As LAB_HDL
is a numeric variable the distribution of the data can be explored.
- The function
ds.quantileMean
returns the quantiles and the statistical mean.
- To get the statistical mean alone, use the function
ds.mean
use the argumenttype
to request split results:
Conclusion
The other parts in this DataSHIELD tutorial series are:
5: Subsetting
6: Modelling