DataSHIELD Training Part 5: Sub-setting
Introduction
This is the fifth in a 6-part DataSHIELD tutorial series.
The other parts in this DataSHIELD tutorial series are:
5: Sub-setting
6: Modelling
Quick reminder for logging in:
Sub-setting
In DataSHIELD there is one function that allows sub-setting of data, ds.dataFrameSubset .
You may wish to use it to:
- Subset a column of data by its "Class"
- Subset a dataframe to remove any "NA"s
- Subset a numeric column of a dataframe using a Boolean inequalilty
Sub-setting by class
You may wish to generate subsets for each level of a categorical variable. To do this we must think about which levels of that categorical variable are available, then use boolean operators to isolate them:
At this stage, we want to work out what arguments are available in the DataSHIELD function so we summon the function help; the help appears as:
So what we have learnt from this is that we must specify:
- the data frame we are working with throughout this tutorial ("D"), as the df.name argument
- the column we wish to split by class ("D$GENDER"), as the V1.name argument
- the value we want to compare the column with, in this case a number ("0"), as the V2.name argument
- the boolean operator we want to use to compare V2.name with V1.name argument
- the specific name we want to call the new object, in string form, with the newobj argument
- as always, specify the datasources = connections
Now there are two serverside objects which have split GENDER by class, to which we have assigned the names "CNSIM.subset.Males" and "CNSIM.subset.Females".
Sub-setting to remove NAs
- The example below uses the function ds.completeCases to subset the assigned data frame D
NA).
The output subset is named "D_without_NA"
:
A subsequent check using ds.dim() will confirm that the new object "D_without_NA" has fewer rows than the original object "D".
Sub-set by inequality
Say we wanted to have a subset where BMI values are ≥ 25, and call it subset.BMI.25.plus
Then the V1.name argument should specify the column name for BMI, which is PM_BMI_CONTINUOUS (remember, this can always be checked by the command ds.colnames(x="D") )and the V2.name argument should specify the value to compare the column to, namely 25, using the boolean operator >=. In the DataSHIELD syntax this looks like the following:
The output is:
The subset of data retains the same variables names i.e. column names. Note we are addressing our newly-named object on the serverside, not accessing a column of the original dataframe "D$...." as before:
Outputs:
To verify the subset above is correct (holds only observations with BMI ≥ 25) the function
ds.quantileMean
with the argument
type='split'
will confirm the BMI results for each study are ≥ 25.
Outputs:
Finally we can create a histogram of these results to confirm them visually:
Gives code output:
And graph:
Conclusion
The other parts in this DataSHIELD tutorial series are:
5: Subsetting
6: Modelling