DataSHIELD Training Part 5: Sub-setting


Introduction

This is the fifth in a 6-part DataSHIELD tutorial series.

The other parts in this DataSHIELD tutorial series are:

Quick reminder for logging in:


Sub-setting

In DataSHIELD there is one function that allows sub-setting of data, ds.dataFrameSubset .

You may wish to use it to:

  • Subset a column of data by its "Class"
  • Subset a dataframe to remove any "NA"s
  • Subset a numeric column of a dataframe using a Boolean inequalilty

Sub-setting by class

You may wish to generate subsets for each level of a categorical variable. To do this we must think about which levels of that categorical variable are available, then use boolean operators to isolate them:

At this stage, we want to work out what arguments are available in the DataSHIELD function so we summon the function help; the help appears as:

So what we have learnt from this is that we must specify:

  • the data frame we are working with throughout this tutorial ("D"), as the df.name argument
  • the column we wish to split by class ("D$GENDER"), as the V1.name argument
  • the value we want to compare the column with, in this case a number ("0"), as the V2.name argument
  • the boolean operator we want to use to compare V2.name with V1.name argument
  • the specific name we want to call the new object, in string form, with the newobj argument
  • as always, specify the datasources = connections

Now there are two serverside objects which have split GENDER by class, to which we have assigned the names "CNSIM.subset.Males" and "CNSIM.subset.Females".

Sub-setting to remove NAs

  • The example below uses the function ds.completeCases to subset the assigned data frame D by rows (individual records) that have no missing values (missing values are denoted with NA). The output subset is named "D_without_NA":

A subsequent check using ds.dim() will confirm that the new object "D_without_NA" has fewer rows than the original object "D".

Sub-set by inequality

Say we wanted to have a subset where BMI values are ≥ 25, and call it subset.BMI.25.plus

Then the V1.name argument should specify the column name for BMI, which is PM_BMI_CONTINUOUS (remember, this can always be checked by the command ds.colnames(x="D") )and the V2.name argument should specify the value to compare the column to, namely 25, using the boolean operator >=. In the DataSHIELD syntax this looks like the following:

The output is:

The subset of data retains the same variables names i.e. column names. Note we are addressing our newly-named object on the serverside, not accessing a column of the original dataframe "D$...." as before:

Outputs:

To verify the subset above is correct (holds only observations with BMI ≥ 25) the function ds.quantileMean with the argument type='split' will confirm the BMI results for each study are ≥ 25.

Outputs:

Finally we can create a histogram of these results to confirm them visually:

Gives code output:

And graph:


Conclusion

The other parts in this DataSHIELD tutorial series are: