Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Note

Only DataSHIELD developers will need to change the default value of the last argument, directory, of the datashield.login function.

Basic statistics and data manipulations

Descriptive statistics: variable dimensions and class

Tip

Almost all functions in DataSHIELD can display split results (results separated for each study) or pooled results (results for all the studies combined). This can be done using the type='split' and type='combine' argument in each function. The majority of DataSHIELD functions have a default of type='combined'. The default for each function can be checked in the function help page.

It is possible to get some descriptive or exploratory statistics about the assigned variables held in the server-side R session such as number of participants at each data provider, number of participants across all data providers and number of variables. Identifying parameters of the data will facilitate your analysis.

  • The dimensions of the assigned data frame D can be found using the ds.dim command in which type='split' is the default argument:
Code Block
languagexml
opals <- datashield.login(logins=logindata,assign=TRUE)
ds.dim(x='D')

The output of the command is shown below. It shows that in study1 there are 2163 individuals with 11 variables and in study2 there are 3088 individuals with 11 variables:

Code Block
languagexml
> opals <- datashield.login(logins=logindata,assign=TRUE)
Logging into the collaborating servers
  No variables have been specified. 
  All the variables in the opal table 
  (the whole dataset) will be assigned to R!
Assigning data:
study1...
study2...
Variables assigned:
study1--LAB_TSC, LAB_TRIG, LAB_HDL, LAB_GLUC_ADJUSTED, PM_BMI_CONTINUOUS, DIS_CVA, MEDI_LPD, DIS_DIAB, DIS_AMI, GENDER, PM_BMI_CATEGORICAL
study2--LAB_TSC, LAB_TRIG, LAB_HDL, LAB_GLUC_ADJUSTED, PM_BMI_CONTINUOUS, DIS_CVA, MEDI_LPD, DIS_DIAB, DIS_AMI, GENDER, PM_BMI_CATEGORICAL


> ds.dim(x='D')
$study1
[1] 2163   11

$study2
[1] 3088   11
  • Use the type='combine' argument in the ds.dim function to identify the number of individuals (5251) and variables (11) pooled across all studies:
Code Block
xml
xml
ds.dim('D', type='combine')
#$pooled.dimension
#[1] 5251   11
  • To check the variables in each study are identical (as is required for pooled data analysis), use the ds.colnames function on the assigned data frame D :
Code Block
languagexml
ds.colnames(x='D')
#$study1
# [1] "LAB_TSC"            "LAB_TRIG"           "LAB_HDL"            "LAB_GLUC_ADJUSTED"  "PM_BMI_CONTINUOUS"  "DIS_CVA"
# [7] "MEDI_LPD"           "DIS_DIAB"           "DIS_AMI"            "GENDER"             "PM_BMI_CATEGORICAL"

#$study2
# [1] "LAB_TSC"            "LAB_TRIG"           "LAB_HDL"            "LAB_GLUC_ADJUSTED"  "PM_BMI_CONTINUOUS"  "DIS_CVA"
# [7] "MEDI_LPD"           "DIS_DIAB"           "DIS_AMI"            "GENDER"             "PM_BMI_CATEGORICAL"
  • Use the ds.class function to identify the class (type) of a variable - for example if it is an integer, character, factor etc. This will determine what analysis you can run using this variable class. The example below defines the class of the variable LAB_HDL held in the assigned data frame D, denoted by the argument x='D$LAB_HDL'
Code Block
languagexml
ds.class(x='D$LAB_HDL')
#$study1
#[1] "numeric"

#$study2
#[1] "numeric"
Tip
titleContinuing the Tutorial
You can now continue with the remainder of the DataSHIELD training tutorial from Descriptive Statistices: Quantiles and Mean as it is for the DataSHIELD Cloud training environment. 

====================

 

 

 

 

 

...