Updated functions in v5

This page highlights the key differences between the functions that have changed from the past iteration of DataSHIELD.

It is assumed that the following packages are preloaded in your work environment and that any necessary server connections must be also be loaded before you start using any DataSHIELD functions. It is also important to note that all the data used is stored as a data table that is named 'D' by default and so when using any variables 'D$' must be added before the variable name.

Library to be loaded

# Packages to be loaded
library(opal)
library(dsBaseClient)

server <- c("dstesting-100") #The VM names
url <- c("http://192.168.56.100:8080") # The fixed IP addresses of the training VMs
user <- "administrator"
password <- "datashield_test&"
table <- c("CNSIM.CNSIM1") # The data tables used in the tutorial
my_logindata <- data.frame(server,url,user,password,table)
myvar <- list('GENDER', 'LAB_HDL', 'LAB_TRIG','DIS_DIAB', 'LAB_TSC', 'PM_BMI_CONTINUOUS')
opals <- datashield.login(logins=my_logindata,assign=TRUE,variables=myvar)
data(logindata)

Functions that have changed

ds.asCharacter

ds.asCharacter turns a vector into a character type. The previous iteration and the current iteration are not that much different, apart from the x argument has changed to 'x.name' and that the data source is required now.

ds.asCharacter

############# ds.asCharacter (v4) #############

ds.asCharacter(x='D$GENDER', newobj = 'gender_char')

############# ds.asCharacter (v5) #############

ds.asCharacter(x.name = 'D$GENDER', newobj = 'gender_char_o', datasources = opals)

Upon correctly doing so, you should receive a message in the console section of R.

ds.asCharacter_v5 message

$is.object.created
[1] "A data object <gender_char_o> has been created in all specified data sources"

$validity.check
[1] "<gender_char_o> appears valid in all sources"

ds.asFactor

ds.asFactor is a function that turns a numeric vector into a factor type. There are a few key changes between the v4 and v5 iterate, namely that the latest iteration doesn't require you to first turn the data being used into a numeric form and to then relabel this to something sensible (possibly ending in '_fact' to easily differentiate between the factor and numerical types). You would then need to run another function, ds.levels, to be able to see what you've requested. As can be seen, the changes are namely that the variable 'x' must now be input.var.name, 'newobj' is now renamed as 'variable.names' and the data source must be added.

ds.asFactor

############# ds.asFactor (v4) #############

ds.asNumeric(x='D$LAB_HDL', newobj='lab_hdl_num')
ds.asFactor(x='lab_hdl_num', newobj='lab_hdl_num_fact')
ds.levels(x='lab_hdl_num_fact')

############# ds.asFactor (v5) #############

ds.asFactor(input.var.name='D$LAB_HDL', variable.names('lab.hdl.fact'), datasources = opals)

ds.asList

ds.asList attempts to construct an object of type list but only for data frames and matrices. The previous iterate and the current one are not that much different, apart from the x argument has changed to 'x.name' and that the data source is required now.

ds.asList

############# ds.asList (v4) #############
 
ds.asList(x='D', newobj = 'alist', datasources = opals)
 
############# ds.asList (v5) #############
 
ds.asList(x.name = 'D', newobj = 'thelist', datasources = opals)

ds.asMatrix

ds.asMatrix attempts to turn its argument into a matrix. The previous iterate and the current one are not that much different, apart from the x argument has changed to 'x.name' and that the data source is required now.

ds.asMatrix

############# ds.asMatrix (v4) #############

ds.asMatrix(x='D$GENDER', newobj = 'gender_mat')

############# ds.asMatrix (v5) #############

ds.asMatrix(x.name = 'D$GENDER', newobj = 'gender_tab', datasources = opals)

Upon correctly doing so, you should receive a message in the console section of R.

ds.Matrix_v5 message

$is.object.created
[1] "A data object <gender_tab> has been created in all specified data sources"

$validity.check
[1] "<gender_tab> appears valid in all sources"

ds.asNumeric

ds.asNumeric turns a vector into a numerical type. There are a few changes between this and the new iteration, namely that you do not need to list the variable you want to use and you do not need to change this into a character to then be able to change it to a numeric type. In this iteration, all that needs to be listed is the variable name and the new name you wish to call it and which data source you wish to extract it from.

ds.asNumeric

############# ds.asNumeric (v4) #############

myvariable <- list("GENDER") 
opals <- datashield.login(logins=my_logindata, assign=TRUE, variables = myvariable)
ds.asCharacter(x='D$GENDER', newobj="gender_ch")
ds.asNumeric(x='gender_ch', newobj="gender_num")

############# ds.asNumeric (v5) #############

ds.asNumeric(x.name = 'D$GENDER', newobj = 'gender_num_o', datasources = opals)

Upon correctly doing so, you should receive a message in the console section of R.

ds.asNumeric_v5 message

$is.object.created
[1] "A data object <gender_num_o> has been created in all specified data sources"

$validity.check
[1] "<gender_num_o> appears valid in all sources"

ds.cbind

ds.cbind attempts to combine objects by columns. A few changes here, namely the DataSHIELD.checks has to be included to check if all the input objects, in this case they have to be in the form of a vector or table, exist and are of appropriate class. Secondly, force.columns assigns the columns of the matrices a new column name, as before, data source has to be added. Lastly notify.of.progress is set to FALSE by default as to notify the user if there is an issue with what their doing in the server.

ds.cbind

############# ds.cbind (v4) #############
 
ds.assign(toAssign='log(D$LAB_TSC)', newobj='labtsc')
ds.assign(toAssign='log(D$LAB_HDL)', newobj='labhdl')
ds.cbind(x = c('labtsc', 'labhdl'), newobj = "newCbindObject", datasources = opals)
 
############# ds.cbind (v5) #############
 
ds.cbind(x = c('D$LAB_TSC', 'D$LAB_HDL'), DataSHIELD.checks = FALSE, force.colnames = c("col1","col2" ), newobj = "cbind.out", datasources = opals, notify.of.progress = FALSE)

Upon correctly doing so, you should receive a message in the console section of R.

ds.cbind_v5 message

$is.object.created
[1] "A data object <cbind.out> has been created in all specified data sources"

$validity.check
[1] "<cbind.out> appears valid in all sources"

ds.cor

ds.cor attempts to calculate the correlation between two variables or the correlation matrix for the variables of an input data frame. Again, fairly little has changed regarding how the function actually works. The v4 version previously required naAction to be one of the following strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs" with the default set to "pairwise.complete.obs". This is slightly different in the v5 iterate where it requires the strings "casewise.complete" or "pairwise.complete" with the default set to the latter. A new implementation to this function is the variable type which simply represents the type of analysis to carry out. There are two types, namely "split", the default option, and "combine". A key difference between the v4 and v5 iterations is that in the latest iterate, an error message is displayed at the end, if the function is run successfully then this error message should have an 'NA' as the final output.

ds.cor

############# ds.cor (v4) #############

ds.cor(x='D$LAB_HDL', y = 'D$LAB_TRIG', naAction = "pairwise.complete.obs", datasources = opals)

############# ds.cor (v5) #############

ds.cor(x='D$LAB_HDL', y = 'D$LAB_TRIG', naAction = "pairwise.complete", type="split", datasources = opals )

Upon correctly doing so, you should receive a message in the console section of R. Note this is with type="split".

ds.cor_v5 message

[[1]]
[[1]]$`Number of missing values in each variable`
       x   y
[1,] 360 362

[[1]]$`Number of missing values pairwise`
    x   y
x 360 366
y 366 362

[[1]]$`Correlation Matrix`
           [,1]       [,2]
[1,]  1.0000000 -0.2215122
[2,] -0.2215122  1.0000000

[[1]]$`Number of complete cases used`
     x    y
x 1803 1797
y 1797 1801

[[1]]$`Error message`
[1] NA

ds.cov

ds.cov attempts to compute the covariance of whatever input argument the user wants. It produces a table outlining the number of complete cases and a table outlining the number of messing values for the user to make a decision about the relevance of the covariance calculation. On the face of it, not much has changed between the v4 and v5 versions. Both still require either a vector, character, matrix or data frame type argument in x and y, both require specific types of naAction which for the v5 iterate, it can be either the default "pairwise.complete" or "casewise.complete". A new implementation to this function is the variable type which simply represents the type of analysis to carry out. There are two types, namely "split", the default option, and "combine". While in the v4 iterate, the function would simply return the covariance and number of complete cases used, the v5 iterate will return the number of missing values in each variable, number of missing values pairwise if "pairwise.complete" is selected and similarly for if "casewise.complete" is selected, variance-covariance matrix, the number of complete cases used and an error message. It is important to note that in both iterations, one can simply set the x argument ='D' which in this case is the default data frame to run covariance calculations for all the items within the data frame, and run the function this way. It will use the default naAction and type.

ds.cov

############# ds.cov (v4) #############

ds.cov(x = 'D$GENDER', y='D$LAB_HDL', naAction = "pairwise.complete.obs", datasources = opals)

############# ds.cov (v5) #############

ds.cov(x = 'D$GENDER', y='D$LAB_HDL', naAction = "pairwise.complete", type = "split", datasources = opals)

Upon correctly doing so, you should receive a message in the console section of R. Note this is with naAction="pairwise.complete" and type="split".

ds.cov_v5message

[[1]]
[[1]]$`Number of missing values in each variable`
     x   y
[1,] 0 360

[[1]]$`Number of missing values pairwise`
    x   y
x   0 360
y 360 360

[[1]]$`Variance-Covariance Matrix`
           x          y
x 0.25009206 0.02634621
y 0.02634621 0.17079588

[[1]]$`Number of complete cases used`
     x    y
x 2163 1803
y 1803 1803

[[1]]$`Error message`
[1] NA

ds.dataFrame

ds.dataFrame takes one or more vectors and generates a data frame structure. Looking at both iterations, not much has changed on the face of it. A few variables have been added such as DataSHIELD.checks and notify.of.progress. The variable DataSHIELD.checks is set to FALSE by default as it would be too time consuming to run all the DS checks. The variable notify.of.progress is set to FALSE by default as to notify the user if there is an issue with what their doing in the server. While both functions behave similarly, the new iterate does return a message when the data frame is created in the server and it runs a validity.check to make sure that the object created doesn't break any DS protocols.

ds.dataFrame

############# ds.dataFrame (v4) #############

myvectors <- c('D$GENDER', 'D$LAB_HDL')
ds.dataframe(x = myvectors, newobj = NULL, row.names = NULL,check.rows = FALSE, 
				check.names = TRUE, stringsAsFactors = TRUE, completeCases = FALSE, datasources = opals)

############# ds.dataFrame (v5) #############

myvectors <- c('D$GENDER', 'D$LAB_HDL')
ds.dataFrame(x = myvectors, row.names = NULL, check.rows = FALSE,
               check.names = TRUE, stringsAsFactors = TRUE, completeCases = FALSE,
               DataSHIELD.checks = FALSE, newobj = "df_new", datasources = opals,
               notify.of.progress = FALSE)

Upon correctly doing so, you should receive a message in the console section of R. Note this is with row.names, check.rows, check.names, stringsAsFactors, completeCases, dataSHIELD.checks, newobj and notify.of.progress all set to their default options.

ds.dataFrame_v5 message

$is.object.created
[1] "A data object <df_new> has been created in all specified data sources"

$validity.check
[1] "<df_new> appears valid in all sources"

ds.dim

ds.dim attempts to calculate the dimension (size) of the argument input, it can be in the form of a character, matrix, array or data frame. There isn't much of a difference between the way the v4 and v5 function work and what it displays. In the v5 there is a new argument added named checks. This will display a Boolean indicator of whether to undertake optional checks of model components which is defaulted to checks="FALSE" to save time. It should be noted that checks should only really be used if the function fails. Another difference between the v4 and v5 is that it will also display a message showing the dimensions of the argument input one is using dependent on the type selected, either type="split" or type="combined" rather than just the dimensions of the argument alone.

ds.dim

############# ds.dim (v4) #############

ds.dim(x='D', type="split", datasources = opals)

############# ds.dim (v5) #############

ds.dim(x = 'D', type = "split", checks = FALSE, datasources = opals)

Upon correctly doing so, you should receive a message in the console section of R.

ds.dim_v5 message

$`dimensions of D in dstesting-100`
[1] 2163    6

$`dimensions of D in combined studies`
[1] 2163    6

ds.histogram

This function plots a non-disclosive histogram. There are a number of key differences between the v4 and v5 versions. Namely that the number of breaks (num.breaks) can be controlled now. By default this is set to 10. This is very different from the previous iteration where the break is set to 33 by default, so it gives you a lot more control. Another difference is that now you can implement different methods: the default method is set to 'smallCellsRule' which removes bins with low variable counts. Another method is 'deterministic which takes the histogram of the scaled centroids of each k nearest neighbours of the original variable where the value of k is set by the user. The final method is 'probabilistic', then the histogram shows the original distribution disturbed by the addition of random stochastic noise, this added noise follows a normal distribution with mean zero and variance equal to a percentage of the initial variance of the input variable. This percentage cant be adjusted by the user in the argument noise.

ds.histogram

############# ds.histogram (v4) #############

ds.histogram(x = 'D$LAB_TSC', type = "combine", datasources = opals)

############# ds.histogram (v5) #############

ds.histogram(x='D$LAB_TSC', type = "combine", num.breaks = 10, method = "smallCellsRule", k=3, noise=0.25, vertical.axis = "Frequency", datasources = opals)

Upon correctly doing so one should receive a familiar message (no real changes from v4) in the console section of R.

ds.histogram_v5 message

$breaks
 [1]  2.301554  3.104198  3.906842  4.709487  5.512131  6.314775  7.117420  7.920064  8.722708  9.525353 10.327997

$counts
 [1]   6  55 193 439 509 373 167  47  14   4

$density
 [1] 0.0013789506 0.0126403806 0.0443562448 0.1008932200 0.1169809772 0.0857247633 0.0383807921 0.0108017798 0.0032175514 0.0009193004

$mids
 [1] 2.702876 3.505520 4.308165 5.110809 5.913453 6.716097 7.518742 8.321386 9.124030 9.926675

$xname
[1] "xvect"

$equidisttable2D
[1] TRUE

$intensities
 [1] 0.0013789506 0.0126403806 0.0443562448 0.1008932200 0.1169809772 0.0857247633 0.0383807921 0.0108017798 0.0032175514 0.0009193004

attr(,"class")
[1] "histogram"

ds.length

ds.length returns the pooled length or the length of the a vector or a list for each study. This is another function that hasn't changed much, apart from the addition of a variable called checks. By default this is set to FALSE (to save time) and it is a Boolean indicator of whether to undertake optional checks of model components.

ds.length

############# ds.length (v4) #############

ds.length(x='D$LAB_TSC', type = "split", datasources = opals)

############# ds.length (v5) #############

ds.length(x='D$LAB_TSC', type ="split", checks = "FALSE", datasources = opals)

ds.mean

ds.mean computes the statistical mean of a given vector. Key differences here is that it runs a checks argument, which by default is set to FALSE (to save time) and it is a Boolean indicator of whether to undertake optional checks of model components. Next is the save.mean.Nvalid argument which is also a Boolean indicator that is set to FALSE by default. All this argument does is act as an indicator as to whether the user wishes to save the generated values of the mean and of the number of valid observations into the R environment at each of the data servers.

ds.mean

############# ds.mean (v4) #############

ds.mean(x='D$LAB_TSC', type = "split", datasources = opals)

############# ds.mean (v5) #############

ds.mean(x='D$LAB_TSC', type = "split", checks=FALSE, save.mean.Nvalid = FALSE, datasources = opals)

Upon correctly doing so one should receive a message in the console section of R.

ds.mean_v5 message

$Mean.by.Study
              EstimatedMean Nmissing Nvalid Ntotal
dstesting-100      5.872113      356   1807   2163$Mean.by.Study
              EstimatedMean Nmissing Nvalid Ntotal
dstesting-100      5.872113      356   1807   2163

$Nstudies
[1] 1

$ValidityMessage
              ValidityMessage 
dstesting-100 "VALID ANALYSIS"

$Nstudies
[1] 1

$ValidityMessage
              ValidityMessage 
dstesting-100 "VALID ANALYSIS"

ds.table2D

The function ds.table2D is a client-side wrapper function. It calls the server-side function 'ds.table2DDS' that generates a 2-dimensional contingency table for each data source. The main differences here lie within the output rather than the initial function itself. In calling the function, there are two key differences, namely that the argument 'type' is set to "both" by defualt but can also be set to "split" and "combine". This is mostly where the differences in the function lie. The argument 'type' is a character which represents the type of table to output: if it is set to 'combine' a pooled 2-dimensional table is returned; if it is set to 'split' a 2=dimensional table is returned for each data source; lastly if set to 'both' a pooled 2-dimensional table plus a 2-dimensional table for each data source is returned. The last difference between the v4 and v5 function is the argument 'warningMessage' which is set to 'TRUE' by default, all it does is return an error message if the table requested is invalid.

ds.table2D

############# ds.table2D (v4) #############

ds.table2D(x='D$DIS_DIAB', y='D$GENDER', type='split', datasources = opals)

############# ds.table2D (v5) #############

ds.table2D(x = 'D$DIS_DIAB', y = 'D$GENDER', type = "both", warningMessage = TRUE, datasources = opals)

Upon correctly doing so one should receive a message in the console section of R.

ds.table2D_v5message

$colPercent
$colPercent$`dstesting-100-D$DIS_DIAB(row)|D$GENDER(col)`
           0      1  Total
0      98.08  99.16  98.61
1       1.92   0.84   1.39
Total 100.00 100.00 100.00


$colPercent.all.studies
$colPercent.all.studies$`pooled-D$DIS_DIAB(row)|D$GENDER(col)`
           0      1  Total
0      98.08  99.16  98.61
1       1.92   0.84   1.39
Total 100.00 100.00 100.00


$rowPercent
$rowPercent$`dstesting-100-D$DIS_DIAB(row)|D$GENDER(col)`
          0     1 Total
0     50.21 49.79   100
1     70.00 30.00   100
Total 50.49 49.51   100


$rowPercent.all.studies
$rowPercent.all.studies$`pooled-D$DIS_DIAB(row)|D$GENDER(col)`
          0     1 Total
0     50.21 49.79   100
1     70.00 30.00   100
Total 50.49 49.51   100


$chi2Test
$chi2Test$`dstesting-100-D$DIS_DIAB(row)|D$GENDER(col)`

	Pearson's Chi-squared test with Yates' continuity correction

data:  contingencyTable
X-squared = 3.8767, df = 1, p-value = 0.04896



$chi2Test.all.studies
$chi2Test.all.studies$`pooled-D$DIS_DIAB(row)|D$GENDER(col)`

	Pearson's Chi-squared test with Yates' continuity correction

data:  pooledContingencyTable
X-squared = 3.8767, df = 1, p-value = 0.04896



$counts
$counts$`dstesting-100-D$DIS_DIAB(row)|D$GENDER(col)`
         0    1 Total
0     1071 1062  2133
1       21    9    30
Total 1092 1071  2163


$counts.all.studies
$counts.all.studies$`pooled-D$DIS_DIAB(row)|D$GENDER(col)`
         0    1 Total
0     1071 1062  2133
1       21    9    30
Total 1092 1071  2163


$validity
[1] "All tables are valid!"

ds.var

ds.var computes the variance of a given vector. The main differences between the v4 and v5 versions is that the output message in the console is a lot more detailed in the v5 iterate. Not only does it compute the global variance, of the selected vector, it also writes the number of missing variables, number of valid variables and the total number of variables. It also prints the number of studies used and the validity of the analysis.

ds.var

############# ds.var (v4) #############

ds.var(x='D$LAB_TSC', type="combine", datasources = opals)

############# ds.var (v5) #############

ds.var(x = 'D$LAB_TSC', type = "split", checks = FALSE, datasources = opals)

Upon correctly doing so one should receive a message in the console section of R.

ds.var_v5 message

$Variance.by.Study
              EstimatedVar Nmissing Nvalid Ntotal
dstesting-100     1.229163      356   1807   2163

$Nstudies
[1] 1

$ValidityMessage
              ValidityMessage 
dstesting-100 "VALID ANALYSIS"