Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
titleLibrary to be loaded
# Packages to be loaded
library(opal)
library(dsBaseClient)
library(dsStatsClient)
library(dsGraphicsClient)
library(dsModellingClient)
library(dsBetaTestClient)

server <- c("dstesting-100") #The VM names
url <- c("http://192.168.56.100:8080") # The fixed IP addresses of the training VMs
user <- "administrator"
password <- "datashield_test&"
table <- c("CNSIM.CNSIM1") # The data tables used in the tutorial
my_logindata <- data.frame(server,url,user,password,table)
myvar <- list('GENDER', 'LAB_HDL', 'LAB_TRIG','DIS_DIAB', 'LAB_TSC', 'PM_BMI_CONTINUOUS')
opals <- datashield.login(logins=my_logindata,assign=TRUE,variables=myvar)
data(logindata)

Functions that have changed


ds.asCharacter

ds.asCharacter turns a vector into a character type. The previous iteration and the current iteration are not that much different, apart from the x argument has changed to 'x.name' and that the data source is required now. 

...

ds.asFactor

ds.asFactor is a function that turns a numeric vector into a factor type. There are a few key changes between the v4 and v5 iterate, namely that the latest iteration doesn't require you to first turn the data being used into a numeric form and to then relabel this to something sensible (possibly ending in '_fact' to easily differentiate between the factor and numerical types). You would then need to run another function, ds.levels, to be able to see what you've requested. As can be seen, the changes are namely that the variable 'x' must now be input.var.name, 'newobj' is now renamed as 'variable.names' and the data source must be added.

...

ds.asList

ds.asList attempts to construct an object of type list but only for data frames and matrices. The previous iterate and the current one are not that much different, apart from the x argument has changed to 'x.name' and that the data source is required now. 

Code Block
languagebash
titleds.asList
linenumberstrue
############# ds.asList (v4) #############
 
ds.asList(x='D', newobj = 'alist', datasources = opals)
 
############# ds.asList (v5) #############
 
ds.asList(x.name = 'D', newobj = 'thelist', datasources = opals)

ds.asMatrix

ds.asMatrix attempts to turn its argument into a matrix. The previous iterate and the current one are not that much different, apart from the x argument has changed to 'x.name' and that the data source is required now. 

...

ds.asNumeric

ds.asNumeric turns a vector into a numerical type. There are a few changes between this and the new iteration, namely that you do not need to list the variable you want to use and you do not need to change this into a character to then be able to change it to a numeric type. In this iteration, all that needs to be listed is the variable name and the new name you wish to call it and which data source you wish to extract it from.

...

ds.cbind

ds.cbind attempts to combine objects by columns. A few changes here, namely the DataSHIELD.checks has to be included to check if all the input objects, in this case they have to be in the form of a vector or table, exist and are of appropriate class. Secondly, force.columns assigns the columns of the matrices a new column name, as before, data source has to be added. Lastly notify.of.progress is set to FALSE by default as to notify the user if there is an issue with what their doing in the server.

...

Code Block
languagebash
titleds.cbind_v5 message
$is.object.created
[1] "A data object <cbind.out> has been created in all specified data sources"

$validity.check
[1] "<cbind.out> appears valid in all sources"

ds.cor

ds.cor attempts to calculate the correlation between two variables or the correlation matrix for the variables of an input data frame. Again, fairly little has changed regarding how the function actually works. The v4 version previously required naAction to be one of the following strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs" with the default set to "pairwise.complete.obs". This is slightly different in the v5 iterate where it requires the strings "casewise.complete" or "pairwise.complete" with the default set to the latter. A new implementation to this function is the variable type which simply represents the type of analysis to carry out. There are two types, namely "split", the default option, and "combine". A key difference between the v4 and v5 iterations is that in the latest iterate, an error message is displayed at the end, if the function is run successfully then this error message should have an 'NA' as the final output.

...

Code Block
languagebash
titleds.cor_v5 message
[[1]]
[[1]]$`Number of missing values in each variable`
       x   y
[1,] 360 362

[[1]]$`Number of missing values pairwise`
    x   y
x 360 366
y 366 362

[[1]]$`Correlation Matrix`
           [,1]       [,2]
[1,]  1.0000000 -0.2215122
[2,] -0.2215122  1.0000000

[[1]]$`Number of complete cases used`
     x    y
x 1803 1797
y 1797 1801

[[1]]$`Error message`
[1] NA

ds.cov

ds.cov attempts to compute the covariance of whatever input argument the user wants. It produces a table outlining the number of complete cases and a table outlining the number of messing values for the user to make a decision about the relevance of the covariance calculation. On the face of it, not much has changed between the v4 and v5 versions. Both still require either a vector, character, matrix or data frame type argument in x and y, both require specific types of naAction which for the v5 iterate, it can be either the default "pairwise.complete" or "casewise.complete". A new implementation to this function is the variable type which simply represents the type of analysis to carry out. There are two types, namely "split", the default option, and "combine". While in the v4 iterate, the function would simply return the covariance and number of complete cases used, the v5 iterate will return the number of missing values in each variable, number of missing values pairwise if "pairwise.complete" is selected and similarly for if "casewise.complete" is selected, variance-covariance matrix, the number of complete cases used and an error message. It is important to note that in both iterations, one can simply set the x argument ='D' which in this case is the default data frame to run covariance calculations for all the items within the data frame, and run the function this way. It will use the default naAction and type.

...

Code Block
languagebash
titleds.cov_v5message
[[1]]
[[1]]$`Number of missing values in each variable`
     x   y
[1,] 0 360

[[1]]$`Number of missing values pairwise`
    x   y
x   0 360
y 360 360

[[1]]$`Variance-Covariance Matrix`
           x          y
x 0.25009206 0.02634621
y 0.02634621 0.17079588

[[1]]$`Number of complete cases used`
     x    y
x 2163 1803
y 1803 1803

[[1]]$`Error message`
[1] NA

ds.dataFrame

ds.dataFrame takes one or more vectors and generates a data frame structure. Looking at both iterations, not much has changed on the face of it. A few variables have been added such as DataSHIELD.checks and notify.of.progress. The variable DataSHIELD.checks is set to FALSE by default as it would be too time consuming to run all the DS checks. The variable notify.of.progress is set to FALSE by default as to notify the user if there is an issue with what their doing in the server. While both functions behave similarly, the new iterate does return a message when the data frame is created in the server and it runs a validity.check to make sure that the object created doesn't break any DS protocols.

...

ds.dim

ds.dim attempts to calculate the dimension (size) of the argument input, it can be in the form of a character, matrix, array or data frame. There isn't much of a difference between the way the v4 and v5 function work and what it displays. In the v5 there is a new argument added named checks. This will display a Boolean indicator of whether to undertake optional checks of model components which is defaulted to checks="FALSE" to save time. It should be noted that checks should only really be used if the function fails. Another difference between the v4 and v5 is that it will also display a message showing the dimensions of the argument input one is using dependent on the type selected, either type="split" or type="combined" rather than just the dimensions of the argument alone. 

...

Code Block
languagebash
titleds.dim_v5 message
$`dimensions of D in dstesting-100`
[1] 2163    6

$`dimensions of D in combined studies`
[1] 2163    6

ds.histogram

This function plots a non-disclosive histogram. There are a number of key differences between the v4 and v5 versions. Namely that the number of breaks (num.breaks) can be controlled now. By default this is set to 10. This is very different from the previous iteration where the break is set to 33 by default, so it gives you a lot more control. Another difference is that now you can implement different methods: the default method is set to 'smallCellsRule' which removes bins with low variable counts. Another method is 'deterministic which takes the histogram of the scaled centroids of each k nearest neighbours of the original variable where the value of k is set by the user. The final method is 'probabilistic', then the histogram shows the original distribution disturbed by the addition of random stochastic noise, this added noise follows a normal distribution with mean zero and variance equal to a percentage of the initial variance of the input variable. This percentage cant be adjusted by the user in the argument noise.

...


ds.length

ds.length returns the pooled length or the length of the a vector or a list for each study. This is another function that hasn't changed much, apart from the addition of a variable called checks. By default this is set to FALSE (to save time) and it is a Boolean indicator of whether to undertake optional checks of model components.

Code Block
languagebash
titleds.length
linenumberstrue
############# ds.length (v4) #############

ds.length(x='D$LAB_TSC', type = "split", datasources = opals)

############# ds.length (v5) #############

ds.length(x='D$LAB_TSC', type ="split", checks = "FALSE", datasources = opals)

ds.mean

ds.mean computes the statistical mean of a given vector. Key differences here is that it runs a checks argument, which by default is set to FALSE (to save time) and it is a Boolean indicator of whether to undertake optional checks of model components. Next is the save.mean.Nvalid argument which is also a Boolean indicator that is set to FALSE by default. All this argument does is act as an indicator as to whether the user wishes to save the generated values of the mean and of the number of valid observations into the R environment at each of the data servers. 

...

Code Block
languagebash
titleds.mean_v5 message
$Mean.by.Study
              EstimatedMean Nmissing Nvalid Ntotal
dstesting-100      5.872113      356   1807   2163$Mean.by.Study
              EstimatedMean Nmissing Nvalid Ntotal
dstesting-100      5.872113      356   1807   2163

$Nstudies
[1] 1

$ValidityMessage
              ValidityMessage 
dstesting-100 "VALID ANALYSIS"

$Nstudies
[1] 1

$ValidityMessage
              ValidityMessage 
dstesting-100 "VALID ANALYSIS"

ds.table2D

The function ds.table2D is a client-side wrapper function. It calls the server-side function 'ds.table2DDS' that generates a 2-dimensional contingency table for each data source. The main differences here lie within the output rather than the initial function itself. In calling the function, there are two key differences, namely that the argument 'type' is set to "both" by defualt but can also be set to "split" and "combine". This is mostly where the differences in the function lie. The argument 'type' is a character which represents the type of table to output: if it is set to 'combine' a pooled 2-dimensional table is returned; if it is set to 'split' a 2=dimensional table is returned for each data source; lastly if set to 'both' a pooled 2-dimensional table plus a 2-dimensional table for each data source is returned. The last difference between the v4 and v5 function is the argument 'warningMessage' which is set to 'TRUE' by default, all it does is return an error message if the table requested is invalid.

...

ds.var

ds.var computes the variance of a given vector. The main differences between the v4 and v5 versions is that the output message in the console is a lot more detailed in the v5 iterate. Not only does it compute the global variance, of the selected vector, it also writes the number of missing variables, number of valid variables and the total number of variables. It also prints the number of studies used and the validity of the analysis.

...