Version 6.0.0

Focus of Release

The major focuses of the v6.0 release of DataSHIELD is the addition of new analytical functions and the integration with DataSHIELD Interface (DSI).

Changes from DataSHIELD v5.1 to v6.0

DataSHIELD Interface (DSI)

DataSHIELD’s dsBaseClient package now uses DataSHIELD Interface (DSI) to communicate with the Opal Server, this replaces the legacy opal R package. This will be a breaking change for code written to use DataSHIELD v4 and v5. The main impact on end users of DataSHIELD it the new technique for logging in to and out of the server, for example,

for logging in:

library('DSOpal') builder <- DSI::newDSLoginBuilder() builder$append(server = "study1", url = "http://192.168.56.100:8080/", user = "administrator", password = "datashield_test&", table = "CNSIM.CNSIM1", driver = "OpalDriver") builder$append(server = "study2", url = "http://192.168.56.100:8080/", user = "administrator", password = "datashield_test&", table = "CNSIM.CNSIM2", driver = "OpalDriver") builder$append(server = "study3", url = "http://192.168.56.100:8080/", user = "administrator", password = "datashield_test&", table = "CNSIM.CNSIM3", driver = "OpalDriver") logindata <- builder$build() connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "D")

for logging out:

DSI::datashield.logout(connections)

The motivation for this change is to give DataSHIELD the ability to connect to other types of server in the future. More information about DSI can be found on it’s GitHub page at https://github.com/datashield/DSI

New Analytical Functions

The functions ds.completeCases, ds.glmerSLMA, ds.lmerSLMA, ds.rep, ds.sample and ds.table have been added to the suite of analytical functions provided by DataSHIELD.

  • ds.completeCases: constructs a modified data frame, matrix or vector, contains no missing values

  • ds.glmerSLMA: fits a Generalized Linear Mixed-Effects Model (GLME) on data from one or multiple sources with pooling via SLMA

  • ds.lmerSLMA: fits a Linear Mixed-Effects Model (lme) - can include both fixed and random-effects - on data from one or multiple sources with pooling via SLMA (Study-Level Meta-Analysis)

  • ds.rep: creates a repetitive sequence by repeating the specified scalar number, vector or list in each data source

  • ds.sample: draws a pseudorandom sample from a vector, dataframe or matrix on the serverside

  • ds.table: creates 1-dimensional, 2-dimensional and 3-dimensional tables using the table function in native R

Changed Functions

The functions ds.dim, ds.length, ds.colnames, ds.ls and ds.levels have been reimplemented not to use the server-side aliases dim, length, colnames, ls and levels (which have now been removed), but now dedicated DataSHIELD server-side functions dimDS, lengthDS, colnamesDS, lsDS and levelsDS. These changes should not affect the behaviour of the functions, they merely reduce the reliance on non-DataSHIELD functions internally on the server and therefore make it more secure and reliable.

The functions ds.cbind and ds.dataFrame have been modified to remove any “DATAFRAME.NAME$“ strings from the column names of the assigned data frames. In addition, the new version of the ds.cbind function generates data frames instead of matrices. We have also fixed a bug related to this issue, on how the two functions were defining the column names in the assigned dataframes when the order of the input components is different in different studies.

An additional disclosure control was added to the ds.cov and ds.cor functions. The disclosure control checks that the number of the input variables is lower than a pre-specified proportion of the individual-level records. To specify the maximum allowed proportion we have used the same filter as the one used in the ds.glm function which checks if the regression model is not oversaturated (you can find more details here). The used filter is set by default to 0.33 which means for example that for a dataframe of 100 rows (i.e. individual-level records) only the variance-covariance or the correlation matrix of up to 33 variables can be returned.

Deprecated Functions

There are a number of functions in DataSHIELD v6.0 which should be regarded as deprecated - i.e. they are still there, but we strongly recommend you stop using them as they will be removed in v6.1. The functions which are deprecated are shown below, along with their replacements which should be used as soon as is practicable.

  • ds.setDefaultOpal, and should be replaced by datashield.connections_defaults

  • ds.listOpals, and should be replaced by datashield.connections

  • ds.table1DS, and should be replaced by ds.table

  • ds.table2DS, and should be replaced by ds.table

  • ds.look

  • ds.meanByClass, and should be replaced by ds.meanSdGp

  • ds.message

  • ds.recodeLevels, and should be replaced by ds.recodeValues

  • ds.subset, and should be replaced by ds.dataFrameSubset

  • ds.subsetByClass, and should be replaced by ds.dataFrameSubset

  • ds.vectorCalc, and should be replaced by ds.make

It should be noted that use of [ and ] should be avoided when performing analysis, specially in conjunction with ds.dataFrameSubset.

Deprecated Aliases

There are a number of server-side aliases in DataSHIELD v6.0 which should be regarded as deprecated, so should not be used as they will be removed in v6.1. The aliases which are deprecated are:

  • is.character (aggregate alias)

  • is.factor (aggregate alias)

  • is.list (aggregate alias)

  • is.null (aggregate alias)

  • is.numeric (aggregate alias)

  • NROW (aggregate alias)

  • t.test (aggregate alias)

  • as.character (assign alias)

  • as.null (assign alias)

  • as.numeric (assign alias)

  • attach (assign alias)

  • complete.cases (assign alias)

  • rep (assign alias)

  • unlist (assign alias)

In addition to the depreciated function it should be noted that it is planned to rename ds.meanSdGp to ds.meanSDByClass in DataSHIELD v6.1.

Function documentation

The documentation of all DataSHIELD functions has been updated.
This new documentation has the same format in all the functions and examples with the logging in according to version 6.0, the usage of the function, and the logging out from the server.

Continuous integration

We have continued to develop our continuous integration (CI), and how have 6310 tests which are run every day and on every proposed code change.

How to upgrade

Update DataSHIELD server-side package

If you have a suitable version of Opal server, and you would like to upgrade the DataSHIELD server package (dsBase). This can be done via the Opal Web Portal. If you go to the “DataSHIELD” page within the “Administration” section of the Opal Web Portal, the old “dsBase” can be removed, then using the “+Add Package” button the new version of “dsBase” can be installed. Select “Install all DataSHIELD packages” then press the “Install” button.

Update DataSHIELD client-side package

If you have installed the DataSHIELD client package (dsBaseClient) using the function install.packages and specifying the Obiba repository, then you can update the client package as follows:

# R > update.packages(repos='http://cran.obiba.org')

If you do not have the “DSI” and “DSOpal” packages installed these packages can be installed as follows:

as installing ‘DSOpal’ will cause the installation of 'DSI'.

Supported versions

DataSHIELD v6.0 is supported on R3.5, R3.6 and R 4.0, and would be expected to work with intermediate versions. At present the DataSHIELD client-side package is known to work on Ubuntu 16.04, Ubuntu 18.04 and Windows 10. DataSHIELD server-side package is known to work when deployed to Opal 2.16.0 running on Ubuntu 16.04.

Code availability

As ever, you can see the code at a variety of places: https://cran.obiba.org/, or https://github.com/datashield/dsBase/tree/6.0.0 and https://github.com/datashield/dsBaseClient/tree/6.0.0

DataSHIELD Wiki by DataSHIELD is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at http://www.datashield.ac.uk/wiki