Disclosure control

PAGE ARCHIVED

This page has been archived and is no longer being updated. Please see corresponding content on the new wiki page



DataSHIELD server-side functions contain disclosure traps, preventing analysis that could return disclosive information and perform real-time disclosure checks during analysis. Disclosure traps are mapped to current best practice for disclosure checking (Welpton, Richard (2019): SDC Handbook. figshare. Book. https://doi.org/10.6084/m9.figshare.9958520.v1) and are configurable by data custodians in Opal to align with their governance needs and the spectrum of data sensitivity. From DataSHIELD v5 onwards there are several disclosure traps that can be deployed in server-side functions, listed below. A summary of disclosure utilised in each function is available at: Disclosure checks . 

nfilter.tab

The minimum non-zero cell count allowed in any cell if a contingency table is to be returned. This applies to one dimensional and two dimensional tables of counts tabulated across one or two factors and to tables of a mean of a quantitative variable tabulated across a factor. Default usually set to 3 but a value of 1 (no limit) may be necessary, particularly if low cell counts are highly probable such as when working with rare diseases. Five is also a justifiable choice to replicate the most common threshold rule imposed by data releasers worldwide; but it should be recognised that many census providers are moving to ten.

nfilter.subset

The minimum non-zero count of observational units (typically individuals) in a subset. Typically defaulted to 3.

nfilter.glm

The maximum number of parameters in a regression model as a proportion of the sample size in a study. If a study has 1000 observational units (typically individuals) being used in a particular analysis then if nfilter.glm is set to 0.33 (its default value) the maximum allowable number of parameters in a model fitted to those data will be 330. This disclosure filter protects against fitting overly saturated models which can be disclosive.

nfilter.string

The maximum length of a string argument if that argument is to be subject to testing of its length. Default value = 80. The aim of this nfilter is to make it difficult for hackers to find a way to embed malicious code in a valid string argument that is actively interpreted.

nfilter.stringShort

Same as above but set to 20 characters

nfilter.kNN

The minimum value allowed for k on the k-nearest neighbours method which is used mainly for some of the graphical functions. Default value = 3.

nfilter.levels 

The maximum number of the unique levels of a categorical variable that are allowed to be returned to the client. If nfilter.levels is set to 0.33 (its default value), and if a categorical variable (i.e. factor) has X distinct categories then if X is greater than the 33% of the variable's length then the categories (i.e. levels) are not returned to the client. This disclosure filter protects against the disclosure of all the unique values in a numerical variable when it is converted to a factor variable. This option has been deprecated.

nfilter.levels.density

The maximum proportion of unique levels of a categorical variable with respect to the number of that variables that is regarded as non-disclosive. For example, if the resulting contains 1000 levels, and were derived from 4000 rows what would be a proportion of 0.25 (25%) so would be regarded as being non-disclosive. Default value is 0.33.

nfilter.levels.max

The maximum number of unique levels of a categorical variable that is regarded as non-disclosive. Default value is 40.

nfilter.noise

The minimum level of noise that can be added to a server-side vector. The "noisy" vector can then be returned to the client. This value specifies the variance of the added noise. If nfilter.noise is set to 0.25 (its default value) then noise following a distribution (usually Gaussian) with zero mean and variance equal to the 25% of the true variance of the vector of interest is added to each individual value of that vector.

datashield.privacyControlLevel

Permit server administrators to run servers with a predefined subset of the standard methods available. If the value of this option is not the string "permissive", the following server side methods will be blocked form use: dataFrameSubsetDS1, levelsDS, BooleDS, cDS, cbindDS, dataFrameDS, dataFrameSortDS, dataFrameSubsetDS2, dmtC2SDS, rbindDS, recodeLevelsDS, recodeValuesDS, repDS, reShapeDS, seqDS, subsetByClassDS and subsetDS. Default value is "permissive". The option was introduced in version DataSHIELD 6.2.

datashield.privacyLevel

This is the old filter that is used in DataSHIELD v4. This option has been deprecated, and has been replaced by the nine filters described above.

DataSHIELD Wiki by DataSHIELD is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at http://www.datashield.ac.uk/wiki