Disclosure control



From DataSHIELD v5 onwards there are several mechanisms to ensure an analysis is non-disclosive. Each of these has a configuration setting in Opal that data providers can set. Server side functions are expected to implement these checks. The list of which disclosure control settings have been implemented in each function is available at: Disclosure checks

nfilter.tab

The minimum non-zero cell count allowed in any cell if a contingency table is to be returned. This applies to one dimensional and two dimensional tables of counts tabulated across one or two factors and to tables of a mean of a quantitative variable tabulated across a factor. Default usually set to 3 but a value of 1 (no limit) may be necessary, particularly if low cell counts are highly probable such as when working with rare diseases. Five is also a justifiable choice to replicate the most common threshold rule imposed by data releasers worldwide; but it should be recognised that many census providers are moving to ten.

nfilter.subset

The minimum non-zero count of observational units (typically individuals) in a subset. Typically defaulted to 3.

nfilter.glm

The maximum number of parameters in a regression model as a proportion of the sample size in a study. If a study has 1000 observational units (typically individuals) being used in a particular analysis then if nfilter.glm is set to 0.33 (its default value) the maximum allowable number of parameters in a model fitted to those data will be 330. This disclosure filter protects against fitting overly saturated models which can be disclosive.

nfilter.string

The maximum length of a string argument if that argument is to be subject to testing of its length. Default value = 80. The aim of this nfilter is to make it difficult for hackers to find a way to embed malicious code in a valid string argument that is actively interpreted.

nfilter.stringShort

Same as above but set to 20 characters

nfilter.kNN

The minimum value allowed for k on the k-nearest neighbours method which is used mainly for some of the graphical functions. Default value = 3.

nfilter.levels

The maximum number of the unique levels of a categorical variable that are allowed to be returned to the client. If nfilter.levels is set to 0.33 (its default value), and if a categorical variable (i.e. factor) has X distinct categories then if X is greater than the 33% of the variable's length then the categories (i.e. levels) are not returned to the client. This disclosure filter protects against the disclosure of all the unique values in a numerical variable when it is converted to a factor variable. 

nfilter.noise

The minimum level of noise that can be added to a server-side vector. The "noisy" vector can then be returned to the client. This value specifies the variance of the added noise. If nfilter.noise is set to 0.25 (its default value) then noise following a distribution (usually Gaussian) with zero mean and variance equal to the 25% of the true variance of the vector of interest is added to each individual value of that vector.

datashield.privacyLevel

This is the old filter that is used in DataSHIELD v4. This will be replaced by the previous eight filters when we will merge the existing and the BetaTest functions in one main package.