View Source

Does my statistical method return disclosive data?

In DataSHIELD, a method is considered disclosive if what it returns as output is the record(s) of one or more study participants or if it can be used to infer the values of one or more study participants!

It is actually a reflexion on the statistical method that we want to implement. For some functions the question is a no-brainer like for our ds.mean because it is known that one cannot infer individual level data (the value for one study participant) from a mean ... unless actually the data to process holds one or a very small number of observations which does not happen in DataSHIELD because of the 'privacy level' functionality which blocks the processing of data that does not have a certain number of observation.
Answering the question actually requires a good understanding of the statistical method. For example in Cox models the baseline hazards risk are disclosive so implementing Cox regression as is is potentially disclosive. The same can be said about the residuals of an generalized linear model.
The purpose of this section is not to list statistical methods and tell whether or not some of their output are disclosive; rather it is to make developers aware of the risk of implementing statistical methods without a good scrutiny of the output of the method.

Is the output of the function correct?

Because in DataSHIELD the user cannot see the data that are processed it is important to ensure the results of a computation is correct.

The best way to ensure a function is producing the right results is to compare its output to that of similar function in standard R. This is done in three steps:

Load the test data you are using for the development in R. This test data must be realistic. It should contain missing values at least and the developer should consider including some extreme values or combinations. Often these checks failed to capture some errors because the data is too 'flat'.
Use establish R functions that carry out the task you want to do in DataSHIELD, run an analyses with the data you uploaded and store the results.
Now use your DataSHIELD function to run the same analysis and compare the DataSHIELD to those you obtained in standard R.

If the two sets of results are not similar go back to your code and try to understand the reason of the difference. You should not proceed until you have similar results.

Once you have completed your checks you can finalize your package and transfer it to GitHub. Go to this page and follow the guide to complete your development.