In DataSHIELD, a method is considered disclosive if what it returns as output is the record(s) of one or more study participants or if it can be used to infer the values of one or more study participants! |
It is actually a reflexion on the statistical method that we want to implement. For some functions the question is a no-brainer like for our ds.mean
because it is known that one cannot infer individual level data (the value for one study participant) from a mean ... unless actually the data to process holds one or a very small number of observations which does not happen in DataSHIELD because of the 'privacy level' functionality which blocks the processing of data that does not have a certain number of observation.
Answering the question actually requires a good understanding of the statistical method. For example in Cox models the baseline hazards risk are disclosive so implementing Cox regression as is is potentially disclosive. The same can be said about the residuals of an generalized linear model.
The purpose of this section is not to list statistical methods and tell whether or not some of their output are disclosive; rather it is to make developers aware of the risk of implementing statistical methods without a good scrutiny of the output of the method.
Because in DataSHIELD the user cannot see the data that are processed it is important to ensure the results of a computation is correct.
The best way to ensure a function is producing the right results is to compare its output to that of similar function in standard R. This is done in three steps:
If the two sets of results are not similar go back to your code and try to understand the reason of the difference. You should not proceed until you have similar results. |
Once you have completed your checks you can finalize your package and transfer it to GitHub. Go to this page and follow the guide to complete your development.