First and foremost let us emphasize the important of writing 'good code'. The below box gives some advice on writing code that will not be a pain to debug for the developer itself but most importantly for other developers who will necessarily have the same style of programming. It is hence essential to stick to good practice to make your future life and the life of fellow developers easy.
As with any programming it is highly important to stick to best practices:
|
We already said that our package will contain four functions: two aggregate functions, that is functions that return some non disclosive summaries to the analyst and two assign functions which store their output on the server site instead of returning it to the analyst. All four functions are basic ones because the purpose of this tutorial is to illustrate the fundamental principals of HDS R package development and that goal is better served by showing simple functions.
DataSHIELD is written in the R programming language and numerous packages and functions have already been written in R and deposited in the Comprehensive R Archive Network (CRAN). It would be silly to re-invent the wheel each time we write a DataSHIELD R function if can use an R function. However because DataSHIELD main goal is to enable analysis without releasing potentially disclosive data to the analyst we must choose R functions extremely carefully as some do return potentially disclosive data (i.e. non aggregated summaries). So to use an R function one should scrutinise its output and decide whether the function (1) is safe to use as is, (2) requires some restriction(s) to its output, (3)requires some changes to its internal code. If none of these is possible then the developer should consider writing a function 'from scratch' which does not happen very often although changing the internal code of certain functions bear the same amount of work as writing a function from scratch in which case one should ask itself if it is not better to just write his/her own function. |
All what is said in the above information note applies mainly to aggregate functions which are the ones where a leak of potentially disclosive data is more likely to occur. As for assign functions the developer should just make sure the output stored on the server site does not become disclosive if processed by an aggregate function (e.g. if the output of an assign function is a single value that relates to one study participant, the value itself is not visible to the analyst but using that value as input for an aggregate function can cause disclosure).
Server site functions names end with the suffix |
In your Rstudio go the tab File
in the top menu, select New File
and then choose R Script
. This will open up a new R script file; copy the below code and paste it or write on the blank R file. Then Go to the tab File
on the Rstudio top menu bar and choose Save As
, browser to the R
folder in the project directory and save the file under the name meanDS.R
. Always use the same name as the function for the script file - the file extension should always be .R
.
#' #' @title Computes statistical mean of a vectores #' @description Calculates the mean value. #' @details if the length of input vector is less than the set filter #' a missing value is returned. #' @param xvect a vector #' @return a numeric, the statistical mean #' @author Gaye, A. #' @export #' meanDS <- function (xvect) { # check if the input vector is valid (i.e. meets DataSHIELD privacy criteria) check <- isValidDS(xvect) # return missing value if the input vector is not valid if(!check){ result <- mean(xvect, na.rm=TRUE) }else{ result <- NA } return(result) } |
In your Rstudio go the tab File
in the top menu, select New File
and then choose R Script
. This will open up a new R script file; copy the below code and paste the code below or write the lines in the file. Then Go to the tab File
on the Rstudio top menu bar and choose Save As
, browser to the R
folder in the project directory and save the file under the name replaceNaDS
. Always use the same name as the function for the script file and the file extension should always be .R
.
#' #' @title Replaces the missing values in a vector #' @description This function identifies missing values and replaces them by a value or #' values specified by the analyst. #' @details This function is used when the analyst prefers or requires complete vectors. #' It is then possible the specify one value for each missing value by first returning #' the number of missing values using the function \code{numNaDS} but in most cases #' it might be more sensible to replace all missing values by one specific value e.g. #' replace all missing values in a vector by the mean or median value. Once the missing #' values have been replaced a new vector is created. #' @param xvect a character, the name of the vector to process. #' @param replacements a vector which contains the replacement value(s), a vector one or #' more values for each study. #' @return a new vector without missing values #' @author Gaye, A. #' @export #' replaceNaDS <- function(xvect, replacements){ # check if the input vector is valid (i.e. meets DataSHIELD criteria) check <- isValidDS(xvect) # get the indices of the missing values indx <- which(is.na(xvect)) if(!check){ # if the inpout vector is valid replace missing values xvect[indx] <- replacements }else{ # if the inpout vector is not valid and is of size > 0 xvect[1:length(xvect)] <- NA } # return the new vector return(xvect) } |
The header of the script (lines starting with #'
) is used by Roxygen to produce the documentation files. Always choose a short and unambiguous @title
, describe the function briefly in the @description
and if there is more to say about the function write your explanations as @details
. Then list the parameters one per line, each line starting with the key word @param
, describing them briefly. Tell what is is returned by the function with the keyword return
. Mention your name as main @author
and if some other developers contributed to the function mention their names as well, separating names by semicolon. To finish, and this is highly important, tell if the function should be available from the client site (i.e. if it can be called by the analyst) this is done by inserting the keyword export
. If the function is not exported it will not be available from the client site.
Although the header bit of a server site function is not as crucial as the one of a client site function it is important to give some information about the function to facilitate future maintenance and extensions.
Without a header a function may work but without an error free code in the body part of the script it might not even be possible to build the package. We will talk about checks later. Both scripts are really simple so we will not explain each line - if you struggle to understand these lines then you are not ready to be an R developer let alone a DataSHIELD developer LOL.
In DataSHIELD only vectors and tables that holds a specific number of observations (set via opal) can be processed. By calling the function |