The server side functions

Introduction to server side functions

First and foremost let us emphasize the important of writing 'good code'. The below box gives some advice on writing code that will not be a pain to debug for the developer itself but most importantly for other developers who will necessarily have the same style of programming. It is hence essential to stick to good practice to make your future life and the life of fellow developers easy.

Good practice in programming

As with any programming it is highly important to stick to best practices:

  • Avoid dense/compact code for improved legibility: put space between terms.
    # write
    result <- mean(xvect, na.rm=TRUE)
    
    # rather than
    result<-mean(xvect,na.rm=TRUE)
    
  • Use brackets with 'if' and loop statements; this is again better for legibility particularly when it comes to debugging as we can highlight opening and closing brackets to check specific section more easily.
    # write
    if(mystuff == 1){
      ...........
    }else{
      ...........
    }
    
    # rather than the below lines or anything not completely obvious
    if(mystuff == 1)
    ..........
    else
    ............
    
  • Comment each line of code unless the line is obvious to even a beginner; not writing comments can prove costly when it comes to debugging after a long period. However avoid writing to much comments to a point where it becomes difficult to 'see' the lines that are executed. * Comments should be in small case* except in the rare situations where you really need to 'shout out' some important information. Using too much capital letters in comments leads to the same problem as having too much comments.
  • Separate blocks of lines by an empty line, again for improved legibility.
  • Indent properly after an 'if' or a loop statement. This is particularly important when using nested 'if' statements or nested loops and will prove valuable when tracking down errors.
  • Use internal functions to avoid extremely long scripts which can prove difficult to debug - see subsetByClassDS which makes use of many internal functions.

We already said that our package will contain four functions: two aggregate functions, that is functions that return some non disclosive summaries to the analyst and two assign functions which store their output on the server site instead of returning it to the analyst. All four functions are basic ones because the purpose of this tutorial is to illustrate the fundamental principals of HDS R package development and that goal is better served by showing simple functions.

DataSHIELD is written in the R programming language and numerous packages and functions have already been written in R and deposited in the Comprehensive R Archive Network (CRAN). It would be silly to re-invent the wheel each time we write a DataSHIELD R function if can use an R function. However because DataSHIELD main goal is to enable analysis without releasing potentially disclosive data to the analyst we must choose R functions extremely carefully as some do return potentially disclosive data (i.e. non aggregated summaries). So to use an R function one should scrutinise its output and decide whether the function (1) is safe to use as is, (2) requires some restriction(s) to its output, (3)requires some changes to its internal code. If none of these is possible then the developer should consider writing a function 'from scratch' which does not happen very often although changing the internal code of certain functions bear the same amount of work as writing a function from scratch in which case one should ask itself if it is not better to just write his/her own function.

All what is said in the above information note applies mainly to aggregate functions which are the ones where a leak of potentially disclosive data is more likely to occur. As for assign functions the developer should just make sure the output stored on the server site does not become disclosive if processed by an aggregate function (e.g. if the output of an assign function is a single value that relates to one study participant, the value itself is not visible to the analyst but using that value as input for an aggregate function can cause disclosure).

Naming of server site function

Server site functions names end with the suffix DS, the only exception to this concerns internal functions which can take any name. Internal R package functions are those that are meant to be called only from within another function. If a server site function name is composed of more than word the second and subsequent words start with a capital letter (e.g. rowColCalcDS, recodeLevelsDS).

The function 'meanDS'

In your Rstudio go the tab File in the top menu, select New File and then choose R Script. This will open up a new R script file; copy the below code and paste it or write on the blank R file. Then Go to the tab File on the Rstudio top menu bar and choose Save As, browser to the R folder in the project directory and save the file under the name meanDS.R. Always use the same name as the function for the script file - the file extension should always be .R.

#'
#' @title Computes statistical mean of a vectores
#' @description Calculates the mean value.
#' @details if the length of input vector is less than the set filter
#' a missing value is returned.
#' @param xvect a vector
#' @return a numeric, the statistical mean
#' @author Gaye, A.
#' @export
#'
meanDS <- function (xvect) {

  # check if the input vector is valid (i.e. meets DataSHIELD privacy criteria)
  check <- isValidDS(xvect)

  # return missing value if the input vector is not valid
  if(!check){
    result <- mean(xvect, na.rm=TRUE)
  }else{
    result <- NA
  }

  return(result)
}

The function 'replaceNaDS'

In your Rstudio go the tab File in the top menu, select New File and then choose R Script. This will open up a new R script file; copy the below code and paste the code below or write the lines in the file. Then Go to the tab File on the Rstudio top menu bar and choose Save As, browser to the R folder in the project directory and save the file under the name replaceNaDS. Always use the same name as the function for the script file and the file extension should always be .R.

#'
#' @title Replaces the missing values in a vector
#' @description This function identifies missing values and replaces them by a value or
#' values specified by the analyst.
#' @details This function is used when the analyst prefers or requires complete vectors.
#' It is then possible the specify one value for each missing value by first returning
#' the number of missing values using the function \code{numNaDS} but in most cases
#' it might be more sensible to replace all missing values by one specific value e.g.
#' replace all missing values in a vector by the mean or median value. Once the missing
#' values have been replaced a new vector is created.
#' @param xvect a character, the name of the vector to process.
#' @param replacements a vector which contains the replacement value(s), a vector one or
#' more values for each study.
#' @return a new vector without missing values
#' @author Gaye, A.
#' @export
#'
replaceNaDS <- function(xvect, replacements){

  # check if the input vector is valid (i.e. meets DataSHIELD criteria)
  check <- isValidDS(xvect)

  # get the indices of the missing values
  indx <- which(is.na(xvect))

  if(!check){
    # if the inpout vector is valid replace missing values
    xvect[indx] <- replacements
  }else{
    # if the inpout vector is not valid and is of size > 0
    xvect[1:length(xvect)] <- NA
  }

  # return the new vector
  return(xvect)

}

The code explained

The header of the script

The header of the script (lines starting with #') is used by Roxygen to produce the documentation files. Always choose a short and unambiguous @title, describe the function briefly in the @description and if there is more to say about the function write your explanations as @details. Then list the parameters one per line, each line starting with the key word @param, describing them briefly. Tell what is is returned by the function with the keyword return. Mention your name as main @author and if some other developers contributed to the function mention their names as well, separating names by semicolon. To finish, and this is highly important, tell if the function should be available from the client site (i.e. if it can be called by the analyst) this is done by inserting the keyword export. If the function is not exported it will not be available from the client site.
Although the header bit of a server site function is not as crucial as the one of a client site function it is important to give some information about the function to facilitate future maintenance and extensions.

The body of the script

Without a header a function may work but without an error free code in the body part of the script it might not even be possible to build the package. We will talk about checks later. Both scripts are really simple so we will not explain each line - if you struggle to understand these lines then you are not ready to be an R developer let alone a DataSHIELD developer LOL.

Checking that a vector or table meets DataSHIELD privacy criteria

In DataSHIELD only vectors and tables that holds a specific number of observations (set via opal) can be processed. By calling the function isValidDS we verify whether the input object can or cannot be processed; the function gets its argument from opal so do not need to takae care of that we just make sure we call the function. To be able to call the function get it from here and copy it over into the R folder of your package. If the input object fails the privacy check an empty object (i.e. object with missing values) is generated as you can see in the rest of the code.

DataSHIELD Wiki by DataSHIELD is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at http://www.datashield.ac.uk/wiki