Secure ranking approach

The following is additional information detailing the cluster of functions that enable secure global ranking and estimation of key global quantiles.

This functionality is based on two pivotal clientside functions: ds.ranksSecure and ds.extractQuantiles

ds.ranksSecure is a clientside function which calls a series of other clientside and serverside functions to securely generate the global ranks of a numeric vector "V2BR" (vector to be ranked) in order to set up analyses on V2BR based on non-parametric methods, for some types of survival analysis and to derive key global quantiles (e.g. the median, lower (25%) and upper (75%) quartiles, and the 95% and 97.5% quantiles) across all sources simultaneously. In general, these global quantiles are different to the equivalent quantiles calculated independently in each data source separately and then summarised by their mean or median across the data sources.

The ranking procedure progresses in a series of steps controlled by ds.ranksSecure and each step is enacted by a series of secondary functions which are a mixture of clientside and serverside aggregate (both assign and aggregate functions). The analytic steps are as follows:

(1) The quantileMeanDS (serverside aggregate) function is called to generate the global mean of V2BR (mean.input.var) and a deliberately non-conservative estimate of its standard deviation (max.sd.input.var) in order to intelligently prime the first stage of the encryption process (see below).

(2) The ds.dmtC2S (clientside) function returns the estimates of the global mean and standard deviation of V2BR to the serverside at each study.

(3) The minMaxRandDS (serverside aggregate) function uses a pseudo-random procedure based on a non-identifiable and non-repeatable seed to create a "random" maximum value in each study that is definitely more positive (by an indeterminate amount) than the actual maximum of V2BR in that study and an equivalent random minimum that is definitely more negative than the actual minimum. These are returned to the clientside from each study and an algorithm embedded in the code of ds.ranksSecure identifies the maximum (most positive) maximum across all studies and the corresponding minimum minimum (most negative) across all studies.

(4) The ds.dmtC2S (clientside) function returns the maximum maximum and minimum minimum to the serverside so all studies share the same values. These will not now be used unless there is an explicit decision to declare missings (NAs) as high or low via the argument NA.manage ="NA.hi" or NA.manage = "NA.low" in which case the NAs in all studies will be replaced by the shared maximum maximum (if NA.manage = "NA.hi") or the shared minimum minimum if NA.manage = "NA.low"). In the former case all NAs will appear as the highest ranked values in every study (with the same values and ranking in all studies), in the latter case they will be the lowest ranked values in all studies (but will again be the same in all studies). Please note that the precise values of the maximum maximum or minimum minimum do not matter (hence the effect of the analysis can precisely be replicated even though it invokes a non-repeatable seed (see above). This is because the actual analysis will ultimately be based on ranks and the top and bottom rankings related to the maximum maximum and the minimum minimum will always be the same regardless of the actual quantitative values of the maximum maximum and minimum minimum

(5) This is one of the most fundamental steps. In it, the blackBoxDS (serverside assign) function is called with a series of the key arguments specified in the original call to ds.ranksSecure (see details of parameters in the function header) or else referring to objects newly created by the serverside functions described above. These arguments include: input.var.name, max.sd.input.var, mean.input.var, shared.seedval=shared.seed.value, synth.real.ratio, and NA.manage=NA.manage). The function blackBoxDS first generates a set of randomly synthesised pseudo-data that span the distribution of the real data, but with substantial overlap,. The number of pseudo-observations generated is determined by the synth.real.ratio argument which defaults to 2, so twice as many pseudo data are generated in each study as there originally real data.

If the original data are not real but rounded to a given precision, blackBoxDS identifies the predominant form of rounding eg integers, tens, thousands, 0.001, 0.1 etc. Then rather than generating synthetic data at complete random (ie in double precision), blackBoxDS instead generates a vector of appropriately rounded terms: eg if the predominant rounding is tens, part of the random rounding vector may be something like …. 50,-30,0,100,40,90,-80 .... . The corresponding vector of synthetic data is then obtained by adding a random sample of the original data (in an indeterminate random order) to the random rounding data. Thus, if a sample of the original real data is ….. 10,20,‑60,0,30.23,-60,120.4 ... and the random rounding vector is as above, the relevant component of the resultant synthetic data will be … 60,-10,-60,100,70.23,30,40.4 ... . This approach ensures the nature of the rounding in the synthetic data is broadly the same as in the original data. This is important because the degree of rounding influences the number of ties expected and if (in an illustrative extreme setting) the original data were all integers while the synthetic data were double-precision, most ties (even after encryption) would be identifiable as real data. This is a crucial strategy to mitigate disclosure risk in the step when encrypted original and synthetic data are co-located on the client server (see below)

Next, blackBoxDS uses the global estimates of mean and standard deviation (as described above) to approximately (but deliberately not precisely) centralise the real- and pseudo- data. Next, blackBoxDS sequentially applies 7 rounds of transformation to both the real and pseudo data. With data held in either double precision or any level of rounding every transformation algorithm faithfully maintains the rank order of the data. The first algorithm simply applies a probit transformation which converts the original approximately centralised real and pseudo data from values typically running from -k through 0 up to +k to proportions which are all strictly greater than 0 and less than 1. This is not strictly encryption because it is a known deterministic transformation that could easily be reversed but it sets up the data in a way that makes it easy to ensure that a range of encrypting transformations can then be applied that will definitely maintain the original rank order and cannot be replicated without knowing the order in which the transforming functions were applied and the value of the random parameters with which each transformation is associated.

To be specific, the next 6 rounds of encryption each invoke one of three monotonic functions with a single randomly selected parameter (lambda) associated with each transformation. Each value of lambda is drawn (independently) from a pseudo-random uniform(0.0001,1) distribution. So if the current element of the variable being transformed is x[i] the transformed value is obtained as x[i+1]<- x[i]^lambda (under algorithm 1); x[i+1]<- x[i]+lambda (under algorithm 2); and x[i+1]<- x[i]*lambda (under algorithm 3). The three algorithms are each applied twice in the block of six, but their order of application is randomly selected as well as the values of lambda. The randomisation process generating the values of lambda and the order of sequential application of the three algorithms is initialised via a seed shared across the different studies so each study applies the various algorithms in the same order with the same lambda as all the other studies.

At present this shared seed is simply determined by the argument <shared.seed.value>. Even knowing that seed it would be exceedingly difficult to recover the original starting values of the input.var without having direct access to the real data on a server. This is because although the process generating the random order of the transformation procedures and the values of lambda for each transformation are the same in each study (initialised by the shared seed), the procedure leading to random generation of the pseudo-data is deliberately different in each study (the shared seed being first modified by a function that depends on the precise numeric characteristics of the variable being ranked [V2BR], before the randomisation sequence is started). This makes it moredifficult for the analyst (working on the DataSHIELD client) to distinguish between encrypted real and encrypted pseudo-observations when they are transferred to the client (see below). Furthermore, when (at the very end of the running of ds.rankSecure) the encrypted real data appear alone on the client server (also see below) the data will first have been through a procedure that converts encrypted real values to ranks and encrypted ranks and this makes reverse engineering near impossible even knowing the shared seed. However, to increase security yet further in the immediate future we will be implementing a new function (already prototyped) that allows a number (the shared seed) to be securely shared across the studies without the client analyst being able to infer its value, even in theory.

Before moving to the next main step in the function (where data are transferred to the client) blackBoxDS checks that the ranks in the original data, the probit transformed data and the data following each of the six rounds of random transformation are all identical. This means the ranks,including ties in the ranks are precisely the same for all of these eight vectors, in all studies. If theyare not the same a message is returned that suggests that, amongst other information, suggests you might try a different shared random seed. This is in case some exceedingly unlikely transformation has led to, for example, an NA. To date no example has arisen where this has happened. If it repeatedly fails despite using different seeds, it suggests there is something more fundamentally wrong with the data or analysis code which should be explored and corrected. Finally, blackBoxDS writes the key components of its output to a data.frame called **"blackbox.output.df"** on the server side. This contains 8 columns in each study: column 1 = vector containing all original real values and synthesised values in the given study; column 2 = the equivalent vector containing values after all 7 seven rounds of transformation are complete; column 3 = the ranks of values in column 1 but as the data frame is sorted (ascending) by the values in column 1, it simply runs 1:Ns where Ns is the total number of real and synthetic values in the given study; column 4 = ranks of column 2 which also runs 1:Ns because of the sorting of the data frame; column 5 holds the original sequential IDs for the data vector consisting of values 1:Ns ordered in alignment with a vector containing all real values of input.var in their original order stacked over all synthetic values in their order of synthesis. The values 1:Ns in column 5 appear in a haphazard order, but by by re-sorting based on column 5 all of the data frame values relating to the real data and synthetic data can easily be separated and extracted in their original order; column 6 is a vector of 0s and 1s denoting whether each row relates to a real value or a synthetic value; column 7 is the same as column 3 and 4 (ie 1:Ns) but it is called “ID.by.val” and can be used as the basis for re-sorting the data.frame back to value order (ascending) based on column 1 if it has been temporarily re-sorted to a different order. Please note , columns 3, 4 and 7 are technically redundant and it is possible that only column 7 will be kept in later versions of this function; column 8 is called SOURCE.ID which is a vector in which all elements take the value s in study s, where s denotes that the specified study is the sth listed study amongst the datasources used.

(6) Next the ranksSecureDS1(serverside aggregate) function extracts columns 2 and 7 of the serverside data frame **"blackbox.output.df"** and transmits them to the clientside. These are the 7-fold transformed ('final-encrypted') original and synthetic values from each study in ascending order and the 1:Ns IDs sitting beside them. Although they carry equivalent information pertaining to ranking as the original real/synthesised data their encryption ensures they are non-disclosive, and it is in any case impossible to separate the real and synthetic data.

Code in ds.ranksSecure next takes the 2 column data frame consisting of the final-encrypted values and values 1:Ns from study s and adds an extra column consisting solely of the value s. These three column study-specific data frames from each study are then stacked using the rbind() function. The resultant data frame is then reordered based on the global ranks of column 1 and a fourth column is added with values 1:M where M is the sum of the number of real and synthetic values (Ns) across all studies s combined. This fourth column therefore holds the global ranks of the original values of input.var (the input variable) across all studies as well as the intervening global ranks for the synthetic data. Ranks are modified as would be expected if there are ties: eg if a short vector to be ranked was c(10,3,8,3,2) it would generate ranks c(5,2.5,4,2.5,1). The function ds.ranksSecure next creates a subset of the 4 column data frame for each study s. This contains values from study s only by selecting rows in which there is a values of s in column 3.

(7) The ds.dmtC2S (clientside) function then returns each of the 4 column study-specific data frames (to be called sR4.df on the server) to the corresponding serverside for study s. It should be noted that each of these data frames is still in the original order (1:Ns) based on ascending sorting of the original input variable (real and synthetic) in study s which is also - by definition - in the order of the corresponding global ranks across ALL studies. Because the variable (numstudies) which denotes the number of studies, which was used earlier, has been lost as serverside functions are called, run, and closed the ds.dmtC2S (clientside) function is also used to retransmit the value of numstudies to the serverside.

(8) Next the ranksSecureDS2 (serverside assign) is used to cbind the **blackbox.output.df** data frame to **sR4.df** in each study and then to select just the rows with real (rather than synthetic) data.

Because of the original sort (ascending) by the value of the encrypted input.var, blackbox.output.df and sR4.df in any given study have the same number of rows in the same order. In effect, this means that ranksSecureDS2 might most simply be viewed as stamping the global ranks on to blackbox.output.df and restricting the resultant data frame to include only the rows corresponding to real data. This is written as the data frame "sR5.df" on to the serverside for each separate study, and this forms the basis of the next major component of the ranking analysis. Note that at this stage, the global ranks (based on real and synthetic data) of all the real data are in precisely the same order as their equivalent global ranks would be if they were based solely on the real data. But the numeric value of the actual ranks will in general differ in the two settings (see below).

(9) The key component of sR5.df is the vector of global ranks in column 9 which is designated sR5.df$global.rank. The next step in analysis is based on a modified version of blackBoxDS (called blackBoxRankDS, which is also a serverside assign function). The input.var to this modified function is now declared as sR5.df$global.rank and blackBoxRankDS undertakes multiple rank-consistent transformations of this input variable in an equivalent way to blackBoxDS when applied to the original input.var. Following the transformation procedure the global ranks become one of the key components of a data frame designated blackbox.ranks.df that still remains in the order of the earlier processing steps (ascending on value of input.val, with any ties left in the same order as before).

Unlike the blackBoxDS function, blackBoxRanksDS does not generate any synthetic data. This means that when the encrypted ranks are transferred to the clientside in the next step, they are not obscured by any pseudo-data. But, because these are encrypted ranks, not encrypted quantitative values, even if someone could reverse engineer the encryption all they would get are ranks rather than actual values (and in principle we already know that the ranks 1:M [incorporating possible ties] exist across all the studies and deriving the actual ranks by reverse engineering would tell a hacker nothing particularly disclosive about the original values of real data. The use of ranks rather than quantitative values would not have been possible when the original encrypted values were transmitted to the clientside (see step 6) because in order to globally rank the data it was at some stage necessary for (encrypted) quantitative values to be compared from all studies simultaneously. If only study-specific ranks had been transferred to the clientside all we would know would be that the ranks in each study ran from 1:Ns and it would be impossible to know, for example, whether the actual quantitative value corresponding to a given rank in one study was the same, higher or lower than the value corresponding to the same rank in another study.

(10) The next step referred to under step nine is based on the ranksSecureDS3 (serverside aggregate) function. This extracts the column of encrypted global ranks from blackbox.ranks.df and cbinds them to an ID vector 1:NRs (where NRs is the original number of real - not real+synthetic - observations in study s). This means that even tied ranks have unique ID values for re-sorting and checking the quality of linkage etc. The function ranksSecureDS3 also cbinds a studyid vector (all elements having the same value, s) to the encrypted global ranks vector and the ID vector. The resultant data frames from each study are copied into the clientside and then stacked across all studies using the rbind() function and the combined data frame is sorted by the encrypted global ranks. A new global.real.rank is then allocated using the R's standard ranking function: global.real.rank <- rank(global.rank). In essence this converts the original global ranks (which were based on ranking the data including both real and synthetic data (and so ran sequentially from 1:M [with appropriate allocation for ties]) into new global ranks based only on the real data. These run from 1:MR where MR = the sum of NRs over all studies s. These two sets of ranks have precisely the same order, but the original range of real + synthetic ranks contains many values that are absent from the global.real.rank vector and so the numeric values of the two sets of ranks differ.

(11) As before (see step 7) the ds.dmtC2S (clientside) function is now used to return the key information in the clientside data frame to the serverside. The key information includes the encrypted global ranks (the global.real.ranks vector) and the studyid vector. Only rows where studyid=s are passed to the serverside in study s. In addition, by dividing the global.real.ranks by the total number of real observations across all studies (ie MR) an extra column is created called global.real.quantile. Together these three vector form a data frame called "global.ranks.quantiles.df" which is what ds.dmtC2S actually passes to the serverside.

(12) The global.ranks.quantiles.df data frame returned to each study is still ordered by the value of the original real input.var data (ascending) – i.e. with the same order and the same number of rows as blackbox.ranks.df at the end of step 10 . In the current step, the ranksSecureDS4 (serverside Assign) function cbinds the serverside data frame blackbox.ranks.df to the global.ranks.quantiles.df data frame. This combined data frame is then sorted either by the quantitative value of the original input data (ascending) – i.e. the order is left unchanged from that of blackbox.ranks.df - or by the order of the original sequential ID applied to the real values of input.var - ie in the same order as the original real data. Which of these two sort orders is applied is determined by the argument <ranks.sort.by> in ds.ranksSecure which can take the value "ID.orig" or "vals.orig". The former results in a data frame in the same order as the original input data, the latter in a data frame in which the absolute magnitude of the original input.var and corresponding ranks increase consistently down the data frame. Regardless how it is sorted, this data frame is then written to the serverside as the data frame output.ranks.df, which can be named as desired by the character string associated with the argument <output.ranks.df>.

(13) Next, ranksSecureDS5 (serverside assign function) takes the comprehensive data frameoutput.ranks.df and creates a new object summary.output.ranks.df (for which a desired name can be specified by the character string associated with the argument <summary.output.ranks.df>). This summary data frame contains only the most essential elements of the final ranking output. To be specific, in each study s, this consists of five columns: [1] ID.seq.real.orig, contains the values of the original sequential ID for the input data in study s. This will run sequentially from 1:NRs if ranks.sort.by = "ID.orig". On the other hand, It will generally appear haphazard if ranks.sort.by = "vals.orig"; [2] input.var.real.orig, contains the values of the original (real) values of input.var in study s. These will be in increasing order if ranks.sort.by="vals.orig" but the order will be the same as in the original input data if ranks.sort.by="ID.orig"; [3] final.ranks.global, contains the global ranks (i.e. based upon the real data values across ALL studies) that correspond to the values in input.var.real.orig and will therefore be in order of increasing value if ranks.sort.by = "vals.orig" but will typically appear haphazard if ranks.sort.by = "ID.orig"; [4] final.quantiles.global, contains the quantile values corresponding to the final.global.ranks held in column 3. In association with column 2, these may now be used to estimate the true data values corresponding to key quantiles (across all studies) such as the median, quartiles and 5% and 95%; [5] studyid, all elements take the value s in study s. **The data frame summary.output.ranks.df** which is written to the serverside server at each study (and contains the ranking data relating only to that study) may be viewed as the primary output from the whole ranking procedure. All other outputs from the analysis (including the more comprehensive data frame output.ranks.df) are tidied up by deletion at the end of the running of the function unless ds.ranksSecure’s argument rm.residual.objects is set to FALSE.

(14) All or selected columns from the summary.output.ranks.df data frame can now be cbinded to the original input.var variable or to the data frame from which input.var was obtained in order to add its corresponding global ranks or global quantiles. These ranks can form the basis of a wide range of non-parametric analyses

ds.extractQuantiles

(15) In addition to undertaking a standard non-parametric analysis based on the ranks generated by ds.ranksSecure, what is also sometimes needed is to calculate and quote the actual values (in V2BR) for key quantiles across all studies. This is enabled by the client function ds.extractQuantiles which is called at the end of ds.ranksSecure.

(16) Given that we have calculated the global quantile corresponding to every observation in every study, it would appear that this procedure should be straightforward. But, in practice, there is a challenge. Specifically, If we wish to calculate, for example, the global median, then based on the serverside information in a study in which the summary.output.ranks.df data frame contains a value of precisely 0.500 in the final.quantiles.global column we can infer that the true global median is the corresponding element in the input.var.real.orig column reflects. But, in a study in which the vector final.values.quantile does not contain the precise value 0.500 the global median will be indeterminant. Similarly, if the true median actually falls at a value in the middle of a large tied cluster, it is conceivable it could take the quantile value 0.49 or 0.52 in several studies and no study will contain a final.values.quantile of 0.500. So if all studies (and the client) are to have access to all the key quantile values, we need to identify these key values in a more sophisticated way.

(17) One of the arguments for ds.secureRanks is <quantiles.for.estimation> this identifies the key quantile estimates to be estimated and returned. The broadest range is selected by quantiles.for.estimation="0.025-0.975". This indicates estimation of the following quantiles:c(0.025,0.05,0.10,0.20,0.25,0.30,0.3333,0.40,0.50,0.60, 0.6667,0.70,0.75,0.80,0.90,0.95,0.975). The alternative allowable options for the <quantiles.for.estimation> argument all represent narrower subsets of these values (see below for information about the parameter: quantiles.for.estimation). Two serverside aggregate functions (extractQuantilesDS1 and then extractQuantilesDS2) sequentially process the information in the summary.output.ranks.df data frame to generate the precise values of the input.var that correspond to each of the selected key quantile values, and these values are then available to the client because they are generated by an aggregate function and the final results are also written to each data server as the data frame final.quantile.df which contains two columns. The first of these is the vector evaluation.quantiles which lists all of the key quantile values targeted by the argument <quantiles.for.estimation> while the second is final.quantile.vector which contains the precise values of input.var that correspond to the key quantiles in the first column.

The precise values in final.quantile.vector are identified in a two stage process (hence the two serverside functions). The first function (extractQuantilesDS1) identifies the lowest value of final.quantiles.global in each study that lies at or above each key target quantile value in the evaluation.quantiles vector. For convenience, these may be called: min.at.or.above.key.value_t,s for target quantile t in study s. The function extractQuantilesDS1 also identifies max.at.or.below.key.value_t,s which is the corresponding maximum value in study s at or below the target value. These vectors are sent to the client which identifies, for each target value, the minimum value of min.at.or.above.key.value_t,s across all studies and the studies in which that value falls. Similarly, the max value of max.at.or.below.key.value_t,s across all studies is also estimated. Please note, that it is possible that in relation to any given threshold t, these values may appear once in different studies, and either or both can appear in several (or even all) studies. But that doesn’t matter. Wherever the minimum minimum or maximum maximum value for a given threshold sits the corresponding value of input.input var is identified and recorded. Whenever one of these values occurs multiple times, we simply take the mean of the corresponding values of input.var which are, by definition, all the same and so the mean is the same as any individual value.

This process results in a list of two input.var values that span each threshold. One is the minimum value of the values that lie at or above the threshold and the other is the maximum maximum of the values that lie below the threshold. We then take the mean of these two values and that is declared as the value of input.var that corresponds to the particular threshold of interest. If the data are such that the precise value of the threshold appears somewhere in the final.quantiles.global column of at least one study (e.g. 0.500 for the median or 0.750 for the upper quartile) the input.var values corresponding to the minimum minimum and maximum maximum values are the same and both precisely equal to the value required for that given quantile. So their mean is clearly the correct value. If the precise value for the required quantile does not appear in the final.quantiles.global column in any study, then the required value of the input.var corresponding to the required threshold is derived as the mean of the closest two values on either side of the threshold regardless which studies they actually fall in. From a heuristic perspective this would appear to be a reasonable approach to estimating a value for a required quantile whenever it falls between two different values.

At this point a data frame is created called **final.quantile.df** in which the first column is the vector of quantile values to be estimated as indicated by the argument <quantiles.for.estimation> and the second column is the vector of input.var values that correspond to those values based on the methods described above. This is the primary output from ds.extractQuantile and so is not deleted even if ds.ranksSecure’s argument rm.residual.objects is set to TRUE (the default).