Road map

Several work packages and deliverables Newcastle University are involved in across multiple grants and various stakeholders. This high level roadmap/work programme represents a grouping of these work packages and deliverables into broad projects that we are either currently doing or planning on doing. 


Jun2019JulAugSepOctNovDecJan2020FebMarAprMayJunJulAugSepOctNovDecJan2021FebMarAprMayJunJulAugSepOctDS conferenceDS conferenceNowDS conference
Testing
Releases
New Lane
GUI
Policy dev
Non-para
dsOmics

Develop testing framework

Write the remainder of tests

Write training material

Prepare v5.0

v6.0 (DSI)

v6.1

Update training material

Prepare v5.1

v6.0 training material update

Future releases

Develop and implement containerisation

Develop DS Resources

Sever reporting and monitoring

Spec GUI

Develop GUI

Initiate steering committee

Develop first version of non-parametric package

dsOmics development

Current work 

Testing framework 


Key stakeholders: All DataSHIELD users 

Background 

  • Have no standard or automatic way of knowing that when a DataSHIELD function changes if it is introducing errors – either by actually failing in unexpected ways, or by giving the wrong answer. Given end users never see the raw data it is possible they will not know if there are problems, so they could go unnoticed for a long time 
  • Same argument for if anything in the tool stack changes (the operating system, C libraries, R libraries, Opal, Java etc). 
  • Will require tests to be run when new code is being developed, and for tests to pass before new code is accepted in the master branch.
  • Will require automated continuous integration to regularly (likely daily) run tests to pick up regressions on changes. 
  • Using testthat as the testing framework.
  • The number and status of tests is a KPI for EUCAN WP5.

Aims

  • Easy way for DataSHIELD function developers to test their code works, and isn't breaking other functionality downstream.
  • Automatic way to test the entire stack for regressions.
  • Current planned classes of tests:
    • Test files are syntactically correct.
    • Test all relevant imports have been declared.
    • Test answers are mathematically correct (e.g. no standard deviation returning negative values)
    • Test answers from DataSHIELD are the same as R.
    • Test answers are correct on both a single and a multiply partitioned data set.
    • Test behavior for unexpected input arguments.

    • Test adherence to disclosure control settings.

    • Possibly more...
  • Measure of code coverage.
  • Simple digest of test status across all functions.
  • Run across a range of (specified) versions of key software (DataSHIELD, opal, R, others?).

Status 

  • Have over 6000 tests.
  • Can be run with standard testthat on any DataSHIELD install.
  • Using Azure pipeline for continuous integration, which builds a sample DataSHIELD install with bleeding edge version of everything from scratch with a vanilla Ubuntu 16.04 VM.
  • Have a public facing page at https://datashield.github.io/testStatus/

Planned work 

  • Write documentation to make it easy for function developers to write tests.
  • Roll out tests for all functions.
  • Aim to have high coverage of tests.

Possible future work

  • Add Docker to the testing framework 
  • Add DSLite to the testing framework.
  • Add dsDanger to CI to edit settings and test how affects stack (e.g. disclosure settings).
  • Run matrix builds across different OS versions, Opal versions, R versions, DataSHIELD versions. 

 


Minor DataSHIELD releases (6.1 onwards)

Key stakeholders: All DataSHIELD users

Background

  • There is a continual stream of requests for new functions which needs to be managed.

Required work/decisions to be made

  • Need a prioritisation mechanism of which functions to develop.


Minor DataSHIELD release policy (v6.1 onwards) 

Key stakeholders: All DataSHIELD users 

Background 

  • After the v6.0 release we will aim to release DataSHIELD functions more often. 

Required work/decisions to be made 

  • Need to have a policy for moving functions into the main DataSHIELD repos. Points to include:
    • Which classes of tests should be required? 
    • Require valid examples? 
    • Should multiple people review the code? 
    • Who decides when it is time to move it? Just the NU team or wider community? 
    • How do we manage upgrades – replacing a function in e.g. dsBase might break working projects. 
    • How do we inform all the consortia of an upgrade? 
  • Suggest we start to discuss this with the intention of having a draft policy in place by the DataSHIELD workshop in September 2020 and to enact policy from DataSHIELD v6.1. 


dsOmics

Key stakeholders: EUCAN, ATHLETE

Background

  • Strong desire to be able to do omics analysis with DataSHIELD.

Status

  • Huge amount of development work has gone into this, making use of the newly developed resources feature on opal.

non-parametric package

Background

  • There is a desire to be able to do non-parametric analysis with DataSHIELD. This is difficult to do in a non-disclosive way since many of the existing algorithms rely on having all the data.

Status

  • This is being actively worked on.

Future work 

  

DataSHIELD Graphical User Interface (GUI) 

Key stakeholders: NU, EUCAN (deliverable), TRUST. 

Background 

  • Needed a GUI for a long time, for a variety of reasons: 
  • Menu driven simple interface. 
  • Reporting interface. 
  • Hooks for piping into VR. 

Status 

  • Submitted a couple of grants to get extra funding to do this. Both were unsuccessful. Now starting to work with HCI team at NU to develop a specification.

 

DataSHIELD server status at remote sites 

Key stakeholders: NU, EUCAN. 

Background 

  • It is a pain to keep on top of which DataSHIELD servers are up etc, often data providers don’t know when there is a problem (as was demonstrated in BioSHARE). This is going to be a big problem with all the sites in EUCAN-CONNECT. 
  • Implemented nagios as a monitoring solution towards the end of BioSHARE. 
  • Can monitor anything, we did CPU, RAM, disk usage, opal accessible. Status checks can literally be anything, could check opal up, can run a simple R command, what version of R installed on Server, what version of DataSHIELD installed etc.  

Required work/decisions to be made 

  • Should we resurrect this work? 
  • Was done as CRON jobs on the VMs, how does this work with containers? 

 

DataSHIELD integration with health text 

Background 

  • Want to integrate natural language processing of health-related text data into DataSHIELD. 
  • Will need new libraries etc. 

Status 

  • Not started actively developing this yet. 

 


DataSHIELD Wiki by DataSHIELD is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at http://www.datashield.ac.uk/wiki