Platform for orchestrating Debian environments for reproducable and replicable Data Science

Speaker: px

Track: Systems administration, automation and orchestration

Type: Long talk (45 minutes)

Health-related data of patients is among the most sensitive data when it comes to data privacy concerns. As such, data science projects in the medical domains must overcome a very high barrier before allowing data researchers access to potentially personally identifiable data, or pseudonymized patient data that carries an inherent risk of depseudonymization.

In the project “Data Science Orchestrator”, we are developing an organizational framework for ethically chaperoning and risk-managing such projects while they are under way. We are also proposing a software stack that will aid in this task while at the same time providing an audit trail across the project that is verifiable even by partial-knowledge participants, and keeping the option for future reproducibility studies and replicability studies open.

To this end, the software stack is orchestrating tightly defined and self-documenting Debian environments called “Data Science Reproducible Execution Environments) (DS:RXEs), while the data lifecycle (with its confidentiality constraints) is tightly controlled with git-annex, creating a reliable audit log in the process.

Because collecting study-relevant data is often a time- and labor-intensive endeavour in the medical domain, many projects are undertaken by associations that span multiple hospitals, administrative domains, often even multiple states. Therefore the “Data Science Orchestrator” also allows for distributed data science computations, which can honor these existing administrative boundaries by means of a federated access model, all while keeping the most sensitive data in-house and exclusively in a tightly controlled computation environment.

URLs