px

Philipp Kaluza received their Diploma of Informatics degree from Technical University of Brunswick in 2010.
They are working as a freelance IT consultant, currently mostly in the medical and research domain.
They have been using, deploying and administrating Linux and Debian since the late 90s, and haven't regretted it a day since - always empowered to make their own improvements to the ecosystem.

Accepted Talks:

Platform for orchestrating Debian environments for reproducable and replicable Data Science

Health-related data of patients is among the most sensitive data when it comes to data privacy concerns. As such, data science projects in the medical domains must overcome a very high barrier before allowing data researchers access to potentially personally identifiable data, or pseudonymized patient data that carries an inherent risk of depseudonymization.

In the project “Data Science Orchestrator”, we are developing an organizational framework for ethically chaperoning and risk-managing such projects while they are under way. We are also proposing a software stack that will aid in this task while at the same time providing an audit trail across the project that is verifiable even by partial-knowledge participants, and keeping the option for future reproducibility studies and replicability studies open.

To this end, the software stack is orchestrating tightly defined and self-documenting Debian environments called “Data Science Reproducible Execution Environments) (DS:RXEs), while the data lifecycle (with its confidentiality constraints) is tightly controlled with git-annex, creating a reliable audit log in the process.

Because collecting study-relevant data is often a time- and labor-intensive endeavour in the medical domain, many projects are undertaken by associations that span multiple hospitals, administrative domains, often even multiple states. Therefore the “Data Science Orchestrator” also allows for distributed data science computations, which can honor these existing administrative boundaries by means of a federated access model, all while keeping the most sensitive data in-house and exclusively in a tightly controlled computation environment.

Reopening the discussion on data distribution

One or one and a half decades ago, we had a discussion on “data-only” packages, prompted at the time by the presence of large game data files in the Debian archive. While this didn’t result in the proposed additional source package format, it helped shape thinking on this topic for quite a while.

While in 2024, commercial CDNs do exist, the distribution of larger data sets is still an ongoing and unsolved problem for most of the world, newly exacerbated by the advent of huge deep learning models.

In this BoF we want to explore the possibility space for a stronger data distribution story within the Debian project (possibly utilizing git-annex, possibly other tools), all the while staying respectful of our mirror operators’ incredible commitment and limited resources.

We will also use the BoF to gauge interest in a longer-form workshop.

Data distribution hands-on

If there is enough interest in the preceeding BoF, let’s get hands-on with git-annex et al., exchange workflow ideas, and maybe even plan new infrastructure to build.

Combining containerized desktop applications with a mature distribution model

We finally have most of the groundwork laid for sensibly containerizing desktop applications, which bring with them some desirable (and overdue) new security properties, often inspired by the mobile “app” ecosystems. Also inspired by these is a new software distribution model that promises software developers more immediate access to their end users, faster update cycles - and, in some cases, “paying only one bridge troll”.

Concerningly, these new systems (e.g. flatpak and snap) are sometimes being used as a way of shirking the system integration responsibilities that Linux distributors have traditionally held. It is up to us to demonstrate once more that a Linux distribution like Debian is more than an annoying gatekeeper, that our methodical system integration and continuous security work are a value-add for the discerning customer (especially one who plans deployments in the traditional distributor/site administrator/system administrator tiers), and that we want to embrace newer security models and take on the challenge of updating our processes accordingly.

I hope to present some early work in this area, and get a wider discussion started within the Debian project.