Skip to content
Thoughts

Aug 05, 2021

Continuous Science - how Software Engineering Practices can solve Scientific Reproducibility

Computational biologists spend more time benchmarking and assessing code than they do writing basic functionality. Luckily for us software engineers face many similar problems and have developed a myriad of solutions for ensuring that software continues to perform well over its lifetime.

Software Engineering and Continuous Science

by Joseph Bloom, Data Scientist, Mass Dynamics

Introduction

Computational biologists spend more time benchmarking and assessing code than they do writing basic functionality.

Luckily for us software engineers face many similar problems and have developed a myriad of solutions for ensuring that software continues to perform well over its lifetime.

Continuous Integration and Continuous Delivery (CI/CD) is the culmination of these solutions, allowing software developers in the agile software development community to deliver solutions faster and more safely than ever before.

CI/CD is a culture and collection of practices which has an immense amount to offer the scientific community.

I’ve spent the past year working as a data scientist, building and benchmarking bioinformatic pipelines whilst working among experienced software developers for industry. During this time it has become clear to me that the introduction of CI/CD or an analogous system in the scientific computation world may have much to offer.

“Continuous Science” as I call it, offers a chance to improve reproducibility, reduce waste and accelerate our collective understanding throughout data driven science.

Current Challenges with Scientific Software

It is no secret that modern science is facing a reproducibility crisis and scientific software is no exception.

While reproducing a published paper is difficult in general, reproducing computation is especially difficult to achieve. Often, a scientist must retrieve all of the open source code, all of the data inputs and all of the data outputs of an existing solution before they can perform a comparison to their new proposal.

Not only is it a difficult and lengthy task to retrieve all of these resources, this is often not possible because of poor documentation of data, limited code sharing or version control and lack of expertise with prior solutions which may be implemented in different coding languages.

Challenges of reproducibility affect individuals acutely by slowing down their research and costing them valuable time that they must spend verifying software integrity.

The communal effects are more serious though, because this slows the progression of fields and hinders progress in important scientific domains like proteomics.

Unfortunately, all this means that people like myself who write scientific code can often spend more time trying to achieve valid comparisons between algorithms and code than building new and exciting solutions.

This is incredibly wasteful when we consider how much duplicated effort is occurring across labs globally.

Furthermore, if it’s hard for the tool-builders to compare, then it is likely almost impossible to believe that users such as biologists could be properly informed about the tools they are using.

Enter CI/CD and the development (pun intended) of Continuous Science.

 

To continue reading; the full blog post published on July 27th, 2021 can be found on Medium here.

Algorithms are his second language. He is a self confessed hyper-verbalist and views life through the lens of data models. He loves to publish and is as eloquent with words as he is numbers.

Latest Articles

Thoughts

Coffees, conversations, and negotiations: an energizing day @ FeMS “Out Of The Shadows” Workshop

How we learnt to increase executive presence, harness the power of story telling and gain confidence presenting beyond the lab

August 28, 2024

Thoughts

ASMS 2024: The rise of Automation and AI-Driven Raw MS Data Processing Solutions

72nd Conference on Mass Spectrometry and Allied Topics. Anaheim Conference Center, Anaheim, CA. 2nd-6th June, 2024

June 19, 2024

Thoughts

Proteomics Progression: Insights and Inflection Points

A Redwood City evening with Key Opinion Leaders, exploring insights and inflection points in proteomics progression, and uncovering key cha...

May 29, 2024