Skip to content
Thoughts

Aug 05, 2021

Continuous Science - how Software Engineering Practices can solve Scientific Reproducibility

Computational biologists spend more time benchmarking and assessing code than they do writing basic functionality. Luckily for us software engineers face many similar problems and have developed a myriad of solutions for ensuring that software continues to perform well over its lifetime.

Software Engineering and Continuous Science

by Joseph Bloom, Data Scientist, Mass Dynamics

Introduction

Computational biologists spend more time benchmarking and assessing code than they do writing basic functionality.

Luckily for us software engineers face many similar problems and have developed a myriad of solutions for ensuring that software continues to perform well over its lifetime.

Continuous Integration and Continuous Delivery (CI/CD) is the culmination of these solutions, allowing software developers in the agile software development community to deliver solutions faster and more safely than ever before.

CI/CD is a culture and collection of practices which has an immense amount to offer the scientific community.

I’ve spent the past year working as a data scientist, building and benchmarking bioinformatic pipelines whilst working among experienced software developers for industry. During this time it has become clear to me that the introduction of CI/CD or an analogous system in the scientific computation world may have much to offer.

“Continuous Science” as I call it, offers a chance to improve reproducibility, reduce waste and accelerate our collective understanding throughout data driven science.

Current Challenges with Scientific Software

It is no secret that modern science is facing a reproducibility crisis and scientific software is no exception.

While reproducing a published paper is difficult in general, reproducing computation is especially difficult to achieve. Often, a scientist must retrieve all of the open source code, all of the data inputs and all of the data outputs of an existing solution before they can perform a comparison to their new proposal.

Not only is it a difficult and lengthy task to retrieve all of these resources, this is often not possible because of poor documentation of data, limited code sharing or version control and lack of expertise with prior solutions which may be implemented in different coding languages.

Challenges of reproducibility affect individuals acutely by slowing down their research and costing them valuable time that they must spend verifying software integrity.

The communal effects are more serious though, because this slows the progression of fields and hinders progress in important scientific domains like proteomics.

Unfortunately, all this means that people like myself who write scientific code can often spend more time trying to achieve valid comparisons between algorithms and code than building new and exciting solutions.

This is incredibly wasteful when we consider how much duplicated effort is occurring across labs globally.

Furthermore, if it’s hard for the tool-builders to compare, then it is likely almost impossible to believe that users such as biologists could be properly informed about the tools they are using.

Enter CI/CD and the development (pun intended) of Continuous Science.

 

To continue reading; the full blog post published on July 27th, 2021 can be found on Medium here.

Algorithms are his second language. He is a self confessed hyper-verbalist and views life through the lens of data models. He loves to publish and is as eloquent with words as he is numbers.

Latest Articles

What's New? Q2-23
What's New

What's New? Q2-23

More new features to breakthrough bottlenecks: protein abundance tracer plot, protein-protein interactions, and more powerful protein list ...

August 07, 2023

The Future of Scientific Progress: From Academia to Industry and the Rise of Digital Biology
Thoughts

The Future of Scientific Progress: From Academia to Industry and the Rise of Digital Biology

The future is bright, and it starts with a willingness to adapt, evolve, and embrace the opportunities that lie before us.

May 26, 2023

What's New? Q1-23
What's New

What's New? Q1-23

Bruker Launches de novo Sequencing for Immunopeptidomics, Library-Free dia-PASEF, Mass Dynamics Knowledge Visualization including partnersh...

April 11, 2023