by Joseph Bloom, Data Scientist, Mass Dynamics
Introduction
Computational biologists spend more time benchmarking and assessing code than they do writing basic functionality.
Luckily for us software engineers face many similar problems and have developed a myriad of solutions for ensuring that software continues to perform well over its lifetime.
Continuous Integration and Continuous Delivery (CI/CD) is the culmination of these solutions, allowing software developers in the agile software development community to deliver solutions faster and more safely than ever before.
CI/CD is a culture and collection of practices which has an immense amount to offer the scientific community.
I’ve spent the past year working as a data scientist, building and benchmarking bioinformatic pipelines whilst working among experienced software developers for industry. During this time it has become clear to me that the introduction of CI/CD or an analogous system in the scientific computation world may have much to offer.
“Continuous Science” as I call it, offers a chance to improve reproducibility, reduce waste and accelerate our collective understanding throughout data driven science.
Current Challenges with Scientific Software
It is no secret that modern science is facing a reproducibility crisis and scientific software is no exception.
While reproducing a published paper is difficult in general, reproducing computation is especially difficult to achieve. Often, a scientist must retrieve all of the open source code, all of the data inputs and all of the data outputs of an existing solution before they can perform a comparison to their new proposal.
Not only is it a difficult and lengthy task to retrieve all of these resources, this is often not possible because of poor documentation of data, limited code sharing or version control and lack of expertise with prior solutions which may be implemented in different coding languages.
Challenges of reproducibility affect individuals acutely by slowing down their research and costing them valuable time that they must spend verifying software integrity.
The communal effects are more serious though, because this slows the progression of fields and hinders progress in important scientific domains like proteomics.
Unfortunately, all this means that people like myself who write scientific code can often spend more time trying to achieve valid comparisons between algorithms and code than building new and exciting solutions.
This is incredibly wasteful when we consider how much duplicated effort is occurring across labs globally.
Furthermore, if it’s hard for the tool-builders to compare, then it is likely almost impossible to believe that users such as biologists could be properly informed about the tools they are using.
Enter CI/CD and the development (pun intended) of Continuous Science.
To continue reading; the full blog post published on July 27th, 2021 can be found on Medium here.