Abstract

With an ever-increasing amount of available data, scalability problems are much more of a concern than they were 20 years ago, and climate change issues urges for greener computing. In bioinformatics, a few methods have become de-facto standards to approach specific issues such as gene-calling, multiple sequence alignment, or protein domain annotation. These tools are now the basis of numerous annotation pipelines that are executed thousands of time daily. Academia however favors scientific novelty over efficiency, and C or C++ code is more than enough to be categorized as high-performance software.

In this work, we use profiling to identify critical parts of established bioinformatics methods. We then employ several optimization techniques to make these tools more efficient. We show how inlining a few functions halve the runtime of the PRANK aligner; how caching alignment matrices used by trimAl reduce the runtime 10-fold; how SIMD speeds up the gene scoring step of Prodigal; and how parallel hashing in FastANI increases the efficiency on multi-core machines.

The goal of this software engineering experiment is to introduce some efficient programming habits to the community, and to change the perspective about ubiquitous software we use brainlessly in our pipelines. More practically, we provide several of these patches as Python packages to be used as drop-in replacements for the originals. As a perspective, we present some general figures about the energetic cost of computing, and how much CO2 can be saved with the aforementioned optimizations.