Abstract
With an ever-increasing amount of available data, scalability problems are much
more of a concern than they were 20 years ago, and climate change issues urges
for greener computing. In bioinformatics, a few methods have become de-facto
standards to approach specific issues such as gene-calling, multiple sequence
alignment, or protein domain annotation. These tools are now the basis of numerous
annotation pipelines that are executed thousands of time daily. Academia however
favors scientific novelty over efficiency, and C or C++ code is more than
enough to be categorized as high-performance software.
In this work, we use profiling to identify critical parts of established
bioinformatics methods. We then employ several optimization techniques to
make these tools more efficient. We show how inlining a few functions halve
the runtime of the PRANK aligner;
how caching alignment matrices used by trimAl
reduce the runtime 10-fold; how SIMD speeds up the gene scoring step of
Prodigal; and how parallel hashing in
FastANI increases the efficiency on
multi-core machines.
The goal of this software engineering experiment is to introduce some efficient
programming habits to the community, and to change the perspective about
ubiquitous software we use brainlessly in our pipelines. More practically, we
provide several of these patches as Python packages to
be used as drop-in replacements for the originals. As a perspective, we
present some general figures about the energetic cost of computing, and
how much CO2 can be saved with the aforementioned optimizations.