[PREPRINT] Machine learning inference of natural product chemistry across biosynthetic gene cluster types

Larralde M, Zeller G, biorxiv (2025).

Abstract

With ever-increasing volumes of sequencing data for biosynthetic gene clusters (BGCs), computational methods to accurately predict which secondary metabolites result from these are critically lacking. Here, we present CHAMOIS, a machine learning-based tool for predicting chemical properties of secondary metabolites from protein domains annotated in the input BGCs. CHAMOIS infers 485 chemical properties from the ChemOnt ontology using logistic regression. It accurately predicts 111 such properties (AUPRC > 0.5) in cross-validation against known instances. Although CHAMOIS is not explicitly trained on biosynthetic knowledge, many of the inferred links between protein domains and metabolite properties are consistent with scientific literature, others suggest new biochemical functions of uncharacterized biosynthetic domains. Finally, CHAMOIS can pinpoint which BGC within a given genome produces a pre-specified metabolite (correct BGC in 69% of cases ranked among the top 5), which holds great potential for prioritising experimental BGC characterisation and discovery of novel biosynthetic enzymes.