Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra

Dührkop, Kai; Nothias, Louis-Félix; Fleischauer, Markus; Reher, Raphael; Ludwig, Marcus; Hoffmann, Martin A.; Petras, Daniel; Gerwick, William H.; Rousu, Juho; Dorrestein, Pieter C.; Böcker, Sebastian

doi:10.1038/s41587-020-0740-8

Article
Published: 23 November 2020

Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra

Nature Biotechnology volume 39, pages 462–471 (2021)Cite this article

19k Accesses
268 Citations
165 Altmetric
Metrics details

Subjects

Abstract

Metabolomics using nontargeted tandem mass spectrometry can detect thousands of molecules in a biological sample. However, structural molecule annotation is limited to structures present in libraries or databases, restricting analysis and interpretation of experimental data. Here we describe CANOPUS (class assignment and ontology prediction using mass spectrometry), a computational tool for systematic compound class annotation. CANOPUS uses a deep neural network to predict 2,497 compound classes from fragmentation spectra, including all biologically relevant classes. CANOPUS explicitly targets compounds for which neither spectral nor structural reference data are available and predicts classes lacking tandem mass spectrometry training data. In evaluation using reference data, CANOPUS reached very high prediction performance (average accuracy of 99.7% in cross-validation) and outperformed four baseline methods. We demonstrate the broad utility of CANOPUS by investigating the effect of microbial colonization in the mouse digestive system, through analysis of the chemodiversity of different Euphorbia plants and regarding the discovery of a marine natural product, revealing biological insights at the compound class level.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Method evaluation: number of ClassyFire compound classes predicted with a particular performance measure.**

**Fig. 3: Comparing the digestive system of GF and SPF mice.**

**Fig. 4: Molecular network of daidzein.**

**Fig. 5: Compound class distribution in *Euphorbia* species.**

**Fig. 6: Structural analysis of rivulariapeptolide 1155 using CANOPUS.**

**Fig. 7: Heterogeneous training for compound class prediction.**

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

John Jumper, Richard Evans, … Demis Hassabis

Genomic language model predicts protein co-regulation and function

Article Open access 03 April 2024

Yunha Hwang, Andre L. Cornman, … Peter R. Girguis

A host–microbiota interactome reveals extensive transkingdom connectivity

Article 20 March 2024

Nicole D. Sonnert, Connor E. Rosen, … Noah W. Palm

Data availability

Input mzML/mzXML files are available at MassIVE (https://massive.ucsd.edu/) with the accession nos. MSV000079949 (mice data) and MSV000081082 (Euphorbia data). The mass spectrometry data for Rivularia sp. cyanobacteria were deposited at MassIVE (accession no. MSV000085578). The spectra for rivulariapeptolide 1155 were annotated in the GNPS spectral library (accession nos. CCMSLIB00005723986 and CCMSLIB00005723388). The structure database with ClassyFire annotations, the publicly available part of the evaluation data and the Cytoscape files for network visualization can be downloaded from https://bio.informatik.uni-jena.de/data/ and https://doi.org/10.6084/m9.figshare.13073051. Source data are provided with this paper.

Code availability

CANOPUS is part of SIRIUS software and can be downloaded from https://bio.informatik.uni-jena.de/software/canopus/. The source code of CANOPUS is available at https://github.com/boecker-lab/sirius-libs. The scripts for analysis and visualization of CANOPUS results are available at https://github.com/kaibioinfo/canopus_treemap.

References

Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).
Article CAS PubMed Google Scholar
Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016).
Article CAS PubMed PubMed Central Google Scholar
Guijas, C. et al. METLIN: a technology platform for identifying knowns and unknowns. Anal. Chem. 90, 3156–3164 (2018).
Article CAS PubMed PubMed Central Google Scholar
Kind, T. et al. Identification of small molecules using accurate mass MS/MS search. Mass Spectrom. Rev. 37, 513–532 (2018).
Article CAS PubMed Google Scholar
Allen, F., Greiner, R. & Wishart, D. Competitive fragmentation modeling of ESI–MS/MS spectra for putative metabolite identification. Metabolomics 11, 98–110 (2015).
Article CAS Google Scholar
Brouard, C. et al. Fast metabolite identification with Input Output Kernel Regression. Bioinformatics 32, i28–i36 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).
Article PubMed PubMed Central Google Scholar
Ridder, L. et al. Automatic chemical structure annotation of an LC-MSⁿ based metabolic profile from green tea. Anal. Chem. 85, 6033–6040 (2013).
Article CAS PubMed Google Scholar
Ruttkies, C., Schymanski, E. L., Wolf, S., Hollender, J. & Neumann, S. MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J. Cheminform. 8, 3 (2016).
Article PubMed PubMed Central Google Scholar
Tsugawa, H. et al. Hydrogen rearrangement rules: computational MS/MS fragmentation and structure elucidation using MS-FINDER Software. Anal. Chem. 88, 7946–7958 (2016).
Article CAS PubMed PubMed Central Google Scholar
Schymanski, E. L. et al. Critical assessment of small molecule identification 2016: automated methods. J. Cheminf. 9, 22 (2017).
Article Google Scholar
Blaženović, I., Kind, T., Ji, J. & Fiehn, O. Software tools and approaches for compound identification of LC-MS/MS data in metabolomics. Metabolites 8, 31 (2018).
Silva, R. R., Dorrestein, P. C. & Quinn, R. A. Illuminating the dark matter in metabolomics. Proc. Natl Acad. Sci. USA 112, 12549–12550 (2015).
Article PubMed PubMed Central Google Scholar
Tsugawa, H. Advances in computational metabolomics and databases deepen the understanding of metabolisms. Curr. Opin. Biotechnol. 54, 10–17 (2018).
Article CAS PubMed Google Scholar
Montenegro-Burke, J. R., Guijas, C. & Siuzdak, G. METLIN: a tandem mass spectral library of standards. Methods Mol. Biol. 2104, 149–163 (2020).
Article CAS PubMed PubMed Central Google Scholar
Vinaixa, M. et al. Mass spectral databases for LC/MS- and GC/MS-based metabolomics: state of the field and future prospects. Trends Anal. Chem. 78, 23–35 (2016).
Article CAS Google Scholar
Aksenov, A. A., Silva, R., Knight, R., Lopes, N. P. & Dorrestein, P. C. Global chemical analysis of biology by mass spectrometry. Nat. Rev. Chem. 1, 0054 (2017).
Article CAS Google Scholar
Frainay, C. et al. Mind the gap: mapping mass spectral databases in genome-scale metabolic networks reveals poorly covered areas. Metabolites 8, 51 (2018).
Venkataraghavan, R., McLafferty, F. W. & Lear, G. E. Computer-aided interpretation of mass spectra. Org. Mass Spectrom. 2, 1–15 (1969).
Article CAS Google Scholar
Curry, B. & Rumelhart, D. E. MSnet: a neural network that classifies mass spectra. Tetrahedron Comput. Methodol. 3, 213–237 (1990).
Article CAS Google Scholar
Werther, W., Lohninger, H., Stancl, F. & Varmuza, K. Classification of mass spectra: a comparison of yes/no classification methods for the recognition of simple structural properties. Chemom. Intell. Lab. Syst. 22, 63–76 (1994).
Article CAS Google Scholar
Heinonen, M., Shen, H., Zamboni, N. & Rousu, J. Metabolite identification and molecular fingerprint prediction via machine learning. Bioinformatics 28, 2333–2341 (2012).
Article CAS PubMed Google Scholar
Hastings, J. et al. The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic Acids Res. 41, D456–D463 (2013).
Article CAS PubMed Google Scholar
Rogers, F. B. Communications to the editor. Bull. Med. Libr. Assoc. 51, 114–116 (1963).
CAS PubMed PubMed Central Google Scholar
Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminf. 8, 61 (2016).
Article Google Scholar
Blaženović, I. et al. Structure annotation of all mass spectra in untargeted metabolomics. Anal. Chem. 91, 2155–2162 (2019).
Article PubMed Google Scholar
Ernst, M. et al. Assessing specialized metabolite diversity in the cosmopolitan plant genus Euphorbia L. Front. Plant Sci. 10, 846 (2019).
Article PubMed PubMed Central Google Scholar
Tsugawa, H. et al. A cheminformatics approach to characterize metabolomes in stable-isotope-labeled organisms. Nat. Methods 16, 295–298 (2019).
Article CAS PubMed Google Scholar
Barupal, D. K. & Fiehn, O. Chemical Similarity Enrichment Analysis (ChemRICH) as alternative to biochemical pathway mapping for metabolomic datasets. Sci. Rep. 7, 14567 (2017).
Article PubMed PubMed Central Google Scholar
Rasche, F. et al. Identifying the unknowns by aligning fragmentation trees. Anal. Chem. 84, 3417–3426 (2012).
Article CAS PubMed Google Scholar
Treutler, H. et al. Discovering regulated metabolite families in untargeted metabolomics studies. Anal. Chem. 88, 8082–8090 (2016).
Article CAS PubMed Google Scholar
Ernst, M. et al. MolNetEnhancer: enhanced molecular networks by integrating metabolome mining and annotation tools. Metabolites 9, 144 (2019).
Lowry, S. R. et al. Comparison of various K-nearest neighbor voting schemes with the self-training interpretive and retrieval system for identifying molecular substructures from mass spectral data. Anal. Chem. 49, 1720–1722 (1977).
Article CAS Google Scholar
Askenazi, M. & Linial, M. ARISTO: ontological classification of small molecules by electron ionization-mass spectrometry. Nucleic Acids Res. 39, W505–W510 (2011).
Article CAS PubMed PubMed Central Google Scholar
Peters, K. et al. Chemical diversity and classification of secondary metabolites in nine bryophyte species. Metabolites 9, 222 (2019).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article CAS PubMed Google Scholar
Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405, 442–451 (1975).
Article CAS PubMed Google Scholar
Wolf, S., Schmidt, S., Müller-Hannemann, M. & Neumann, S. In silico fragmentation for computer assisted identification of metabolite mass spectra. BMC Bioinformatics 11, 148 (2010).
Article PubMed PubMed Central Google Scholar
Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019).
Article PubMed Google Scholar
Cooper, B. T. et al. Hybrid search: a method for identifying metabolites absent from tandem mass spectrometry libraries. Anal. Chem. 91, 13924–13932 (2019).
Article CAS PubMed PubMed Central Google Scholar
Allard, P.-M. et al. Integration of molecular networking and in-silico MS/MS fragmentation for natural products dereplication. Anal. Chem. 88, 3317–3323 (2016).
Article CAS PubMed Google Scholar
Silva, R. R. et al. Propagating annotations of molecular networks using in silico fragmentation. PLoS Comput. Biol. 14, e1006089 (2018).
Article PubMed PubMed Central Google Scholar
Fox Ramos, A. E. et al. CANPA: computer-assisted natural products anticipation. Anal. Chem. 91, 11247–11252 (2019).
Article CAS PubMed Google Scholar
Quinn, R. A. et al. Global chemical effects of the microbiome include new bile-acid conjugations. Nature 579, 123–129 (2020).
Article CAS PubMed PubMed Central Google Scholar
Minamida, K. et al. Production of equol from daidzein by Gram-positive rod-shaped bacterium isolated from rat intestine. J. Biosci. Bioeng. 102, 247–250 (2006).
Article CAS PubMed Google Scholar
Quinn, R. A. et al. Molecular networking as a drug discovery, drug metabolism, and precision medicine strategy. Trends Pharmacol. Sci. 38, 143–154 (2017).
Article CAS PubMed Google Scholar
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Article CAS PubMed PubMed Central Google Scholar
Hooft, J. J. J., Wandy, J., Barrett, M. P., Burgess, K. E. V. & Rogers, S. Topic modeling for untargeted substructure exploration in metabolomics. Proc. Natl Acad. Sci. USA 113, 13738–13743 (2016).
Article PubMed PubMed Central Google Scholar
Vasas, A. & Hohmann, J. Euphorbia diterpenes: isolation, structure, biological activity, and synthesis (2008–2012). Chem. Rev. 114, 8579–8612 (2014).
Article CAS PubMed Google Scholar
Yang, M. et al. Studies on the fragmentation pathways of ingenol esters isolated from Euphorbia esula using IT-MSn and Q-TOF-MS/MS methods in electrospray ionization mode. Int. J. Mass Spectrom. 323-324, 55–62 (2012).
Article CAS Google Scholar
Riina, R. et al. A worldwide molecular phylogeny and classification of the leafy spurges, Euphorbia subgenus Esula (Euphorbiaceae). TAXON 62, 316–342 (2013).
Article Google Scholar
Horn, J. W. et al. Phylogenetics and the evolution of major structural characters in the giant genus Euphorbia L. (Euphorbiaceae). Mol. Phylogenet. Evol. 63, 305–326 (2012).
Horn, J. W. et al. Evolutionary bursts in Euphorbia (Euphorbiaceae) are linked with photosynthetic pathway. Evolution 68, 3485–3504 (2014).
Peirson, J. A., Bruyns, P. V., Riina, R., Morawetz, J. J. & Berry, P. E. A molecular phylogeny and classification of the largely succulent and mainly African Euphorbia subg. Athymalus (Euphorbiaceae). TAXON 62, 1178–1199 (2013).
Article Google Scholar
Dorsey, B. L. et al. Phylogenetics, morphological evolution, and classification of Euphorbia subgenus Euphorbia. TAXON 62, 291–315 (2013).
Article Google Scholar
Yang, Y. et al. Molecular phylogenetics and classification of Euphorbia subgenus Chamaesyce (Euphorbiaceae). TAXON 61, 764–789 (2012).
Article Google Scholar
Pluskal, T., Castillo, S., Villar-Briones, A. & Oresic, M. MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinformatics 11, 395 (2010).
Article PubMed PubMed Central Google Scholar
Nothias, L.-F. et al. Feature-based molecular networking in the GNPS analysis environment. Nat. Methods 17, 905–908 (2020).
Article CAS PubMed PubMed Central Google Scholar
Schmid, R. et al. Ion identity molecular networking in the GNPS Environment. Preprint at bioRxiv https://doi.org/10.1101/2020.05.11.088948 (2020).
Röst, H. L. et al. OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nat. Methods 13, 741–748 (2016).
Article PubMed Google Scholar
Benton, H. P., Wong, D. M., Trauger, S. A. & Siuzdak, G. XCMS2: processing tandem mass spectrometry data for metabolite identification and structural characterization. Anal. Chem. 80, 6382–6389 (2008).
Article CAS PubMed PubMed Central Google Scholar
Shinbo, Y. et al. in Plant Metabolomics Vol. 57 (eds Saito, K. et al.) 165–181 (Springer, 2006).
Wishart, D. S. et al. HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Res. 46, D608–D617 (2018).
Article CAS PubMed Google Scholar
Kanehisa, M., Goto, S., Kawashima, S. & Nakaya, A. The KEGG databases at GenomeNet. Nucleic Acids Res. 30, 42–46 (2002).
Article CAS PubMed PubMed Central Google Scholar
Bobach, C., Böhme, T., Laube, U., Püschel, A. & Weber, L. Automated compound classification using a chemical ontology. J. Cheminform. 4, 40 (2012).
Article CAS PubMed PubMed Central Google Scholar
Klekota, J. & Roth, F. P. Chemical substructures that enrich for biological activity. Bioinformatics 24, 2518–2525 (2008).
Article CAS PubMed PubMed Central Google Scholar
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
Article CAS PubMed Google Scholar
Willighagen, E. L. et al. The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J. Cheminf. 9, 33 (2017).
Article Google Scholar
Hähnke, V. D., Kim, S. & Bolton, E. E. PubChem chemical structure standardization. J. Cheminf. 10, 36 (2018).
Article Google Scholar
Rogers, D. J. & Tanimoto, T. T. A computer program for classifying plants. Science 132, 1115–1118 (1960).
Article CAS PubMed Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).
Google Scholar
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).
Abadi, M. N. et al. in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (eds Keeton, K. & Roscoe, T.) 265–283 (USENIX, 2016).
Platt, J. C. Advances in Large Margin Classifiers (MIT Press, 2000).
Böcker, S. & Dührkop, K. Fragmentation trees reloaded. J. Cheminform. 8, 5 (2016).
Article PubMed PubMed Central Google Scholar
Ludwig, M. et al. Database-independent molecular formula annotation using Gibbs sampling through ZODIAC. Nat. Mach. Intell. 2, 629–641 (2020).
Article Google Scholar
Moorthy, A. S., Wallace, W. E., Kearsley, A. J., Tchekhovskoi, D. V. & Stein, S. E. Combining fragment-ion and neutral-loss matching during mass spectral library searching: a new general purpose algorithm applicable to illicit drug identification. Anal Chem. 89, 13261–13268 (2017).
Mann, H. B. & Whitney, D. R. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18, 50–60 (1947).
Article Google Scholar
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Meth. 17, 261–272 (2020).
Article CAS Google Scholar

Download references

Acknowledgements

We thank Deutsche Forschungsgemeinschaft for providing financial support (no. BO 1910/20 to S.B., K.D. and M.L. and no. PE 2600/1 to D.P.), and the Academy of Finland (no. 310107/MACOME to J.R.). P.C.D., R.R. and W.H.G. were supported by the Gordon and Betty Moore Foundation (no. GBMF7622) and by the US National Institutes of Health (NIH; no. R01 GM107550). P.C.D. was supported by NIH grants nos. P41 GM103484 and R03 CA211211. L.-F.N. was supported by NIH grant no. R01 GM107550 and by the European Union’s Horizon 2020 program (MSCA-GF, no. 704786). We thank F. Kuhlmann and Agilent Technologies, Inc. for providing data used in the evaluation of CANOPUS. We thank Y. Djoumbou Feunang, D. Arndt and D. Wishart for providing ClassyFire annotations for a database of molecular structures. We thank K. Alexander, E. Caro-Diaz and B. Naman for assistance with the collection of Rivularia sp. Further, we thank S. Whitner and K. Joosten for 16S recombinant DNA analysis. We thank M. Ernst for valuable discussions on the Euphorbia plant study, and J. van der Hooft and S. Rogers for feedback on the manuscript.

Author information

Authors and Affiliations

Chair for Bioinformatics, Friedrich-Schiller University, Jena, Germany
Kai Dührkop, Markus Fleischauer, Marcus Ludwig, Martin A. Hoffmann & Sebastian Böcker
Collaborative Mass Spectrometry Innovation Center, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, La Jolla, CA, USA
Louis-Félix Nothias, Daniel Petras & Pieter C. Dorrestein
Center for Marine Biotechnology and Biomedicine, Scripps Institution of Oceanography, University of California, San Diego, La Jolla, CA, USA
Raphael Reher & William H. Gerwick
International Max Planck Research School ‘Exploration of Ecological Interactions with Molecular and Chemical Techniques’, Max Planck Institute for Chemical Ecology, Jena, Germany
Martin A. Hoffmann
Scripps Institution of Oceanography, University of California, San Diego, La Jolla, CA, USA
Daniel Petras
Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, La Jolla, CA, USA
William H. Gerwick
Helsinki Institute for Information Technology, Department of Computer Science, Aalto University, Espoo, Finland
Juho Rousu

Authors

Kai Dührkop
View author publications
You can also search for this author in PubMed Google Scholar
Louis-Félix Nothias
View author publications
You can also search for this author in PubMed Google Scholar
Markus Fleischauer
View author publications
You can also search for this author in PubMed Google Scholar
Raphael Reher
View author publications
You can also search for this author in PubMed Google Scholar
Marcus Ludwig
View author publications
You can also search for this author in PubMed Google Scholar
Martin A. Hoffmann
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Petras
View author publications
You can also search for this author in PubMed Google Scholar
William H. Gerwick
View author publications
You can also search for this author in PubMed Google Scholar
Juho Rousu
View author publications
You can also search for this author in PubMed Google Scholar
Pieter C. Dorrestein
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Böcker
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

K.D., J.R. and S.B. designed the research. K.D. and S.B. developed the computational method. K.D. implemented the computational method with contributions from M.L., M.F. and M.A.H. M.F. integrated CANOPUS into SIRIUS v.4.4. K.D., L.-F.N. and P.C.D. applied and evaluated the method in the mouse and Euphorbia studies. R.R. isolated rivulariapeptolide 1155 and applied CANOPUS (on mass spectrometry data collected and analyzed by D.P. and R.R. and supervised by W.H.G.) and one-/two-dimensional NMR analysis for its structural elucidation. K.D., S.B., L.-F.N. and R.R. wrote the manuscript, in concert with all authors.

Corresponding author

Correspondence to Sebastian Böcker.

Ethics declarations

Competing interests

S.B., K.D., M.L., M.F. and M.A.H. are cofounders of Bright Giant GmbH. P.C.D. is scientific advisor for Sirenas LLC.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 CANOPUS performance sunburst plot. Matthews correlation coefficient (MCC) for the 782 of 2,497 compound classes with at least 50 positive examples.

SVM training dataset. A darker green coloring corresponds to better prediction performance for the class. The size of each slice is chosen such that all classes fit into the figure and has no further meaning. Inner slices represent parent classes of outer slices.

Extended Data Fig. 2 Effect of removing a subclass from the MS/MS training data.

a–c, Regular evaluation setup: classes and subclasses are distributed into cross-validation folds, ensuring that methods are never evaluated on the same MS/MS data or structures they were trained on. d-f, We remove all flavonoid glycosides (the subclass) from the MS/MS training data (d), and then evaluate the predictor for glycosides (the class) on these removed MS/MS spectra (e). A perfect method would still classify all flavonoid glycoside MS/MS spectra as glycosides (f). CANOPUS exhibits only a small drop (68% to 97%) in correct classifications (c,f). In contrast, direct prediction performed mostly on par with CANOPUS before removing flavonoid glycosides from the MS/MS training data (c), but misses almost all of them (8%) afterwards (f). We were able to attribute this to the presence of isoflavonoid glycosides in the training data; these do not belong to the flavonoid class, but have highly similar structures and MS/MS spectra, except for the presence of a sugar residue. We observed that direct prediction in (d-f) uses the presence of a sugar residue to infer that a MS/MS spectrum is not a glycoside. In contrast, CANOPUS does not fall for this ‘bait’; heterogeneous training allows us to integrate the substantially more comprehensive structure data in its predictions.

Extended Data Fig. 3 Relative number of compounds annotated at varying ClassyFire class levels in the mice study (a) and the Euphorbia plant study (b).

The ClassyFire ChemOnt ontology is organized as a tree, where the Kingdom is either Organic compounds or Inorganic compounds. Superclasses like Lipids and lipid like molecules, Benzenoids are children of Kingdom class. Flavonoids and Steroids and steroid derivatives are examples for the Class level, while Flavonoid glycosides and Bile acids, alcohols, and derivatives are examples for subclasses. There can be up to 11 levels in the ontology. c, ClassyFire classes of compounds in the biological databases. We observe a similar distribution of class levels as for the two biological datasets, indicating that CANOPUS is comprehensively classifying compounds at all possible compound class levels.

Extended Data Fig. 4 Molecular network and compound class annotations (single class annotations) for the mice digestive system.

Node colors indicate the compound class annotated by CANOPUS; displayed compound classes were manually selected. When a compound is annotated with multiple classes, the class with the larger structural pattern is selected. Nodes are connected by an edge if the spectral similarity is 0.7 or higher.

Extended Data Fig. 5 Molecular network and compound class annotations (muliple class annotations) for the mice digestive system.

Node colors indicate the compound class annotated by CANOPUS; compound classes are the same as in Supplementary Fig. 4 1. Compounds belonging to multiple classes displayed as multicolored nodes. Nodes are connected by an edge if the spectral similarity is 0.7 or higher.

Extended Data Fig. 6 Number of compounds detected for each Euphorbia subgenus.

Orange bars indicate the number of compounds detected here, black ticks indicate the number of compounds reported in the original study. Higher numbers of detected features are not a measure of quality for the two methods, but depend mainly on the preprocessing executed before compound classification.

Extended Data Fig. 7 Number of compounds annotated as diterpenoids in different species of Euphorbia.

Left: absolute number of compounds. Right: relative number of compounds, that is, number of diterpenoids divided by total number of compounds in each species. Black ticks in the left figure mark the reported number of diterpenoids in the original study by Ernst et al.

Extended Data Fig. 8 Number of compounds annotated as triterpenoids in different species of Euphorbia.

Left: absolute number of compounds. Right: relative number of compounds, that is, number of triterpenoids divided by total number of compounds in each species. Black ticks in the left figure mark the reported number of triterpenoids in the original study by Ernst et al.

Extended Data Fig. 9 Number of diterpenoids in different species of Euphorbia.

Black bars show the amount of diterpenoids that have a benzoic acid ester (a), fatty acid ester (b) or two carboxylic acids (c).

Source data

Supplementary information

Supplementary Information

Supplementary Tables 4, 5 and 7, Figs. 1–10 and Notes 1 and 2.

Reporting Summary

41587_2020_740_MOESM3_ESM.html

Supplementary Data. Interactive comparison of Euphorbia plants. The user can select any two plant species to be compared; two sunburst plots then show the number of compounds annotated by CANOPUS for each compound class. Mouse-over allows display details of a compound class, including the number and percentage of compounds belonging to this class and the ClassyFire ontology and description of the class.

Supplementary Table 1

Compound classes from ChemOnt ontology not predicted by CANOPUS.

Supplementary Table 2

Evaluation results for all query MS/MS from the SVM training dataset.

Supplementary Table 3

Performance of all evaluated methods for individual compound classes; evaluation on the independent dataset.

Supplementary Table 6

Standardized SMILES of all compound structures from MassBank and GNPS used as the SVM training dataset.

Source data

Source Data Fig. 3.

Source Data Extended Data Fig. 3.

Source Data Extended Data Fig. 6.

Source Data Extended Data Fig. 7.

Source Data Extended Data Fig. 8.

Source Data Extended Data Fig. 9.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dührkop, K., Nothias, LF., Fleischauer, M. et al. Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. Nat Biotechnol 39, 462–471 (2021). https://doi.org/10.1038/s41587-020-0740-8

Download citation

Received: 15 April 2020
Accepted: 16 October 2020
Published: 23 November 2020
Issue Date: April 2021
DOI: https://doi.org/10.1038/s41587-020-0740-8

This article is cited by

Fast mass spectrometry search and clustering of untargeted metabolomics data
- Mihir Mongia
- Tyler M. Yasaka
- Hosein Mohimani
Nature Biotechnology (2024)
An assessment of AcquireX and Compound Discoverer software 3.3 for non-targeted metabolomics
- Bret Cooper
- Ronghui Yang
Scientific Reports (2024)
Metabolic re-programming in confrontations of Colletotrichum graminicola and Aspergillus nidulans with Bacillus biocontrol agents
- Bennet Rohan Fernando Devasahayam
- Diana Astrid Barrera Adame
- Holger B. Deising
Journal of Plant Diseases and Protection (2024)
microbeMASST: a taxonomically informed mass spectrometry search tool for microbial metabolomics data
- Simone Zuffa
- Robin Schmid
- Pieter C. Dorrestein
Nature Microbiology (2024)
Autologous hematopoietic stem cell transplantation significantly alters circulating ceramides in peripheral blood of relapsing-remitting multiple sclerosis patients
- Aina Vaivade
- Anna Wiberg
- Kim Kultima
Lipids in Health and Disease (2023)

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links