This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
A dual diffusion model enables 3D molecule generation and lead optimization based on target pockets
Nature Communications Open Access 26 March 2024
-
Contigs directed gene annotation (ConDiGA) for accurate protein sequence database construction in metaproteomics
Microbiome Open Access 19 March 2024
-
Lose-lose consequences of bacterial community-driven invasions in soil
Microbiome Open Access 18 March 2024
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout
References
Sunagawa, S. et al. Science 348, 1261359 (2015).
Afshinnekoo, E. et al. Cell Syst. 1, 72–87 (2015).
Howe, A.C. et al. Proc. Natl. Acad. Sci. USA 111, 4904–4909 (2014).
Franzosa, E.A. et al. Nat. Rev. Microbiol. 13, 360–372 (2015).
Scholz, M.B., Lo, C.C. & Chain, P.S. Curr. Opin. Biotechnol. 23, 9–15 (2012).
Desai, N., Antonopoulos, D., Gilbert, J.A., Glass, E.M. & Meyer, F. Curr. Opin. Biotechnol. 23, 72–76 (2012).
Tang, W. et al. in IEEE International Conference on Big Data, 56–63 (IEEE, 2014).
Altschul, S.F. et al. Nucleic Acids Res. 25, 3389–3402 (1997).
Edgar, R.C. Bioinformatics 26, 2460–2461 (2010).
Kiełbasa, S.M., Wan, R., Sato, K., Horton, P. & Frith, M.C. Genome Res. 21, 487–493 (2011).
Zhao, Y., Tang, H. & Ye, Y. Bioinformatics 28, 125–126 (2012).
Buchfink, B., Xie, C. & Huson, D.H. Nat. Methods 12, 59–60 (2015).
Hurwitz, B.L. & Sullivan, M.B. PLoS One 8, e57355 (2013).
Hauser, M., Steinegger, M. & Söding, J. Bioinformatics 32, 1323–1330 (2016).
Murzin, A.G., Brenner, S.E., Hubbard, T. & Chothia, C. J. Mol. Biol. 247, 536–540 (1995).
Karplus, K., Barrett, C. & Hughey, R. Bioinformatics 14, 846–856 (1998).
Rognes, T. BMC Bioinformatics 12, 221 (2011).
Frith, M.C. Nucleic Acids Res. 39, e23–e23 (2011).
Frith, M.C., Park, Y., Sheetlin, S.L. & Spouge, J.L. Nucleic Acids Res. 36, 5863–5871 (2008).
Jensen, L.J. et al. Nucleic Acids Res. 36, D250–D254 (2008).
Finn, R.D. et al. Nucleic Acids Res. 44 D1, D279–D285 (2016).
Steinegger, M. & Söding, J. Preprint at bioRxiv https://dx.doi.org/10.1101/104034 (2017).
Eddy, S.R. PLOS Comput. Biol. 7, e1002195 (2011).
Acknowledgements
We are grateful to C. Notredame and C. Seok for hosting M.S. at the CRG in Barcelona and at Seoul National University for 12 and 18 months, respectively, and to Burkhard Rost at TU Munich for accepting the formal supervision of his PhD thesis. We thank M. Mirdita, L. van den Driesch, and C. Galiez for contributing utilities and workflows, and S. Sunagawa, M. Frith, T. Rattei and our laboratory for feedback on the manuscript. This work was supported by the European Research Council's Horizon 2020 Framework Programme for Research and Innovation (“Virus-X”, project no. 685778) and by the German Federal Ministry for Education and Research (BMBF) (grants e:AtheroSysMed 01ZX1313D, “SysCore” 0316176A).
Author information
Authors and Affiliations
Contributions
M.S. developed the software and performed the data analysis. M.S. and J.S. conceived of and designed the algorithms and benchmarks and wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Eliminating random memory access during k-mer match stage in MMseqs2.
Numbers in this figure are represented in hexadecimal notation (e.g. 0xFF is equal to 255 in decimal). After the end of loop 2 (Fig. 1B), the matches array on the left, containing single k-mer matches between the query sequence and various target sequences, is processed in two steps to find double k-mer matches. In the first step, the entries (target_ID, i−j) of matches are sorted into 2B arrays (bins) according to the lowest B bits of target_ID. Here, for illustration purposes, we set B = 8. In the second step, the 2B bins are processed one by one. For each k-mer match (target_ID, i−j), we run the code in the magenta frame of Fig. 1B. But now, the diagonal_prev array fits into L1/L2 CPU cache, because it only contains ceil(N/2B) entries, where N is the number of sequences in the target database.
Supplementary Figure 2 Multi-core scaling of MMseqs2.
Runtimes of MMseqs and MMseqs2 searches in fast and default sensitivity using 1, 2, 4, 8 and 16 threads on a 2 × 8 core server with 128 GB main memory. Theoretically optimal scaling is indicated as a dashed black line for each method. We searched with 6370 full length protein queries against 30 Mio. UniProt sequences. On 16 cores, MMseqs achieves 58% and MMseqs2 85% of their theoretical maximum performance interpolated from the single core measurement. The improvement in scaling behaviour from MMseqs to MMseqs2 is owed to minimizing random main memory accesses, as explained in Fig. S1.
Supplementary Figure 3 Runtime of MMseqs2 against the UniProt at different sensitivity and database split settings.
We measured the search time with query sets of 10,000 and 100,000 sequences through the UniProt database (Release 2017_03 with 80204488 sequences) using four sensitivity settings (faster, fast, default, and sensitive) and splitting the database into 1, 2, and 4 chunks. Runtimes for Refseq/Genbank (Release March 3, 2017 with 81,027,309 sequences) are very similar. The memory consumption of the index table for the split levels of 1, 2, and 4 was 190GB, 101GB, and 57GB respectively. All searches ran on a 2×14-core server with 768GB main memory.
Supplementary Figure 4 False discovery rate versus E-value threshold.
False discovery rate versus E-value threshold in version 2 of the sequence search sensitivity benchmark using unshuffled query sequences. Colors are the same as in Fig. 2a
Supplementary Figure 5 Sequence searching sensitivity assessment with unshuffled query sequences.
Cumulative distribution of area under the curve (AUC) sensitivity for all 6324 queries in version 2 of the sequence search sensitivity benchmark using unshuffled query sequences. Higher curves signify higher sensitivity.
Supplementary Figure 6 Sequence profile searching sensitivity assessment with unshuffled query sequence profiles.
Cumulative distribution of area under the curve (AUC) sensitivity for all 6324 unshuffled query sequences in version 2 of the sequence search sensitivity benchmark using unshuffled query sequences. Higher curves signify higher sensitivity. Higher curves signify higher sensitivity. 2 IT: 2 search iterations etc.
Supplementary Figure 7 False discovery rate versus E-value threshold for profile searches.
False discovery rate versus E-value threshold in version 2 of the sequence profile search sensitivity benchmark using unshuffled query sequences.
Supplementary Figure 8 Accuracy of reported E-values.
The expected number of false positives is the E-value threshold times the number of searches, E × 6324. The observed number of false positives is the total number of false positives below the E-value threshold in all 6324 searches. If E-values were accurate, observed and expected numbers of false positives would coincide (diagonal grey line). LAST and MMseqs2 report the most accurate E-values. The false positives shown were obtained with version 2 of the sequence search sensitivity benchmark. Colors are the same as in Fig. 2a.
Supplementary Figure 9 Sequence searching sensitivity assessment with single-domain SCOP sequences.
Cumulative distribution of area under the curve (AUC) sensitivity for all 7616 single domain SCOP sequences. Higher curves signify higher sensitivity. AUC up to the first false positive is the fraction of true positive matches found with better E-value than the first false positive match.
Supplementary Figure 10 False discovery rate versus E-value threshold for the single-domain benchmark.
False discovery rate versus E-value threshold for the single-domain SCOP sequence search sensitivity benchmark
Supplementary Figure 12 Algorithmic changes to perform fast sequence profile searches using MMseqs2.
We precompute all similar k-mers above a similarity threshold for each target profile and store them into the index table. For each query sequence we run over its overlapping, spaced k-mers (loop 2) and look up in the index table (blue frame) only the exact same k-mer. At the ungapped alignment stage we use the target profile consensus sequence. We transpose the results, i.e., we exchange the role of query and target in the results and then, as the last step, align the profiles against all query sequences and transpose back.
Supplementary information
Supplementary Information
Supplementary Figures Tables and Texts (PDF 5454 kb)
Supplementary Information
Supplementary Source Code (ZIP 7337 kb)
Rights and permissions
About this article
Cite this article
Steinegger, M., Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35, 1026–1028 (2017). https://doi.org/10.1038/nbt.3988
Published:
Issue Date:
DOI: https://doi.org/10.1038/nbt.3988
This article is cited by
-
Deciphering the gut microbiome of grass carp through multi-omics approach
Microbiome (2024)
-
Unraveling metagenomics through long-read sequencing: a comprehensive review
Journal of Translational Medicine (2024)
-
Contigs directed gene annotation (ConDiGA) for accurate protein sequence database construction in metaproteomics
Microbiome (2024)
-
Lose-lose consequences of bacterial community-driven invasions in soil
Microbiome (2024)
-
Combining GWAS and comparative genomics to fine map candidate genes for days to flowering in mung bean
BMC Genomics (2024)