MPL resolves genetic linkage in fitness inference from complex evolutionary histories

Sohail, Muhammad Saqib; Louie, Raymond H. Y.; McKay, Matthew R.; Barton, John P.

doi:10.1038/s41587-020-0737-3

Article
Published: 30 November 2020

MPL resolves genetic linkage in fitness inference from complex evolutionary histories

Nature Biotechnology volume 39, pages 472–479 (2021)Cite this article

3263 Accesses
16 Citations
69 Altmetric
Metrics details

Subjects

Abstract

Genetic linkage causes the fate of new mutations in a population to be contingent on the genetic background on which they appear. This makes it challenging to identify how individual mutations affect fitness. To overcome this challenge, we developed marginal path likelihood (MPL), a method to infer selection from evolutionary histories that resolves genetic linkage. Validation on real and simulated data sets shows that MPL is fast and accurate, outperforming existing inference approaches. We found that resolving linkage is crucial for accurately quantifying selection in complex evolving populations, which we demonstrate through a quantitative analysis of intrahost HIV-1 evolution using multiple patient data sets. Linkage effects generated by variants that sweep rapidly through the population are particularly strong, extending far across the genome. Taken together, our results argue for the importance of resolving linkage in studies of natural selection.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: MPL accurately recovers selection from complex dynamics.**

**Fig. 2: MPL compares favorably with state-of-the-art methods.**

**Fig. 3: Patterns of strong selection in intrahost HIV-1 evolution.**

**Fig. 4: Maps of strong contributions of linkage to inferred selection.**

**Fig. 5: Estimates of selection coefficients for viral escape mutations must account for clonal interference.**

**Fig. 6: Complex patterns of selection in HIV-1 Env following superinfection in an individual who develops broadly neutralizing antibodies.**

The population genomics of adaptive loss of function

Article Open access 11 February 2021

J. Grey Monroe, John K. McKay, … Pádraic J. Flood

Genetic complementation fosters evolvability in complex fitness landscapes

Article Open access 12 January 2023

Ernesto Segredo-Otero & Rafael Sanjuán

Genetic load: genomic estimates and applications in non-model animals

Article 08 February 2022

Giorgio Bertorelle, Francesca Raffini, … Cock van Oosterhout

Data availability

Raw data used in our analysis is available in the GitHub repository located at https://github.com/bartonlab/paper-MPL-inference. Source data are provided with this paper.

Code availability

Code used in our analysis is available in the GitHub repository located at https://github.com/bartonlab/paper-MPL-inference. The repository also contains Jupyter notebooks that can be run to reproduce the results presented here. The source code is shared under GPL-3.0 license https://github.com/bartonlab/paper-MPL-inference/blob/master/LICENSE-GPL. An executable version is also provided on Code Ocean at https://codeocean.com/capsule/3400567/tree (ref. ³⁰), distributed under the GPL-3.0 license https://opensource.org/licenses/gpl-license/.

References

Bignell, G. R. et al. Signatures of mutation and selection in the cancer genome. Nature 463, 893–898 (2010).
Article CAS PubMed PubMed Central Google Scholar
Greaves, M. & Maley, C. C. Clonal evolution in cancer. Nature 481, 306–313 (2012).
Article CAS PubMed PubMed Central Google Scholar
Burrell, R. A., McGranahan, N., Bartek, J. & Swanton, C. The causes and consequences of genetic heterogeneity in cancer evolution. Nature 501, 338–345 (2013).
Article CAS PubMed Google Scholar
Nik-Zainal, S. et al. The life history of 21 breast cancers. Cell 149, 994–1007 (2012).
Article CAS PubMed PubMed Central Google Scholar
Landau, D. A. et al. Evolution and impact of subclonal mutations in chronic lymphocytic leukemia. Cell 152, 714–726 (2013).
Article CAS PubMed PubMed Central Google Scholar
Łuksza, M. et al. A neoantigen fitness model predicts tumour response to checkpoint blockade immunotherapy. Nature 551, 517–520 (2017).
Article PubMed PubMed Central Google Scholar
McMichael, A. J., Borrow, P., Tomaras, G. D., Goonetilleke, N. & Haynes, B. F. The immune response during acute HIV-1 infection: clues for vaccine development. Nat. Rev. Immunol. 10, 11–23 (2010).
Article CAS PubMed Google Scholar
Allen, T. M. et al. Selective escape from CD8⁺ T-cell responses represents a major driving force of human immunodeficiency virus type 1 (HIV-1) sequence diversity and reveals constraints on HIV-1 evolution. J. Virol. 79, 13239–13249 (2005).
Article CAS PubMed PubMed Central Google Scholar
Zanini, F. et al. Population genomics of intrapatient HIV-1 evolution. eLife 4, e11282 (2015).
Article PubMed PubMed Central Google Scholar
Strelkowa, N. & Lässig, M. Clonal interference in the evolution of influenza. Genetics 192, 671–682 (2012).
Article CAS PubMed PubMed Central Google Scholar
Łuksza, M. & Lässig, M. A predictive fitness model for influenza. Nature 507, 57–61 (2014).
Article PubMed Google Scholar
Muller, H. J. The relation of recombination to mutational advance. Mut. Res. 1, 2–9 (1964).
Article Google Scholar
Smith, J. M. & Haigh, J. The hitch-hiking effect of a favourable gene. Genet. Res. 23, 23–35 (1974).
Article CAS PubMed Google Scholar
Hegreness, M., Shoresh, N., Hartl, D. & Kishony, R. An equivalence principle for the incorporation of favorable mutations in asexual populations. Science 311, 1615–1617 (2006).
Article CAS PubMed Google Scholar
Lang, G. I. et al. Pervasive genetic hitchhiking and clonal interference in forty evolving yeast populations. Nature 500, 571–574 (2013).
Article CAS PubMed PubMed Central Google Scholar
Tenaillon, O. et al. Tempo and mode of genome evolution in a 50,000-generation experiment. Nature 536, 165–170 (2016).
Article CAS PubMed PubMed Central Google Scholar
Levy, S. F. et al. Quantitative evolutionary dynamics using high-resolution lineage tracking. Nature 519, 181–186 (2015).
Article CAS PubMed PubMed Central Google Scholar
Bollback, J. P., York, T. L. & Nielsen, R. Estimation of 2N_es from temporal allele frequency data. Genetics 179, 497–502 (2008).
Article CAS PubMed PubMed Central Google Scholar
Malaspinas, A.-S., Malaspinas, O., Evans, S. N. & Slatkin, M. Estimating allele age and selection coefficient from time-serial data. Genetics 192, 599–607 (2012).
Article PubMed PubMed Central Google Scholar
Mathieson, I. & McVean, G. Estimating selection coefficients in spatially structured populations from time series data of allele frequencies. Genetics 193, 973–984 (2013).
Article PubMed PubMed Central Google Scholar
Feder, A. F., Kryazhimskiy, S. & Plotkin, J. B. Identifying signatures of selection in genetic time series. Genetics 196, 509–522 (2014).
Article PubMed Google Scholar
Lacerda, M. & Seoighe, C. Population genetics inference for longitudinally-sampled mutants under strong selection. Genetics 198, 1237–1250 (2014).
Article PubMed PubMed Central Google Scholar
Foll, M., Shim, H. & Jensen, J. D. WFABC: a Wright–Fisher ABC–based approach for inferring effective population sizes and selection coefficients from time-sampled data. Mol. Ecol. Resour. 15, 87–98 (2015).
Article PubMed Google Scholar
Ferrer-Admetlla, A., Leuenberger, C., Jensen, J. D. & Wegmann, D. An approximate Markov model for the Wright–Fisher diffusion and its application to time series data. Genetics 203, 831–846 (2016).
Article CAS PubMed PubMed Central Google Scholar
Taus, T., Futschik, A. & Schlötterer, C. Quantifying selection with Pool-Seq time series data. Mol. Biol. Evol. 34, 3023–3034 (2017).
Article CAS PubMed PubMed Central Google Scholar
Illingworth, C. J. R. & Mustonen, V. Distinguishing driver and passenger mutations in an evolutionary history categorized by interference. Genetics 189, 989–1000 (2011).
Article PubMed PubMed Central Google Scholar
Illingworth, C. J. R., Fischer, A. & Mustonen, V. Identifying selection in the within-host evolution of influenza using viral sequence data. PLoS Comput. Biol. 10, e1003755 (2014).
Article PubMed PubMed Central Google Scholar
Terhorst, J., Schlötterer, C. & Song, Y. S. Multi-locus analysis of genomic time series data from experimental evolution. PLoS Genet. 11, e1005069 (2015).
Article PubMed PubMed Central Google Scholar
Sohail, M. S., Louie, R. H. Y., McKay, M. R. & Barton, J. P., MPL resolves genetic linkage in fitness inference from complex evolutionary histories. Github https://github.com/bartonlab/paper-MPL-inference (2020).
Sohail, M. S., Louie, R. H. Y., McKay, M. R. & Barton, J. P., MPL resolves genetic linkage in fitness inference from complex evolutionary histories. Code Ocean https://doi.org/10.24433/CO.1795728.v1 (2020).
Mustonen, V. & Lässig, M. Fitness flux and ubiquity of adaptive evolution. Proc. Natl Acad. Sci. USA 107, 4248–4253 (2010).
Article CAS PubMed PubMed Central Google Scholar
Illingworth, C. J. R., Parts, L., Schiffels, S., Liti, G. & Mustonen, V. Quantifying selection acting on a complex trait using allele frequency time series data. Mol. Biol. Evol. 29, 1187–1197 (2011).
Article PubMed PubMed Central Google Scholar
Schraiber, J. G. A path integral formulation of the Wright–Fisher process with genic selection. Theor. Popul. Biol. 92, 30–35 (2014).
Article PubMed Google Scholar
Ewens, W. J. Mathematical Population Genetics 1: Theoretical Introduction (Springer Science & Business Media, 2012).
Iranmehr, A., Akbari, A., Schlötterer, C. & Bafna, V. CLEAR: Composition of likelihoods for evolve and resequence experiments. Genetics 206, 1011–1023 (2017).
Article PubMed PubMed Central Google Scholar
Liu, M. K. P. et al. Vertical T cell immunodominance and epitope entropy determine HIV-1 escape. J. Clin. Invest. 123, 380–393 (2013).
CAS PubMed Google Scholar
Moore, P. L. et al. Multiple pathways of escape from HIV broadly cross-neutralizing V2-dependent antibodies. J. Virol. 87, 4882–4894 (2013).
Article CAS PubMed PubMed Central Google Scholar
Doria-Rose, N. A. et al. Developmental pathway for potent V1V2-directed HIV-neutralizing antibodies. Nature 509, 55–62 (2014).
Article CAS PubMed PubMed Central Google Scholar
Liu, Y. et al. Selection on the human immunodeficiency virus type 1 proteome following primary infection. J. Virol. 80, 9519–9529 (2006).
Article CAS PubMed PubMed Central Google Scholar
Neher, R. A. & Leitner, T. Recombination rate and selection strength in HIV intra-patient evolution. PLoS Comput. Biol. 6, e1000660 (2010).
Article PubMed PubMed Central Google Scholar
Batorsky, R. et al. Estimate of effective recombination rate and average selection coefficient for HIV in chronic infection. Proc. Natl Acad. Sci. USA 108, 5661–5666 (2011).
Article CAS PubMed PubMed Central Google Scholar
Wang, S. et al. Manipulating the selection forces during affinity maturation to generate cross-reactive HIV antibodies. Cell 160, 785–797 (2015).
Article CAS PubMed PubMed Central Google Scholar
Liao, H.-X. et al. Co-evolution of a broadly neutralizing HIV-1 antibody and founder virus. Nature 496, 469–476 (2013).
Article CAS PubMed PubMed Central Google Scholar
Ganusov, V. V. et al. Fitness costs and diversity of the cytotoxic T lymphocyte (CTL) response determine the rate of CTL escape during acute and chronic phases of HIV Infection. J. Virol. 85, 10518–10528 (2011).
Article CAS PubMed PubMed Central Google Scholar
Ganusov, V. V., Neher, R. A. & Perelson, A. S. Mathematical modeling of escape of HIV from cytotoxic T lymphocyte responses. J. Stat. Mech.: Theory Exp. 2013, P01010 (2013).
Article Google Scholar
Kessinger, T., Perelson, A. & Neher, R. Inferring HIV escape rates from multi-locus genotype data. Front. Immunol. 4, 252 (2013).
Pandit, A. & de Boer, R. J. Reliable reconstruction of HIV-1 whole genome haplotypes reveals clonal interference and genetic hitchhiking among immune escape variants. Retrovirology 11, 11–56 (2014).
Article Google Scholar
Leviyang, S. & Ganusov, V. V. Broad CTL response in early HIV infection drives multiple concurrent CTL escapes. PLoS Comput. Biol. 11, e1004492 (2015).
Article PubMed PubMed Central Google Scholar
Beerenwinkel, N., Günthard, H. F., Roth, V. & Metzner, K. J. Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front. Microbiol. 3, 329 (2012).
Article CAS PubMed PubMed Central Google Scholar
Turajlic, S., Sottoriva, A., Graham, T. & Swanton, C. Resolving genetic heterogeneity in cancer. Nat. Rev. Genet. 20, 404–416 (2019).
Article CAS PubMed Google Scholar
Good, B. H., McDonald, M. J., Barrick, J. E., Lenski, R. E. & Desai, M. M. The dynamics of molecular evolution over 60,000 generations. Nature 551, 45–50 (2017).
Article PubMed PubMed Central Google Scholar
Kouyos, R. D., Althaus, C. L. & Bonhoeffer, S. Stochastic or deterministic: what is the effective population size of HIV-1? Trends Microbiol. 14, 507–511 (2006).
Article CAS PubMed Google Scholar
Cocco, S., Feinauer, C., Figliuzzi, M., Monasson, R. & Weigt, M. Inverse statistical physics of protein sequences: a key issues review. Rep. Prog. Phys. 81, 032601 (2018).
Article PubMed Google Scholar
Socolich, M. et al. Evolutionary information for specifying a protein fold. Nature 437, 512–518 (2005).
Article CAS PubMed Google Scholar
Weigt, M., White, R. A., Szurmant, H., Hoch, J. A. & Hwa, T. Identification of direct residue contacts in protein–protein interaction by message passing. Proc. Natl Acad. Sci. USA 106, 67–72 (2009).
Article CAS PubMed Google Scholar
Morcos, F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl Acad. Sci. USA 108, E1293–E1301 (2011).
Article CAS PubMed PubMed Central Google Scholar
Russ, W. P., Lowery, D. M., Mishra, P., Yaffe, M. B. & Ranganathan, R. Natural-like function in artificial WW domains. Nature 437, 579–583 (2005).
Article CAS PubMed Google Scholar
Ferguson, A. L. et al. Translating HIV sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design. Immunity 38, 606–617 (2013).
Article CAS PubMed PubMed Central Google Scholar
Mann, J. K. et al. The fitness landscape of HIV-1 gag: advanced modeling approaches and validation of model predictions by in vitro testing. PLoS Comput. Biol. 10, e1003776 (2014).
Article PubMed PubMed Central Google Scholar
Figliuzzi, M., Jacquier, H., Schug, A., Tenaillon, O. & Weigt, M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1. Mol. Biol. Evol. 33, 268–280 (2015).
Article PubMed PubMed Central Google Scholar
Barton, J. P. et al. Relative rate and location of intra-host HIV evolution to evade cellular immunity are predictable. Nat. Commun. 7, 11660 (2016).
Article CAS PubMed PubMed Central Google Scholar
Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).
Article CAS PubMed PubMed Central Google Scholar
Louie, R. H. Y., Kaczorowski, K. J., Barton, J. P., Chakraborty, A. K. & McKay, M. R. Fitness landscape of the human immunodeficiency virus envelope protein that is targeted by antibodies. Proc. Natl Acad. Sci. USA 115, E564–E573 (2018).
Article CAS PubMed PubMed Central Google Scholar
Quadeer, A. A., Louie, R. H. Y. & Mckay, M. R. Identifying immunologically-vulnerable regions of the HCV E2 glycoprotein and broadly neutralizing antibodies that target them. Nat. Commun. 10, 2073 (2019).
Article PubMed PubMed Central Google Scholar
Quadeer, A. A., Barton, J. P., Chakraborty, A. K. & McKay, M. R. Deconvolving mutational patterns of poliovirus outbreaks reveals its intrinsic fitness landscape. Nat. Commun. 11, 377 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kimura, M. Diffusion models in population genetics. J. Appl. Probab. 1, 177–232 (1964).
Article Google Scholar
Tataru, P., Bataillon, T. & Hobolth, A. Inference under a Wright-Fisher model using an accurate beta approximation. Genetics 201, 1133–1141 (2015).
Article CAS PubMed PubMed Central Google Scholar
He, Z., Beaumont, M. & Yu, F. Effects of the ordering of natural selection and population regulation mechanisms on Wright-Fisher models. G3: Genes, Genomes, Genetics 7, 2095–2106 (2017).
Article PubMed Google Scholar
Tataru, P., Simonsen, M., Bataillon, T. & Hobolth, A. Statistical inference in the Wright-Fisher model using allele frequency data. Syst. Biol. 66, e30–e46 (2017).
PubMed Google Scholar
Risken, H. The Fokker–Planck Equation: Methods of Solution and Applications 2nd edn (Springer, 1989).
Gaschen, B., Kuiken, C., Korber, B. & Foley, B. Retrieval and on-the-fly alignment of sequence fragments from the HIV database. Bioinformatics 17, 415–418 (2001).
Article CAS PubMed Google Scholar
Korber, B. et al. in Human Retroviruses and AIDS (eds Korber, B. et al.) 102–111 (Los Alamos National Laboratory, 1998)..
Zanini, F., Puller, V., Brodin, J., Albert, J. & Neher, R. A. In vivo mutation rates and the landscape of fitness costs of HIV-1. Virus Evol. 3, vex003 (2017).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank A.K. Chakraborty, C.J.R. Illingworth, B. Lee and J.G. Schraiber for helpful discussions and comments on the manuscript. The work of M.S.S., R.H.Y.L. and M.R.M. was supported by the Hong Kong Research Grants Council under grant number 16234716. M.S.S. and M.R.M. were also supported by the Hong Kong Research Grants Council under grant number 16201620, while R.H.Y.L. was also supported by Australia’s National Health and Medical Research Council under grant number APP1121643. The work of J.P.B. reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award R35GM138233.

Author information

These authors contributed equally: Muhammad Saqib Sohail, Raymond H. Y. Louie.

Authors and Affiliations

Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong, China
Muhammad Saqib Sohail, Raymond H. Y. Louie & Matthew R. McKay
Institute for Advanced Study, Hong Kong University of Science and Technology, Hong Kong, China
Raymond H. Y. Louie
The Kirby Institute, University of New South Wales, Sydney, New South Wales, Australia
Raymond H. Y. Louie
School of Medical Sciences, University of New South Wales, Sydney, New South Wales, Australia
Raymond H. Y. Louie
Department of Chemical and Biological Engineering, Hong Kong University of Science and Technology, Hong Kong, China
Matthew R. McKay
Department of Physics and Astronomy, University of California, Riverside, Riverside, CA, USA
John P. Barton

Authors

Muhammad Saqib Sohail
View author publications
You can also search for this author in PubMed Google Scholar
Raymond H. Y. Louie
View author publications
You can also search for this author in PubMed Google Scholar
Matthew R. McKay
View author publications
You can also search for this author in PubMed Google Scholar
John P. Barton
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors designed research, developed methods, analyzed data, interpreted results and wrote the paper.

Corresponding authors

Correspondence to Matthew R. McKay or John P. Barton.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 MPL accurately recovers selection coefficients from complex simulated evolutionary trajectories.

a, Trajectories of mutant allele frequencies over time exhibit complex dynamics in a WF simulation with a simple fitness landscape. b, Separate views of individual trajectories for beneficial, neutral, and deleterious mutants (left panel) and inferred selection coefficients (right panel) for a single simulation run. Note that many neutral mutations exhibit temporal variation similar to beneficial or deleterious mutations. MPL estimates the underlying selection coefficients used to generate these trajectories, presented as mean values ± one theoretical standard deviation, and distinguishes between beneficial, neutral, and deleterious mutations, using Eq. (11). Dashed lines mark the true selection coefficients. c, Distributions of selection coefficient estimates across n = 100 replicate simulations with identical parameters in the special case of perfect sampling. MPL is also robust to finite sampling constraints, accurately classifying beneficial (d) and deleterious (e) mutants even when the number of sequences sampled per time point n_s is low, and the spacing between time samples Δt is large. Simulation parameters. L = 50 loci with two alleles at each locus (mutant and WT): ten beneficial mutants with s = 0.025, 30 neutral mutants with s = 0, and ten deleterious mutants with s = −0.025. Mutation probability μ = 10⁻³, population size N = 10³. Initial population composed of approximately equal numbers of three random founder sequences, evolved over T = 400 generations.

Extended Data Fig. 2 MPL improves selection inference for simulated data sets.

In Fig. 2, we showed the performance of MPL and existing methods on simulated test data, averaged over n = 100 replicate simulations with identical parameters. Here we show the improvement of MPL over existing methods for the classification of beneficial (a) and deleterious (b) mutations, and for the error in the estimated selection coefficients (c), for each individual simulation. Selection is more difficult to infer in some simulated data sets, but results from MPL show better agreement with the true parameters in the vast majority of simulations. Simulation parameters. L = 50 loci with two alleles at each locus (mutant and WT): ten beneficial mutants (with s = 0.1 for complex, s = 0.025 for simple), 30 neutral mutants (s = 0 for both scenarios), and ten deleterious mutants (s = −0.1 for complex, s = −0.025 for simple). Mutation probability μ = 10⁻⁴, population size N = 10³. For the complex case, the initial population is composed of equal numbers of five random founder sequences, evolved over T = 310 generations. Recorded trajectory used for inference begins at generation 10. For the simple case, the initial population begins with all WT sequences, evolved over T = 1000 generations.

Extended Data Fig. 3 MPL performs well in the presence of recombination.

a, Classification performance of MPL is robust to variation in per locus recombination probability, r. Results are shown for n = 100 independent Monte-Carlo runs. The lower and upper edge of the boxplot correspond to the 25th to 75th percentiles, the bar corresponds to the median while the top and bottom whiskers show the maximum and minimum value within 1.5× the interquartile range from the boxplot. Linkage effects in the data decrease as the recombination probability increases. As a measure of the linkage disequilibrium in the data, we plot the histograms (b) of the covariance (x_ij − x_ix_j) of mutant allele frequencies integrated over time (300 generations) for a range of recombination probabilities. The number of mutant pairs with strong pairwise covariance values decrease with increasing values of r, indicating lower linkage disequilibrium. Simulation parameters. Same as those of simple scenario used in Fig. 2, that is, L = 50 loci with two alleles at each locus (mutant and WT): ten beneficial mutants (s = 0.025), 30 neutral mutants (s = 0), and ten deleterious mutants (s = −0.025). Mutation probability μ = 10⁻⁴, population size N = 10³, r = {0, 10⁻⁵, 10⁻⁴, 10⁻³}. The initial population begins with all WT sequences, evolved over T = 300 generations.

Extended Data Fig. 4 Performance of MPL on data with HIV-1-like sampling profiles.

a, The number of sequences per time point n_s are drawn from a binomial distribution with n = 1000 and p = 0.0139, with the same mean as that of the HIV data. b, The time between samples is drawn from a mixture of two gamma distributions f(x;k,θ), where k and θ are the shape and scale parameters. The mixture distribution has the form w₁ × (f(x;k₁,θ₁) + m₁) + w₂ × (f((k₂θ₂ + m₂ − x);k₂,θ₂) + m₂) where m₁ = 0, m₂ = 120, are constants added to shift the mean, k₁ = 3.5, k₂ = 3, θ₁ = 8.4, θ₂ = 2, while w₁ = 0.87, and w₂ = 0.13 are the mixing weights. The parameters were chosen to mimic the distribution of the time between samples of the HIV data analyzed in the manuscript (Supplementary Table 1). c, The number of generations used for inference is also drawn from a mixture of two gamma distributions, having the form given above and with parameters k₁ = 5.5, k₂ = 15, θ₁ = 7.2, θ₂ = 8, m₁ = 5, m₂ = 143, w₁ = 0.21, and w₂ = 0.79. The parameters were chosen to mimic the distribution of the trajectory lengths of the HIV data analyzed in the manuscript (Supplementary Table 1). d, A typical sampled trajectory of allele frequencies: beneficial (red), deleterious (blue) and neutral (gray). Dashed lines indicate the sampling time-points. e, The AUROC performance of identifying beneficial and deleterious selection coefficients under perfect and heterogeneous sampling scenarios. Results are evaluated for those sites that are polymorphic in the heterogeneous sampling case. Results are shown for n = 100 independent Monte-Carlo runs. The lower and upper edge of the boxplot correspond to the 25th to 75th percentiles, the bar corresponds to the median while the top and bottom whiskers show the maximum and minimum value within 1.5× the interquartile range from the boxplot. Simulation parameters: population size N = 1000, L = 50 loci with two alleles at each locus (mutant and WT), ten beneficial mutants with selection coefficients s uniformly distributed over the range [0.075, 0.125], 30 neutral mutants with s = 0, and ten deleterious mutants with selection coefficients uniformly distributed over the range [-0.125, -0.075], mutation probability per site per generation μ = 10⁻⁴, and recombination probability per site per generation r = 10⁻⁴.

Extended Data Fig. 5 Most genetic variants have little effect on inferred selection at other sites, but a small minority have strong effects.

After computing the pairwise effects \(\Delta \hat s_{ij}\) of each variant i on the inferred selection coefficient for each other variant j, referred to as the target, we summed the absolute value of the \(\Delta \hat s_{ij}\) values over all target variants j to quantify the influence of each variant i on selection at other sites. One histogram is shown for each sequencing region, for each individual. For the vast majority of variants, the total effect on selection at other sites is near zero. However, a small minority have strong effects. We defined a variant to be ‘highly influential’ if the sum of the absolute values of the \(\Delta \hat s_{ij}\) over all targets j was larger than 0.4 (=40%).

Extended Data Fig. 6 Variants that strongly influence inferred selection at other sites often act across large genomic distances.

Plot of all linkage effects on inferred selection coefficients \(\Delta \hat s_{ij}\) for which |\(\Delta \hat s_{ij}\)| > 0.004. One plot is shown for each sequencing region, for each individual. These strong effects of linkage on inferred selection coefficients can act at long range across the genome. Approximately 40% of highly influential variants, characterized by strong effects on inferred selection at other sites, lie within identified CD8⁺ T cell epitopes. The 5′ region for individual CH607 is not shown because no \(\Delta \hat s_{ij}\) values are larger than the cutoff.

Extended Data Fig. 7 For most variants, effects on inferred selection coefficients for other variants, and linkage disequilibrium, are stronger at smaller genomic distances.

a, Histogram of the absolute value of linkage effects on inferred selection coefficients for other variants |\(\Delta \hat s_{ij}\)|, divided into subgroups based on the distance along the genome between variant i and target variant j. Consistent with intuition, the large effects on inferred selection coefficients occur most frequently for different variants that occur at the same site on the genome (that is, distance equal to zero). ‘Interactions’ between such variants are necessarily perfectly competitive because only a single nucleotide is allowed at each position in the genetic sequence. For most variants, stronger linkage effects on inferred selection coefficients are more frequently observed for other variants within a distance of ten base pairs (bp). Large linkage effects for pairs of variants within a distance of 30 bp, the approximate length of a linear T cell epitope, occur appreciably more frequently than for pairs of variants at greater genomic distances. However, there is little difference in the distribution of linkage effect sizes for pairs of variants that are between 31 bp and 100 bp apart compared to pairs of variants that are more than 100 bp apart. Nonetheless, some strong linkage effects on inferred selection are observed at long genomic distances (see Fig. 4 and Supplementary Fig. 5). b, Linkage disequilibrium, measured by the absolute value of the off-diagonal entries of the integrated allele frequency covariance matrix, C_int. Like the |\(\Delta \hat s_{ij}\)|, linkage decays along with the distance between variants along the genome. However, we note that linkage disequilibrium values in general appear to be more long-ranged.

Extended Data Fig. 8 Estimates of selection coefficients in a simple example of clonal interference.

a, Two escape mutations arise in the TW10 epitope targeted by individual CH58 and compete for dominance. b, MPL infers that both TW10 escape variants are positively selected. Estimates based on trajectories of individual variants only infer substantial positive selection for the 1514A variant that fixes. The magnitude of selection inferred with the independent model is also smaller than that inferred by MPL. c, Inferred selection in the HIV-1 5′ half-genome sequence for CH58. Inferred selection coefficients are plotted in tracks. Coefficients of transmitted/founder nucleotides are normalized to zero. Tick marks denote polymorphic sites. Inner links, shown for sites connected to the TW10 epitope, have widths proportional to matrix elements of the inverse of the integrated covariance. Linked sites affect selection estimates within the epitope.

Extended Data Fig. 9 Estimates of selection coefficients in a complex example of clonal interference.

a, Multiple escape variants for the Nef epitope EV11, targeted by individual CH131, interfere with one another over the course of nearly one year. Here we have omitted the trajectories for transient variants with a deletion at sites 8988a-8988c, which are insertions with respect to the HXB2 reference sequence. b, MPL infers that all nonsynonymous EV11 escape variants are positively selected. Variants 9000C and 9006T are both synonymous, and are inferred to be nearly neutral by MPL. As in previous examples, inferences using only the trajectories of individual variants only infer substantial positive selection for variants that are polymorphic at the final time point, or where the transmitted/founder (TF) allele at the same site appears strongly selected against. In the latter case, positive selection is inferred because all selection coefficients are normalized such that the selection coefficient for the TF variant is zero. This is why the independent model infers 8988T to be beneficial despite its low frequency at the final time point. Note that the independent model also infers the synonymous mutation 9000C to be beneficial. c, Inferred selection in the HIV-1 3′ half-genome sequence for CH131. Inferred selection coefficients are plotted in tracks. Coefficients of TF nucleotides are normalized to zero. Tick marks denote polymorphic sites. Inner links, shown for sites connected to the EV11 epitope, have widths proportional to matrix elements of the inverse of the integrated covariance. Linked sites affect selection estimates within the epitope.

Extended Data Fig. 10 Inferred selection coefficients across patients using different conventions for data processing.

Inferred selection coefficients are highly similar following different choices for processing the sequence data. Pearson R² values between inferred selection coefficients range from 0.97 to 1.00, with an average of 0.99. Data processing conventions. Reference: current data processing conventions. Max Δt = 200/400: remove time points that are more than 200/400 days beyond the last included time point (reference: 300 days). Max gap freq. = 80%/99%: remove sites where >80%/99% of observed variants are gaps (reference: 95%). Max gap num. = 50/500: remove sequences with >50/500 gaps in excess of subtype consensus (reference: 200). Min seqs. = 2/6: remove time points with <2/6 available sequences (reference: 4). Remove ambiguous: remove sequences that contain ambiguous nucleotides if any other nucleotide variation is observed at the same site. LTR, long terminal repeat.

Supplementary information

Supplementary Information

Supplementary Figs. 1 and 2, Supplementary Table 1 and Supplementary Text.

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sohail, M.S., Louie, R.H.Y., McKay, M.R. et al. MPL resolves genetic linkage in fitness inference from complex evolutionary histories. Nat Biotechnol 39, 472–479 (2021). https://doi.org/10.1038/s41587-020-0737-3

Download citation

Received: 16 September 2019
Accepted: 14 October 2020
Published: 30 November 2020
Issue Date: April 2021
DOI: https://doi.org/10.1038/s41587-020-0737-3

This article is cited by

Haplotype based testing for a better understanding of the selective architecture
- Haoyu Chen
- Marta Pelizzola
- Andreas Futschik
BMC Bioinformatics (2023)
Inferring the distribution of fitness effects in patient-sampled and experimental virus populations: two case studies
- Ana Y. Morales-Arce
- Parul Johri
- Jeffrey D. Jensen
Heredity (2022)
MPL resolves genetic linkage in fitness inference from complex evolutionary histories
- Muhammad Saqib Sohail
- Raymond H. Y. Louie
- John P. Barton
Nature Biotechnology (2021)