Handling missing values in population data: consequences for maximum likelihood estimation of haplotype frequencies

Gourraud, Pierre-Antoine; Génin, Emmanuelle; Cambon-Thomsen, Anne

doi:10.1038/sj.ejhg.5201233

Download PDF

Article
Published: 14 July 2004

Handling missing values in population data: consequences for maximum likelihood estimation of haplotype frequencies

Pierre-Antoine Gourraud¹,
Emmanuelle Génin² &
Anne Cambon-Thomsen¹

European Journal of Human Genetics volume 12, pages 805–812 (2004)Cite this article

1788 Accesses
25 Citations
Metrics details

Abstract

Haplotype frequency estimation in population data is an important problem in genetics and different methods including expectation maximisation (EM) methods have been proposed. The statistical properties of EM methods have been extensively assessed for data sets with no missing values. When numerous markers and/or individuals are tested, however, it is likely that some genotypes will be missing. Thus, it is of interest to investigate the behaviour of the method in the presence of incomplete genotype observations. We propose an extension of the EM method to handle missing genotypes, and we compare it with commonly used methods (such as ignoring individuals with incomplete genotype information or treating a missing allele as any other allele). Simulations were performed, starting from data sets of haematopoietic stem cell donors genotyped at three HLA loci. We deleted some data to create incomplete genotype observations in various proportions. We then compared the haplotype frequencies obtained on these incomplete data sets using the different methods to those obtained on the complete data. We found that the method proposed here provides better estimations, both qualitatively and quantitatively, but increases the computation time required. We discuss the influence of missing values on the algorithm's efficiency and the advantages and disadvantages of deleting incomplete genotypes. We propose guidelines for missing data handling in routine analysis.

Multiple haplotype reconstruction from allele frequency data

Article 22 April 2021

Accurate, scalable and integrative haplotype estimation

Article Open access 28 November 2019

Genotype phasing in pedigrees using whole-genome sequence data

Article 29 January 2020

Introduction

The numerous polymorphic genetic markers throughout the genome, the recent improvements in molecular techniques and the new possibilities of automation¹ allow the development of large genetic studies in populations. HLA population genetics data were one of the first applications of maximum likelihood (ML) estimation of haplotypes 30 years ago.^{2, 3, 4} The genetic structure of the HLA region (6p21.3) is of particular interest, since it has numerous contiguous loci and a high number of alleles at many loci, generating a theoretical number of phenotypes and haplotypes greater than the usual sample size (for example, HLA-DRB1 N=330 alleles).⁵ Further, there is a high number of low-frequency haplotypes. The occurrence of incomplete genotypes has been reduced by the continuing improvements in HLA typing techniques. Nevertheless, when analysing large data sets such as volunteer potential haematopoietic stem cell donor Registries, the influence of missing values in haplotype frequency estimation must be addressed. Thus, haplotype estimation in large data sets of contiguous loci becomes an important issue in population-based molecular genetics.

To overcome the lack of phase information provided by the techniques, likelihood-based calculations in the general framework of the Expectation Maximisation algorithm have been formalised and further developed by Dempster.⁶ Many of the properties of the EM algorithm for ML estimation method have already been discussed.^{7, 8} These include the accuracy of the estimation of haplotype frequency, departure from Hardy–Weinberg equilibrium, the number of alleles, the number of loci, the type of markers, linkage disequilibrium measure, the influence of collapsing over a locus, computational properties and genotyping error.^{9, 10, 11, 12}

The EM algorithm is primarily set to handle the missing phase information, and it can be adapted to deal with complete and incomplete genotypes at the same time (i.e. missing phase information and missing values within a genotype).⁶ We were interested in the influence of missing values on haplotype frequency estimation. In practice, missing values are usually handled in one of two ways: individuals with incomplete data are ignored (as in the EH software;¹³ further referred as MVDEL), or missing values are coded as an additional allele (in the ARLEQUIN software).¹⁴ This last ‘method’ is an acknowledged bug in the ARLEQUIN implementation for the estimation of haplotype frequencies. Several types of stand-alone software^{15, 16, 17, 18, 19, 20, 21} propose to take into account incomplete genotypes in their analyses, but they are suitable only for specific kinds of data (biallelic markers), for specific kinds of missing data (for example recessive data) or within the framework of familial data.

Here we explore several possible solutions to this problem of missing values and look at the consequences on the estimation of haplotype frequencies by maximum likelihood methods. We implemented them in software named LOGINSERM_ESTIHAPLO.

Population and methods

Population data

The data were obtained from the French Registry of volunteer unrelated potential haematopoietic stem cell donors.^{22, 23} In all, 30 independent data sets of 1000 individuals were obtained by randomly drawing individuals without replacement from the main database of 85 933 individuals typed for HLA-A, -B and -DR. For each of these 30 data sets without missing values (referred to as ‘initial data sets’), HLA-A, B, DR haplotype frequencies were estimated by EM and used as references to study the impact of missing data on haplotype frequency estimation.

Missing values definition

We considered a genotype to present a ‘missing value’ when one or zero alleles is reported at a particular locus. We assumed that the missing values were independent from the nature of the other reported polymorphisms.

Missing values simulations

Simulations were used to generate missing values ranging from 5 to 25%. In order that the missing values were randomly distributed in the data, a uniform random number was drawn for each allele in each data set. If this number was smaller than the required percentage of missing values, then the allele was deleted. Thus, one or two alleles could be missing at each locus.

Missing values handling

Two methods were compared:

1
The MVDEL method (for Missing Values Deleted) ignores any individual with missing data. If data are missing at any locus, all information is deleted for that individual. This is the method implemented in the program EH.¹³
2
The MVSAS method (for Missing Values Statistically Assessed) allows for the missing value to be any allele, which is consistent with the incomplete genotype and the haplotypes already observed in the sample. This second method was inspired by Excoffier and Dempster.^{6, 24} However, not all the alleles at a locus were possible. Only those already found associated with the observed alleles at the other loci in the data set were considered to substitute missing values. Indeed, the contribution of incomplete observations to the haplotype estimation is weighted by the probability of possible haplotypes in the same data set. All complete or incomplete observations are used to identify the allelic association resulting in the possible haplotype diversity. For instance, consider an observed individual with genotype (1,1) at one biallelic locus and genotype (1,?) at a second biallelic locus. In the Dempster or Excoffier approach, ‘?’ will be replaced by 1 or 2. In our method, it will depend on the possible haplotypes that have been deduced from individuals without missing data. If haplotype 1,2 is never estimated to exist, then ‘?’ could only be replaced by 1. This procedure is extended to all possible pattern of missing values.

EM algorithm

The estimation of haplotype frequencies by maximum likelihood within the EM algorithm has been performed as described elsewhere.⁷ The Expectation Step (E-step) generally computes the likelihood of the sample using haplotype estimations of the previous iteration, or the initial values at first step (which are chosen at random; no multiple starting conditions are used). The counting procedure is extended to the presence of missing values. The criteria to stop iteration are modified in order to compare different models.

1. Maximisation step (M-step): In the M-step, haplotype frequency estimation is inspired from a gene-counting procedure.^{25, 26} For each genotype, the presence of a haplotype is counted through the probability of its resulting phase. We extend this procedure to incomplete genotypes in the example below, using notation for three loci, indexed by i, j and k. This notation generalises to any number of loci by extending the number of indices. The implementation in our software works with up to seven loci:

where h_ijk is the estimation of haplotype i–j–k at iteration t+1; N is the number of genotypes observed; P₁(h_ijk/n) and P₂(h_ijk/n) are the probabilities of observing the haplotype i–j–k as first and second, respectively; P₁(h_ijk/n) and P₂(h_ijk/n) could be calculated as functions of haplotype estimations at iteration (t):

where P₁(h_ijk/n) is the probability of observing the i–j–k haplotype for a given genotype n; h_c^(t) is the complementary haplotype or possible haplotypes to observe the i–j–k haplotype in genotype n; h₁^(t) and h₂^(t) are the pseudo-haplotype frequency estimations at iteration (t) for genotype n; (h₁, h₂) notation refers then to all possible pairs of haplotypes which may result in the observation of the genotype n, whereas h_ijk^(t) is the i–j–k haplotype frequency estimated at iteration (t); δ_h1,h2 is the Kronecker delta defined by: homozygous genotype. Such M-step is performed for all h_ijk; that is, all haplotype estimations are computed at each iteration. This probability is the ratio of the probability of the haplotype combination and the probability of observing such a genotype n. Although notation h_ijk describes the set of three locus (i, j, k) haplotypes as parameters in the three-locus HLA data used here, generalisation of indexes ‘i, j, k’ to any number of loci is possible; h_ijk would refer then to the appropriate set of pseudo-haplotypes (set of haplotype compatible with given genotype). Asymptotic properties of the EM algorithm are not modified.⁶

2. Iterations of EM: The method usually assumes that if likelihood does not vary from more than a given very small value (say for instance 10e−4) between two iterations, estimations of haplotype frequencies are stable. Here, since we compare estimations obtained under different likelihood models, we use a direct measure of the whole stability of estimations by considering the sum of absolute errors (SAE) that is defined as follows:

where H is the total number of haplotypes estimated and h_i^(t+1) and h_i^(t) are the estimations of haplotype number i at iterations t and t+1, respectively. No multiple starting conditions were used routinely, but convergence was assessed separately. Iterations of the algorithm were stopped when SAE reached 10e-4.

The modified EM algorithm described above for the integration of missing genotype data has been implemented in a C-written program called ‘LOGINSERM_ESTIHAPLO’ (available on request).

Method for comparison

A comparison of the accuracy of haplotype estimations was made using the ‘IH’ measure (for Identification of Haplotypes). IH takes the value of 1 if the set of estimated haplotypes is identical to the reference set of haplotypes.⁷ It can be applied for all parameters estimated (all haplotype frequencies estimated) or for estimation of frequencies, which are estimated above 1/2N, where N is the sample size. 1/2N is the threshold of estimated existence of a haplotype in the sample.

where K_ref is the number of parameters (frequency estimations) in the reference, K_missed is the number of parameters that are absent in the estimated frequencies and K_est is the number of parameters estimated in the haplotype estimations compared to the reference.

Haplotype inference methods on complete data can generate errors as compared to population (‘true’) frequencies, due to sampling errors.¹⁶ To compare the haplotype estimation obtained on the complete data to those obtained in the presence of missing values, and to evaluate specifically the impact on the missing values on haplotype inference, we computed the difference between haplotype estimations using three classical indexes:

1
The Mean Square Error (MSE) defined as:
The Mean Absolute Error (MAE) defined as:
where H is the number of parameters shared by the reference and the compared estimations. h_i^(ref) and h_i^(est) are the estimations of haplotype i frequency in the initial data set and in data with missing values, respectively.
2
The similarity index ‘If’^{7, 12} used to measure the estimation accuracy:
We also introduced another measure of the accuracy through a normalised similarity index ‘Ifn’. ‘Ifn’ considers both the number of shared haplotype estimations and the absolute error on frequencies:
where K_shared is the number of parameters (haplotype frequency estimations) shared by the initial data set; and K_true is the number of parameters (haplotype frequency estimated) in the initial data set.

Results

The different ways of handling missing values may affect the estimation of haplotype frequencies at different levels: qualitative (identification of possible haplotypes) and quantitative (their frequencies). These two levels can be considered for all possible haplotypes, or for those expected to be present in the sample. The results presented correspond to the haplotypes present in the sample. (All comparisons are available on request.)

Identification of haplotypes expected to be present in the sample according to the reference estimation

Tables 1 and 2 are built on the comparison of frequency estimates above 1/2N,where N is the analysed sample size. The ‘MVDEL’ method reduces the number of haplotypes shared with the reference, in the presence of missing values (Table 1; column ‘Kept’ and ‘Lost’, rows ‘MVDEL’). Deleting incomplete observations results in a decrease in the haplotypic diversity in the population, with some haplotypes being lost. The number of lost haplotypes is greater with MVDEL (Table 1; column ‘Lost’, rows ‘MVDEL’) than with MVSAS (Table 1; lumn ‘Lost’). Interestingly, in the two methods, while analysing estimation above 1/2N frequency threshold only, no haplotype estimations were added compared to the reference (not shown). In this case, added haplotype estimation number is not forced to zero. The value of IH that summarises the conservation of haplotype estimations vs the reference without missing values is greater for the algorithm developed here (Table 1; Figure 1). MVSAS is therefore qualitatively better than the MVDEL method. Following the qualitative analysis of the nature of haplotypes generated through the algorithms, it is necessary to analyse the influence of handling missing values on the frequency estimation.

Table 1 Comparison of the average number of different haplotypes with frequency estimation above 1/2000, obtained according to two different ways of handling missing values

Full size table

Table 2 Comparison of the accuracy of haplotype frequency estimations in the presence of missing values in the data set, restricted to estimations above 1/2000

Full size table

Frequency estimation of haplotypes

The haplotype frequencies estimated by the different methods using incomplete data are similar to those obtained on the initial sample with no missing data. Several global measures of the accuracy of these methods are presented in Table 2. This shows that the accuracy of the method developed (MVSAS) is at least as good as that of the MVDEL method. The question of the global accuracy of haplotype estimation was addressed in several ways. Using squared errors, there was no apparent difference in error range in the two methods (Table 2; column MSE). Similarly, using absolute error as an evaluation of the differences between estimations, no significant global modifications were seen (Table 2; column MAE). The frequencies obtained with MVSAS seem to be closer to the reference estimations than those obtained with MVDEL. Analysis of vectorial error (not shown) shows a tendency to overestimate the haplotype frequencies, which may be due to several reasons. For example, for MVSAS, the weight of lost haplotypes is distributed over other possible haplotypes. For MVDEL, the decrease in the sample size due to deletion of observations leads to an overestimation of the frequency of the remaining ones. The Ifn global measure of the accuracy of estimations is consistent with the computed MAE (Table 2; column ‘Ifn’) and confirms the slight improvement of the estimations provided.

Calculation time

Obviously, the calculation cost of the MVSAS method is greater than that required when there are no missing values. This depends both on the percentage and the distribution of the missing values. From the simulations performed using Quadri Xeon 700 Mhz (cache 1 Mo; random access memory 4 Go; Operating system : Linux Red hat 7.3), the additional computation time costs approximately 1 min for 1% of missing values. On these data, the observed relationship is linear.

Discussion

Even though other methods are available (Parsimony,²⁷ Pseudo Bayesian²⁸ and Partition-Ligation Bayesian),²⁹ EM remains the most widely used algorithm for the estimation of haplotype frequencies. Thus, we focus only on the modification of the ML estimations provided by the EM method in the presence of missing values. We do not discuss the general properties of the estimations provided by this method as these have been discussed previously in the literature.

Although the ideal situation is to have no missing values, this is rarely the case. The use of unrelated individuals does not allow deduction of the missing values or genotyping errors. Missing values are sometimes nonrandom as they can be related to typing difficulties or to particular combinations of alleles. Such cases are addressed at the technical level as part of the quality control procedure. In the statistical handling, the assumption is made that the missing values are independent from the identity of the missing allele at the locus being considered and independent from the alleles at the other loci. This is the case, for example, for a nongenotyped locus. The importance of data validation for large data sets has been underlined,³⁰ as along with the consequences of genotyping error.¹⁰

Consequences related to the presence of missing values

The incidence of missing values in the data set modifies the information deduced on the phase information, for at least three reasons. First, computational algorithms cannot replace experimental data, thus missing information is handled in the framework of the theoretical model but remains unsolved. Secondly, it modifies the likelihood model because the parameters (ie the number of haplotypes) are different, and because the sample itself is modified. If one ignores the actual implementation of missing values handling by the software, the influence of the incomplete observations cannot be anticipated. Having incomplete observations influences the distribution of an observation over its possible phases. In ARLEQUIN,¹⁴ missing values are considered as an additional allele at each locus. Consequently, the algorithm creates artificial haplotypes. This results in a systematic bias surrounding haplotype frequencies. In the MVDEL method, lost haplotypes may arise for two reasons: either from the missing values themselves (MVSAS method; Table 1; Column ‘lost’), or from the initial decision to delete all the information about individuals with missing values. Lost (or added) estimates are expected to influence the accuracy of the estimations.

Criteria of choice for handling incomplete genotypes

The adaptation of ML estimation of haplotype frequencies to incorporate missing values slightly increases the accuracy of the estimations obtained. The MVSAS method is particularly relevant when the main interest of the study is focused on rare haplotypes. We have shown that the two methods presented here for handling missing genotypes (MVDEL and MVSAS) have different consequences on the haplotype estimates. Depending on the aim of the study and on whether one is interested in the most frequent haplotypes, a rare haplotype (disease or candidate haplotype) or the whole set of estimations for global population analysis and gametic disequilibrium measurement, one or the other methods may be best. If one is interested in common haplotypes then MVDEL may be used, since, even though haplotypes may be lost with this method, this will mainly concern rare haplotypes. If the sample size is sufficiently large, therefore, haplotype diversity is usually not affected. If, however, one is interested in rare haplotypes, then MVSAS should be used. Estimation of rare haplotype distribution remains a difficult issue. In fact, MVSAS can be adapted to any situation and it works well even if missing values are concentrated over data at a given locus. The price to pay, however, is the computation time. Indeed, the calculation cost may be prohibitive when the number of missing values, the sample size and the number of loci increase.

The main difference between MVDEL and MVSAS is attributable to missing values distribution over the sample. As underlined by Fallin and Schork,⁹ the ML estimations are sensitive to sampling error. This is particularly true for the missing values sampling. Using the MVDEL method, the decrease of the sample size makes sampling errors more frequent than in the other methods and therefore results in less accurate estimations.

Ambiguities, nomenclature in the data set

Techniques sometimes give results as ‘ambiguities’, and from the molecular observation some of the known alleles can be discarded (for example, when the results provide a list of possible alleles and a list of absent alleles). These are not missing values but partial information, and could easily be handled using the same statistics as those presented for missing values. Depending on the complexity of ambiguities in nomenclatures (see Marsh⁵ for HLA), it turns out that this theoretically simple process becomes complicated to implement. Such ambiguities might be taken into account to set the initial nature of the haplotypes (preliminary step of EM).

The methods presented here for HLA are of general relevance and can be applied to microsatellite and SNP haplotype estimation. Regarding the general properties of the method, it is all the more efficient, as the gametic disequilibrium is strong in the region.¹¹ Thus, the genetic structure of the region influences the statistical reassessment of missing values. It means that the gametic disequilibrium allows the deduction of a polymorphism, based on knowledge of the contiguous one in the MVSAS method. In the MVDEL method, it suggests that enough global information for phase information reconstruction remains after deleting some observations. Similarly, for polymorphic markers, missing values are expected to affect low-frequency haplotypes qualitatively, whereas high-frequency ones are affected quantitatively. In this sense, it is consistent with the general differential confidence inherent to the method on rare vs frequent haplotype frequencies estimation (i.e. the more frequent the haplotype, the more reliable its estimation). If missing values affect bi-allelic markers, the estimation of haplotype frequencies may essentially be quantitative, with a higher impact on low frequency and low gametic disequilibrium haplotypes. In such cases, the PLEM strategy may be an alternative method for dealing with missing values; keeping multiple outputting of possible haplotypes should be recommended, as reported.¹⁵

Convergence velocity and stopping criteria

We did not choose the classical likelihood stability criteria to stop iteration. Indeed, the likelihood not only depends on the values of the parameters estimated (haplotype frequencies), but also on the number of the parameters, the likelihood model and the data set retained for the analysis, which are different because of the way missing values are handled. Similar considerations were made in SNPHAP software (available at David Clayton's web site): the skimming procedure used to speed up computations modifies the likelihood model while iterating. Thus, the stability of estimators was measured directly, using the estimations by the sum of absolute variation (SAE) on the estimations from one iteration to the next.

Another choice we made here may differ from the classical ones: we only used haplotypes that have been estimated to exist in complete observations, thereby reducing the number of parameters. The alternative – inclusion of all the possible alleles – does not change the result, but increases the running time.

The M-step is the limiting one, together with the number of estimation of haplotype frequencies. Other calculations may solve the problem or may allow multi-point haplotype frequency to be computed. Although trimming procedures^{31, 32} have been proposed to reduce the number of parameters while iterating, the final likelihood cannot be used in log likelihood-based tests.

The evolution of the possibilities in large-scale genotyping requires the statistical treatment of the data and motivated our investigation on handling missing values for ML estimation. The statistical handling of missing values increases the quality of the haplotype frequencies provided. Deleting the incomplete observations is acceptable when using large data sets or when the estimation is computer intensive. These conclusions contribute to the enhancement of the use of haplotype estimation and allow better analysis of the data. The structure of the data influences the effectiveness of the method and puts the methodological consideration on this haplotype estimation into perspective. Indeed, the kind of polymorphism, the number of loci, the sample size, or the population may require different computational implementations.

References

Gut IG : Automation in genotyping of single nucleotide polymorphisms. Hum Mutat 2001; 17: 475–492.
Article CAS PubMed Google Scholar
Morton NE, Simpson SP, Lew R, Yee S : Estimation of haplotype frequencies. Tissue Antigens 1983; 22: 257–262.
Article CAS PubMed Google Scholar
Piazza A : Haplotypes and linkage disequilibrium from three-locus phenotypes. Histocompat Test Munksgaard 1975; 923–927.
Yasuda N : Estimation of haplotype frequency and linkage disequilibrium parameter in the HLA system. Tissue Antigens 1978; 12: 315–322.
Article CAS PubMed Google Scholar
Marsh SG : Nomenclature for factors of the HLA system, update February 2003. Hum Immunol 2003; 64: 656–657.
Article CAS PubMed Google Scholar
Dempster AP : Maximum likelihood from incomplete data from incomplete. J Roy Statist Soc 1977; 39: 921–927.
Google Scholar
Excoffier L, Slatkin M : Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol 1995; 12: 921–927.
CAS PubMed Google Scholar
Schipper RF, D'Amaro J, Bakker JT, Bakker J, van Rood JJ, Oudshoorn M : HLA gene haplotype frequencies in bone marrow donors worldwide registries. Hum Immunol 1997; 52: 54–71.
Article CAS PubMed Google Scholar
Fallin D, Schork NJ : Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. Am J Hum Genet 2000; 67: 947–959.
Article CAS PubMed PubMed Central Google Scholar
Kirk KM, Cardon LR : The impact of genotyping error on haplotype reconstruction and frequency estimation. Eur J Hum Genet Oct 2002; 10: 616–622.
Article CAS Google Scholar
Xu CF, Lewis K, Cantone KL et al: Effectiveness of computational methods in haplotype prediction. Hum Genet 2002; 110: 148–156.
Article CAS PubMed Google Scholar
Single RM, Meyer D, Hollenbach JA et al: Haplotype frequency estimation in patient populations: the effect of departures from Hardy–Weinberg proportions and collapsing over a locus in the HLA region. Genet Epidemiol 2002; 22: 186–195.
Article PubMed Google Scholar
Xie X, Ott J : Testing linkage disequilibrium between a disease gene and marker loci. Am J Hum Genet 1993; 53 (Suppl): 1107.
Google Scholar
ARLEQUIN a program for population genetic analysis [computer program]. Version;, 1996–2002.
Qin ZS, Niu T, Liu JS : Partition-Ligation–Expectation-Maximization Algorithm for Haplotype Inference with Single-Nucleotide Polymorphisms. Am J Hum Genet 2002; 71: 1242–1267.
Article CAS PubMed PubMed Central Google Scholar
Hawley ME, Kidd KK : HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. J Hered 1995; 86: 409–411.
Article CAS PubMed Google Scholar
Long JC, Williams RC, Urbanek M : An E–M algorithm and testing strategy for multiple-locus haplotypes. Am J Hum Genet 1995; 56: 799–810.
CAS PubMed PubMed Central Google Scholar
Mander AP : Haplotype analysis in population based study. Stata J 2001; 1: 58–75.
Article Google Scholar
Abecasis GR, Cherny SS, Cookson WO, Cardon LR : Merlin – rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet 2002; 30: 97–101.
Article CAS PubMed Google Scholar
Clayton D, Jones H : Transmission/disequilibrium tests for extended marker haplotypes. Am J Hum Genet 1999; 65: 1161–1169.
Article CAS PubMed PubMed Central Google Scholar
Zhao JH, Sham PC : Faster haplotype frequency estimation using unrelated subjects. Hum Hered 2002; 53: 36–41.
Article CAS PubMed Google Scholar
Raffoux C, Baouz A, Cozic F, Marry E : France Greffe de Moelle: Rapport d'activité 2001. Paris: France Greffe de Moelle, December 2001.
Google Scholar
Lonjou C, Clayton J, Cambon-Thomsen A, Raffoux C : HLA -A, -B, -DR haplotype frequencies in France – implications for recruitment of potential bone marrow donors. Transplantation 1995; 60: 375–383.
Article CAS PubMed Google Scholar
Excoffier L : Arlequin Bugs; Available at: http://lgb.unige.ch/arlequin/software/2.000/doc/buglist/buglist.html.
Smith CAB : Counting methods in genetical statistics. Ann Hum Genet 1957; 21: 254–276.
Article CAS PubMed Google Scholar
Cepellini R : The estimation of gene frequencies in random mating population. Ann Hum Genet 1955; 20: 97–115.
Article Google Scholar
Clark AG : Inference of haplotypes from PCR-amplified samples of diploid populations. Mol Biol Evol 1990; 7: 111–122.
CAS PubMed Google Scholar
Stephens M, Smith NJ, Donnelly P : A new statistical method for haplotype reconstruction from population data. Am J Hum Genet 2001; 68: 978–989.
Article CAS PubMed PubMed Central Google Scholar
Niu T, Qin ZS, Xu X, Liu JS : Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am J Hum Genet 2002; 70: 157–169.
Article CAS PubMed Google Scholar
Schipper RF, Oudshoorn M, D'Amaro J et al: Validation of large data sets, an essential prerequisite for data analysis: an analytical survey of the Bone Marrow Donors Worldwide. Tissue Antigens 1996; 47: 169–178.
Article CAS PubMed Google Scholar
SNPHAP [computer program]. Clayton DG, http://www-gene.cimr.cam.ac.uk/clayton/software/.
Thomas A : GCHap: fast MLEs for haplotype frequencies by gene counting. Bioinformatics 2003; 19: 2002–2003.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank the three reviewers for their comments on this manuscript, Pr Jean-Pierre Florens for his advice and comments on this work, and acknowledge the help of France Greffe de Moelle and Dr C Raffoux for the use of HLA Registry data. This work was supported by the EU grant ‘MADO’: FP5-QLG7-CT-2001-00065, and Etablissement Français des Greffes Grant. The Ecole Normale Supérieure de Lyon, France, has supported PAG in his studies.

Author information

Authors and Affiliations

Unité INSERM 558-Faculté de médecine, 37 allées Jules Guesde, Toulouse, F-31073, France
Pierre-Antoine Gourraud & Anne Cambon-Thomsen
Unité INSERM 535-Hôpital Paul Brousse, Bâtiment Leriche, BP1000, Villejuif, 94817, Cedex, France
Emmanuelle Génin

Authors

Pierre-Antoine Gourraud
View author publications
You can also search for this author in PubMed Google Scholar
Emmanuelle Génin
View author publications
You can also search for this author in PubMed Google Scholar
Anne Cambon-Thomsen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pierre-Antoine Gourraud.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gourraud, PA., Génin, E. & Cambon-Thomsen, A. Handling missing values in population data: consequences for maximum likelihood estimation of haplotype frequencies. Eur J Hum Genet 12, 805–812 (2004). https://doi.org/10.1038/sj.ejhg.5201233

Download citation

Received: 27 June 2003
Revised: 29 April 2004
Accepted: 05 May 2004
Published: 14 July 2004
Issue Date: 01 October 2004
DOI: https://doi.org/10.1038/sj.ejhg.5201233

Keywords

This article is cited by

Impact of preprocessing on medical data classification
- Sarab Almuhaideb
- Mohamed El Bachir Menai
Frontiers of Computer Science (2016)
Highlighting nonlinear patterns in population genetics datasets
- Gregorio Alanis-Lobato
- Carlo Vittorio Cannistraci
- Timothy Ravasi
Scientific Reports (2015)
SNP-based analysis of the HLA locus in Japanese multiple sclerosis patients
- J P McElroy
- N Isobe
- J Kira
Genes & Immunity (2011)
A comprehensive evaluation of SNP genotype imputation
- Michael Nothnagel
- David Ellinghaus
- Andre Franke
Human Genetics (2009)
Selective recruitment of stem cell donors with rare human leukocyte antigen phenotypes
- A H Schmidt
- A Stahr
- C Rutt
Bone Marrow Transplantation (2007)

Handling missing values in population data: consequences for maximum likelihood estimation of haplotype frequencies

Abstract

Similar content being viewed by others

Multiple haplotype reconstruction from allele frequency data

Accurate, scalable and integrative haplotype estimation

Genotype phasing in pedigrees using whole-genome sequence data

Introduction