A new study by Jody Hey,1 published in PLoS Biology, sets new standards in the analysis of human genetic data. Using new statistical methods and a combined analysis of nine genes, Hey provides a detailed picture of the events associated with the first migration of Asians into the Americas.

Explorations into the use of DNA sequence data for human demographic inferences began in the late 1980s and early 1990s.2, 3 The research was focused on testing the out-of-Africa hypothesis and the main inferential tool was the estimation of gene trees. However, it soon became apparent that demographic inferences cannot easily be made on the basis of an estimated gene tree, mainly because the relationship between particular demographic models and gene trees is very complex. The same gene tree may arise from multiple different demographic models. A method for connecting gene trees with demographic models was needed. Coalescent theory4 turned out to provide this link. Using coalescent theory it is possible to calculate how likely a particular gene tree is under a particular demographic model. The coalescent framework was used to estimate population growth rates, and methods for inferring migration rates and other parameters were developed.5, 6, 7, 8 Unfortunately, most of the models were so demographically naïve that they hardly were applicable to real human data. The fundamental problem has been that the effects of various factors, such as changes in population sizes, gene flow between populations (migration), and divergence of populations from a shared ancestral population, are intertwined, making it impossible to determine the effect of one factor without taking the other into account. The only solution to this problem is to construct complex models that take all (or as many as possible) of the relevant factors into account.

The study by Hey, Rutgers University, sets the bar for such studies. His model incorporates changes in population size, gene flow, and divergence – allowing new explorations into human genetic demography. Inferences are made in a coalescent-based statistical framework that takes into account the uncertainty in the data regarding the gene tree (no gene tree can be estimated with 100% accuracy) and can combine the information from many different loci. He applied this method to data from nine loci from East-Asians and Amerind-speaking Native American populations. The major objective was to determine the timing of the earliest migrations into the Americas from Asia, and determine the effective population sizes of past and present populations. The results suggest that the first wave of migration occurred relative recently but that the effective number of migrants was about 90.

Although much emphasis has been put on the exact number of migrants populating the Americas, it is should be noted that the estimates obtained in genetic studies are of the effective population size at the time of migration. The actual number of people could be substantially higher than the effective number. For example, Hey found the effective population size of the number of people in the ancestral Asian population to be approximately 9000, implying that the number of individuals peopling the Americas in the first wave corresponds to as much as 1% of the entire East Asian population. The results also show that there could have been substantial levels of migration between Asians and Amerindians in the years after the first wave of migration. Nonetheless, the study clearly describes a picture of demographic events that include strong growth in the population size after the first wave of migration and a very recent migration event.

Some of the parameters of interest could not be estimated with great certainty. For example, the date of the first migration event was associated with much statistical uncertainty, and the relative importance of migration after the first migration event could not be determined. Although this could be seen as a weakness of the study, it really points to the strength of the methodology. The method is based on a statistical method that takes all the relevant information from the genetic data into account. So when some of the parameters are difficult to estimate, it implies that the data does not contain enough information about these parameters. In this way, the methodology significantly helps to quantify the uncertainty in the data. It also raises serious concerns about previous studies which, based on much less data, and without the use of rigorous statistical methods, have made strong claims about human demography using genetic data.

What sets Hey's study apart from other similar studies is the use of complex and more realistic models. While no model can be exactly true, the approach by Hey can help distinguish good models from bad ones. Genetic data in human demographic studies have often been analyzed by interpreting an estimated gene tree or network. As Hey points out, the verbal interpretations are themselves models that often are very simplistic. The method presented by Hey is an important step forward in the field of human genetic demographics, replacing Ad hoc story telling with rigorous model testing and statistical inferenceâ–ª