Introduction

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus emerged at the end of 2019 and spread rapidly across the world, with the WHO announcing a global pandemic on 11 March 2020. This new betacoronavirus had not been seen before, but it is related to the severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS) coronaviruses1. We now know that SARS-CoV-2 uses the human ACE2 receptor for viral entry2, initially infecting and replicating in epithelial cells in the nasopharynx and subsequently gaining access to the distal alveolar space3,4. The virus is recognized by immune cells through pattern-recognition receptors, prominently by members of the Toll-like receptor group such as TLR3 and TLR7, which promote the synthesis of type I interferons5,6,7, and by cytoplasmic RNA sensors retinoic acid-inducible gene I (RIGI; also known as DDX58) and interferon-induced helicase C domain-containing protein 1 (IFIH1; also known as MDA5) inducing type I/III interferon responses8,9. Secreted type I interferons signal via interferon receptors (IFNARs) to switch on Janus kinase 1 (JAK1) and tyrosine kinase 2 (TYK2) and, consequently, promote the expression of interferon‐stimulated genes such as oligoadenylate synthetase 1 (OAS1), OAS2 and OAS3 (ref.10). Severe forms of coronavirus disease 2019 (COVID-19) involve a dysregulation of the immune response that results in insufficient or delayed type I interferon response11,12. Eventually, sustained hyperinflammation results in increased immune infiltration in the lungs, reduction in alveolar lacunar space, cell death by apoptosis and lung fibrosis13,14.

COVID-19 manifests with a wide range of symptoms and degrees of severity. Although most cases are now known to be asymptomatic or mild, some patients develop a severe form of the disease that results in acute respiratory distress syndrome and consequent multi-organ complications15,16. Disease severity is correlated with several risk characteristics including older age, being of male sex and smoking, various clinical comorbidities such as being obese or immunocompromised17 and clinical biomarkers such as autoantibodies to type I interferons, cytokines and inflammation markers18. In the early days of the pandemic, it was already noted that these clinical factors did not fully explain the variability in COVID-19 disease severity between individuals, and severe cases were observed among young individuals without apparent previous pre-conditions, sometimes clustering in families19, suggesting a role for human genetics as a risk factor.

Finding host genetic factors for infection susceptibility and disease severity is important, because it leads to better understanding of the viral infection, the pathophysiological changes that occur owing to disease and to the discovery of potential drug targets. It can also shed light on the causal relationships between risk factors, biomarkers and disease outcomes, and can inform prevention strategies. Well-known examples of successful human genetic studies of infectious diseases include identification of the CCR5Δ32 mutation for protection against HIV infection20,21, and the protection against Plasmodium falciparum infection (malaria disease) in individuals who are heterozygous carriers of a sickle cell allele of the haemoglobin-β (HBB) gene22,23,24. We refer to the Review by Kwok et al.25 for a broader overview of human genetic influences on infectious diseases.

Compared with other common complex diseases, studying the human genetics of infectious disease poses additional challenges including uneven exposure to the virus within a population, the differential treatment of patients with severe disease under a pandemic emergency and the implementation and uptake of vaccination programmes. Nonetheless, the existing worldwide expertise in generation and analysis of human genetic data has allowed for rapid large-scale studies in host genetics of COVID-19. In this Review we provide an overview of current study designs enabling discovery of human genetic variation associated with COVID-19, with a focus on large-scale population-based association studies, the genetic discoveries made so far and what we have learnt in terms of biology and public health impact. Finally, we provide some of the key challenges ahead for the field in this moving pandemic and beyond.

Study designs for COVID-19 host genetics

Many types of study have contributed to host genetic investigations for COVID-19 during the pandemic.

Clinical studies

Clinical studies collect deep and disease-relevant phenotypic information and typically focus on patients with severe COVID-19 (refs26,27,28,29). Most are of small to medium size with up to a few thousand patients and were initiated after the emergence of SARS-CoV-2 specifically to study COVID-19. However, one of the largest clinical studies, GenOMICC/ISARIC28, predated the pandemic by already studying the genetics of critical illness due to infection. These researchers were able to rapidly harness existing clinical study and recruitment frameworks for the study of COVID-19. Clinical studies are well positioned to study disease severity, once appropriate controls are also collected and can be used to investigate how genetic risk factors affect a patient’s clinical trajectories after infection. To investigate the genetic bases of COVID-19, these studies generally invest in whole-exome sequencing (WES) and/or whole-genome sequencing (WGS) data generation and analysis.

Biobank and cohort studies

Existing biobank and cohort studies can be used to study COVID-19 given a large enough sample size and sufficient infection rate within the population. These studies typically identify COVID-19-positive cases through linkage with electronic health records or questionnaires. Individuals who are not COVID-19 positive or who tested negative can be used as controls. These studies can provide a more representative sample of patients with COVID-19 than clinical studies, although participants enrolling in biobank and cohort studies are often not fully representative of the general population. For some of the established epidemiological cohorts, participants have been extensively recontacted for the collection of longitudinal information about COVID-19 symptoms30. With few exceptions (for example, the UK Biobank and DiscovEHR collaboration31) most of these studies use genotyping microarrays and are not well suited to study variants with population frequency below 0.1%.

Direct-to-consumer genetic companies

Direct-to-consumer genetic companies have engaged in COVID-19 research to an unprecedented extent. For example, 23andMe32 and AncestryDNA33,34, two of the largest companies in this space, have designed surveys allowing collection of detailed self-reported information. Given the large number of customers, these companies were well powered to identify new common genetic variants associated with various COVID-19 phenotypes, including vaccination side effects35 and specific COVID-19 symptoms36. The disadvantage of such studies is that COVID-19-positive status was self-reported and severe cases are under-represented, although SARS-CoV-2 PCR test result and hospitalization from COVID are presumed to be quite reliably self-reportable.

COVID-19 phenotypes

Most of the host genetic studies for COVID-19 have focused on identifying variation in the genome that is associated with susceptibility to infection, disease severity and disease-related symptoms.

Susceptibility to infection

Susceptibility to infection is typically defined as being COVID-19 positive given exposure to the virus. This is the most challenging phenotype to collect because viral exposure is difficult to trace. Roberts and colleagues from the AncestryDNA Science Team37 have best attempted to capture susceptibility by comparing COVID-19 negative and positive individuals who had a housemate with a confirmed COVID-19 diagnosis. The COVID-19 Host Genetics Initiative (HGI)38 used a simpler approach, comparing individuals who are COVID-19 positive versus population controls and named this phenotype ‘reported SARS-CoV-2 infection’. Despite the suboptimal choice of the control group, probably including controls who had not been exposed to the virus, the results overlapped with those from AncestryDNA.

Disease severity and progression

Disease severity is often captured by comparing individuals who are COVID-19 positive who have been hospitalized or who have been admitted to an intensive care unit (ICU) with those who have less severe disease or are asymptomatic but still positive for the virus. Hospitalization, admission to an ICU and requirement for respiratory support represent ad hoc definitions of severity that are robust enough to be captured across studies with heterogeneous designs. The COVID-19 HGI38 and the GenOMICC/ISARIC study28, in their main analyses, used population controls instead of individuals who are COVID-19 positive with non-severe disease. This can result in case misclassification because some controls might turn out to be cases if exposed to the virus. Nonetheless, this approach is more powerful than using individuals who are COVID-19 positive with non-severe disease as controls because of the large availability of population controls, especially within biobank studies38. In support of the usefulness of population controls, the results have shown to be robust once a more appropriate control definition is used38.

Disease-related symptoms

Some genetic studies have focused on a single symptom (for example, loss of taste and smell36) or on a combination of symptoms that can be used to detect undiagnosed COVID-19 cases39. Such study designs were particularly valuable in the absence of widespread testing, as at the beginning of the pandemic.

Complexity in the phenotype definitions

In addition to some of the limitations described above, there are several layers of complexity when studying infectious diseases such as COVID-19 (Fig. 1). First of all, although SARS-CoV-2 has spread rapidly, not all individuals in any population have been exposed at the time of study recruitment. Furthermore, this level of exposure is clearly time dependent throughout the pandemic. There are also large differences in socio-economic and demographic factors that contribute to viral exposure, such as ethnicity, job and age. When the whole population has not yet been exposed to the virus, those identified as cases or controls are not a random sample owing to the selection biases currently present in the population in question40. Ongoing vaccination programmes are also shifting the rates and demographics of infection, and there are large differences in epidemic management and inequalities between vaccination programmes across countries. The severity of the disease, as captured by hospitalization or ICU admission, is also dependent on the health practice in different countries, which might have also varied in different phases of the pandemic. Finally, different viral strains can affect infection susceptibility and COVID-19 disease severity. Host genetics can influence all of these stages from the socio-economic factors contributing to the chance of exposure, through infection and the development of initial symptoms, to progression to severe disease.

Fig. 1: Schematic of the disease progression trajectory for individuals exposed to SARS-CoV-2.
figure 1

The black horizontal arrow shows the progression through different stages of coronavirus disease 2019 (COVID-19), and the decreasing cylinder sizes represent that only a subset of individuals at each stage progress to more-advanced disease states. The true stages of the disease do not always correspond to what is captured in most COVID-19 studies. For example, many asymptomatic individuals are not captured. Thus, the dashed ellipses represent ‘checkpoints’ that one needs to cross to be identified with a certain COVID-19-related phenotype and be included in most COVID-19 studies. Environmental and external factors (shown above the cylinders) influence not only the checkpoints but also the underlying chance and speed of transition between various stages of the disease. Each factor can influence various stages of disease progression, and some (for example, socio-demographic factors) affect each step in the progression from severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) exposure to death. On the bottom, we represent the impact of the host genome. The host genome affects each phase of disease progression either by acting directly on infection susceptibility and disease severity or via environmental factors.

Genetic findings

Genetic association studies can identify genomic regions linked to infection susceptibility and disease, but these studies are also susceptible to various biases that may arise during sample collection, data generation and processing. Furthermore, such findings require additional analyses and functional follow-up to pinpoint the specific variants and genes that directly affect the observed phenotypes. We next discuss the current findings primarily from the largest genetic studies for SARS-CoV-2 infection and COVID-19 disease. In Table 1 we summarize the key evidence for some of the most robust and interpretable associations and report our confidence for the suspected causal gene.

Table 1 Genetic loci associated with SARS-CoV-2 infection susceptibility and COVID-19 severity, including the putative causal genes

Rare variants

There is an extensive literature on rare variants that cause inborn errors of innate immunity that can result in severe, idiosyncratic outcomes from common infectious diseases. We refer readers to the work by Casanova and Abel41 for further details on the topic. These rare variants have been typically discovered by studying small family pedigrees and individuals with extreme phenotypic manifestations. By contrast, well-powered population-based WES and WGS studies have been lacking, and more widely available genotyping microarray data are not as useful for this purpose (for further information, see the sections covering common variants), as such variants can be extremely rare and specific to individual families. Sequence data have the advantage of capturing variants that have usually occurred in relatively recent generations or de novo and may have large effects on a disease outcome. Typically, variants with large effects remain at low frequency in the population or are purged out owing to selective pressure. Rare non-synonymous coding variants are of particular interest because they can easily point out the causal gene and, thus, reveal potential for therapeutic targets.

Van der Made and colleagues42 published one of the first studies on rare variants in the context of COVID-19 severity. They searched for rare non-synonymous and possibly damaging variants in a group of genes with known associations with immunodeficiencies. Their analyses on data from two families with affected males (brother pairs) pointed to X chromosome variants in the TLR7 gene, which is involved in the pathogen recognition pathway and innate and adaptive immunity. This finding has been replicated by Fallerini et al.7 in 561 individuals and Asano et al.5 in a larger sample of 1,533 individuals. Both studies and a further follow-up study by Mantovani et al.43 performed functional investigations highlighting the role of TLR7 loss of function in impaired type I interferon responses.

A larger case–control study was conducted by Zhang et al.44 by comparing exome sequence data from 659 patients with life-threatening COVID-19, including children, with data from 534 individuals with mild or asymptomatic COVID-19. They focused on 13 candidate genes previously associated with monogenic immunological disorders or that are involved in these pathways and concluded that at least 3.5% of patients with life-threatening COVID-19 pneumonia had genetic defects in some of these genes implicated in the type I interferon pathway.

Because the aforementioned studies focus on candidate genes instead of using a hypothesis-free genome-wide approach that requires a larger sample size and a more stringent significance threshold, the results need to be carefully scrutinized and replicated. Only the TLR7 association reached exome-wide significance in unpublished work by the COVID-19 HGI WES/WGS working group, which now includes up to 23,000 cases and 500,000 controls (G. Butler-Laporte, personal communication). A smaller study of 7,491 patients who were critically ill and 48,400 controls did not identify any significant rare variant associations45. This and two other studies31,46 have not been able to replicate the rare variant associations with the 13 immune genes reported by Zhang et al.44, despite substantially larger sample sizes. These differences may be partially due to different definitions of COVID-19 severity, age distribution and in silico versus experimental validation of non-synonymous variants47,48. The power of rare variant discovery in COVID-19 will be improved by increased sample sizes of WES and WGS data sets, which in time may provide definitively conclusive associations. To summarize, TLR7 is currently the only gene uniformly replicated for association of rare non-synonymous variants with severe COVID-19, although it is expected that more findings will be confirmed as studies increase in power.

Common variant: introduction

Although rare variant studies for COVID-19 are still in their nascent phase, there is now robust, replicated evidence for multiple loci harbouring common variants associated with infection susceptibility and disease severity. These studies have mainly used microarray-based genotyping technology, which is scalable and cost-effective. Genotyping microarrays are designed to capture the more common variation across the genome using a sparse number of genetic markers in coding and non-coding regions, followed by statistical imputation of the remaining known sites of genetic variation, both common and rare. Genome-wide association studies (GWAS) using genotype data are powerful for capturing associations for variants with population frequency >0.1% that typically have mild to moderate effects on the phenotype49,50. Owing to the relatively quick and cheap generation of genotype data, GWAS have proved an important starting point for distinguishing between the genetic variants that affect susceptibility to SARS-CoV-2 infection and those increasing the risk of developing a severe form of COVID-19 disease once infected.

Common variant: infection susceptibility

We have previously mentioned how current genetic studies can only imprecisely capture susceptibility to SARS-CoV-2 infection. Nonetheless, well-powered analyses clearly point to a group of loci that are associated with COVID-19 disease, but are not specific to disease severity. The COVID-19 HGI has recently formalized this observation, by developing a Bayesian framework to assign posterior probability for a variant to belong to either disease severity or susceptibility to infection51. Briefly, by contrasting effect sizes in severe COVID-19 with those seen in COVID-19 populations with severe cases removed, one can analytically distinguish those variants involved in susceptibility to infection (equal in the two groups when compared with controls) and those specifically involved in severe progressions that manifest uniquely or much more substantially in the severe group.

The strongest signal within the susceptibility group of loci is the ABO (histo-blood group ABO system transferase) gene, which was initially identified by the Severe Covid-19 GWAS group26. The ABO alleles determine an individual’s blood group by enzymatically catalysing the production of A and B antigens in human cells. There is now robust evidence that ABO is associated with susceptibility to SARS-CoV-2 infection, with both Shelton et al.32 and HGI38 reporting similar effect sizes for the infection susceptibility and disease severity phenotypes. The data suggest that individuals with O blood group, who have neither A nor B antigens, are protected against the viral infection (odds ratio (OR) ≈0.90). This result is consistent with several observational studies that found that blood group A was associated with infection susceptibility52. The exact mechanism is, however, unclear. It has been suggested that this association can be attributed to protective effects exerted by anti-A IgG antibodies and not the blood group itself53. Others have shown that the ABO variant associates with higher levels of CD209 protein, which has been shown to directly interact with the spike protein of SARS-CoV-2 (ref.54). Nonetheless, the association between ABO and susceptibility to infection adds to an extensive list of evidence linking blood type with infectious diseases55, including the recent observation by Shelton et al.32 that blood group O appeared to be a risk-increasing factor for influenza symptoms in the years before the COVID-19 pandemic.

A second infection susceptibility locus is ACE2, which is worth mentioning because the gene encodes a key protein involved in the viral entry pathway of SARS viruses2,3,4. GWAS by Horowitz et al.56 and COVID-19 HGI51 point to a protective variant (rs190509934) 60 bp upstream of the ACE2 gene. This variant, which is rare among individuals of European ancestry (0.2% in the Genome Aggregation Database (gnomAD)), but more common in South Asians (2.7%) was associated with a 39% reduction in ACE2 expression in liver tissues.

A third infection susceptibility signal lies in the 3p21.31 locus and it is independent of the largest signal for severe COVID-19 disease, which is also in the same region (Fig. 2). This rather surprising proximity has caused this signal for susceptibility to be overlooked in some studies. Roberts et al.34 were the first to highlight the presence of a susceptibility signal in 3p21.31, and later the COVID-19 HGI38 has shown that there are several independent signals (r2 ≈ 0) associated with SARS-CoV-2 infection susceptibility, all located within the gene body of SLC6A20, which encodes an amino acid transporter protein that is known to functionally interact with the SARS-CoV-2 receptor ACE2 (ref.57). We discuss some of the functional work that has been done to decipher this locus in more detail in Box 1.

Fig. 2: Genetic association patterns in the chromosome 3p21.31 region from COVID-19 HGI meta-analysis.
figure 2

a,b | Locuszoom128 plots of the 3p21.31 locus for coronavirus disease 2019 (COVID-19) hospitalization (panel a) and reported infection (as a proxy for susceptibility to infection) (panel b) from the COVID-19 Host Genetics Initiative (HGI)51 release 6. Points are coloured based on r2 linkage disequilibrium (LD) values to each lead variant, and the purple diamond represents the lead variant. c | Local LD structure of the region. The heatmap represents r2 values among the significantly associated variants (plotted region is highlighted with background shading in panels a and b). The Neanderthal-derived 49.4 kb haplotype region with high LD97 is highlighted in darker grey background shading. The region displays patterns of long regions in strong LD and harbours within it several genes (annotated below panel b). Identifying the causal variants by statistical means in regions of long LD is challenging, as the lack of recombination events can lead to multiple variants having similar evidence for association in the locus. Causal variants at this risk-associated locus may have relevance for different ancestries given the different global frequencies of introgressed Neanderthal alleles. More information about this locus can be found in Box 1. Figure and legend provided by M. Kanai (laboratory of M.J.D.).

In addition to the three loci highlighted above, there are additional loci that can be linked to SARS-CoV-2 infection susceptibility and for which we describe the potential causal genes in Table 1.

Common variant: COVID-19 severity

GWAS of severity phenotypes (that is, hospitalization, admission to ICU or death due to COVID-19) have identified more loci than GWAS of infection susceptibility. The largest signal in the 3p21.31 locus was described by the Severe Covid-19 GWAS Group only 3 months after the declaration of the pandemic and posted as a preprint in June 2020 (ref.26). Given the relevance of this locus to severe COVID-19 disease we provide more detailed insights in Box 1 and show the associations graphically in Fig. 2.

The next leap in the discovery of new severity-associated loci came by the combined effort of the GenOMICC/ISARIC28 and COVID-19 HGI studies38,58. The GenOMICC/ISARIC study28 included just over 2,000 patients who were critically ill with COVID-19 from ICUs across the UK, and their strategy of enrichment for very severe cases resulted in improved power and discovery of eight new loci. The COVID-19 HGI38 provided replication for the GenOMICC/ISARIC study, and independent results were released online. The main findings from these severity analyses directly indicate that both genes involved in immune response and others involved in lung disease pathology are central to severe COVID-19 progression.

First, we highlight three instances in which genes modulating the immune response to viral infection are plausibly implicated. TYK2 has been extensively explored in the human genetic literature owing to its relevance as a potential therapeutic target for autoimmune diseases and cancer. Individuals with complete loss of TYK2 function present with immunodeficiencies59,60, whereas individuals heterozygous or homozygous for low-frequency hypomorphic variants that cause lowered TYK2 signalling (via decreased phosphorylated STAT) have a more complex presentation. Although these individuals do not seem to be impacted in health measures or mortality in a large cohort study and are in fact protected from common autoimmune diseases61, they are susceptible to tuberculosis infection owing to impaired immune signalling62,63,64. By current understanding, TYK2 is involved in balancing the cytokine response and is therefore an interesting target for drug development. Of note, the missense variant (rs34536443:G>C or p.Pro1104Ala) previously associated with protection from certain autoimmune diseases, increases the risk for severe COVID-19.

The second locus points to IFNAR2 (IFNα and IFNβ receptor subunit 2/3), which has been replicated in multiple studies28,38,56 and proposed as a druggable target through Mendelian randomization (MR) studies65. However, we note the close proximity between IFNAR2, IL10RB and IFNAR1, and it is not yet fully established that IFNAR2 is the only relevant gene in this locus. Patients with severe COVID-19 show evidence of a dysregulated type I interferon response to the SARS-CoV-2 virus66,67,68, and drugs inducing the interferon pathway in the early stages of infection have also been shown to be beneficial69. This could imply that the timing of either stimulation or down-regulation of the interferon pathway during the course of infection could affect the outcome in patients66.

The third locus overlaps the OAS gene cluster, which encodes proteins involved in viral clearance. Several lines of evidence point to OAS1 as the causal gene4,70,71: genetically predicted higher levels of circulating OAS1 are protective against severe COVID-19 (ref.4), and the causal haplotype is associated with decreased nonsense-mediated decay of OAS1 transcripts, and thereby potentially faster initial responses to viral infections and viral clearance71. Through a detailed functional study, Wickenhagen and colleagues72 showed that SARS-CoV-2 was inhibited by the action of OAS1 interacting with several regions of the SARS-CoV-2 genome, with the most prominent sites mapping to the first 54 nucleotides of the 5′ untranslated region, which is present in all SARS-CoV-2 positive-sense viral RNAs. These findings are interesting in the light of COVID-19 treatment, as OAS1-activating drugs already exist. Additionally, a recent targeted fine-mapping study identified a candidate causal splice variant, leading to a more active OAS1 enzyme and downstream antiviral activity73.

The other major insight gained from the human genetic findings comes from the overlap between genetic signals for COVID-19 severity and lung diseases. This overlap is consistent with the epidemiological evidence associating pre-existing lung conditions with COVID-19 severity74,75 and respiratory failure being the major cause of death among hospitalized patients with COVID-19 (ref.69). At least four loci associated with COVID-19 severity have been previously linked to interstitial lung disease, lung fibrosis, lung carcinomas and/or decreased lung function28,38. Genes harboured within these published loci include dipeptidyl peptidase 9 (DPP9), Forkhead box protein P4 (FOXP4), surfactant protein D (SFTPD) and mucin 5B (MUC5B)51. The lead variant at the MUC5B locus (rs35705950-T) is associated with increased MUC5B expression in lung tissue76, which has been associated with muco-ciliary dysfunction and increased bleomycin-induced fibrosis in mice77. This specific variant is protective against severe COVID-19 but is the strongest known association for substantially increased risk of idiopathic pulmonary fibrosis (IPF)76. This opposite direction of effect is intriguing given the concordant direction observed for two other genome-wide significant loci and the overall positive genetic correlation between IPF and COVID-19 (ref.78). Nonetheless, this result is also consistent with the MUC5B promoter variant being associated with twofold improved survival among patients with IPF79. For FOXP4, a promoter region signal is associated with increased COVID-19 severity38,51 and is also associated with increased expression of FOXP4. This specific variant is infrequent in samples with European ancestry and much more common in East and South Asia and in admixed Hispanic–Latino samples of the Americas80, underscoring the importance of taking a global approach for more comprehensive and equitable gene discovery. Importantly, this same association has been previously noted in lung cancer81,82 and in interstitial lung diseases83 — all in a concordant direction — suggesting another potential therapeutic target. For SFTPD, the missense variant identified by the HGI51 is consistent with emerging results pointing to the involvement of surfactant proteins in severe COVID-19 risk. Surfactant proteins are secreted by alveolar cells in the lung, and maintain healthy lung function and facilitate pathogen clearance84. SFTPD is involved in the immune response pathway and the SFTPD missense variant has been linked to reduced lung function and severe COVID-19 (ref.85).

Together with the other findings, these paint an overall picture in which variants in genes involved in upkeep of healthy lung tissue and maintenance of the immune system and its regulation upon viral exposure can affect the course of the disease in an individual.

The human leukocyte antigen (HLA) system orchestrates immune regulation, and the largest GWAS of common infections have implicated HLA in 13 of them86. Thus, it was thought that this region would have a prominent role in explaining variability in COVID-19 severity and infection susceptibility, yet the region is far from being the strongest signal in GWAS. However, associations for HLA class II have now been detected by GenOMICC/ISARIC28 and COVID-19 HGI51. Additionally, smaller targeted studies that were able to impute the HLA genotypes and thus gain better resolution of the region have also implicated HLA class I genotypes87,88. What is still needed are definitive large-scale studies that properly account for the complexity in linkage disequilibrium (LD) and ancestry differences in the region. Therefore, the lack of HLA associations from some GWAS of COVID-19 severity might partially reflect limitations of the study designs rather than a genuine lack of biological association. The recent availability of multi-ancestry HLA imputation panels89 and integration with imputation servers might facilitate this much-needed activity.

Overall, what has perhaps come as a surprise from GWAS of COVID-19 is how relatively many loci point to plausible biology, compared with other complex traits and considering the challenges in defining a reliable and consistent phenotype during an ongoing pandemic. Nonetheless, these results have been mainly used to confirm existing biological hypotheses and have not yet provided profoundly novel insights into COVID-19 disease, thus highlighting the challenges in rapidly connecting variants to function.

Effect of age

The genetic architecture of complex disease is not fixed, and genetics tends to have a larger proportional contribution to disease burden in younger age groups90. Given the extreme importance of age as a risk factor for severe COVID-19 (refs26,91), age should be considered in genetic analyses. Some evidence is emerging for age-specific effects at candidate rare variant loci7,44 and one common risk locus29. Large meta-analyses with access to detailed individual-level data will be needed to better understand the relationship of age and severe disease, particularly for individuals with rare variants.

Effect of sex

Male sex is one of the most impactful epidemiological risk factors for hospitalization and severe respiratory syndrome due to COVID-19, but initially large-scale genetic studies did not report sex-specific effects for infection susceptibility or severe disease. However, some reports of sex-specific effects are starting to emerge for loci containing immune-related genes34,92. Moreover, the rare variants in the chromosome X gene TLR7 affect males and are associated with severe COVID-19 outcomes93. Overall, genetics is unlikely to explain much of the increased COVID-19 severity among men. The general lack of sex-specific factors is not totally surprising as the genetics of numerous, well-studied immune-mediated diseases that significantly differ in their prevalence between sexes have not demonstrated a significant contribution of sex-specific genetic factors to such differences.

Population genetics and ancestry differences

Epidemiological studies have shown that people from non-white ethnic backgrounds are more at risk of infection and of severe COVID-19 (refs17,94,95), raising questions about whether human genetics can explain some of these differences. Generally, non-genetic factors are much more relevant than genetic factors in explaining health disparities. However, the scale and diversity of participants in the COVID-19 HGI provide an opportunity to determine whether any of this difference might be explained by genetic variants that are risk factors for COVID-19 having higher frequencies in certain ancestries, and/or genetic variants having similar frequencies, but different magnitude of effects, across ancestries or environments.

Heterogeneity of variant effects across populations has been compared in several studies. Shelton et al.32 showed no significant difference in effect across several genetically defined ancestry groups at the most prominent risk loci, the 3p21.31 and ABO loci. However, with increasing sample sizes and improved representation of non-European ancestry groups, the COVID-19 HGI has recently reported a significantly different effect between ancestry groups for the FOXP4 locus51. Apart from this locus, the authors suggest that the observed heterogeneity at the remaining loci is more likely to be due to differences in study inclusion criteria (for example, variable definition of COVID-19 severity owing to different thresholds for testing, hospitalization and patient recruitment). Additionally, a smaller study by Parikh et al.96 used admixture mapping — a method of gene mapping that uses differential risk by ancestry to identify ancestry-specific effects — and identified two genomic regions associated within local ancestries, suggesting that some ancestry-specific effects might exist.

Where the magnitudes of effect at currently established loci seem to be consistent across ancestry groups, lead variants at several loci show substantial frequency differences across populations (see the example of the 3p21.31 locus in Box 1). Some of the differences can be explained by negative selection as in the case of TYK2 (ref.64). However, for other loci such as the 3p21.31 locus and the OAS gene cluster in which variants originated from Neanderthal introgression70,97, it is as yet unknown whether the introgression drove selection or whether (as for other loci) the allele frequency differences might simply be consistent with genetic drift. Overall, we do not observe any specific ancestry group with consistently higher or lower frequencies at established COVID-19-associated variants. However, in-depth analysis of this issue has not been conducted, and existing analysis reporting that signatures of adaptation might be linked to an ancient epidemic in East Asian populations did not use GWAS-associated loci98. Furthermore, as we do not know the exact causal variants for COVID-19 severity and susceptibility, it is difficult to draw conclusions even from accurate comparisons of ancestry-specific effect sizes. Beyond answering some key population genetics questions, more samples from diverse ancestries are needed to build a more comprehensive map of the effects of host genetics and to improve the statistical refinement of functional underpinnings of the loci associated with COVID-19, by, for example, co-localization and fine-mapping.

Overall, current evidence does not suggest that human genetics has a major role in explaining differences in COVID-19 severity and infection susceptibility across different ancestry groups. Thus, the most likely explanation is that, like most health disparities, differences observed between ancestry groups are likely to be due to differences in environmental and socio-economic factors that impact an individual’s chance of contracting COVID-19 and/or obtaining rapid and effective health-care interventions upon infection. Larger sample sizes in continental ancestry groups other than Europeans will allow further investigation of these questions.

Clinical and public health impact

Genetic instruments to identify causal risk factors

Genetics can be used to identify risk factors and biomarkers that correlate with COVID-19 and to support causal relationships with new or established risk factors99,100,101. For example, large-scale genetic studies can identify shared genetic effects between COVID-19 and other traits. This is typically achieved using genetic correlations100. The main advantage of genetic correlations compared with phenotypic correlations is that risk factors and COVID-19 phenotypes do not need to be measured on the same set of individuals. Genetic correlations for genetic liability to SARS-CoV-2 infection or more severe disease have recapitulated most of the established phenotypic (clinical) correlations with severe COVID-19 (for example, increased body mass index (BMI), smoking, diabetes, ischaemic stroke and educational attainment)28,38. However, these results alone need to be interpreted with caution as they are subject to the same set of biases and confounders as standard epidemiological analyses, with the additional caveat that genetic studies are normally conducted on non-representative populations.

Genetic correlations can be combined with MR studies, which aim to identify causal associations between exposures and outcomes101,102. This MR approach can reveal which risk factors might be causal for COVID-19 severity and which might be merely comorbid. For example, the HGI used MR to show that type 2 diabetes (T2D) was not a causal risk factor for severe COVID-19, but instead the association might be mediated by increased BMI. However, the most valuable application of MR studies in the context of COVID-19 is to evaluate the causal relationship with protein products that are targets of currently licensed drugs (drug repurposing) or drugs in clinical development. Specifically, if a putative drug target can be shown to have a causal effect on COVID-19 severity, then there can be more confidence that targeting that protein might be able to modify the disease course. An important consideration when honing in on potential drug targets though, is their potential pleiotropic effects; a drug target with specific downstream effects may be more desirable than modifying the function of a target that is involved in multiple pathways or biological processes. We note here that although MR analyses can pinpoint interesting candidates for follow-up, various in silico analyses and in vitro and in vivo models have a crucial role in preclinical target identification.

MR studies on COVID-19 have now suggested several proteins as potential drug targets, some of which are already targeted by existing drugs. For example, Gaziano et al.65 found the best potential for druggable COVID-19 targets to be IFNAR2 and ACE2, which are known players in immune response and SARS-CoV-2 entry, respectively. The GenOMICC/ISARIC study28 also performed MR for an a priori list of candidate genes, which were targets of drugs that at the time had been proposed as potentially effective treatments for COVID-19. Their analysis for causal associations with the risk of developing severe COVID-19 prioritized IFNAR2 and TYK2, which were previously implicated by GWAS. Another GWAS-implicated gene, OAS1, has also been supported by a study from Zhou et al.4 who investigated the levels of hundreds of circulating proteins in individuals (non-infectious state) and identified a causal relationship between higher plasma OAS1 levels and COVID-19 severity.

Perhaps the clearest example of where MR supports clinical findings is the IL-6 receptor (IL-6R). During the early pandemic, IL-6R inhibition was proposed as a potentially effective mechanism for treating severe COVID-19 (refs103,104). Elevated levels of IL-6, which is a known immune-stimulating cytokine, have been regarded as a biomarker of severe COVID-19 in hospitalized patients who have elevated or dysregulated immune responses15. An MR analysis by Bovijn et al.105 found a significant causal relationship between IL-6R genetic variants that resulted in reduced levels of the receptor and improved outcome in patients with COVID-19. Indeed, a recent meta-analysis of 27 randomized trials showed that administration of IL-6 antagonists, compared with usual care or placebo, was associated with lower 28-day all-cause mortality in patients hospitalized for COVID-19 (ref.106), supporting the results of the MR analysis. Some debate on the similarities of the mechanism of action between the naturally occurring variants and the molecular inhibitors exist, as Garbers and Rose-John107 have suggested that IL-6R inhibitors block both soluble and cell-bound IL-6R, thus eliminating the IL-6 signalling pathway, but functional genetic variants in the IL6R gene might instead affect the proportion of soluble to membrane-bound protein. Nevertheless, as the treatment has been shown to be beneficial, understanding the specific mechanisms of natural versus pharmacological modulation of the protein is likely to be of academic interest but will not affect the introduction of these drugs into clinical use in patients with COVID-19.

Polygenic scores

A polygenic score (PS; also known as polygenic risk score (PRS)) summarizes the measurable individual genetic risk for a chosen trait or disease based on the genotypes at several loci from GWAS. These are constructed typically either from variants in loci that are statistically significantly trait associated or also including variants across loci that did not reach genome-wide statistical significance. At a population level, PS alone or in combination with other risk factors can be used to assign an estimate of risk to each individual108,109. A few studies have now tried to calculate PSs for COVID-19, but these have so far been generally weakly powered, and most variation in the phenotype explained by PS is due to the inclusion of a few of the most significant signals, for example, the 3p21.31 locus29,31,56.

A clinical application for PS of SARS-CoV-2 infection susceptibility or severity is unlikely in the short term. First, in a clinical setting, genetic information is not routinely collected at scale or available for consultation by clinicians. Second, although many risk prediction tools for COVID-19 have been developed110,111,112, to our knowledge none has been used in clinical practice. Thus, it would be unlikely for a COVID-19 PS to be widely adopted. However, there might be some value for PS in identifying individuals who are at higher risk of developing severe COVID-19 symptoms amongst younger individuals without pre-existing risk factors. A study by Nakanishi et al.29 showed that in COVID-19-positive individuals younger than 60 years, a single genetic risk factor (the 3p21.31 locus) can be as predictive of death and respiratory failure as some established comorbidities such as T2D. Nonetheless, more research is needed not only to evaluate more powerful PSs, but also to address inherent limitations such as the lack of PS transferability across ancestry groups.

Research applications of PS are nonetheless valuable. PS can be used to summarize our current knowledge on the genetic risk factors that underlie infection susceptibility and COVID-19 severity. For example, are individuals at higher genetic risk more likely to develop vaccine breakthrough infection, to experience more severe side effects or to develop post-COVID syndrome?

In conclusion, GWAS results can be used to construct PSs that are valuable for research purposes, but are unlikely to have a clinical value in the short term.

Conclusions and future perspectives

Genetic association studies have been exceptionally fast in delivering new genetic signals underlying COVID-19 severity and infection susceptibility. On a sobering note, these discoveries have had a limited impact on the management of the COVID-19 pandemic thus far, and it is our hope that the next phase of the pandemic will see more application of human genetics results and better functional insights. Here, we provide some perspective on the key opportunities ahead for the field, while taking for granted that increased sample size will fuel new discoveries.

Expanding COVID-19 phenotypes and post (long) COVID-19

As reviewed here, most genetic studies of COVID-19 to date have focused on pinpointing factors that make some individuals more susceptible to SARS-CoV-2 infection and explaining why others develop severe symptoms. However, with ever-expanding understanding of the disease and the data collected, future genetic studies may expand to investigating, at scale, particular symptoms associated with the infection or severe comorbid conditions such as multisystem inflammatory syndromes113,114,115. Furthermore, some individuals who have contracted COVID-19 experience long-term symptoms that may result in a considerable health burden in the years to come116,117. There is large variability in the symptoms experienced by those affected by post (long) COVID-19 (refs116,117). Human genetics can be helpful in this context because some of the post-COVID-19 symptoms have directly or indirectly been studied by GWAS. For example, one might test the hypothesis that COVID-19 accelerates existing genetic predispositions to some of the symptoms. Together with observational epidemiological analysis, MR can be used as an additional pillar to triangulate evidence of causal relationship between COVID-19 and downstream consequences. Global networks such as the COVID-19 HGI can play a key part in such undertakings because they bring together studies with different designs, including biobank studies with longitudinal medical information pre- and post-infection and direct-to-consumer studies that can capture self-reported symptoms on a large number of individuals.

Interaction between host genomes and viral genomes

The interaction between host and viral genomes is surprisingly understudied, partially reflecting the lack of interaction between the corresponding scientific communities, but, most importantly, the lack of studies in which both types of information have been collected at scale118,119. A recent report120 showed that the protective effect of the sickle cell allele of host HBB against severe malaria is not detected in the presence of certain alleles in the parasite’s genome. These parasite alleles are particularly common in strains found in Africa, illustrating the importance of host–pathogen interaction analyses for understanding regional disease epidemiology and selective pressures in infectious disease. Variability in symptoms and resulting disease severity have also been observed across SARS-CoV-2 strains121,122, but it is not clear whether the underlying host genetic factors are the same. Parikh et al.96 have conducted an initial study combining viral and human genetic data information, but they did not find significant results from the phylogenetic information constructed from the viral RNA. To overcome the lack of large samples, one might perform targeted studies focusing on genome-wide significant loci or PSs. Additionally, with recent temporal waves of disease dominated by delta and then omicron variants, the time and location of infection could potentially be used to infer a proxy for the likely variant.

Vaccination response and breakthrough infections

Rollout of vaccines brings challenges and opportunities to the study of the human genetic epidemiology of COVID-19. On one hand, the different strategies employed by countries can shape the epidemic differently in different parts of the world, inevitably changing the major demographic groups who become infected or severely affected by the disease, and can ultimately challenge the interpretation of genetic discoveries. On the other hand, widespread vaccination opens the possibility to study vaccination side effects and breakthrough infections. Bolze et al.35 have reported that individuals who carry the HLA-A✳03:01 allele were more likely to experience severe difficulties with daily routine after vaccination. For other more severe and rare side effects, it will be of paramount importance to leverage existing international collaboration to obtain robust and replicable results.

Data sharing

Although this pandemic has shown the importance of rapid data sharing, open methodological reporting and academic–commercial partnership science, the sharing of individual-level data is still far from being a reality. Widespread, yet safe, access to individual-level data can foster discoveries and methodological developments beyond what is currently possible with sharing of summary statistics. Yet despite repeated evidence showing that study participants endorse data sharing123,124,125, legal and data protection challenges have hindered these efforts within and beyond the human genetics community126. Consortia such as the COVID-19 HGI38,51,58 have clearly demonstrated the impact of transparent science: despite the challenges of the pandemic, they set common goals early on and prioritized the sharing of resources and data, and the result was one of the largest genetic studies ever performed so far with representation from almost every continent. These types of effort should be considered as a roadmap to future collaborative initiatives. Currently, with the exception of the UK Biobank and a small subset of the HGI initiative (EGAC00001002188), there is no large data set with human genetic and COVID-19 disease information that is accessible to the entire scientific community via established repositories. We hope the next phase of the pandemic will see a shift in the attitude towards sharing of individual-level data.

Outlook for COVID-19 host genetics

Continued investigations into host genetic factors that contribute to severe COVID-19 and susceptibility to SARS-CoV-2 viral infection will be essential to maximize the chances of finding new therapeutic avenues to treating the disease, whether it be through drug repurposing or the longer-term endeavour of new drug development. These findings should be integrated with multi-omics results to provide clearer biological insights. As for any other complex disease, genetic risk prediction is likely to add value to clinical risk prediction in a hospital setting for identification of patients who are more likely to develop further severe symptoms, and thus continued efforts on the identification of risk factors and the development of predictive biomarkers are warranted. Host genetics is not the sole key to cracking the code to successful and effective treatment of COVID-19, but with continuation of open science and partnerships between academic, industry, health-care providers and policy-makers, we will hopefully see large leaps towards that goal in the near future.