Introduction

One of the powerful and widely used methods to detect associations between phenotypes and genetic variants is the genome-wide association study (GWAS), which analyzes genetic variants in common diseases. This method has been proved useful through an extreme increase in published GWAS results over time from its introduction (~2008) to the present (MacArthur et al. 2017). Some studies have recently reviewed novel techniques and methodologies for data pre-processing and GWAS methodologies, which increase the power of the analysis and help achieve accurate results from GWAS (Mortezaei and Tavallaei 2021; Tam et al. 2019).

Although GWASs can identify disease mechanisms, leveraging the wealth of GWAS-implicated loci and inferring truly causal variants is the main bottleneck that leads to gaps between genetic studies and therapeutic applications (Schaid et al. 2018). To close these gaps, post-GWAS pipelines have been developed (Box 1). Based on cell culture-based experiments and biological post-GWAS functional studies, candidate causal variants can be identified, and genetic variants in haplotypes associated with diseases can be defined. Post-GWAS analysis can identify genes functionally related to specific diseases and more quickly connect the functional part of the genome with clinical applications (Lin et al. 2018); for instance, the post-GWAS analysis helped gain new insight into causal germline variants and their impact on the aetiology of prostate cancer and translate genetic variants into therapeutic and clinically meaningful results (Farashi et al. 2019).

Meta-analysis of GWASs can be performed to increase the power of association detection by analyzing more genomic variants in human and nonhuman species. For example, to discover flavor-associated single-nucleotide polymorphisms (SNPs) in tomatoes, a meta-analysis of GWASs was performed. The results of this analysis indicated that in comparison with traditional cherry tomatoes, in modern cultivation, the majority of alleles associated with high sugar levels have been lost. Such results can provide new insight into the genetics of tomato flavor and how to control it (Zhao et al. 2019a). Further, the results of GWAS meta-analysis for multiple myeloma in human has identified suggestive novel risk alleles that could better capture disease risk in individuals (Du et al. 2020).

In post-GWAS analysis, another applicable technique is cross-phenotype association analysis, which refers to cases when loci or genes have significant associations with multiple traits. One of the limitations of quantitative trait analysis approaches, such as GWAS, is the challenge of identifying epistasis and pleiotropy. Epistasis refers to the influence of genetic mutations on other mutations, and pleiotropy refers to a phenomenon in which a single locus can control multiple phenotypic characteristics. Epistasis can cause pleiotropy, and pleiotropy is known to be an underlying cause of cross-phenotype associations (Polster et al. 2016).

This review considers the challenges of detecting truly causal variants from GWAS, identifying the functions of identified loci, and understanding the contributions of most identified loci to the pathogenesis of complex traits and the subsequent application of post-GWAS analysis techniques to overcome these challenges. Then, different post-GWAS integrative and cross-phenotype association analysis methods, considering epistasis and pleiotropy, that can provide valuable information from GWAS results, are discussed. The post-GWAS study design is indicated in Fig. 1. In addition, Box 1 summarizes the objectives and types of the post-GWAS method reviewed in this paper and key post-GWAS methods, including LD score regression, genetic correlations, and polygenic risk scores (PRSs), are discussed in Box 2.

Fig. 1: Post-GWAS study design.
figure 1

One set of study is shown in large circle. Significant SNPs detected from the GWAS results can be causative variants, shown in small area, or those in LD with causals for which functional annotations are required, shown in another area. The post-GWAS integrative analysis, shown in the next circles below others, combining GWAS results with somatic or transcriptomic data, can be used to boost the GWAS power. Such analysis and its results for each phenotype can be used in cross-phenotype associations and phenotypic comparisons, using another set of study shown in different labeled circle.

Causality detection

According to the American College of Medical Genetics and Genomics, a genetic variant is causative if it is involved in a specific phenotype development. These phenotypes can be human diseases, behaviors, morphology (e.g., height), economic success, and also behavioral, functional, and productive phenotypes in plants and animals (Richards et al., (2015)). It is possible that a pathogenic variant is not causative; this can happen when nonfunctional and benign variants are involved in the pathogenesis of the patient’s phenotype. Based on the genomic architecture, the density of genotype data, selection signals, and genotyping technologies including SNP array designs either to include SNPs inside known genes or not, GWAS can identify the causative loci or those in linkage disequilibrium (LD) with them. Common tag SNPs can identify causal variants, their variance proportion, and their true effect size to help account for the missing heritability in GWAS (Boudellioua et al. 2017). Comparing gene-phenotype associations with patient phenotypes, a method called PhenomeNET Variant Predictor (PVP) was developed; this approach considers patients’ phenotypic similarities to rank potential candidate genes and facilitate causal variant identification. PVP merely depends on the phenotypes of the modeled organism. It is only applicable when variants are in known disease genes, and it does not provide information regarding oligogenic or digenic inheritance. DeepPvP (Boudellioua et al. 2019) and OligoPVP (Boudellioua et al. 2018) are developed methods that can be employed for PVP, which is useful for causal variant detection of complex traits. The performance of PVP has been evaluated by the previous study on congenital Hypothyroidism (CH) for potentially pathological variant detection, which analyzes a series of exomes in the UK10K dataset when the results indicate likely causative variants (Boudellioua et al. 2017).

To examine the effect of causal variants on a specific disease, Mendelian randomization (MR) is a suitable approach. The name MR comes from Mendel’s law of independent assortment when an individual’s genotype is formed randomly during segregation (Grover et al. 2017). In recent years, diverse MR methodologies have been developed, and the selection of an appropriate method depends on considering a combination of conditions such as data availability, number of SNPs, and correlations between SNPs. For example, two-stage least squares, limited information maximum likelihood, inverse variance weighted, MR-Egger, weighted median regression, multivariable MR, Bayesian MR, structural mean models, and generalized methods of moments are different MR strategies that provide causal estimates for genetic instruments (Kou et al. 2020). There are also some R packages, such as TwoSampleMR and PathD, and a STATA package called MRrobust for MR analysis (Davis et al. 2018). One previous study has employed MR in osteoporosis for causal variant inference and potential risk factor detection. Horizontal pleiotropy, LD, population stratification, trait heterogeneity, the complexity of association, dynastic effects, clinical period effects, selection bias, and weak instrument bias are some limitations of MR that can make it more complicated (Kou et al. 2020).

On the other hand, propensity score approaches are conditional probability assignments that can be applied in population-based genetic association studies to obtain valid estimates and address confounders such as disease and patient characteristics or genetic ancestry. For example, the combination of principal components and propensity scores (PCAPS) can be used to address confounders due to population stratification. The advantage of using PCAPS is the ability to detect true associations and reduce false-positive findings in GWASs by capturing and summarizing the variability in principal component analysis. PC can be carried out on GWAS results using EIGENSOFT software, whose predictions in the logistic model can be employed in PCAPS estimation. Compared to other PCA methods, PCAPS can correctly identify false-positive results. PCAPS has been examined as a practical and innovative way for testicular cancer to correct GWAS population stratifications and false-positive identification (Zhao et al. 2018). One of the other developed methods is the propensity score adjustment method (PSAM), which uses estimated propensity scores to adjust for the influences of epistasis or correlations. This method tests for single locus associations and uses genetic variant interactions or correlations to adjust for their effects and account for the missing heritability. The PSAM methodology starts with SNP subset selection and estimation of propensity scores and disease associations for each SNP. Next, univariate logistic regression is used for each SNP, and stepwise multivariate logistic regression is performed using the logit model. Without increasing the model complexity, the PSAM can increase the power of logistic regression tests for single-point association analysis when accounting for factors such as missing heritability. PSAM can be employed to determine treatment and outcome association. Furthermore, some treatment and outcome spurious associations caused by covariant confounders can be removed using PSAM. Seven simulated data types were used to evaluate the PSAM performance, and the result indicated a 15% improvement in the power of disease association identification compared to other methods. Afterward, the results of performing PSAM for rheumatoid arthritis and immunity have identified significant associated SNPs (Rai et al. 2018). One limitation of propensity score approaches is that they do not ensure balance in unmeasured and confounders and cannot substitute for randomization. The methods that are mentioned for causality detection, along with their applications and limitations, are explained in Table 1.

Table 1 Methods in causality detection.

Functional annotations

In the identification of disease-associated genetic variants, although the GWAS method is powerful, it cannot directly address genetic association signals, which are a set of variants within a locus that can influence target genes and are associated with a complex trait (Cannon and Mohlke 2018). To address such problems, post-GWAS analysis is performed by predicting the genes identified from reported GWAS variants that are most likely to be associated with the disease (Gallagher and Chen-Plotkin 2018). The post-GWAS analysis can use eQTL (Box 1), genetic and ontology data and co-functional gene networks to predict disease-related genes. The post-GWAS analysis can also consider associations between promoters and regulatory elements to predict disease-related genes distal or proximal to regulatory elements or GWAS signals. Such post-GWAS analysis can identify disease genes and then score such variants to prioritize the most likely signals (Broekema et al. 2020). For example, in a case study of Alzheimer’s disease, post-GWAS analysis identified 131 highly scored putative risk genes among 552 candidate genes (Lin et al. 2018). Furthermore, pathway analysis and Gene Ontology (GO) (The Gene Ontology Consortium 2019) terms, mammalian phenotypes (Weng and Liao 2010), and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (Kanehisa et al. 2017) can be combined to analyze such results and identify the most likely candidate genes. Post-GWAS pathway analysis has been successfully employed to identify novel risk pathways and biological mechanisms of type 2 diabetes (Liu et al. 2017).

On the other hand, fine-mapping approaches can be applied for variants, which are usually combinations of functional annotations and statistics. Such studies usually include genotyping arrays for studying specific SNPs, statistical approaches for the detection of causal SNPs, and functional annotations (Osgood and Knight 2018). Usually, genes close to GWAS-identified SNPs are assumed to be high-risk genes, and distant genes are ignored. For instance, integrated post-GWAS analysis of schizophrenia has been performed to address such problems and identify distant disease risk genes by regulatory elements (Lin et al., (2016).

Within a locus, independent association signals can be determined using fine-mapping approaches that involve stepwise conditional analysis involving targeted re-sequencing (Salomon et al. 2016) and imputation (Howie et al. 2012). Then, a creditable set can be defined using a posterior probability with a Bayesian approach. Next, the functional annotations of the creditable set can be determined using National Institute of Health (NIH) roadmap studies (Romanoski et al. 2015) or the Encyclopedia of DNA Elements (ENCODE) (ENCODE project consortium 2012). For example, in type 1 diabetes, 50 susceptibility loci were examined using a Bayesian fine-mapping approach (Onengut-Gumuscu et al. 2015); For instance, a Bayesian approach has been successfully used to detect significant loci associated with 22 traits in the Kaiser cohort (Majumdar et al. 2018).

Overall, the value of the generated data is strongly related to the selected tissue or cell type. Genome editing based on clustered regulatory interspaced short palindromic repeats (CRISPR) is another approach for identifying causal variants by introducing deletion/insertion mutations in a locus (Cong et al. 2013). For example, this approach was successfully applied in a study on Parkinson’s disease (Soldner et al. 2016). In addition, a GWAS-identified locus can be edited to match orthologues of other loci. In such approaches, to identify important gene regulatory regions, genome editing can make precise changes, such as SNP mutations, to identify important gene regulatory regions (Bauer et al. 2013).

Usually, causal variants coincide with regions associated with transcription factor (TF) binding sites of chromatin interactions or histone modification and open chromatin (Rivandi et al. 2018). Data on the locations of DNA methylation, open chromatin, histone modification, TF binding sites, DNA expression and other regulatory features are publicly available from ENCODE (https://www.encodeproject.org/) (ENCODE Project Consortium 2012), the NIH roadmap epigenomics project (Zhou et al. 2015), the FunctiSNP R package (http://www.bioconductor.org/packages/release/bioc/html/FunciSNP.html) (R Core Team 2012), RegulomeDB (http://www.regulomedb.org/) (Boyle et al. 2012) and HaploReg (http://archive.broadinstitute.org/mammals/haploreg/haploreg.php) (Ward and Kellis 2011). Inferring the mechanism of causal variants is complicated because GWAS-identified loci may regulate multiple RNAs or target genes. To address this challenge, information about gene expression and chromatin interaction, regulatory data and bioinformatics developments can be useful (Rivandi et al. 2018).

Enrichr (http://amp.pharm.mssm.edu/Enrichr) is a comprehensive resource containing a collection of gene sets and their biological knowledge to further analyze GWAS results. The number of annotated gene sets in Enrichr is more than 180,184 (Kuleshov et al. 2016). Moreover, another web-based platform that can be used for GWAS results’ functional annotation and genetic causal variants’ prioritization is FUMA, http://fuma.ctglab.nl. It provides adequate insight into the genetic variants’ biological implications by combining biological data repositories and tools (Watanabe et al., (2017)). In addition, being training on eQTL fine-mapping, Expression Modifier Score (EMS) is a genomic score method used to predict regulatory effects of variants on gene expression and could leverage epigenetic marker prediction. Among other genomic score methods, the EMS has higher prediction accuracy and is useful for regulatory variant prioritization. Initially, score bins were predicted for that method, and then the fraction was calculated for positively labeled samples to scale the output score and derive EMS. EMS has been validated and used preferably to QTLs statistical fine-mapping. Then using the UK Biobank (UKBB) dataset (Bycroft et al. 2018) for hematopoietic traits, the Finucane lab, https://www.finucanelab.org/, used the EMS to prioritize putative causal variants of non-coding regions (Wang et al. 2021).

One of the most important challenges in the field of GWAS is that most significant SNPs identified through GWASs fall outside of coding regions; thus, the function and contribution of most loci to the pathogenesis of complex diseases are largely unknown (Mortezaei et al. 2017). Thus, it is critical to understand the biological functions, roles and disease effects of genetic variants. The detection of functional genetic variants in non-coding elements is discussed in Box 3.

Post-GWAS integrative analysis

GWAS results can be compared with prior findings to get more valuable genetic results which can be used for real-world medical applications. For GWAS summary-level data, a comprehensive collection can be assessed using the GWAS Central database (Beck, Shorter T (2019)) or GWAS database (GWASdb) (http://jjwanglab.org/gwasdb) (Lin et al., (2016)) to obtain access to unified and combined data. GWAS Central is a collection of metadata and GWAS summary-level data from many sources, including the Open Access Database of Genome-wide Association Results (Johnson and O’Donnell 2009) and the National Human Genome Research Institute-European Bioinformatics Institute (NHGRI-EBI) (Buniello et al. 2019), published or unpublished GWAS data, etc. One of the advantages of using GWAS Central is that all available summary-level data in that database are available for use, rather than limiting the data to only results with significant p-values (Beck, Shorter T (2019)). In comparison with GWAS Central, in GWASdb, there is a larger number of GWAS publications that studied population-specific traits (Lin et al., (2016)).

Integrating GWAS results with other resources, such as clinical findings, co-functional genes, somatic mutations, metabolite-transcript correlation, and eQTL data, can provide valuable information about the genetics of quantitative traits (Wang et al. 2016). For example, network-based integrative analysis of GWAS biological signals with networks of co-functional genes provided an opportunity to augment GWAS findings and detect highly probable candidate genes in association with quantitative traits in Arabidopsis thaliana (Lee and Lee 2018). Integrative analysis has also been applied to combine GWAS findings with a network of metabolite-transcript correlations for Arabidopsis. This strategy can be used to identify gene-metabolite associations and discover novel genes in relation to the metabolites (Wu et al. 2016). Applying large-scale integrative analysis of GWAS data with methylation QTLs could also identify multiple disease-specific genes and pathways and provide novel insight into their genetic mechanisms (Zhao et al. 2017). In addition, the integration of GWAS results with epigenomic data can be achieved by applying the GWAS3D database for the identification of genetic variants with the ability to affect regulatory elements such as enhancers and promoters. Evaluation of GWAS3D was successfully performed for plasma low-density lipoprotein cholesterol to prioritize regulatory variants (Li et al. 2013).

It has been shown that integrative analysis that links genetic variations with their biological roles and functions is important and useful in genetic prediction. Integrative analysis of GWAS results has revealed some genes in association with obesity-related phenotypes, considering their contribution to the regional fat distribution (Ahn et al. 2019). In addition, in previous studies, some hub genes in relation to milk yield in Mediterranean buffaloes were found using co-expression network analysis and GWAS data (Deng et al. 2019).

Germline variants and somatic mutations

Single-cell analysis can e.g., be employed for studies of cancer, as a disease caused by uncontrolled invasiveness and proliferation and somatic mutations (Ren et al. 2018). For instance, somatic single nucleotide variants on bone marrow were discovered performing enhanced whole-genome sequencing (Petti et al. 2019). In addition, somatic mutations can be caused by mosaic chromosomal alterations of specific tissues using genome re-sequencing or array genotyping data. As an illustration, it has been previously identified that the risk of elevated hematological cancer is ten times more in individuals with mosaicism chromosomal alterations (Loh et al. 2018). Another study employed SNP-array data from UK Biobank to detect mosaic chromosomal alterations of blood cancer (Loh et al. 2020).

For an analysis of prostate cancer, 305 individuals with aggressive tumors and 52 control samples were selected from the Cancer Genome Atlas (TCGA) (Cancer Genome Atlas Research Network et al. 2013). In addition, 61 germline variants in association with prostate cancer were downloaded from the GWAS catalogue database (Welter et al. 2014), and information about somatic mutations was obtained from Catalogue of Somatic Mutations in Cancers (COSMIC) (Tate et al. 2018). Then, possible genetic cooperation and oncogenic interactions between germline variants and somatic mutations were investigated. Then, for enrichment analysis of germline and somatic mutations, IPA can be used (Kramer et al. 2014). The results highlighted the power of post-GWAS integrative analysis to determine the biological context of aggressive prostate cancer (Mamidi et al. 2019). Another study by Wu et al. (2019) integrated germline and somatic mutations for carcinogenesis-related gene identification in triple-negative breast cancer. As a result, 237 genes were discovered that were functionally related to germline and somatic mutations. These functionally related germline and somatic mutations can be used for prognostic marker identification and the development of prevention strategies (Wu et al. 2019).

In addition, using gene expression data, germline and somatic mutation information has been integrated, and 124 common genes associated with prostate cancer have been identified. In this study, to gain insight into the biological function of germline and somatic mutations, molecular networks of differentially expressed genes were generated and biological pathway analyses were performed using IPA (Kramer et al. 2014). The results of such analyses can be used to discover interactions between germline and somatic mutations and the putative functional bridges between them (Mamidi et al. 2019). In addition, the results of such studies can demonstrate that the somatic evolution of tumors can be affected by germline variants. The existence of germline and somatic mutation interactions can indicate the existence of some cooperation between such mutations, although the mechanism of such interactions has not been investigated, and more research is required (Jia and Zhao 2016).

Relations among germline variants, somatic mutations, and genetic drug targets of complex human disorders can be employed to provide new insights into complex human diseases. The genetic findings of such studies can be translated into clinical applications (Chen et al. 2019). For example, such integrative analyses of the genetics of cancer (Ung et al. 2016) and the genetics of neurodegenerative diseases (NDs) (Mortezaei et al. 2019) have been reported. These studies can help identify genetic modules with clinical roles in the initiation, development, and treatment of complex human disorders, such as cancer or NDs.

In such studies, a directed functional interactome, node classes of germline variants, somatic mutations and drug targets for complex human diseases were combined, and the relative positions of the node classes were identified by network analyses. For the created node classes containing germline variants, somatic mutations and drug targets, the genetic functional interactions were downloaded from the Reactome database (http://www.reactome.org/pages/download-data/) (Jassal et al. 2019). As indicated in Fig. 2, through integration of germline variants, somatic mutations, and drug targets via network-based analysis, the hierarchical structure of the networks was also evaluated to compare the roles and importance of elements in those biological networks. As a result, all such studies revealed that drug targets are the most important factors functionally influencing others, followed by somatic mutations and germline variants (Mortezaei et al. 2019; Ung et al. 2016).

Fig. 2: Integrative network analysis.
figure 2

A network of germline variants from GWAS results, with networks of somatic mutations and genetic drug targets, was used to create an integrated network containing germline variants, somatic mutations and genetic drug targets. Then, the results of integrative genetic analysis performed using hierarchical network analysis and network centralities were used to identify novel candidate genes important in the pathogenesis of complex diseases. As an example, the results of such an analysis used to assess the importance of gene mutations in cancer and neurodegenerative diseases in humans is shown. The integrative genetic analysis shows that somatic mutations in relation to cancer or neurodegenerative diseases in human beings have strong, independent effects on their genetic drug targets, which can be used for individual treatments.

Proteins that bound to drugs in nonhuman studies were shown to have some homologs in humans, identified as potential drug targets in humans and retrieved from the Protein Data Bank (PDB) database (wwPDB Consortium 2019). Somatic and germline mutations also occur in nonhuman species such as animals and plants. It has been demonstrated previously that the rate of somatic mutations is higher than that of germline mutations in animals such as mice, and the rates of both kinds of mutations are higher than those of mutations in humans. In plants, somatic mutations occur during mitotic cell division in gametophytes or sporophytes, and gametic mutations occur during meiosis (Milholland et al. 2017). Considering similarities or differences between genetic drug targets and germline or somatic mutations in humans and nonhuman species, such integration analysis, can be performed for all kinds of species.

GWAS and transcriptomic data

Integrating GWAS and eQTL data provided novel susceptibility genes in relation to obesity and some clues for studies of their mechanisms (Liu et al. 2018). GWAS data can also be integrated with eQTL and protein-protein interaction data to detect disease-associated genes and prioritize candidate genes. With such studies, one can go beyond GWAS, eQTL, and protein-protein interaction approaches (Wang et al. 2018). When integrating GWAS with transcriptome data on complex traits, estimation of the causal effects of gene expression can be performed by applying the MR approach. Transcriptome-wide Mendelian randomization (TWMR) is a multivariable MR approach integrating GWAS summary-level data with eQTLs to estimate the causal effects of gene expression on complex traits. Previously, TWMR has been successfully applied to assess gene expression’s causal associations with 43 complex traits (Porcu et al. 2019).

A previously conducted study analyzed GWAS results using regulatory datasets, such as eQTL, to identify causal variants (Lin et al. 2018). Another study concentrated on integrating GWAS results with eQTL data for disease gene identification, and then the strength of candidate genes was scored using ontology datasets (Peat et al. 2020). Another study combined GWAS results with eQTL data to identify Alzheimer’s disease-associated genes and prioritize significant SNPs (Zhao et al. 2019b). Functional enrichment analysis of SNPs and eQTL-based SNP ontology platforms have been constructed before, which is helpful to identify significant SNPs in association with complex diseases, such as neurodegenerative disease (Li et al. 2016).

From the genotype-tissue expression (GTEx) consortium, eQTL data were downloaded and integrated with GWAS summary statistics for body mass index to identify signals with the same causal variants. The results of such analyses are tissue specific, which indicates that different tissues and molecular mechanisms are involved (The GTEx Consortium 2013). Another study integrated GWAS data with eQTL data from 44 tissues selected from the GTEx project. In that study, regulatory variations were used to assess GWAS tissue specificity and to discover causal genes in multiple tissues. Several approaches have been used for such integrative analysis to identify genetic variations of different diseases. These approaches include heritability analysis, enrichment analysis, linking contributions of tissue-specific eQTLs, and true positive rate estimation (Mortlock et al. 2020). Then, in such studies and for all tested GWAS tissues, Bonferroni correction was used to assess significant GWAS-trait pairs. Finally, a gene-set enrichment analysis was used to test for GWAS and eQTL target genes. For instance, to identify potential risk alleles and causal genes, eQTL has been applied on gene expression, genotype data from the GTEx project for colon tissue, and TCGA data for colorectal tumor tissue (Loo et al. 2017).

Cross-phenotype associations

It has also been demonstrated that detected loci or genes can have significant associations with multiple traits, referred to as cross-phenotype associations (Li et al. 2017). Some cross-phenotype association tests can be used to boost the power of GWASs. For example, multiple GWASs have demonstrated associations of a gene desert on chromosome 8p24 with chronic lymphocytic leukemia and colon, breast, prostate, bladder and ovarian cancers (Turnbull et al. 2010). Other studies have demonstrated that a functional variation of the PTPN22 gene is associated with systemic lupus erythematosus, type 1 diabetes, Graves’ disease and rheumatoid arthritis (Solovieff et al. 2013).

Statistical methods for detecting cross-phenotype associations have been broadly classified into univariate and multivariate analyses (Box 4) (Broadaway et al. 2016). For example, multivariate analysis of variance (MANOVA) can be used in regression of multiple phenotype analysis (Yang et al. 2019). One of the methods that can be applied in this direction is Fisher’s combined P value method (Li and Zhu 2017). In addition, a linear mixed model can be used for multivariate analysis. Table 2 summarizes some properties, including the advantages/disadvantages of using univariate vs. multivariate approaches (Saccenti et al. 2014). In cases where the phenotypes are non-normally distributed or categorical, cross-phenotype analysis can be performed using modifications of original regression, generalized estimating equations and the Bayesian framework (O’Reilly et al. 2012). There exist some multi-phenotype methods, such as MV-BIMBAM software (Shim et al. 2015) for multivariate association analysis, which uses a Bayesian model comparison, and multivariate-linear mixed model (MV-LMM), which can be used for related phenotypes. For MV-LMM, two types of optimization algorithms can be used: an expectation-maximization (EM) algorithm followed by a Newton-Raphson (NR) algorithm, which combines the stability of EM with the faster convergence of NR (Zhou and Stephens 2014). As an illustration, multivariate, univariate, and bivariate analyses have been successfully performed on a European population of 43,870 cardiovascular and neurological diseases (Zhang et al. 2019).

Table 2 Univariate vs. multivariate analysis.

The phenome-wide association study (PheWAS) (Box 4) approach can be used to investigate cross-phenotype associations, to demonstrate genetic architecture in relation to genes and pleiotropy (when one locus affects more than one trait or phenotype) and for diagnosis (Bush et al. 2016). In network analyses (Box 4), SNP-SNP interaction networks from GWASs have been constructed graphically, and the BridGE approach (Fang et al. 2017) was applied to identify single or multiple biological pathways enriched for that interaction; For example, SNP-SNP interaction analysis has been performed for cardiovascular risk in autoimmune diseases, which helped classify them into more associated groups (Perrotti et al. 2017).When BridGE searches for within/between or hub pathway models, the results of such pathway-level interaction analysis can provide useful information about increasing or decreasing the risk of related diseases (Wang et al. 2017); For instance, significant interactions have been identified between type 2 diabetes, Parkinson’s disease, breast cancer, prostate cancer, and schizophrenia by widely applying BridGE successfully (Fang et al. 2019). There are some limitations in this kind of network analysis. These limitations include phenotypic differences across studies and limited individual-level phenotype or genotype information from networks of summary statistics. These limitations were addressed in a study by Verma et al. (2019) in which a single-source, electronic health record (EHR) (Agrawal and Prabakaran 2020) was used for specific definitions of phenotypes and an individual genotyping platform was applied. Finally, the results of 31,017 PheWASs were used to create a disease-disease network (DDN). For example, DDNs have been successfully constructed to identify genetic similarities between diseases, such as rheumatoid arthritis, type 1 diabetes, and multiple sclerosis (Verma et al. 2019).

Electronic health record (EHR)-based PheWASs can be used to identify cross-phenotype associations, construct DDNs on the basis of shared associations and understand genetic similarities between diseases. An EHR-based comprehensive PheWAS has been performed to provide the landscape of associations across diseases and quantitative traits (Verma et al. 2018). The results of such analyses revealed previously reported associations between type 1 diabetes, morbid obesity and a primary hypercoagulable state (Wellcome Trust Case Control Consortium 2007). In addition, a large number of genetic variants indicated strong connections between autoimmune disorders such as type 1 diabetes, psoriasis, rheumatoid arthritis and multiple sclerosis. This indicates that even if different types of tissues are affected in each autoimmune disorder, they all share similar genetic components via shared genetic pathways (Tettey et al. 2015).

DDNs are bipartite networks that can be constructed and visualized using Gephi software (https://gephi.org), in which statistical packages can be used as plug-ins for network analysis. In the analysis of DDNs, one of the key goals is the identification of strongly linked diseases within and between disease classes and the identification of meaningful connections. In addition, integrating genetic functional knowledge with association results can broaden our understanding of biologically relevant findings (Halu et al. 2019). Then, DDNs and epigenetic knowledge can be integrated to examine tissue-specific changes (Verma et al. 2019).

Network analysis, such as community detection, can also be applied to extract subnetworks of diseases that are biologically relevant. There are various community-detection techniques, such as Louvain’s method (Blondel et al. 2008), that use Gephi software (Bastian et al. 2009) for subnetwork detection. For such analysis, only SNPs in the enhancers of specific tissues were considered. As a result, the liver has been found to have the largest number of associated diseases, such as hyperlipidemia, essential hypertension, chronic non-alcoholic diseases, cardiovascular diseases, morbid obesity and cirrhosis of the liver (Verma et al. 2019). Network analysis can be extended to include associations between EHR clinical laboratory measures and genetic variants to conduct large studies based on gene-trait associations. Another study by Verma et al. (2018) used RNA-Seq data from the roadmap epigenome for genetic association periodization based on gene expression measures. This study first calculated the correlation between gene expression and chromatin state, generated a gene expression measurement matrix and finally performed regression analysis between chromatin model binary measures and gene expression. This type of analysis can be applied to improve the understanding of the effects of genetic variations on phenotypes when explorations beyond those of protein-coding regions are possible. Epigenetic knowledge helps identify associated diseases and their biological relevance in the context of cross-phenotype associations (Gonzalez-Serna et al. 2020). The cross-phenotype methods are discussed further and classified in Box 4.

Using principal component-based methods for cross-phenotype associations in flies, some regulatory loci have been identified that jointly associate with multiple metabolic pathways. In addition, the cross-phenotype association test in Drosophila was used to detect metabolism-related genes. These results can be applied for genotype-to-phenotype mapping of metabolic traits. In addition, in Drosophila, cross-phenotype association mapping has been used to examine starvation resistance, glucose content and body weight (Nelson et al. 2016). This study applied phenotype measurements from the Drosophila Genetic Reference Panel and the SMAT R package (Schifano et al. 2013) for the given traits. These analyses revealed significant associations between triglyceride levels, starvation resistance and CG7560 and cht12 loci. In cross-phenotype tests using enriched association signals, starvation resistance revealed associations with genes enriched in ventral cord development and glucose content. These results illustrated that characters affecting the central nervous system are associated with hyperactivity during starvation.

On the one hand, the Enhancing Neuroimaging Genetics through Meta-analysis (ENIGMA) consortium started more than ten years ago, aiming to study neuroimaging genetics on a large scale. Such analysis used more than 50,000 individuals to indicate robustly associated genetic markers with brain function and structure. The analysis results were identified in more than 200 loci having significant association with brain variations. Afterward, ENIGMA applied multivariate methods to fulfill quantifying challenges in brain networks’ complex relationships. In addition, the cross-disorder groups of ENIGMA used multiple genomic data to answer transdiagnostic questions. For this group, an exemplar approach is examining brain organization for psychiatric disorder patients with first-degree unaffected relatives (Thompson et al. 2020).

Epistasis and pleiotropy

Epistasis refers to a case in which genetic mutations are influenced by the presence or absence of other genetic mutations. Therefore, the expression of genes in a locus is altered by another locus. In such a case, at different loci, multiple genes interact with each other to affect a trait. Epistasis effects are known to be one of the factors underlying missing heritability. This is because epistasis can reveal genetic interactions and provide insights for complex genotype and phenotype mapping that cannot be achieved from association studies. Within gene regulatory networks and biological pathways, the result of physical interactions between biomolecules is called biological epistasis. On the other hand, genotype and phenotype relationships summarized using mathematical modeling and their deviations from additivity are called statistical epistasis. Therefore, biological epistasis and statistical epistasis provide two different perspectives and are consistent with strategies such as gene-gene, SNP-SNP, and protein-protein interactions (Slim et al. 2018).

On the other hand, pleiotropy is a phenomenon where a single locus influences or controls multiple phenotypic characteristics. The pleiotropy is known as an underlying cause of cross-phenotype associations. In other words, cross-phenotype associations are more general than the pleiotropy capable of occurring in biological pleiotropy, phenotypic causal relationships, spurious associations, study design, and confounder biases. In genetic epidemiologic studies, cross-phenotype associations are often incorrectly interpreted as pleiotropy examples, while pleiotropy is the only possible cross-phenotype association explanation. Therefore, a careful dissection of cross-phenotype associations is necessary for the detection of true pleiotropic loci. Both mentioned univariate and multivariate methods in the “Cross-phenotype association” section are pleiotropy informed considering cross-phenotype correlations (Salinas et al. 2017).

The ubiquity and the relation of pleiotropy with human effect size can be examined using the GWAS catalog as a comprehensive database (Welter et al. 2014). It has been found that nearly half of the genes in that database are associated with more than one disease, and that number will continue to increase. In addition, in the UniProt database, ~12% of protein-coding genes were identified as pleiotropic (The UniProt Consortium 2018). In the case of pleiotropy, in which a gene is associated with more than one phenotype, the genes are likely to be involved in many biological processes and to have a strong phenotypic effect. In quantitative trait analyses such as GWASs, one of the limitations is the inability to detect epistasis and pleiotropy (Polster et al. 2016).

Since pleiotropy depends on genetic interactions, epistasis can cause pleiotropic variation at a locus. This is because the way genes affect more than one trait depends on their interactions with other genes. Therefore, for the evolution of pleiotropy, genetic variations caused by epistasis are necessary. In all types of organisms, such as plants, viruses, bacteria, and humans, previous studies have identified epistasis and pleiotropy. For example, it has been identified that in human immunodeficiency virus (HIV) infection and multiple drug resistance, epistasis and pleiotropy play fundamental roles (Polster et al. 2016). The pleiotropy that explains genetic variants contributing to multiple traits is known as an underlying cause of cross-phenotype associations. In other words, cross-phenotype associations are more general than the pleiotropy that can occur in biological pleiotropy, phenotypic causal relationships and spurious associations. Both the univariate and multivariate methods mentioned in the “Cross-phenotype association” section are pleiotropy informed considering cross-phenotype correlations (Salinas et al. 2017).

Modeling of discoveries from GWASs can be performed using effect direction meta-analysis (EDME) for pleiotropy quantification. EDME has been applied in cattle to discover trait variation and better understand the biology of complex traits. In that study, EDME of GWASs on cows and dairy bulls was performed to discover pleiotropic variants, their related affects and the biology behind each complex trait (Xiang et al. 2020).

Phenotypic comparisons

Following the GWAS and cross-phenotype association analysis approach, a study by Gu et al. (2019) concentrated on genetic similarities between phenotypes, employing a statistical approach that is analogous to Fisher’s probability test for a set of SNPs relating to different molecular traits. Based on this approach, the similarities between phenotypes can be determined using functionally related genes and their common molecular mechanisms. As an example, in that study, it was found that breast cancer, prostate cancer, lung cancer, fasting glucose and fasting insulin were clustered together. It has also been determined that common molecular underpinnings, such as AMP/GMP signaling, insulin/NAPDH oxidase/ROS and apoptosis, play important roles in connecting the mentioned phenotypes (Gu et al. 2019).

GWAS summary-level data is an informative source of pooled data that can be used to compare phenotypes within and between species. For example, GWAS Central (Beck, Shorter T (2019)) allows unified data visualization and interrogation by restricting the display of risk alleles, and GWASdb (Lin et al., (2016)) integrates comprehensive resources for data content extension and population studies. The inclusion of additional ontologies such as those in the systematized nomenclature of medicine clinical terms (SNOMED CT) is an example of the extension of information about semantic phenotypes that can be applied for future planning (Al-Hablani 2017). In addition, the reverse GWAS (RGWAS) approach uses the genetic basis of multiple traits from GWAS results to classify phenotypes and produce homogeneous subtypes of samples. RGWAS has two steps: the initial step includes multitrait GWAS dataset clustering with the regression method, and the second step is biological assessment between those clusters. The “rgwas” R package (https://github.com/andywdahl/rgwas) is available for RGWAS implementation. RGWASs can handle residual trait correlations, covariates, quantitative traits and mixed binary data. Propagating first-step uncertainty is not possible using the RGWAS approach, one of the limitations of this approach that needs to be considered in the future. For example, RGWAS has been successfully applied to recover subtypes of stress to depressive disorder and identify metabolic traits (Dahl et al. 2019).

To understand the biology of diseases, their prognosis and their treatment, it may be essential to distinguish their subtypes. For example, different subtypes of breast cancer have been distinguished by prognoses, population structure, treatment responses, and different genetic risk factors (Iqbal et al. 2015). Gene-environment interactions (Fairfax and Knight 2014), disease misclassification (The Brainstorm Consortium 2018), and gene-gene interactions (Fang et al. 2019) create distinct subtypes of complex diseases.

Phenotypic comparisons can also be performed via post-GWAS integrative analysis. For example, as shown in Box 5, integrating germline variants from GWASs with somatic mutations revealed similarities between cancer and NDs.

Conclusions

In the last decade, the number of established SNP-trait associations from GWASs has increased dramatically, but the determination of causal variants from them remains a major challenge. Post-GWAS analysis can be used to address such challenges in the detection of causal variants from GWASs and to determine their mechanisms of action. On the other hand, it is critical to understand the biological functions of genetic variants and how they can affect diseases. This requires interpreting the functions and contributions of most loci to the pathogenesis of complex diseases, considering the fact that most significant SNPs identified through GWASs fall outside of coding regions. Additionally, to identify functional genetic variants, GWAS results can be combined with chromatin features, genome-wide maps and genetic transcription data to help overcome challenges in the field.

In addition, integrative analysis of GWAS results with co-functional genes, clinical findings, eQTL data and metabolite-transcript correlations can provide valuable information about the genetics of complex diseases, and such results can be translated into clinical applications. For example, post-GWAS integrative analysis has revealed that the somatic evolution of tumors can be affected by germline variants. Such interactions between germline variants and somatic mutations can result from cooperation between them. This interaction mechanism has not been investigated, and further research is required to answer this question.

On the other hand, the challenges of detecting epistasis and pleiotropy in quantitative analysis approaches, such as GWASs, can be overcome using post-GWAS analysis. The pleiotropy explaining the genetic variants that contribute to multiple traits is an underlying cause of cross-phenotype associations. It has also been concluded that in GWASs, focusing on one trait can result in missing the opportunity to evaluate multiple phenotypes, especially when cross-phenotype associations exist; in such cases, phenome-wide data can be used to improve the statistical power of genetic association studies. Similarities between phenotypes can be measured from GWAS summary-level data. The results of such analyses followed by enrichment analysis can facilitate the development of effective treatment and prevention options for complex diseases. For example, GWAS results were used to compare human cancer with NDs, which indicated that the genetic drug targets for both kinds of diseases were responsible for initiating signaling cascades. Another common conclusion from analyses of GWAS results for both cancer and NDs is that drug targets and somatic mutations correspond to bottleneck proteins that can transfer signals to the nucleus. Such a conclusion can be used to develop a framework for future studies and to better understand the genetics of different complex diseases. The review also discussed and provided some examples that post-GWAS methods, which can be used to weight the results, can be performed for humans or for nonhuman species.