A clustering approach to improve our understanding of the genetic and phenotypic complexity of chronic kidney disease

Eoli, A.; Ibing, S.; Schurmann, C.; Nadkarni, G. N.; Heyne, H. O.; Böttinger, E.

doi:10.1038/s41598-024-59747-4

Download PDF

Article
Open access
Published: 26 April 2024

A clustering approach to improve our understanding of the genetic and phenotypic complexity of chronic kidney disease

A. Eoli^1,2,6^na1,
S. Ibing^1,2,6^na1,
C. Schurmann^1,2,6,
G. N. Nadkarni^4,5,
H. O. Heyne^1,2,4,6 &
…
E. Böttinger^1,2,3,4,6

Scientific Reports volume 14, Article number: 9642 (2024) Cite this article

194 Accesses
2 Altmetric
Metrics details

Subjects

Abstract

Chronic kidney disease (CKD) is a complex disorder that causes a gradual loss of kidney function, affecting approximately 9.1% of the world's population. Here, we use a soft-clustering algorithm to deconstruct its genetic heterogeneity. First, we selected 322 CKD-associated independent genetic variants from published genome-wide association studies (GWAS) and added association results for 229 traits from the GWAS catalog. We then applied nonnegative matrix factorization (NMF) to discover overlapping clusters of related traits and variants. We computed cluster-specific polygenic scores and validated each cluster with a phenome-wide association study (PheWAS) on the BioMe biobank (n = 31,701). NMF identified nine clusters that reflect different aspects of CKD, with the top-weighted traits signifying areas such as kidney function, type 2 diabetes (T2D), and body weight. For most clusters, the top-weighted traits were confirmed in the PheWAS analysis. Results were found to be more significant in the cross-ancestry analysis, although significant ancestry-specific associations were also identified. While all alleles were associated with a decreased kidney function, associations with CKD-related diseases (e.g., T2D) were found only for a smaller subset of variants and differed across genetic ancestry groups. Our findings leverage genetics to gain insights into the underlying biology of CKD and investigate population-specific associations.

Multi-phenotype genome-wide association studies of the Norfolk Island isolate implicate pleiotropic loci involved in chronic kidney disease

Article Open access 30 September 2021

A catalog of genetic loci associated with kidney function from analyses of a million individuals

Article 31 May 2019

Multivariate canonical correlation analysis identifies additional genetic variants for chronic kidney disease

Article Open access 09 March 2024

Introduction

Chronic kidney disease (CKD) is a primarily asymptomatic disease characterized by a gradual loss of kidney function over a period extending from several months to years¹. CKD affects approximately 9.1% of the global population, with a higher prevalence in high-income countries². The leading risk factors for developing CKD are diabetes (40% of cases) and hypertension (29% of cases), followed by heart disease, family history of CKD, and obesity³. Other factors, such as exposure to HIV and contaminants, are additionally important in low-income countries^4,5. Genetic ancestry also plays a crucial role, with increased risk rates of kidney failure in Black/African Americans and Hispanics/Latinos compared to individuals of European ancestry⁶. If left untreated, CKD increases the mortality risk for individuals with cardiovascular disease (CVD) and can result in the complete loss of kidney function⁷. Therefore, early detection is critical for improving quality of life and life expectancy. During the early stages of CKD, cost-effective treatment options are available and can be tailored to the cause of the disease⁸.

CKD is defined by a reduced functionality of the kidneys, which limits its filtering capability over a period of at least three months⁹. The main biomarkers for CKD detection include the urinary albumin/creatinine ratio (ACR) and the estimated glomerular filtration rate (eGFR)¹. While ACR facilitated diagnosing albuminuria—an indicator of kidney damage characterized by an elevated excretion of urinary albumin—the eGFR estimates the filter volume of the glomerulus per unit of time using different biomarkers such as serum creatinine¹. An abnormal kidney activity is indicated by high ACR values, reduced eGFR, or both.

Over the past few decades, many large-scale genomic studies, such as genome-wide association studies (GWAS), have successfully identified more than 500 independent genetic variants associated with reduced kidney function^10,11,12. The association between genetic variants and various phenotypes has been studied, and the results are often shared in publicly available databases, like the GWAS Catalog¹³. The association of one genetic variant with multiple traits can be considered to identify secondary traits associated with a phenotype. This understanding can help elucidate potentially shared disease mechanisms, assuming that genetic variants affecting a shared pathway have a similar impact on the associated traits.

Soft-clustering methods provide a means to reduce the genetic complexity of a heterogeneous disease while also accounting for shared disease mechanisms. In contrast to hard-clustering approaches like K-means or hierarchical clustering, soft-clustering enables the factorization of high-dimensional data by identifying overlapping clusters¹⁴. Non-negative matrix factorization (NMF) is a family of algorithms within multivariate analysis that addresses the dimensionality challenge by extracting meaningful features from a given data set^15,16. Using this approach, different genetically driven subtypes have been identified for Type 2 Diabetes in the past and were even associated with differences in the clinical outcomes^17,18,19,20.

In this study, we aimed to deconstruct the heterogeneity of CKD by identifying its genetic subtypes. First, we collect all published variant-trait associations for variants associated with reduced kidney function and apply soft-clustering using NMF. We used the algorithm’s weights to calculate cluster-specific polygenic scores (cPGS) within the BioMe biobank. Finally, we use a phenome-wide association study (cPGS-PheWAS) to validate and interpret the clusters. By deconstructing the complexity of CKD, this methodology contributes new insights into the disease pathways of CKD and enhances our understanding of population-specific differences for CKD.

Results

NMF identified nine clusters of CKD-associated variants

We identified 508 independent genetic variants associated with decreased kidney function from the literature. Then, we retrieved all proxy SNPs in linkage disequilibrium with the identified variants and linked them to 805 associated traits using the GWAS Catalog database. After multiple filtering steps, including GWAS sample size thresholds and correction of GWAS p-values for multiple associations, we constructed a final matrix of trait-variant associations, which included 322 variants and 229 associated traits. We applied NMF to factorize the variants-traits association matrix X into a traits matrix W and variants matrix H with a shared number of clusters, k (Fig. 1). We identified nine clusters of CKD-associated variants using a hypothesis-free approach.

The most frequent CKD-associated secondary traits retrieved from the GWAS Catalog are related to kidney function (e.g., blood urea nitrogen, urea, uric acid, and cystatin C measurements), hemoglobin levels (e.g., hemoglobin measurements, hematocrit, and erythrocyte counts), T2D, body weight (e.g., body height, appendicular lean mass, BMI, BMI-adjusted waist-hip ratio), and pulse pressure (systolic and diastolic blood pressure measurements), among others (see Fig. S1). CKD-associated traits and their associated CKD variants were factorized into nine partly overlapping clusters by conducting NMF. To ensure the results were robust, we repeated the clustering with Bayesian NMF (bNMF) and got comparable results (Table S1). The top seven traits per cluster are summarised in Fig. 2. The ‘Reduced lipids’ cluster is associated with decreasing blood lipid levels (triglycerides, total cholesterol, use of lipid-lowering medications) and liver enzymes. The top traits of the cluster ‘Increased body mass’ show a positive association with body weight (appendicular lean mass, body height, and body weight). The clusters ‘Increased blood volume’ and ’Reduced blood volume’ are positively and negatively associated with volemic traits (e.g., mean corpuscular volume and mean corpuscular hemoglobin), respectively.

Similarly, clusters ‘Increased/Reduced hematocrit’ show opposite associations with hemoglobin content (e.g., hematocrit, hemoglobin measurements, red blood cell density, erythrocyte count), and clusters ‘Increased/Reduced inflammation’ convey opposite associations with markers of inflammation (C-reactive protein) and blood lipids. Lastly, cluster ‘Increased urate’ is positively associated with kidney function biomarkers like urate, blood/serum urea nitrogen, blood proteins, and Cystatin C. The complete lists of the top features and variants per cluster, defined as traits and variants in the top decile of the cluster weights of the matrices H and W, are listed in Table S2. The matrices H and W are also available as supplementary material. Figure S2 summarises how the variants are distributed in each cluster, showing their overlaps.

Comparing three different pathway analysis approaches, we could identify significantly enriched pathways (q-value < 0.05) based on Ingenuity Pathway Analysis (IPA) for four of the nine identified clusters (Fig. S3). Overall, only two of the 22 identified enriched clusters were enriched in more than one of the clusters, namely, the myelination signaling pathway (enriched for the ‘Increased hematocrit’ and ‘Reduced urate’ clusters) and the estrogen receptor signaling pathway (enriched for the ‘Reduced hematocrit’ and ‘Reduced urate’ clusters). For some of the enriched pathways, we could identify an association with the top-weighted traits of the corresponding clusters, suggesting their biological plausibility. For instance, the IL-12 Signaling and Production in Macrophages is significantly enriched (q-value = 0.02) for the top-weighted genes of the increased inflammation cluster. IL-12 is a pro-inflammatory cytokine that, in the past, has been associated with multiple immune-mediated diseases²¹. For the ‘Increased hematocrit’ cluster, characterized by increased hematocrit, hemoglobin, red blood cell density measures, and erythrocyte count, BMP signaling was one of the significantly enriched pathways (q-value = 0.03). Bone Morphogenic Protein (BMP) has been implicated with hematopoiesis²². The top associated genes of the ‘Increased urate cluster’ suggest alterations of the metabolomic and renal functions. These suggestions can be validated by the Xenobiotic Metabolism AHR Signaling pathway, which can be activated in CKD patients due to the accumulation of uremic toxins and can promote renal fibrosis²³.

PheWAS validated biological pathways of most clusters across ancestries

Each cluster was examined by conducting a cPGS-PheWAS with 988 quantitative traits and 832 binary traits on the four BioMe cohorts (ALL, AFR, AMR, EUR). Except for the ‘Increased body mass’ (abbreviated as ‘IBM’), ‘Reduced lipids’, and the ‘Reduced inflammation’ clusters, we validated at least 2 of the clusters’ top 5 traits (Fig. 3). The level of significance was reached more frequently across ancestries (ALL) than when validating on the individual ones (AMR, AFR, EUR) (Table S3). In addition to the validated traits, significant associations with decreasing eGFR were seen in clusters ‘Increased urate’ (β = − 0.04 [− 0.06 to − 0.03], p-value = 6.7e−09) and ‘Reduced hematocrit’ (β = − 0.05 [− 0.07 to − 0.04], p-value = 4.0e−12) (Table S3). ‘Reduced hematocrit’ was also nominally associated with an increased risk for chronic renal failure (OR = 1.11 [1.05–1.16], p-value = 1.2e−04) and with the curated phenotype ‘diabetic and hypertensive CKD’ (OR = 1.27 [1.11–1.46], p-value = 6.6e−04). Besides showing negative associations with disorders of lipoid metabolism, cluster ‘Increased inflammation’ shows strong negative associations with Alzheimer’s disease (OR = 0.60 [0.52–0.7], p-value = 1.5e−11) and dementias (OR = 0.77 [0.71–0.84], p-value = 1.0e−09) (Fig. S4). Regarding the individual ancestries, EUR showed the strongest associations when validating on binary traits, with an increased risk for “visual disturbances” (OR = 1.51 [1.27–1.79], p-value = 2.1e−06) in the cluster ‘Reduced inflammation,’ while AFR showed the strongest associations when validating on quantitative traits, with the strongest association being for the LDL-HDL ratio (β = − 0.14 [− 0.17 to − 0.11], p-value = 9.3e−21) in the ‘Increased inflammation’ cluster. Figure 3 summarizes which of the top traits of each cluster have been validated, while the complete list of cPGS-PheWAS results by ancestry is stored in Table S3.

cPGSs suggest apparent differences between genetic ancestries

We extracted the cluster weights of the W matrix and used them to calculate cluster-specific polygenic scores (cPGS) for participants of the BioMe cohort. Figure 5 shows the standardized polygenic score distributions for all NMF clusters across the BioMe cohort (ALL) and the individual continental populations EUR (n = 7447), AMR (n = 5336), and AFR (n = 5660). A normal distribution was observed for the cluster ‘Increased urate’ (EUR, AMR, and AFR; Anderson–Darling test, all p-values available in Table S5). Although polygenic scores are expected to have a normal distribution²⁴, the other eight clusters present either a skewed tail (e.g., ‘Increased hematocrit’) or several peaks in their cPGS distributions (e.g., ‘Reduced inflammation’). As illustrated in Fig. 4, the peaks are caused by a few variants with relatively high cluster weights (the complete list of cluster weights for the top variants of each cluster is available in Table S2). For example, the top variant in cluster ‘Increased inflammation’ (rs429358, mapped gene: APOE) weighs 4.6, while the second one (rs17050272) weighs 0.2. In Fig. 5, we can also observe how this variant is more frequent in participants of inferred EUR ancestry. Similarly, the top variant of ‘Reduced inflammation’ (rs1260326, mapped gene: GCKR) weighs 5.7 and seems more frequent in the AFR population, while the second one (rs4418728) weighs 0.9. This unbalance in weight creates the three peaks of the distributions: the lower peak includes the scores of individuals without the top variant (0 copies), the middle one the heterozygous (1 copy), and the higher peak includes scores of participants with two copies of the top variant. Other ancestry-specific differences are visible in the distributions of four clusters and are significant when testing with the Mann–Whitney test (all p-values available in Table S4). This suggests that some variants appear with different frequencies in people that do not share similar ancestry: ‘Increased inflammation’ (all combinations), ‘Reduced inflammation’ (all combinations), ‘Reduced lipids’ (EUR vs. AFR), and ‘Increased body mass’ (EUR vs AFR and AFR vs AMR).

Distribution of participants across clusters

As the cPGS are calculated separately per cluster, each BioMe participant might have high polygenic scores in multiple clusters. Therefore, to understand the cluster overlap in terms of relative risk, we checked how many individuals belonged to the top decile of 1 or more clusters. 58% (18,431/31,701) of the whole BioMe cohort (ALL) were at high risk in at least 1 cluster. Of these, 60.2% were in the top decile for only 1 cluster, while 37% were at risk for 2–3 clusters (Fig. S5).

Discussion

CKD is typically defined as a progressive loss of kidney function over time. Although numerous genetic variants have been identified as associated with CKD, their relationship to disease pathways remains largely unclear. The work described here is the most comprehensive assessment of how variants associated with CKD can be grouped according to different CKD-related factors. Specifically, we included variant-trait associations of 322 CKD SNPs and 229 related metabolic traits from publicly available GWAS datasets. By analyzing these associations with NMF, a factorization approach that allows for minimal overlap between groups, we identified 9 clusters of CKD variants and associated traits.

CKD is commonly recognized as a heterogeneous condition with various underlying causes and risk factors, which are unlikely to represent a single disease process. This complexity is also reflected by the associated traits retrieved from published GWAS, which are related to kidney function, hemoglobin levels, T2D, body weight, and pulse pressure, among others. Attempting to deconvolute CKD's genetic heterogeneity and differentially grouping these traits, the nine clusters we identified represented different aspects of CKD. For example, the ‘Increased urate’ cluster, whose clustering weights represent abnormal levels of urinary metabolites like urate, blood/serum urea nitrogen, blood proteins, and Cystatin C, is related to decreasing kidney function. In normal conditions, such blood metabolites are excreted by the kidneys, but in CKD, they accumulate and exert a detrimental biological activity^25,26. A second cluster, which we summarised as ‘Increased inflammation,’ was strongly clustered around rising serum C-reactive protein (CRP) concentrations. CRP is a common inflammatory biomarker in chronic diseases like CKD, diabetes, and cardiovascular diseases^27,28,29. In line with that, patients with CKD commonly experience chronic inflammatory states³⁰. These states tend to worsen as the disease progresses toward end-stage renal disease and are reflected, or even modulated³¹, by increasing CRP levels^32,33,34. For these two pathways, we identified significantly enriched pathways using the top 25% weighted genes in IPA that are in line with the described top-weighted traits. Overall, the difference in significantly enriched pathways between the clusters suggests biological differences between them.

We then studied the genotype–phenotype correlation to demonstrate the utility of the clusters. We could validate most of the top-weighted features on quantitative traits (i.e., biomarkers), while the validation on binary traits (i.e., diagnoses) was less robust and required additional clinical interpretation. For the clusters of ‘Increased urate’ and ‘Increased inflammation,’ the top traits were confirmed by the PheWAS. CKD is also associated with dyslipidemia, which is comprised of high levels of triglycerides and LDL-cholesterol, and low levels of HDL-cholesterol and apolipoprotein A1³⁵. We could observe similar associations in clusters ‘Increased inflammation,’ ‘Reduced lipids,’ and ‘Reduced inflammation.’ Notably, we found multiple significant associations for cluster ‘Increased inflammation’ with reduced risk of dementias. Glycerophospholipids play an essential role in neural membranes^36,37, and their levels are directly correlated with serum triglycerides and inversely correlated with total cholesterol and eGFR³⁸.

A limitation of this study is the need for more genetic diversity in the GWAS Catalog, which mainly consists of studies performed on the European population. This European bias is well described in the literature and has important implications for disease risk prediction across global populations³⁹. Despite this lack of genetic diversity, which affected the initial selection of variants used in the input matrix, we could still validate our results in BioMe, a biobank enriched for populations with non-European ancestries. We were most powered when jointly analyzing across ancestries (ALL), while signals validated in different ancestral groups with some group-specific differences. This result suggests that ancestry-specific studies are essential although most CKD risk factors converge across ancestral groups. Another two limitations are the filtering rules used to select traits and variants for the algorithm’s input matrix and the possible existence of non-additive interactions between risk factors that we did not consider in this study. Lastly, one of the input CKD studies, the PAGE study⁴⁰, was also conducted using BioMe data. However, this should not impact the results since we are not looking at CKD case/control scenarios but at CKD subtypes.

Understanding the biological pathways that lead to CKD is essential to improve clinical management. For example, some clusters group similar traits but with opposite effect directions (e.g., ‘Increased hematocrit’ and ‘Reduced hematocrit’), while others suggest potentially protective effects (e.g., against dyslipidemia in cluster ‘Increased inflammation’). This behavior might indicate that CKD can affect the same metabolic pathways differently, confirming the genetic complexity of the disease. Additionally, the clusters have a limited degree of overlap, and as each represents a specific set of variants, participants might be high risk (i.e., in the top decile of the polygenic score) for more than one cluster. This additive disease model, similar to the mutational signatures in cancer, suggests a possible interplay of genetic susceptibility to multiple disease-causing mechanisms⁴¹.

In summary, by clustering genetic variants associated with CKD, we identified clusters with distinct trait associations, likely representing mechanistic pathways involved in CKD. We confirmed the validity of these clusters phenotypically. Further clinical investigations could explore whether individuals with a common disrupted pathway also share similar complications, a comparable rate of disease progression, or a different treatment response. In the future, classifying patients with CKD using their genotype may improve care by offering a more personalized and genetically informed clinical plan.

Methods

Trait-variants selection

We identified and aligned the alleles of 508 independent genetic variants associated either with decreased kidney function (defined as low eGFR levels for at least three months) or with CKD (using ICD-9/10 codes) from the most recent GWAS and GWAS meta-analyses^{10,11,12, 40, 42, 43} (Fig. 1a). We then used the R package LDlinkR (R version 4.2.1) to retrieve all proxy SNPs in linkage disequilibrium (r² > = 0.6) with the lead variants, across all available 1000G human populations^44,45. We used the GWAS Catalog database to link the proxy SNPs to 805 associated traits (as of July 30th, 2022)¹³. We excluded gender-specific GWAS and GWAS performed on less than 100 individuals. Additionally, as we are interested in secondary features associated with CKD, we excluded GWAS of traits directly related to eGFR or CKD (e.g., “Mild to moderate chronic kidney disease,” “Estimated Glomerular Filtration Rate”). We kept trait-variant associations with a significance threshold of less than 1 × 10⁻⁶ using a Bonferroni correction for all 2401 associations in our data set. To reduce sparsity in the data, we excluded traits associated with less than five variants; this threshold was empirically defined by comparing the clustering results of traits associated with up to 15 CKD variants. We standardized effect sizes across all GWAS by dividing the regression coefficient beta (B) by the standard error, using the GWAS summary statistic results. Traits and variants were then arranged as a matrix with the standardized effect sizes (β) as values. Table S5 contains, for each input CKD variant, the list of CKD-associated secondary traits extracted from the GWAS Catalog and the corresponding exclusion criteria for those excluded during the filtering steps.

NMF

NMF factorizes the input matrix of trait-variant associations (X, of dimensions 229 × 322) into a matrix of traits (H, 229xK) and one of variants (W, Kx322), so that HxW ≈ X¹⁵ (Fig. 1b). The factorization rank K corresponds to the number of clusters. We implemented NMF using the R package ButchR with 10,000 iterations, 30 random initiations, and the convolution threshold set to 80⁴⁶. The number of expected clusters was set between 2 and 20. ButchR suggests the optimal K based on six cluster evaluation metrics, like the mean silhouette width and the Frobenius error. If two or more K were presented, we considered results with the highest mean silhouette width and the lowest Frobenius error, as suggested by Alexandrov et al.⁴¹. As additional validation, we also performed a Bayesian version of NMF¹⁶, using the code provided by Udler et al.¹⁷. bNMF was run 1000 times with up to 200,000 iterations in each run.

Pathway analyses

To identify cluster-specific enriched pathways, we conducted three different pathway analysis approaches using the closest genes to the variants (n = 262). For gene set enrichment analysis (GSEA), using the clusterProfiler R package⁴⁷, we tested for enrichment of the Hallmark, Ractome and KEGG^48,49,50,51 gene pathways from MSigDB. Ranks of the genes were defined based on the weights from the variant-cluster matrix H. If a gene was annotated to multiple variants, we only considered the highest weights. Using the top 25% weighted SNPs and their corresponding annotated genes, we further conducted Ingenuity Pathway Analysis⁵² and Overrepresentation Analysis using the WEB-based GEne SeT AnaLysis (WebGestalt) toolkit⁵³, focusing on the KEGG, Reactome, Panther, and Wiki pathways.

Cluster-specific polygenic scores

The results of clustering provide cluster-specific weights for each variant and trait. We used PLINK and the variant cluster weights to calculate cluster-specific polygenic scores (cPGS) of the BioMe biobank participants⁵⁴. cPGS were standardized within each cluster. The normality of each cPGS distribution was tested with the Anderson–Darling method. Differences between ancestry-specific distributions were tested with the Mann–Whitney test.

Validation cohort (BioMe)

We validated our results using the genetic and linked electronic health records (EHR) data of 31,701 BioMe biobank participants⁵⁵ (Fig. 1c). As a fine-scale population structure can improve the risk prediction of complex diseases within genetic groups⁵⁶, we inferred the genetic ancestry of the BioMe participants. We then performed a Principal Component Analysis (PCA) using PLINK, excluding relatives above 2nd-degree (kinship method, estimated using KING⁵⁷) and variants with minor allele frequency below 0.05^54,58. We trained a random forest classifier to infer the genetic ancestry of BioMe participants using the 1000 Genomes labels as reference⁵⁹. The labeled ancestries are Admixed American (AMR, n = 5336), African (AFR, n = 5660), European (EUR, n = 7447), South Asian (SAS, n = 613), and East Asian (EAS, n = 728). For sub-population-specific analyses, we removed participants with mixed ancestry (defined as having a random forest probability ≤ 0.5) and outliers by only including the quantiles 0.25–0.90⁶⁰ (n = 11,404).

Modeling disease outcomes as a function of cluster-specific polygenic scores

For each cluster, the cPGS were associated with the phenotypes available in the BioMe data set by performing a phenome-wide association study (cPGS-PheWAS). We fitted linear regression models to analyze 988 quantitative traits (e.g., laboratory results) and logistic regression models for 832 binary traits with cPGS as independent variables, adjusting for sex, age, and the first ten genetic principal components (stats R package⁶¹). Binary traits included Phecodes mapped to ICD-9 and ICD-10 codes (a Phecode is considered if at least two relevant diagnostic codes were present in a patient’s EHR)⁶² and curated phenotypes⁶³. Controls were identified as the reference category. Traits were only considered if present or measured in at least 100 biobank participants. The model parameters were standardized using the effectsize R package (refit method)⁶⁴. Standardized coefficient estimates (linear regression) and odd ratios (logistic regression, defined as change of 1 SD in the PGS) were reported with the corresponding 95% confidence intervals. The Bonferroni method was used to adjust for multiple testing, and the alpha threshold was defined as 3.1e−06 (0.05/[9 * (988 + 832)]). We then compared the PheWAS results with the traits in the top decile of NMF’s trait weights.

Data availability

All publicly available data (input variants, trait-variant associations) used to support the findings of this study are included in this published article (and its Supplementary Information files) and are also available from the cited publications and GWAS Catalog. Additional data generated for the analysis steps, including source code and intermediate results, are available from the corresponding author upon reasonable request. The data used to validate the findings of this study are available from BioMe biobank (https://icahn.mssm.edu/research/ipm/programs/biome-biobank), but restrictions apply to their availability. To access the data, please reach out to biomebiobank@mssm.edu.

References

Cockwell, P. & Fisher, L.-A. The global burden of chronic kidney disease. Lancet 395, 662–664 (2020).
Article PubMed Google Scholar
Bikbov, B. et al. Global, regional, and national burden of chronic kidney disease, 1990–2017: A systematic analysis for the Global Burden of Disease Study 2017. Lancet 395, 709–733 (2020).
Article Google Scholar
Couser, W. G., Remuzzi, G., Mendis, S. & Tonelli, M. The contribution of chronic kidney disease to the global burden of major noncommunicable diseases. Kidney Int. 80, 1258–1270 (2011).
Article PubMed Google Scholar
Jha, V. et al. Chronic kidney disease: Global dimension and perspectives. Lancet 382, 260–272 (2013).
Article PubMed Google Scholar
Ekrikpo, U. E. et al. Chronic kidney disease in the global adult HIV-infected population: A systematic review and meta-analysis. PLoS One 13, e0195443 (2018).
Article PubMed PubMed Central Google Scholar
Saran, R. et al. US Renal Data System 2015 Annual Data Report: Epidemiology of kidney disease in the United States. Am. J. Kidney Dis. 67, A7–A8 (2016).
Article Google Scholar
Levey, A. S. & Coresh, J. Chronic kidney disease. Lancet 379, 165–180 (2012).
Article PubMed Google Scholar
Ruggenenti, P., Cravedi, P. & Remuzzi, G. Mechanisms and Treatment of CKD. J. Am. Soc. Nephrol. 23, 1917–1928 (2012).
Article CAS PubMed Google Scholar
Levey, A. S. et al. Definition and classification of chronic kidney disease: A position statement from Kidney Disease: Improving Global Outcomes (KDIGO). Kidney Int. 67, 2089–2100 (2005).
Article PubMed Google Scholar
Gorski, M. et al. Meta-analysis uncovers genome-wide significant variants for rapid kidney function decline. Kidney Int. 99, 926–939 (2021).
Article CAS PubMed Google Scholar
Teumer, A. et al. Genome-wide association meta-analyses and fine-mapping elucidate pathways influencing albuminuria. Nat. Commun. 10, 4130 (2019).
Article ADS PubMed PubMed Central Google Scholar
Morris, A. P. et al. Trans-ethnic kidney function association study reveals putative causal genes and effects on kidney-specific disease aetiologies. Nat. Commun. 10, 29 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Sollis, E. et al. The NHGRI-EBI GWAS Catalog: Knowledgebase and deposition resource. Nucleic Acids Res. 51, D977–D985 (2023).
Article CAS PubMed Google Scholar
Lee, D. D. & Seung, H. S. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999).
Article ADS CAS PubMed Google Scholar
Paatero, P. & Tapper, U. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5, 111–126 (1994).
Article Google Scholar
Févotte, C. & Idier, J. Algorithms for nonnegative matrix factorization with the β-divergence. Neural Comput. 23, 2421–2456 (2011).
Article MathSciNet Google Scholar
Udler, M. S. et al. Type 2 diabetes genetic loci informed by multi-trait associations point to disease mechanisms and subtypes: A soft clustering analysis. PLoS Med. 15, e1002654 (2018).
Article PubMed PubMed Central Google Scholar
Kim, H. et al. High-throughput genetic clustering of type 2 diabetes loci reveals heterogeneous mechanistic pathways of metabolic disease. Diabetologia 66, 495–507 (2023).
Article CAS PubMed Google Scholar
DiCorpo, D. et al. Type 2 diabetes partitioned polygenic scores associate with disease outcomes in 454,193 individuals across 13 cohorts. Diabetes Care 45, 674–683 (2022).
Article CAS PubMed PubMed Central Google Scholar
Chun, S. et al. Non-parametric polygenic risk prediction via partitioned GWAS summary statistics. Am. J. Hum. Genet. 107, 46–59 (2020).
Article CAS PubMed PubMed Central Google Scholar
Tait Wojno, E. D., Hunter, C. A. & Stumhofer, J. S. The immunobiology of the interleukin-12 family: Room for discovery. Immunity 50, 851–870 (2019).
Article CAS PubMed Google Scholar
Bhatia, M. et al. Bone morphogenetic proteins regulate the developmental program of human hematopoietic stem cells. J. Exp. Med. 189, 1139–1148 (1999).
Article CAS PubMed PubMed Central Google Scholar
Liu, J.-R. et al. Gut microbiota-derived tryptophan metabolism mediates renal fibrosis by aryl hydrocarbon receptor signaling activation. Cell. Mol. Life Sci. 78, 909–922 (2021).
Article CAS PubMed Google Scholar
Choi, S. W., Mak, T.S.-H. & O’Reilly, P. F. Tutorial: A guide to performing polygenic risk score analyses. Nat. Protoc. 15, 2759–2772 (2020).
Article CAS PubMed PubMed Central Google Scholar
Gagnebin, Y. et al. Exploring blood alterations in chronic kidney disease and haemodialysis using metabolomics. Sci. Rep. 10, 19502 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Vanholder, R. et al. Review on uremic toxins: Classification, concentration, and interindividual variability. Kidney Int. 63, 1934–1943 (2003).
Article CAS PubMed Google Scholar
Stenvinkel, P. & Lindholm, B. C-reactive protein in end-stage renal disease: Are there reasons to measure it?. Blood Purif. 23, 72–78 (2005).
Article CAS PubMed Google Scholar
Kalantar-Zadeh, K., Stenvinkel, P., Pillon, L. & Kopple, J. D. Inflammation and nutrition in renal insufficiency. Adv. Ren. Replace. Ther. 10, 155–169 (2003).
Article PubMed Google Scholar
Hansson, G. K. Inflammation, atherosclerosis, and coronary artery disease. N. Engl. J. Med. 352, 1685–1695 (2005).
Article CAS PubMed Google Scholar
Stenvinkel, P. et al. Strong association between malnutrition, inflammation, and atherosclerosis in chronic renal failure. Kidney Int. 55, 1899–1911 (1999).
Article CAS PubMed Google Scholar
Li, Z. et al. C-reactive protein promotes acute renal inflammation and fibrosis in unilateral ureteral obstructive nephropathy in mice. Lab. Investig. 91, 837–851 (2011).
Article CAS PubMed Google Scholar
Stuveling, E. M. et al. C-reactive protein is associated with renal function abnormalities in a non-diabetic population. Kidney Int. 63, 654–661 (2003).
Article CAS PubMed Google Scholar
Menon, V. et al. C-reactive protein and albumin as predictors of all-cause and cardiovascular mortality in chronic kidney disease. Kidney Int. 68, 766–772 (2005).
Article CAS PubMed Google Scholar
Muntner, P. et al. The prevalence of nontraditional risk factors for coronary heart disease in patients with chronic kidney disease. Ann. Intern. Med. 140, 9 (2004).
Article PubMed Google Scholar
Theofilis, P., Vordoni, A., Koukoulaki, M., Vlachopanos, G. & Kalaitzidis, R. G. Dyslipidemia in chronic kidney disease: Contemporary concepts and future therapeutic perspectives. Am. J. Nephrol. 52, 693–701 (2021).
Article CAS PubMed Google Scholar
Farooqui, A. A., Horrocks, L. A. & Farooqui, T. Glycerophospholipids in brain: Their metabolism, incorporation into membranes, functions, and involvement in neurological disorders. Chem. Phys. Lipids 106, 1–29 (2000).
Article CAS PubMed Google Scholar
Frisardi, V., Panza, F., Seripa, D., Farooqui, T. & Farooqui, A. A. Glycerophospholipids and glycerophospholipid-derived lipid mediators: A complex meshwork in Alzheimer’s disease pathology. Prog. Lipid Res. 50, 313–330 (2011).
Article CAS PubMed Google Scholar
Chen, H. et al. Combined clinical phenotype and lipidomic analysis reveals the impact of chronic kidney disease on lipid metabolism. J. Proteome Res. 16, 1566–1578 (2017).
Article CAS PubMed Google Scholar
Sirugo, G., Williams, S. M. & Tishkoff, S. A. The missing diversity in human genetic studies. Cell 177, 26–31 (2019).
Article CAS PubMed PubMed Central Google Scholar
Lin, B. M. et al. Genetics of chronic kidney disease stages across ancestries: The PAGE study. Front. Genet. 10, 494 (2019).
Article CAS PubMed PubMed Central Google Scholar
Alexandrov, L. B., Nik-Zainal, S., Wedge, D. C., Campbell, P. J. & Stratton, M. R. Deciphering signatures of mutational processes operative in human cancer. Cell Rep. 3, 246–259 (2013).
Article CAS PubMed PubMed Central Google Scholar
Stanzick, K. J. et al. Discovery and prioritization of variants and genes for kidney function in >1.2 million individuals. Nat. Commun. 12, 4350 (2021).
Article CAS PubMed PubMed Central Google Scholar
Lin, B. M. et al. Whole genome sequence analyses of eGFR in 23,732 people representing multiple ancestries in the NHLBI trans-omics for precision medicine (TOPMed) consortium. EBioMedicine 63, 103157 (2021).
Article CAS PubMed PubMed Central Google Scholar
Machiela, M. J. & Chanock, S. J. LDlink: A web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants: Fig. 1. Bioinformatics 31, 3555–3557 (2015).
Article CAS PubMed PubMed Central Google Scholar
Myers, T. A., Chanock, S. J. & Machiela, M. J. LDlinkR: An R package for rapidly calculating linkage disequilibrium statistics in diverse populations. Front. Genet. 11, 157 (2020).
Article PubMed PubMed Central Google Scholar
Quintero, A. et al. ShinyButchR: Interactive NMF-based decomposition workflow of genome-scale datasets. Biol. Methods Protoc. 5, bpaa022 (2020).
Article PubMed PubMed Central Google Scholar
Wu, T. et al. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. Innovation 2, 100141 (2021).
CAS PubMed PubMed Central Google Scholar
Kanehisa, M. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).
Article CAS PubMed PubMed Central Google Scholar
Kanehisa, M., Furumichi, M., Sato, Y., Kawashima, M. & Ishiguro-Watanabe, M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 51, D587–D592 (2023).
Article CAS PubMed Google Scholar
Milacic, M. et al. The reactome pathway knowledgebase 2024. Nucleic Acids Res. 52, D672–D678 (2024).
Article PubMed Google Scholar
Liberzon, A. et al. The molecular signatures database hallmark gene set collection. Cell Syst. 1, 417–425 (2015).
Article CAS PubMed PubMed Central Google Scholar
Krämer, A., Green, J., Pollard, J. & Tugendreich, S. Causal analysis approaches in ingenuity pathway analysis. Bioinformatics 30, 523–530 (2014).
Article PubMed Google Scholar
Liao, Y., Wang, J., Jaehnig, E. J., Shi, Z. & Zhang, B. WebGestalt 2019: Gene set analysis toolkit with revamped UIs and APIs. Nucleic Acids Res. 47, W199–W205 (2019).
Article CAS PubMed PubMed Central Google Scholar
Purcell, S. et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Article CAS PubMed PubMed Central Google Scholar
BioMe BioBank Program | Icahn School of Medicine. Icahn School of Medicine at Mount Sinai. https://icahn.mssm.edu/research/ipm/programs/biome-biobank
Belbin, G. M. et al. Toward a fine-scale population health monitoring system. Cell 184, 2068-2083.e11 (2021).
Article CAS PubMed Google Scholar
Manichaikul, A. et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010).
Article CAS PubMed PubMed Central Google Scholar
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Article CAS PubMed Google Scholar
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article Google Scholar
Quality Control (QC) | Pan UKBB. https://pan-dev.ukbb.broadinstitute.org/docs/qc
R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2013).
Google Scholar
Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1111 (2013).
Article CAS PubMed PubMed Central Google Scholar
Nadkarni, G. N. et al. Development and validation of an electronic phenotyping algorithm for chronic kidney disease. AMIA Annu. Symp. Proc. AMIA Symp. 2014, 907–916 (2014).
PubMed Google Scholar
Ben-Shachar, M., Lüdecke, D. & Makowski, D. effectsize: Estimation of effect size indices and standardized parameters. J. Open Source Softw. 5, 2815 (2020).
Article ADS Google Scholar

Download references

Acknowledgements

We would like to thank Miriam Udler, who kindly provided us with the script of the bNMF algorithm, which was used in this study to confirm the clusters identified by NMF. This work was supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences. Additionally, this work was supported by the Office of Research Infrastructure of the National Institutes of Health under award number S10OD026880, which allowed us to use Mount Sinai Data Warehouse (MSDW) data. Regarding HPI.MS resources, funding was provided by the Hasso Plattner Foundation (HPF). Additionally, the research leading to these results has received funding from the Horizon 2020 Programme of the European Commission under Grant Agreement No. 826117 (Smart4Health). The Mount Sinai BioMe Biobank has been supported by The Andrea and Charles Bronfman Philanthropies and in part by Federal funds from the NHLBI and NHGRI (U01HG00638001; U01HG007417; X01HL134588). We thank all participants in the Mount Sinai BioMe Biobank. We also thank all of our recruiters who have assisted in data collection and management, and we are grateful for the computational resources and staff expertise provided by Scientific Computing at the Icahn School of Medicine at Mount Sinai.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

These authors contributed equally: A. Eoli and S. Ibing.

Authors and Affiliations

Digital Engineering Faculty, University of Potsdam, Potsdam, Germany, Prof.-Dr.-Helmert-Str. 2-3, 14482
A. Eoli, S. Ibing, C. Schurmann, H. O. Heyne & E. Böttinger
Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York City, NY, USA
A. Eoli, S. Ibing, C. Schurmann, H. O. Heyne & E. Böttinger
Department of Medicine, Icahn School of Medicine at Mount Sinai, New York City, NY, USA
E. Böttinger
Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai, New York City, NY, USA
G. N. Nadkarni, H. O. Heyne & E. Böttinger
The Charles Bronfman Institute of Personalized Medicine, New York City, NY, USA
G. N. Nadkarni
Hasso Plattner Institute for Digital Engineering gGmbH, Prof.-Dr.-Helmert-Str. 2-3, 14482, Potsdam, Germany
A. Eoli, S. Ibing, C. Schurmann, H. O. Heyne & E. Böttinger

Authors

A. Eoli
View author publications
You can also search for this author in PubMed Google Scholar
S. Ibing
View author publications
You can also search for this author in PubMed Google Scholar
C. Schurmann
View author publications
You can also search for this author in PubMed Google Scholar
G. N. Nadkarni
View author publications
You can also search for this author in PubMed Google Scholar
H. O. Heyne
View author publications
You can also search for this author in PubMed Google Scholar
E. Böttinger
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, C.S., E.B.; Methodology, C.S., S.I.; Data analysis, A.E., S.I.; Writing of the original draft, A.E.; Tables and Figures, A.E.; Supervision, S.I., H.O.H., E.B.; Project administration, E.B.; Funding acquisition, E.B., H.O.H. All authors reviewed and edited the manuscript. All authors have read, discussed, and approved the manuscript, its analyses, and interpretations.

Corresponding author

Correspondence to A. Eoli.

Ethics declarations

Competing interests

Claudia Schurmann is a paid employee of Bayer AG, Pharmaceuticals. All other authors do not have any competing interest.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Supplementary Table S2.

Supplementary Table S3.

Supplementary Table S4.

Supplementary Table S5.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Eoli, A., Ibing, S., Schurmann, C. et al. A clustering approach to improve our understanding of the genetic and phenotypic complexity of chronic kidney disease. Sci Rep 14, 9642 (2024). https://doi.org/10.1038/s41598-024-59747-4

Download citation

Received: 09 October 2023
Accepted: 15 April 2024
Published: 26 April 2024
DOI: https://doi.org/10.1038/s41598-024-59747-4

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.