Rummagene: massive mining of gene sets from supporting materials of biomedical research publications

Clarke, Daniel J. B.; Marino, Giacomo B.; Deng, Eden Z.; Xie, Zhuorui; Evangelista, John Erol; Ma’ayan, Avi

doi:10.1038/s42003-024-06177-7

Download PDF

Article
Open access
Published: 20 April 2024

Rummagene: massive mining of gene sets from supporting materials of biomedical research publications

Communications Biology volume 7, Article number: 482 (2024) Cite this article

853 Accesses
8 Altmetric
Metrics details

Subjects

Abstract

Many biomedical research publications contain gene sets in their supporting tables, and these sets are currently not available for search and reuse. By crawling PubMed Central, the Rummagene server provides access to hundreds of thousands of such mammalian gene sets. So far, we scanned 5,448,589 articles to find 121,237 articles that contain 642,389 gene sets. These sets are served for enrichment analysis, free text, and table title search. Investigating statistical patterns within the Rummagene database, we demonstrate that Rummagene can be used for transcription factor and kinase enrichment analyses, and for gene function predictions. By combining gene set similarity with abstract similarity, Rummagene can find surprising relationships between biological processes, concepts, and named entities. Overall, Rummagene brings to surface the ability to search a massive collection of published biomedical datasets that are currently buried and inaccessible. The Rummagene web application is available at https://rummagene.com.

Genome-wide association studies

Article 26 August 2021

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Article 26 February 2024

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Introduction

The introduction of omics technologies has gradually moved biological and biomedical research from studying single genes and proteins towards studying gene sets, clusters of genes, molecular complexes, and gene expression modules¹. Many biomedical and biological research studies produce and publish gene and protein sets. For example, differentially expressed genes and proteins from transcriptomics and proteomics assays, genes associated with genomic variants identified to be relevant to a phenotype, gene knockouts associated with a cellular or an organismal phenotype, target genes of transcription factors as determined by ChIP-seq experiments, proteins identified in differential phosphoproteomics, proteins identified in a complex from immunoprecipitation followed by mass-spectrometry studies, genes associated with a cellular phenotype from CRISPR screens, and many more types of sets can be generated. These gene sets are highly valuable but not often reused. This lack of reuse is partially because there are no standards for submitting gene sets in publications, and there are no centralized community repositories for depositing gene and protein sets. As a result, the potentially useful information about gene sets is buried in supporting material tables stored as PDF, Excel, CSV, or Word file formats. Since general and domain specific search engines do not index the contents of such supporting materials, there is no way to search through these tables. These supporting tables are not indexed by search engines because most search engines can only deal with free text and are not capable of parsing data tables.

Named entity recognition methods have been widely applied to biomedical and biological publication text, but not yet to extract gene sets from supporting tables. Manual gene set annotations and extraction of gene sets from publications has been achieved, but it is time consuming, labor intensive, and requires domain expertise. Most such efforts miss many relevant studies. For example, to create the ChIP-x Enrichment Analysis (ChEA) resource we manually extracted gene sets from supporting materials of ChIP-seq studies^2,3. While the ChEA database achieved great success, it is difficult to maintain. Efforts such as ReMap⁴, Recount⁵, and ARCHS4⁶ aim to address this challenge by uniformly reprocessing all the raw data available from community repositories to recompute gene sets from published studies, but such efforts rely on the existence of community repositories and uniform data collection standards. Another effort to automate the extraction of gene sets from publications is Pathway Figure Optical Character Recognition (PFOCR)⁷. PFOCR automatically extracts pathways from publications by scanning pathway diagrams. However, surprisingly, as far as we know there are no publications, databases, or community repositories that contain extracted gene sets from supporting materials of scientific biomedical research publications. Rummagene is a web-based software application that serves hundreds of thousands of gene sets extracted from publications listed on PubMed Central (PMC). It contains a softbot that scans supporting materials of publications listed on PMC to keep the resource consistently updated. The Rummagene website provides the ability to search the corpus of gene sets by an input gene set query, a PMC free text search, and a table title search. To understand the statistical patterns within the Rummagene corpus, we performed various exploratory analyses, as well as demonstrate how this rich resource of organized biological knowledge can be used for specific applications.

Results

Descriptive statistics

The initial version of Rummagene contains 642,389 gene sets extracted from 121,237 articles. These 121,237 articles are identified as containing gene sets from 5,448,589 scanned PMC articles. The distribution of the occurrence of genes in gene sets is not even. Some genes are found in many sets, but most genes are members of few sets (Fig. 1a). At the same time, most identified gene sets have less than one hundred genes in each set (Fig. 1b). While most publications only contributed to the Rummagene collection one or two gene sets, there are few publications that contributed a few hundred sets (Fig. 1c). Over the years, more and more gene sets are found in publications (Fig. 1d). In fact, in the past four years, publications included many more sets compared to sets identified in the 30 years between 1988 and 2018. Since 2005, the average length of gene sets jumped from less than 20 genes in each set to ~150 genes in each set (Fig. 1e). This is likely due to the introduction of omics technologies and publications reporting gene sets identified from such studies. By projecting the gene set content into two dimensions with UMAP⁸, we see that, on average, short gene sets contain genes that are more commonly studied (Fig. 1f, g). While this is a general trend, some genes occur in many sets but are less commonly studied (Fig. 1h, i). Specifically, we identified 604 gene sets that are enriched in understudied genes. Understudied gene sets are defined as gene sets where the median citations per gene is less than 3 standard deviations from the median for citations observed for randomly assembled gene sets of similar size (Supplementary Data 1). These gene sets contain many sets that are made of orphan GPCRs and Znf family members. Other sets are mainly modules of differentially expressed genes. These modules are likely serving critical biological roles but are less explored. Next, we noticed that the Rummagene collection of gene sets has many duplicate entries. In fact, duplicated gene sets make up approximately 15% of the Rummagene gene sets. Many of these duplicate sets are found in the same publication. The publications having multiple tables often list the same sets but with different measurements or statistics, for example, measuring the expression of a set of genes under different conditions. We found fewer duplicate gene sets across multiple papers (Fig. 1j, k).

**Fig. 1: Distributions of the genes and gene sets in the Rummagene database.**

Annotated collections of themed gene set libraries

For the collection of 642,389 gene sets we can identify subsets of gene sets associated with specific biological themes such as sets related to kinases, transcription factors, cell types, cell lines and tissues. Such themed gene sets can be used for specific enrichment analysis tasks such as kinase enrichment analysis⁹, transcription factor enrichment analysis². Producing such subsets of gene sets can be done by simply searching the table titles for terms that match named entities such as protein kinases or transcription factors. Indeed, we identified 4525 gene sets that contain named human kinases, and 8078 gene sets that contain named transcription factors in the table titles. 444 kinase names and 1121 transcription factor names are unique in these collections of gene sets (Fig. 2a, Supplementary Data 2, 3). Similarly, we identified 4443 gene sets that contain named cell lines, and 6268 gene sets that contain cell types or tissues in table titles, with 450 and 670 unique terms, respectively (Fig. 2b). In addition, 5560 sets had the term “down” and 6677 had the term “up” in their table titles (Fig. 2c). These sets likely contain up- and down-regulated genes from gene expression signatures. A large portion of the identified gene sets contain gene names in their titles. Specifically, 97,478 table titles contain human gene symbols or synonyms (Fig. 2c). For the subset of gene sets containing known transcription factors in their titles, Uniform Manifold Approximation and Projection (UMAP) plots were generated from the inverse document frequency (IDF) vectors for all gene sets in the subset. Points representing different gene sets are colored by both the PubMed Central ID (PMCID) of the original publication (Fig. 2f), and by the associated transcription factor (Fig. 2e). We found that these gene sets tend to cluster by transcription factor even when they are derived from different publications. This was further confirmed to be statistically significant (T-test; p < 0.0001) by comparing the average and distribution of the Jaccard index similarities between gene sets mentioning the same transcription factor from different publications compared to those not mentioning the same TF (Fig. 2d). We also applied the same process to generate UMAP plots for the subset of terms containing known kinases, and similarly saw that these gene sets clustered by kinase (Fig. 2g) although originating from different PMCIDs (Fig. 2h). This trend was also confirmed statistically (Fig. 2d). Next, we aimed to assess whether kinase and transcription factor gene set libraries created from Rummagene contain useful information for performing gene set enrichment analysis. To achieve such an assessment, we queried each gene set from the Rummagene kinase and transcription factor libraries against corresponding kinase and transcription factor libraries created from multiple sources^2,9. We observe a significant recovery of the correct kinases and transcription factors with all libraries, with best agreement observed for KEA⁹ for kinases, and ChEA 2022¹⁰ for transcription factors (Fig. 3a–f). This is likely because these two resources are manual efforts of extracting gene and protein sets from publications, including data from supporting tables. Comparing the kinase and transcription factor Rummagene libraries to KEA and ChEA, Rummagene is likely more comprehensive and updated, but less accurate.

**Fig. 2: Extracting kinases, transcription factors, tissues, cell types, and cell lines from Rummagene gene sets.**

**Fig. 3: Benchmarking the consensus transcription factor gene and kinase set libraries created from Rummagene.**

Topic modeling

To obtain a global view of the contents of the gene sets in Rummagene, we performed latent Dirichlet allocation (LDA) analysis¹¹ on all abstracts from publications containing at least one extracted gene set. Nine topics were identified and subsequently manually labeled based on the most common terms and their relative weights (Fig. 4a). Some of the most frequently appearing terms across all topics included gene, cell, expression, DNA, patient, cancer, and analysis. The greatest portion of abstracts are relating to mutations and variants in diseases, protein-protein interactions, and mechanisms, while the topics with the least abstracts are related to immune functions and genome-wide associations and risks. The visualization of abstracts in topic space also reveals the relation and similarity between topics (Fig. 4b). For instance, the topic mutations and variants in disease borders DNA transcription and methylation. Additionally, the genome wide association and risk topic is isolated from the other topic clusters. The data and modeling topic is located adjacent to most of the other topics suggesting that abstracts with this topic may be related to a variety of other topics as expected. Overall, the topic analysis reveals the predominant categories of gene sets in Rummagene, specifically those concerning mutations and variants in diseases and those concerning protein interactions and functional mechanisms.

Similar gene set pairs that are distant in abstract space

Next, we asked whether the knowledge embedded in Rummagene can lead to the construction of hypotheses by identifying gene sets with high similarity in gene set space while completely disjointed at the publication abstract text space. The rationale for this is that this way we can identify undiscovered associations between named entities such as genes and diseases. Surprisingly, we first observed that the pairs of gene sets with the highest similarity at the gene set level, with no similarity at the abstract level, are highly enriched in proteins that are commonly detected in mass-spectrometry proteomics studies (Fig. 5a), highly expressed in RNA-seq assays (Fig. 5b), but less widely studied (Fig. 5c). This is likely because proteomics studies tend to commonly report the same abundant, large-size, and “sticky” proteins, transcriptomics studies detected as differentially expressed highly expressed genes, and gene sets in publications commonly report overlapping genes in pathways and ontology terms containing highly studied genes. After filtering pairs of gene sets that are proteomics rich, or contain highly expressed genes, or composed of highly studied genes, we identified a few pairs of sets that contain a gene name in one table title of one set, and a disease name in the table title of the second set (Supplementary Note). For example, some of the top identified pairs highlight a possible relationship between the proteins identified to interact with CLUH¹², and gene sets identified in hypoxia¹³, melanoma¹⁴, and glioma¹⁵. This connection is logical because CLUH was found to be critical to mitochondrial function which is altered in these conditions. Similarly, other top overlapping pairs include the TOPBP1 interactome¹⁶ and a potential relationship to melanoma¹⁴, hypoxia¹⁷, and teratomas¹⁸. To assist in possibly explaining these connections, we utilized the GPT-4 API, a large language model, to compose hypotheses that suggest how such seemingly unrelated named entities might be in fact related by giving GPT-4 the two abstracts. For example, when asked about the connection between the gene CLUH and the disease hypoxia, prompted with the abstracts and gene set terms, the LLM responded with a plausible explanation concerning mitochondrial function, specifically: “Therefore, it is plausible that the CLUH gene may be involved in the adaptive response of SKOV-3 ovarian cancer cells to hypoxia, possibly by regulating the translation and stability of mitochondrial proteins. This could explain the high overlap between the two gene sets. Further experimental studies would be needed to confirm this hypothesis.” The model successfully determined the cell line used to produce the gene set concerning hypoxia from the abstract provided and it made a reasonable hypothesis about the relationship between the two gene sets given the dissimilar context of the abstracts. Additionally, when asking the LLM about the connection between the gene sets with TOPBP1 and teratomas in their column names, using the two abstracts associated with these gene sets, the LLM produced a plausible explanation about their similarity after stating a hypothesis and reiterating information from the abstracts: “Given the role of TOPBP1 in DNA repair and the importance of gene mutations in the development of teratomas, it is plausible that mutations or dysregulation of TOPBP1 could contribute to the development or progression of teratomas. This could explain the high overlap between the two gene sets. Further research would be needed to confirm this hypothesis and elucidate the exact mechanisms involved”. The summaries produced by the GPT-4 LLM are mostly helpful and logical but should be manually verified as the model states on its own.

**Fig. 5: Distribution of percent of sticky proteins.**

Gene function predictions

Large collections of gene sets can be used to effectively predict gene functions with semi-supervised learning¹⁹. The first step to produce such predictions is to construct a gene-gene similarity matrix from the Rummagene database of gene sets. This can be done with different algorithms. Here we tested the ability of three previously published co-occurrence algorithms²⁰ to make such predictions, and compare the quality of such predictions to predictions made with a similar method that utilizes gene-gene co-expression correlations from thousands of RNA-seq samples⁶. The gene-gene similarity matrices from Rummagene were able to predict with high accuracy and precision the gene membership for functional terms created from the Gene Ontology (GO) Biological Process²¹, GWAS Catalog²², Mouse Genome Informatics (MGI) Mammalian Phenotypes (MP)²³, and WikiPathways²⁴ (Fig. 6a). To illustrate an example for one term, the term “Fasting Plasma Glucose” from GWAS Catalog was selected. The top 10 genes that are closest to the genes known to be associated with this phenotype are SLCO1B3-SLCO1B7, P3R3URF-PIK3R3, SLC30A8, FAM240B, MTNR1B, PERCC1, EEF1AKMT4-ECE2, KLF14, CCDC201, and PAX4; and the ROC curve to assess the quality of the predictions has a 0.75 area under the curve (Fig. 6b). The top 10 predicted genes for each term from these three gene set libraries are provided as a supporting table (Supplementary Data 4).

**Fig. 6: Benchmarking gene function prediction using Rummagene gene sets.**

The knowledge space that is covered by Rummagene compared with Enrichr

To assess the breadth and coverage of the automatically curated Rummagene gene set space, we contrasted it against the Enrichr¹⁰ gene set space. Enrichr is a large-scale curated database of gene sets of similar size when compared to Rummagene. UMAP²⁵ was applied to project over 1 million gene sets into two dimensions for the purpose of data visualization where each point represents a gene set from either Rummagene or Enrichr. Gene sets are colored by whether they originate from Rummagene or Enrichr’s gene set library categorization: Transcription, Pathways, Ontologies, Diseases/Drugs, Cell Types, Miscellaneous, Legacy, Crowd (Fig. 7a, b). We observe that Rummagene gene sets cluster into many punctate clusters that likely represent themed gene sets (Fig. 7A). Also, Enrichr’s gene sets are clustered by category (Fig. 7b). When overlaying the Rummagene gene sets on the Enrichr gene sets, most categories are covered with some few exceptions. We observe that some gene set libraries are not covered by Rummagene, while few areas in gene set space are much more common in Rummagene compared with Enrichr. To quantitatively verify the presence of these unique clusters, UMAP enhanced clustering was employed with a UMAP projection with min_dist of 0 followed by HDBSCAN clustering²⁶. Clusters were assigned labels based on whether 25% of the gene sets within that cluster were from a given Enrichr gene set library, or otherwise they were labeled by a cluster number. Mostly Enrichr and mostly Rummagene clusters, making up 90% of gene sets in the cluster across the projection are visible (Fig. 7c). The largest clusters that are mostly from Enrichr are from gene set libraries that were created from unique sources, for example, the LINCS L1000 data^27,28, single cell transcriptomics²⁹, virus-host protein-protein interactions³⁰, pathways extracted from figures³¹, and gene sets related to NIH funded investigators³² (Fig. 7d). On the other hand, several clusters were unique to Rummagene (Supplementary Data 5–8). One of these clusters, namely cluster 81, contains gene sets that are exclusively transcription factors. This is likely because there are specific assays and studies that focus on profiling these genes exclusively.

**Fig. 7: Visualizing the global space of the gene sets contained within Rummagene and Enrichr.**

The Rummagene website

The Rummagene data is served on the website https://rummagene.com with three search engines. The first search engine accepts gene sets as the input query and then returns matching gene sets based on the overlap between the input gene set and the unique sets in the Rummagene database. The results are ranked by the Fisher’s exact test, and to optimize responsiveness, a fast in-memory algorithm is implemented. The results are presented to the user in paginated tables with hyperlinks to the original publication and the supplemental material from which the gene sets were extracted, the genes in the matching sets, the overlapping genes, the p-values and the Benjamini-Hochberg corrected p-values of the overlap, and the odds ratios. When clicking the overlap numbers, a popup screen shows the overlapping genes with the ability to copy them to the clipboard, submit them to Rummagene, or submit them to Enrichr¹⁰. Similarly, the original gene set can be accessed by clicking on the column name. The second search engine facilitates a free-text PMC search. This search engine queries PMC with the entered terms to receive PMCIDs that match the query. It then compares the returned PMCIDs to the PMCIDs in Rummagene to identify matching PMCIDs. Once such matches are detected, the gene sets in the Rummagene database are returned to the user as a paginated table with hyperlinks to the original publications and the matching gene sets. The third search engine queries the table titles from which gene sets were extracted. Table titles that match the inputted search terms are displayed in a paginated table with hyperlinks to the matching publications and gene sets. All search engine results can be filtered, shared by URL, and downloaded. The entire database is available for download as a text file, and access to the data is provided via a GraphQL API. Importantly, the Rummagene resource is updated automatically once a week.

Discussion

By crawling through full articles and supporting materials from over five million research publications available from PMC, we were able to identify over 150,000 publications that contain over 600,000 mammalian gene sets of various lengths. Smaller gene sets are enriched for widely studied genes while longer lists contain less studied genes likely due to their origin from omics studies. Interestingly, in the past five years, the publication of gene sets in articles has been increasing exponentially. Hence, most gene sets in the Rummagene database are from this period. Here we demonstrated how the Rummagene resource can be used for various applications. Specifically, we showed how a subset of the extracted sets can be used for transcription factor and kinase enrichment analyses. We also showed how the rich knowledge in Rummagene can be used for gene function predictions. In addition, we demonstrated how we can form hypotheses by identifying gene set pairs with high similarity in gene set space and low similarity in abstract space. However, many additional applications are possible. For example, Rummagene can be used to produce textual descriptions for gene sets using large language models (LLMs). Given the large collection of Rummagene gene sets, as well as the fast enrichment search engine that is implemented, we could provide the Rummagene API to an LLM to act as a chatbot that searches for relevant papers that are related to a given gene set and then summarize the collective functions identified in these papers. This is different from just giving an LLM a gene set because it adds focus to the search by utilizing the Rummagene API. The LLM use case currently implemented in Rummagene is forming hypotheses about two highly overlapping gene sets with dissimilar abstracts. We show how when submitting the two abstracts to an LLM to provide an explanation about why the seemingly unrelated abstracts might have highly overlapping gene sets, the LLM is constrained to provide a plausible explanation. Although such an explanation is at times trivial, in all cases that we tested, it was based on correct facts. Hence, the prompt is detailed and constrained enough to produce high-quality responses from the LLM. One of the opportunities provided by Rummagene is its integration with other resources that contain large collections of gene sets and signatures, for example, Enrichr¹⁰, ARCHS4⁶, and SigCom LINCS²⁷. Biomedical research has been traditionally communicated via hardcopy printed paper journals. The transition into fully digital research communication, and with the introduction of omics technologies, increased efforts are placed on better annotation and standardization of published research data including the publication of gene sets and data tables. During this transition period toward such improved annotations, Rummagene plays an important role in making previously published data, buried in supplemental materials of publications, more findable, accessible, interoperable, and reusable (FAIR)³³.

Methods

Crawler to extract gene sets from publications listed in PMC

The PMC Open Access Subset³⁴ contains millions of journal articles available under license terms that permit reuse. Additionally, PMC provides uniformly structured bundles that can be retrieved in bulk over FTP. An index file contains a tabular listing of all PMCIDs represented with a pointer to the compressed bundle corresponding to that PMCID. Each bundle has a PDF of the paper, an XML document containing structured metadata about the paper, figures, and supplemental material files. First, the index file is downloaded, a job is then submitted for each paper. The job downloads and extracts the archive and processes the XML structured paper by loading the tables from the paper and all supplemental files. Both the tables from the main paper, and the tables in the supplemental files may have captions or labels. These captions or labels are saved. Additionally, places in the text that mention the table, or the supplemental file, are identified when they are linked in the markup; at most, 15 words before such a call to the tables are saved. Every supplemental file is processed by one of several table-extractor-functions, selected based on the file extension. These extractor functions include support for Excel, CSV, TSV, and inferred separator loading of TXT files, as well as a PDF table extractor based on Tabula-Py. For each supporting materials table that is extracted, every column in the table is considered. The extractor function attempts to map all unique strings to gene symbols. Mapping may be direct, through some synonym, or identifier. Any column where more than half of the strings can be successfully mapped to a valid human gene symbol using NCBI’s Gene Info³⁵ file for Homo sapiens are retained. In other words, all columns passing this filter become a gene set in the Rummagene gene set library. This approach aims to capture human gene sets, but also captures gene sets from other mammalian organisms such as mouse or rat because of the high overlap in gene symbols. Hence, we consider the overall collection of gene sets in Rummagene as mammalian. The term describing the gene set is made of the PMCID, the file name in the bundle, the Excel spreadsheet name or the XML table label, the column’s first cell, and additional sequential numbers that are added to the term to make it unique if needed. The description field is constructed by concatenating any available caption, label, and text mention. The original items in each table column that pass the filter are preserved, but genes are included only if they can be mapped to official symbols. In addition to filtering out columns with too few mapped genes (< 5), columns with too many mapped genes (> 2500) are ignored. This is because these are likely to contain gene sets that cover all measured genes and not a subset of identified genes with a potentially unique function. This pipeline produces a large gene matrix transpose (GMT) file which can be added to incrementally. The pipeline is designed to continue where it left off when it is re-run. It is set to run weekly to extend the database with any new publications that are added to the PMC Open Access database. The new entries to the GMT are stored in the Rummagene database to be accessed from the web-based application. By extracting gene sets from supporting material of published research articles we can make these more accessible for search and reuse.

Search engine implementation

The large size of the Rummagene gene set library requires special implementation of an algorithm that can quickly compare the input gene set to all the gene sets in the Rummagene database. Besides a fast algorithm that can compare the input set to all other sets, efficient storage of the gene sets is needed as well as sufficient hardware. To enable a fast gene set search, a Rust-powered REST API was implemented. The algorithm first initializes several in-memory data structures: 1) a background sorted set of all genes across all gene sets in the database; 2) the index of each gene saved in a hashmap mapping where each gene is mapped to a 32 bit unsigned integer (U32) index; 3) the gene set IDs and unique hashes stored as UUIDs; and 4) a hashset of mapped genes using the Fowler–Noll–Vo (FNV) hash function on each gene for each unique gene set. FNV is known to perform well when dealing with small keys. This is the case in our implementation which uses 32-bit unsigned integer keys. In our tests, FNV performed much faster than the default hasher. These data structures are created by querying the database with Rust. When the user presses the search button, the queried gene sets are forwarded to the API. After ensuring that the index is initialized, the code maps the user submitted gene set to a U32 hash set. It then computes the intersections between the user’s gene set and the gene sets in memory and performs the Fisher’s exact test using the identified overlap. Parallel processing with Rayon³⁶ is employed to further speed up this process. Once completed, Benjamini-Hochberg adjusted p-values are computed. Next, the results are sorted by p-value, temporarily cached, and returned. The gene sets in Rummagene are stored in a Postgres database³⁷. A function in the Postgres database is responsible for mapping the gene symbols to UUIDs before passing them to the Rust API to obtain results. These returned results can be joined by ID with the gene sets and genes in the database to facilitate further filtering. In this way, the use of an API is transparent to the front-end which queries the database with PostGraphile powered GraphQL. By implementing an advanced fast search engine, we can offer an interactive real-time service to users of the Rummagene application. The Rummagene database is automatically updated once a week by processing all the new articles added to PMC in the past week to identify new gene sets in the supporting materials of these articles. When a batch of new gene sets are added to the database, a new reference of valid gene names is constructed with the complete set of genes in the database. At that time, the API is called to prepare the new gene name reference prior to removing the old reference. By automatically updating the database, we ensure that it will remain relevant and current long term with minimal effort.

Extracting functional terms from column titles

To assess the contents of the extracted gene sets, the column titles for each table were examined to identify a variety of functional terms. Supplementary table titles often include DOI and other identification information, thus these were ignored when conducting this analysis. After separating column titles in each gene set, column titles were split on dashes, underscores, and periods. To identify gene sets in each column, each resulting string was examined to assess if it was an NCBI Entrez³⁸ approved human gene symbol or a listed synonym. All gene synonyms were subsequently converted to their official symbol. Although genes can be represented with integer identifiers, strings only containing numbers were ignored because after manual examination, we discovered that many of these as artifacts. Additionally, strings containing S succeeded by an integer were ignored considering the vast majority of these refer to the supplemental table number. Transcription factors and kinases were subsequently identified from the extracted gene symbols. To identify gene sets that may represent signatures, the strings ‘up’, ‘down’, and ‘dn’ were searched for in the split column titles. To identify tissues, cell types and cell lines present in the column titles, the Brenda Tissue Ontology (BTO)³⁹ official terms and synonyms were extracted, and exact matches were identified. For gene sets containing multiple BTO terms, they were hyphenated to capture, for instance, a cell type from a specific tissue.

Visualization of the kinase and TF gene set libraries

For each extracted gene set, IDF vectors were computed using the Scikit-learn⁴⁰ Python package using the set of all included genes as the corpus. Using the Scanpy⁴¹ Python package, Uniform Manifold Approximation and Projection (UMAP)⁸ plots for different categories of gene sets were then generated from the IDF vectors and clusters were automatically computed using the Leiden algorithm⁴². To visualize broad patterns across the data, each point representing a gene set was colored based on the cluster, associated PMCID, and associated kinase or transcription factor, if applicable. By visualizing the kinase and TF gene set libraries we can observe higher level functional clusters of related kinases and TFs.

Benchmarking transcription factor and kinase enrichment analyses

Consensus transcription factor and kinase gene set libraries were created by performing a metadata search of the Rummagene database by submitting the kinase or transcription factor named entities as the search term. Returned entries are matches where the transcription factor or kinase terms appear in the gene set’s table title, table legend, or column legend. The gene set for each transcription factor and kinase is composed from the union of all identified gene sets corresponding to the given transcription factor or kinase. Benchmarking datasets were sourced from ChEA3² for transcription factors and from KEA3⁹ for kinases. To benchmark enrichment analysis performed with the constructed consensus gene set libraries, the rank of each transcription factor/kinase was identified using the Fisher’s exact test p-value for each matching gene set in each benchmarking dataset. To generate ROC curves, we downsampled the negative class to the same size as the positive class to achieve class balance. ROC curves were then bootstrapped over 5000 iterations and the mean ROC and AUCs were reported. Since we are randomly downsampling the negative class, bootstrapping the curve over several thousand iterations ensures a more accurate depiction of the ability of the Rummagene transcription factor and kinases gene set libraries to accurately predict the perturbed transcription factor or kinase. The numpy interp function was used to linearly interpolate between all points from the 5000 ROC curves to generate composite ROC for each benchmarking library.

Topic modeling

To identify the predominant topics associated with gene sets in the Rummagene database, the abstracts of each paper contributing at least one gene set were assembled from the PMC bulk download. The text contained within the <abstract> tags was concatenated. Papers containing no abstracts were excluded from the analysis. Each abstract was then tokenized, stop words were removed, and lemmatized using the Python package Natural Language Toolkit (NLTK)⁴³. The LdaModel class of Python package Gensim⁴⁴ was then used to identify nine topics with a chunksize of 100 over 10 passes. The number of topics was chosen manually by observing the separation of topics given different sets of parameters. Word counts and word importance were extracted from the model for each of the nine topics. The abstracts were visualized in topic space using the vectors produced by the latent Dirichlet allocation (LDA) model¹¹ for adherence of each paper to each topic using t-SNE²⁵.

Similar gene set pairs that are distant in abstract space

The preprocessing of publications’ abstracts followed the same procedure as in topic modeling where abstracts were first extracted from the PMC bulk download, then cleaned of stopwords and lemmatized using the NLTK⁴³ Python package. Abstracts were then converted to word counts using the count vectorizer and subsequently fit to term frequency - inverse document frequency (TF-IDF) vectors using the Scikit-learn⁴⁰ Python package. The cosine similarity of each paper abstract to all other abstracts was then assessed using the Scikit-learn pairwise linear kernel metric based on the computed TF-IDF vectors. Only pairs of gene sets from different publications with zero cosine similarity of their abstracts were retained. For each pair of such gene sets, Fisher’s exact test was performed to assess the significance of the overlap among the genes within these two sets. Only pairs with p < 0.05 were retained for further analysis. Pairs with identical gene sets were excluded. Pairs were further filtered to only include those with overlaps of more than 50 genes. Additionally, to assess novelty of the recovered pairs, the percentage of their overlapping genes with ‘sticky proteins’ identified in analysis of protein-protein interactions⁴⁵ were used (Supplementary Data 9). In the analysis of gene set pairs including a gene or a disease in the table or column title and legend, only the top 10,000 most significant pairs with < 10% ‘sticky proteins’ were included. To assess the amount of highly cited genes, present in the overlapping genes of gene set pairs, the top 500 most cited genes according to GeneRIF³⁸ were used (Supplementary Data 9). Additionally, to determine the amount of highly expressed genes present in the overlapping genes of gene set pairs, the top 500 most highly expressed protein coding genes were sourced based on mean expression across 5000 random samples from ARCHS4⁶ (Supplementary Data 9). To identify disease names in column titles of the gene set pairs, DisGeNet⁴⁶ disease terms were used and gene names were identified using NCBI gene³⁸ mappings. The OpenAI API chat completion module using the GPT-4 model was utilized to hypothesize about the connection between the remaining top pairs of gene sets from the subset of filtered genes sets based on the filtering steps described above. When prompting the model, we provide it with the gene set terms, the abstracts of both papers, as well as any identified disease or gene extracted from the gene set term column title in following format: “Based on the pair of extracted gene sets from two research publications, hypothesize why there might be a connection between these gene sets based on the two abstracts, and the provided gene and disease terms: Gene set term 1: [term1], disease from gene set 1 term: [disease], abstract of publication for gene set term 1: [term1_abstract], Gene set term 2: [term2], gene(s) from gene set 2 term:⁴⁷, abstract of publication for gene set term 2: [term2_abstract].” Additionally, the system message explains the task as follows: “You are a biologist who attempts to generate a hypothesis about why two gene sets, which are lists of genes, may have a high overlap despite being extracted from two publications that have dissimilar abstracts. The gene set/paper pairs you will be given have one gene set with a disease term and the other with a gene name, so you should include reasoning as to a possible connection between the disease and the gene and explain this possible connection. Such a connection should be related to the abstracts.” The response from the model along with statistics about the significance of the overlap and a PubMed query with the disease and the gene is provided to help uncover if this association is already published in literature.

Gene function predictions

50,000 gene sets were randomly selected from Rummagene and filtered for sets with less than 2000 genes. For all human genes, we formed a matrix $A$ where $A(i,j)=1$ if gene i is a member of gene set $j$ and $0$ otherwise. Then the co-occurrence matrix $\varPhi =A\cdot {A}^{T}$. As previously described²⁰, the co-occurrence probability between two genes:

$$P\left(\alpha ,\beta \right)=\frac{\varPhi \left(\alpha ,\beta \right)}{{\phi }_{0}},$$

where ${\phi }_{0}$ is the total number of co-occurrences, and the marginal probability $P(\alpha )=\frac{1}{{\phi }_{0}}\mathop{\sum}\limits_{\beta \ne \alpha }\varPhi (\alpha ,\beta )$.

The cosine similarity, Jaccard index, and normalized pointwise mutual information (NPWMI) for each pair of genes were then calculated as follows:

$${Cosine}(\alpha ,\beta )=\frac{P(\alpha ,\beta )}{\sqrt{P(\alpha )P(\beta )}}$$

$${Jaccard}(\alpha ,\beta )=\frac{P(\alpha ,\beta )}{P(\alpha )+P(\beta )-P(\alpha ,\beta )}$$

$${NPWMI}(\alpha ,\beta )=\frac{-1}{{{{{\mathrm{ln}}}}}(P(\alpha ,\beta ))}\cdot \max \left\{0,{{{{\mathrm{ln}}}}}\left(\frac{P(\alpha ,\beta )}{P(\alpha )P(\beta )}\right)\right\}$$

The NPWMI is a value between 0 and 1, where a larger value indicates the two genes co-occur with greater probability than expected by random chance⁴⁸. Four gene set libraries were used to benchmark gene function prediction: GO Biological Process (2023), GWAS Catalog (2023), MGI Mammalian Phenotypes (2021), and Human WikiPathways (2021). To perform the predictions of the likelihood that a gene belongs to a gene set, we measured the distance of each gene to each gene set in each library by computing the average distance of the gene to each gene in each gene set. Suppose $L$ is a matrix where $L(i,j)=1$ if gene $i$ is a member of gene set $j$ in the library $L$, and $0$ otherwise. Let $D$ be the similarity matrix as described above, where the diagonal is set to $0$. The gene/gene-set association matrix $G=\frac{D\cdot L}{L\cdot {1}^{T}}$ where the division is elementwise. Each entry $G(i,j)$ is then the mean similarity of gene $i$ to all the genes in gene set $j$. The matrix $G$ can then be used to predict membership of gene $i$ in any gene set. ROC curves and AUC values for each term in the library were computed using the Python sklearn.metrics module⁴⁰.

Comparing the Rummagene gene set space to the Enrichr gene set space

All the gene set libraries in Enrichr were assembled and processed together with the Rummagene gene sets so they can be projected into the same two-dimensional space. First, all genes were mapped to their official NCBI gene symbols for Homo sapiens or filtered out. Gene sets were then converted into vectors with values corresponding to the inverse document frequency (IDF)⁴⁹. Truncated Singular Value Decomposition (Truncated SVD)⁵⁰ was then used to reduce the dimensionality of the IDF vectors to the 50 largest singular values. A UMAP²⁵ with the default settings was then used to embed all samples into two dimensions. Finally, to better position the visualization, we computed the mean and standard deviation of the embedding dimension axes and show the bulk of the samples that are within 1.68 standard deviations from the mean.

Data availability

The Rumamgene dataset version analyzed here is available for download from: https://rummagene.com/download and from Figshare⁵¹. The most recent updated version of the Rummagene dataset is also available from https://rummagene.com/download. This dataset is updated weekly on Mondays. Additional files needed to reproduce the results are provided as Supplementary Data files.

Code availability

The Rummagene web server application is available from: https://rummagene.com/. The Rummagene source code is available from: https://github.com/MaayanLab/rummagene and a snapshot of the source code was deposited in Figshare⁵². The code and files needed to reproduce the figures are available from: https://github.com/MaayanLab/rummagene/tree/main/figures.

References

Manzoni, C. et al. Genome, transcriptome and proteome: the rise of omics data and their integration in biomedical sciences. Brief. Bioinform. 19, 286–302 (2018).
Article CAS PubMed Google Scholar
Keenan, A. B. et al. ChEA3: transcription factor enrichment analysis by orthogonal omics integration. Nucleic Acids Res. 47, W212–W224 (2019).
Article CAS PubMed PubMed Central Google Scholar
Lachmann, A. et al. ChEA: transcription factor regulation inferred from integrating genome-wide ChIP-X experiments. Bioinformatics 26, 2438–2444 (2010).
Article CAS PubMed PubMed Central Google Scholar
Hammal, F., de Langen, P., Bergon, A., Lopez, F. & Ballester, B. ReMap 2022: a database of Human, Mouse, Drosophila and Arabidopsis regulatory regions from an integrative analysis of DNA-binding sequencing experiments. Nucleic Acids Res. 50, D316–D325 (2022).
Article CAS PubMed Google Scholar
Wilks, C. et al. recount3: summaries and queries for large-scale RNA-seq expression and splicing. Genome Biol. 22, 323 (2021).
Article CAS PubMed PubMed Central Google Scholar
Lachmann, A. et al. Massive mining of publicly available RNA-seq data from human and mouse. Nat. Commun. 9, 1366 (2018).
Article PubMed PubMed Central Google Scholar
Shin, M.-G. & Pico, A. Using Published Pathway Figures in Enrichment Analysis and Machine Learning. bioRxiv. https://doi.org/10.1101/2023.07.06.548037. (2023).
McInnes, L., Healy, J. & Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv [stat.ML] (2018).
Kuleshov, M. V. et al. KEA3: improved kinase enrichment analysis via data integration. Nucleic Acids Res. 49, W304–W316 (2021).
Article CAS PubMed PubMed Central Google Scholar
Kuleshov, M. V. et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 44, W90–W97 (2016).
Article CAS PubMed PubMed Central Google Scholar
Pritchard, J. K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).
Article CAS PubMed PubMed Central Google Scholar
Hémono, M., Haller, A., Chicher, J., Duchêne, A.-M. & Ngondo, R. P. The interactome of CLUH reveals its association to SPAG5 and its co-translational proximity to mitochondrial proteins. BMC Biol. 20, 13 (2022).
Article PubMed PubMed Central Google Scholar
Bileck, A. et al. Inward Outward Signaling in Ovarian Cancer: Morpho-Phospho-Proteomic Profiling Upon Application of Hypoxia and Shear Stress Characterizes the Adaptive Plasticity of OVCAR-3 and SKOV-3 Cells. Front. Oncol. 11, 746411 (2021).
Article CAS PubMed Google Scholar
Rolfs, F., Piersma, S. R., Dias, M. P., Jonkers, J. & Jimenez, C. R. Feasibility of Phosphoproteomics on Leftover Samples After RNA Extraction With Guanidinium Thiocyanate. Mol. Cell. Proteom. 20, 100078 (2021).
Article CAS Google Scholar
Monsivais, D. et al. Mass-spectrometry-based proteomic correlates of grade and stage reveal pathways and kinases associated with aggressive human cancers. Oncogene 40, 2081–2095 (2021).
Article CAS PubMed PubMed Central Google Scholar
Mooser, C. et al. Treacle controls the nucleolar response to rDNA breaks via TOPBP1 recruitment and ATR activation. Nat. Commun. 11, 123 (2020).
Article CAS PubMed PubMed Central Google Scholar
Salaverry, L. S. et al. Metabolic plasticity in blast crisis-chronic myeloid leukaemia cells under hypoxia reduces the cytotoxic potency of drugs targeting mitochondria. Discov. Oncol. 13, 60 (2022).
Article CAS PubMed PubMed Central Google Scholar
Shen, H. et al. Integrated Molecular Characterization of Testicular Germ Cell Tumors. Cell Rep. 23, 3392–3406 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lachmann, A. et al. Geneshot: search engine for ranking genes from arbitrary text queries. Nucleic Acids Res 47, W571–W577 (2019).
Article CAS PubMed PubMed Central Google Scholar
Ma’ayan, A. & Clark, N. R. Large Collection of Diverse Gene Set Search Queries Recapitulate Known Protein-Protein Interactions and Gene-Gene Functional Associations. arXiv [q-bio.MN] (2016).
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000).
Article CAS PubMed PubMed Central Google Scholar
MacArthur, J. et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 45, D896–D901 (2017).
Article CAS PubMed Google Scholar
Smith, C. L., Goldsmith, C.-A. W. & Eppig, J. T. The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biol. 6, R7 (2005).
Article PubMed Google Scholar
Pico, A. R. et al. WikiPathways: pathway editing for the people. PLoS Biol. 6, e184 (2008).
Article PubMed PubMed Central Google Scholar
Van Der Maaten, L., Postma, E. O., van den Herik, H. J. & Others. Dimensionality reduction: A comparative review. J. Mach. Learn. Res. 10, 13 (2009).
Campello, R. J. G. B., Moulavi, D. & Sander, J. Density-based clustering based on hierarchical density estimates. in Advances in Knowledge Discovery and Data Mining 160–172 (Springer Berlin Heidelberg, Berlin, Heidelberg, 2013).
Evangelista, J. E. et al. SigCom LINCS: data and metadata search engine for a million gene expression signatures. Nucleic Acids Res. 50, W697–W709 (2022).
Article CAS PubMed PubMed Central Google Scholar
Subramanian, A. et al. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell 171, 1437–1452.e17 (2017).
Article CAS PubMed PubMed Central Google Scholar
Tabula Sapiens Consortium*. et al. The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science 376, eabl4896 (2022).
Article Google Scholar
Lasso, G. et al. A Structure-Informed Atlas of Human-Virus Interactions. Cell 178, 1526–1541.e16 (2019).
Article CAS PubMed PubMed Central Google Scholar
Hanspers, K., Riutta, A., Summer-Kutmon, M. & Pico, A. R. Pathway information extracted from 25 years of pathway figures. Genome Biol. 21, 273 (2020).
Article PubMed PubMed Central Google Scholar
Talley, E. M. et al. Database of NIH grants using machine-learned categories and graphical clustering. Nat. Methods 8, 443–444 (2011).
Article CAS PubMed PubMed Central Google Scholar
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).
Article PubMed PubMed Central Google Scholar
Gamble, A. PubMed Central (PMC). Charlest. Advisor 19, 48–54 (2017).
Article Google Scholar
Brown, G. R. et al. Gene: a gene-centered information resource at NCBI. Nucleic Acids Res 43, D36–D42 (2015).
Article CAS PubMed Google Scholar
Pieper, R., Löff, J., Hoffmann, R. B., Griebler, D. & Fernandes, L. G. High-level and efficient structured stream parallelism for rust on multi-cores. J. Computer Lang. 65, 101054 (2021).
Article Google Scholar
Obe, R. O. & Hsu, L. S. PostgreSQL: Up and Running: A Practical Guide to the Advanced Open Source Database. (“O’Reilly Media, Inc.,” 2017).
Maglott, D., Ostell, J., Pruitt, K. D. & Tatusova, T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 33, D54–D58 (2005).
Article CAS PubMed Google Scholar
Gremse, M. et al. The BRENDA Tissue Ontology (BTO): the first all-integrating ontology of all organisms for enzyme sources. Nucleic Acids Res 39, D507–D513 (2011).
Article CAS PubMed Google Scholar
Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Article PubMed PubMed Central Google Scholar
Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).
Article CAS PubMed PubMed Central Google Scholar
Bird, S., Klein, E. & Loper, E. Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. (“O’Reilly Media, Inc.,” 2009).
Rehurek, R. & Sojka, P. Gensim–python framework for vector space modelling. NLP Centre, Faculty of Informatics, Masaryk University (2011).
Mazloom, A. R. et al. Recovering protein-protein and domain-domain interactions from aggregation of IP-MS proteomics of coregulator complexes. PLoS Comput. Biol. 7, e1002319 (2011).
Article CAS PubMed PubMed Central Google Scholar
Piñero, J. et al. DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Res 45, D833–D839 (2017).
Article PubMed Google Scholar
Sun, B. B. et al. Genetic regulation of the human plasma proteome in 54,306 UK Biobank participants. bioRxiv 2022.06.17.496443 (2022) https://doi.org/10.1101/2022.06.17.496443.
Chiarcos, C., de Castilho, R. E. & Stede, M. Von Der Form Zur Bedeutung: Texte Automatisch Verarbeiten: From Form to Meaning: Processing Texts Automatically. Proceedings of the Biennial GSCL Conference 2009. (Narr Francke Attempto Verlag, 2009).
Karen, S. J. A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28, 11–21 (1972).
Article Google Scholar
Chicco, D. & Masseroli, M. Software Suite for Gene and Protein Annotation Prediction and Similarity Search. IEEE/ACM Trans. Comput. Biol. Bioinform. 12, 837–843 (2015).
Article CAS PubMed Google Scholar
Clarke, D. J. B. et al. Rummagene gene sets with descriptions 01172024. figshare. Dataset. https://doi.org/10.6084/m9.figshare.25017023.v3 (2024).
Clarke, D. J. B. et al. Rummagene source code snapshot from 03132024. figshare. Software. https://doi.org/10.6084/m9.figshare.25404637.v1 (2024).

Download references

Acknowledgements

This study is partially supported by NIH grants OT2OD030160, U24CA264250, RC2DK131995, and U24CA224260.

Author information

Authors and Affiliations

Department of Pharmacological Sciences, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
Daniel J. B. Clarke, Giacomo B. Marino, Eden Z. Deng, Zhuorui Xie, John Erol Evangelista & Avi Ma’ayan

Authors

Daniel J. B. Clarke
View author publications
You can also search for this author in PubMed Google Scholar
Giacomo B. Marino
View author publications
You can also search for this author in PubMed Google Scholar
Eden Z. Deng
View author publications
You can also search for this author in PubMed Google Scholar
Zhuorui Xie
View author publications
You can also search for this author in PubMed Google Scholar
John Erol Evangelista
View author publications
You can also search for this author in PubMed Google Scholar
Avi Ma’ayan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.J.B.C. and G.B.M. developed the website, wrote the manuscript, performed data analysis, and produced figures. D.J.B.C. wrote the crawler and developed the code for the fast search engine. Z.X. and E.Z.D. wrote the manuscript and performed data analysis. J.E.E. contributed to the data analysis. A.M. conceived the project, wrote the manuscript, managed the project, and was responsible for funding the project.

Corresponding author

Correspondence to Avi Ma’ayan.

Ethics declarations

Competing interests

The authors declare competing interests.

Peer review

Peer review information

Communications Biology thanks Alexander R. Pico and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editors: Chien-Yu Chen and Tobias Goris.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Note

Description of Supplementary Materials

Supplementary-Data-1

Supplementary-Data-2

Supplementary-Data-3

Supplementary-Data-4

Supplementary-Data-5

Supplementary-Data-6

Supplementary-Data-7

Supplementary-Data-8

Supplementary-Data-9

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Clarke, D.J.B., Marino, G.B., Deng, E.Z. et al. Rummagene: massive mining of gene sets from supporting materials of biomedical research publications. Commun Biol 7, 482 (2024). https://doi.org/10.1038/s42003-024-06177-7

Download citation

Received: 11 October 2023
Accepted: 10 April 2024
Published: 20 April 2024
DOI: https://doi.org/10.1038/s42003-024-06177-7

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Descriptive statistics

Annotated collections of themed gene set libraries

Topic modeling

Similar gene set pairs that are distant in abstract space

Gene function predictions

The knowledge space that is covered by Rummagene compared with Enrichr

The Rummagene website

Discussion

Methods

Crawler to extract gene sets from publications listed in PMC

Search engine implementation

Extracting functional terms from column titles

Visualization of the kinase and TF gene set libraries

Benchmarking transcription factor and kinase enrichment analyses

Topic modeling

Similar gene set pairs that are distant in abstract space

Gene function predictions

Comparing the Rummagene gene set space to the Enrichr gene set space

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links