Introduction

Science-driven research productivity and associated innovation processes have become increasingly complex for a number of reasons (Bloom et al. 2020; Boyack et al. 2017; Chen 2006; Chu and Evans 2021; Jones 2009; Kozlow 2023). Globally, over 2.6 million scientific articles were published in 2018 alone (White 2019). As scientific output increases over time, there has also been an increasing variety of sources of emergent topics as a result of the recombination of subjects and fields. Emergent topics that cross science fields are expected to be less path dependent than past patterns of scientific knowledge production. In line with this, Fortunato et al. (2018) emphasize the need to understand the science of science, especially as disciplinary boundaries break down.

Contemporary science is a dynamical system of undertakings driven by complex interactions among social structures, knowledge representations, and the natural world. Scientific knowledge is constituted by concepts and relations embodied in research papers, books, patents, software and other scholarly artifacts, organized into scientific disciplines and broader fields. These social, conceptual, and material elements are connected through formal and informal flows of information, ideas, research practices, tools, and samples. Science can thus be described as a complex, self-organizing, and constantly evolving multiscale network. (Fortunato et al. 2018, p. 1)

While research output has risen, scientific productivity—or the value derived from that output—has fallen across fields (Bloom et al. 2020). The rate of innovation has slowed because the level of specialization (Jones 2009) and the size of teams (Kozlow 2023) needed to conduct science has increased. Intertwined with specialization and team size, the costs of research and development have sharply risen, reducing the rate of science productivity (Bloom et al. 2020). Another reason is how emergence has been measured. For instance, as the volume of scientific output increases, the ability to evaluate emerging research topics decreases because canonical literature is more likely to be cited (Chu and Evans 2021). “Could we be missing fertile new paradigms because we are locked into overworked areas of study?” (Chu and Evans 2021, p.5). Moreover, could we be misidentifying where emerging value is derived from science?

This has important implications considering the importance of scientific forecasting for understanding and developing effective science, technology, and innovation (STI) policy initiatives that aim to support science and to predict innovation trajectories (Börner et al. 2018). Essentially, innovative outcomes are frequently the result of converging technologies that often heavily depend on interdisciplinary scientific inputs (Kogler et al. 2022). Thus, and perhaps not surprisingly, contemporary attempts to address and to meet global grand challenges are directed toward interdisciplinary research where a deep integration of disciplines that combine different types of scientific and technological paradigms in genomic/biotechnology, nanotechnology, and information technology (e.g., blockchain, sensors, AI, and Big Data) are often believed to be the most promising avenues to pursue (Petersen et al. 2021). Recent examples, such as the mRNA vaccine for COVID-19, confirm this notion as they are usually the result of several decades of scientific research that might only become highly effective once the advances in various scientific fields are combined in a single applicable technological solution or innovation. Past convergenceFootnote 1 stems from emergent interdisciplinary fields, e.g., biotechnology, which further catalyze innovations from other sectors (Feldman et al. 2015). Thus, changes at the interdisciplinary boundaries that are in flux may provide further insights into potential future convergence activities.

New discoveries, especially those with multi-disciplinary roots, are usually difficult to attribute to existing classification schemas (Fagerberg et al. 2012), but equally, they define the frontier of the innovation process as they combine existing forms of knowledge into something entirely novel (Eisenhardt and Martin 2000; Lee et al. 2015; Schumpeter 1934; 1942). Thus, interdisciplinary fields of science can be used to define the emergence of new topics (Chakraborty 2018; Khan and Wood 2015; Lee et al. 2015). Utilizing bibliometric network analysis on publication metadata, the present study proposes an approach capable of identifying from where interdisciplinary science fields emerge based on a global scientific map that indicates also changes in the growth of influence.

Specifically, the investigation employs topic modeling to classify scientific research topics from a large amount of data using unsupervised algorithms. The suggested embedded topic modeling approach then enables identification of emerging science topics in line with Schumpeterian notions of knowledge recombination processes where it is possible to observe how the combination of multiple disciplines or science categories unfolds over time. Unlike technology convergence that has been studied more systematically (Lee et al. 2019), few studies, to the best of our knowledge, have directed similar research efforts towards interdisciplinary knowledge recombination processes and how these might impact the overall evolution of the entire scientific knowledge landscape and subsequent innovation outcomes. Moreover, the application of topic modeling in natural language processing (NLP) environs to emerging interdisciplinary science studies holds the potential to provide important insights. The novel approach of combining embedded topic modeling and co-occurrence network analysis methods across global science maps can help with identifying emerging science topics before they consolidate into fields and predict those with potential value for knowledge recombination leading to global convergence.

The overarching goal is to analyze the complexity, self-organization, and evolution of scientific knowledge production while sifting through a large volume of scientific publications, and to understand how it might be possible to anticipate scientific innovations as they emerge from converging areas of research. The main objective of the present study is then to provide a novel approach to the bibliometric analyses toolkit by combining network analysis and embedded topic modeling techniques for the identification of emergent scientific topics of research interdisciplinarity.

Further, a novel measure for emergent topics is developed and employed, utilizing the network centrality index. Additionally, we leverage an embedded topic modeling technique, specifically BERTopic (Bidirectional Encoder Representations from Transformers), to gain insights into the emergent and globally domain-crossing profiles within interdisciplinary science fields. Through this comprehensive approach, we aim to illuminate the evolution of the science of science by investigating the changing boundaries of interdisciplinary research.

In the following sections, we provide an overview of the relevant literature in this line of inquiry, introduce the methodology followed by overall and detailed empirical findings, and finally offer a detailed discussion and some concluding thoughts.

Literature review

Science maps were developed to understand patterns related to the science of science, which include identifying topics of interest (Zahedi and van Eck 2018), identifying growth rates of science (Bornmann and Mutz 2015), identifying topic emergence (Jung and Segev 2022a), and detecting patterns and trends in the scientific literature (Kim and Chen 2015), especially through new combinations of interdisciplinary fields of science and technologies (Blei and Lafferty 2007; Eum and Maliphol 2023; Khan and Wood 2015; Lee et al. 2015). Science maps are network representations of the scientific literature that have evolved in research approaches (Chen 2006). Underlying these past approaches is an emphasis on finding radically new innovations within a specialized domain of science.

The evolution of the literature on emergence began with citation analysis and currently combines methods that identify network patterns using topic modeling techniques (Rotolo et al. 2015). Network analysis is commonly used to map the trends and patterns in the scientific literature, e.g., linked through citations, including the emergence of new seminal discoveries that change the course of a science specialization (Chen 2006). Science mapping linking research literature through citations can be used to demonstrate different evolutionary stages of scientific development over time, allowing the identification of transformative contributions through predictive analysis (Chen 2017). Models have been designed to include different aspects of the science of science. Science overlay maps represent subsets or networks of publications of global base maps, distinguishing different levels of research field categorization (Sjögårde 2022).

Emerging technologies from science can be defined by characteristics measured through bibliometric indicators and text analysis (Rotolo et al. 2015). By combining full-text analysis and bibliometric indicators, Glenisson et al. (2005) piloted a study that demonstrated the usefulness of data mining and bibliometric techniques that facilitate mapping fields of science. Patterns of scientific emergence have been modeled through clustering (Glänzel and Thijs 2012; Yau et al. 2014), national output (Suominen and Toivanen 2016), and using networks to demonstrate emergence (Khan and Wood 2015).

The emergent topics are expected to grow rapidly out of uncertain and ambiguous areas of research and converge to make a novel impact (Rotolo et al. 2015). Past studies on emergence focus on local maps or predefined areas of study, e.g. Curran and Leker (2011) on the nutraceuticals industry; Rey-Martí et al. (2016) on social entrepreneurship; and Song et al. (2017) on personalized medicine. Existing studies that demonstrate emergence have been carried out through bibliometric analyses using frequency-based topic modeling techniques that identified science topics (Griffith et al. 2004), topic coherence (Newman et al. 2011), topic “bursts” (Mane and Börner 2004), and patterns of scientific breakthrough (Winnink et al. 2019). Emergence is often identified through a measure of diversity within the local map, e.g., Rao-Stirling diversity and relative variety (Leydesdorff and Rafols 2011; Leydesdorff et al. 2019; Rafols and Meyer 2010).

The studies of emergent science are limited in scope by constraining fields of study through specific journals, articles, or authors. Once the science map is generated, topic modeling is analyzed based on network values generated from the map. The terms with higher frequency in the corpus are identified as emergent topic clusters. Thus, these studies examine the science of science generated within a science subject, category, or journal group based on measures of frequency and diversity within a local map. These approaches define the distance of interdisciplinarity through relative measures within the field of science. By relying on frequency, past approaches are more subject to canonical bias and may ignore context. Thus, the influence or importance of an interdisciplinary science pair in a science map offers an alternative approach to identifying emergence.

Novelty is also necessary to define emergence (Rotolo et al. 2015). Novelty can be identified through the merging of previously separate “streams of research” or fields of science (Day and Schoemaker 2000; Shin et al. 2022; Small et al. 2014). Thus, another measure of emergent organization is fast-growing multiple field or technology interdisciplinarity (Bornmann 2013; Bornmann and Marx 2014; Lee et al. 2021; Leydesdorff et al. 2013). Over time, research has become increasingly interdisciplinary (Chakraborty 2018). Research fields go through three stages: growth, maturity, and interdisciplinarity (Chakraborty 2018).

How disciplines are classified and differentiated, however, is still unsettled and still needs to be operationalized (Sugimoto and Weingart 2015). One method of defining disciplines is by using data-based publication indices such as Web of Science (WoS) categories (Sugimoto and Weingart 2015). Following this, interdisciplinarity can be modeled using keywords, authors’ fields of study, and citations that cross multiple disciplines (Chakraborty 2018; Xu et al. 2018, 2019). Topic prediction using network analysis has been used to find emergent patterns across domains that are pre-defined and linked through co-occurrence frequency (Jung and Segev 2022b).

The measure of interdisciplinarity must balance variety and similarity (Leydesdorff 2018). When comparing against global data, limiting topic detection within a single discipline neglects to consider the increasingly interdisciplinary nature in which science is conducted (Boyack 2017). Using global maps leads to more accurate partitions and higher textual coherence of topics because the entire context is preserved. (Klavans and Boyack 2011). Moreover, long distances between interdisciplinary topics tend to have a higher scientific impact (Larivière et al. 2015). When scientific research incorporates new technological ideas, the convergent science tends to have a greater impact (Kwon et al. 2019). Further, humanities and social science research tends to have lower citation density which leads to lower measures of interdisciplinarity (Larivière et al. 2015).

While many investigations use interdisciplinary measures of emergence, past studies frequently restricted the analysis to local science maps that focus on a narrow field of science using relative measures for emergence. Furthermore, the formation of interdisciplinary research in the relevant literature has been mainly modeled through the evolution of keyword co-occurrence (Xu et al. 2018). Thus, one of the significant limitations of existing studies concerning the identification of thematic structures and dynamic patterns is that researchers constructed scientific maps around pre-defined topics (Gläser et al. 2017). By limiting the topic scope, the approaches resorted to using frequency-based measures of variety to determine relative novelty, and speed to define emergence. Frequency-based keyword evolution, however, can constrain our understanding of interdisciplinarity, disregard context, and intensify canonical bias. In contrast, global science maps can provide unbiased results if the size of the documents is sufficiently large (Rafols et al. 2010). While some studies differentiate between multi-, inter-, and trans-disciplinary (Chakraborty 2018; Leydesdorff et al. 2018), the operationalization of these distinctions remains limited. Thus, this study distinguishes the concept of growing and dominant sciences focused on broadly identifying the importance of interdisciplinarity across networks of STEM domains.

Methodology

The present study combines network analysis and BERTopic and applies it to understand cross-domain topic areas. BERTopic is an integrated topic modeling technique using embedding vector and c-TF-IDF to create dense clusters allowing interpretable topics from text data. Traditional text analysis is a labor-intensive activity that limits sample sizes to the speeds that human researchers are capable of reading, even ambitious studies are limited to a few hundred. For this reason, topic modeling techniques based on the frequency-based approach (ex. Latent Semantic Analysis, Latent Dirichlet Allocation, Dynamic Topic Model) were introduced to derive unobserved topics from a very large number of texts. However, frequency-based approaches remove context by relying only on term frequencies. New embedding-based approaches such as BERTopic, allow us to consider the contextual knowledge of large text data sets. The Web of Science Raw Data (WoS)Footnote 2, with over 63 million publication records found in 12,500 high-quality journals, is a common target of bibliometric analysis.

The data and methods used for the empirical analysis are introduced in accordance with the overall research process described in two stages (Fig. 1): data collection and pre-processing, network analysis of an interdisciplinary science dataset, and topic modeling of the newly constructed dataset. The first stage gathers and prepares the data from the journal publication metadata for network analysis and topic modeling. In stage 1, science category-subject network analysis is conducted to construct an interdisciplinary science network. In this constructed interdisciplinary science network, the science category-subjects that have greater network centrality, i.e. those that have greater potential value in terms of knowledge recombination, are defined. Here, the dataset is divided into two consecutive periods to create two interdisciplinary science networks. Comparing network values in two periods, science category-subjects that are more likely to grow (emerging science field) and that are more likely to have greater frequency (dominant science field) in the following period are selected to filter the final text dataset for topic modeling. Through this step, more precise and accurate data on publications can be extracted by filtering ones including such science category-subjects to restrict the data to the ‘emerging science fields’. Utilizing the filtered list of publications, in the following subsection (Fig. 1, stage 2), topic modeling is conducted to explore the emerging topics in each interdisciplinary science field. This stage includes all the required processes for running the BERTopic model analysis. Through this process, latent topics representing each interdisciplinary science are derived. For qualitative validation, the publications that are the most representative of the emergent topics—which have been identified through the unsupervised learning process—are analyzed to identify what the topics of interest are for the given interdisciplinary categories.

Fig. 1: Overall research process.
figure 1

The overall research process is performed in two stages: (i) defining a network of documents based on science-subject pairs and (ii) identifying topics from the network data.

Data collection

For the empirical analysis, the metadata is collected from the Web of Science Database. The database provides bibliometric information of scientific publications including the publication title, year, journal title, author, institution, institution’s address, broad category, subject field, funding, citations, etc. The metadata should also include fields that enable differentiation by document type (ex. Article, editorial material, review, biographical item, letter, bibliography, correction, book review, meeting abstract, or proceedings paper) and publication type (journal, book in series, or book). These criteria allow us to restrict our sample to publications that are written for the same purpose, to maintain the quality of articles, and to avoid duplication. The dataset employed here is limited to journal articles by filtering its document and publication types.

Then, the list of publications that meet the definition of interdisciplinary science is selected and divided into three-year periods, which helps to stabilize dataset rankings (Archambault et al. 2009). By definition, interdisciplinary science refers to the cases where the scientific outcome is based on different research areas. In the WoS database, the research areas are defined by the scientific classifications, subheadings, and subjects. The broad global science category (‘subheading’ in WoS) indicates the top-level classification for the scientific fields including life-science & biomedicine (LSB), technology (TE), physical sciences (PS), arts & humanities, and social sciences. These categories are mutually exclusive. The subject field refers to a lower-tier classification of science that is assigned to an accordant category subheading. Here, all classifications are provided by WoS, as all journals and books included in WoS are categorized accordingly. In this study, an interdisciplinary science field is defined as the scientific outcome based on at least two subheadings, which are science categories.

In our WoS publication sample dataset, publications with technology- and science-based subheadings (LSB, TE, and PS) are used to maintain the consistency of the scientific fields. A total of 7,453,987 publications (from 10,138 journals) with 226 subjects are first collected over the reference period of 2012 and 2017. From this data set, global interdisciplinary science publications are filtered, which gives us 1,194,332 publications (from 1137 journals) with 172 subjects. Our final sample is restricted to publications that are classified as Journal Article (doc_type = ‘Article’ and pub_type = ‘Journal’) without missing abstracts. Table 1 presents the basic descriptive statistics on the number of publications, subjects, and journals for each interdisciplinary science field included in our final sample. Among all the interdisciplinary sciences, PS-TE has the greatest number of publications, subject, and journals, showing that it is the most active interdisciplinary science field. The increments of publication from all interdisciplinary science activities reflect the global trend of technology convergence as more heterogeneous technologies and industrial fields are used together over time.

Table 1 Descriptive Statistics of Interdisciplinary Science Exploration Sets.

Science category-subject co-occurrence network analysis

Science category-subject pair set

Prior to the science category-subject co-occurrence network analysis, a science category-subject co-occurrence pair set is constructed. In the interdisciplinary science dataset, a list of science category-subjects that are relevant to the category subheadings are assigned for each publication. Each science category-subject represents a node in the network connected by publications. To conduct co-occurrence network analysis, the combinations of category-subjects for each publication are transformed into a pair-form dataset for each interdisciplinary science field that defines the edges between nodes. We illustrate science category-subjects by signifying their categories with a capital letter (A, B, or C) and a number (1–9) to differentiate the science category-subjects within the categories. If publication X contains three science category-subjects A3, B6, and C9, it will have three rows of pair sets: A-B, B-C, and A-C. If a publication Y contains three science category-subjects of A1, A2, and B5, it will have two duplicate rows of interdisciplinary pair sets: A-B, A-B. Once the data set is transformed, the numbers of science category-subject pairs are aggregated by counting the number of publications including such science category-subject pairs. The aggregated science category-subject pair set, therefore, presents the number of publications of science category-subject pairs in each interdisciplinary science in the respective period.

Science category-subject co-occurrence network analysis

Using subject pair sets, subject co-occurrence network analysis is conducted for interdisciplinary science fields in each period. A co-occurrence network is an effective method for analyzing the structural relationship between elements. A similar approach has been used with patent data for technology convergence analysis (Curran and Leker 2011; Kogler et al. 2017; Kim et al. 2018, 2019). In this regard, a co-occurrence network using publication data can provide greater understanding of how science category-subjects are being used and related to each other across interdisciplinary science fields. In a subject co-occurrence network, science category-subjects are used as nodes, and publications are used as edges. For the linkage rule, undirected and weighted networks are adopted. As shown in Fig. 2, science category-subjects are connected only if they were used in the same publication. For instance, subjects A and C have a total of two edges because they are used in publications 1 and 2.

Fig. 2: Science category-subject co-occurrence network.
figure 2

The Science category-subject co-occurrence network shows an example network of publication nodes, e.g., Publication 1, linked by listed subjects, e.g., A1.

Once the global network map is constructed for interdisciplinarity, the Eigenvector centrality (EIG) values of all nodes (in this network, science category-subjects) are measured. In this science category-subject co-occurrence network of interdisciplinary science, a science category-subject that is more important or influential can be regarded as a key science category-subject in an interdisciplinary science field, and those with a greater network value should be highlighted as they are the ones leading science category interdisciplinarity. Here, EIG measures the influence of network nodes beyond mere frequency counts by considering the centrality of connected nodes (West et al. 2013). For instance, a science category-subject connected to important science category-subjects is considered to have greater influence in the network. Rather than assuming equal importance, this measure differentiates the weight of edges by the importance of connected nodes. Unlike degree centrality, which solely focuses on the number of connections, EIG assesses a node’s importance by evaluating the significance of its connections. This approach captures the qualitative aspect of network relationships. Furthermore, while PageRank is specifically tailored for directed networks, EIG’s versatility allows it to be effectively applied to undirected networks as well. In this aspect, EIG can be used as an indicator for measuring the importance or influence of the emergent field interdisciplinarity (Heo & Lee, 2019; Qian et al., 2017; Rapach et al., 2015). With EIG, a network index that measures the influence of a node in a network by assigning weights to each connection based on the centrality of the connected node (Bonacich 2007), the key science category-subject in terms of being comparatively more important can isolated.

Using EIG, the conceptual framework of dominant and emerging science fields is proposed for the following purposes. First, by using EIG and its growth rate (EIG.GR), either dominant- or growing-sciences in terms of knowledge recombination can be determined. The threshold for dominant and growing interdisciplinary science is set to the top 10% of science category-subjects. Essentially, only those that are ranked in the top 10% in each measure are selected and named as dominant- and growing-sciences, respectively. Choosing the top 10% threshold for EIG and EIG.GR as criteria for identifying dominant or emerging science subjects is a deliberate methodological decision. This threshold is designed to selectively highlight the most influential or rapidly evolving fields, accounting for the skewed distribution of scientific networks where a few nodes accumulate the majority of connections. It allows for the identification of both established and emerging fields, reflecting on the dynamic nature of scientific research. A conservative approach like this minimizes false positives due to statistical fluctuations, ensuring that only subjects with consistently high metrics are considered. Furthermore, setting a clear benchmark facilitates comparative analysis over time and across disciplines, providing a consistent and reliable method for tracking changes in the scientific landscape. This choice underscores a strategic approach to recognizing significant trends and shifts within the realm of scientific research, emphasizing the importance of both sustained influence and notable growth in determining the prominence of science subjects.

As illustrated in Fig. 3, if the EIG (or EIG.GR) value of a science category-subject falls within the top 10%, it is considered to be a dominant (or emerging) science. If the values of both EIG and EIG.GR are within the top 10%, then the science category-subject can be classified as both dominant and emerging, signifying not only its current influence but also a significant increase in its impact. Conversely, if neither value falls within the top 10%, the science category-subject is not considered either dominant or emerging. This allows us to focus on the specific list of publications that are more valuable in interdisciplinary science activity. Also, this contributes to improving the computation process for running text analysis by reducing the sample size. Rather than running text analysis for the whole sample, focusing on the selected publications that can be assumed to have more potential and to be consistent in terms of science subjects can improve the precision of our analysis. In this regard, selected growing interdisciplinary science category-subjects can be used as a reference for potential ones in the future. Due to the path-dependent nature of knowledge, a strong tendency or preference to follow such a trajectory is often observed, especially in knowledge-intensive activities. In other words, either a present network position or current network growth is very likely to be consistent also in the following period. This will be discussed in more detail with empirical findings in the following section.

Fig. 3: Concept of growing and dominant interdisciplinary subjects.
figure 3

The graphs demonstrate how emerging science differs from dominant science as measured by Eigenvector centrality and the growth rate of Eigenvector centrality.

Since the main interest of this study is exploring new rising topics in interdisciplinary science fields, we focus on growing-sciences rather than dominant-sciences. For the following step, publications representing growing interdisciplinary science category-subjects are filtered.

Embedded topic modeling

BERTopic

To derive topics for growing-sciences of each interdisciplinary science document, the BERTopic model is used. BERT, also known as Bidirectional Encoder Representations from Transformers, is a deep learning-based language model built on Transformer architecture developed by Google (Devlin et al. 2019). As presented in Fig. 4, the BERTopic is an integrated topic modeling technique that incorporates BERT embeddings, Unified Manifold Approximation and Projection (UMAP), Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), and a class-based Term Frequency-Inverse Document Frequency (c-TF-IDF) (Grootendorst 2022).

Fig. 4: Process of BERTopic modeling.
figure 4

The process of BERTopic modeling involves transforming document data into vectorized data, reducing the dimensionality, organizing the data into clusters and topics.

The first step is embedding vectorization, which transforms target documents into vectors. Unlike conventional topic modeling methods that rely on Bag-of-Words (BoW) approaches, focusing solely on the frequency of terms, BERTopic utilizes embedding vectors. These embeddings represent documents in a space that, while lower in dimension compared to the vast potential vocabulary of BoW, is rich in capturing the deep semantic information inherent in the text. This allows for a higher contextual understanding of documents. By leveraging pre-trained word embeddings, BERTopic enables the analysis of documents with nuanced insights into their contextual meanings, surpassing the limitations of traditional encoding vectorization methods. Here, we utilized the default text representation model, “all-MiniLM-L6-v2”, for our analysis. This model, designed as an all-purpose model, functions by converting sentences and paragraphs into a 384-dimensional dense vector space. It’s versatile, suitable for tasks like clustering or semantic search, especially for English language text. Compared to the “all-mpnet-base-v2” model, one that is known to provide the best quality, it operates five times faster without compromising on qualityFootnote 3, and its effectiveness has led to its adoption in various relevant studies (Samsir et al. 2023; Wang et al. 2023).

The second step in BERTopic involves dimensionality reduction. This is crucial because clustering algorithms, which are integral to topic modeling, perform better with lower-dimensional data. The primary challenge addressed here is the ‘curse of dimensionality,’ where high-dimensional spaces can negatively impact the efficiency and effectiveness of clustering algorithms. By reducing the dimensionality of the embedding space, BERTopic effectively mitigates this issue, facilitating more coherent and accurate topic clusters. This approach emphasizes the importance of tailoring data preprocessing steps to enhance the performance of specific algorithms used in the topic modeling process. For this reason, the UMAP algorithm is used to reduce the complexity of the embedding vector while preserving its essential structure. Assuming that high dimensional data lies on a lower dimension, UMAP maps highly complex data onto a simpler space efficiently by preserving the comparative distance and density and makes it easier to identify the cluster of similar documents (McInnes et al. 2016).

The following step is document clustering using HDBSCAN, which generates clusters based on the density of data points by using the hierarchical tree method. One of the strengths of HDBSCAN is that it can effectively identify and handle noise, which can help to derive more meaningful clusters. In addition, the combination of UMAP and HDBSCAN shows better performance in text clustering (Asyaky and Mandala 2021), and the clustering results can be modified by adjusting the hyperparameters regarding cluster generation.

The last step is topic generation with c-TF-IDF. c-TF-IDF is an adaptation of TF-IDF, which is designed to capture the representative terms from documents for each topic. TF-IDF is known as an effective measure for finding representative terms by combining term frequency and inverse document frequency (Salton and Buckley 1988). Under the assumption that a representative term of a document should be a distinctive one that represents the document, this measure simply captures the terms that not only occur more frequently in a document but also occur less frequently in other documents. By using c-TF-IDFFootnote 4 (Eq. 1), the importance of a term within a specific class can be found.

$${c-{TF}-{IDF}}_{i,c}=\frac{{{tf}}_{i,c}}{{w}_{c}}\times \log \frac{N}{{Docs}(w)}$$
(1)

Qualitative validation of results

Once the interdisciplinary science maps have been analyzed, a list of representative publications for each interdisciplinary category can be generated based on the topics defined through BERTopic. Reliance on machine learning, however, can lead to misclassification (Lyutov et al. 2021), so we examine the results of the topic modeling to identify from where the newly emergent topic stems and describe them. Many recent studies that apply BERTopic have performed qualitative or manual validation of the results (Balcı et al. 2023; Capra, 2024; de Lima et al. 2023; Kasperiuniene et al. 2020; Wang et al. 2023). Using qualitative analysis, we review the results of the BERTopic process to validate them. First, the topic keywords are considered to determine if they provide a common theme for the articles under the topics. A qualitative approach is used to examine the topics to identify characteristics of emergent topics. After BERTopic is performed on the data sets, a list of topic keywords and representative articles emerge through the unsupervised process, e.g. topic-1. Additionally, traceability requires parsimony that the representations are unnecessarily complex such that even non-experts should be able to interpret them (Rafols et al. 2010). The results are compared to check that they are rational or “make sense” to non-experts. Additionally, the journal lists are evaluated to discern the characteristics of the topics. Nonsensical topics would be expected to be random or not fit our definition of global interdisciplinary.

Case Study on Interdisciplinary Science in the Web of Science

Preparing the interdisciplinary science dataset

Following previous bibliometric studies using topic modeling techniques (Suominen and Toivanen 2016; Velden et al. 2017; Yau et al. 2014), we use the Web of Science Core Collection (WoS),Footnote 5 which is a database of peer-reviewed scholarly journals published worldwide. The WoS database provides the necessary metadata required for pre-processing, e.g. selecting peer-reviewed journal articles.

Results of science category-subject co-occurrence network analysis

In this section, the results of science category-subject co-occurrence network analysis are presented. Figure 5 illustrates the dominant- and growing-interdisciplinary science using the conceptual framework presented in Fig. 3, and Table 2 presents the full list of dominant- and growing-sciences. All nodes represent the science category-subjects included in each interdisciplinary science field, and dominant- (located further to the right on the x-axis) and growing-science (located higher on the y-axis) are labeled. One interesting point is that a clear distinction between dominant- and growing-interdisciplinary science is observed in all cases. Considering the path-dependent nature of knowledge, the dominant-sciences are likely to remain dominant in the following period. The prediction of key emergence trends, however, focuses on new interdisciplinary science category-subject merging that is expected to be more influential, rather than those that are already well-known. The gap between two types of science category-subjects justifies our approach to distinguishing promising science category-subjects in the future from those that already prevail, and more importantly, indicates that focusing on the emerging topics fits more into the purpose of this research.

Fig. 5: Subject co-occurrence network analysis result.
figure 5

a LSB-TE. b LSB-PS. c PS-TE. d LSB-PS-TE. Note: The growing interdisciplinary science subjects are in bold.

Table 2 List of dominant and growing science category-subjects in interdisciplinary science fields.

This study focuses on the growing influence of interdisciplinary science to investigate the key topics that are likely to rise in the near future. In this regard, the publications including growing-interdisciplinary science are used for the following step of analysis. As shown in Table 2 and Fig. 6, EIG values of growing cross-domain science category-subjects in the following period tend to be greater than that of other science fields. This reflects that growing interdisciplinary science category-subjects in the current period have the greatest increases in the following period. With few exceptions, these subjects are different than those in the dominant-science fields. For BERTopic modeling, therefore, a set of cross-domain publications including growing-science are used.

Fig. 6: Comparison of the EIG in following period between Growing-Interdisciplinary Science category-subjects and others.
figure 6

Note: On average, Eigenvector centrality in the following period of Growing-Interdisciplinary Science category-subjects (0.348) is higher than others (0.093).

Unsupervised classification of the emergent interdisciplinary science topics

BERTopic setting

While conventional topic modeling approaches consider the number of topics as an important hyperparameter to run analysis, BERTopic does not necessarily require it because UMAP and HDBSCAN ease the optimization of the clustering process, and automatically generate the list of topics. However, setting the number of topics is still important because a fully automated learning process may end up with an incomprehensible result. For instance, if BERTopic is conducted with its default settings and HDBSCAN optimization algorithms, it will automatically generate a list of topics, but this does not guarantee that the result is also acceptable in terms of application and obtaining insights.

For this reason, the three hyperparameters of n-gram range, number of topics, and minimum topic size are tested within ranges to find the best BERTopic model results (Table 3). The n-gram range determines whether the term should cover unigrams, bigrams, or trigrams, the number of topics sets the initial number of topics when running BERTopic, and the minimum topic size sets the minimum number of documents that each topic should contain. While the first two hyperparameter values were tested with the same range (n-gram range: unigram, bigram, trigram; number of topics: 5–1000), minimum topic size values proportional to the total number of publications were used. Minimum topic size values can be strongly affected by the size of documents, which may lead to topic sizes that are too broad or narrow for different cases. This especially largely influences the creation of outlier topics and an inexplicable number of topics. Thus, applying a proportional minimum topic size can help us minimize the size of outlier topics and maintain an explainable number of topics. For this reason, an integer value is used for the minimum topic size for each case that represents 0.5–3% of total publications. To help us consider a combination of different hyperparameters with wide ranges, a random search method is used to find an optimized parameter with random combinations, limited to no more than 100 iterations.

Table 3 Hyperparameter testing of BERTopic.

For each iteration, the information entropy value is measured (Eq. (2)) (MacKay 2003). By finding cases with uneven distribution of words in the topic, a set of topics with explicit semantic expression can be found (Wang et al. 2023). Known as a measurement of uncertainty, information entropy provides a means to determine whether topics can be clearly distinguished. In this regard, a model with the lowest information entropy value (Eq. (2)) is selected as the best model.

$${{Entropy}}_{i}=-K\mathop{\sum }\limits_{i=1}^{m}P\left({W}_{i}|T\right)\log (P\left({W}_{i}|T\right))$$
(2)

BERTopic results

Once the dataset has been divided into different interdisciplinary sciences, the BERTopic process identifies articles that have similar topics, limited to the number of topics defined. The topics are defined through an unsupervised algorithm that identifies common lists of keywords that describe the topics.

Table 4 presents the groups of topics that appear in the greatest number of articles for each pairing of the subheadings: LSB-TE, LSB-PS, PS-TE, and LSB-PS-TE. The list of topic keywords identified in the interdisciplinary text set is used to define the topics. Outlier groups are used to prevent the formation of nonsensical or isolated topic groups.

Table 4 Topics and keyword lists for science category co-occurrence pairsFootnote

Following prompt has been used with ChatGPT (GPT-4): I have topic that contains the scientific publications related to [“Name of Interdisciplinary Science”]. The topic is described by the following keywords: [“List of keywords”] Based on the above information, can you give a short label of the topic?

.

Qualitative validation of results

Following recent studies that apply BERTopic (Balcı et al. 2023; Capra 2024; de Lima et al. 2023; Kasperiuniene et al. 2020; Wang et al. 2023), this study performed qualitative or manual validation of the results. While topic modeling may allow for the analysis of a large corpus of data, the results of the topic modeling should remain decipherable to non-experts (Rafols et al. 2010). Thus, we perform small-scale, qualitative analysis to verify that this condition holds.

While all articles are matched with the topic that is the most likely fit, not all articles that fall under the topic are equally representative of the topic. The representative articles are identified through the topic modeling technique, which means that they have the highest probability of matching the topic. The top 3 representative articles that fit the topics defined through topic modeling are provided in Table 5. All of the representative articles can be readily fit with the topics with which they are matched.

Table 5 Representative articles for each interdisciplinary emergent topic.

When considering the LSB-TE case, “Mechanical Properties and Composition of Natural Fibrous Materials” is most represented by articles LSB-TE-0-A through C. The article titles contain the phrases that are recognizably appropriate for the emergent topic: “tree bark,” “insulation material,” “manufacturing,” “green-glued plywood panel,” “resistance of thermally modified,” “under extreme pressure,” and “ash wood.” Moreover, the journal titles are also representative of the topic: Forest Products Journal and European Journal of Wood and Wood Products (appears twice). Similar patterns are found for the other emergent topics listed in Table 5. Therefore, we find that the emergent topics that have been defined represent an easily recognizable theme. More broadly, many of the emergent topics are related to green technologies and sciences and to a lesser extent health-related technologies.

The journals with the greatest number of emergent interdisciplinary topic publications can be identified from the list of identified topics (Table 6). Yet, the journals in which the topics appear are clustered among a small portion of all publications; the distribution of publications with emergent interdisciplinary topics is skewed towards a small share of all journals in the dataset. Half of all publications were published in the top quintile of all journals for each interdisciplinary category group: 14th percentile (LSB-TE), 13th percentile (LSB-PS), 10th percentile (PS-TE), and 18th percentile (LSB-PS-TE). Additionally, when considering the top journals that emerge from the ranking of interdisciplinarity results, the categories become clearer when considering the emergent topics. For PS-TE, the emergent topics can only be seen in Desalination and Water Treatment and International Journal of Hydrogen Energy. The other titles are suggestive of the science and technologies involved: physical chemistry, sensors, and materials.

Table 6 Top 10 journals by interdisciplinary category pairs.

Discussion and conclusion

As science continues to expand its research output, the science of science emergence provides an opportunity to understand where new knowledge—the source of innovation—originates from by examining global interdisciplinarity. Most previous studies have focused on breakthroughs or identifying popular directions within narrow fields of study measured by frequency size. These past approaches apply the logic of identifying patterns of frequency-based dominant topics within a specific field of science. In contrast, the present study provides an alternative perspective in understanding the science of science emergence with a focus on the influence of the changing boundaries of conjoining science across categories. The main contributions of our research are (i) to expand the definition of interdisciplinarity to include global domain-crossing science categories, (ii) to use Eigenvector centrality as a measure of influence on emergent topics, and (iii) to demonstrate the use of embedded topic modeling over a dataset the represents a global science map. This study provides an early foray into applying unsupervised classification using BERTopic modeling on interdisciplinary science datasets. This approach is one of the few contemporary studies that apply text-embedding-based topic modeling techniques to the science of science emergence, and the only one to focus on the influence of existing science topics on emergence.

Furthermore, the present investigation provides a simple model to achieve the desired analysis and, in addition, demonstrates that the originating subjects of interdisciplinary topics can be identified using embedded topic modeling. Using the Schumpeterian definition of knowledge creation based on recombination processes, the model examines the intersection of interdisciplinary sciences to identify the most influential topics related to emergent scientific knowledge based on science topics that are projected onto a global science map. The results can be used to identify trend profiles of the interdisciplinary sources of emergent topics over time.

Since dominant science is subject to the bias of size and canonical fields, emergent science based on the influence of co-occurring science domains provides an alternative measure. The Eigenvector centrality value can be used as a measure for the growth of interdisciplinarity that is different from approaches that focus on dominant science in a co-occurrence network of interdisciplinary emergence. Dominant science subjects are different than the topics related to growing interdisciplinary science, differentiating the results of this study from prior studies that emphasize frequency-based, dominant science. The approach that we used allows us to retain contextual knowledge in text analysis. Nonetheless, those science subjects that appear in both emergent growing and dominant interdisciplinary sciences such as “Green & Sustainable Science & Technology” may indicate greater influence on research for society and have greater potential for applications.

This study suggests that identifying emergent topics may help us better understand how to direct and use innovative research. This study detected green- and health-related topics are emergent across many of the global interdisciplinary science categories. As global challenges emerge, more efficient and effective means to identify emergent research to address them are necessary; yet, it has become increasingly difficult to meet this aim (Petersen et al. 2021). Bloom et al. (2020) posit that if firms are shifting towards defensive research activities, then government policy must reconsider how research is publicly funded. In order to increase economic productivity, the sources (and barriers) of innovation need to be detected within sectors and individuals. Although this may help when focusing on economic-related challenges, there may be the need for additional measures of research productivity when considering socially oriented innovation demands. Thus, an alternative explanation for the decline in science productivity is that social innovation may be driving research rather than economic imperatives.

Although the present study has departed from prior studies in several aspects, further research is needed to address its limitations. First, the number of topics that were automatically generated was small, which means that there are likely additional emergent topics that can be identified in follow-up studies. Nevertheless, the current investigation adopted a conservative approach to ensure that the topics identified were meaningful, especially when considering that the distributions are highly skewed. Future research should also consider how to refine the level at which emergent topics are still acceptably defined, e.g., recursive clustering on large-scale bibliometric data (cf. Mejia and Kajikawa 2020) while balancing the diversity of domains and similarity of emergent topics. Additionally, the NLP approach adopted here requires a comparably large amount of computing power, which, in turn, might pose a challenge for universal day-to-day applications and policy purposes.

Another limitation is that our data is constrained to scientific journal articles in the WoS. Not all innovations—especially social innovations—may be derived from science and technology fields. This approach may also ignore disciplines that tend to produce other types of publications. A broader approach that considers these types of interdisciplinarity may provide alternative sources of identifying social innovation. Lastly, while this study focused on specific characteristics of emergence defined through interdisciplinarity in the WoS, future research assessments should “consider the value and impact of all research outputs” and “consider a broad range of impact measures,” as stated in the San Francisco Declaration on Research Assessment (Cagan 2013). Rather than redefine emergence through science maps, this study aimed to explore a different approach to understanding emergence by providing an alternative perspective on emergence.

The science of science can link existing knowledge reservoirs for technology development, especially as global challenges influence the direction of science emergence that can be applied to the innovation of new technologies. A better understanding of the existing topics that are cross-domain and, as such, generate new innovative outcomes and solutions can help to apply the science of science to applicable and effective STI policy initiatives that incorporate social innovation objectives as well.