Introduction

Water pollution caused by pesticides (PCs) is a major environmental concern that hinders the achievement of Sustainable Development Goals (UN), especially in emerging countries like India and China1. The rapid urbanization and industrialization that fuel agricultural modernization have accelerated the spread of pesticides in the ecosystem, posing serious threats to biota, water quality, and human health2. Anthropogenic activities, industrial wastewater, and agricultural runoff contribute to the contamination of surface and groundwater by pesticides1. These toxic and mobile contaminants eventually enter the food chain, leading to retinal degeneration, cardiovascular problems, muscle degeneration, cancer, and irreversible cell membrane damage3,4. Various ex-situ and in situ remediation processes have been implemented to treat water polluted with PCs. Among these, in-situ removal of PCs using biochar adsorption matrices has gained widespread recognition as a sustainable and cost-effective remediation procedure that is easy to design and efficient5,6.

Biochar has shown great potential for retrieving pesticides from contaminated ecosystems due to its superior physicochemical properties compared to primary feedstocks7. Biochar can be produced from various natural biological sources, such as agricultural residues, sludge, and animal manure, using different heating methods, such as fast pyrolysis, slow pyrolysis, hydrothermal carbonization, and gasification, to treat both organic and synthetic forms of pesticides, such as herbicides, insecticides, and fungicides in aqueous environments5. However, the PC adsorption efficiency of biochar-mediated aqueous systems depends on several factors, including feedstock choice, production conditions, water matrix (e.g., pH, concentration, temperature), experimental conditions (dose, treatment time), and pesticide properties (e.g., solubility, molecular weight)8. Numerous adsorption mechanisms, such as hydrogen bonding, surface complexation, electrostatic interactions, π–π interactions, Van der Waals’ forces, and pore filling, have contributed to PC removal efficiency9,10,11. Despite extensive research, the optimal settings for maximizing PC adsorption in aqueous solutions via biochar remain highly variable. Concurrent adsorption studies examining the synergistic effect of all experimental parameters are challenging. Review studies, bibliometrics, and quantitative analyses have been used to assess the effectiveness of pesticide adsorption12,13,14. Still, these methods are time-consuming and complex in estimating the relative contribution of adsorption variables to removal efficiency. An empirical approach that uses a computational framework to optimize the factors related to biochar characteristics, experimental conditions, and aqueous matrix configuration and highlight their relative contribution to enhancing pesticide removal can lead to an overall understanding of complex adsorption behavior15. This approach can reduce the time, cost, and resources involved in biochar-based adsorption procedures and help determine the optimal conditions for maximum PC sequestration from water.

In the realm of chemical sciences, machine learning (ML) based data-driven models are gaining traction as powerful tools for promoting environmentally conscious research. Diverse machine learning techniques such as support vector machines (SVM), convolutional neural networks (CNN), random forests (RF), and artificial neural networks (ANN) have demonstrated effective utility in the identification of pesticides and their derivatives within real samples16,17,18. Additionally, ML models have been used to track the dissipation of pesticides in plants19, assess their effects on soil microbial communities20, and evaluate their potential genotoxic impact on humans21. However, the application of ensemble ML algorithms to predict the efficacy of biochar adsorption matrix in treating pesticides is not known. While various studies have been conducted to experimentally investigate pesticide remediation from water using different biochar systems and validate scientific hypotheses using classical sorption models, the exploration of machine-learning algorithms in this context remains unexplored22.

Furthermore, previous studies have utilized molecular simulations to forecast biochar adsorption23,24. However, these investigations have focused on capturing minuscule interactions rather than evaluating the impact of physicochemical parameters on the remediation of pesticides using bulk biochar. To fill the existing research gap, we utilize three ensemble machine learning (ML) models, namely Categorical boost (CatBoost), Light gradient boost machine (LightGBM), and Random forest (RF), to predict the adsorption efficiency of biochar in removing pesticides from aqueous environments. While traditional statistical methods can establish linear or quadratic relationships between an individual independent variable and the target variable, ML models can simultaneously consider all correlated features and establish a more complex relationship with the output variable. Moreover, the developed models can assist researchers in designing experiments for biochar-based pesticide remediation. In this study, we considered ten adsorption attributes linked to three different aspects of the adsorbate-adsorbent system, namely (i) biochar properties, (ii) aqueous matrix configuration, and (iii) experimental conditions, and assessed their impact on pesticide adsorption in biochar-treated aqueous media.

Results

Modeling pesticide adsorption on biochars

Decision-tree-based ensemble machine learning (ML) models, including CatBoost, LightGBM, and Random forest (RF), were utilized to predict the efficacy of adsorptive sequestration of pesticides from an aqueous medium using biochar. These models integrate multiple individual models to enhance the accuracy of predictions. The decision tree-based model flow utilized eight input attributes to split the data into smaller subsets based on specific features. This process was recursively repeated to achieve a decision or prediction. Grid search optimization was employed to determine the best possible combination of hyperparameters, minimizing the root-mean-squared error (RMSE) on both the train and test sets. More information regarding the hyperparameters of the fine-tuned models can be found in Supplementary Table 1 in the supplementary information (SI).

Figure 1a–c shows scatter plots that juxtapose the experimentally determined pesticide adsorption efficacy values with those predicted by the model. The CatBoost framework yielded the highest regularization coefficient (R2) values of 0.968 and 0.956 for the training and test subsets, correspondingly surpassing LightGBM (R2train = 0.931, R2test = 0.862) and Random forest (R2train = 0.820, R2test = 0.796) models. The CatBoost was the best-performing model due to less variation between the train and test losses compared to LightGBM, while the RF predictive error was very high than the loss functions of boosting models (refer SI, Supplementary Fig. 1). The feature importance graph (Fig. 1d) indicated that surface area (SA) had the most significant effect on the model, followed by pesticide concentration (Co) and pore volume, respectively. Figure 1e demonstrates the SHAP values, which visually represent each feature’s relative contributions to the final predictions generated by the CatBoost model.

Fig. 1: Performance of machine learning models and feature importance.
figure 1

The figure presents the performance comparison of three machine learning models: a Categorical boost (Catboost), b Light Gradient Boosting Machine (LightGBM), and c Random Forest (RF) model in predicting pesticide adsorption capacity of biochars in aqueous system. In addition, d and e show feature importance insights using the CatBoost model and SHAP feature importance plots, respectively.

These results suggested the superiority of CatBoost in predicting pesticide adsorption on biochars. The influence of each input feature on the output or PC adsorption performance in the CatBoost model, which delivered the best performance, was assessed utilizing both the CatBoost-built-in feature importance criteria and Shapley Additive exPlanations. Figure 3d exhibits the feature significance of the trained model based on the total percentage gain with the feature. The findings of this study highlight surface area (SA) as the most significant physicochemical characteristic affecting pesticide adsorption on biochar, followed by pesticide concentration (Co), pore volume (Vt), and biochar dose (Fig. 1d, e).

Model feature insights and their implication on sustainability in agriculture

The partial dependence plots (PDP), depicted in Fig. 2I, II, offer valuable insights into the impact of biochar attributes (SA, Vt, pH_BC), experimental factors (CT, dose), and aqueous conditions (pH, Co, and T) on the biochar capacity to treat pesticide-contaminated water in agricultural fields. Among the biochar attributes, the textural characteristics of biochar, especially surface area, have the most significant influence on the model prediction results. In contrast, the biochar pH contributes the least among all the input parameters. It can be observed that the increment in the surface area (SA) and pore volume (Vt) demonstrate a direct correlation with the increase in the pesticide adsorption capacity of biochar within a specific range (0.25–1000 m2/g, 0.004–0.5 cm3/g) (Fig. 2I-a, Fig. 2II-a). The expansion in the surface area leads to an increased number of active adsorption sites25,26,27,28. At the same time, the interconnected pore structure of well-structured biochar facilitates efficient diffusion and transfer of adsorbate molecules through its pores, thereby augmenting adsorption performance29. This information can help researchers, agricultural scientists, and biochar producers optimize biochar production processes for targeted remediation of specific pesticides, thereby promoting the responsible and efficient use of biomass feedstocks to produce biochar with effective textural properties.

Fig. 2: Attribute analysis for pesticide adsorption on biochar.
figure 2

(i) 1D-PDP of Model Attributes: One-dimensional Partial Dependence Plots (1D-PDP) showcasing the relationship between individual model attributes and pesticide adsorption capacity. (ii) 2D-PDP on Pesticide Adsorption Capacity: Two-dimensional Partial Dependence Plots (2D-PDP) exploring the combined effects of model attributes on pesticide adsorption capacity of biochars.

Among the experimental factors, biochar dose significantly influenced the model predictions (Fig. 2I-f). The pesticide adsorption capacity of biochar decreased with the increasing biochar dose and was found to be highest for a biochar dose of less than 1 g/L. As the quantity of biochar increases, the available surface area per unit mass decreases, or pore blockage occurs due to particles’ aggregation, which reduces the accessibility of active adsorption sites, resulting in reduced adsorption capacity30. The biochar dose of less than 1 g/L with a treatment duration of fewer than 500 min provided the biochar’s maximum pesticide adsorption capacity (Fig. 2II-b). These findings can aid in optimizing the biochar dose with the treatment time, which maximizes the removal of specific pesticides from agricultural water. By optimizing the biochar dose and contact time during pesticide treatment, the residual biochar in the treated water can be utilized for subsequent soil amendment31,32.

Among the water matrix parameters (i.e. Co, pH, and T), pesticide concentration significantly impacts the model predictions, followed by pH and water temperature. The biochar adsorption capacity showed a linear increment with the increasing concentration of pesticide in water (0.2–2000 mg/L) (Fig. 2I-h). Higher pesticide concentrations result in steeper concentration gradients between the solution and the biochar surface. This concentration gradient promotes more efficient mass transfer of pesticides from the solution to the biochar, allowing for increased adsorption capacity33. The model results can help in optimizing the water matrix parameters for maximum adsorption of pesticides on Biochar (Fig. 2II-c, Fig. 2II-d). The model can improve water quality near agricultural fields by reducing pesticide runoff and subsequent contamination.

With the acquired insights into the capacity of biochar to adsorb pesticides by manipulating model attributes across a wide range of parameters, machine learning-mediated modeling can be pivotal in advancing sustainable agricultural practices. This can be achieved through optimizing resource utilization to achieve a well-balanced biochar design, enhancing soil health via applying biochar for soil amendments, and mitigating pesticide runoff and environmental contamination. The present data utilized to train the model primarily comprises experiments conducted on laboratory-scale solutions. Nonetheless, to enhance the predictive capability of the machine learning model, it is essential to incorporate real-world data obtained from diverse agricultural scenarios and environments, thus ensuring its applicability and generalizability.

Discussion

Benefits of the ML intervention and its pathway to impact agriculture

Utilizing a machine learning framework to predict pesticide removal from agricultural systems using biochar holds significant advantages for various stakeholders in the agricultural sector. Firstly, agrarian practitioners can significantly benefit from adopting machine learning models to assess pesticide removal using biochar. By doing so, farmers can gain valuable insights into the performance and effectiveness of biochar-based remediation methods. With this knowledge, individuals can make well-informed decisions when implementing these techniques to address irrigation water concerns. This approach not only minimizes pesticide exposure but also fosters sustainable agricultural practices.

Moreover, researchers investigating sustainable agricultural practices and water quality can leverage this technology to their advantage. By utilizing predictions generated by the machine learning model, they can explore the impact of biochar on pesticide adsorption and its subsequent effects on soil and water systems. Through extensive data analysis, researchers can develop robust models capable of predicting the efficiency of biochar in removing specific pesticides from water. Consequently, this promotes subsequent investigations aimed at optimizing the utilization of biochar as an adsorbent and gaining insights into the kinetics of pesticide elimination. Ultimately, this enhances our comprehension of the intricate interactions among biochar characteristics, soil water chemistry, and pesticide behavior.

Furthermore, biochar producers can harness this technology to tailor their products for adsorbing pesticides in agricultural runoffs. Producers can optimize their production processes by understanding the relationship between biochar characteristics and pesticide adsorption efficiency and offer highly effective biochar products for pesticide treatment. This contributes to biochar’s commercialization and drives its adoption as a sustainable solution for treating agricultural water and reducing pesticide contamination.

Lastly, environmental agencies can rely on the machine learning model to assess the potential impacts of pesticides from agricultural runoffs on water bodies. By incorporating the predicted adsorption efficiency of biochar, these agencies can evaluate the effectiveness of different biochar types in reducing pesticide concentrations in soil and water. The resulting knowledge enables the development of more targeted policies and guidelines to mitigate pesticide contamination, thereby safeguarding the environment and public health.

The pathway to impact in agriculture involves empowering stakeholders with knowledge and tools to make informed decisions about agricultural water treatment, pesticide management, and water resource protection. The technology helps optimize the use of biochar with a balanced design as an effective adsorbent for pesticide removal, promoting sustainable agricultural practices, minimizing environmental contamination, and supporting soil and water conservation efforts.

The application of biochar to remediate pesticides from agricultural runoffs offers a sustainable waste management and water treatment solution. The permeation of ensemble machine learning tools can help in the balanced design and use of biochar-based pesticide removal processes. This study demonstrates the potential of the CatBoost framework to derive insights from the biochar adsorption data and help forecast the relevant attributes impacting the retrieval of pesticides from aqueous solutions. The CatBoost reveals the dominance of textural characteristics, pesticide concentration, and biochar dose on the adsorptive capture of pesticides. As more pertinent data related to pesticide properties and selectivity parameters of biochar are available, the CatBoost technique, with the acquired prediction accuracy of ~96%, can be extended into real-world treatment systems to formulate a more holistic model framework that will help researchers and agricultural scientists to predict the potential of biochar to treat specific pesticides in water and soil systems. Moreover, the current theme of the research focuses on the role of ML in analyzing pesticide adsorption behavior, but in the future, it will be intriguing to explore the role of ML-assisted management models to address the post-adsorption environmental challenges related to desorbed pesticides and the management of pesticide saturated biochars.

Methods

The sections below briefly discuss the pre-processing steps in the pesticide treatment data preparation using biochars, model development, and feature importance analysis.

Pesticide remediation dataset based on biochars

Compilation of data

A comprehensive collection of 96 academic research articles was compiled to investigate the adsorption of pesticides in biochar-mediated aqueous systems. The articles were gathered using reputable online search engines such as Google Scholar, Scopus Index, and Web of Science. Pesticides treated using biochar adsorption matrix methods are depicted using a tag cloud (Fig. 3). Supplementary Table 2 provides an elaborate tabular representation of feedstock biomass data, while Supplementary Fig. 2 offers a visual overview of the biochar dataset. A comprehensive and detailed description of the gathered data is presented in an Excel file, which can be accessed through a downloadable link provided in Supplementary Table 3. The file contains extensive information about the specific biochar utilized for pesticide treatment, research locations, publication years, and the dataset.

Fig. 3
figure 3

Visualization of a cloud representing various pesticides adsorbed on biochar, with further details available in Supplementary Table 2.

The information about the PC adsorption capacity from the collected research articles was extracted from graphs using Plot Digitizer Software34or was acquired from tables or calculated using Eq. (1).

$${PC\; adsorption\; capacity}=\frac{\left({C}_{{PO}}-{C}_{{PB}}\right)* V}{{C}_{{PO}* m}}$$
(1)

where \({C}_{{PO}}\) refers to the initial PC concentration in the solution and \({C}_{{PB}}\) is the PC concentration post-biochar treatment at time t of the pesticide solution. \(V\) designates the volume of the aqueous solution, and \(m\) denotes the mass of the biochar. The pesticide adsorption capacity was expressed in mg g−1. The biochar-mediated adsorption process in PC-contaminated solutions describes a decrease in their concentration, expressed in mg L−1. The domain knowledge was used to group the ten input attributes, i.e., the surface area of Biochar (SA), pH of biochar (pH_BC), total pore volume (Vt), biochar dose (dose), cation exchange capacity (CEC), pH of the solution (pH), initial pesticide concentration (Co), treatment or contact time (CT) and temperature (T) into empirical categories of the adsorbate-adsorbent system mentioned in the Introduction.

Imputation of missing data and assessment of correlations

The research studies did not report all the values associated with the ten sets of attributes. The missing value imputation was used for the feature for which most data was available. In this work, a threshold of 50% was employed, meaning attributes with missing values greater than the threshold were discarded from the dataset. Thus the cation exchange capacity (CEC) attribute was excluded from the dataset, and the median value was employed to impute missing values for the remaining features. The median was selected as it demonstrated the highest representativeness for the comprehensive data and did not introduce bias. Pearson’s correlation matrix, consisting of nine sets of features, is presented in Supplementary Fig. 3. Subsequently, the dataset matrix with dimensions of 878 × 9, encapsulating data variability as shown in Table 1, was employed to generate ensemble machine learning (ML) models.

Table 1 Representation of data variability using the quartile range (QR).

Development and evaluation of models

Data division, scaling, and hyperparameter tuning

Predictive models were developed using three machine-learning techniques: CatBoost, LightGBM, and Random Forest (RF). Figure 4 showcases a flowchart that provides an overview of the technical aspects of the machine learning pipeline. This flowchart encompasses the steps involved in data preparation, model training, validation, and, ultimately, the prediction of pesticide adsorption. A sample space was established to perform ML modeling, comprising 878 data records encompassing nine attributes: SA, Vt, pH_BC, pH, CT, Dose, T, Co, and the desired output attribute, Pesticide Adsorption. By using a stratified random sampling algorithm, 703 samples were chosen as the training set, and 175 samples were designated as the test set from the previously mentioned sample space. The grid search algorithm was utilized to identify the optimal hyperparameters in the training process for each specific model (CatBoost, LightGBM, and RF). The fundamentals of each ML model are described in Supplementary Methods in SI. To mitigate the risk of overfitting in our study, we implemented a 5-fold cross-validation approach. This technique was employed to assess the predictive performance of our models on unseen data samples, thereby enhancing the reliability and generalizability of our results. Finally, the precision of the predictive models was measured using the coefficient of regularization (R2) and the root-mean-squared error (RMSE). R2 and RMSE values were computed using Eqs. (2) and (3).

$${R}^{2}=1-\frac{\mathop{\sum }\limits_{i=1}^{N}{({y}_{m,i}-{y}_{e,i})}^{2}}{\mathop{\sum }\limits_{i=1}^{N}{({y}_{e,i}-{y}_{{m\; av}})}^{2}}$$
(2)
$${RMSE}=\sqrt{\frac{\mathop{\sum }\limits_{i=1}^{N}{\left({y}_{m,i}-{y}_{e,i}\right)}^{2}}{n}}$$
(3)

where \({y}_{m,i}\) associates with the value of output predicted by the model and \({y}_{e,i}\) corresponds to the output value acquired using experiments, \({y}_{{m\; av}}\) is the average of all the outcomes predicted by the model, and n is the sample size in the training or testing subsets. We utilized the Python programming language and relevant libraries for modeling implementation. The Scikit-Learn library31 was employed for performance analysis, while the conda packages of CatBoost, LightGBM, and RF libraries were utilized for implementing the model algorithms. The pyplot libraries were also used for exploratory data analysis and graph generation.

Fig. 4: illustrates the machine learning workflow employed to assess the potential of a biochar-mediated aqueous system for pesticide treatment.
figure 4

The first block (i) represents the compilation of pesticide adsorption data, i.e., biochar’s textural properties, water matrix parameters, experimental factors, and adsorption capacity, followed by data pre-processing for model development. The second block (ii) outlines the steps involved in the model training process, where a five-fold cross-validation methodology was implemented to address overfitting concerns. The third block (iii) pertains to model performance evaluation using the test data and the utilization of the best predictive model for feature exploration.

Feature importance

The significance and impact of each attribute (feature) on the output attribute (pesticide adsorption capacity) were computed in two ways in Python. First, the built-in feature importance function provided by the CatBoost library was used to plot features in the order of their impact on model fit. The systematic correlation between the input adsorption attributes and output was illustrated by integrating partial dependence plots generated using CatBoost. Another feature importance criterion, known as SHAP (Shapley’s Addition Description Method)35,36, was also employed to anticipate the impact of various attributes on the model predictions.