Introduction

A reason behind the very successful and ubiquitous applications of AI and machine learning is the rapid development of clever and more effective metahheuristic algorithms for optimization purposes1,2,3,4,5. One such class is the class of nature-inspired metaheuristic algorithms that include genetic algorithm (GA), differential evolution (DE) and particle swarm optimization (PSO), among many others. Each of these algorithms has been widely tested for optimizing different types of complex objective functions successfully across disciplines. The more popular and exemplary ones have many variants, which are modified versions or improvements of the original version. For example, the variant may converge faster, make the original algorithm less prone to premature convergence, or has greater chance of extricating itself from a local optimum. Ease of availability of codes in R, Matlab and Python to run metaheuristics greatly facilitate its use and popularity in practice. For example, the website https://pyswarms.readthedocs.io/en/latest/ houses a comprehensive set of PSO tools written in Python6. More recently,7 provided a high-level Python package for selecting machine learning algorithms and their parameters using PSO. Hybridized algorithms that creatively combine suitable algorithms, metaheuristic or not, can also markedly increase the performance of a metaheuristic algorithm; see details and applications in8,9,10.

There are many applications of nature-inspired metaheuristic algorithm across disciplines. For example, PSO, being an exemplary nature-inspired algorithm, is widely used to tackle problems due to COVID-1911,12,13,14. There are many monographs on nature-inspired metaheuristic algorithms at various levels, see, for example,15,16,17,18. Some are targeted to specific disciplines; for example, applications in building energy power and storage systems4, agriculture19, chemical engineering20, or for feature selection21 with numerous applications in finance and reinforce learning, to name a few. Overview papers on metaheuristics are plentiful; see for example22,23,24. A most recent paper that gives a comprehensive overview on metaheuristics is25.

The motivation of our work comes from our observation that nature-inspired metaheuristic algorithms s are very under-utilized in research in the statistical and life sciences. The aim of this paper is to demonstrate the usefulness of such algorithms to optimize very different types of optimization problems in the statistical and life sciences. As an example, we consider a recently proposed metaheuristic algorithm called competitive swarm optimizer with mutating agents (CSO-MA) by one of the coauthors26 and demonstrate its utility to solve different types of optimization problems in bioinformatics psychology, ecology, biostatistics and also in the manufacturing industry.

Nature-inspired metaheuristic algorithms

Nature-inspired metaheuristic algorithms have emerged as a dominant component in the field of optimization27,28. These algorithms have gained significant popularity in solving real, high-dimensional, and complex optimization problems. They have found widespread application in engineering, computer science, and various other disciplines to address challenging optimization problems22,29. Despite their versatility, these algorithms are under-utilized in some disciplines. One of their key strengths is the availability of widely accessible free codes for users to implement. In addition, they are fast, assumptions-free, and serve as general-purpose optimization algorithms. While they do not guarantee the discovery of an optimal solution, they often yield optimal or near-optimal solutions in a timely manner. Recent studies have demonstrated the ability of swarm-based algorithms to effectively search for previously elusive optimal designs that require solving a 3-layer optimization problem30. In the next subsection, we briefly discuss competitive swarm optimization (CSO) and one of its variants.

Competitive swarm optimizer

Competitive swarm optimizer (CSO) swarm-based algorithm proposed by31 and has proven its effectiveness for solving different types of optimization problems with various dimensions . For example32, applied CSO to select variables for high-dimensional classification models, and33 used CSO to study a power system economic dispatch, which is typically a complex nonlinear multivariable strongly coupled optimization problem with equality and inequality constraints.

CSO minimizes a given continuous function \(f(\textbf{x})\) over a user-specified compact space \(\varvec{\Omega }\) by first randomly generating a set of candidate solutions. They take the form of a swarm of n particles at positions \(\textbf{x}_1, \;\cdots , \;\textbf{x}_n\), along with their corresponding random velocities \(\textbf{v}_1, \;\cdots , \;\textbf{v}_n\). For tackling design problems, each particle is a candidate design and upon convergence, the solution is the optimal design.

After the initial swarm is generated, at each iteration we randomly divide the swarm into \(\left\lfloor \frac{n}{2} \right\rfloor \) pairs and compare their objective function values. At iteration t, we identify \(\textbf{x}^t_i\) as the winner and \(\textbf{x}^t_j\) as the loser if \(f(\textbf{x}^t_i) < f(\textbf{x}^t_j)\). The winner retains the status quo and the loser learns from the winner. The two defining equations for CSO are

$$\begin{aligned}{}&\textbf{v}^{t+1}_{j} = \textbf{R}_1 \otimes \textbf{v}^t_{j} + \textbf{R}_2 \otimes (\textbf{x}^t_{i} - \textbf{x}^t_{j}) +\phi \textbf{R}_3 \otimes (\bar{\textbf{x}}^t - \textbf{x}^t_{j}) \end{aligned}$$
(1)
$$\begin{aligned}{}&\text {and}\,\textbf{x}^{t+1}_{j} = \textbf{x}^t_{j} + \textbf{v}^{t+1}_{j}, \end{aligned}$$
(2)

where \(\textbf{R}_1, \;\textbf{R}_2, \;\textbf{R}_3\) are all random vectors whose elements are drawn from U(0, 1). The operation \(\otimes \) represents element-wise multiplication and the vector \(\bar{\textbf{x}}^t\) is the swarm center at iteration t. The social factor \(\phi \) controls the influence of the neighboring particles to the loser and a large value is helpful for enhancing swarm diversity (but possibly impacts convergence rate). This process iterates until a pre-specified stopping criterion or criteria are met.

Simulation results have repeatedly shown that CSO either outperforms or is competitive with several state-of-the-art evolutionary and swarm based algorithms, including several enhanced versions of PSO. This conclusion was arrived at after comparing CSO performance with state-of-the-art EAs using a variety of benchmark functions with dimensions up to 5000 and found that CSO was frequently the fastest and with the best quality results31,34,35,36.

Competitive swarm optimizer with mutated agents

Zhang et al. (2017)37 proposed an improvement on CSO and call the enhanced version, competitive swarm optimizer with mutated agents (CSO-MA). After pairing up the swarm in groups of two at each iteration, the variant randomly chooses a loser particle p as an agent, randomly picks a variable indexed as q and then randomly changes the value of \(\textbf{x}_{pq}\) to either \(\textbf{xmax}_{q}\) or \(\textbf{xmin}_q\), where \(\textbf{xmax}_q\) and \(\textbf{xmin}_q\) represent, respectively, the upper bound and lower bound of the q-th variable. If the current optimal value is already close to the global optimum, this change will not hurt since we implement this experiment on a loser particle, which is not leading the movement for the whole swarm; otherwise, this chosen agent restarts a journey from the boundary and has a chance to escape from a local optimum. Figure 1 shows the flowchart of CSO-MA. The mutation step (the box in purple) is a key feature of CSO-MA that differentiates it from the standard CSO. The mutation is intended to increase the diversity of the solutions and prevent premature convergence to a local optimum by allowing particles to explore more distant regions of the search space, see37 for details.

Let n be the swarm size and let D be the dimension of the problem. The computational complexity of CSO is \(\mathcal {O}(nD)\) and since our modification only adds one coordinate mutation operation to each particle, its computational complexity is the same as that of CSO. The improved performance of CSO-MA over CSO for finding optimal designs for many complex multi-dimensional benchmark functions has been validated26. In the next section, we apply CSO-MA to different estimation problems and show it can produce better quality solutions than conventional methods.

All computations were performed on a MacBook Pro (16-inch, 2021) with an Apple M1 Max chip and 64GB of memory. The operating system was macOS Sonoma version 14.1.1 and the programming languages were Matlab 2023a and Python 3.9.13. Throughout, the hyper-parameter of CSO-MA was set to \(\phi =0.3\) and the rest of its parameters and those for CSO and PSO are all set to default values. The codes for all the computations are available from the first author by request.

Figure 1
figure 1

Flowchart of CSO-MA.

In response to a referee’s comment, we show further that CSO-MA is competitive with recently proposed metaheuristic algorithms. As noted in38, there is a continuing plethora of new or slightly modified proposed as nature-inspired metaheuristics and it is desirable to limit the number of them, unless they are competitive. To this end, we further compare performance of CSO-MA with PSO and CSO using 3 more randomly selected CEC static benchmark functions not used for comparison in26 and described in39. These additional three functions have different mathematical and geometric properties: function \(f_9\) is the Weierstrass function (separable), \(f_{10}\) is the Quartic function and \(f_{11}\) is the Ackley function (non-separable). All three functions have a global minimum of 0, and their optimum were attained at \(\textbf{0}\) for \(f_9\) and \(f_{11}\), and \(f_{10}\) achieved its optimum at \(\textbf{1}\). We also tested the 3 algorithms for their ability to optimize a sphere function \(f_{12}\), which is a 2022 CEC dynamic benchmark function. This function was selected at random from the list and is much harder to optimize because it came from a dynamic optimization problem40,41). The dimensions of the four functions selected for additional comparison were \(D = 100\) and \(D=500\).

Table 1 displays the comparison results after 30 repeated runs. We observe from the table that CSO-MA found the smallest mean values of the optimum when compared with CSO or PSO for \(f_{9},f_{11},f_{12}\), but not for \(f_{10}\). To test whether there is a significant difference in the medians of the optimal values found by CSO and PSO compared with that from CSO-MA, we applied a Wilcoxon’s non-parametric test. Table 2 reports their p-values and suggest that CSO-MA tends to perform more similarly with CSO than PSO in low dimensional optimization problems and that CSO-MA outperforms PSO significantly for the two dimensions tested.

Table 1 Performances of the three algorithms for minimizing 3 CEC2008 benchmark static functions (\(f_{9},f_{10},f_{11}\)) and 1 CEC2020 benchmark dynamic function with multiple optima (\(f_{12}\)).
Table 2 The p values of the Wilcoxon’s tests for comparing differences in the medians of the optimized values found from the three algorithms for functions in Table 1. They show CSO-MA outperforms 2 PSO and CSO in 2 of the 3 2008CEC benchmark static functions and the 1 2022CEC benchmark dynamic function algorithms.

Estimation problems

Metaheuristics has been used to find estimates for model parameters and there is work that showed they can outperform those from statistical packages or find them when the latter fail to do so. For example42, showed that PSO can find more optimal L1-estimates for some models than those in statistical packages. In what is to follow, we demonstrate the CSO-MA can find more optimal maximum likelihood estimates and also able to find them when some statistical packages cannot. Our applications include finding maximum likelihood estimates for models in bioinformatics and research in education, and M-estimates for a Cox regression in a Markov renewal model.

Single-cell generalized trend model (scGTM)

Cui et al. (2022)43 proposed a model called scGTM to study relationship between pseudotime44 and gene expression data. The model assumes that the gene expression has a ‘hill’ trend along the pseudotime and can be modeled using a set of interpretable parameters. Below is a brief description of the model and shows CSO-MA outperforms PSO algorithm for all but one gene in terms of finding the optimal value of the negative loglikelihood function; details in43.

For a hill-shaped gene, the scGTM parameters are \(\Theta =(\mu _{\text {mag}}, k_1,k_2,t_0,\phi ,\alpha ,\beta )^T\) and they are estimated from from the observed expression counts \(\varvec{y} = (y_1, \ldots , y_C)^T\) and cell pseudotimes \(\varvec{t} = (t_1, \ldots , t_C)^T\) using the constrained maximum likelihood method. Here C is the number of cells and the interpretations of the parameters in the model are given in Section 2.1 of43. If \(\log L(\Theta \mid \varvec{y}, \varvec{t})\) is the log likelihood function, the optimization problem is:

$$\begin{aligned}{} & {} \max _{\Theta } \log L(\Theta \mid \varvec{y}, \varvec{t}) \end{aligned}$$
(3)

such that

$$\begin{aligned}{}&\min _{c \in \{1,\ldots ,C\}}\log (y_c+1)\le \mu _{\text {mag}}\le \max _{c \in \{1,\ldots ,C\}}\log (y_c+1)\,,\nonumber \\&k_1,k_2\ge 0\,,\; \min _{c \in \{1,\ldots ,C\}} t_c\le t_0\le \max _{c \in \{1,\ldots ,C\}} t_c\,,\; \phi \in \mathbb {Z}_+\,, \end{aligned}$$
(4)

where

$$\begin{aligned}{}&\log L(\Theta \mid \varvec{y}, \varvec{t})=\log \left[ \prod _{c=1}^C\textbf{P}(Y_{c}=y_{c} \mid t_c)\right] \nonumber \\&\quad =\sum _{c=1}^C \log \Big [(1-p_c)f(y_c|t_c) + p_c \; \mathbb {I}(y_c=0)\Big ] \end{aligned}$$
(5)

and

$$\begin{aligned}{}&f(y_c|t_c)=\frac{\tau _c^{y_c}}{y_c!} \frac{\Gamma (\phi +y_c)}{\Gamma (\phi )(\phi +\tau _c)^{y_c}} \frac{1}{\left( 1+\frac{\tau _c}{\phi }\right) ^\phi }\,,\\&\log (\tau _{c}+1)={\left\{ \begin{array}{ll} b + \mu _{\text {mag}}\exp {(-k_1(t_c-t_0)^2)} &{}\text { if }t_c \le t_0\\ b + \mu _{\text {mag}}\exp {(-k_2(t_c-t_0)^2)} &{}\text { if }t_c > t_0 \end{array}\right. }\,,\\&\log \left( \frac{p_c}{1-p_c}\right) =\alpha \log (\tau _c+1)+\beta \,, \end{aligned}$$

which are all functions of \(\Theta \). There are two difficulties in the optimization problem (3). First, the likelihood function (5) is neither convex nor concave. Second, the constraint is linear in \(\mu _{\text {mag}}\), \(k_1\), \(k_2\), and \(t_0\) but \(\phi \) is a positive integer-valued variable. Hence, conventional optimization algorithms, like P-IRLS in GAM45,46 and L-BFGS in switchDE47 are unlikely able to work well. The authors proposed PSO to solve for the constrained MLEs and a Python package is available online. We now apply CSO-MA to the same problem and compare results from the Python package. In addition, we compared CSO-MA’s performance with results from two recently proposed metaheuristic algorithms: the prairie dog optimization algorithm (PDO) proposed by48 and the Rutta and Kutta optimization (RUN) algorithm proposed by49. Table 3 displays the negative log likelihood function values found by CSO-MA and PSO for the 20 exemplary genes in50 after 1000 function evaluations of Eq. (5) for the two algorithms and it shows that CSO-MA outperformed PSO and PDO in all but three of the 20 genes. The Wilcoxon test of CSO-MA against the other two algorithms produced p-values less than 0.001 (0.00077 for PSO and 0.00026 for PDO), suggesting that CSO-MA indeed outperformed PSO and PDO in this example.

Table 3 Optimized negative log likelihood (NLL) values (multiplied by \(10^5\)) obtained by CSO-MA, PSO and PDO after 1000 function evaluations. Lowest NLL values among the three algorithms are in bold for each gene and overall results suggest that CSO-MA outperforms PSO and PDO in almost all cases..

Figure 2 displays the fitted PAEP gene given by CSO-MA, PSO and PDO. We observe that CSO-MA captures the “fast decreasing trend” when \(t\ge 0.8\) better than PSO does, and it reaches the higher peak than PDO does. Figures for other genes also show a consistent pattern.

Figure 2
figure 2

Comparison of CSO-MA, PDO and PSO results for the fitted scGTM with gene PAEP.

Parameter estimation for a Rasch model

The Rasch model is one of the most widely used item response models in education and psychology research51. Estimating the parameters in the Rasch and other item response models can be challenging and there is continuing interest to estimate them using different methods and studying the various computational issues. For example52,53, reported that there are at least 27 R packages indexed with the word “Rasch” and 11 packages capable of estimating parameters and analysis for the Rasch model.

The expectation-maximization (EM) algorithms is a common method for parameter estimation in statistics54,55,56. The Bock-Aitkin algorithm is a variant of the EM algorithm and is one of the most popular algorithms for estimating parameters in the Rasch models57. Because the Rasch model also has many extensions with applications in agriculture, health care studies and in research in marketing58,59,60, this subsection compares, for the first time, how metaheuristic algorithms perform relative to the Bock-Aitkin’s method.

We give a brief review of the Rasch model before we compare the estimation results given by CSO-MA, Bock-Aitkin’s (in the R package ltm) and two other metaheuristic algorithms CA and PSO in terms of the likelihood values. In a Rasch model, we work with \(N \times I\) binary item response data where 1 indicates correct and 0 indicates incorrect responses. The data come from a cognitive assessment (e.g., math or reading) that includes I test items. A group of N students gave their responses to the I items, and their binary answers to each of the N items were scored and analyzed51. The Rasch model is given by:

$$\begin{aligned}{} & {} \text{ logit } \big ( \textbf{P} \big (Y_{ji} =1 | \theta _j \big ) \big ) = \theta _j - \beta _i , \; \; \theta _j \sim N(0, \sigma ^2). \end{aligned}$$
(6)

The item parameter \(\beta _i\) represents the difficulty of item i and parameter \(\theta _j\) represents the ability of person j. We assume that \(\theta _j \sim N(0, \sigma ^2)\). This model is called the one-parameter model because it considers one type of item characteristic (difficulty). Let \(p_{ji}=\textbf{P} \big (Y_{ji} =1 | \theta _j\big )\) and write the marginal likelihood function for model (6) as

$$\begin{aligned}{}&L({\Theta }) = \prod _{j=1}^N \int \prod _{i=1}^I p_{ji}^{Y_{ji}} (1-p_{ji})^{1-Y_{ji}} \pi (\theta ) d\theta , \end{aligned}$$
(7)

where \(\Theta = \big (\beta _1, \cdots , \beta _I, \sigma ^2 \big )^T \) and \(\pi (\theta )\) is the prior of \(\theta \).

Metaheuristics has been shown that it can provide superior performance over statistical methods. For instance61, tackled the challenge of deriving the maximum likelihood estimates for parameters in a mixture of two Weibull distributions with complete and multiple censored data. Their simulation outcomes indicated that the Particle Swarm Optimization (PSO) frequently outperformed the quasi-Newton method and the EM algorithm in terms of bias and root mean square errors.

In this study, we present similar results and show that the nature-inspired metaheuristic algorithm Mutation Algorithm (CSO-MA) can also give more precise maximum likelihood estimates compared to three of its competitors: PSO, the Bock-Aitkin’s method, and the Cat Swarm Algorithm (CA). PSO is legendary and an exemplary nature-inspired swarm based algorithm and CA was introduced by62, and its effectiveness as an optimizer for a single objective function was demonstrated in63, where they showed its superior competitive edge against several contemporary top-performing algorithms.

We employed the “Verbal Aggression” data set the R Archive64 and let NLL denote the minimized value of the negative log-likelihood function. Table 4 displays the NLLs from the 4 algorithms, where a swarm size of 30 was used for the 3 metaheuristic algorithms. The hyper-parameter for CSO-MA, was set to \(\phi =0.3\), and the hyper-parameters for PSO and CA were set to the default values in the R package metaheuristicOpt65. Evidently, CSO-MA has the smallest NLL value and is the winner. The estimated NNL values from CSO-MA, PSO, and Bock-Aitkin are similar, but that from CA is not, suggesting that CA appears less reliable since its estimated NLLs (gold points and lines on the left panel do not come close to the others.

Table 4 Negative log likelihood values from the four algorithms with CSO-MA outperforming the other three algorithms..

Figure 3 presents a two-panel visualization. The upper panel illustrates the estimated parameters derived from the four algorithms: CSO-MA, Bock-Aitkin, PSO, and CA. Here, the x-axis represents all 24 parameters (encompassing 23 items in addition to the variance parameter) in the model, while the y-axis depicts their estimated values. The lower panel delineates the progression trajectories of the negative log-likelihood functions of the four algorithms, spanning about 100 function evaluations. The left panel shows that except for the CA algorithm, Bock-Aitkin, PSO and CSO-MA give similar parameter estimates; the right panel shows that Bock-Aitkin converges fastest in terms of number of function evaluations while PSO is the slowest. However, CSO-MA has the smallest negative log-likelihood value, or equivalently, the largest log-likelihood value.

Figure 3
figure 3

The left panel shows estimated parameters from the four algorithms: CSO-MA, Bock-Aitkin, PSO and CA. The x-axis refers to all 24 parameters (23 items plus the variance parameter) in the model and the y-axis refers to the estimated parameter values. The right panel shows the trajectories of the negative log likelihood functions of the four algorithms as they evaluate the negative log-likelihood functions about 100 times..

M-estimation for Cox regression in a Markov renewal model

In this subsection, we show CSO-MA can solve estimating equations and produce M-estimates for model parameters, that are sometimes more efficient than those from statistical packages. Askin et al. (2017)66 correctly noted that metaheuristics is rarely used to solve estimating equations in the statistical community.

In a survival study, the experience of a patient may be modelled as a process with finite states67 and modelling is based on transition probabilities among different states. We take bone marrow transplantation (BMT) as an example. BMT is a primary treatment for leukemia but has major complications, notably Graft-Versus-Host Disease (GVHD), where transplanted marrow’s immune cells react against the recipient’s cells in two forms: Acute (AGVHD) and Chronic (CGVHD). The main treatment failure is death in remission, often seen in patients with AGVHD or both GVHD types, occurring unpredictably before relapse. The term “death in remission” in the context of leukemia refers to the death of a patient who is in remission from leukemia. This means the patient has achieved remission, where there are no detectable leukemia cells in the body, but they died from other causes that are not directly related to the active progression of leukemia. However, both AGVHD and CGVHD reduce leukemia relapse risks. Hence, there’s a five-state model: transplant (TX), AGVHD, and CGVHD are temporary states, while relapse and death in remission are absorbing states68. Figure 4 shows the possible transitions among different states (i.e., TX, AGVHD, CGVHD, Relapse and Death).

Figure 4
figure 4

A five-state Markov renewal model for BMT failure. Reproduced from68. TX = Transplant, AGVHD = Acute Graft-Versus-Host Disease, CGVHD = Chronic Graft-Versus-Host Disease, Relapse = Relapse of leukemia, Death in Remission = Death of a patient who is in remission from leukemia.

To model such a process in a mathematically rigorous way, we assume observations on each individual form a Markov renewal process with a finite state, say \(\{1,2,\cdots , r\}\)69. That is, we observe a process \((X, T)=\{(X_n, T_n):n\ge 0\}\) where (for simplicity, we do not consider censoring in this subsection), and \(0=T_0<T_1<T_2<\cdots \) are calendar times of entrances into the states \(X_0, X_1, \cdots , X_n\in \{1,2,\cdots ,r\}\). In the BMT example, \(r=5\) and \(X_n\) takes values in \(\{\)TX, AGVHD, CGVHD, Relapse, Death in Remission\(\}\) and \(W_i=T_n-T_{n-1}\) represents the sojourn time staying in the state \(X_n\). We also observe a covariate matrix \(\textbf{Z}=\{\textbf{Z}_{ij}:i,j=1,2,\cdots ,r\}\) where each \(\textbf{Z}_{ij}\) itself is a vector. In practice, we assume that the sojourn time \(W_n\) given \(X_{n-1}=i\) and \(\textbf{Z}\) has survival probability70

$$\begin{aligned} \textbf{P}(W_n>x|X_{n-1}=i, \textbf{Z})=\exp \left( -\sum _{k=1,k\not =i}^rA_{0,ik}(x)e^{\beta ^TZ_{ik}}\right) \end{aligned}$$

and the transition probability is (\(i\not =j\))

$$\begin{aligned} \textbf{P}(X_{n}=j|X_{n-1}=i, W_n)=\frac{\alpha _{0,ij}(W_n)e^{\beta ^TZ_{ij}}}{\sum _{k\not =i}\alpha _{0,ik}(W_n)e^{\beta ^TZ_{ik}}}, \end{aligned}$$

where \(\beta \) is the parameter of interest, \(A_{0,ik}(x)=\int _0^x\alpha _{0,ik}(s)ds\) is the baseline cumulative hazard from state i to state k and \(\alpha _{0, ik}(x)\) is the hazard function from state i to state k71. Suppose we observe M iid individuals and suppose the risk process for an individual is given by \(Y_{i}(x)=\sum _{n\ge 1}\mathbb {I}(W_n\ge x, X_{n-1}=i)\). For a fixed x, \(Y_i(x)\) counts the number of visits to state i with sojourn time more than x for a particular individual. In the five-state model in Figure 4, since we cannot revisit the states that we have already exited, \(Y_i(x)\) is a binary variable. Then from68,72,73, the estimating equation for \(\beta \) is

$$\begin{aligned}{}&\textbf{U}(\beta )=\sum _{m=1}^M\sum _{i\not =j}^r\int _0^\infty \left[ \textbf{Z}_{ijm}-\frac{S_{ij}^{(1)}(x,\beta )}{S_{ij}^{(0)}(x,\beta )}\right] dN_{ijm}(x). \end{aligned}$$
(8)

Here \(N_{ijm}(x)=\sum _{n\ge 1}\mathbb {I}(T_n\le x, X_n=j, X_{n-1}=i)\), \(S_{ij}^{(0)}(x,\beta )=\frac{1}{M}\sum _{m=1}^MY_{im}(x)e^{\beta ^TZ_{ijm}}\) and \(S_{ij}^{(1)}(x,\beta )\) is the first partial derivative of \(S_{ij}^{(0)}\) with respect to \(\beta \). The M-estimates of \(\beta \) are obtained by solving \(\textbf{U}(\beta )=0\). To apply CSO-MA to obtain the estimates, we turn the problem of solving \(\textbf{U}(\beta )=0\) into a minimization problem as follows:

$$\begin{aligned}{}&\widehat{\beta }=\arg \min _\beta \Vert \textbf{U}(\beta )\Vert _p \end{aligned}$$
(9)

where \(p\in [1,\infty ]\) is a user-selected constant. If the solution exists for \(\textbf{U}(\beta )=0\), then we have \(\min \Vert \textbf{U}(\beta )\Vert _p=0\) for any \(p\ge 1\). Using metaheuristics to creatively solve the system of nonlinear equations74,75, results from our simulation study suggest that the choice of p does not affect the convergence speed of CSO-MA nor the estimated parameters.

For simulation, we set \(p=2\) and assume \(r=3\), \(A_{0, ij}(x)=0.5x\) for all \(i\not =j\), the true parameter vector \(\beta =(0.901, 0.759, 0.348)^T\) and elements of the covariance matrix \(\textbf{Z}\) are random uniform variates from \([-1, 1]\). In total, we generated \(M=100\) individuals and the left panel of Figure 5 shows one of the realizations. The swarm size for CSO-MA was 20 and we ran it for 100 function evaluations and the right panel of Figure 5 shows the convergence of CSO-MA. The estimated parameter is \(\widehat{\beta }=(0.908, 0.753, 0.329)^T\), which is close to the true value. The observed vector of biases \((0.007,0.006,0.017)^T\) is likely due to both the optimization algorithm and the method of partial likelihood itself. The first issue can be reduced by trying using different initialized values of CSO-MA and the second issue may be solved by having a larger sample size so that consistency of the estimators is guaranteed theoretically. For space consideration, we omit additional simulation results that support the effectiveness of CSO-MA for estimating the true parameters correctly.

Figure 5
figure 5

Application of CSO-MA to find M-estimates for a Cox regression in a Markov renewal model. The left panel is one of the realizations of 100 individuals; the red dots represent the jump times and the transitions for the pair \((X_n, T_n)\). The right panel shows the convergence trajectory of CSO-MA.

To further investigate the scalability of CSO-MA and compare it with other algorithms, we perform another simulation study where the state space of \(X_i\) consists of two, i.e., \(\{1,2\}\) and 2 is an absorbing state. Consequently, the Markov renewal model is equivalent to a two-state Markov model or a Cox proportional hazards model71, the sample size is \(M=10,000\) and the \(\beta \) parameter has is the \(100\times 1\) vector with all entries equal to 1. The elements of the covariance matrix \(\textbf{Z}\) are again generated uniformly from \([-1,1]\) to mimic the high-dimensional scenario in statistical applications76. The simulation is performed on the Matlab 2023a platform. Instead of minimizing the norm of \(\textbf{U}(\beta )\), we minimize the negative partial log-likelihood (NLL) value68. We compare CSO-MA with PDO and Runge Kutta optimization (abbreviated as RUN, it is another recently proposed metaheuristics49) in terms of their optimum values, stability and running time. The results are given in Table 5. We run each algorithm 30 times to get reasonable statistical results and the number of function evaluation is set to 1000, the swarm size for each algorithm is set to 30. The results suggest that RUN performs the best in terms of NLL and its stability; The CSO-MA has the best performance in terms of average elapsed time and PDO is the slowest among the three algorithms.

Table 5 Negative log likelihood values from the three algorithms with CSO-MA outperforming the other two recently proposed algorithms.

Matrix completion (missing data imputation) in a two-compartment model

In real studies, such as clinical trials, missing or incomplete data is omnipresent. They occur in computer vision, clinical trials and genomics, just to name a few77. Missing data also appear a lot in a recommendation or recommender system, which is defined as a decision making strategy for users under complex information environments78; see79 for an overview of this emerging area of research to alleviate the problem on information load. The best strategy in dealing with missing data is to avoid having them in the first place. This would require constant monitoring of the data and filling in the missing data as soon as they are discovered. Despite the best efforts, missing data abounds and pose problems in data analysis. Matrix completion is the task of filling in the missing entries of a partially observed matrix that represents the data structure. In many instances, the task is equivalent to performing data imputation in statistics. The leads to matrix completion problems and they occur across disciplines. Ensembled models have also been built based on matrix completion for computational drug repurposing to fight the virus SARS-COV-280.

In this subsection, we apply CSO-MA to a missing data imputation problem in a non-linear Gaussian regression model using simulated data. Missing data is ubiquitous in all research fields. Imputation is one of the most common ways to fill in and analyze missing data81 and the Expected Maximization (EM) method54 is a popular choice for imputing multivariate normal data. We briefly describe the problem and the EM algorithm below.

Suppose that \((Y_1,Y_2)\in \mathbb {R}^2\) has a bivariate normal distribution with mean \(\mu (\theta )=(\mu _1(x,\theta ),\mu _2(x,\theta ))^T\) and a known covariance matrix \(\Sigma =\left( \begin{matrix} \sigma _1^2&{} \quad \rho \sigma _1\sigma _2\\ \rho \sigma _1\sigma _2&{} \quad \sigma _2^2 \end{matrix}\right) \) where \(\theta \) is a vector of parameters characterizing \(\mu \) and x is (possibly) a vector of covariates. We observe n realizations \(y_i=(y_{i1},y_{i2})^T, i=1,2,\cdots ,n\) and \(y_{ij}\) contains missing values for some i and j. Let \(Y_{(0)}\) and \(Y_{(1)}\) denote the observed and missing parts, respectively. On page 250-251 in Little and Rubin (2019)81, at the \((t+1)^{th}\) iteration, the E step of the algorithm calculates

$$\begin{aligned} \textbf{E}\left( \sum _{i=1}y_{ij}\Big |Y_{(0)},\theta ^{(t)}\right) =\sum _{i=1}^ny_{ij}^{(t+1)} \end{aligned}$$

and

$$\begin{aligned} \textbf{E}\left( \sum _{i=1}y_{ij}y_{ik}\Big |Y_{(0)},\theta ^{(t)}\right) =\sum _{i=1}^n\left( y_{ij}^{(t+1)}y_{ik}^{(t+1)}+c_{jki}^{(t+1)}\right) \end{aligned}$$

for \(j,k=1,2,\cdots ,K\) where

$$\begin{aligned} y_{ij}^{(t+1)}={\left\{ \begin{array}{ll} y_{ij}&{} \quad \text { if }y_{ij}\in Y_{(0)}\\ \textbf{E}\left( y_{ij}\Big |Y_{(0)},\theta ^{(t)}\right) &{} \quad \text { if }y_{ij}\in Y_{(1)} \end{array}\right. } \end{aligned}$$

and

$$\begin{aligned} c_{jki}^{(t+1)}={\left\{ \begin{array}{ll} 0&{} \quad \text { if }y_{ij}\text { or }y_{ik}\text { is observed.}\\ Cov\left( y_{ij},y_{ik}\Big |Y_{(0)},\theta ^{(t)}\right) &{} \quad \text { if }y_{ij},y_{ik}\in Y_{(1)}. \end{array}\right. } \end{aligned}$$

After the E-step, missing values are replaced by the conditional expectation derived above. Next, for the M-step, we maximize the following conditional log-likelihood with respect to \(\theta \) using CSO-MA:

$$\begin{aligned}{} & {} \textbf{E}\left( l(\theta |Y_{(0)},Y_{(1)})\Big |Y_{(0)},\theta ^{(t)}\right) = - \frac{1}{2}\sum _{i=1}^n\left( \textbf{y}_i^{(t+1)}-\mu (x_i,\theta )\right) ^T\Sigma ^{-1}\left( \textbf{y}_i^{(t+1)}-\mu (x_i,\theta )\right) + C \end{aligned}$$
(10)

where \(\textbf{y}_i^{(t+1)}=(y_{i1}^{(t+1)},\ y_{i2}^{(t+1)})\) and C is a constant independent of \(\theta \). Section 8.6 in Wild and Seber (1989)82 (page 414) illustrates a two-compartment model with (see also chapter 7 in83)

$$\begin{aligned} y_{ij}=\mu _j(x_i,\theta )+\epsilon _{ij},i=1,2,\cdots ,n,\ j=1,2, \end{aligned}$$

where x refers to time, \((\epsilon _{i1},\epsilon _{i2})^T\) are independently drawn from \(N_2\left( 0,\Sigma \right) \), and the two means are

$$\begin{aligned}{} & {} \mu _1(x,\theta )=\theta _1e^{-\theta _2 x}+(1-\theta _1)e^{-\theta _3x},\ \mu _2(x,\theta )=1-(\theta _1+\theta _4)e^{-\theta _2x}+(\theta _1+\theta _4-1)e^{-\theta _3x}, \end{aligned}$$

where

$$\begin{aligned} \theta _4=\frac{(\theta _3-\theta _2)\theta _1(1-\theta _1)}{(\theta _3-\theta _2)\theta _1+\theta _2}. \end{aligned}$$

Suppose at some time point x, the operator failed to record either \(y_{i1}, y_{i2}\) or both and we observe \(Y_{(0)}\) and \(Y_{(1)}\) (n observations in total). To make inference about \(\theta \), however, we still want to make use of the partially observed data. In this case, we apply the EM algorithm described above to maximize the conditional likelihood (10).

We analyze a real data set to illustrate this idea. The data set comes from Beauchamp and Corenell (1966)84, see also section 11.2 in Wild and Seber (1989)82. We randomly mask some of the values of the data in to be missing in Table 6 and denote them by NA.

Table 6 The dataset from Beauchamp and Corenell (1966)84.
Table 7 The imputed dataset.

Using the complete observations, we estimated the covariance \(\Sigma \) to be \(\left( \begin{matrix} 0.075 &{} - 0.06\\ - 0.06 &{} 0.06 \end{matrix}\right) \) and in the original paper, using full data, the authors’ estimated the parameters to be \(\widehat{\theta }=(0.060, \ 0.007,\ 0.093)^T.\) For the EM algorithm, we set the initial \(\theta \) to be \((0.381,\ 0.021,\ 0.197)\) and ran CSO-MA for 200 iterations with \(\phi =0.3\). The whole algorithm alternates between computing expression (10) and applying CSO-MA to maximize (10). We ran 10 iterations in total and the imputed results are given in Table 7. We further perform a simulation study (not reported here) with sample size \(n=80\) and 40 missing values in total. The true parameter \(\theta \) is \((0.4, 0.05, 0.3)^T\) and the initial value for the EM algorithm is \((0.1, 0.1, 0.1)^T\). The algorithm terminates after 5 iterations, with the estimated parameter value \(\widehat{\theta }=(0.392, 0.056, 0.275)^T\). This shows that CSO-MA performs well in its role as an optimizer.

A variable selection problem in ecology

In addition to numerous applications of metaheuristics in engineering and computer science, metaheuristics has also found applications ranging from addressing substantiability issues85 to land use19 and agriculture58. See also86, who used metaheuristic algorithms to design placements of the groundwater wells in the Los Angeles Basin.

In this subsection, we apply CSO-MA to a penalized linear regression problem in ecology. Model selection is essential in much of ecology because ecological systems are often too large and slow-moving for our hypotheses to be tested through manipulative experiments at the relevant temporal and spatial scales87.

The data comes from a plateau lake in Yunnan, China, and was collected by a group of researchers at the Department of Environmental Engineering, Tsinghua University in 2019. They took water samples in March (Spring), June (Summer), September (Autumn) and December (Winter). At each time, 30 sites were sampled from different parts of the waterway. Due to weather issues at the plateau lake in June, data from 6 sites were not recorded. Therefore, the total number of samples is 114 (\(=30\times 4 -6 \text { records}\)). The outcome variable is CRAP and the goal is to determine if and how 17 key variables affect the outcome. Table 8 lists all the regression variables and for space consideration, we only display in the same table, the first two sets of measurements from the \(114\times 18\) data matrix.

Table 8 Two samples of measurements for the regression variables in the model: Cyanobacteria relative abundance in Phytoplankton (CRAP), the sampling depth of water (Depth), Chlorophyll abundance (Chi-a), dissolved oxygen (DO), turbidity of water (Turbity), potential of hydrogen (pH), Ammonium Nitrogen (NH4-N), Nitrate Nitrogen (NO3-N), total concentration of Nitrogen (TN), total Phosphorus (TP), total organic Carbon (TOC), total dissolve solid (TDS), water temperature (T), Calcium (Ca), Potassium (K), Magnesium (Mg), Sodium (Na) and Fluorine (F).

Cyanobacteria can form dense and sometimes produce algal toxins. In extreme cases, the cyanobacteria bloom, with high cyanobacterial density or high proportion of cyanobacteria in phytoplankton, can threaten the aquatic ecosystem, fisheries and safety of the water for human drinking. Over the years, the cyanobacterial blooms increase in frequency, magnitude and duration globally88. The cyanobacteiral bloom is influenced by the surrounding environment. To effectively control and prevent the cyanobacterial bloom, one of the most important scientific questions is how other factors affect CRAP (Cyanobacteria relative abundance in Phytoplankton). High values of CRAP often indicate cyanobacterial bloom. Therefore, if we can control the key factors that are associated with CRAP, we can improve environmental health dramatically.

Linear regression analysis is a default choice for detecting association and outliers. We expect that many covariates are correlated. For example, NH4-N and NO3-N are highly correlated with TN. Thus, in reality, some measurements are more important than others to ecologists. In statistics, variable selection and penalized regression methods are proposed to address this issue. In what is to follow, we use CSO-MA and a penalized regression method known as smoothly clipped absolute deviation (SCAD)89 to selected variables into the model.

Let y be the vector of CRAP responses in the linear model, let X be the covariate matrix containing variables Depth to F and each column of X is standardize by subtracting its mean and dividing its standard deviation so that each column of X has mean 0 and standard deviation 1. This standardization step is crucial because we want to analyze the relative influence of the variables on CRAP and having different scales can cause confusion. All these variables are listed in Table 8. Let \(\beta \) be the vector of unknown parameters to be estimated by solving the following optimization problem:

$$\begin{aligned}{}&\min _\beta \ \Vert y-X\beta \Vert _2^2 + \rho \left( \sum _{i=1}^pP(\beta _j|\lambda ,a)\right) , \end{aligned}$$
(11)

where \(\rho \) is the regularization parameter, \(a,\lambda \) are tuning parameters and

$$\begin{aligned} P(\beta _j|\lambda ,a)= {\left\{ \begin{array}{ll} \lambda |\beta _j| &{}\text { if }|\beta _j|\le \lambda \\ \frac{a\lambda |\beta _j|-\beta _j^2-\lambda ^2}{a-1} &{}\text { if }\lambda <|\beta _j|\le a\lambda \\ \frac{\lambda ^2(a+1)}{2} &{}\text { if }|\beta _j|>a\lambda \end{array}\right. } \end{aligned}$$

is a differentiable and non-convex function and is called the SCAD penalty. The parameter \(\rho \) controls the degree of shrinkage applied to the coefficients. A larger \(\rho \) increases the penalty on the coefficients, driving them toward zero, and thus, helps in preventing overfitting by enforcing sparsity in the model. We set \(a=2.5\) and \(\lambda =1\), apply SCAD regression to the data (Xy) for different choices of \(\rho \) (see formula (11)) and optimize it using CSO-MA algorithm. We set 12 different values for \(\rho \), i.e., \(10^{-6}, 10^{-5}, 10^{-4}, 10^{-3}\), 0.01, 0.025, 0.05, 0.1, 0.2, 0.5, 1, 10, 100. For each \(\rho \), we record the best particle position found by CSO-MA as our estimation for \(\beta \). The CSO-MA algorithm is initialized with 25 particles and iterates 100 times (i.e., 100 function evaluations). We run the algorithm 50 times for each \(\rho \) to analyze the stability of CSO-MA. For illustration purpose, we demonstrate the average and standard deviation of the 50 runs when \(\rho =0.025\) and the results are shown in Table 9; further, the average minimum of (11) when \(\rho =0.025\) is 0.315 with a standard deviation of 0.0009 (the other \(\rho \)’s have similar standard deviation and minimum values), suggesting the stability of CSO-MA algorithm.

Table 9 Average and standard deviation of parameter estimation after 50 times of runs.

Figure 6 illustrates the solution path of SCAD using the CSO-MA algorithm. The x-axis represents the scaled \(\rho \) values. When \(\rho \) decreases from 100, estimation of turbidity (T) deviates from 0 at first. It suggests that turbidity is one of the most important measurements associating with the level of DRAP. One possible reason for such phenomenon is that the turbid water prevents light from penetrating, which in turn indicates a lower amount of the algae carrying out photosynthesis. Further, temperature (T) is another variable deviating from 0 at first. The reason is that the optimum temperature for algae growth is \(20+\ C^{\circ }\) and the lower the temperature, the less active the metabolism of algaeis. In addition, when \(\rho \) decreases from 0.05 to 0.01 (x from 7 to 5), parameter estimation for chemical elements, such as K, Mg, Na, all deviates from 0, suggesting that the concentration of chemical elements has slightly different association of CRAP.

This subsection shows CSO-MA can be usefully applied along with SCAD penalized regression to explore association among different components of water quality and how that affect the outcome CRAP. The interpretation of the solution path is in line with scientific common sense.

Figure 6
figure 6

Solution path of SCAD using CSO-MA. Each line represents the trajectory of an estimated coefficient for a predictor variable across the ordered values of the regularization parameter \( \rho \). The y-axis denotes the estimated coefficient values. The x-axis corresponds to the ordinal position of each \( \rho \) value in the set \( 10^{-6}, 10^{-5}, \ldots , 100 \) , which have been rescaled to 1, 2, ..., for clarity of presentation.

Design problems

Design problems are important because experimental costs are always increasing and a well designed study can provide maximum statistical inference precision at minimal cost. Given a design region, a regression model with several factors or independent variables, a design criterion and the total number of observations allowed, an optimal design problem involves finding the optimal number of design points, the optimal combination of factor levels and the proportions of observations to take at the design points. Sometimes the proportions are called weights (\(w_i\)) and they sum to unity. By working with weights and a convex design criterion, the optimal design problem can be formulated as a convex optimization problem, where theoretical tools are available to confirm whether a solution is optimal. In particular, an equivalence theorem, one for each convex criterion, can be derived using convex analysis results. A by-product is also a design efficiency lower bound that assesses the proximity of a design to the theoretical optimum without knowing the latter. In general, design efficiency is some ratio between 0 and 1 of the criterion value of a design relative to that of the optimum and designs with high efficiencies are sought. If a design has an efficiency of 0.5 or 50%, this means that the design has to be replicated twice to provide the same level of information as the optimal design90 and91 provide the technical details.

There are algorithms in the statistics literature for finding optimal experimental designs and even though some of them can be proven to converge mathematically to the optimum92,93. However, they may not work well in practice when the model is nonlinear and has several interacting factors. As some of the references below indicated, metaheuristics can outperform traditional algorithms or solve optimization problems that they cannot94. For example,30 solved a standardized maximin optimal design problem and95 found a minimax optimal design for a random effects hierarchical linear model. Both involved optimizing a non-differentiable objective function and some require multiple nested layers of optimization.

Below is a new application that shows CSO-MA can find a locally D-optimal design to estimate all parameters in a logistic model with 10 factors and 3 pairwise interaction terms. D-optimal designs are popular because when errors are normally distributed, they minimize the volume of the confidence ellipsoid of the parameters and hence the parameters are accurately estimated. Previous attempts using other metaheuristic algorithms to solve this design problem were less successful because of the large dimension of the optimization problem. For example,96 applied GA, PSO, CSO to find locally D-optimal designs for the Poisson model and logistic model with 5 factors and all pairwise interaction terms97 used quantum PSO (QPSO) and modified the codes to also find locally D-optimal designs. The modified code d-QPSO found a locally D-optimal design for a 10-factor logistic model but interactions were not allowed. Likewise,98 applied differential evolution to find locally D-optimal designs for the same model with 5 pairwise interaction terms. However, optimality of their design could not be confirmed but its proximity to the optimum (without knowing the optimum) was assessed using a D-efficiency lower bound90. The reported design has at least 95% D-efficiency, suggesting they it is close enough to the optimum (without knowing what the optimum is) and likely suffice for most practical purposes.

Car refueling experiment

 99 described an experiment, based on the logistic model, for testing a vision-based car refueling system with the question that whether a computer-controlled nozzle was able to insert itself into the gas pipe correctly or not97. The experiment includes four binary explanatory factors (\(x_1\sim x_4\) numerically taking -1 or 1): ring type (white paper or reflective), lighting (room lighting or 2 flood lights and room lights), sharpening (without or with), smoothing (without or with); six continuous factors (\(x_5\sim x_{10}\)): lightning angle (50 to 90 degrees), gas-cap angle 1 (30 to 55 degrees), gas-cap angle 2 (0 to 10 degrees), can distance (18 to 48 inches), reflective ring thickness (0.125 to 0.425 inches) and threshold step value (5 to 15). Experts’ opinions suggest that the model should include five specific interaction terms in the model and they are the pairwise interactions terms between ring type and reflective ring thickness, interactions between lighting and lighting angle, interaction between smoothing and car distance, along with 2 3-order interaction terms. To test CSO-MA’s potential for finding a locally D-optimal design for a more likely realistic model, we include selected interaction terms, namely three two-factor interaction terms and two three-factor interaction terms. Table 10 lists all the terms in the model.

Table 10 All the model terms in the car refueling study.

The full model has 10 factors and 16 parameters97. Assumed a set of nominal parameter values and found a locally D-optimal design using a swarm-based algorithm called Quantum-Behaved PSO (d-QPSO). On average, the runtime for finding the optimal design for the additive linear part of the model without interaction terms, was 140 seconds. They did not report the locally D-optimal design for the full model.

We applied CSO-MA to search for locally D-optimal designs for both models with and without interaction terms. Since not all factors are likely to interact, we choose to include, as an example, 3 two-factor interaction terms and 2 three-factor interaction terms. We set \(k = 20\), which is the initial guess of the number of design points required of the optimal design. We set the number of particles in the algorithm to be \(n = 200\) and the stopping criterion is whether the fitness value change is within the pre-specified tolerance value of \(10^{-5}\). We ran the algorithm 10 times independently and on average, CSO-MA took 24 seconds to find the same locally D-optimal design for the no-interaction model, which is significantly shorter than that required by the d-QPSO employed in97. For the model with the interaction terms, CSO-MA was also able to find a locally D-optimal design shown in Table 11 with 17 design points and the corresponding weights are in the last column. It has 17 design points, the criterion value is 7.256 and a direct calculation shows its D-efficiency lower bound is 97%. It is not possible to easily confirm optimality of a design for a multi-factor model because it is difficult to visually appreciate the fine features in a high dimensional plot; see97,98. Additional numerical checks similar to that described in96 support that the design found by CSO-MA has all the required features in the equivalence theorem.

Table 11 A CSO-MA generated locally D-optimal design with 17 design points for the 10-factor model with selected interaction terms in the car refueling experiment. The nominal set of values for the model parameters is \(\varvec{\theta }^T = (3.0, 0.5, 0.75, 1.25, 0.8,\) \(0.5, 0.8, -0.4, -1.00, 2.65, 0.65, 1.1, -0.2, 0.9, -0.36, 1.07)\).

Conclusions

Nature-inspired metaheuristic algorithms are general-purpose optimization tools and they require virtually no assumption for them to work reasonably well. While they are typically used when all other known optimization methods fail, we note that

  • improved metaheuristics, such as CSO-MA, can outperform earlier metaheuristic algorithms; this was the case for optimizing parameter estimation in the single-cell generalized trend model, where CSO-MA produced significantly more optimal values than those from two recently proposed metaheuristics;

  • they can also produce better quality solutions than those obtained from traditional methods or via commercial statistical packages; this was the case in Table 2 where we observe that the negative likelihood function from the deterministic Bock-Aitkin’s algorithm has a larger value than that from PSO and CSO-MA, suggesting metaheuristics outperformed the Bock-Aitkin’s procedure;

  • the ecological application also demonstrated that results from CSO-MA can have smaller deviations than other estimates, suggesting more stable results from the CSO-MA algorithm compared with standard methods like using CRAP or SCAD;

  • improved metaheuristics, such as CSO-MA, can solve optimization problems that were deemed problematic before; such is the case for the car design problem where this paper considers several interacting factors more than earlier papers with a handful of factors in interaction terms.

We close by returning to the question posed in the title of the paper. Based on the current work and other optimization work we have done using metaheuristics, our cumulative experience suggests that they are able to explore and exploit complex optimization problems in statistics and arrive at an optimal solution, or close to the optimum100. More interestingly, there are increasing examples that show metaheuristics can outperform statistical optimization methods with theoretical convergence properties in terms of speed or quality of the solution. An example is the frequently used Fedorov’s type of algorithms commonly used to generate an optimal design by adding one design point at each iteration, and then periodically collapsing nearby points to a single point heuristically. For this reason, we believe that metaheuristics offers exciting and fertile ground for theoretical and applied researchers and can potentially revolutionize the world of optimization.