Genetic Algorithm and Particle Swarm Optimization

Genetic Algorithm

Genetic Algorithm

The Genetic Algorithm (GA) is a popular optimization method predating most similar approaches to nature-inspired optimizers. It is part of the Evolutionary Computing family of methods, which is a very robust kind of AI. This tutorial explains the Genetic Algorithm and Particle Swarm Optimization.

 

Although this optimization approach was first introduced in the 1960s by Ingo Rechenberg, the GA framework wasn’t fully realized until a bit later, in the early 1970s, by John Holland’s team. John Holland popularized this new approach with his blog Adaption in Natural and Artificial Systems, which was published in 1975.

 

GAs are heavily influenced by Darwinian evolution. The idea behind them is that each solution is part of a group of cells that are evolving over a number of generations (the equivalent of epochs in ANNs and iterations in PSO). As the group evolves, it gets closer and closer to the optimal solution to the optimization problem it models.

 

We’ll examine the specifics of the GA optimization framework and its core algorithm, see how to implement it in Julia, point out several variants, and discuss how Gas are applicable to data science

 

The idea of the GA framework is to view the problem as a set of discrete elements, forming what is referred to as a chromosome. Each one of these elements is referred to as a gene, and they can be arbitrary in number, depending on the problem at hand. Although each gene is usually a bit, encoding can take a variety of forms.

 

A collection of all these chromosomes is called a genome. Through a series of processes, the genome evolves into the ideal combination of genes. This “perfect combination” is called a “genotype,” and it encapsulates the solution we are after. The information captured in each gene encoding is referred to as a trait.

 

Unlike PSO, solution elements of GAs don’t change through motion, but through a pair of processes called mutation and crossover.

 

These terms are again borrowed from biology, as the processes are similar to those that occur in replicating DNA. In nature, this process leads to the birth of new organisms; that’s why we refer to different iterations of this evolutionary process as “generations”.

 

Mutation is the simplest process, as it involves a single chromosome. Basically, it ensures that over each generation, there is a chance that some gene in the chromosome will change randomly.

 

The probability of this happening is fairly small, but the whole evolutionary process takes so long that it is almost guaranteed to happen at least once. Furthermore, it is theoretically possible to have multiple mutations in the same chromosome (especially if it is large enough). The purpose of the mutation is that it ensures diversity in the traits, which would otherwise remain stagnant.

 

Crossover (or recombination) is the most common process by which elements change. It involves two chromosomes merging into a single one, at either a random or a predefined location such as the middle.

 

However, certain instances of crossover can involve two locations or even a logical operator like AND. For the purposes of simplicity, we’ll work with the basic single-point crossover in this blog.

 

The crossover process ensures that the genome changes over time, through traits that already manifest in the parents (e.g. eye color). Which of these traits survive in the long run depends on another aspect of the evolutionary process called fitness.

 

Not all chromosomes get to cross over, since there exists a selection process to ensure that the best-performing chromosomes are most likely to have descendants in the next generation, much like in a species, only the better equipped individuals (e.g. faster, more adaptable, with better immune systems, etc.) manage to survive and procreate, ensuring that their genes don’t die out.

 

Fitness is the measure of how well these chromosomes perform as potential solutions to the problem we are solving. Just like with PSO, we are trying to maximize or minimize a fitness function that evaluates a solution.

 

As the number of chromosomes must remain constant through the whole evolutionary process (otherwise we’d risk a population explosion, draining our computational resources), only the best chromosomes make it to the next generation, based on their fitness.

 

Elitism is an auxiliary aspect of the GAs framework that is often used to ensure that the best solution is constantly present. It’s like a fail-safe, guarding against the possibility that the new genome is worse than that of the previous generation, due to some bad crossovers and/or bad mutations.

 

Elitism makes sure that the best performing chromosome or chromosomes remain in the next generation regardless.

 

Although elitism was not part of the original GA framework, it is strongly recommended you make use of it, as it has been shown to generally improve the performance of the optimizer.

 

However, if you overdo it with elitism by getting too many well-performing chromosomes to the next generation at the expense of others, not as well-performing chromosomes, you may end up with an overly homogeneous population.

 

This would result in an optimization process that converges prematurely with the yielded solution more likely to be sub-optimal. Note that the elitism option is controlled by a parameter that indicates how many best-performing chromosomes to keep (see elitism() function later on).

 

The search space of problems tackled with GAs ideally involves a huge number of potential solutions to the problem—usually larger than what could be solved analytically. A modest example: if a GA tried to solve a problem where each chromosome has 60 genes represented as bits, it would have 260 or over a billion potential solutions.

 

In general, problems that lend themselves to GAs fall under the umbrella of “NP-hard” problems. These are problems whose solving cannot be reduced to a fast process, as they take exponential time.

 

This means that if the dimensionality of the problem increases by a factor of 2, the complexity of the problem is bound to quadruple, or worse.

 

A typical NP-hard problem with many applications in logistics is the Traveling Salesman Problem (TSP). This involves finding the optimal way to traverse a graph so that at the end of your trip you are back where you started. Despite its simple description, this is an exceptionally difficult problem as the number of nodes in that graph gets larger.

 

As the scope of these problems makes finding the best solution quite unrealistic, we opt for a “good enough” solution—one that yields a quite large (or small) value for the fitness function we are trying to maximize or minimize.

 

Standard Genetic Algorithm

Let’s now look at the actual algorithm that lies at the core of the GAs framework, the original Genetic Algorithm itself. The main process is as follows:

 

1. Initialization stage: Generate a random population of n chromosomes (potential solutions for the problem). Define Fitness function F() and optimization mode (maximization or minimization).

 

Define stopping conditions such as the maximum number of generations, or minimum progress of fitness over a given number of generations. Define crossover and mutation probabilities (pc and pm respectively), as well as selection scheme.

 

2. Fitness evaluation: Evaluate the fitness of each chromosome x in the population by calculating F(x).

 

3. New population: Create a new genome by repeating the following steps until the new set of chromosomes is complete:

 

a. Selection: Select two parent chromosomes from a population according to their fitness. Namely, select them with a probability p that is proportional to their fitness scores.

 

b. Crossover: With a crossover probability pc, cross over the parents to form new offspring (children). If no crossover is performed, the offspring are exact copies of their parents.

 

c. Mutation: With a mutation probability pm, mutate new offspring at each position in it.

 

d. Population update: Place new offsprings in a new population and discard the previous population.

4. Loop Process: Repeat steps 2-3 until a stopping condition has been met.

5. Output results: Output the best-performing chromosome and its fitness score.

 

The selection process involves one of two main methods to stochastically determine which chromosomes get to be parents (candidates of the crossover process) and which don’t. These are roulette wheel selection and rank selection.

 

The first approach involves creating a “wheel” based on the fitnesses of all the chromosomes, by basically normalizing them so that they add up to 1. This normalization takes place based on a scaling function like exp(x) or sqrt(x), depending on the problem at hand.

 

After we obtain a random number in the [0, 1) interval, and we pick the chromosome corresponding to the wheel section that includes that random number. We then repeat that process one more time to find the other parent.

 

The rank selection approach uses the ranking of the fitness scores instead of the scores themselves. So, the worst performing chromosome will have a value of 1, the second worse value of 2, and the best one a value of n, where n is the total number of chromosomes. In all the other aspects, it’s the same as the roulette wheel approach.

 

The rank selection approach ensures that all chromosomes have a decent chance of getting selected, especially in cases where a small number of chromosomes dominate the population in terms of performance (because they are significantly better than the rest).

 

With so many parameters in the GA framework, it can be overwhelming to figure out how to use it for your optimization problems. What follows are some rules of thumb for selecting values for these parameters.

 

Naturally, fiddling with these parameters is a great way to learn, but these guidelines will help you at least get started.

 

As far as the crossover is concerned, you can use a probability between 0.8 and 0.95. This means that around 90% of the time, there will be a crossover taking place for a given chromosome.

 

Regarding mutation probability, a value of around 0.005 to 0.01 generally works. Over time, a mutation on its own can produce a decent solution without any crossover at all. Setting this too high will result in a highly unstable genome that will change uncontrollably and never converge.

 

Population size is a bit trickier to set since a larger population would still work, but take longer for the algorithm to run. That’s why having a number of chromosomes equal to the number of genes in a chromosome is generally a good place to start.

 

When it comes to selection type, generally the roulette wheel method is fine. However, if you find that a small set of chromosomes monopolize the solution process (resulting in largely sub-optimal results for the whole system), then rank selection may be a better option.

 

Implementation of GAs in Julia

Let’s now look at how this algorithm can be implemented in Julia. Below is a sample implementation of a GA, with the elitism add-on included.

 

We’ve also included a sample fitness function so that you can test it. Note that some variables in this code are abbreviated. These are as follows:

X = population data (matrix)

c = coefficients vector for sample function, for testing purposes

ff = fitness function to maximize

nv = number of variables to consider

maximize = whether the function needs to be maximized or not

ips = initial population size

s = desired sum of chromosomes in generated population (an optional but useful parameter for certain problems)

px = probability of event x happening

ng = number of generations

 

The code is written to be easily customized whenever needed, as per the functional programming paradigm that Julia follows.

function sample_ff(x::Array{<:Real, 1},

c::Array{<:Real, 1} = ones(Int64, length(x))) # function to maximize

z = abs.(x) .* c

return 1 / (1 + sum(z))

end

ShouldActivate(p::Float64) = rand() < p # activation trigger for an event of probability p

evaluation(ff::Function, x::Array{<:Real, 1}) = ff(x)

function scaling(y::Array{Float64, 1}) # scaling fitness function values

y_ = exp(y)

# y_ = sqrt(y) # another alternative for scaling

return y_ / sum(y_)

end

function selection(X::Array{<:Real, 2}, y::Array{Float64, 1}, nv::Int64, ips::Int64)

y_ = scaling(y)

c = 0

ind = 1

xs = Array{eltype(X)}(undef, 2, nv)

ys = Array{Float64}(undef, 2)

while true

if ShouldActivate(y_[ind])

c += 1

xs[c, :] = X[ind, :]

ys[c] = y[ind]

if c == 2; return xs, ys; end

end

ind += 1

if ind > ips; ind = 1; end

end

end

function mutation(x::Array{<:Real, 1}, p::Float64, nv::Int64) # change population a bit

new = false

for i = 1:nv

if ShouldActivate(p)

new = true

if eltype(x) <: Bool

x[i] = !x[i]

else

x[i] = 1 - x[i]

end

end

end

return x, new

end

function crossover(xs::Array{<:Real, 2}, ys::Array{Float64, 1}, p::Float64, nv::Int64)

d = rand(1:2) # dominant gene

z = xs[d, :]

w = ys[d]

new = false

if ShouldActivate(p)

new = true

r = 3 - d # recessive gene

q = rand(1:(nv-1))

z[1:q] = xs[r, 1:q]

w = [NaN]

end

return z, w, new

end

function elitism(X::Array{<:Integer, 2}, y::Array{Float64, 1}, n::Int64)

YX = hcat(y, X)

YX_sorted = swi(YX)[1]

X_elite = round.(Int64, YX_sorted[1:n, 2:end])

y_elite = YX_sorted[1:n, 1] return X_elite, y_elite

end

function GeneratePopulation(ips::Int64, nv::Int64, s::Int64 = -1) # function for creating original population

if s == -1

X = rand(Bool, ips, nv)

for i = 1:ips

if sum(X[i,:]) == 0

X[i,rand(1:nv)] = true

end

end

else

x = falses(nv)

for i = 1:s; x[i] = true; end

X = Array{Bool}(undef, ips, nv)

for i = 1:ips

X[i,:] = x[randperm(nv)]

end

end

return X

end

function runga(ff::Function, nv::Int64, ips::Int64 = nv, s::Int64 = div(nv, 2),

ng::Int64 = 1000, pm::Float64 = 0.01, pc::Float64 = 0.9, pr::Float64 = 0.1) # wrapper function

X = GeneratePopulation(ips, nv, s)

y = Array{Float64}(undef, ips)

for i = 1:ips

y[i] = evaluation(ff, X[i,:]) # fitness scores of population

end

n = round(Int64, ips*pr) # elite size

X_ = Array{Bool}(undef, ips, nv) # offspring population

y_ = Array{Float64}(undef, ips) # fitness scores of offspring population

for i = 1:ng

X_[1:n, :], y_[1:n] = elitism(X, y, n)

for j = (n+1):ips

xs, ys = selection(X, y, nv, ips)

z, w, new1 = crossover(xs, ys, pc, nv)

z, new2 = mutation(z, pm, nv)

if new1 || new2

y_new = evaluation(ff, z)

else

y_new = w

end

X_[j,:] = copy(z)

y_[j] = y_new[1]

end

X = copy(X_)

y = copy(y_)

end

ind = findmax(y)[2]

return X[ind, :], y[ind]

end

 

Beyond these examples, there are various applications of GAs that make learning them a worthwhile effort. Many of these cases are often referred to as the knapsack problem, which involves finding the optimum collection of items to put in a knapsack, given some weight restrictions and some value for each item.

 

The idea is to maximize the overall value of the collection, while also keeping the total weight under a given threshold, for each container (knapsack). However, other kinds of problems, including the optimization of continuous variables, can be solved using GAs.

 

For example, GAs can be used in feature selection, as a dimensionality reduction methodology. The restriction, in this case, would be that the selected features are below a given proportion of the original feature set, while at the same time capturing as much information of the original feature set as possible.

 

Naturally, in cases with a lot of features, selecting a good subset of those features can be quite challenging due to the enormous amount of combinations that are possible—which is exactly where GAs come in handy.

 

In addition, GAs can be used in all kinds of situations where discrete optimization is required, such as non-linear dynamic systems. One such case is the stock market, where GAs can be used to select the best stocks for a portfolio.

 

Naturally, building a portfolio this way would require some additional analysis to determine the value of each stock, but GAs can play an important role in the selection process itself, once those values have been established.

 

GAs can also be used in other optimization scenarios, such as scheduling. Building an optimal schedule around many restrictions and preferences is a classical discrete optimization problem, applicable nearly everywhere from project management to logistics. As long as the problem is properly encoded, GAs can be very effective at finding a solution.

 

An application closer to data science is the use of GAs to design ANNs, both in terms of architecture and in terms of weights (training). Although PSO can be great at figuring out the weight values, it isn’t best suited for working out the architecture of an ANN. 

 

The architecture of such a network involves the number of neurons in each layer, as well as how they are connected (and not all ANNs have dense connectivity).

 

Finally, GAs are used for biological research, as in the case of finding the shape of a protein molecule. This is particularly useful for the research of potential treatments for cancer since the closely-linked problem of protein folding is an NP-hard problem with a huge solution space.

 

There are several more applications beyond these, some more specialized than others. The fact that GAs find use in such a broad spectrum of domains is a testament to their versatility and value.

 

Main variants of GAs

The GA is a very general framework, so the standard algorithm we saw earlier has several variations that perform better in many cases.

 

One such variant is the Hybrid Genetic Algorithms (HGAs), which is basically an ensemble approach to GAs, executed in a sequential fashion. The standard GA is combined with a different optimizer that is better suited for finding an optimum in a smaller solution space, usually using derivatives or any other additional information regarding the fitness function.

 

GA is applied globally and once a local optimum is found, the other optimizer takes over to refine the solution. HGAs are useful when the fitness function can be differentiated, or when a more accurate solution is required.

 

Another interesting variant is the Self-Organizing Genetic Algorithm (SOGA). As their name suggests, SOGAs involves a process whereby the parameters of the optimization method are also optimized, along with the variables of the problem at hand. So, instead of fine-tuning the optimization model yourself, the model does it for you.

 

The Variable Selective Pressure Model (VSPM) is a newer variant of GAs worth considering. This relatively sophisticated approach to GAs involves changing the selection strategy so that you can steer how much diversity exists in the population, thereby avoiding overly homogeneous or overly diverse populations. The idea is to introduce an “infant mortality” rate that limits the presence of weaker chromosomes in the population.

 

Genetic programming

Moreover, Genetic Programming (GP) is a powerful GA variant that focuses on continuous variables. As it is quite distinct from the standard GA, and since its usefulness in data science applications is noteworthy, we’ll take a closer look at GP.

 

Finally, an add-on to the standard GA, such as the elitism process, can be viewed as a variant of sorts. After all, the broad framework lends itself to tweaking, so if you delve into it a bit, you are bound to come up with your own variant of the optimization method.

 

Genetic Programming is an interesting variant of GAs, which has received a lot of attention in the past few years. It involves “building” a synthetic function off other simpler functions that take the form of genes.

 

However, there is also some data that is used in those functions, so the inputs of a GP model include a couple of data streams, one for the inputs and one for the outputs of the function that needs to be approximated.

 

Note that this problem can be solved with a regression model, though the model’s success will be quite limited by linear combinations. Of course, you can have a non-linear regression model, but you’d have to create the non-linear features yourself (which takes considerable effort).

 

You could also use an ANN, but you wouldn’t be able to see how the inputs are mapped to the outputs (just like the “black box” issue we discussed in previous blogs). If you require a mapping through a function that you can view in its entirety, GP is the best way to go.

 

However, the function that GP yields are not necessarily a good model since it is bound to overfit if you use the error as a fitness function (which you’d minimize afterward).

 

That’s why it would make more sense to use some heuristic, such as the correlation coefficient or a similarity metric in general, as the fitness function (which you’d maximize in this case).

 

Such problems have been solved successfully for feature fusion scenarios, as part of a predictive analytics model.

 

The functions that GP uses can be anything from a simple polynomial to some complex trigonometrical function. They must be in their most basic form since more complex functions will inevitably emerge through the evolution process of the GP system.

 

For example, if you have the functions tan(x), x2, and x3, GP will at one point have chromosomes consisting of tan(x2), tan(x3), (tan(x))2, x6, etc.

 

That’s why special care must be taken when selecting those functions so that the data used with them makes sense. For instance, no point in using sqrt(x) if you have data points of negative x, though sqrt(abs(x)) could be used instead.

 

GP is a truly ingenious method that often adds more value than the conventional GA variants, opening new possibilities. Apart from feature fusion, it has been used for building LISP programs. Don’t let the fact that many AI experts don’t know about GP deter you from trying it out.

 

GA framework tips

GA is not so straightforward; you must be aware of several considerations. For starters, as GAs have a lot of parameters, some fine-tuning is usually necessary. Sometimes configuring a GA properly is a big part of its effectiveness, especially in complex problems.

 

Also, even though GAs is capable of handling continuous optimization problems, they are generally not effective in those situations. That’s why if you want to tackle such problems you have to either use some other optimization framework, or some specialized variant of GAs, such as GP.

 

Moreover, the standard GA is not all that robust, compared to its variants. So, if you are to use the GA framework for a complex problem, it is best to explore the algorithm’s various add-ons, so that you get a more effective and more efficient optimization system.

 

Furthermore, just like PSO, GAs are great if your fitness function cannot be differentiated, or the process of differentiating is very expensive computationally.

 

However, if you have easy access to the derivatives of the fitness function, other optimization methods may be more effective for finding an accurate solution.

 

Summary

The Genetic Algorithm (or GA) is an evolutionary computation framework. These algorithms mimic the biology of genetic reproduction in order to model optimization problems. Here, chromosomes represent potential solutions.

 

The bulk of all the chromosomes are called the genome and it progresses in terms of fitness over several iterations called generations.

 

GAs involves two main processes for evolving the solutions into something closer to the optimum they are after. These are mutation and crossover.

 

Mutation is the process according to which at any given generation, a chromosome may change at a random gene, by flipping its value. Mutation ensures that the genome doesn’t get stagnant and therefore the solutions are limited to a particular part of the search space.

 

Crossover is the process whereby two chromosomes A and B merge to form two other chromosomes, each consisting of a part of A and a part of B. The split of the parent chromosomes can happen either at a random or a predefined position. Crossover ensures that the solutions are dynamic and therefore able to evolve over time.

 

The best-performing chromosome (or chromosomes) is sometimes retained in the next generation. This ensures that the best solution doesn’t degrade when the crossovers, mutations, or selections don’t work out favorably. This process is called elitism.

 

The original Genetic Algorithm involves creating a population of chromosomes, evaluating them using a fitness function, stochastically selecting some of them based on their fitness scores, applying crossover and mutation to them based on these selections, creating a new population based on the results, and repeating this process for a given number of generations or until some other stopping criterion has been met.

 

Special care must be taken during the encoding part of a GAs model, since it’s not always straightforward, while some additional functions may need to be coded.

 

Some of the main variants of GAs include Hybrid GAs, Self-organizing GAs, Variable Selective Pressure Models, and Genetic Programming.

 

Genetic Programming (GP) is a very useful variant of the GA. In GPs, different functions are used as chromosomes, in order to create a synthetic function that approximates a given mapping (like a regression system). GP has been used for feature fusion, among other applications.

 

Although GAs can handle continuous variables (with the proper encoding), they are not as accurate as other AI systems, such as PSO and Simulated Annealing.

 

GAs are ideal for cases where the fitness function cannot be differentiated or the differentiation process is too computationally expensive. Unfortunately, NDA restrictions prohibit the discussion of details about such problems.

 

Particle Swarm Optimization Algorithm

The significance of PSO lies in the fact that many of the alternative optimizers are merely variations of the cornerstone PSO. As a result, understanding this optimizer grants you access to a whole set of optimization methods that can solve much more than conventional data analytics problems.

 

In fact, their applications span over so many fields that one can argue that many data analytics methods are just a niche application of this AI framework.

 

PSO belongs to a general class of systems called Evolutionary Computation, which is a type of Computational Intelligence.

 

Computational Intelligence is a popular subclass of AI (at least in the research world) that involves the development and application of clever ways to solve complex problems, using just a computational approach.

 

In this blog, we’ll examine the inner workings of PSO, as well as some of its most important variants, with a focus on the Firefly optimizer. We’ll also show how PSO can be implemented in Julia. We’ll close with some useful considerations about PSO, and a summary of the key points of this blog.

 

PSO algorithm

The logic behind PSO is to have a set of potential solutions (akin to a swarm of particles) that continuously evolve, becoming better and better, based on some fitness function the system tries to maximize or minimize.

 

The particles “move” with varying speeds throughout several dimensions (also called variables), influenced by the best-performing particle, so that they collectively reach an optimal solution in an efficient manner.

 

In addition, each particle “remembers” its best performance historically, and it takes that into account when changing its position.

 

Naturally, the best performing particle may be a different one over the duration of the search (you can imagine the group of solutions moving towards the best possible solution like a swarm of insects, so which insect is closest to that solution is bound to be different every time you look at the swarm).

 

Still, there is generally an improvement in the best solution over time, even if the rate of this improvement gradually diminishes. This is because the closer you get to the best solution, the more likely the swarm is bound to deviate from it (albeit slightly) while “zeroing in” on that best solution.

 

All these traits make the PSO algorithm ideal for optimizing the parameters of a complex system. PSO is relatively new as an algorithm; its creators, Dr. Eberhart and Dr. Kennedy, invented it in 1995. The pseudocode of PSO is as follows:

 

For each particle in the swarm

Initialize particle by setting random values to its initial state

End

Do

For each particle in the swarm

Calculate fitness value

If the fitness value is better than the best fitness value in its history (pBest): pBest <— fitness value of particle

End

gBest <— particle with the best fitness value of all the particles in the swarm

For each particle

Calculate particle velocity according to equation A

Update particle position according to equation B

End

Repeat Until maximum iterations is reached OR minimum error criteria is attained

The key equations for the updating of a particle’s speed and position, respectively, are the following:

eq. A: v[] += c1 * rand() * (pbest[] - present[]) + c2 * rand() * (gbest[] - present[])

eq. B: present[] += v[]

 

Note that c1 and c2 are parameters of the PSO algorithm, each having the default value of 2.0 (though any values between 0.0 and 4.0 can be used), while the value of each is independent of the value of the other.

 

Also, the velocity of a particle for any given dimension is limited to Vmax (another parameter set by the user) to avoid the particles swarming out of control. The exact value of this parameter depends on the problem.

 

Other parameters include the number of particles (usually at least 20, with more complex problems usually requiring more particles), the range of the values of the particles (which is dependent on the problem), and the stopping conditions—namely the total number of iterations and the minimum error threshold. These stopping conditions are also dependent on the problem.

 

To make PSO faster, we can include an additional parameter that affects the progress the algorithm makes as it searches through the solution space. If a certain number of iterations take place without any significant progress in the objective function, then the algorithm can terminate; in these cases, the swarm usually has gotten stuck in a local optimum.

 

Main PSO variants

Just like most well-established algorithms in AI, PSO has its share of variants, most of which are better suited for certain sets of problems. The most important of these variants are:

 

PSO with inertia: a variation of PSO that uses an “inertia weight” (usually around 0.9), which gradually decreases, allowing for a narrowing of the search over time. This enables PSO to switch from exploratory to exploitative mode, yielding more accurate solutions.

 

PSO with Neighborhood Operator: a popular variant of PSO that considers other particles in the same neighborhood. The idea is that through this method the chances of getting trapped in a local optimum are greatly reduced, making the whole system more robust.

 

Discrete PSO: a variant of PSO that enables the solution of discrete optimization problems.

 

Constriction PSO: a version of PSO that doesn’t make use of the Vmax parameter. It manages to keep velocities in check by introducing a couple of additional parameters, one of which is the constriction coefficient χ (suggested value: 0.7289). These parameters ensure that the velocity of the algorithm remains manageable, making PSO converge smoothly.

 

Fully informed PSO a case of Constriction PSO, where the two parameters are the same; this is generalized to any number of particles. This enables each particle to be influenced by each other particle, making the whole process more stable.

 

Bare-bones PSO: a lighter version of the original PSO algorithm, with the whole velocity aspect, dropped altogether.

 

Firefly: a somewhat different approach to the whole “movement” part of the algorithm. We’ll examine this in more detail in the following section.

 

Note that since PSO is a topic of active research, there are continually new variants being developed. The variants mentioned here are just the few that have stood the test of time; they go on to show that there is a lot of promise in this optimization framework.

 

Firefly optimizer

The Firefly optimizer is one of the most interesting variants of PSO—in fact, it is unique enough to be considered a different optimizer by many people. However, when examined in depth, it becomes clear that the Firefly optimizer is a more creative take on the key components of PSO.

 

Despite sharing the same principles as PSO, a couple of distinct differences give Firefly its niche. For starters, each particle is attracted by all other particles—not just the best-performing one. This is reminiscent of the Fully-Informed PSO variant.

 

In addition, there is no velocity in this algorithm, since the concept of inertia is replaced by “fogginess.” In other words, light is not dispersed perfectly as if in a vacuum (such a case would make the algorithm very unstable and the particles’ movement chaotic).

 

This is expressed by the light-absorption coefficient γ, which ensures that the attraction fades exponentially, while the intensity of the light follows the Newtonian model of gravity.

 

The exponential diminishing of the attractiveness of other particles ensures that the fireflies don’t get too confused and that it is generally the closest well-performing firefly that has the most impact.

 

Other than all that, the Firefly algorithm follows the same strategy as the PSO algorithm. You can read about the details of the Firefly algorithm in its documentation.

 

The key advantages of Firefly over PSO are that it is faster and more accurate, across a variety of objective functions. On the downside, it has a bunch of parameters that need to be set, and tweaking the algorithm is quite a challenge.

 

Fortunately, the Firefly algorithm is good enough to be useful off the shelf with the default parameter values. The code that accompanies this blog includes an implementation of Firefly when we discuss optimization ensembles.

 

PSO versus other optimization methods

A few optimization methods are similar to PSO but are not part of its taxonomy. The two most well-known of these are Ant Colony Optimization (ACO) and Genetic Algorithms (GAs). Note that both these algorithms are also part of the Evolutionary Computing (EC) family.

 

Despite its similarities to PSO, ACO takes a probabilistic approach to optimization. The whole framework resembles a chaotic system, with pheromones playing the role of attractors. Pheromones are the influential forces a solution exercises over other solutions.

 

The potential solutions are called “ants” instead of particles, but the idea is the same. What’s more, ACO has spawned its own set of variants and similar methods, such as Bee Colony Optimization (BCO).

 

As for GAs, they involve a lot of tweaking of solutions, both through random changes and through interactions with other solutions. Furthermore, they involve coding each solution in a binary pattern.

 

We’ll explore GA optimization in detail in the next blog. Although GAs are versatile in terms of the kind of problems they can solve, they are rarely used for optimizing continuous variables. This is because they fail to perform as well as PSOs and other similar methods in the EC family.

 

PSO implementation in Julia

PSO can be easily implemented in any programming language. For simplicity and performance, we’ll be using Julia (ver. 1.0) for this and all the other optimization methods in this blog. Note that a PSO implementation is already available on Github.

 

However, it is unlikely that you’ll find an implementation more comprehensive than the one we provide here since few Julia programmers have delved into this topic extensively.

 

The PSO implementation in this section takes the fitness function as input argument ff, which in turn takes as input an array of numbers.

 

This way, there is no way for PSO to know how many variables it deals with since there is nothing in the function ff to denote that. As a result, the number of variables to optimize needs to be included as well, in its inputs (variable nv).

 

It has been found empirically that a swarm size of about 10 times the number of variables works well; this is what PS, the number of particles parameter, defaults to. Beyond these, there are a few additional inputs based on what we’ve discussed previously.

function pso(ff::Function, nv::Int64, minimize::Bool = true, ps::Int64 = 10*nv, ni::Int64 = 2000, c = [2.0, 2.0], maxv = 2.0, iwp = 50)

buffer = div(iwp, 2)

ni += iwp

tol = 1e-6

PP = randn(ps, nv) # population

positions

PV = randn(ps, nv) # population

velocities

PC = Array{Float64}(undef, ps) # population

costs (scores)

Pp_best = copy(PP) # particle’s

best position

gb = Array{Float64}(undef, ni) # global best

over time

if minimize

temp = Inf

else

temp = -Inf

end

for I = 1:iwp; gb[i] = temp; end

for I = 1:ps; PC[i] = ff(PP[I,:][:]); end

p_best = copy(PC)

if minimize

m, ind = findmin(PC)

else

m, ind = findmax(PC)

end

gb[1+buffer] = m

Pg_best = PP[ind,:]

for iter = (iwp + 1):ni

for I = 1:ps

PV[I,:] += c[1] * rand() * (Pp_best[I,:] – PP[I,:]) + c[2] * rand() * (Pg_best – PP[I,:])

for j = 1:nv

if PV[I,j] > maxv; PV[I,j] = maxv; end if PV[I,j] < -maxv; PV[I,j] = -maxv; end end

PP[I,:] += PV[I,:]

PC[i] = ff(PP[I,:][:])

gb[iter] = gb[iter – 1]

if minimize

if PC[i] < p_best[i]

p_best[i] = PC[i]

Pp_best[I,:] = PP[I,:]

if PC[i] < gb[iter]

gb[iter] = PC[i]

Pg_best = PP[I,:]

end

end

else # maximizing mode

if PC[i] > p_best[i]

p_best[i] = PC[i]

Pp_best[I,:] = PP[I,:]

if PC[i] > gb[iter]

gb[iter] = PC[i]

Pg_best = PP[I,:]

end

end # of 2nd if

end # of 1st if

end # of I loop

if abs(gb[iter] – gb[iter-iwp]) < tol

return Pg_best, gb[iter] # best solution and best value respectively

end

end # of iter loop

return Pg_best, gb[end] # best solution and best value respectively

end

 

Despite its length, the core of the algorithm is simple, quite fast, and relatively light on computational resources. Note that most of the parameters are optional since their default values are predefined.

 

Simply feed it the fitness function and the number of variables, and decide whether you want it to be minimized or not. If you don’t specify the latter, the PSO method defaults to minimization of the fitness function.

 

Note that we use here the “vanilla” version of PSO, with minimal add-ons. As a result, its performance is not great. We’ll investigate a more improved Julia script of PSO in blog 10, along with its parallelized version.

 

PSO in action

The first practical application of PSO proposed by its creators was training ANNs. However, PSOs flexible nature has made it useful in various other domains, such as combinatorial optimization, signal processing, telecommunications, control systems, data mining, design, power systems, and more.

 

Also, as more specialized algorithms for training ANNs became available, PSO ceased being a relevant option for optimizing the weights of an ANN.

 

Although most versions of PSO involve a single-objective approach (having a single fitness function), with some changes, PSO can be used in multiple-objective and dynamic problems (with varying configurations).

 

The possibility of having constraints in the solution space has also been explored (the constraints in this latter case are inherently different from the constriction PSO variant).

 

So, even though PSO was originally a data science-oriented algorithm, its applicability has made it a useful tool for all sorts of problems. This clearly shows how AI is an independent field that dovetails well with almost any data-related scenario.

 

Nevertheless, some organizational problems require the use of an optimizer, rather than a machine learning system. Examples of such issues include creating an optimal schedule, finding the best way to stock a warehouse, or working out the most efficient route for a delivery driver.

 

These problems are so common in so many industries that familiarity with a robust optimizer like PSO can be a good distinguishing factor, professionally. Besides, having a variety of skills can help you develop a more holistic view of a challenging situation, empowering you to find a better strategy for tackling it.

 

Note that just like any other AI optimizer, PSO does not provide the best solution to a problem, nor does it have mathematical precision. However, it is very efficient. As such, PSO adds a lot of value in cases where an approximate solution is sufficient, especially if the time it takes to find this solution is also important.

 

Furthermore, when the problems involve functions that cannot be easily analyzed mathematically (e.g. functions that aren’t “smooth” enough to calculate a derivative function), a method like PSO is the most viable option.

 

Minimizing a polynomial expression

The examples of the PSO involve different problems, as expressed by a couple of different fitness functions.

 

In the first case we consider a minimization problem, while in the latter, we’ll look at a maximization problem. First, let’s start with defining the fitness function, F, for the first problem, which involves a complex (highly non-linear) polynomial expression:

function F(X::Array{Float64})

return y = X[1]^2 + X[2]^2 + abs(X[3]) + sqrt(abs(X[4]*X[5])) + 1.0

end

You can also write the above function as:

F(X::Array{Float64}) = X[1]^2 + X[2]^2 + abs(X[3]) + sqrt(abs(X[4]*X[5])) + 1.0

 

Though more compact, this may not be as useful for complex functions involving a lot of variables.

Whatever the case, we expect to get a solution that’s close to (0, 0, 0, 0, 0), since this is the solution that corresponds to the absolute minimum of this function (which is in this case 1.0 since 0^2 + 0^2 + |0| + sqrt(|0*0|) + 1 = 1).

 

Next, we need to run the PSO algorithm, using the above function as an input. We’ll work with the default values for the input parameters, for the majority of them:

pso(F, 5)

For one of the runs of PSO, the solution [-0.0403686, 0.0717666, -0.0102388, 0.0966982, -0.129386] was yielded, corresponding to a fitness score of approximately 1.243.

 

Although this solution is not particularly impressive, it is decent, considering the complexity of the problem and the fact that we used the most basic version of the optimizer.

 

We can try a smaller swarm – say, of 20 particles – for comparison:

pso(F, 5, true, 20)

The result in this case was [0.164684, -0.241848, 0.0640438, -0.0186612, -0.882855], having a fitness score of about 1.388. Additional runs may yield better scores. This shows that PSO systems can yield acceptable results, even without lots of particles.

We can measure how long this whole process takes using the @time meta-command, as follows:

@time pso(F, 5)

 

In this case, for a solution of comparable fitness, we learn that the whole process took about 0.008 seconds—not bad at all. As a bonus, we get some information about how many computational resources the process consumes.

 

That is 7.179 MB of RAM through its 87.6K allocations. Note that for this report to be accurate, the command must run more than once. This is true of all Julia functions benchmarked using this meta-command.

 

Maximizing an exponential expression

Let’s try something a bit more challenging for the maximization example. This problem consists of six variables, one of which is raised to the 4th power, making the solution space a bit rougher.

 

Function F2(X::Array{Float64})

return y = exp(-X[1]^2) + exp(-X[2]^2) + exp(-abs(X[3])) + exp(-sqrt(abs(X[4]*X[5]))) + exp(-X[6]^4)

end

Like in the previous case, we expect to get something close to (0, 0, 0, 0, 0, 0) as a solution, since this is the absolute maximum of this function (which is equal to 5.0 since F2(0, 0, 0, 0, 0, 0) = exp(-0^2) + exp(-0^2) + exp(-|0|) + exp(-sqrt(|0*0})) + exp(-0^4) = 1 + 1 + 1 + 1 + 1 = 5).

 

To use PSO, we simply type:

pso(F2, 6, false)

The solution obtained is [0.370003, 0.0544304, 0.0980422, 0.00426721, -0.011095, 0.294815], corresponding to a fitness score of about 4.721, which is quite close to the maximum value we were expecting.

Again, we can see how much time and computational resources this whole process took in this case:

@time pso(F2, 6, false)

 

The time the whole problem took was about 0.009 seconds, while it took about 15.006 MB of memory and around 183.1K allocations. Clearly, this is a somewhat tougher problem, involving a larger swarm, so it takes a bit more time and memory (though the time overhead is quite small).

 

If we were to solve either one of these problems with a deterministic optimizer, though, it would probably take the same computer longer.

 

PSO tips

Despite its simplicity, avoiding suboptimal results with PSO requires some attention to detail. For instance, if you use a low value for Vmax, the algorithm will take a long time to converge (not to mention the increased risk of it getting stuck at a local optimum, yielding a mediocre solution).

 

On the other hand, a very large value would make the whole process very unstable (and unable to converge on an optimum).

 

Furthermore, a very large number of particles make the whole system fairly slow; too few particles make it difficult to find the optimum solution. The empirical default value of 10 times the number of variables seems to work well for all the benchmarks tried, but it’s just a rule of thumb; make sure you experiment with this parameter when you fine-tune your PSO model.

 

In addition, in some cases, PSO is used with a variable Vmax parameter, to ensure that it converges more smoothly.

 

For example, you can reduce it by a factor k, every so many iteration, so that as it approaches the optimum value of the function, the particles of the swarm will be closer together, yielding a better precision. Once you get the hang of PSO, you can experiment with such parameters to improve its performance.

 

What’s more, it’s a good idea to make sure that the swarm covers a meaningful area when deployed, to ensure that it won’t get stuck in a local optimum. In other words, if you are optimizing a set of three parameters that all take place between 0 and 1, it’s best to spread the swarm to cover as much volume as possible, instead of having them all close to (0, 0, 0).

 

This is because if the optimal solution is close to (0, 1, 1), for example, it could take the swarm a long time to approach it.

 

How much area exactly a swarm covers when deployed is something you may want to experiment with, since it largely depends on the problem at hand.

 

Also, consider the distribution of the particles across the various dimensions of the problem space. The distribution used in this implementation is Gaussian, as shown through the randn() function used to initialize the particles.

 

The algorithm’s performance can be greatly improved if you parallelize it. The best way to do so involves defining a number of workers, each one undertaking an instance of the algorithm, and then comparing their findings, taking the smaller or larger of their solutions, depending on the type of optimization problem you are solving.

 

Make sure you use the @everywhere meta-command in front of all the functions, however, or the parallelization will not work. 

 

Finally, PSO is still a work in progress, so don’t be afraid to experiment a bit, changing it to suit the problem you need to solve. We also recommend you try to implement the Firefly algorithm. 

 

Summary

Particle Swarm Optimization (PSO) is a fundamental optimization algorithm under the umbrella of nature-inspired optimizers. It is also part of the Computational Intelligence group of systems, which is a subclass of AI.

 

PSO entails a set of potential solutions which constantly evolve as a group, becoming better and better, based on some fitness function the system tries to optimize.

 

Just like most robust algorithms of this type, PSO is ideal for tackling complex, highly non-linear problems, usually involving many variables, such as the parameters of a complex system like an ANN.

 

PSO is noticeably different from Ant Colony Optimization (ACO) as well as from Genetic Algorithms (Gas). There also exist some differences among the variants of PSO; differences mainly concern the scope and the specifics of the method.

 

There are various versions of PSO. Firefly is one of the most noteworthy variations, partly due to its distinct approach to the problem space.

 

The “swarm” used in Firefly is a set of fireflies, attracted to each other based on how well they perform in the fitness function the swarm is trying to optimize. Instead of using velocities, however, the particles, in this case, are “pulled” by all of the other particles, based on how far they are and how “bright” they shine.

 

Firefly is generally faster and more accurate as an optimizer, compared to PSO (as well as a few other nature-inspired optimizers).

 

The original PSO and most of its variants are ideal for optimizing continuous variables.

The fitness function of an optimizer like PSO does not need to be differentiable since no derivatives of it are ever calculated.

 

PSO has a variety of applications, including ANN training, signal processing, and combinatorial optimization problems. Different versions of PSO can handle more sophisticated optimization scenarios, such as multiple-objective problems, constrains-based cases, and dynamic problems. One version of PSO (Discrete PSO) even tackles discrete optimization problems.

 

PSO on its own is not as robust as its variants, but it’s very useful to know. Understanding its original form makes learning its variants (or creating new ones) significantly easier.

 

Optimization ensembles often go by the term “hybrid systems” when it comes to scientific literature. However, we prefer to avoid adding yet another term to the vast lexicon of data science when “ensemble” works well as a general descriptor plus it’s the term most commonly used in the industry. So, if you come across the term “hybrid” in a paper.

 

Optimization ensembles combine the performances of various optimization systems that tackle the same problem. These systems may be iterations of the same method, with different parameters and/or starting values, or they can comprise different methods altogether.

 

Alternatively, optimization ensembles can exchange data from one optimizer to another, in an attempt to further improve the accuracy of the results.

 

Naturally, this approach requires a more in-depth understanding of the methods involved, as well as more fine-tuning each solution to the problem at hand. In this blog, we’ll focus on the simpler ensemble described in the previous paragraph.

 

The role of parallelization in optimization ensembles

Parallelization is a key to creating ensembles, as it enables the optimization methods to work simultaneously. Much like big data modeling systems that use various machines (or components, like GPUs) to process data while simultaneously training a system, parallelized optimizers can train and seek the optimum solution at the same time, trying out different strategies.

 

This can minimize the risk of getting trapped in a local optimum, or getting an inaccurate solution due to a sub-optimal configuration in the optimizer’s parameter settings.

 

Parallelization, therefore, makes better use of available resources, saving us time and facilitating better results. Depending on the nature of a problem, some optimizers can be designed to focus on only one part of the problem, such as the rough optimization that takes place in the beginning, before zeroing in on the optimum solution. This may require some advanced programming, though, so we won’t delve into it in this blog.

 

The framework of a basic optimization ensemble

 

Let’s now examine a basic optimization ensemble, comprising two optimizers working in parallel. Each system has different parameters, but they are both sharing the same fitness function (otherwise it wouldn’t be an ensemble at all).

 

Once they search for solutions, they compare their outputs. The solution that corresponds to a higher or lower value (depending on whether it is a maximization or a minimization problem) is selected as the output of the whole system.

 

The key strength of this setup is that the two optimizers work simultaneously, so the whole system is quite fast. You can create an ensemble of more than two optimizers, each with its own set of parameters, all sharing the same fitness function—provided there are enough processing units to run them.

 

Note that if the optimizers used in the ensemble have significantly different running times, this will result in delays. This is because the ensemble wrapper function must wait for all the workers to finish, not being able to outpace its slowest optimizer.

 

A case study with PSO Systems in an ensemble

To make all this more concrete, let’s consider a basic optimization ensemble consisting of three PSO systems, working to minimize a given function. These examples are based on a quad-core computer, so they are appropriate for a machine having at least this computing power.

 

If you are using an older computer, you will need to adjust the number of workers to suit its specifications—otherwise, you may experience delays when running the ensemble code.

 

For starters, let’s get the PSO code in memory. Do note the one minor improvement, made for better accuracy in the results: if the best solution doesn’t improve beyond a given tolerance threshold within a buffer period, the maximum velocity diminishes by a given factor (in this case 0.618).

 

Next, we need to load the Distributed package from the Base package so that we can use parallelization:

Next, we need to add a few workers to apply parallelization:

addprocs(3)

We can verify that we now have 4 workers (one for each PSO system in the ensemble) as follows:

nprocs()

 

Even though we only need 3 workers for this example, we strongly recommended adding all the workers you will need for the entire project during this initialization phase. Otherwise, you risk serious confusion among the functions used by these workers.

 

Afterward, we need to load a function called swi() to help us with sorting matrices, available in a custom script called SortWithIndexes.jl:

include(“SortWithIndexes.jl”)

Next, we need to define the fitness function (let’s call it FF) for the optimizers, and we’ll opt to minimize it:

@everywhere function FF(X::Array{Float64})

return y = X[1]^2 + abs(X[2]) + sqrt(abs(X[3]*X[4])) + 1.0

end

 

For this function to be accessible to all the workers, it must be preceded by the @everywhere meta-command.

The same prefix is added to the pso() function, implementing the PSO algorithm for each available worker, as well as in the swi() function for the matrix sorting process.

 

If you bring additional workers onboard later, they will not have access to the FF function—until you run the previous code again.

 

Similarly, if you plan to run different versions of an optimizer among these workers, it is best to activate each worker and run its corresponding optimizer code individually, without the @everywhere meta-command.

 

For the wrapper function driving the whole ensemble, we’ll use the following code. It’s basically a meta-PSO function, encompassing all the optimizers in the ensemble, through their corresponding workers:

function ppso(ff::Function, minimize::Bool = true, nv::Int64 = 4, ps::Int64 = 10*nv, ni::Int64 = 2000)

np = nprocs()

F = Any[ff for i = 1:np]

Z = pmap(pso, F)

G = Array{Float64}(undef, np)

for i = 1:np

G[i] = Z[i][2]

end

if minimize

ind = indmin(G)

else

ind = indmax(G)

end

println(G)

return Z[ind]

end

 

In this case, we don’t need to use the @everywhere meta-command since the function doesn’t need to run on every single worker. Also, the println() command is not entirely necessary, though it can be helpful if you’d like to see how the various members of the ensemble perform.

 

To run the PSO ensemble on the fitness function F, just type:

ppso(FF, true, 4)

This ensemble’s solution is very close to the global optima of [0, 0, 0, R] and [0, 0, R, 0] (where R is any real number), having a fitness value of 1.0. The solution found in this case is [-0.00527571, -1.66536e-6, 1.44652e-5, -3.29118e-6], corresponding to a fitness value of about 1.00008.

 

This is quite close to the actual optimal value, even though the whole process of finding this solution was not noticeably slower than running a single PSO search.

 

Because of the default settings of the ppso() function, even if we typed ppso(FF), the ensemble would still work. Also remember that to use this function for a maximization problem, you must set the second parameter to “false” first:

ppso(FF, false, 4)

 

A case study with PSO and Firefly ensemble

Let’s now consider another basic optimization ensemble, this time involving two different optimizers: a PSO optimizer and a Firefly optimizer. The objective, in this case, is again to minimize a given function. To make things more interesting, we’ll have two optimizers for each algorithm, for a total of 4 workers.

 

In this example, we’ll need the code for both PSO and Firefly. For the latter, we need to employ some auxiliary functions, all of which are contained in the corresponding notebook. For brevity’s sake, we’ll omit that code here. The Firefly optimization system is implemented using a function called ffo().

 

For the ensemble, we can use the wrapper function co(), which stands for “combined optimizer”:

function co(ff::Function, minimize::Bool =

true, nv::Int64 = 5)

np = round(Int64, nprocs() / 2) # number of processors for each optimizer

N = 2*np

F = Function[ff for i = 1:np]

M = Bool[minimize for i = 1:np] NV = Int64[nv for i = 1:np] G = Array{Float64}(undef, N)

X = pmap((ff, minimize, nv) -> ffo(ff, minimize, nv), F, M, NV)

Y = pmap((ff, minimize, nv) -> pso(ff, minimize, nv), F, M, NV)

Z = vcat(X, Y)

for i = 1:N

G[i] = Z[i][2]

end

if minimize

ind = findmin(G)[2]

else

ind = findmax(G)[2]

end

println(G)

return Z[ind]

end

 

As in the previous example, we don’t need to use the @everywhere meta-command for this function, since it doesn’t need to run on every single worker. Remember, we only need the @everywhere meta-command if the code needs to run on every single worker, i.e. in parallel. Wrapper functions generally don’t need that.

 

The println() command in the function is purely for educational purposes, allowing you to view the results of the various members of the ensemble. In this case, it’s particularly interesting, as we have optimizers from two different families (the first two being PSO and last two being Firefly).

 

To test the whole thing on the fitness function F (defined previously), we just need to type:

co(F, true, 4)

This returns a solution very close to the global optimum, [-0.0103449, 6.77669e-8, -7.33842e-7, -7.93017e-10], corresponding to the fitness value of about 1.0001.

 

The fitness values of the best solutions of the optimizers are 1.01052, 1.00636, 1.0003, and 1.00011, respectively. 

Although the final result is not as accurate as the previous example’s, it is quite good overall. Considering that the PSO systems didn’t do so well in this problem, the ensemble still managed to perform well, thanks to the other two Firefly optimizers.

 

The co() function is versatile by design, in terms of the number of workers used. It basically allocates half the available workers to one optimizer, and the remaining workers to the other. You can experiment with it using different numbers of workers, particularly if you have enough processing units at your disposal.

 

How optimization ensembles fit into the data science pipeline

At this point, you may be wondering how all this optimization ensemble stuff is relevant to the data science process.

 

After all, you can get insights with more conventional methods such as basic machine learning system and statistical models, or through the deep learning systems, we examined in the first part of the blog. How do optimization ensembles improve upon these existing methodologies?

 

In short, most modern data science problems involve optimization in one way or another, and conventional optimizers just won’t cut it. Older methods worked well for the simple problems they were built for.

 

Just a few decades ago, analysts (with lots of patience and a knack for math) solved simple problems by hand. But data science is a huge realm with massive quantities of complex data, though, involving many dimensions and variables.

 

What’s more, although the optimizers we studied may be able to handle these problems, oftentimes we don’t have the time to fine-tune these systems. That’s where optimization ensembles come in.

 

From performing feature selection to optimizing the boundaries of parameters in a 1-class SVM for anomaly detection, optimization is essential to many data science projects.

 

More importantly, solving complex optimization problems efficiently through ensembles will give you a better perspective on problem-solving in data science. Many available methods are already optimized, but sometimes you may need to break new ground, tackling a problem that hasn’t been tackled before (at least, not in the open-source world).

 

That’s where experience with ensembles (and other topics just beyond the crash-course level) can come in handy since not all problems can or should be solved with off-the-shelf methods.

 

The scope of a good data scientist is no longer limited to traditional data science methods, expanding rapidly to include AI methods like optimization ensembles.

 

Ensemble tips

The optimization ensembles covered in this blog may seem straightforward, but as usual, you must keep a few key things in mind. For example, you must ensure that the optimizers are set up properly, by configuring the corresponding parameters with values that make sense for the problem at hand.

 

Otherwise, the results may not be that useful or there is going to be a lot of wasted computational resources (e.g. from delayed convergence).

 

Also, you are better off using optimizers you are quite comfortable with and that you understand well. This way, you can obtain good results from such a system, even if the optimizers you use are instances of the same algorithm.

 

Besides, due to their stochastic nature, they are bound to yield different solutions anyway every time you run them, so putting them in an ensemble can definitely have some positive impact on the result of the optimization.

 

When building an optimization ensemble, pay attention to the computing power at your disposal. Julia can create as many workers as requested, but if the power of the workers exceeds the capacity of the processing threads available in your machine, the ensemble will not be efficient.

 

Some threads will need to take on more tasks than others, leading to an imbalance of workload that is bound to delay the whole system. Remember, the workers may correspond to individual CPUs or GPUs when their number is small.

 

If that number exceeds the total processing units accessible to Julia, though, there is bound to be an issue, even if it doesn’t throw any errors or exceptions.

 

What’s more, using an optimization ensemble does not guarantee that the solution will be a global optimum. Sometimes, even a group of ensembles can become trapped in local optima, yielding a sub-optimal solution at the end of the pipeline. Still, the chances of that are smaller when using an ensemble than when using a single optimizer.

 

Summary

Optimization ensembles (also called “hybrid systems”) utilize a combination of optimizers to tackle a given fitness function. Such an ensemble can produce a result at least as good as the result of its best-performing member.

 

Optimization ensembles come in different forms. Simple ensembles run two (or more) optimizers in parallel on a given problem and then compare their outputs. The optimizers can be of different algorithm families, though it’s best to use algorithms with similar running times, to ensure efficiency.

 

The coding in a basic optimization ensemble is quite straightforward. Use the @everywhere meta-command to make sure all the optimizer functions are available to all the workers in the ensemble.

 

Understanding the ins and outs of optimization ensembles will help you hone your problem-solving skills, and empower you to tackle challenging and complex data science problems. Such problems may not be commonplace, but when they do arise, they are often critical to the project at hand.

 

When working with optimization ensembles, remember to set the parameters properly and pay attention to the computing power available.

 

The result of an optimization ensemble is not guaranteed to be a global optimum, but it is almost certainly better than the result

Recommend