Interpolation using a genetic algorithm
This weekend I was in Milan to get a visa and I had the opportunity to work with a friend, Michele, on genetic algorithms. It was the first time I dig up in such field and it was very exciting. In this post I want to explain some bits of our work.
A brief introduction to GA
A genetic algorithm is a search/optimization algorithm that uses an heuristic approach to reduce the search space and evolve gradually to a solution.
Population
It is an algorithm that has its root in the theory of natural selectioni by Charles Darwin. The main components of a GA are:
- the population, that concentrate all the available solutions at a given time;
- the fitness function, that gives an approximation of the quality of the solution codified by a given member of the population.
In a GA the first thing to do is to generate a population.
A population is a group of objects with given attributes, usually a string, and they contains in some form the solution (usually inside a string); the first population is randomly generated and contains a big number of solutions, but not every solution (this is not a bruteforce approach).
After this step the fitness functions evaluates the quality of every solution that a given member carries: the evaluation should be considered from a bottom up point of view.
Reproduction
Now, as in Darwin's theory of evolution, the member of the population are going to "reproduce": two members are going to be coupled to generate a new member of the second generation and every child member will contain a solution that is the product of the original genes of their parent members.
This time the reproduction of the population into a second one is not entirely random. The fitness function gives us an approximation of the quality of every gene that a member carries and by the rule of the "survival by the fittest" the probability that a member is going to reproduce with another one is proportional to the quality of its genes.
When we have a second generation of members we can recur on our GA and generate a third generation. From this point we can recur until we converge to a solution that is common to every member, or at least that is suited to our needs.
Mutation
Actually, in some cases, a mutation function can be added, so that, like in real world, some times the genes are "scrambled" indipendently from the fitness function.
There is more to a GA, for example we could talk about possible ways of storing the genes inside a member or when to use mutation, anyway I want to stop here and continue with an analysis of my problem.
Interpolating a function using a GA
Me and Michele decided to spend some time developing a little python script to explore GA capabilities and we decided to interpolate some points on a cartesian plane.
Our program, that is available here uses a class to define the various members of the population and a string for the genes, a class as well for the points on the plane.
The fitness function is not as precise as it should be because this is only a proof of concept:
.. code:: python
mutationProbability = 0.1 rangeLimit = 5 def fitness(item, pointList, n): value = 0 for p in pointList: y = 0 for i in range(n): y += item.gene[i] * pow(p.x, i) result = 1 - (abs (p.y - y) / rangeLimit) if result < 0: result = 0 value += result return value / n
item is just a member of the population, poinList is the list of points and n is the number of points (n - 1 is the grade of the function).
for i in range(n): y += item.gene[i] * pow(p.x, i)
this piece of code gives us the value of the function encoded in the genes in the points of pointList;
result = 1 - (abs (p.y - y) / rangeLimit) if result < 0: result = 0
while here the script stores 1 - the previous result because if the GA has yield a good result there should be distance = 0 from the function evaluated and the points; If this is the case, the fitness function should attribute the highest possible reproduction probability for that member. At the end the fitness function returns the total value over the number of points evaluated.
As you can see this fitness function is by no means an optimal one. The reproduction probability is higher for functions that crosses some points and are really distant from others rather than for functions that are closer to every point but crosses none. Anyway for simple cases the GA yields good results, as an example for points (0 0), (1 4), (2 9) one of the member with the highest reproduction probability has this function in its genes:
-0.0487839869993989 * x^0 + 4.600339125358671 * x^1 + -0.2780958075230644 * x^2
that crosses this points: (0 -0.0488), (1 4.2735), (2 8.0395) given 80 iterations, initial population of 600 members and a two digit approximation.
For a more precise computation a higher population size and a really high number of iterations should be used.