0% found this document useful (0 votes)
22 views11 pages

Gplearn SymbolicRegression AnalyticsVidhya Medium-Com

Genetic Progression with Python

Uploaded by

banstala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views11 pages

Gplearn SymbolicRegression AnalyticsVidhya Medium-Com

Genetic Progression with Python

Uploaded by

banstala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...

Sign In Get started

Published in Analytics Vidhya

You have 1 free member-only story left this month. Sign up for Medium and get an extra one

Andrea Castiglioni Follow

Aug 3, 2020 · 7 min read · Member-only · Listen

Python Symbolic Regression with gplearn: how


to discover analytical relationships in your data
In this tutorial I want to introduce you to Genetic Programming in Python
with the library gplearn.

Symbolic regression is a machine learning technique that aims to identify an


underlying mathematical expression that best describes a relationship. It
begins by building a population of naive random formulas to represent a
relationship between known independent variables and their dependent
variable targets in order to predict new data. Each successive generation of
programs is then evolved from the one that came before it by selecting the
fittest individuals from the population to undergo genetic operations.

1. Introduction
Imagine you were a scientist working in any field. You can basically split your
work in three steps: the first is to gather the data, the second is to propose a
phenomenological explanation of those data. Third you would like to explain
those data and phenomenological observations to first principles.
Imagine you are about to gather data and discover gravity. There are three
scientist we should remember about this field: Tycho Brahe (data acquisition)
who took extensive and precise measurements of the position of planets over

1 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...

time.
Sign In Get started
Johannes Kepler who, from Brahe’s measurements, derived analytical
expressions that describe the motion of the solar system in a concise manner
(data analysis).
Finally, the theorist, Isaac Newton. He realized the mechanisms underlying
planets travelling around the Sun which could be formulated into a universal
law (derivation from 1st principles).

Machine learning (ML) models are currently the tools of choice for
uncovering these physical laws. Although they have shown some promising
performance in predicting materials properties, typical parameterized
machine learning models are not allowing the ultimate steps of scientific
research: automating Newton. Take into consideration for example ML/Deep
learning algorithms who can predict with a perfect score the number of
infected people in a global pandemic: if they can’t explain the number, they
are of limited use. That’s why simpler mechanistic models like SIR are still
widely used. ML models can be predictive but their descriptions are often too
verbose (e.g. deep-learning models with thousands of parameters) or
mathematically restrictive (e.g. assuming the target variable is a linear
combination of input features).

In this prospective paper, we focus on an alternative to machine-learning


models: symbolic regression. Symbolic regression simultaneously searches
for the optimal form of a function and set of parameters to the given problem,
and is a powerful regression technique when little if any a-priori knowledge of
the data structure/distribution is available.

2 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...

Sign In Get started

Genetic programming flowchart depicting the iterative solution finding process. source arxiv.

2. Initial dataset and data analysis


We generate the data like we did in the regression tutorial. The only
difference here is that we modify the amount of sample and the coefficients to
give us with a complex function. We see that the result is the combination of a
linear x, a sin(x) and x**3.

nsample = 400
sig = 0.2
x = np.linspace(-50, 50, nsample)
X = np.column_stack((x/5, 10*np.sin(x), (x-5)**3, np.ones(nsample)))
beta = [0.01, 1, 0.001, 5.]

y_true = np.dot(X, beta)


y = y_true + sig * np.random.normal(size=nsample)
df = pd.DataFrame()
df['x']=x
df['y']=y

The shape of the curve is illustrated below:

3 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...

Sign In Get started

Function to be predicted.

3. GPlearn imports and implementation


We will import SymbolicRegressor from gplearn and also the decision tree
and random forest regressor from sklearn from which we will compare the
results.

We import from sympy in particular we will use simpify to make the outcome
expression more readable.

Basic libraries to import from gplearn, sklearn and sympy.

Since we will be comparing the results from different approaches, we split our
dataset in train/test:

X = df[['x']]
y = df['y']
y_true = y
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.30, random_state=42)

Our first fit will be extremely easy: in the function set we import most, if not
all, the default functions which comes with the library. Then we initialize the
regressor with some standard parameters.

With this code we’re telling the regressor to use the function from the

4 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...

function_set list. The number of generations (evolution) of the expression will


Sign In Get started
be 40, with an initial population of 5000. Each of those 5000 will combinate
between them to form expressions which will be evaluated. The evaluation
function by default will assess the “fitness” by using by default a MSE metric.
There’s a stopping_criteria which by default will be of 0.01 in case we reach
this MSE before the number of generations. Then there are three parameters
linked with evolution: p_crossover, p_subtree_mutation and
p_hoist_mutation. For more detail about the specific use of those parameters
see the documentation. One thing you should note, however, is that the sum of
the three parameters should be equal or less than 1. We can also try to control
the total length of our expression by using Parsimony_coefficient. The higher
this parameter will be, the shorter the regressor will try to keep the
expression. With max_samples we’re telling the fitness function to compare
the result with 90% of our data.

# First Test
function_set = ['add', 'sub', 'mul', 'div','cos','sin','neg','inv']

est_gp =
SymbolicRegressor(population_size=5000,function_set=function_set,
generations=40, stopping_criteria=0.01,
p_crossover=0.7, p_subtree_mutation=0.1,
p_hoist_mutation=0.05,
p_point_mutation=0.1,
max_samples=0.9, verbose=1,
parsimony_coefficient=0.01, random_state=0,
feature_names=X_train.columns)

We define a dictionary called converter: its use will be clear later but
essentially we need it to convert a function name, or string, in its equivalent
python mathematical code.

converter = {
'sub': lambda x, y : x - y,
'div': lambda x, y : x/y,
'mul': lambda x, y : x*y,
'add': lambda x, y : x + y,
'neg': lambda x : -x,
'pow': lambda x, y : x**y,
'sin': lambda x : sin(x),
'cos': lambda x : cos(x),
'inv': lambda x: 1/x,
'sqrt': lambda x: x**0.5,
'pow3': lambda x: x**3
}

We then fit the data, we evaluate our R2 score and with the help of simpyfy we
print the resulting expression:

5 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...

est_gp.fit(X_train, y_train) Sign In Get started


print('R2:',est_gp.score(X_test,y_test))
next_e = sympify((est_gp._program), locals=converter)
next_e

With verbose = 1 the output will be something like this: the first column
indicates the generation number. Then we have the average length and fitness
of the whole population (5000 units). On the right we have the best parameters
for the individual.

With the code above, the result will be like:

R squared and analytical function from gplearn.

6 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...

4. Comparing Gplearn to traditional ML approaches Get started


Sign In
We reached a good level of correlation with the original expression. However I
must say that it’s particularly difficult to interpret this equation! Let’s see how
it compares with standard ML tools:

est_tree = DecisionTreeRegressor(max_depth=5)
est_tree.fit(X_train, y_train)
est_rf = RandomForestRegressor(n_estimators=100,max_depth=5)
est_rf.fit(X_train, y_train)
y_gp = est_gp.predict(X_test)
score_gp = est_gp.score(X_test, y_test)
y_tree = est_tree.predict(X_test)
score_tree = est_tree.score(X_test, y_test)
y_rf = est_rf.predict(X_test)
score_rf = est_rf.score(X_test, y_test)

From scores above, the result is not impressive apparently: the ML tools both
scored pretty well if compared to our analytical expression:

RF score:0.993
DT score: 0.992
GPlearn score: 0.978

However, if we plot the results, we can highlight that our analytical function is
actually better in the middle range while it goes off at the extreme points.

fig = plt.figure(figsize=(12, 10))

for i, (y, score, title) in enumerate([(y_test, None, "Ground Truth"),


(y_gp, score_gp,
"SymbolicRegressor"),
(y_tree, score_tree,
"DecisionTreeRegressor"),
(y_rf, score_rf,
"RandomForestRegressor")]):

ax = fig.add_subplot(2, 2, i+1)
points = ax.scatter(X, y_true, color='green', alpha=0.5)
test = ax.scatter(X_test,y,color='red', alpha=0.5)
plt.title(title)

plt.show()

7 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...

Sign In Get started

Comparison of full data (green points) and test data (red points) for three different approaches. Top left
indicates original data.

Above: relationship between y_pred and y_true. Below: prediction error for the three algorithms.

8 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...

5. Improving the result Get started


Sign In
By looking at the function and using some prior knowledge of mathematics,
we can improve our algorithm by feeding it some functions to improve the
result. gplearn allows us to create custom functions and we will implement a
simple cubic expression.

from gplearn.functions import make_function

def pow_3(x1):
f = x1**3
return f

pow_3 = make_function(function=pow_3,name='pow3',arity=1)

# add the new function to the function_set

function_set = ['add', 'sub', 'mul',


'div','cos','sin','neg','inv',pow_3]

est_gp =
SymbolicRegressor(population_size=5000,function_set=function_set,
generations=45, stopping_criteria=0.01,
p_crossover=0.7, p_subtree_mutation=0.1,
p_hoist_mutation=0.05,
p_point_mutation=0.1,
max_samples=0.9, verbose=1,
parsimony_coefficient=0.01, random_state=0,
feature_names=X_train.columns)
est_gp.fit(X_train, y_train)
print('R2:',est_gp.score(X_test,y_test))
next_e = sympify((est_gp._program), locals=converter)
next_e

Result of gplearn

The R2 is now really close to 1, also note how simpler our equation is! If we
compare again to the traditional ML tools we really appreciate the
improvement with this step.

9 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...

Sign In Get started

37 1

Enjoy the read? Reward the writer.Beta


Your tip will go to Andrea Castiglioni through a third-party platform of
Give a tip
their choice, letting them know you appreciate their story.

Above: relationship between y_pred and y_true. Below: prediction error for the three algorithms.

Sign up for Analytics Vidhya News Bytes


By Analytics Vidhya

10 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...

6. Conclusion
Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.
Get started
Sign In
I think symbolic regression is a great tool to be aware of. It isn’t Get
perhaps perfect for every kind o
this newsletter
Your email
approach but it gives you another option which can be really useful as the outcome is readily
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy
understandable.
practices.

About Help Terms Privacy

Photo by Hannah Troupe on Unsplash

11 of 11 21/10/22, 21:13

You might also like