Gplearn SymbolicRegression AnalyticsVidhya Medium-Com
Gplearn SymbolicRegression AnalyticsVidhya Medium-Com
You have 1 free member-only story left this month. Sign up for Medium and get an extra one
1. Introduction
Imagine you were a scientist working in any field. You can basically split your
work in three steps: the first is to gather the data, the second is to propose a
phenomenological explanation of those data. Third you would like to explain
those data and phenomenological observations to first principles.
Imagine you are about to gather data and discover gravity. There are three
scientist we should remember about this field: Tycho Brahe (data acquisition)
who took extensive and precise measurements of the position of planets over
1 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...
time.
Sign In Get started
Johannes Kepler who, from Brahe’s measurements, derived analytical
expressions that describe the motion of the solar system in a concise manner
(data analysis).
Finally, the theorist, Isaac Newton. He realized the mechanisms underlying
planets travelling around the Sun which could be formulated into a universal
law (derivation from 1st principles).
Machine learning (ML) models are currently the tools of choice for
uncovering these physical laws. Although they have shown some promising
performance in predicting materials properties, typical parameterized
machine learning models are not allowing the ultimate steps of scientific
research: automating Newton. Take into consideration for example ML/Deep
learning algorithms who can predict with a perfect score the number of
infected people in a global pandemic: if they can’t explain the number, they
are of limited use. That’s why simpler mechanistic models like SIR are still
widely used. ML models can be predictive but their descriptions are often too
verbose (e.g. deep-learning models with thousands of parameters) or
mathematically restrictive (e.g. assuming the target variable is a linear
combination of input features).
2 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...
Genetic programming flowchart depicting the iterative solution finding process. source arxiv.
nsample = 400
sig = 0.2
x = np.linspace(-50, 50, nsample)
X = np.column_stack((x/5, 10*np.sin(x), (x-5)**3, np.ones(nsample)))
beta = [0.01, 1, 0.001, 5.]
3 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...
Function to be predicted.
We import from sympy in particular we will use simpify to make the outcome
expression more readable.
Since we will be comparing the results from different approaches, we split our
dataset in train/test:
X = df[['x']]
y = df['y']
y_true = y
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.30, random_state=42)
Our first fit will be extremely easy: in the function set we import most, if not
all, the default functions which comes with the library. Then we initialize the
regressor with some standard parameters.
With this code we’re telling the regressor to use the function from the
4 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...
# First Test
function_set = ['add', 'sub', 'mul', 'div','cos','sin','neg','inv']
est_gp =
SymbolicRegressor(population_size=5000,function_set=function_set,
generations=40, stopping_criteria=0.01,
p_crossover=0.7, p_subtree_mutation=0.1,
p_hoist_mutation=0.05,
p_point_mutation=0.1,
max_samples=0.9, verbose=1,
parsimony_coefficient=0.01, random_state=0,
feature_names=X_train.columns)
We define a dictionary called converter: its use will be clear later but
essentially we need it to convert a function name, or string, in its equivalent
python mathematical code.
converter = {
'sub': lambda x, y : x - y,
'div': lambda x, y : x/y,
'mul': lambda x, y : x*y,
'add': lambda x, y : x + y,
'neg': lambda x : -x,
'pow': lambda x, y : x**y,
'sin': lambda x : sin(x),
'cos': lambda x : cos(x),
'inv': lambda x: 1/x,
'sqrt': lambda x: x**0.5,
'pow3': lambda x: x**3
}
We then fit the data, we evaluate our R2 score and with the help of simpyfy we
print the resulting expression:
5 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...
With verbose = 1 the output will be something like this: the first column
indicates the generation number. Then we have the average length and fitness
of the whole population (5000 units). On the right we have the best parameters
for the individual.
6 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...
est_tree = DecisionTreeRegressor(max_depth=5)
est_tree.fit(X_train, y_train)
est_rf = RandomForestRegressor(n_estimators=100,max_depth=5)
est_rf.fit(X_train, y_train)
y_gp = est_gp.predict(X_test)
score_gp = est_gp.score(X_test, y_test)
y_tree = est_tree.predict(X_test)
score_tree = est_tree.score(X_test, y_test)
y_rf = est_rf.predict(X_test)
score_rf = est_rf.score(X_test, y_test)
From scores above, the result is not impressive apparently: the ML tools both
scored pretty well if compared to our analytical expression:
RF score:0.993
DT score: 0.992
GPlearn score: 0.978
However, if we plot the results, we can highlight that our analytical function is
actually better in the middle range while it goes off at the extreme points.
ax = fig.add_subplot(2, 2, i+1)
points = ax.scatter(X, y_true, color='green', alpha=0.5)
test = ax.scatter(X_test,y,color='red', alpha=0.5)
plt.title(title)
plt.show()
7 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...
Comparison of full data (green points) and test data (red points) for three different approaches. Top left
indicates original data.
Above: relationship between y_pred and y_true. Below: prediction error for the three algorithms.
8 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...
def pow_3(x1):
f = x1**3
return f
pow_3 = make_function(function=pow_3,name='pow3',arity=1)
est_gp =
SymbolicRegressor(population_size=5000,function_set=function_set,
generations=45, stopping_criteria=0.01,
p_crossover=0.7, p_subtree_mutation=0.1,
p_hoist_mutation=0.05,
p_point_mutation=0.1,
max_samples=0.9, verbose=1,
parsimony_coefficient=0.01, random_state=0,
feature_names=X_train.columns)
est_gp.fit(X_train, y_train)
print('R2:',est_gp.score(X_test,y_test))
next_e = sympify((est_gp._program), locals=converter)
next_e
Result of gplearn
The R2 is now really close to 1, also note how simpler our equation is! If we
compare again to the traditional ML tools we really appreciate the
improvement with this step.
9 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...
37 1
Above: relationship between y_pred and y_true. Below: prediction error for the three algorithms.
10 of 11 21/10/22, 21:13
gplearn Symbolic Regression | by Andrea Castiglioni | ... https://medium.com/analytics-vidhya/python-symbolic-...
6. Conclusion
Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.
Get started
Sign In
I think symbolic regression is a great tool to be aware of. It isn’t Get
perhaps perfect for every kind o
this newsletter
Your email
approach but it gives you another option which can be really useful as the outcome is readily
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy
understandable.
practices.
11 of 11 21/10/22, 21:13