Data-Engineering EINDE
Data-Engineering EINDE
1
1. Pre-processing
Our first step was to load the data from Kaggle. The panda dataframe was used so that everything was
already in the right format. The first merge that happened was the merge between the match-dataframe
and the player-dataframe. As this dataframe had already been used for the explanatory analysis, we
simply reused it. When these two dataframes were merged, we proceed to clean up the columns that are
not necessary for machine learning or that have been duplicated.
Next, we merged the dataframes of players and their attributes. We also dropped unnecessary columns
there. Another problem we came across was that the attributes of the players were kept in a timeseries.
This gave an abundance of observations so we used groupby to take an average for the continuous
variables and used the mode for the discrete ones. Since groupby created two different dataframes, we
had to merge the continuous and the discrete variables again into one dataframe.
After this was done we got two big dataframes that we needed to merge. This was done in the next step.
After that we dropped the player names, which we had used for the merge, because our reasoning was
that the names should not influence the outcome of a match, only the specs of the players should.
Next up, we defined our target-variable. The target variable was whether or not the home-team would
lose or not. We compared the goals from the home team to the away team. After this we merged for a
last time with the big dataframe.
Then the second big task of the data pre-processing started. Our first step was to check where all the
NaN’s are in our dataset, we also checked for infinities and zeros.
In this project we worked on data collected for more than 25000 matches played within the seasons 2008
to 2016. The database contained seven tables: Country, League, Match, Player, Player_Attributes, Team,
Team_Attributes. Databases in SQL format were presented.
For the exploratory data analysis, the three questions that were proposed. These three questions were:
For each year, what are the most successful teams?, What are the attributes of the best 1% scoring
players? and calculate the age of each player for every match. To answer the questions, the first step was
to copy the dataframes so that if a mistake was made, the data would not be corrupted. Each question
was then followed by a different set of steps.
2
We completed the following steps to determine which teams were the most successful:
Based on the final dataframe obtained, we made the following graph. In this graph you can see the points
of the best teams of each year from 2008-2016.
Secondly, we determined the characteristics of the top 1% of scoring players by following these steps:
3
From the final dataframe that we obtained in this section, we made the following graph by choosing the
following parameters: player names and their overall rating. The graph shows the average of the overall
ratings of the ten top scoring players:
Finally, in order to calculate the age of each player for each match after we created the respective copy,
we created the following loops for both home and away teams. The first five steps describe what was
done in each loop.
3. Machine learning
Three algorithms were run to create the models. The first one was logistic regression, the second
one was a decision tree and lastly a random forest with Kfold. The target variable was whether
or not the home team would lose. The input variables were the attributes of the players and
three odds from the bookmakers, namely, the odds that the home team would win, that they
would play a draw or that the away team would win. For efficiency's sake we first dropped all
columns that weren’t needed. This was done in the pre-processing step, so we could just use the
dataframe in machine learning.
4
The first step we did during the machine learning was creating random indices for a test, training
and validation set. We used the validation set to make sure we did not overfit our model and the
testset to calculate the AUC and accuracy. After this was done we extracted all features and
targets from the dataframe. Then we did the pre-processing of the discrete variables. This
actually means that we dummyfied all the discrete variables. This included our input variables
and the target variable. After that we pre-processed the continuous variables, this was done by
using a scaler to scale all variables. We did so that all variables had an equal weight in the machine
learning algorithm.
Secondly we concatenated all the variables, this includes the discrete and continuous variables
but also the target variable. The first row of this concatenate dataframes were all NaN’s so we
dropped this. After that we dropped all the NaN’s, _0 and 0. This was done with all variables
together so that if we dropped one row in the training data it would also be dropped in the target
variable. After this we removed the target variable out of the dataframe. And lastly we made the
training, validation and test set. When this was all done we could start with machine learning.
Literature stated that it would be hard to predict a football-match using machine learning. The
market roughly predicted 70% of the games right in 2019 (Empirics Asia, 2019). This is the
accuracy that we have. But we are working with skewed data. Only 28.77% of the matches were
lost by the home team. This led us to look at the AUC instead of the accuracy.
The first model we checked was the logistic linear regression. This model outperformed the rest
by a small margin. The accuracy of this model was 72,24% and the AUC was 65,22%. The Area
Under the Curve is being displayed in the ROC graph. the more the ROC curve is in the top left
corner, the better the model is.
The second model that we checked was the KNN. However this model did not have any
discriminative power as it just predicted that every match would be lost. This led to an accuracy
of 71,87%. A way to fix this would have been to have different cut-off values. Another technique
that might have been useful would be Near Miss. We would not recommend a Smote technique
since this technique duplicates the minority class and Python already gave us a warning that the
training dataset was big. However we also looked at the AUC as this metric is classification-
threshold-invariant. The AUC was 63,33%.
The last technique we tried was a tree. The tree was by far the worst model. This model also has
an accuracy of 71,87% like KNN. It also has the same problem as KNN, meaning that this model
does not have any discriminative power. However the AUC is lower than KNN. The AUC of a tree
is only 60,41%. We also tried a random forest with Kfold but it didn’t have a significantly higher
AUC.
5
Model/Values Logistic Linear KNN Decision Tree Random
Regression Forest
We looked at the AUC and ROC graphs to compare the models. We chose to only show the two
models with the highest AUC values. As you can see, there is only a minimal difference between
the two. We eventually went with the logistic linear regression because it has the highest AUC
value but on top of that it also has a better accuracy. From our experiments with these machine
learning models for soccer matches we conclude that the logistic regression was the best for us.
4. References
● Empirics Asia. (2019, December 19). How I Used Machine Learning to Predict Football Games for
24 Months Straight. Retrieved from Empirics Asia: https://empirics.asia/how-i-used-machine-
learning-to-predict-football-games-for-24-months-straight/
● Mathien, H. (2016, October 16). European Soccer Database. Retrieved from Kaggle:
https://www.kaggle.com/hugomathien/soccer
● Seaborn documentation: https://seaborn.pydata.org/tutorial.html
● Scikit-learn documentation : https://scikit-learn.org/stable/user_guide.htm
6
5. Appendix
# import libraries
import pandas as pd
import pyodbc as pdb
import numpy as np
import datetime as dt
import shutil
import math
import os
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
from heapq import nlargest
#%% datacleaning: we will now clean both datasets so we can cleanly merge them. This step is
preparring for the ML
1
this dataframe was made in the expanitoy part, we put the code at the end of the document
1
# after BWH,BWA and BWD we drop everything we don't understand (IWH till BSA)
df_match_processed.drop(df_match_processed.columns[4:28], axis=1, inplace=True)
#now we remove the age of the players as this is not explanatory in our model
df_match_processed =
df_match_processed[df_match_processed.columns.drop(list(df_match_processed.filter(regex='age')))]
#now we groupby so that there is no time-evolution for the players but an overal mean (continuous)
or mode for (discrete)
#first of all we groupby for the continous variables
df_players_cont=df_players_merged.groupby('player_name').mean()
2
df_match_player_merged =
df_match_player_merged.rename(columns={"player_name":"away_player_name_{x}"})
#%% names of the players should not influence the match, but their specs should. So we drop all of
their names
df_match_player_merged_final=df_match_player_merged.copy()
df_match_player_merged_final.drop(df_match_player_merged_final.columns[4:26], axis=1,
inplace=True)
3
#getting the results back
return Percen
percent_missing = df_COMPLETE.isnull().sum() * 100 / len(df_COMPLETE) # percentage van missing
values
#%% Extracting all features and target from the train dataframe
Target = df_COMPLETE.iloc[:,179] # target variable is on the last place of our dataframe
#import libraries
from numpy import asarray
from sklearn.preprocessing import MinMaxScaler
4
#we make the odds a dataframe instead of a tuple
ODDS_coding=pd.DataFrame(data=specs_numbers_prep1,index=np.array(range(1,
21375)),columns=np.array(range(1, 4)))
#%%We put all the variables together and get the NaN’s, _0,... out
horizontal_stack = pd.concat([specs_numbers_prep, ODDS_coding,Discrete,TARGET], axis=1)
horizontal_stack = horizontal_stack[1:]
horizontal_stack = horizontal_stack[horizontal_stack.isnull().sum(axis=1) < 2]
#%% we check the how many of the matches of the hometeams were lost
Procent = sum(TARGET)/19675
#creating the test, train, validation data for the input data
X_train=horizontal_stack.iloc[indices_train_list,]
X_test=horizontal_stack.iloc[indices_test_list,]
X_val=horizontal_stack.iloc[indices_val_list,]
#creating the test, train, validation data for the target variable
Y_train=TARGET.iloc[indices_train_list,]
Y_test=TARGET.iloc[indices_test_list,]
5
Y_val=TARGET.iloc[indices_val_list,]
X_train_total=pd.concat([X_train, X_val]) #concatenate two data frames
Y_train_total=pd.concat([Y_train, Y_val]) #concatenate two data frames
# Now we need to create a player DataFrame, with the info we want, like we did before
df_player_info = df_player[['player_api_id', 'player_name', 'birthday']]
for x in range(1,12):
# Now we need to create a player DataFrame, with the info we want, like we did before
df_player_info = df_player[['player_api_id', 'player_name', 'birthday']]
6
df_player_info = df_player_info.rename(columns={
'player_api_id': player_column_name,
'player_name': f'name_{player_column_name}',
'birthday': f'birthday_{player_column_name}',
})
for x in range(1,12):
df_match_processed[f'birthday_home_player_{x}'] =
pd.to_datetime(df_match_processed[f'birthday_home_player_{x}'])
df_match_processed[f'birthday_away_player_{x}'] =
pd.to_datetime(df_match_processed[f'birthday_away_player_{x}'])
df_match_processed[f'age_home_player_{x}'] = round(((pd.to_datetime(df_match_processed['date']) -
pd.to_datetime(df_match_processed[f'birthday_home_player_{x}'])).dt.days)/365,1)
df_match_processed[f'age_away_player_{x}'] = round(((pd.to_datetime(df_match_processed['date']) -
pd.to_datetime(df_match_processed[f'birthday_away_player_{x}'])).dt.days)/365,1)
# show results
df_match_processed.head()