0% found this document useful (0 votes)
22 views13 pages

Data-Engineering EINDE

The document discusses the steps taken to clean and preprocess soccer match data, perform exploratory data analysis to answer questions about successful teams and top scoring players, and build machine learning models to predict if the home team will lose a match. Logistic regression performed best with an accuracy of 72.24% and AUC of 65.22%.

Uploaded by

ahmedabbasi1318
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views13 pages

Data-Engineering EINDE

The document discusses the steps taken to clean and preprocess soccer match data, perform exploratory data analysis to answer questions about successful teams and top scoring players, and build machine learning models to predict if the home team will lose a match. Logistic regression performed best with an accuracy of 72.24% and AUC of 65.22%.

Uploaded by

ahmedabbasi1318
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Academic year 2020/21

Faculty of Business and Economics

Data Engineering Project


Data engineering
Prof. Len Feremans

Hélène Truyers s0145497


Carolina Salcedo Ortiz s0200385
Oussama El Bouazzaoui s0162659

1
1. Pre-processing

Our first step was to load the data from Kaggle. The panda dataframe was used so that everything was
already in the right format. The first merge that happened was the merge between the match-dataframe
and the player-dataframe. As this dataframe had already been used for the explanatory analysis, we
simply reused it. When these two dataframes were merged, we proceed to clean up the columns that are
not necessary for machine learning or that have been duplicated.

Next, we merged the dataframes of players and their attributes. We also dropped unnecessary columns
there. Another problem we came across was that the attributes of the players were kept in a timeseries.
This gave an abundance of observations so we used groupby to take an average for the continuous
variables and used the mode for the discrete ones. Since groupby created two different dataframes, we
had to merge the continuous and the discrete variables again into one dataframe.

After this was done we got two big dataframes that we needed to merge. This was done in the next step.
After that we dropped the player names, which we had used for the merge, because our reasoning was
that the names should not influence the outcome of a match, only the specs of the players should.

Next up, we defined our target-variable. The target variable was whether or not the home-team would
lose or not. We compared the goals from the home team to the away team. After this we merged for a
last time with the big dataframe.

Then the second big task of the data pre-processing started. Our first step was to check where all the
NaN’s are in our dataset, we also checked for infinities and zeros.

2. Exploratory data analysis

In this project we worked on data collected for more than 25000 matches played within the seasons 2008
to 2016. The database contained seven tables: Country, League, Match, Player, Player_Attributes, Team,
Team_Attributes. Databases in SQL format were presented.

For the exploratory data analysis, the three questions that were proposed. These three questions were:
For each year, what are the most successful teams?, What are the attributes of the best 1% scoring
players? and calculate the age of each player for every match. To answer the questions, the first step was
to copy the dataframes so that if a mistake was made, the data would not be corrupted. Each question
was then followed by a different set of steps.

2
We completed the following steps to determine which teams were the most successful:

1. We derived the year from date field in df_match


2. We assigned points to home and away team
3. We melted dataframe 2 times
4. We merged df_best_teams with df_team on team_id and team_api_id
5. We kept only relevant columns
6. Then, we grouped by year and team
7. We showed most successful team for each year
8. Finally, we displayed the results

Based on the final dataframe obtained, we made the following graph. In this graph you can see the points
of the best teams of each year from 2008-2016.

Secondly, we determined the characteristics of the top 1% of scoring players by following these steps:

1. We calculated number of rows that corresponds to top 1 %


2. We filtered the top 1 %
3. Then, we merged with df_player to get the name of the player and sort by finishing and date
attributes
4. We dropped duplicates and kept only the first row it finds (this explains why we have sorted on
date in the previous step)
5. We kept only the relevant columns and reset the index
6. We renamed player api id so we could merge later
7. Finally, we displayed the results

3
From the final dataframe that we obtained in this section, we made the following graph by choosing the
following parameters: player names and their overall rating. The graph shows the average of the overall
ratings of the ten top scoring players:

Finally, in order to calculate the age of each player for each match after we created the respective copy,
we created the following loops for both home and away teams. The first five steps describe what was
done in each loop.

1. We added birthday for each player of the home/away team


2. We automatically created the name of the column to be merged
3. We needed to create a player DataFrame, with the information we want
4. In this step we have to replace the names to allow merging
5. Lastly, we made the merge
6. After we did the steps above for both we dropped columns starting with home_ and away_

3. Machine learning

Three algorithms were run to create the models. The first one was logistic regression, the second
one was a decision tree and lastly a random forest with Kfold. The target variable was whether
or not the home team would lose. The input variables were the attributes of the players and
three odds from the bookmakers, namely, the odds that the home team would win, that they
would play a draw or that the away team would win. For efficiency's sake we first dropped all
columns that weren’t needed. This was done in the pre-processing step, so we could just use the
dataframe in machine learning.

4
The first step we did during the machine learning was creating random indices for a test, training
and validation set. We used the validation set to make sure we did not overfit our model and the
testset to calculate the AUC and accuracy. After this was done we extracted all features and
targets from the dataframe. Then we did the pre-processing of the discrete variables. This
actually means that we dummyfied all the discrete variables. This included our input variables
and the target variable. After that we pre-processed the continuous variables, this was done by
using a scaler to scale all variables. We did so that all variables had an equal weight in the machine
learning algorithm.

Secondly we concatenated all the variables, this includes the discrete and continuous variables
but also the target variable. The first row of this concatenate dataframes were all NaN’s so we
dropped this. After that we dropped all the NaN’s, _0 and 0. This was done with all variables
together so that if we dropped one row in the training data it would also be dropped in the target
variable. After this we removed the target variable out of the dataframe. And lastly we made the
training, validation and test set. When this was all done we could start with machine learning.

Literature stated that it would be hard to predict a football-match using machine learning. The
market roughly predicted 70% of the games right in 2019 (Empirics Asia, 2019). This is the
accuracy that we have. But we are working with skewed data. Only 28.77% of the matches were
lost by the home team. This led us to look at the AUC instead of the accuracy.

The first model we checked was the logistic linear regression. This model outperformed the rest
by a small margin. The accuracy of this model was 72,24% and the AUC was 65,22%. The Area
Under the Curve is being displayed in the ROC graph. the more the ROC curve is in the top left
corner, the better the model is.

The second model that we checked was the KNN. However this model did not have any
discriminative power as it just predicted that every match would be lost. This led to an accuracy
of 71,87%. A way to fix this would have been to have different cut-off values. Another technique
that might have been useful would be Near Miss. We would not recommend a Smote technique
since this technique duplicates the minority class and Python already gave us a warning that the
training dataset was big. However we also looked at the AUC as this metric is classification-
threshold-invariant. The AUC was 63,33%.

The last technique we tried was a tree. The tree was by far the worst model. This model also has
an accuracy of 71,87% like KNN. It also has the same problem as KNN, meaning that this model
does not have any discriminative power. However the AUC is lower than KNN. The AUC of a tree
is only 60,41%. We also tried a random forest with Kfold but it didn’t have a significantly higher
AUC.

5
Model/Values Logistic Linear KNN Decision Tree Random
Regression Forest

72,24% 71,87% 71,78% 71,87%


Accuracy

65,22% 63,33% 60,41% 63,32%


AUC

We looked at the AUC and ROC graphs to compare the models. We chose to only show the two
models with the highest AUC values. As you can see, there is only a minimal difference between
the two. We eventually went with the logistic linear regression because it has the highest AUC
value but on top of that it also has a better accuracy. From our experiments with these machine
learning models for soccer matches we conclude that the logistic regression was the best for us.

4. References
● Empirics Asia. (2019, December 19). How I Used Machine Learning to Predict Football Games for
24 Months Straight. Retrieved from Empirics Asia: https://empirics.asia/how-i-used-machine-
learning-to-predict-football-games-for-24-months-straight/
● Mathien, H. (2016, October 16). European Soccer Database. Retrieved from Kaggle:
https://www.kaggle.com/hugomathien/soccer
● Seaborn documentation: https://seaborn.pydata.org/tutorial.html
● Scikit-learn documentation : https://scikit-learn.org/stable/user_guide.htm

6
5. Appendix

# import libraries
import pandas as pd
import pyodbc as pdb
import numpy as np
import datetime as dt
import shutil
import math
import os
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
from heapq import nlargest

# make connection to sqlite database


con = sqlite3.connect("database.sqlite")

# create dataframe for each table


df_match = pd.read_sql_query("SELECT * FROM Match", con)
df_player = pd.read_sql_query("SELECT * FROM Player", con)
df_player_attr = pd.read_sql_query("SELECT * FROM Player_Attributes", con)
df_country = pd.read_sql_query("SELECT * FROM Country", con)
df_team = pd.read_sql_query("SELECT * FROM Team", con)
df_team_attr = pd.read_sql_query("SELECT * FROM Team_Attributes", con)
df_league = pd.read_sql_query("SELECT * FROM League", con)

#%% datacleaning: we will now clean both datasets so we can cleanly merge them. This step is
preparring for the ML

#first we clean df_match_processed


df_match_processed1= df_match_processed.loc[:,~df_match_processed.columns.duplicated()]
#we delete the first 6 collumns so id, country_id, leage_id,season stage, date will be deleted
df_match_processed.drop(df_match_processed.columns[0:6], axis=1, inplace=True)
#now we will drop everything that is in XML as extracting XML is out of the scope for this project
df_match_processed.drop(df_match_processed.columns[1:9], axis=1, inplace=True)
# we want to keep BWH,BWA and BWD since these are odd, this leaves us drop the first 3 colums
df_match_processed.drop(df_match_processed.columns[1:4], axis=1, inplace=True)

1
this dataframe was made in the expanitoy part, we put the code at the end of the document

1
# after BWH,BWA and BWD we drop everything we don't understand (IWH till BSA)
df_match_processed.drop(df_match_processed.columns[4:28], axis=1, inplace=True)
#now we remove the age of the players as this is not explanatory in our model
df_match_processed =
df_match_processed[df_match_processed.columns.drop(list(df_match_processed.filter(regex='age')))]

#now we clean up the player_merged


#we create a dataset with the attributes of the players and their names
df_players_merged = df_players.merge(df_players_attr,
on='player_api_id').sort_values(["finishing","date"],ascending=[False,False])
#we first drop the player_api_id
df_players_merged.drop(df_players_merged.columns[0], axis=1, inplace=True)
#now we drop the id, fifa_api_id and data
df_players_merged.drop(df_players_merged.columns[1:4], axis=1, inplace=True)
# now we drop everything after head_accurancy
df_players_merged.drop(df_players_merged.columns[9:42], axis=1, inplace=True)

#now we groupby so that there is no time-evolution for the players but an overal mean (continuous)
or mode for (discrete)
#first of all we groupby for the continous variables
df_players_cont=df_players_merged.groupby('player_name').mean()

# now we groupby for the discrete variables


df_players_discr=df_players_merged.groupby('player_name').agg(pd.Series.mode)
df_players_discr.drop(df_players_discr.columns[3:6], axis=1, inplace=True)

df_cleaned = df_players_discr.merge(df_players_cont, on='player_name')

#%% now we merge data_processed with data_cleaned


#merge df_matched_processed with df_cleaned for away player one
df_match_player_merged =
pd.merge(df_match_processed,df_cleaned,how='inner',left_on=['name_away_player_1'],right_on=['pla
yer_name'])
df_match_player_merged =
df_match_player_merged.rename(columns={"player_name":"away_name_1"})

for x in range (2,12):


# Merge for every other away player (2-11)
df_match_player_merged =
pd.merge(df_match_player_merged,df_cleaned,how='inner',left_on=[f'name_away_player_{x}'],right_o
n=['player_name'])

2
df_match_player_merged =
df_match_player_merged.rename(columns={"player_name":"away_player_name_{x}"})

#merge for the home players


for x in range(1,12):
df_match_player_merged=pd.merge(df_match_player_merged,df_cleaned,how='inner',left_on=[f'name
_home_player_{x}'],right_on=['player_name'])
df_match_player_merged =
df_match_player_merged.rename(columns={"player_name":"home_player_name_{x}"})

#%% names of the players should not influence the match, but their specs should. So we drop all of
their names
df_match_player_merged_final=df_match_player_merged.copy()
df_match_player_merged_final.drop(df_match_player_merged_final.columns[4:26], axis=1,
inplace=True)

#%% we need to convert the scores of df_match to who won-lost-tie


match_final = df_match[['match_api_id','home_team_goal','away_team_goal']].copy()
# so now we made our targetvariable in relationship to the hometeam
match_final.loc[match_final["home_team_goal"] > match_final["away_team_goal"],"Target"] =
'WIN_HOME_TIE'
match_final.loc[match_final["home_team_goal"] < match_final["away_team_goal"],"Target"] =
'LOSE_HOME'

#now we merge df_match_player_merged_final and match_final for the complete dataframe


df_COMPLETE=df_match_player_merged_final.merge(match_final, on='match_api_id')
# now we just need to drop the match_api_id and then we have a complete dataframe
df_COMPLETE.drop(df_COMPLETE.columns[0], axis=1, inplace=True)
#we drop the home_team_goal and away_team_goal as they are not available if we want to predict
who is going to win
df_COMPLETE.drop(df_COMPLETE.columns[179:181], axis=1, inplace=True)

#%% looking for the pecentage nan


# calculate the number of NaN’s
def nan_percen(column):
spec = df_COMPLETE.loc[:,column]
aantalnan = spec.isna()
Percen = sum(aantalnan)/len(spec)

3
#getting the results back
return Percen
percent_missing = df_COMPLETE.isnull().sum() * 100 / len(df_COMPLETE) # percentage van missing
values

#%% Data preparation


#Create random indices for a test, training and validation set

from sklearn.model_selection import train_test_split


indices=np.arange(19675)
indices_train, indices_test = train_test_split(indices, test_size=0.2, random_state=0)
indices_train, indices_val = train_test_split(indices_train, test_size=0.2, random_state=0)

#%% Extracting all features and target from the train dataframe
Target = df_COMPLETE.iloc[:,179] # target variable is on the last place of our dataframe

specs=df_COMPLETE.iloc[:,3:178]#all the specs


specs_numbers=specs.select_dtypes(include=np.number)#the numeric columns
specs_strings=specs.select_dtypes(exclude=np.number)#the non numeric columns
specs_strings=specs_strings.astype(str) #convert to string

#%%Preprocessing for discrete variables (for the specs)


Discrete=pd.get_dummies(specs_strings, sparse=True)#dummify the discrete variables
#%%
TARGET=pd.get_dummies(Target, sparse=True)#dummify the target variable
Target1=TARGET.iloc[:,0]
#%% Making sure we have a copy of our targetvariable in case someting goes wrong
TARGET = Target1

#%% Now we pre-process the the ODDS (this is a continous variable)


odds = df_COMPLETE.iloc[:,0:3]
data_preprocessed = [] #initialise

#import libraries
from numpy import asarray
from sklearn.preprocessing import MinMaxScaler

#we scale our continous variables


scaler = MinMaxScaler()
specs_numbers_prep1 = scaler.fit_transform(odds)
data_preprocessed.append(specs_numbers_prep1)

4
#we make the odds a dataframe instead of a tuple
ODDS_coding=pd.DataFrame(data=specs_numbers_prep1,index=np.array(range(1,
21375)),columns=np.array(range(1, 4)))

#%% Preprocessing for continuous variables (step 2)


#we scale our continous variables
scaler = MinMaxScaler()
specs_numbers_prep = scaler.fit_transform(specs_numbers)
data_preprocessed.append(specs_numbers_prep)
#we make the odds a dataframe instead of a tuple
specs_numbers_prep=pd.DataFrame(data=specs_numbers_prep,index=np.array(range(1,
21375)),columns=np.array(range(1, 110)))

#%%We put all the variables together and get the NaN’s, _0,... out
horizontal_stack = pd.concat([specs_numbers_prep, ODDS_coding,Discrete,TARGET], axis=1)
horizontal_stack = horizontal_stack[1:]
horizontal_stack = horizontal_stack[horizontal_stack.isnull().sum(axis=1) < 2]

#%% we extract the target variable out of the horizontal stack


TARGET = horizontal_stack.iloc[: , -1]
horizontal_stack = horizontal_stack.iloc[: , :-1]

#%% we check the how many of the matches of the hometeams were lost
Procent = sum(TARGET)/19675

#%% Create train, validation and test data


#tolist() returns a list of values so here for train test and val data
indices_train_list=indices_train.tolist()
indices_test_list=indices_test.tolist()
indices_val_list=indices_val.tolist()

#creating the test, train, validation data for the input data
X_train=horizontal_stack.iloc[indices_train_list,]
X_test=horizontal_stack.iloc[indices_test_list,]
X_val=horizontal_stack.iloc[indices_val_list,]

#creating the test, train, validation data for the target variable
Y_train=TARGET.iloc[indices_train_list,]
Y_test=TARGET.iloc[indices_test_list,]

5
Y_val=TARGET.iloc[indices_val_list,]
X_train_total=pd.concat([X_train, X_val]) #concatenate two data frames
Y_train_total=pd.concat([Y_train, Y_val]) #concatenate two data frames

EXPLORITORY PART NEEDED TO UNDERSTAND THE FIRST MERGE


#%% Caculate the age of each player for every match
# First, copy the df_match DataFrame, as we will work on it
df_match_processed = df_match.copy()

# add birthday for each player of the home team


for x in range(1,12):

# Here we automatically create the name of the column to be merged


player_column_name = f'home_player_{x}'

# Now we need to create a player DataFrame, with the info we want, like we did before
df_player_info = df_player[['player_api_id', 'player_name', 'birthday']]

# And now, we have to replace the names to allow merging


df_player_info = df_player_info.rename(columns={
'player_api_id': player_column_name,
'player_name': f'name_{player_column_name}',
'birthday': f'birthday_{player_column_name}',
})

# Finally, now, we can make the merge


df_match_processed = df_match_processed.merge(df_player_info, on=player_column_name)

# add birthday for each player of the away team

for x in range(1,12):

# Here we automatically create the name of the column to be merged


player_column_name = f'away_player_{x}'

# Now we need to create a player DataFrame, with the info we want, like we did before
df_player_info = df_player[['player_api_id', 'player_name', 'birthday']]

# And now, we have to replace the names to allow merging

6
df_player_info = df_player_info.rename(columns={
'player_api_id': player_column_name,
'player_name': f'name_{player_column_name}',
'birthday': f'birthday_{player_column_name}',
})

# Finally, now, we can make the merge


df_match_processed = df_match_processed.merge(df_player_info, on=player_column_name)

# drop columns starting with home_ and away_


df_match_processed =
df_match_processed.loc[:,~df_match_processed.columns.str.startswith('home_')]
df_match_processed =
df_match_processed.loc[:,~df_match_processed.columns.str.startswith('away_')]

# calculate the age for each player

for x in range(1,12):

df_match_processed[f'birthday_home_player_{x}'] =
pd.to_datetime(df_match_processed[f'birthday_home_player_{x}'])
df_match_processed[f'birthday_away_player_{x}'] =
pd.to_datetime(df_match_processed[f'birthday_away_player_{x}'])
df_match_processed[f'age_home_player_{x}'] = round(((pd.to_datetime(df_match_processed['date']) -
pd.to_datetime(df_match_processed[f'birthday_home_player_{x}'])).dt.days)/365,1)
df_match_processed[f'age_away_player_{x}'] = round(((pd.to_datetime(df_match_processed['date']) -
pd.to_datetime(df_match_processed[f'birthday_away_player_{x}'])).dt.days)/365,1)

# show results
df_match_processed.head()

#now we don’t need the birtday anymore so we can drop these


df_match_processed =
df_match_processed[df_match_processed.columns.drop(list(df_match_processed.filter(regex='birthday
')))]

You might also like