Predicting The Outcome of English Premier League Matches Using Machine Learning
Predicting The Outcome of English Premier League Matches Using Machine Learning
Muntaqim Ahmed Raju∗ , Md. Solaiman Mia† , Md. Abu Sayed‡ and Md. Riaz Uddin§
∗ Department
of Computer Science and Engineering, Dhaka International University, Bangladesh
† AssistantProfessor, Department of Computer Science and Engineering, Green University of Bangladesh, Bangladesh
‡§ Lecturer, Department of Computer Science and Engineering, Dhaka International University, Bangladesh
Abstract—English Premier League (EPL) is the world’s most analyses to find out key factors for predicting football match
popular football league. Since this is a prominent league, there results. We have employed the feature-engineering technique
has been a variety of preceding endeavors both commercially to create the most substantial features. Thereafter, the feature
and scholastically to predict EPL match results. In this paper,
machine learning, a promising tool of the fourth industrial scaling is carried out by using min-max normalization to scale
revolution (Industry 4.0), has been used to introduce a model all the features. Uni-variate feature selection based on Chi-
for predicting the outcomes of EPL matches both in multi- Square statistical test is used as a feature selection method for
class (home, draw, and away) and in binary-class (home, and choosing the best viable features set primarily based on their
not-home) with the last five seasons football matches. We have scores for their correlation with the outcome variable. Then to
employed five machine learning algorithms along with different
machine learning techniques ranging from data pre-processing perceive the most promising method of the prediction, we have
to hyper-parameter optimization which find the best results. In monitored five different machine learning algorithms such as
addition, the comparative results demonstrate that, our proposed Support Vector Machine (SVM), Logistic Regression (LR),
model gives 70.27% accuracy in multi-class and 77.43% accuracy Naive Bayes (NB) classifier, Decision Tree Classifier (DTC)
in binary-class compared to the best known existing models in and AdaBoost Classifier (ABC). Lastly, the hyper-parameter
the literature.
Keywords—Football Prediction, English Premier League, Ma-
optimization technique is used to achieve the best possible
chine Learning, Data Mining hyper-parameters of every model.
A variety of experiments to forecast football matches have
I. I NTRODUCTION been carried out in the literature. In the exploration [1], the
authors discussed the prediction of football matches using
English Premier League (EPL) with a potential television tree-based model algorithm such as C5.0, random forest, and
audience of 4.7 billion people is the highest tier of the extreme gradient boosting and the best accuracy is generated
English football league system and the world’s largest sports by the random forest algorithm which is 68.55%. However,
community. There is a great deal of madness among the people the primary downside of this analysis is its feature collection.
about it and that is the reason of predicting the outcome of Exploration [1] can not be used to determine football matches
EPL matches with a huge phenomenon among football fans. prior to the game starts because they used features such as
Hence, a lot of football supporters and experts have been home team shots, away team shots, home team corners, away
giving predictions in different ways about who is going to win team corners, etc. which can not be addressed before the
before the match begins. Actually, there is a whole industry match started. Whereas, the accuracy of our study is higher
around it, there are pre-match analysis and post-match analysis than the study mentioned and is capable of predicting football
by commentators to anticipate who is going to win. Channel matches before the game starts. Some studies [2], [3] have
like ESPN are committed to try to predict who is going to been conducted using different algorithms but the predictive
dominate a game. This is actually very crazy thing that has accuracy is low, i.e., only 59% and 58.5%, respectively. In
been going for until the end of time. another research [4], the output of football matches has some
Artificial Intelligence (AI) is the brain behind Industry limitations. The algorithm used is the LR that gives only two
4.0. Machine Learning, a subset of AI has emerged as a results, i.e., home or not-home while in a football match there
promising tool in intelligent predictive applications for smart are three possible outcomes home win, away win, or draw.
manufacturing in Industry 4.0. Thus, in this paper, we have In this paper, we will discuss prior works before analyzing
used an intelligent machine learning predictive model which feature selection, discussing performance of various models,
attempts to anticipate who is going to win. The predominant and analyzing our results.
objective of this work is to accurately determine the outcome
in multi-class and binary-class of EPL matches. Initially, II. L ITERATURE R EVIEW
a survey of the last five seasons of the English Premier
League has been conducted. Since then, we have explored Various examines have been done to find the criteria for
a wide variety of soccer blogs, pre-match, and post-match foreseeing the result of football matches to be more exact.
978-1-6654-0489-1/20/$31.00©20XX IEEE
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:37:01 UTC from IEEE Xplore. Restrictions apply.
The following investigations have been led to locate an ideal IV. R ESEARCH M ETHODOLOGY
model for the prediction of football matches. In this section, we have presented our proposed examina-
Alfredo et al. [1] discussed the football match prediction tions of this exploration which employs five different popular
using tree-based model algorithms such as C5.0, random machine learning algorithms.
forest, and extreme gradient boosting. The backward wrapper
A. Data Collection
method was used as a feature selection methodology to assist
in picking the best feature to improve the accuracy of the The dataset employed in this research originates from
model. This study used 10 seasons of EPL football matches DataHub.io, which is a typical dataset to be utilized in football
history with 15 initial features to predict the match results match prediction research. The data used is based on five
(home win, away win or draw). The random forest algorithm seasons of EPL matches from the 2014-2015 season to the
generated the best accuracy of 68.55% whereas the C5.0 2018-2019 season. The total number of data used for this
algorithm had the lowest accuracy of 64.87% and the extreme whole investigation is 1870 historical match data.
gradient boosting algorithm provided 67.89% accuracy. B. Data Preprocessing
Sathe et al. [2] prepared dataset to predict the outcome Dataset used in this research needed to be preprocessed
(home win, away win or draw) of EPL matches by web since it was composed of several features of each season. Many
crawling of team ratings from sofifa and considering the of these features such as match date, referee name, football
performance of each team at home field and away field. Their team name, and bookmaker odds were practically superfluous.
final dataset consists of FIFA ratings of each team along In this process, our essential assignment was to remove the
with their performances of last 10 seasons. They used three irrelevant attributes or features which had no impact on the
machine learning classification methods, which are Support model development and keep only the attributes or features
Vector Machine (SVM), Naive Bayes (NB), and Random we particularly needed. From the retained attributes, feature
Forest (RF). The best accuracy obtained is 59% with SVM engineering was done to make the final features that were
method. utilized for model advancement.
Similarly, Baboota et al. [3] worked on the building of a 1) Feature Engineering: Feature engineering is an im-
generalized predictive model for predicting the results (home portant but labor-intensive component of machine learning
win, away win or draw) of the English Premier League. applications [5]. To use feature engineering, a model’s feature
They used data from 2005 to 2016 spanning of 11 seasons. vector is expanded by adding new features that compute
They divided their dataset into nine seasons of training data based on other properties [6]. The final 23 features that have
from 2005 to 2014, and kept the remaining two seasons been established with the help of our retained attributes are
from 2014 to 2016 as test data. Using feature engineering the mathematical conversion. Some of the features are given
and exploratory data analysis, they created a feature set for below:
determining the most important factors for predicting the result • Home team goals scored per game at home: It is a
of a football match, and consequently created a highly accurate function of home team goals scored at home and home
predictive system using machine learning. Their best model team match played at home. It helps to predict or forecast
using gradient boosting produced accuracy of 58.5%. the number of goals that may be scored by a home team
Rana et al. [4] described a Logistic Regression model to at home.
predict matches outcome (home, not-home) of the English • Home team goals conceded per game at home: This is
Premier League. They used SVM, XGBoost and Logistic based on home team goals conceded at home and home
Regression classifiers for primary classification of the data, and team match played at home. It allows to estimate or de-
then selected the best algorithm out of these three to predict termine how many goals a home team would potentially
that appropriate label. The application of these classifiers is conceive at home.
done on real team data which is gathered from football- • Home team win percentage: It is a component of total win
data.co.uk for the seasons ranging from 2003-04 to 2018-19. and the total match played by the home team. It gives the
Prediction accuracy of the built model is 65.63%. possibility to win the home team’s future match.
• Away team win percentage: This is a measure of the total
III. G OAL OF THE STUDY win and total match played by the away team. It provides
the potential of an away team to win future matches.
The main goal of this work is to create the most influential
2) Feature Scaling: Feature scaling is the technology of
features through feature engineering to accurately determine
standardizing individual features over a defined range [7]. For
the outcome in multi-class and binary-class of EPL matches.
the scaling intent of this study, we have exercised min-max
None of the existing works mentioned in Section II worked
normalization. It is a technique that scales an element or
for both multi-class and binary-class. Since football is a very
perception into the range of 0 and 1 [8]. The mathematical
adaptive game, we have designed our model in such a way that
equation for Min-Max Normalization is,
has added very recent data to the model. It will be possible
to predict every new season with the help of most influential xi − min(x)
features. xnew = (1)
max(x) − min(x)
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:37:01 UTC from IEEE Xplore. Restrictions apply.
3) Feature Selection: Feature Selection is the process of D. Models
selecting a subset of relevant features which contribute most For the intent of this analysis, we have primarily employed
to prediction variable or output [9]. In this study, we have five mainstream supervised machine learning algorithms [12]
used uni-variate feature selection [10] based on Chi-Square (SVM, LR, NB classifier, DTC, ABC) to address our classifi-
statistical test which picks up the intrinsic properties of the cation problem.
features. Features with the highest Chi-Square statistical test 1) Support Vector Machine (SVM): The SVM is a kernel-
score are illustrated in Fig. 1. based learning algorithm to address the problem of classifica-
tion and regression. It produces ideal isolating limits between
data sets by resolving a problem of quadratic optimization.
The algorithm characterizes the best hyper-plan which divides
the number of points with a maximum margin associated
with different class names [13]. SVM is a predictive data
classification algorithm. So, we have checked out it to take
care of our classification issues too. For making a model with
SVM, we have taken advantage of 23 features that we have
already developed through feature engineering. Thereafter, we
have tuned the hyper-parameters using grid search with k-folds
cross-validation (we used a k-value of 10) where best hyper-
parameters were C = 1, gamma = 0.1, kernel = ‘sigmoid’
illustrated in Fig. 2.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:37:01 UTC from IEEE Xplore. Restrictions apply.
67.71% of accuracy in multi-class, 74.92% of accuracy in
binary-class using features chosen through the feature selec-
tion system, respectively.
4) Decision Tree Classifier (DTC): The DTC algorithm
represents a function that takes as input a vector of attribute
values and returns a decision single output value. A decision
tree reaches its decision by performing a sequence of tests
[16]. It can be used to solve both regression and classifi-
cation problems. Since our problem is also a question of
classification, we have built up a predictive model utilizing
DTC. We have tuned its hyper-parameters to manipulate
the learning process using grid search where best hyper-
parameters were min samples split = 200, criterion =
‘gini’, min samples leaf = 1 which is illustrated in Fig.
Fig. 3. Grid Search for Logistic Regression model’s hyper-parameters. 5.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:37:01 UTC from IEEE Xplore. Restrictions apply.
TABLE IV
P ERFORMANCE OF THE DTC M ODEL
TABLE V
P ERFORMANCE OF THE ABC M ODEL
TABLE VI
According to the accuracy rates of each model, the feature T HE R ESULTS OF THE P REDICTION P ROCESS IN M ULTI -C LASS
selection process has slightly increased the prediction perfor- Model Accuracy Precision Recall F1-Score
mance. SVM 69.15% 65% 68.66% 65.66%
LR 70.27% 65.33% 70.66% 66.66%
V. E XPERIMENTAL R ESULTS AND A NALYSIS NB 67.71% 64.66% 65.66% 65%
DTC 67.76% 64.66% 66.33% 65.33%
This section exhibits our research findings as well as a ABC 69.15% 65% 68.66% 65.66%
comparative analysis with the existing models. Since selected
features from feature selection process has slightly increased According to Table VI, performance values of the Logistic
the performance, we have employed those selected features Regression model are a little bit higher than rest of the models.
for the evaluation of each models. Therefore, we considered LR model as the proposed model of
A. Evaluation of Each Model for Multi-Class this literature for multi-class classification.
The total number of matches was 1870, which consisted of B. Evaluation of Each Model for Binary-Class
861 home team wins, 565 away team wins, and 444 draws. The total number of matches was 1870 matches, which
The 10-fold cross-validation method with the confusion matrix consisted of 861 home team wins, and 1009 wins for not-
was executed to measure the efficiency of each classification home. To evaluate the efficiency of each classification model,
model. The performance of each model for multi-class is 10-fold cross-validation method was used with the confusion
shown from Table I to Table V. matrix. The performance of each model for binary-class is
TABLE I displayed from Table VII to Table XI.
P ERFORMANCE OF THE SVM M ODEL
TABLE VII
Class Precision Recall F1-Score P ERFORMANCE OF THE SVM M ODEL
Away 70% 68% 69%
Draw 43% 68% 52% Class Precision Recall F1-Score
Home 82% 70% 76% Not-Home 85% 75% 80%
Average 65% 68.66% 65.66% Home 67% 79% 73%
Average 76% 77% 76.50%
TABLE II
P ERFORMANCE OF THE LR M ODEL TABLE VIII
P ERFORMANCE OF THE LR M ODEL
Class Precision Recall F1-Score
Away 69% 71% 70% Class Precision Recall F1-Score
Draw 42% 71% 53% Not-Home 77% 81% 79%
Home 85% 70% 77% Home 78% 74% 76%
Average 65.33% 70.66% 66.66% Average 77.50% 77.50% 77.50%
TABLE III
P ERFORMANCE OF THE NB M ODEL TABLE IX
P ERFORMANCE OF THE NB M ODEL
Class Precision Recall F1-Score
Away 72% 65% 69% Class Precision Recall F1-Score
Draw 46% 59% 52% Not-Home 74% 78% 76%
Home 76% 73% 74% Home 76% 71% 74%
Average 64.66% 65.66% 65% Average 75% 74.50% 75%
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:37:01 UTC from IEEE Xplore. Restrictions apply.
TABLE X VI. C ONCLUSION
P ERFORMANCE OF THE DTC M ODEL
The model we devised is based on statistical analysis of
Class Precision Recall F1-Score past football games. We will be able to make fairly accurate
Not-Home 77% 76% 76% predictions. Although the accuracy of this model is pretty
Home 71% 73% 72%
Average 74% 74.50% 74% good, it is not guaranteed to be always right and there is a
lot of scope for future work in this regard. We could bring
in sentiment analysis, features such as individual player and
TABLE XI team performance metrics, studying the trending hash-tags on
P ERFORMANCE OF THE ABC M ODEL
twitter on match day, the posts from fans on social media, etc
Class Precision Recall F1-Score to further enhance the accuracy of the model.
Not-Home 77% 77% 77%
Home 72% 73% 73% ACKNOWLEDGEMENT
Average 74.50% 75% 75%
This work was partially supported by the “Research Fund”
of Green University of Bangladesh.
TABLE XII
T HE R ESULTS OF THE P REDICTION P ROCESS IN B INARY-C LASS R EFERENCES
[1] Y. F. Alfredo and S. M. Isa, “Football Match Prediction with Tree Based
Model Accuracy Precision Recall F1-Score Model Classification”, I. J. Intelligent Systems and Applications, vol. 11,
SVM 76.85% 76% 77% 76.5% no. 7, pp. 20-28, 2019.
LR 77.43% 77.50% 77.50% 77.50% [2] S. Sathe, D. Kasat, N. Kulkarni and R. Satao, “Predictive Analysis of
NB 74.92% 75% 74.50% 75% Premier League Using Machine Learning”, I. J. Innovative Research in
DTC 75.93% 74% 74.50% 74% Computer and Communication Engineering, vol. 5, no. 3, pp. 4121-4124,
ABC 76.15% 74.50% 75% 75% 2017.
[3] R. Baboota and H. Kaur, “Predictive analysis and modelling football
results using machine learning approach for English Premier League”,
According to Table XII, performance values of the Logistic I. J. Forecasting, vol. 35, no. 2, pp. 741-755, 2019.
Regression model are a little bit higher than rest of the models. [4] D. Rana and A. Vasudeva, “Premier League Match Result Prediction
Therefore, we considered LR model as the proposed model of using Machine Learning”, Jaypee University of Information Technology,
2019.
this literature for binary-class classification. [5] Y. Bengio, A. Courville and P. Vincent, “Representation learning:
A review and new perspectives”, IEEE Trans. Pattern Analysis and
C. Comparative Results Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
In this sub-section, a comparative analysis is presented to [6] A. Coates, A. Y. Ng and H. Lee, “An analysis of single-layer networks
in unsupervised feature learning”, I. Con. Artificial Intelligence and
prove the superiority of the proposed model of EPL match Statistics, pp. 215–223, 2011.
prediction over existing models. [7] X. Wan, “Influence of feature scaling on convergence of gradient
iterative algorithm”, J. Physics: Conf. Series, vol. 1213, no. 3, pp. 1-5,
TABLE XIII 2019.
C OMPARISON OF THE P ROPOSED M ODEL WITH THE E XISTING M ODELS [8] S. G. K. Patro and K. K. Sahu, “Normalization: A Preprocessing Stage”,
IN M ULTI -C LASS IARJSET, vol. 2, no. 3, pp. 20-22, 2015.
[9] J. Tang, S. Alelyani and H. Liu, “Feature Selection for Classification: A
Parameters Accuracy Review”, Data Classification: Algorithms and Applications, CRC Press,
Proposed Model in Multi-Class 70.27% pp. 37-64, 2014.
Existing Model [1] in Multi-Class 68.55% [10] R. H. Subho, M. R. Chowdhury, D. Chaki and S. Islam,“A Univariate
Existing Model [2] in Multi-Class 59% Feature Selection Approach for Finding Key Factors of Restaurant
Existing Model [3] in Multi-Class 58.5% Business”, IEEE Region 10 Symposium (TENSYMP), pp. 605-610,
2019.
[11] E. Eryarsoy and D. Delen, “Predicting the Outcome of a Football Game:
Table XIII shows the comparison between the proposed A Comparative Analysis of Single and Ensemble Analytics Methods”,
model and the existing models [1], [2] and [3] in multi-class HICSS, pp. 1107–1115, Hawaii, 2019.
where proposed model accuracy is 70.27% and the existing [12] S. Chakravarty, H. Demirhan and F. Baser, “Fuzzy regression functions
with a noise cluster and the impact of outliers on mainstream machine
models [1], [2] and [3] have 68.55%, 59%, 58.5% of accuracy, learning methods in the regression setting”, Applied Soft Computing,
respectively. vol. 96, pp. 1-17, 2020.
[13] T. Cheng, D. Cui, Z. Fan, J. Zhou and S. Lu, “A new model to forecast
the results of matches based on hybrid neural networks in the soccer
TABLE XIV rating system”, Proc. Fifth Int. Conf. Computational Intelligence and
C OMPARISON OF THE P ROPOSED M ODEL WITH THE E XISTING M ODEL IN Multimedia Applications (ICCIMA), IEEE, 2003.
B INARY-C LASS [14] S. Dreiseitl and L. Ohno-Machado, “Logistic regression and artifi-
cial neural network classification models: a methodology review”, J.
Parameters Accuracy
Biomedical Informatics, vol. 35, no. 5–6, pp. 352-359, 2002.
Proposed Model in Binary-Class 77.43% [15] D. J. Hand and K. Yu, “Idiots Bayes—not so stupid after all?,” Int.
Existing Model [4] in Binary-Class 65.63% Statistical Review, vol. 69, no. 3, pp. 385–398, 2001.
[16] L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone, “Classification
Table XIV shows the comparison between proposed model and Regression Trees”, Biometrics, vol. 40, no. 3, pp. 874, 1984.
[17] C. Ying, M. Qi-Guang, L. Jia-Chen and G. Lin., “Advance and prospects
and existing [4] model in binary-class where proposed model of AdaBoost algorithm”, Acta Automatica Sinica, vol. 39, no. 6, pp.
accuracy is 77.43% and existing model [4] accuracy is 65.63%, 745–758, 2013.
respectively.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on October 11,2024 at 05:37:01 UTC from IEEE Xplore. Restrictions apply.