0% found this document useful (0 votes)
211 views17 pages

Abstract and Figures: The Open International Soccer Database For Machine Learning

The document describes the Open International Soccer Database, which contains essential match data from soccer leagues around the world. It was created to facilitate machine learning research on soccer outcome prediction. The database does not include detailed player or team stats, but instead focuses on basic information like goals scored, teams involved, league, and date. A 2021-22 Soccer Prediction Challenge was held to test how well machine learning models could predict future match outcomes using the database's data. The database is intended as an open resource to help analyze soccer and benchmark machine learning methods.

Uploaded by

ViraL facts
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
211 views17 pages

Abstract and Figures: The Open International Soccer Database For Machine Learning

The document describes the Open International Soccer Database, which contains essential match data from soccer leagues around the world. It was created to facilitate machine learning research on soccer outcome prediction. The database does not include detailed player or team stats, but instead focuses on basic information like goals scored, teams involved, league, and date. A 2021-22 Soccer Prediction Challenge was held to test how well machine learning models could predict future match outcomes using the database's data. The database is intended as an open resource to help analyze soccer and benchmark machine learning methods.

Uploaded by

ViraL facts
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

The Open International Soccer Database for machine learning

Abstract and Figures


How well can machine learning predict the outcome of a soccer game, given the
most commonly and freely available match data. To help answer this question
and to facilitate machine learning research in soccer, we have the Open
International Soccer Database. Version v1.0 of the Database contains essential
information soccer matches from different countries.

To demonstrate the use of the Database for machine learning research, we


organized the 2021-22 Soccer Prediction Challenge. One of the goal of the
Challenge was to estimate where the limits of predictability lie, given the type
of match data contained in the Database.

Another goal of the Challenge was to pose a real-world machine learning


problem with a fixed time line and a genuine prediction task to develop a
predictive model from the Database and then to predict the outcome of the
2021-22 future soccer matches taking place from 2019-20 to the end of the
regular season.

The Open International Soccer Database is released as an open science project,


providing a valuable resource for soccer analysts and a unique benchmark for
advanced machine learning methods. Here, we describe the Database and the
2021-22 Soccer Prediction Challenge.

The sample database represents some of the data storage and retrieval about a
soccer tournament based on EURO CUP Series. You might love football, and
for all the football lovers we are providing a detail information about a football
tournament. This design of database will make it easier to understand the
various questions comes in your mind about a soccer tournament.

Keywords : Open International Soccer Database ・ 2017 Soccer Prediction Challenge

A V D GROUP I GIVE YOUR CAREER A LIFT 1


Open science - Soccer analytics

A V D GROUP I GIVE YOUR CAREER A LIFT 2


Introduction
Predicting the outcomes of sporting events has been the subject of intensive
research for many years. One obvious motivation for this is betting. Sports
betting has become a global multi- dollar industry. At least since the late 2000s,
statistical forecasting models have been developed for association football also
known as soccer.

One of the earliest studies on soccer analysis concluded that chance dominates
the game, which makes outcome prediction very difficult. Despite the relatively
simple rules and objectives governing soccer, predicting the outcome of a
soccer game is difficult. One aspect that makes soccer so popular and
unpredictable is that goals are relatively rare and the margin of victory for the
winning team is relatively low for most matches.

Another reason why predicting the outcome of a soccer game is difficult is that
goals and other game-changing circumstances (e.g., red cards, injuries,
penalties) often do not occur as a result of superior or inferior play by one team,
but are due to difficult-to-capture events, such as poor refereeing, unfortunate
deflections or bounces of the ball, weather or ground conditions, or fraudulent
match manipulation. Also, factors like political upheaval in the club’s
management, behavior of spectators, and media pressure can influence the
outcome of matches, but such events are rarely captured in databases.

To date, relatively few studies have investigated machine learning methods for
soccer outcome prediction. We speculate that one reason is the lack of readily
available open soccer data. Here, we present the Open International Soccer
Database to bridge this gap. The Database contains the most commonly and
freely available as well as consistently reported information about the outcome
of a league soccer match. This information concerns the goals scored by each
team, teams involved, league, season, and date on which the match was played.
While goals are arguably the most important match events, the drawback of
such basic data is that it lacks more “sophisticated” outcome-relevant
information, such as fouls committed, yellow and red cards, corners conceded
by each team, or data about players, teams and clubs.

Note, however, that legislation, such as the UK Data Protection Bill and the
General Data Protection Regulation by the European Union, puts legal
constraints to the disclosure of full names of players or coaches in publicly
available databases. In sports, biometric or health data could be highly sensitive

A V D GROUP I GIVE YOUR CAREER A LIFT 3


to the individual player because it can be linked to physical performance, and
there are many potential misuses of such personal data, including damages to a
player’s reputation.

In contrast to more sophisticated data, the beauty of simple match data is that it
can be easily understood and analyzed by any machine learning researcher, just
like the famous Iris data set. But although the data is simple to understand, it
does not mean that the scope of possible analysis is limited on the contrary, as
the special issue Machine Learning for Soccer shows, the data set provides
considerable analytical challenges.

Researchers are welcome to freely use the Database to develop and test their
own strategies, methods, and tools.

However, the major motivation for developing the Open International Soccer
Database was not to provide yet-another benchmark data set for the machine
learning community, but to build a knowledge base that can be used for the
prediction of real-world soccer matches.

Related work

This paper is related to outcome prediction of soccer matches, sports


prediction challenges and open science. We will now position it with respect
to the state of the art in these areas.

1. Predicting soccermatch outcomes


One way to predict match results is through the use of statistical models or machine
learning.
Karlis and Ntzoufras (2003) used a bivariate Poisson model to predict the number of
goals
scored by each team in amatch.Baio and Blangiardo (2010) proposed aBayesian
hierarchical
model to predict the outcomes of the Italian SerieAleague in the 2007/08
season.VanHaaren
et al. (2011) used kernel-based relational learning to predict the goal differences in soccer
matches. Rue and Salvesen (2000) used a Bayesian approach to model the relative
strength
of attack and defense of a team. O’Donoghue et al. (2004) used a variety of statistical and
machine learning methods to predict the results of the 2002 FIFA World Cup, but overall,
only with limited success.

A V D GROUP I GIVE YOUR CAREER A LIFT 4


Fig. The 25 most frequent match scorelines in the Open International Soccer
Database. The least frequent outcomes, which were observed only once in all 216,743
matches, in the Database (not shown in the barchart) . Boxplots of the prior
probabilities of the home team winning, drawing, and losing in the Database.

A V D GROUP I GIVE YOUR CAREER A LIFT 5


Discussion
Soccer is arguably the world’s most popular team sport. It is also interesting from
an analytical point of view because it presents unique challenges. Soccer typically
involves a low number of goals, a low margin of victory and difficult-to-capture
events that often determine the final outcome of a match. On the other hand, data
capturing the essential aspects of a match are readily available. However, soccer
data are rarely available in a form directly usable by machine learning methods.
Moreover, how soccer data from different countries and leagues or other
competitions, such club or national team championships, should be combined to
produce a larger data set suitable for machine learning is not immediately obvious.
Thus, predictive modeling in soccer poses interesting challenges with respect to the
integration of domain knowledge and feature engineering.

A V D GROUP I GIVE YOUR CAREER A LIFT 6


List of tables in the soccer database:

 soccer_country
 soccer_city
 soccer_venue
 soccer_team
 playing_position
 player_mast
 referee_mast
 match_mast
 coach_mast
 asst_referee_mast
 match_details
 goal_details
 penalty_shootout
 player_booked
 player_in_out
 match_captain
 team_coaches
 penalty_gk

A V D GROUP I GIVE YOUR CAREER A LIFT 7


ER Diagram

A V D GROUP I GIVE YOUR CAREER A LIFT 8


Description of tables:

soccer_country:
 country_id – this is a unique ID for each country
 country_abbr – this is the sort name of each country
 country_name – this is the name of each country

soccer_city:
 city_id – this is a unique ID for each city
 city – this is the name of the city
 country_id – this is the ID of the country where the cities are located and
only those countries will be available which are in soccer_country table

soccer_venue:
 venue_id – this is a unique ID for each venue
 venue_name – this is the name of the venue
 city_id – this is the ID of the city where the venue is located and only
those cities will be available which are in the soccer_city table
 aud_capicity – this is the capacity of audience for each venue

soccer_team:
 team_id – this is the ID for each team. Each teams are representing to a
country which are referencing the country_id column of soccer_country
table
 team_group – the name of the group in which the team belongs
 match_played – how many matches a team played in group stage

A V D GROUP I GIVE YOUR CAREER A LIFT 9


 won – how many matches a team won
 draw – how many matches a team draws
 lost – how many matches a team lose
 goal_for – how many goals a team conceded
 goal_agnst – how many goals a team scored
 goal_diff – the difference of goal scored and goal conceded
 points – how many points a team achieved from their group stage
matches
 group_position – in which position a team finished their group stage
matches

playing_position:
 position_id – this is a unique ID for each position where a player played
 position_desc – this is the name of the position where a player played

player_mast:
 player_id – this is a unique ID for each player
 team_id – this is the team where a player played, and only those teams
which referencing the country_id column of the table soccer_country
 jersey_no – the number which labeled on the jersey for each player
 player_name – name of the player
 posi_to_play – the position where a player played, and the positions are
referencing the position_id column of playing_position table
 dt_of_bir – date of birth of each player
 age – approximate age at the time of playing the tournament
 playing_club – the name of the club for which a player was playing at the
time of the tournament

referee_mast:
 referee_id – this is the unique ID for each referee
 referee_name – name of the referee
 country_id – the country, where a referee belongs and the countries are
those which referencing the country_id column of soccer_country table

match_mast:

A V D GROUP I GIVE YOUR CAREER A LIFT 10


 match_no – this if the unique ID for a match
 play_stage – this indicates that in which stage a match is going on, i.e. G
for Group stage, R for Round of 16 stage, Q for Quarter final stage, S for
Semi Final stage, and F for Final
 play_date – date of the match played
 results – the result of the match, either win or draw
 decided_by – how the result of the match has been decided, either N for
by normally or P for by penalty shootout
 goal_score – score for a match
 venue_id – the venue where the match played and the venue will be one
of the venue referencing the venue_id column of soccer_venue table
 referee_id – ID of the referee who is selected for the match which
referencing the referee_id column of referee_mast table
 audence – number of audience appears to watch the match
 plr_of_match – this is the player who awarded the player of a particular
match and who is selected a 23 men playing squad for a team which
referencing the player_id column of player_mast table
 stop1_sec – how many stoppage time ( in second) have been added for
the 1st half of play
 stop2_sec – how many stoppage time ( in second) have been added for
the 2nd half of play

coach_mast:
 coach_id – this is the unique ID for a coach
 coach_name – this is the name of the coach

asst_referee_mast:
 ass_ref_id – this is the unique ID for each referee assists the main referee
 ass_ref_name – name of the assistant referee
 country_id – the country where an assistant referee belongs and the
countries are those which are referencing the country_id column of
soccer_country table

match_details:

A V D GROUP I GIVE YOUR CAREER A LIFT 11


 match_no – number of the match which is referencing the match_no
column of match_mast table
 play_stage - stage of the match, i.e. G for group stage, R for Round of 16,
Q for Quarter Final, S for Semi final and F for final
 team_id – the team which is one of the playing team and it is referencing
the country_id column of soccer_country table
 win_lose – team either win or lose or drawn indicated by the character W,
L, or D
 decided_by - how the result achieved by the team, indicated N for normal
score or P for penalty shootout
 goal_score – how many goal scored by the team
 penalty_score – how many goal scored by the team in penalty shootout
 ass_ref – the assistant referee assist the referee which are referencing the
ass_ref_id column of asst_referee_mast table
 player_gk - the player who is keeping the goal for the team, is referencing
the player_id column of player_mast table

goal_details:
 goal_id – this is the unique ID for each goal
 match_no – this is match_no which is referencing the match_no column
of match_mast table
 player_id - this is the ID of a player who is selected for the 23 men squad
of a team for the tournament and which is referencing the player_id
column of player_mast table
 team_id – this is the ID of each team who are playing in the tournament
and referencing the country_id column of soccer_country table
 goal_time – this is the time when the goal scored
 goal_type – this is the type of goal which came in normally indicated by
N or own goal indicating by O and goal came from penalty indicated by P
 play_stage – this is the play stage in which goal scored, indicated by G
for group stage, R for round of 16 stage, Q for quarter final stage, S for
semifinal stage and F for final match
 goal_schedule – when the goal came, is it normal play session indicated
by NT or in stoppage time indicated by ST or in extra time indicated by
ET
 goal_half – in which half of match goal came

A V D GROUP I GIVE YOUR CAREER A LIFT 12


penalty_shootout:
 kick_id – this is unique ID for each penalty kick
 match_no - this is the match_no which is referencing the match_no
column of match_mast table
 team_id – this is the ID of each team who is playing in the tournament
and referencing the country_id column of soccer_country table
 player_id - this is the ID of a player who is selected for the 23 men squad
of a team for the tournament and which is referencing the player_id
column of player_mast table
 score_goal – this is the flag Y if able to score the goal or N when not
 kick_no – this is the kick number for the kick of an individual match

player_booked:
 match_no - this is the match_no which is referencing the match_no
column of match_mast table
 team_id – this is the ID of each team who are playing in the tournament
and referencing the country_id column of soccer_country table
 player_id - this is the ID of a player who is selected for the 23 men squad
of a team for the tournament and which is referencing the player_id
column of player_mast table
 booking_time – this is the time when a player booked
 sent_off – this is the flag Y when a player sent off
 play_schedule – when a player booked, is it in normal play session
indicated by NT or in stoppage time indicated by ST or in extra time
indicated by ET
 play_half – in which half a player booked

player_in_out:
 match_no - this is the match_no which is referencing the match_no
column of match_mast table
 team_id – this is the ID of each team who are playing in the tournament
and referencing the country_id column of soccer_country table

A V D GROUP I GIVE YOUR CAREER A LIFT 13


 player_id - this is the ID of a player who is selected for the 23 men squad
of a team for the tournament and which is referencing the player_id
column of player_mast table
 in_out – this is the flag I when a player came into the field or O when go
out from the field
 time_in_out – when a player come into the field or go out from the field
 play_schedule – when a player come in or go out of the field, is it in
normal play session indicated by NT or in stoppage time indicated by ST
or in extra time indicated by ET
 play_half - in which half a player come in or go out

match_captain:
 match_no - this is the match_no which is referencing the match_no
column of match_mast table
 team_id – this is the ID of each team who are playing in the tournament
and referencing the country_id column of soccer_country table
 player_captain - the player who represents as a captain for a team, is
referencing the player_id column of player_mast table

team_coaches:
 team_id – this is the ID of a team who is playing in the tournament and
referencing the country_id column of soccer_country table
 coach_id – a team may be one or more coaches, this indicates the
coach(s) who is/are coaching the team is referencing the coach_id column
of coach_mast table

penalty_gk:
 match_no - this is the match_no which is referencing the match_no
column of match_mast table
 team_id – this is the ID of each team who are playing in the tournament
and referencing the country_id column of soccer_country table
 player_gk - the player who kept goal at the time of penalty shootout, is
referencing the player_id column of player_mast table

A V D GROUP I GIVE YOUR CAREER A LIFT 14


Questions

1) Write a query in SQL to find the number of venues for EURO cup 2016.
2) Write a query in SQL to find the number countries participated in the EURO
cup 2016.
3) Write a query in SQL to find the number goals scored in EURO cup 2016
within normal play schedule.
4) Write a query in SQL to find the number of matches ended with a result.
5) Write a query in SQL to find the number of matches ended with draws.
6) Write a query in SQL to find the date when did Football EURO cup 2016
begin.
7) Write a query in SQL to find the number of goal scored in every match within
normal play schedule.
8) Write a query in SQL to find the match no, date of play, and goal scored for
that match in which no stoppage time have been added in 1st half of play.
9) Write a query in SQL to find the number of matches ending with a goalless
draw in group stage of play.
10) Write a query in SQL to find the number of booking happened in each
half of play within normal play schedule.
11) Write a query in SQL to find the name of the venue with city where the
EURO cup 2016 final match was played.
12) Write a query in SQL to find the number of goal scored by each team in
every match within normal play schedule.
13) Write a query in SQL to find the total number of goals scored by each
player within normal play schedule and arrange the result set according to the
heighest to lowest scorer.

A V D GROUP I GIVE YOUR CAREER A LIFT 15


14) Write a query in SQL to find the highest individual scorer in EURO cup
2016.
15) Write a query in SQL to find the captains for the top four teams with
other information which participated in the semifinals (match 48 and 49) in the
tournament.
16) Write a query in SQL to find the player who was selected for the Man of
the Match Award in the finals of EURO cup 2016
17) Write a query in SQL to find those players who came into the field in
the most last time of play.
18) Write a query in SQL to find the date when penalty shootout matches
played. (Subquery)
19) Write a query in SQL to find the venues where penalty shootout
matches played. (Subquery)
20) Write a query in SQL to find the stage of match where the penalty kick
number 23 had been taken. (Subquery)

A V D GROUP I GIVE YOUR CAREER A LIFT 16


A V D GROUP I GIVE YOUR CAREER A LIFT 17

You might also like