Accessing and Working With Statsbomb Data in R
Accessing and Working With Statsbomb Data in R
StatsBomb Data In R
What is R and Why Use It?
What Is R and Why Use It?
R is a programming language that is useful for managing large datasets. It is especially useful in the
world of football data, as it allows us to manipulate that data to various ends. Such as creating metrics
out of the data and visualising it.
https://cran.r-project.org/mirrors.html
We at StatsBomb use R regularly (amongst other coding languages) in day-to-day work, particularly
within our analysis department. Spreadsheets are a viable route when you’re just starting out, but
eventually the datasets become too big and unwieldy, performing nuanced dissection of them becomes
too complicated.
Once you’ve gotten over the learning curve, R is ideal for parsing data and working with it however you
like in a fast manner.
RStudio
The base version of R is a somewhat cumbersome piece of software. This has lead to the creation of many
different ‘IDE’s (integrated development environment). These are wrappers around the initial R install that
make most tasks within R easier and more manageable for the end user. The most popular of these is RStudio:
https://www.rstudio.com/products/rstudio/
It is recommended that you install RStudio (or any similar IDE that you find and prefer) as most users do. It
will make working with StatsBomb’s data a cleaner, simpler process.
Opening a New R ‘Project’
This (minus the annotations of course) is
what you should see when you load up R
Studio.
The main packages we will focus on here and which need installing are:
‘tidyverse’: tidyverse contains a whole host of other packages (such as dplyr and magrittr) that are useful for
manipulating data. install.packages(“tidyverse”)
‘devtools’: Most packages are hosted on CRAN. However there are also countless useful ones hosted on
Github. Devtools allows for downloading of packages directly from Github. install.packages(“devtools”)
‘ggplot2’: The most popular package for visualising data within R. It is contained within tidyverse.
Once a package is installed it can be loaded into R by running library(PackageNameHere). You should load all
of these at the start of any session.
What is ‘StatsBombR’ and how to Install it?
StatsBomb’s former data scientist Derrick Yam created ‘StatsBombR’, an R package dedicated to making using
StatBomb’s data in R much easier. It can be found on Github at the following link, along with much more
information on its uses. There are lots of helpful functions within it that you should get to know.
https://github.com/statsbomb/StatsBombR
To install the package in R, you’ll need to install the ‘Devtools’ package, which can be done by running the
following line of code:
install.packages("devtools")
devtools::install_github("statsbomb/StatsBombR")
Finding More Info On Packages
FreeCompetitions() - This shows you all the competitions that are available as free data
If you want to store the output of this (or any other functions) so you can pull it up at any time, instead of just
having it in the R console, you can run something like the following:
Comp <- FreeCompetitions(). Then, anytime you run Comp (or whatever word you choose to store it under, you
can go with anything), you will see the output of FreeCompetitions().
Matches <- FreeMatches(Comp) - This shows the available matches within the competitions chosen
StatsBombData <- StatsBombFreeEvents(MatchesDF = Matches, Parallel = T) - This pulls all the event data
for the matches that are chosen.
Pulling the Free Data
Now we’re going to run through an example of how to pull the data into R. Open up a new ‘script’, so we can
store this code and have it easily accessible, by going to File -> New File -> R Script. This script can be saved
at any time.
1
: tidyverse loads many different packages. Most
library(tidyverse) important for this task are dplyr and magrittr.
library(StatsBombR) 1 StatsBombR loads StatsBombR.
Comp <- FreeCompetitions() %>%
2
filter(competition_id==11 & season_name=="2005/2006") 2 : This grabs the competitions that are available to
the user and filters it down, using dplyr’s ‘filter’
Matches <- FreeMatches(Comp) 3 function, to just the 2005/06 La Liga season in this
example.
StatsBombData <- StatsBombFreeEvents(MatchesDF = Matches, Parallel = T) 4
3
: This pulls all the matches for the desired
5
StatsBombData = allclean(StatsBombData) competition.
4 5
: Now we have created a ‘dataframe’ (essentially a : Extracts lots of relevant information such as x/y
table) called ‘StatsBombData’ (or whatever you coordinates. More information can be found in the
choose to call it) of the free event data for the La package info. Be sure to familiarise yourself with the
Liga season in 2005/2006. columns it creates using names(nameofyourdfhere).
Working With the Data
Getting to Know the Data
On our Github page - where our free data is hosted - we have put the specification documents for StatsBomb
Data. These are available to view or download at any time and will hopefully answer any questions you may
have about what a certain event type is or any similar inquiries.
Open Data Competitions v2.0.0.pdf - Covers the objects contained within the competitions information (
FreeCompetitions() ).
Open Data Matches v3.0.0.pdf - Describes the match info download ( FreeMatches() ).
Open Data Lineups v2.0.0.pdf - Describes the structure of the lineup info ( getlineupsFree() ).
Open Data Events v4.0.0.pdf - Explains the meaning of the column names within the event data.
StatsBomb Event Data Specification v1.1.pdf - The full breakdown of all the events within the data.
Data Use Cases
Now that we have our StatsBombData file, we’re going to run through some ways you can use the data and
familiarise yourself with R in the process. There will be four use cases, increasing in complexity as they go:
Use Case 1: Shots and Goals - A simple but important starting point. Here we will extract shots and goals
totals for each team, then look at how to do the same but on a per game basis.
Use Case 2: Graphing Shots On a Chart - After we have the shots and goals data, how can we take that and
create a starter chart from it?
Use Case 3: Player Shots Per 90 - Getting shots for players is simple enough after doing so for teams. But
then how can we adjust those figures on a per 90 basis?
Use Case 4: Mapping Passes - Filtering our data down to just a subset of passes and then using R’s ggplot2 to
plot those passes on a pitch.
Data Use Case 1: Goals and Shots
shots_goals = StatsBombData %>%
1
group_by(team.name) %>%
summarise(shots = sum(type.name=="Shot", na.rm = TRUE),
2
goals = sum(shot.outcome.name=="Goal", na.rm = TRUE))
shots = sum(type.name=="Shot", na.rm = TRUE) is telling it to create a new column called ‘shots’ that sums up
all the rows under the ‘type.name’ column that contain the word “Shot”. na.rm = TRUE tells it to ignore any NAs
within that column.
2
3
: This relabels the shots axis. : Now we are telling ggplot to format it is a bar
chart.
4 5
: This removes the title for the axis. : Here we cut down on the space between the bars and the edge
of the plot
6 7
: This flips the entire plot, with the bars now going : theme_SB() is our own internal visual aesthetic for ggplot
horizontally instead. charts that we have packaged with StatsBombR. Optional of
course.
Data Use Case 2: From Data to a Chart
All that should result in a chart like this.
https://ggplot2.tidyverse.org/reference/
Data Use Case 3: Player Shots Per 90
player_shots = StatsBombData %>%
group_by(player.name, player.id) %>%
summarise(shots = sum(type.name=="Shot", na.rm = TRUE)) 1 1
: Much the same as the team calculation. We are including
‘player.id’ here as it will be important later.
player_minutes = get.minutesplayed(StatsBombData) 2
The one we’ll be using here comes courtesy of FC rStats. A twitter user who has put together various helpful,
public R packages for parsing football data. The package is called ‘SBPitch’ and it does exactly what it says on
the tin. There will be further options in the ‘Other Useful Packages’ at the end of this document. First let’s get
it installed with the following code:
devtools::install_github("FCrSTATS/SBpitch")
We’re going to plot Messi’s completed passes into the box for the 2005/2006 La Liga season. Plotting all of his
passes would get messy of course, so this is a clearer subset. Make sure you’ve used the functions previously
discussed to pull that data.
Data Use Case 4: Plotting Passes
1
library(SBpitch) : Pull some of the Messi data of your
choice and call it ‘messidata’ for us to work
passes = messidata %>% with here. Then we can filter to Messi’s
filter(type.name=="Pass" & is.na(pass.type.name) & player.id==5503) 1 %>% passes. is.na(pass.type.name) filters to only
filter(pass.end_location.x>=102 & pass.end_location.y<=62 & pass.end_location.y>=18) 2 completed passes.
2
create_Pitch() + : Filtering to passes within the box. The
geom_segment(data = passes, aes(x = location.x, y = location.y, coordinates for pitch markings in SBD can
xend = pass.end_location.x, yend = pass.end_location.y), be found in our event spec.
lineend = "round", size = 0.6, arrow = arrow(length = unit(0.08, "inches"))) 3 +
labs(title = "Lionel Messi, Completed Box Passes", subtitle = "La Liga, 2005/2006") 4 + 3
: This creates an arrow from one point
scale_y_reverse()5 + (location.x/y, the start part of the pass) to an
coord_fixed(ratio = 105/100) 6 end point (pass.end_location.x/y, the end of
the pass). Lineend, size and length are are all
customization options for the arrow.
4 5
: Creates a title and a subtitle for the plot. : Reverses the y axis. Otherwise the
You can also add captions using caption =, data would be plotted on the wrong
along with other options. side of the pitch.
6
: Fixes the plot to a certain aspect ratio of
your choice, so it doesn’t look stretched.
Data Use Case 4: Plotting Passes
You’ll have this plot. Again, it’s simple and
bare but it starts you off and from here you
can layer on all sorts of customization.
allclean() - Mentioned previously but to elucidate: this extrapolates lots of new, helpful columns from the pre
existing columns. For example, it takes the location column and splits it up into separate x/y columns. It also
extracts freeze frame data and goalkeeper information. Make sure to use.
get.playerfootedness() - Gives you a player’s assumed preferred foot using our pass footedness data.
get.opposingteam() - Returns an opposing team column for each team in each match.
get.gamestate() - Returns information for how much time each team spent in various game states
(winning/drawing/losing) for each match.
The community around R is packed with packages that fulfill all sorts of needs. Chances are that, if you’re
looking to do something in R or fix some sort of issue, there’s a package out there for it. There are far too many
to name but here’s a brief selection of some that may be relevant to working with StatsBomb Data:
Ben Torvaney, ggsoccer - A package that contains an alternative for plotting a pitch with SB Data.
Joe Gallagher, soccermatics - Also offers an option for pitch plotting along with other useful shortcuts for
creating heatmaps and so on.
ggrepel - Useful for when you’re having issues with overlapping labels on a chart.
gganimate - If you ever feel like getting more elaborate with your graphics, this gives you a simple way to
create animated ones within R and ggplot.
Hope You Enjoy the Data!
Any questions: