Data Analysis Using Logistic Regression
Data Analysis Using Logistic Regression
EXCEL
Logistic regression is one of the most widely used predictive modelling techniques. In this
book we will learn how to use logistic regression to aid decision making.
We will use data from our favourite sport, Cricket to illustrate the application of logistic
regression in decision making situations.
HOW WILL THIS GUIDE HELP ME?
The purpose of this guide is to demonstrate a step-by-step approach to data analysis using
data from the sport of Cricket. You will learn how to handle a data set, how to become
intimate with it, run descriptive analytics and build predictive models using logistic
regression on it, and draw insights from the results to guide you decisions.
HOW DO I USE THIS GUIDE?
The data set analyzed in this guide is available for free download. In order to get the full
benefit from this guide, you should download this data set and perform the steps
illustrated in each chapter before moving on to the next one.
Table of Contents
How will this guide help me? ....................................................................... 1
How do I use this guide? ............................................................................ 1
Introduction ......................................................... Error! Bookmark not defined.
Who is the greatest ODI batsman India has ever produced? ................................ 4
Problem Definition ..................................................................................... 5
Sachin, Sourav, Rahul ............................................................................... 5
Data Exploration ........................................................................................ 7
What is the available information? ................................................................ 7
What kind of questions can I answer using this data? .......................................... 8
Business application of Data exploration ........................................................ 10
Data Exploration Step 2 .......................................................................... 11
How much data is there? ........................................................................ 11
What does the data represent? ................................................................ 11
Examining all variables .......................................................................... 12
EXERCISE ............................................................................................. 13
Data Preparation ...................................................................................... 14
Cleaning the Opposition field .................................................................. 14
Cleaning the Runs field ............................................................................ 16
Cleaning up the Results field ..................................................................... 18
Data preparation in business analytics .......................................................... 19
EXERCISE ............................................................................................. 20
Descriptive Analytics .................................................................................. 21
EXERCISE ............................................................................................. 22
Predictive Modelling .................................................................................. 23
An introduction to Regression .................................................................... 24
INTRODUCTION
We chose Cricket as my analytics case study because of two reasons. The first reason is
that a majority of the readers of this e-book will be Cricket fans. You will be able to
relate to the problems we attempt to solve in this book. In many cases you will already
have gut-based opinions on the topics we discuss. You will find it interesting to see if
analytics verifies or diverges from your gut.
For the purpose of this book we will be analysing the performance of some of Indias top
ODI batsmen with a focus on the batting genius Sachin Tendulkar.
WHO IS THE GREATEST ODI BATSMAN INDIA HAS EVER PRODUCED?
This is a debate that has raged many a time across India, from water coolers to drawing
rooms to canteens to social media, and is unlikely to have a conclusive or decisive end.
There are many reasons why this debate is often inconclusive, not least being the
completely different and arbitrary set of criteria used by people to back the player they
rate supreme. Greatest as a term is open to many interpretations and, having been
witness to and often been a part of many such debates, I figured this needed an objective
approach.
Being data scientists, we thought of using a purely statistical and data-driven approach to
answer this question.
And like any statistical research, step one involved clearly defining the research objective.
Statistics
Innings
Runs
Sachin
Sourav
Dravid
431
292
307
17742
11255
10,536
Of course, for each I found plenty of backers willing to back their case:
I think dada is the best because of the way he ripped apart the bowlers before they
started to bowl short at him
I think Dravid is the best because he is such a joy to watch. Every innings of his is pure
class
Sachin has scored 49 ODI centuries and was the first player ever to hit a double hundred
in ODI. Of course he is the best. No question about it.
There are others who have quoted the names of Sehwag, Dhoni and even the name Virat
Kohli has already started creeping in, but none are near 10,000 ODI runs in overall
contribution and that is the first statistic that eliminated them from this research.
So now we have re-stated the objective and defined the scope of our analysis as well.
Amongst those who have scored more than 10000 runs in ODIs, which batsman has had
the most impact on Indias win-rate through the runs they have scored?
Now that we have defined the scope of our analysis in very precise terms, we will explore
the data that is available to us.
Opposition
Ground
Start Date
Gujranwala 18-Dec-89
01-Mar-90
v New Zealand Dunedin
v New Zealand Wellington 06-Mar-90
v Pakistan
Runs
0
Result
lost
Margin
7 runs
BR
Toss
won
Bat
2nd
lost
108 runs
won
2nd
36
won
1 runs
won
1st
ODI # 623
v Sri Lanka
Sharjah
25-Apr-90
10
lost
3 wickets
ODI # 625
v Pakistan
Sharjah
27-Apr-90
20
lost
26 runs
ODI # 634
v England
Leeds
18-Jul-90
19
won
6 wickets
ODI # 635
v England
Nottingham
20-Jul-90
31
won
5 wickets
lost
1st
won
2nd
12
won
2nd
12
won
2nd
For our analysis, we will need to download the data for all 3 batsmen under consideration
i.e. Sachin, Sourav and Rahul. We will illustrate the data exploration and preparation
steps for Sachins data only. This same process will then be repeated for the other two as
well.
In all, this is pretty good information. If we look at the first row of the data, it tells us
about the game with Match Id 593.
Match Id
ODI # 593
ODI # 612
Opposition
Ground
Start Date
Gujranwala 18-Dec-89
01-Mar-90
v New Zealand Dunedin
v New Zealand Wellington 06-Mar-90
v Pakistan
Runs
0
Result
lost
Margin
7 runs
BR
Toss
won
Bat
2nd
2nd
lost
108 runs
won
36
th
won
1 runs
won
31
won
5 wickets
1st
against Pakistan at Gujranwala on 18 Dec 1989. India won the toss, decided
25-Apr-90
10
lost
3 wickets
4
lost
1st
Sharjah
to field and while chasing, fell short of the target by 7 runs. Sachin got out for a duck in
27-Apr-90
20
lost
26 runs
won
2nd
ODI # 625
v Pakistan
Sharjah
this
18-Jul-90
19
won
6 wickets
12
won
2nd
ODI #game.
634
v England
Leeds
ODI # 616
India
played
ODI # 623
v Sri Lanka
ODI # 635
v England
Nottingham
20-Jul-90
12
won
2nd
Similarly, the second field Ground helps us add the venue dimension to analysing
Sachins performance.
Example questions:
What is Sachins average at different venues?
Where has he scored most centuries?
Where has he scored the most half-centuries?
Where does he have the highest win rate?
Start date tells us when the game was played. It provides the time dimension to the
analysis.
Example questions:
The field Runs is important for obvious reasons. This variable is a measure of Sachins
performance in game.
Note that all the other variables are used as Dimensions i.e. they are a means to slice
and dice the data for the measure Runs. For example, we can look at Sachins total
runs scored or average runs scored by Opposition. Opposition here is the dimension and
we are slicing the data along this dimension. Runs, on the other hand, are a measure.
The field Result gives the result of that particular game. We use this field as another
dimension in the analysis.
Example questions:
What is Sachins average when team India wins a game vs. when they lose it?
How many centuries has Sachin scored in Indias victories vs. losses?
The field Margin is a slightly tricky one. It gives the margin of victory in runs when
team batting first wins, and in wickets when the team batting second wins the game. This
field will need some transformation for it to be used effectively. If required, we will come
to that in the data preparation stage.
Similarly, for the field BR.
The fields Toss and Bat also add dimensions to our analysis. We can analyse Sachins
performance when India wins the toss vs. when they lose it and when they bat first vs.
when they bat second.
Note one thing here. We had mentioned that the field Runs is a measure and all other
fields are dimensions. Well, thats not entirely correct. Even the field Result can be
used as a measure depending on what we are analysing. For example, if we answer the
question what is Indias win rate when they win the toss vs. when they lose it? In this
case, the field Result is the measure and the field Toss is the dimension.
In this section, we have completed the first step in data exploration. We have identified
the information contained in the data set. We have looked at each field and understood its
definition. We have also looked at several examples of questions that we can answer with
this data.
It is advisable to spend plenty of time on the data dictionary. The analyst needs to be
comfortable with the definition of all the variables before proceeding any further with the
analysis.
Since the first row contains the headers, this means there is 463 rows of data. Each row
represents one match, so we have data on 463 matches.
We know that 18-Dec 1989 was Sachins debut game. We can confirm that Sachin has
played 463 games from then till 18-Mar 2012.
This implies that we have data on all of Sachins games from his debut till 18-Mar 2012.
We have now established that in our data set, we have data on 463 games. This
represents all the games that Sachin has played for India from his debut till 18-Mar
2012.
Figure 1 Match Id
that we have 2 different values representing the same thing. Example, U.A.E. could also
be written as UAE (without the dots) in some rows and we will need to then change some
entries to make it consistent. We could easily do a Replace all in Excel to change all
UAE values to U.A.E..
In this manner, I scan all the fields and make a note of all the points that need to be
worked on. Let us now move to the next stage, i.e. the data preparation stage. This is
where we will manipulate and transform the data into the format we want.
EXERCISE
Download the data by clicking on this link: Cricket data for Sachin, Sourav and Rahul
Perform the following steps on the data for Sourav and Rahul
1. Open the data in Excel
2. Examine the data. How many games worth of data is there for each of these
players
3. Examine all the variables independently using the filter option and make a note of
the changes you would need to make on the data in the data prep stage.
On the next screen, I simply click on the space between v and the
opposition name and a line appears between the two signifying a
break.
I click finish and I now have the data broken into 2 columns. The
original column contains all the vs and the column on the right
now contains all the Opposition names without the v.
With a little bit of cleaning, I now have my Opposition field in the
format I want.
It is a good idea to make a note of all the changes we are making. I have noted that we
have deleted data on 11 games here. In these 11 games, Sachin did not bat and hence this
data was not useful for our analysis.
The next thing I noticed in the Runs field is the presence of a number of entries where
the score is followed by an asterisk (*) sign. This is common convention to denote a notout score. In all these innings, Sachin stayed not out at the end. There are 41 such
innings in our data set.
What should we do with this issue? Removing the asterisk at the end is fairly simple in
excel. but before we do that we need to carefully understand the implications of that on
our analysis.Converting a score of 40* to just 40 means that we are saying that the impact
from Sachins runs remains the same whether he scores 40* or if he gets out at 40.
I think this is a fair assumption. Since we are measuring a batsmans impact solely throgh
the runs they have scored, it is ok to discard the information on whether the batsman got
out or not.
Having gone through this exercise mentally, I think it is fine to go ahead with this
approach. I now proceed to remove the asterisk at the end.
I will again insert a column to the righ of this field and use the Text to columns function.
This time I choose the Delimited option.
When I click Next, I am asked to choose the character that I want to use as a delimiter.
I choose the Other option and enter the character * and click on Finish.
What this does is, it tells excel to treat every asterisk sign as a delimiter, keep the
content to the left of it in the original cell and move the content to the right of it into the
cell on the right. In our case, this simply eliminates the * from this field.
We have now cleaned up the Runs field in our data set and made it amenable to
mathematical operations that we will perform in the next step.
This brings us to the end of data preparation. Before we proceed any further, it is
important to summarize what we have done here.
1. We cleaned up the Opposition field by removing the v before each team name
2. We cleaned up the Runs field by removing all games where Sachin did not bat.
We deleted 11 games this way.
3. We removed the * at the end of scores where Sachin did not get out. Our data now
does not differentiate between innings where Sachin was out and innings where he
wasnt.
4. We removed all games where the result was not a straight win or loss. We removed
an additional 21 games this way.
5. We started with 463 games and now we are considering only 431 games for our
actual analysis.
Deriving variables is also a part of data preparation. Sometimes we need to create new
variables from the existing ones for the purpose of our analysis. For example, if we need
to create a Year variable, we can derive that from the Start date variable. We could
also derive the Country where the match was played from the Venue field. This will
involve creating a separate lookup table which maps venues to countries.
Data preparation is an important part of any analysis but it becomes even more important
when dealing with complex business data. Effective data preparation increases the
strength of the predictive models by harnessing the power of the available data in the
most efficient manner.
Now that we have prepared the data, we are now ready for the next stage i.e. Predictive
modelling. But before we get into that, now is a good time to perform some descriptive
analytics on the data first.
EXERCISE
Perform the following steps on the data for Sourav and Rahul
1. Clean up the Opposition field by removing the v before each team name
2. Clean up the Runs field by removing all games where the batsman did not bat.
3. Remove the * at the end of scores where the batsman did not get out.
4. Remove all games where the result was not a straight win or loss.
5. Make a note of the total number of games you started with and what you are left
with for further analysis.
Descriptive analytics like this helps an analyst understand the data better. It also helps
her spot anything unusual anything that requires further investigation.
Descriptive analytics is a useful tool to understand the data, generate insights and spot
unusual occurrences that require further investigation.
EXERCISE
Descriptive analytics offers unlimited ways of analysing any kind of data. You are only
limited by your imagination. Here are some things you can do with your data at this stage.
1. Analyze the batsmans performance over time Total runs scored by calendar year
and average runs scored by calendar year
2. Analyze the batsmans performance by opposition, by venue (home and away) etc.
3. Create and examine the distribution of scores
What you will find is that in most cases descriptive analytics will confirm your belief or
intuition. But in a few cases, every once in a while, you will find patterns or insights that
you did not know or that run counter to your intuition. These counter-intuitive or hidden
insights are what make descriptive analytics such a valuable tool.
There does seem to be a general trend of improvement in Indias win rate as Sachins
scores become higher.
TYPES OF REGRESSION
There are many types of regression techniques that are applied by Statisticians depending
on the nature of the problem and the variables involved. Linear and logistic are two of the
most popular ones.
Linear regression assumes a linear relationship between the dependent and the
independent variable. If the relationship between Sachins score and Indias win-rate
could be quantified with a straight line, then linear regression would be a suitable
modeling technique.
In our problem, we have seen in the previous graphs that the relationship between our
dependent and independent variable is not exactly linear.
Further, the variable that we are trying to predict i.e. the outcome of the game is a
binary variable (win/loss). In our case, a regression technique called logistic is more
suitable. Logistic regression does not need a linear relationship between the dependent
and independent variables. Logistic regression can handle all sorts of relationships.
LOGISTIC REGRESSION
In this book, we will not go into the mathematical details of logistic regression. Instead we
are going to focus on its application for a given problem.
The result of a logistic regression model is an equation in this format
Log [p/(1-p)] = a + bx
Let us interpret this equation in the context of our problem.
There will be 2 values that will be generated from the model a and b.
Using the equation above, we can calculate the value of p for any given value of X.
We first calculate the value of Log[p/(1-p)] by putting the values of a, b and x. Let us
call this Y.
Log[p/(1-p)] = Y
We can then use the antilog or exponent function to calculate the value of p/(1-p).
p/(1-p) = exp(Y)
From there we can easily calculate the value of p as well.
p = exp (Y)/(1 + exp(Y))
p, if you remember is the probability of India winning the game. Thus, using a logistic
model, for any given value of X, we are able to calculate p, Indias predicted win-rate.
This is how we interpret the results of the logistic regression model.
Now we need to find the values of a and b so that we can calculate the probability p for
any given X (Runs scored).
Once we read in the data, we can quickly run summary statistics on the data using a
simple command.
summary(data.frame.sachin)
As you can see, this command creates 6 measures for each field. The min value, the 25th
percentile value, the 50th percentile value, the 75th percentile value, the max value and
the mean.
Family = binomial is important. There are multiple algorithms under the glm command in
R. Binomial tells R to use the logistic regression technique.
Since we have assigned a name to our model (smodel), R will not print the results of the
model. For this we again use the summary command.
summary(smodel)
a is the estimate of the intercept and it has a value of -.258734 as per our model.
b is the coefficient estimate of the runs and its value is .010062.
We now have information to solve the regression equation and can calculate the value of p
for each value of X.
EXERCISE
For this exercise you should have first prepared the data as described in the previous
section.
Save the newly prepared data for both Sourav and Rahul into separate csv sheets.
Read the data sets into R
Get summary statistics for the data using the summary function
Perform logistic regression on the data using the glm function.
Summarize the results of the model using the summary command
Note the estimates for the intercept and the variable Runs
This means that if Sachin scores a duck (0 runs), Indias predicted probability of winning
that game is 44%.
Cumulative
Increment
X = Runs
Y = a + bx
Increment
0
1
2
-0.258734
-0.248672
-0.23861
0.44
0.44
0.44
0
0.25%
0.25%
0.25%
0.50%
3
4
5
6
7
8
9
10
11
-0.228548
-0.218486
-0.208424
-0.198362
-0.1883
-0.178238
-0.168176
-0.158114
-0.148052
0.44
0.45
0.45
0.45
0.45
0.46
0.46
0.46
0.46
0.25%
0.25%
0.25%
0.25%
0.25%
0.25%
0.25%
0.25%
0.25%
0.74%
0.99%
1.24%
1.49%
1.74%
1.99%
2.24%
2.49%
2.74%
Gangulys line (the green line) is at the top which means Gangulys runs have the highest
impact on Indias win rate.
In other words, if everything else remains constant, if Ganguly scores 60 runs, it is likely to
improve Indias chances of winning more than if either Sachin or Rahul had scored the
same number of runs (60).
Putting it another way, if India were in the world cup finals and you had to pray for one
batsmans success (out of these 3), you should be praying for a high score from Ganguly as
that will improve Indias chances of winning more than if Sachin or Rahul were to hit that
same high score.
As an example, Rahul Dravid scored 69 runs in his last innings. As per our model, his
innings improved Indias chances of winning from 35% to 57% - an improvement of
22%.Thus Rahuls contribution through that innings was 22%. If we repeat this process for
all his innings and calculate the average contribution per innings, this would represent
Rahuls average contribution to Indias victories over his entire career.
Here is a comparison of the 3 Indian stalwarts.
Again, Ganguly has the highest average contribution per innings. For every inning that he
played, the runs scored by him improved Indias chances of winning by 13% on an average.
In comparison, Rahuls average contribution is 11% and Sachins is 10%.
Thus, as per our statistical analysis, Ganguly is the player who had the highest average
contribution per innings.
From a purely statistical point of view, if we look at the average contribution per inning as
the defining measure, Ganguly has come out to be the most important contributor and is
therefore, the best ODI player for India amongst the 3 highest run getters.
LIFETIME CONTRIBUTION
Is average contribution the best way to measure a players contribution?
For example, Sachin has played in over 430 games (more than any other batsman in the
world). While Sourav has played only 292 and Dravid 307.
Statistics
Sachin
Sourav
Dravid
Innings
Runs
431
292
307
17742
11255
10,536
Is it not better to look at their total contribution over the entire career rather than the
average contribution per game?
Better or not, it is surely a different way to measure the players contribution. If we
compare the lifetime contribution of the 3 players it gives a different picture.
Sachin leads the comparison by a fair bit. Ganguly is second and Dravid is a distant third.
Of course, a big thing to consider is that Sachin is still playing while the other two have
retired. Sachin will surely improve his lifetime contribution by the time he retires.
If we look at the Life time contribution of a player as the defining measure, Sachin has
emerged as the best ODI player for India amongst the 3 highest run getters.
EXERCISE
Use the coefficients obtained from the models to generate the average and life time
contributions for both Sourav and Rahul. Compare with results here.
Actual
Results
Loss
Win
Grand
Total
Loss
100
100
200
result
Win
115.5
115.5
231
Grand
Total
215.5
215.5
431
Actual
result
Results
Loss
Win
Grand
Total
Loss
114
86
200
Win
87
144
231
Grand
Total
201
230
431
We expect the model (if it is no good) to correctly predict 215.5 of the 431 games. The
model actually correctly predicts (114 + 144) = 258 games.
We can thus see that the model seems to be good. It has a (258/431) i.e. 60% accuracy
when predicting the outcome of the game based on Sachins score.
We can use the Chi-square test to determine if this difference between what we expected
to see and what we actually see is statistically significant or not.
Using Excel again to run this test, we find that there is a greater than 99.99% chance that
what we are seeing is because the model is actually good and not because of chance.
We can thus conclude that our model is indeed a good model and we can be fairly
confident about our findings.
EXERCISE
Create the tables with actual and expected values for both players
Run Chi-square tests to check validity of the model
CONCLUSION
In this book, we have looked at an application of logistic regression to get statistical
insights from the data that helps us make a decision on who is the best ODI batsman in
India.
PROBLEM DEFINITION
We started with a vaguely defined question Who is the greatest ODI batsman India has
ever produced?
In the first step, we defined the problem in more certain terms. If we want to apply any
kind of analytics on the data, the problem definition needs to be unambiguous and
precise. We changed the definition to Which batsman has had the most impact on Indias
win-rate through the runs they have scored in ODIs?
Then, in order to limit the scope of the analysis, we added another constraint. We
modified the problem definition to this form Amongst those who have scored more
than 10000 runs in ODIs, which batsman has had the most impact on Indias win-rate
through the runs they have scored?
DATA EXPLORATION
Data exploration was done in 2 steps.
In the first step we identified the information contained in the data set. We looked at
each field and understood its definition. We also looked at several examples of questions
that we can answer with this data.
In the second step, we dug a little deeper to understand the data even better.
DATA PREPARATION
This is the stage where we prepare the data for extensive further analysis. This involves
cleaning up of variables and removing data that is not required for analysis. For Sachins
analysis, we started with 464 games. We removed the games where the player did not bat
or the result was a tie or no result. We ended with 431 games worth of data.
Data preparation often involves other important activities like outlier removal, missing
value treatment and variable transformation as well.
DESCRIPTIVE ANALYTICS
Having prepared the data for analysis, we first ran a series of descriptive analytics on the
data. Descriptive analytics like this helps an analyst understand the data better. It also
helps her spot anything unusual anything that requires further investigation.
PREDICTIVE MODELING
The next step is to build a predictive model that will help us answer the original question.
We used logistic regression to estimate the relationship between Sachins score and Indias
win rate. In other words, we built a model that will help us predict, for a given number of
runs scored by Sachin, what is the probability of India winning the game. This model is
also able to estimate the increase in probability of India winning with each additional run
scored by Sachin.
MODEL VALIDATION
We also spent some time validating the model. We used a Chi-square test to determine if
the model is statistically significant or not. We found that our model was highly significant
and thus concluded that the insights generated from the model are indeed valid.
This brings us to the end of this book. We hope that you will find this book a useful
tutorial to get a beginners understanding of the application of logistic regression to solve
problems and make decisions.