0% found this document useful (0 votes)

233 views38 pages

Data Analysis Using Logistic Regression

This document provides an introduction and guide to using logistic regression to analyze cricket data and determine the greatest Indian ODI batsman. It will cover exploring the available cricket statistics, preparing the data by cleaning fields, descriptive analytics, building a logistic regression model in R and Excel, interpreting the results, and validating the model. The purpose is to demonstrate how to analyze a dataset step-by-step to provide objective answers to debates like determining the greatest batsman based on their impact on India's win rate. The analysis will focus on Sachin Tendulkar, Sourav Ganguly, and Rahul Dravid, India's top 3 run scorers in ODI cricket.

Uploaded by

SaurabhVerma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

233 views38 pages

Data Analysis Using Logistic Regression

Uploaded by

SaurabhVerma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

BEGINNERS GUIDE TO LOGISTIC REGRESSION USING R AND

EXCEL
Logistic regression is one of the most widely used predictive modelling techniques. In this
book we will learn how to use logistic regression to aid decision making.
We will use data from our favourite sport, Cricket to illustrate the application of logistic
regression in decision making situations.
HOW WILL THIS GUIDE HELP ME?
The purpose of this guide is to demonstrate a step-by-step approach to data analysis using
data from the sport of Cricket. You will learn how to handle a data set, how to become
intimate with it, run descriptive analytics and build predictive models using logistic
regression on it, and draw insights from the results to guide you decisions.
HOW DO I USE THIS GUIDE?
The data set analyzed in this guide is available for free download. In order to get the full
benefit from this guide, you should download this data set and perform the steps
illustrated in each chapter before moving on to the next one.

Table of Contents
How will this guide help me? ....................................................................... 1
How do I use this guide? ............................................................................ 1
Introduction ......................................................... Error! Bookmark not defined.
Who is the greatest ODI batsman India has ever produced? ................................ 4
Problem Definition ..................................................................................... 5
Sachin, Sourav, Rahul ............................................................................... 5
Data Exploration ........................................................................................ 7
What is the available information? ................................................................ 7
What kind of questions can I answer using this data? .......................................... 8
Business application of Data exploration ........................................................ 10
Data Exploration Step 2 .......................................................................... 11
How much data is there? ........................................................................ 11
What does the data represent? ................................................................ 11
Examining all variables .......................................................................... 12
EXERCISE ............................................................................................. 13
Data Preparation ...................................................................................... 14
Cleaning the Opposition field .................................................................. 14
Cleaning the Runs field ............................................................................ 16
Cleaning up the Results field ..................................................................... 18
Data preparation in business analytics .......................................................... 19
EXERCISE ............................................................................................. 20
Descriptive Analytics .................................................................................. 21
EXERCISE ............................................................................................. 22
Predictive Modelling .................................................................................. 23
An introduction to Regression .................................................................... 24

Types of regression .............................................................................. 25

Logistic Regression ............................................................................... 25
Building a logistic regression model ........................................................... 26
Reading data into R .............................................................................. 26
Running a Logistic Model ........................................................................ 27
EXERCISE ............................................................................................. 29
Interpreting the output ............................................................................... 30
What about the batting average? ................................................................. 32
Lifetime Contribution .............................................................................. 33
EXERCISE ............................................................................................. 34
Model Validation ....................................................................................... 35
EXERCISE ............................................................................................. 36
Conclusion .............................................................................................. 37
Problem definition .................................................................................. 37
Data Exploration .................................................................................... 37
Data Preparation .................................................................................... 37
Descriptive Analytics ............................................................................... 38
Predictive Modeling ................................................................................ 38
Interpreting the results ............................................................................ 38
Model Validation .................................................................................... 38

INTRODUCTION
We chose Cricket as my analytics case study because of two reasons. The first reason is
that a majority of the readers of this e-book will be Cricket fans. You will be able to
relate to the problems we attempt to solve in this book. In many cases you will already
have gut-based opinions on the topics we discuss. You will find it interesting to see if
analytics verifies or diverges from your gut.
For the purpose of this book we will be analysing the performance of some of Indias top
ODI batsmen with a focus on the batting genius Sachin Tendulkar.
WHO IS THE GREATEST ODI BATSMAN INDIA HAS EVER PRODUCED?
This is a debate that has raged many a time across India, from water coolers to drawing
rooms to canteens to social media, and is unlikely to have a conclusive or decisive end.
There are many reasons why this debate is often inconclusive, not least being the
completely different and arbitrary set of criteria used by people to back the player they
rate supreme. Greatest as a term is open to many interpretations and, having been
witness to and often been a part of many such debates, I figured this needed an objective
approach.
Being data scientists, we thought of using a purely statistical and data-driven approach to
answer this question.
And like any statistical research, step one involved clearly defining the research objective.

STAGE 1: PROBLEM DEFINITION

Given that greatest is a term used in many contexts, the first task was to restate the
question under argument to be one which would provide conclusive objective answers. I
came up with:
Which batsman has had the most impact on Indias win-rate through the runs they
have scored in ODIs?
The restatement of the problem immediately narrows the discussion to batting
performances only and their impact on wins. To some its a cruel elimination of factors
like the elegance of a particular cover drive or ability to pace an innings. To the data
scientist, it is moving the argument to a turf where the conversation stops moving round
and round and instead lurches towards facts that should shape opinions.

SACHIN, SOURAV, RAHUL

Remember, this is still a discussion on who is the greatest of them all? India has produced
a number of ODI cricketers (in fact many think that far too many have worn the cap
without merit) but the discussions for greatest need to be limited to a select few.
The first elimination criterion used was the total number of career runs scored. For
further analysis, I zeroed in on the top 3.
Sachin Tendulkar, Sourav Ganguly and Rahul Dravid are Indias all time highest ODI run
getters. Sachin at 17742 runs is still going strong while Sourav and Rahul have both retired.

Statistics
Innings
Runs

Sachin

Sourav

Dravid

431

292

307

17742

11255

10,536

Of course, for each I found plenty of backers willing to back their case:
I think dada is the best because of the way he ripped apart the bowlers before they
started to bowl short at him
I think Dravid is the best because he is such a joy to watch. Every innings of his is pure
class
Sachin has scored 49 ODI centuries and was the first player ever to hit a double hundred
in ODI. Of course he is the best. No question about it.
There are others who have quoted the names of Sehwag, Dhoni and even the name Virat
Kohli has already started creeping in, but none are near 10,000 ODI runs in overall
contribution and that is the first statistic that eliminated them from this research.
So now we have re-stated the objective and defined the scope of our analysis as well.

Amongst those who have scored more than 10000 runs in ODIs, which batsman has had
the most impact on Indias win-rate through the runs they have scored?
Now that we have defined the scope of our analysis in very precise terms, we will explore
the data that is available to us.

STAGE 2: DATA EXPLORATION

Data exploration is an important part of any analysis. It becomes even more important
when dealing with a data set for the first time.
In our case, we first need to identify the data to be used for this analysis. We used the site
www.espncricinfo.com to download the available data.
There is a lot of information available on Cricket players on this website. For the purpose
of our example, we will consider a small sample of the available information.
Our analysis table contains 10 fields. Here is a snippet of the data set.
Match Id
ODI # 593
ODI # 612
ODI # 616

Opposition

Ground
Start Date
Gujranwala 18-Dec-89
01-Mar-90
v New Zealand Dunedin
v New Zealand Wellington 06-Mar-90
v Pakistan

Runs
0

Result
lost

Margin
7 runs

Toss
won

Bat
2nd

lost

108 runs

won

2nd

won

1 runs

won

1st

ODI # 623

v Sri Lanka

Sharjah

25-Apr-90

lost

3 wickets

ODI # 625

v Pakistan

Sharjah

27-Apr-90

lost

26 runs

ODI # 634

v England

Leeds

18-Jul-90

won

6 wickets

ODI # 635

v England

Nottingham

20-Jul-90

won

5 wickets

lost

1st

won

2nd

won

2nd

won

2nd

For our analysis, we will need to download the data for all 3 batsmen under consideration
i.e. Sachin, Sourav and Rahul. We will illustrate the data exploration and preparation
steps for Sachins data only. This same process will then be repeated for the other two as
well.

WHAT IS THE AVAILABLE INFORMATION?

The first step in data exploration is to understand the information available to us. Let us
spend some time on our data set.
The first field Match Id is a unique identifier for each ODI game. We can see that each
row in the data has a unique Match Id. This means that each row in our data corresponds
to one game. The first row in the data corresponds to ODI # 593. You can see that it is
referring to Sachins debut game against Pakistan.
The second field Opposition is self-explanatory. The opposition in this match was
Pakistan. The third field Ground tells us where the match was held. The field Start
Date gives us the date of the match. Runs is the number of runs scored by the
batsman (Sachin Tendulkar) in that game. Next we have the result of the game. Margin
gives us the margin of victory. If the team batting first won the game, then this field gives
us the number of runs they won the game by. If the team batting second won the game,
this field tells us the number of wickets they won by. The field BR is populated only in
cases where the team batting second won the game. It gives the number of balls
remaining when the victory was achieved. Toss tells whether India won or lost the toss.
The final field, Bat tells us if India batted first or second.

In all, this is pretty good information. If we look at the first row of the data, it tells us
about the game with Match Id 593.
Match Id
ODI # 593
ODI # 612

Opposition

Ground
Start Date
Gujranwala 18-Dec-89
01-Mar-90
v New Zealand Dunedin
v New Zealand Wellington 06-Mar-90
v Pakistan

Runs
0

Result
lost

Margin
7 runs

Toss
won

Bat
2nd
2nd

lost

108 runs

won

th
won

1 runs

won

5 wickets

1st
against Pakistan at Gujranwala on 18 Dec 1989. India won the toss, decided
25-Apr-90
10
lost
3 wickets
4
lost
1st
Sharjah
to field and while chasing, fell short of the target by 7 runs. Sachin got out for a duck in
27-Apr-90
20
lost
26 runs
won
2nd
ODI # 625
v Pakistan
Sharjah
this
18-Jul-90
19
won
6 wickets
12
won
2nd
ODI #game.
634
v England
Leeds
ODI # 616
India
played
ODI # 623

v Sri Lanka

ODI # 635

v England

Nottingham

20-Jul-90

won

2nd

WHAT KIND OF QUESTIONS CAN I ANSWER USING THIS DATA?

Let us examine each of the fields and understand the kind of insights this information can
provide. The first field is what is called as the Primary key in data mining parlance. It is a
unique number assigned to each game in order to identify the game and distinguish it from
others. This key is useful for data manipulation but not for analysis itself.
The second field Opposition tells us who Sachin was playing against. We can analyse
Sachins performance by opposition. Think of any statistic that will help us analyse
Sachins performance. The field Opposition helps us add this dimension to the analysis.
Example questions:
What is Sachins average against each of the teams?
What is the win rate by opposition?
At what rate has he scored half centuries and centuries against different opposition?

Similarly, the second field Ground helps us add the venue dimension to analysing
Sachins performance.
Example questions:
What is Sachins average at different venues?
Where has he scored most centuries?
Where has he scored the most half-centuries?
Where does he have the highest win rate?

Start date tells us when the game was played. It provides the time dimension to the
analysis.
Example questions:

What is Sachins average in each of the last 20 years?

When did he score most centuries?
When did he score the most half-centuries?
How many years has he scored more than 1000 runs in?

The field Runs is important for obvious reasons. This variable is a measure of Sachins
performance in game.
Note that all the other variables are used as Dimensions i.e. they are a means to slice
and dice the data for the measure Runs. For example, we can look at Sachins total
runs scored or average runs scored by Opposition. Opposition here is the dimension and
we are slicing the data along this dimension. Runs, on the other hand, are a measure.

The field Result gives the result of that particular game. We use this field as another
dimension in the analysis.
Example questions:
What is Sachins average when team India wins a game vs. when they lose it?
How many centuries has Sachin scored in Indias victories vs. losses?
The field Margin is a slightly tricky one. It gives the margin of victory in runs when
team batting first wins, and in wickets when the team batting second wins the game. This
field will need some transformation for it to be used effectively. If required, we will come
to that in the data preparation stage.
Similarly, for the field BR.
The fields Toss and Bat also add dimensions to our analysis. We can analyse Sachins
performance when India wins the toss vs. when they lose it and when they bat first vs.
when they bat second.
Note one thing here. We had mentioned that the field Runs is a measure and all other
fields are dimensions. Well, thats not entirely correct. Even the field Result can be
used as a measure depending on what we are analysing. For example, if we answer the
question what is Indias win rate when they win the toss vs. when they lose it? In this
case, the field Result is the measure and the field Toss is the dimension.

In this section, we have completed the first step in data exploration. We have identified
the information contained in the data set. We have looked at each field and understood its
definition. We have also looked at several examples of questions that we can answer with
this data.

BUSINESS APPLICATION OF DATA EXPLORATION

This is a simplified scenario that we have taken for the purpose of this guide. Business
situations can be far more complex.
The data set that we have contains 10 fields. Business data sets can have many more
fields. Data sets in financial services can have up to 1000 fields. Most business data sets
tend to have anywhere between 10 to 100 variables.
Further, our data set has very intuitive fields. They are easy to understand and are not
vague in definition. In business situations, variables may not be this easy to understand.
In such a situation, there is something called a Data dictionary which comes in very
handy for the analyst. The data dictionary is a document (usually an excel sheet) which
has the names and definitions of all the fields in the data set.
A snippet from a data dictionary

It is advisable to spend plenty of time on the data dictionary. The analyst needs to be
comfortable with the definition of all the variables before proceeding any further with the
analysis.

DATA EXPLORATION STEP 2

So far, we have explored what is the information available to us. We have looked at all
the different fields in our data set and understood exactly what they mean.
The next step is to explore the data itself. How much information do we have? What is the
quality of the available data? How do we need to prepare the data?
For this step, we will need to look at each of the fields in the data individually.

HOW MUCH DATA IS THERE?

We can simply scroll down in Excel to see how many rows of data is there. In our case, we
find that there is data till the 464th row.
Since this is a fairly small
dataset, we are going to
perform the data
exploration and
preparation steps in Excel.
However, when we come
to the predictive modelling
stage we will use R.

Since the first row contains the headers, this means there is 463 rows of data. Each row
represents one match, so we have data on 463 matches.

WHAT DOES THE DATA REPRESENT?

We now need to find the time period this data pertains to. The field start date can provide
us that information. We sort the data on start date (which refers to the date the match
was played on). The data is already sorted on the start date. We can see that the first
game in the data was played on 18-Dec 1989 and the last one on 18-Mar 2012.

We know that 18-Dec 1989 was Sachins debut game. We can confirm that Sachin has
played 463 games from then till 18-Mar 2012.
This implies that we have data on all of Sachins games from his debut till 18-Mar 2012.
We have now established that in our data set, we have data on 463 games. This
represents all the games that Sachin has played for India from his debut till 18-Mar
2012.

EXAMINING ALL VARIABLES

Now let us examine all the values in all the fields
individually. Since this is a small data set, we can scan the
values manually. The easiest way to do this is to apply filters
on all the fields and examine each filter one by one.
The first field is the Match Id. We scan all the unique values
by scrolling within the filter. All the values are in order.

Figure 1 Match Id

We do the same with Opposition and find all the values in

order.
There are multiple things we are looking for when we scan
these values. The first is to detect something we do not
expect to see. For example, if we see China in the
opposition, thats an unexpected value that needs to be
investigated further. A more likely error that can occur is

that we have 2 different values representing the same thing. Example, U.A.E. could also
be written as UAE (without the dots) in some rows and we will need to then change some
entries to make it consistent. We could easily do a Replace all in Excel to change all
UAE values to U.A.E..
In this manner, I scan all the fields and make a note of all the points that need to be
worked on. Let us now move to the next stage, i.e. the data preparation stage. This is
where we will manipulate and transform the data into the format we want.

EXERCISE
Download the data by clicking on this link: Cricket data for Sachin, Sourav and Rahul
Perform the following steps on the data for Sourav and Rahul
1. Open the data in Excel
2. Examine the data. How many games worth of data is there for each of these
players
3. Examine all the variables independently using the filter option and make a note of
the changes you would need to make on the data in the data prep stage.

STAGE 3: DATA PREPARATION

It is a good practice to create a copy of the data set at this stage. Now we will start
making modifications to the data. Some of them may be irreversible. Creating a copy of
the data set at this stage gives us the option to go back to the original data set at any
stage later on.

CLEANING THE OPPOSITION FIELD

One thing that bugs me here is the presence of a v before the team names. For
example, the entry for a game where the opposition is Pakistan is v Pakistan. The v is
here as a short form for versus. But I feel it is pretty redundant. While it is not essential
for this analysis to remove the v, I will do it for aesthetic reasons.
There are many ways to remove the v here. I will use the Text to columns function in
excel. First I insert a column to the right of the Opposition column.
Then I simply select the cells where the data is located, click on the Text to columns
function under the Data tab and then choose the Fixed width option. Then I click
Next.

On the next screen, I simply click on the space between v and the
opposition name and a line appears between the two signifying a
break.
I click finish and I now have the data broken into 2 columns. The
original column contains all the vs and the column on the right
now contains all the Opposition names without the v.
With a little bit of cleaning, I now have my Opposition field in the
format I want.

CLEANING THE RUNS FIELD

When I examined the Runs field, I found a couple of things I will need to correct before I
can use this field for mathematical analysis.
First there are a couple of text entries in this
field. You can see the values DNB and TDNB
in the adjoining figure. Both of these refer to
situations where Tendulkar did not get to bat.
We now think back to the goal of our analysis. The
goal is Which batsman has had the most
impact on Indias win-rate through the runs
they have scored in ODIs?
With this goal in mind, we can safely exclude all
matches where Sachin did not bat. If he did not
bat, he could not have had any impact on the
team win-rate through his runs.
Note that he could still have had an impact
through his bowling and fielding but we are not
trying to measure that impact.
We can simply remove this data from our analysis data set by filtering and deleting the
rows.

It is a good idea to make a note of all the changes we are making. I have noted that we
have deleted data on 11 games here. In these 11 games, Sachin did not bat and hence this
data was not useful for our analysis.
The next thing I noticed in the Runs field is the presence of a number of entries where
the score is followed by an asterisk (*) sign. This is common convention to denote a notout score. In all these innings, Sachin stayed not out at the end. There are 41 such
innings in our data set.
What should we do with this issue? Removing the asterisk at the end is fairly simple in
excel. but before we do that we need to carefully understand the implications of that on
our analysis.Converting a score of 40* to just 40 means that we are saying that the impact
from Sachins runs remains the same whether he scores 40* or if he gets out at 40.
I think this is a fair assumption. Since we are measuring a batsmans impact solely throgh
the runs they have scored, it is ok to discard the information on whether the batsman got
out or not.
Having gone through this exercise mentally, I think it is fine to go ahead with this
approach. I now proceed to remove the asterisk at the end.
I will again insert a column to the righ of this field and use the Text to columns function.
This time I choose the Delimited option.

When I click Next, I am asked to choose the character that I want to use as a delimiter.

I choose the Other option and enter the character * and click on Finish.
What this does is, it tells excel to treat every asterisk sign as a delimiter, keep the
content to the left of it in the original cell and move the content to the right of it into the
cell on the right. In our case, this simply eliminates the * from this field.
We have now cleaned up the Runs field in our data set and made it amenable to
mathematical operations that we will perform in the next step.

CLEANING UP THE RESULTS FIELD

The next thing on my list is to clean up the Result field. There are 4 kinds of results in
our data set. won, lost, n/r and tied. n/r stands for no result. Since we want
to measure the impact on the teams win-rate, we can exclude the matches where the

result is n/r or tied. You can argue that tied

means India did not win and hence can be counted
as lost but by the same logic, tied does not
mean lost as well. Hence we decide to exclude all
games where the results is n/r or tied.
We note that we have deleted another 21 games due
to this criterion.

This brings us to the end of data preparation. Before we proceed any further, it is
important to summarize what we have done here.
1. We cleaned up the Opposition field by removing the v before each team name
2. We cleaned up the Runs field by removing all games where Sachin did not bat.
We deleted 11 games this way.
3. We removed the * at the end of scores where Sachin did not get out. Our data now
does not differentiate between innings where Sachin was out and innings where he
wasnt.
4. We removed all games where the result was not a straight win or loss. We removed
an additional 21 games this way.
5. We started with 463 games and now we are considering only 431 games for our
actual analysis.

DATA PREPARATION IN BUSINESS ANALYTICS

When dealing with business data, data preparation can be a long and exhausting process.
What we have discussed here can be considered more as data cleaning. We have not really
touched upon certain other important aspects of data preparation.
Anomaly detection or outlier correction is used extensively when dealing with business
data. The idea here is to remove unusual occurrences from the data before building a
predictive model. This is because outliers can have undue influence on our models. In our
case, we have limited data and there is nothing in the data that justifies outlier
correction.
Missing data treatment is another crucial step in data preparation. In our dataset, we
have no missing values (Thank you espncricinfo!) but if, for example, we had some innings
for which we had no values in the Runs field, we would have to do something about it.
Typically missing data treatment involves either imputing or estimating the missing values
or removing the data with missing values from our analysis.

Deriving variables is also a part of data preparation. Sometimes we need to create new
variables from the existing ones for the purpose of our analysis. For example, if we need
to create a Year variable, we can derive that from the Start date variable. We could
also derive the Country where the match was played from the Venue field. This will
involve creating a separate lookup table which maps venues to countries.
Data preparation is an important part of any analysis but it becomes even more important
when dealing with complex business data. Effective data preparation increases the
strength of the predictive models by harnessing the power of the available data in the
most efficient manner.
Now that we have prepared the data, we are now ready for the next stage i.e. Predictive
modelling. But before we get into that, now is a good time to perform some descriptive
analytics on the data first.

EXERCISE
Perform the following steps on the data for Sourav and Rahul
1. Clean up the Opposition field by removing the v before each team name
2. Clean up the Runs field by removing all games where the batsman did not bat.
3. Remove the * at the end of scores where the batsman did not get out.
4. Remove all games where the result was not a straight win or loss.
5. Make a note of the total number of games you started with and what you are left
with for further analysis.

STAGE 4: DESCRIPTIVE ANALYTICS

In the data exploration stage, we had compiled a long list of questions that could be
answered from this data. Here are some interesting charts.

Here is a graph on the distribution of Sachins innings score.

Descriptive analytics like this helps an analyst understand the data better. It also helps
her spot anything unusual anything that requires further investigation.
Descriptive analytics is a useful tool to understand the data, generate insights and spot
unusual occurrences that require further investigation.
EXERCISE
Descriptive analytics offers unlimited ways of analysing any kind of data. You are only
limited by your imagination. Here are some things you can do with your data at this stage.
1. Analyze the batsmans performance over time Total runs scored by calendar year
and average runs scored by calendar year
2. Analyze the batsmans performance by opposition, by venue (home and away) etc.
3. Create and examine the distribution of scores
What you will find is that in most cases descriptive analytics will confirm your belief or
intuition. But in a few cases, every once in a while, you will find patterns or insights that
you did not know or that run counter to your intuition. These counter-intuitive or hidden
insights are what make descriptive analytics such a valuable tool.

STAGE 5: PREDICTIVE MODELLING

Now that we have explored the data, prepared it for analysis and run descriptive analytics
on it, the next stage is predictive modelling.
We again refer back to the goal of our analysis, Which batsman has had the most
impact on Indias win-rate through the runs they have scored in ODIs?
We need to establish a relationship between Indias win-rate and the number of runs
scored by Sachin in a particular game. Let us first examine a graph where we have Indias
win-rate on the vertical axis and Sachins scores (in buckets of 20) on the horizontal axis.
When Sachin scores less than twenty one runs, Indias win-rate is 42% . It climbs up to 56%
when he scores between 21 and 40 runs. It goes up to a whopping 83% when Sachin scores
between 121 and 140 runs. The win-rate does come down for scores greater than 140 but
this aberration could be attributed to sparse data for such high scores. Since Sachin has
scored more than 140 in only 11 games, 1 or do unusual results can make a big impact on
this win-rate.

There does seem to be a general trend of improvement in Indias win rate as Sachins
scores become higher.

What if we could quantify this relationship? If we could somehow create a mathematical

formula that would be able to calculate Indias win-rate for any given Sachin score. For
example, if Sachin scores 25 runs in an innings, what if we could just plug in his score into
a mathematical formula and bam! It gives us the probability of India winning that game.
We will now attempt to do exactly this via a regression model. We will estimate the
relationship between Sachins score and Indias win rate. In other words, we will build a
model that will help us predict, for a given number of runs scored by Sachin, what is the
probability of India winning the game. This model will also be able to estimate the
increase in probability of India winning with each additional run scored by Sachin.
AN INTRODUCTION TO REGRESSION
Regression is one of the most popular predictive techniques. In simple terms, regression
helps us understand how the typical value of one variable also called the dependent
variable (in this case, Indias win-rate) changes when some other variable also called
independent variable (here Sachins score) varies.
This is a simplified case of regression. In many situations, regression models are used to
understand the effect of several variables on one variable. For example, Indias win-rate
could also be influenced by factors like whether India batted first or second, whether
India was playing at home or away or even the toss. We could theoretically build a model
which takes the effect of all these variables on Indias win-rate.

TYPES OF REGRESSION
There are many types of regression techniques that are applied by Statisticians depending
on the nature of the problem and the variables involved. Linear and logistic are two of the
most popular ones.
Linear regression assumes a linear relationship between the dependent and the
independent variable. If the relationship between Sachins score and Indias win-rate
could be quantified with a straight line, then linear regression would be a suitable
modeling technique.
In our problem, we have seen in the previous graphs that the relationship between our
dependent and independent variable is not exactly linear.
Further, the variable that we are trying to predict i.e. the outcome of the game is a
binary variable (win/loss). In our case, a regression technique called logistic is more
suitable. Logistic regression does not need a linear relationship between the dependent
and independent variables. Logistic regression can handle all sorts of relationships.

LOGISTIC REGRESSION
In this book, we will not go into the mathematical details of logistic regression. Instead we
are going to focus on its application for a given problem.
The result of a logistic regression model is an equation in this format
Log [p/(1-p)] = a + bx
Let us interpret this equation in the context of our problem.

There will be 2 values that will be generated from the model a and b.
Using the equation above, we can calculate the value of p for any given value of X.
We first calculate the value of Log[p/(1-p)] by putting the values of a, b and x. Let us
call this Y.

Log[p/(1-p)] = Y
We can then use the antilog or exponent function to calculate the value of p/(1-p).
p/(1-p) = exp(Y)
From there we can easily calculate the value of p as well.
p = exp (Y)/(1 + exp(Y))
p, if you remember is the probability of India winning the game. Thus, using a logistic
model, for any given value of X, we are able to calculate p, Indias predicted win-rate.
This is how we interpret the results of the logistic regression model.
Now we need to find the values of a and b so that we can calculate the probability p for
any given X (Runs scored).

BUILDING A LOGISTIC REGRESSION MODEL

We are going to use a combination of R and Excel to build this model. We will calculate
the coefficients (a and b) using R. we will perform all other calculations in Excel.
R is an open source tool that is available as a free download. Anyone can download R on
their machine and start working with it.
Download and install R

READING DATA INTO R

It is a lot simpler to load csv files in R than Excel files. We will copy paste our data into
another excel sheet and save it as a .CSV file.
We use the read command in R to read in the data.
data.frame.sachin = read.table("E:\\jigsaw\\Blog\\sachin.csv",
+ header = T,
+ sep = ",")
This command creates a new table (or data frame) called Sachin by reading in data from
the file Sachin.csv. we have specified the location of the file as well. Note that R requires
you to add double slashes when specifying the pathname. The header = T command tells R
to treat the first row of the data as headers. The sep = , command tells R that the data
is separated by commas (since it is a CSV, comma separated file).

Once we read in the data, we can quickly run summary statistics on the data using a
simple command.
summary(data.frame.sachin)

As you can see, this command creates 6 measures for each field. The min value, the 25th
percentile value, the 50th percentile value, the 75th percentile value, the max value and
the mean.

RUNNING A LOGISTIC MODEL

Once we have read in the data and run summary statistics on it, the next step is to build
the model. We are building a simple 2 variable model. The variable Outcome is the
dependent variable. This is what we will try to predict. The variable Runs is the
independent variable. This is what we will use for prediction. In other words, we will
quantify the relationship between runs scored and the outcome (or the probability of
outcome being a win).
We use the glm command in R.
smodel <- glm(outcome ~ Runs, data = data.frame.sachin, family = "binomial")
smodel is the name we have given to our model.
Glm stands for generalized linear modelling. This is the broad technique under which
logistic regression falls. Here it stands for a procedure which allows us to build various
kinds of regression models.
Outcome ~ Runs tells R to use the variable Runs to predict the variable outcome. Suppose
we want to use 2 variables instead of 1 for prediction. Along with Runs, we also want to
use the variable toss. In this case, this part of the command will change from Outcome ~
Runs to Outcome ~ Runs + Toss
Data = data.frame.new tells R to use the table Sachin as the data on which to do the
analysis.

Family = binomial is important. There are multiple algorithms under the glm command in
R. Binomial tells R to use the logistic regression technique.
Since we have assigned a name to our model (smodel), R will not print the results of the
model. For this we again use the summary command.
summary(smodel)

For a full understanding of the model output, click here.

For our purpose, we are interested in knowing the values of a and b. If you remember, the
logistic regression equation is as follows
Log [p/(1-p)] = a + bx
a is the constant value that is calculated by the model and is called the intercept. B is
another constant which is the coefficient of X (runs scored).
In the model output, we can find the values of a and b under the heading Coefficients.

a is the estimate of the intercept and it has a value of -.258734 as per our model.
b is the coefficient estimate of the runs and its value is .010062.
We now have information to solve the regression equation and can calculate the value of p
for each value of X.

EXERCISE
For this exercise you should have first prepared the data as described in the previous
section.

Save the newly prepared data for both Sourav and Rahul into separate csv sheets.
Read the data sets into R
Get summary statistics for the data using the summary function
Perform logistic regression on the data using the glm function.
Summarize the results of the model using the summary command
Note the estimates for the intercept and the variable Runs

STAGE 6: INTERPRETING THE OUTPUT

We have run our model and obtained the results. Let us now understand and interpret the
results.
Case 1: When Runs = 0
For X = 0,
a + bX = a
which is -.258734
Thus Log[p/(1-p)] = -.258734. Or,
p/(1-p) = Exp(.258734) = .772. Or,
P = .44 OR 44%.

This means that if Sachin scores a duck (0 runs), Indias predicted probability of winning
that game is 44%.

Case 2: When Runs = 50

For X = 50, a + bX = -.258734 + .10062 * 50
Thus Log[p/(1-p)] = -.244366,
p/(1-p) = Exp(.244366) = 1.28. Or,
P = .56 OR 56%.
If Sachin scores a half century (50 runs), Indias predicted probability of winning that
game is 56%.
As per the model, Indias chances of winning a game increase by over 12%, if Sachin scores
exactly 50 runs in a game vs. if he scores 0.
In other words, the contribution of Sachins 50 runs is an increment of 12% in Indias
chances of winning.
In this way, we can calculate Sachins contribution (in the form of an increase in Indias
chances of winning) for every score that he has ever made in these 431 games.

Cumulative
Increment

X = Runs

Y = a + bx

Increment

0
1
2

-0.258734
-0.248672
-0.23861

0.44
0.44
0.44

0
0.25%
0.25%

0.25%
0.50%

3
4
5
6
7
8
9
10
11

-0.228548
-0.218486
-0.208424
-0.198362
-0.1883
-0.178238
-0.168176
-0.158114
-0.148052

0.44
0.45
0.45
0.45
0.45
0.46
0.46
0.46
0.46

0.25%
0.25%
0.25%
0.25%
0.25%
0.25%
0.25%
0.25%
0.25%

0.74%
0.99%
1.24%
1.49%
1.74%
1.99%
2.24%
2.49%
2.74%

A snippet of the table

used to calculate the
increment in Excel

When Sachin scores a 0 in a game, his contribution is 0 for that game.

When he scores 1 run, his contribution is a .25% increase in Indias chances of winning.
When he scores 10 runs, his contribution is a 2.49% increase in Indias chances of winning.
And so on.
Using this method, we can calculate Sachins total (lifetime) and average (per game)
contribution towards the teams win-rate.
We can use the exact same method to calculate the same measures for the other 2 players
included in our analysis Sourav Ganguly and Rahul Dravid.
Here is a comparison of the effects of the runs scored by each of the three batsmen.

Gangulys line (the green line) is at the top which means Gangulys runs have the highest
impact on Indias win rate.
In other words, if everything else remains constant, if Ganguly scores 60 runs, it is likely to
improve Indias chances of winning more than if either Sachin or Rahul had scored the
same number of runs (60).
Putting it another way, if India were in the world cup finals and you had to pray for one
batsmans success (out of these 3), you should be praying for a high score from Ganguly as
that will improve Indias chances of winning more than if Sachin or Rahul were to hit that
same high score.

WHAT ABOUT THE BATTING AVERAGE?

Ok, so we have proved that each run from Gangulys bat was more useful than from Sachin
or Rahuls bat. But what about the actual runs scored in every inning?
Sachin scores his runs at an average of 45 per innings, Ganguly scored at 41 and Dravid at
39. Even though Gangulys runs are more valuable, Sachin has scored more in each innings
on average. Could his contribution be more than Ganguly?
To find out each players average contribution, we need to look at their contribution for
each of the innings played.

As an example, Rahul Dravid scored 69 runs in his last innings. As per our model, his
innings improved Indias chances of winning from 35% to 57% - an improvement of
22%.Thus Rahuls contribution through that innings was 22%. If we repeat this process for
all his innings and calculate the average contribution per innings, this would represent
Rahuls average contribution to Indias victories over his entire career.
Here is a comparison of the 3 Indian stalwarts.

Again, Ganguly has the highest average contribution per innings. For every inning that he
played, the runs scored by him improved Indias chances of winning by 13% on an average.
In comparison, Rahuls average contribution is 11% and Sachins is 10%.
Thus, as per our statistical analysis, Ganguly is the player who had the highest average
contribution per innings.
From a purely statistical point of view, if we look at the average contribution per inning as
the defining measure, Ganguly has come out to be the most important contributor and is
therefore, the best ODI player for India amongst the 3 highest run getters.

LIFETIME CONTRIBUTION
Is average contribution the best way to measure a players contribution?
For example, Sachin has played in over 430 games (more than any other batsman in the
world). While Sourav has played only 292 and Dravid 307.
Statistics

Sachin

Sourav

Dravid

Innings
Runs

431

292

307

17742

11255

10,536

Is it not better to look at their total contribution over the entire career rather than the
average contribution per game?
Better or not, it is surely a different way to measure the players contribution. If we
compare the lifetime contribution of the 3 players it gives a different picture.

Sachin leads the comparison by a fair bit. Ganguly is second and Dravid is a distant third.
Of course, a big thing to consider is that Sachin is still playing while the other two have
retired. Sachin will surely improve his lifetime contribution by the time he retires.
If we look at the Life time contribution of a player as the defining measure, Sachin has
emerged as the best ODI player for India amongst the 3 highest run getters.

EXERCISE
Use the coefficients obtained from the models to generate the average and life time
contributions for both Sourav and Rahul. Compare with results here.

STAGE 7: MODEL VALIDATION

We have used logistic regression to create a model to measure the impact of a players
runs scored on the teams chances of winning that game.
We have found that Sourav Ganguly has the highest average contribution per inning while
Sachin has the highest lifetime contribution.
Both of these conclusions are based on the models that we have built. It is therefore
important to verify how good, if any, these models are.
There are many ways of measuring the quality of a model. We will look at a very intuitive
way of doing this. We have already discussed the Chi-square test previously. We will use
this versatile test once again, this time to measure the quality of our model.
If you remember, the Chi-square test is commonly used to compare observed data with
data we would expect to obtain according to a specific hypothesis.
Let us understand how we can use it to assess the quality of our model.
There are 431 games in our data (for Sachin). Since we have taken all games with no
results or tie results, we know that all of these remaining 431 games had a win/loss result.
Now if we try to predict the results of these 431 games with no additional information, we
would expect to get it right in 50% of the cases. Thus we would expect our guess to be
right in 215 or 216 of the games and wrong in the remaining ones.
Now we can make the same prediction about the game outcome using our model as well.
If the model predicts a probability higher than .5, we can say that it is predicting a win for
India. If the model predicts a probability lower than .5, we can assume that the model is
predicting a loss for India.
If the model is not really any good, we would expect it to be right in half the cases.
Out of the 431 games, India has lost 200. We would predict our model to predict 100 of
these as losses and 100 as wins. Similarly, of the 231 games that India won, we would
expect the model to predict 115.5 correctly as wins and the remaining incorrectly as
losses.
This is what we would expect to see.
Model Prediction

Actual

Results

Loss

Win

Grand
Total

Loss

100

200

result

Win

115.5

231

Grand
Total

215.5

431

And this is what we actually see.

Model Prediction

Actual
result

Results

Loss

Win

Grand
Total

Loss

114

200

Win

144

231

Grand
Total

201

230

431

We expect the model (if it is no good) to correctly predict 215.5 of the 431 games. The
model actually correctly predicts (114 + 144) = 258 games.
We can thus see that the model seems to be good. It has a (258/431) i.e. 60% accuracy
when predicting the outcome of the game based on Sachins score.
We can use the Chi-square test to determine if this difference between what we expected
to see and what we actually see is statistically significant or not.
Using Excel again to run this test, we find that there is a greater than 99.99% chance that
what we are seeing is because the model is actually good and not because of chance.
We can thus conclude that our model is indeed a good model and we can be fairly
confident about our findings.
EXERCISE

Create the tables with actual and expected values for both players
Run Chi-square tests to check validity of the model

CONCLUSION
In this book, we have looked at an application of logistic regression to get statistical
insights from the data that helps us make a decision on who is the best ODI batsman in
India.
PROBLEM DEFINITION
We started with a vaguely defined question Who is the greatest ODI batsman India has
ever produced?
In the first step, we defined the problem in more certain terms. If we want to apply any
kind of analytics on the data, the problem definition needs to be unambiguous and
precise. We changed the definition to Which batsman has had the most impact on Indias
win-rate through the runs they have scored in ODIs?

Then, in order to limit the scope of the analysis, we added another constraint. We
modified the problem definition to this form Amongst those who have scored more
than 10000 runs in ODIs, which batsman has had the most impact on Indias win-rate
through the runs they have scored?

DATA EXPLORATION
Data exploration was done in 2 steps.
In the first step we identified the information contained in the data set. We looked at
each field and understood its definition. We also looked at several examples of questions
that we can answer with this data.
In the second step, we dug a little deeper to understand the data even better.

How much information do we have?

What is the quality of the available data?

How do we need to prepare the data?

DATA PREPARATION
This is the stage where we prepare the data for extensive further analysis. This involves
cleaning up of variables and removing data that is not required for analysis. For Sachins
analysis, we started with 464 games. We removed the games where the player did not bat
or the result was a tie or no result. We ended with 431 games worth of data.

Data preparation often involves other important activities like outlier removal, missing
value treatment and variable transformation as well.
DESCRIPTIVE ANALYTICS
Having prepared the data for analysis, we first ran a series of descriptive analytics on the
data. Descriptive analytics like this helps an analyst understand the data better. It also
helps her spot anything unusual anything that requires further investigation.

PREDICTIVE MODELING
The next step is to build a predictive model that will help us answer the original question.
We used logistic regression to estimate the relationship between Sachins score and Indias
win rate. In other words, we built a model that will help us predict, for a given number of
runs scored by Sachin, what is the probability of India winning the game. This model is
also able to estimate the increase in probability of India winning with each additional run
scored by Sachin.

INTERPRETING THE RESULTS

Having built the model, the next step is to interpret the results of the model and generate
insights from it. We used 2 separate parameters to measure who is the greatest ODI
batsman.
The first parameter measures the average contribution of the batsman per inning. Sourva
Ganguly emerged as the most valuable contributor using this method.
The second parameter measures the life time contribution of the batsman over his entire
career. Sachin Tendulkar emerged as the top contributor using this method.

MODEL VALIDATION
We also spent some time validating the model. We used a Chi-square test to determine if
the model is statistically significant or not. We found that our model was highly significant
and thus concluded that the insights generated from the model are indeed valid.

This brings us to the end of this book. We hope that you will find this book a useful
tutorial to get a beginners understanding of the application of logistic regression to solve
problems and make decisions.

IT Architect Series: Foundation In the Art of Infrastructure Design: A Practical Guide for IT Architects
From Everand
IT Architect Series: Foundation In the Art of Infrastructure Design: A Practical Guide for IT Architects
John Yani Arrasjid, VCDX-001
No ratings yet
Fundamentals of Cyber Security: Principles, Theory and Practices
From Everand
Fundamentals of Cyber Security: Principles, Theory and Practices
Mayank Bhushan
No ratings yet
The Data-Confident Internal Auditor: A Practical, Step-by-Step Guide
From Everand
The Data-Confident Internal Auditor: A Practical, Step-by-Step Guide
Yusuf Moolla
No ratings yet
Atkinson
No ratings yet
Atkinson
14 pages
Data Analysis Using Logistic Regression PDF
No ratings yet
Data Analysis Using Logistic Regression PDF
38 pages
Chavi Blog Heading 3
No ratings yet
Chavi Blog Heading 3
3 pages
Choithram School North Campus: Maths Activity
No ratings yet
Choithram School North Campus: Maths Activity
20 pages
"Data Analysis" Basic Concepts and Applications
From Everand
"Data Analysis" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Quantifying Individual and Team Performance in Cricket: Hot-Hand Effect
No ratings yet
Quantifying Individual and Team Performance in Cricket: Hot-Hand Effect
15 pages
An Analysis of Batting Performance of The Cricket Players
No ratings yet
An Analysis of Batting Performance of The Cricket Players
6 pages
Group 9 - ODI Batsmen Ranking
No ratings yet
Group 9 - ODI Batsmen Ranking
10 pages
Business Intelligence and Data Mining Techniques
From Everand
Business Intelligence and Data Mining Techniques
Dwaipayan Sethi
No ratings yet
Statistical Breakdown On Odi Career of Shakib Al Hasan and Angelo Mathews
No ratings yet
Statistical Breakdown On Odi Career of Shakib Al Hasan and Angelo Mathews
24 pages
Project PPT 2
No ratings yet
Project PPT 2
14 pages
Essentials of Data Analysis
From Everand
Essentials of Data Analysis
Agasti Khatri
No ratings yet
Decision Making with Data
From Everand
Decision Making with Data
Ravi Deshpande
No ratings yet
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Maths Project Applied
No ratings yet
Maths Project Applied
11 pages
IP Project
No ratings yet
IP Project
28 pages
Ayush Harshit DHV
No ratings yet
Ayush Harshit DHV
13 pages
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
From Everand
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
Waldo Todd
No ratings yet
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
Project Summary On SAS Descriptive Analysis On Cricket Triangular Series
No ratings yet
Project Summary On SAS Descriptive Analysis On Cricket Triangular Series
2 pages
Data Analytics
From Everand
Data Analytics
Jeffery Short
1/5 (1)
What Is Data Analytics? A Complete Guide For Beginners
From Everand
What Is Data Analytics? A Complete Guide For Beginners
Piyush Kumar Jain
No ratings yet
Oracle Business Intelligence : The Condensed Guide to Analysis and Reporting
From Everand
Oracle Business Intelligence : The Condensed Guide to Analysis and Reporting
Yuli Vasiliev
No ratings yet
Capstone Notes-1
No ratings yet
Capstone Notes-1
18 pages
Introduction to Data Analytics
From Everand
Introduction to Data Analytics
Dan Martin
No ratings yet
BA3
No ratings yet
BA3
3 pages
Webscrappingprojectreport (045020)
No ratings yet
Webscrappingprojectreport (045020)
18 pages
Data Mining for Beginners: A Programmer’s Guide
From Everand
Data Mining for Beginners: A Programmer’s Guide
Agasti Khatri
No ratings yet
Project Math
No ratings yet
Project Math
10 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Baseball and Business Intelligence
No ratings yet
Baseball and Business Intelligence
13 pages
Capstone Presentation
No ratings yet
Capstone Presentation
10 pages
PYTHON FOR DATA ANALYSIS: A Practical Guide to Manipulating, Cleaning, and Analyzing Data Using Python (2023 Beginner Crash Course)
From Everand
PYTHON FOR DATA ANALYSIS: A Practical Guide to Manipulating, Cleaning, and Analyzing Data Using Python (2023 Beginner Crash Course)
Ike Beck
No ratings yet
Data Science Career Guide Interview Preparation
From Everand
Data Science Career Guide Interview Preparation
Gradient Publication
No ratings yet
Principles of Data Mining
From Everand
Principles of Data Mining
Subodh Keshari
No ratings yet
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition)
From Everand
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition)
Dr. Gypsy Nandi
No ratings yet
3072 6115 1 SM
No ratings yet
3072 6115 1 SM
6 pages
WinnoveX: Win Your Way To Innovation Excellence
From Everand
WinnoveX: Win Your Way To Innovation Excellence
Leandre Adifon
No ratings yet
Jor S 2 Rating Player in Test Cricket
No ratings yet
Jor S 2 Rating Player in Test Cricket
13 pages
Big Data for IoT, Cloud, and AI
From Everand
Big Data for IoT, Cloud, and AI
Anasooya Khanna
No ratings yet
Cricket and Six Sigma
No ratings yet
Cricket and Six Sigma
6 pages
Thriving in a Data World: A Guide for Leaders and Managers
From Everand
Thriving in a Data World: A Guide for Leaders and Managers
Sangeeta Krishnan
No ratings yet
Data Science and Machine Learning Interview Questions Using R: Crack the Data Scientist and Machine Learning Engineers Interviews with Ease
From Everand
Data Science and Machine Learning Interview Questions Using R: Crack the Data Scientist and Machine Learning Engineers Interviews with Ease
Vishwanathan Narayanan
No ratings yet
Analysis of Factors Affecting The Winning Percentage ODI Cricket Matches
0% (1)
Analysis of Factors Affecting The Winning Percentage ODI Cricket Matches
24 pages
Introduction to Data Science Using R
From Everand
Introduction to Data Science Using R
Prema Alla
No ratings yet
Introduction to Business Analytics
From Everand
Introduction to Business Analytics
Dwaipayan Sethi
No ratings yet
Analytical Sport Business
No ratings yet
Analytical Sport Business
11 pages
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
Importance of Analytics in Sport Management: Indian Perspective
No ratings yet
Importance of Analytics in Sport Management: Indian Perspective
7 pages
Big Data, Machine Learning, and Data Mining Explained
From Everand
Big Data, Machine Learning, and Data Mining Explained
Chitrali Kaul
No ratings yet
Madan Gopal Jhanwar
No ratings yet
Madan Gopal Jhanwar
11 pages
Data Analytics for Businesses 2019: Master Data Science with Optimised Marketing Strategies using Data Mining Algorithms (Artificial Intelligence, Machine Learning, Predictive Modelling and more)
From Everand
Data Analytics for Businesses 2019: Master Data Science with Optimised Marketing Strategies using Data Mining Algorithms (Artificial Intelligence, Machine Learning, Predictive Modelling and more)
Riley Adams
5/5 (1)
Big Data and Data Science: Analytics for the Future
From Everand
Big Data and Data Science: Analytics for the Future
Dhaanyalakshmi Ahuja
No ratings yet
PES UNIVERSITY, Bangalore UE18CS203 B.Tech, Sem III Session: Aug-Dec, 2019 Ue18Cs203 - Introduction To Data Science
No ratings yet
PES UNIVERSITY, Bangalore UE18CS203 B.Tech, Sem III Session: Aug-Dec, 2019 Ue18Cs203 - Introduction To Data Science
4 pages
Cricket and Math
No ratings yet
Cricket and Math
85 pages
Charts Case Study
No ratings yet
Charts Case Study
6 pages
Data-Driven Decision Making
From Everand
Data-Driven Decision Making
Aadinath Pothuvaal
No ratings yet
Be Data Curious!: Be Data Curious!, #1
From Everand
Be Data Curious!: Be Data Curious!, #1
Nick Jewell
No ratings yet
Midterm 2010 F
No ratings yet
Midterm 2010 F
15 pages
The C-Reactive Protein/Albumin Ratio Is Useful For Predicting Short-Term Survival in Cancer and Noncancer Patients
No ratings yet
The C-Reactive Protein/Albumin Ratio Is Useful For Predicting Short-Term Survival in Cancer and Noncancer Patients
6 pages
Journal Article On Racial Discrimination
No ratings yet
Journal Article On Racial Discrimination
8 pages
Linear Models Reading
No ratings yet
Linear Models Reading
26 pages
Credit Risk Prediction With and Without Weights of Evidence
No ratings yet
Credit Risk Prediction With and Without Weights of Evidence
20 pages
On Applying On Applying On Applying On Applying Neuro Neuro Neuro Neuro - C C C Computing in E Omputing in E Omputing in E Omputing in E - Com Domain Com Domain Com Domain Com Domain
No ratings yet
On Applying On Applying On Applying On Applying Neuro Neuro Neuro Neuro - C C C Computing in E Omputing in E Omputing in E Omputing in E - Com Domain Com Domain Com Domain Com Domain
5 pages
Srujan ML 2 Project Fin
No ratings yet
Srujan ML 2 Project Fin
39 pages
Airline Delay Prediction
No ratings yet
Airline Delay Prediction
6 pages
Natalia Cantó-Sancho, Mar Sánchez-Brau, Belén Ivorra-Soler, Mar Seguí-Crespo
No ratings yet
Natalia Cantó-Sancho, Mar Sánchez-Brau, Belén Ivorra-Soler, Mar Seguí-Crespo
22 pages
Williams Et Al., 2023
No ratings yet
Williams Et Al., 2023
10 pages
Sentiment Analysis Based On Performance of Linear Support Vector Machine and Multinomial Naïve Bayes Using Movie Reviews With Baseline Techniques
No ratings yet
Sentiment Analysis Based On Performance of Linear Support Vector Machine and Multinomial Naïve Bayes Using Movie Reviews With Baseline Techniques
19 pages
Mock
No ratings yet
Mock
35 pages
Untitled Document
No ratings yet
Untitled Document
8 pages
Credit Scoring For Microfinance Institutions in Mexico An Ensemble and Hybridized Approach
No ratings yet
Credit Scoring For Microfinance Institutions in Mexico An Ensemble and Hybridized Approach
7 pages
Foundations of Probability in Python - Part 4
No ratings yet
Foundations of Probability in Python - Part 4
62 pages
Logistic 6
No ratings yet
Logistic 6
17 pages
Generalized Linear Models - Ymod
No ratings yet
Generalized Linear Models - Ymod
3 pages
Cs224n 2024 Lecture02 Wordvecs2
No ratings yet
Cs224n 2024 Lecture02 Wordvecs2
45 pages
Predicting Soccer League Games Using Multinomial Logistic Models
No ratings yet
Predicting Soccer League Games Using Multinomial Logistic Models
9 pages
The Stability Graph After Three Decades in Use: Experiences and The Way Forward
No ratings yet
The Stability Graph After Three Decades in Use: Experiences and The Way Forward
35 pages
Data Science Interview Preparation 1
100% (3)
Data Science Interview Preparation 1
79 pages
Selecting Predictive Modeling Technique
No ratings yet
Selecting Predictive Modeling Technique
121 pages
Bankruptcy Prediction Model With Zeta Optimal Cut-Off Score To Correct Type I Errors
No ratings yet
Bankruptcy Prediction Model With Zeta Optimal Cut-Off Score To Correct Type I Errors
17 pages
Short Play and Communication SPACE
No ratings yet
Short Play and Communication SPACE
12 pages
Assistant Professor, Department of Economics
No ratings yet
Assistant Professor, Department of Economics
6 pages
Churn Data Prediction Project
No ratings yet
Churn Data Prediction Project
5 pages
Fake Job Post Prediction Using ML
No ratings yet
Fake Job Post Prediction Using ML
7 pages
Heart Disease Prediction Using Logistic Regression Algorithm Using Machine Learning
No ratings yet
Heart Disease Prediction Using Logistic Regression Algorithm Using Machine Learning
4 pages
How Would Be Graduate Should Prepare Themselves To Be Selfemployed
No ratings yet
How Would Be Graduate Should Prepare Themselves To Be Selfemployed
37 pages

Data Analysis Using Logistic Regression

Uploaded by

Data Analysis Using Logistic Regression

Uploaded by

BEGINNERS GUIDE TO LOGISTIC REGRESSION USING R AND

Types of regression .............................................................................. 25

STAGE 1: PROBLEM DEFINITION

SACHIN, SOURAV, RAHUL

STAGE 2: DATA EXPLORATION

WHAT IS THE AVAILABLE INFORMATION?

WHAT KIND OF QUESTIONS CAN I ANSWER USING THIS DATA?

What is Sachins average in each of the last 20 years?

BUSINESS APPLICATION OF DATA EXPLORATION

DATA EXPLORATION STEP 2

HOW MUCH DATA IS THERE?

WHAT DOES THE DATA REPRESENT?

EXAMINING ALL VARIABLES

We do the same with Opposition and find all the values in

STAGE 3: DATA PREPARATION

CLEANING THE OPPOSITION FIELD

CLEANING THE RUNS FIELD

CLEANING UP THE RESULTS FIELD

result is n/r or tied. You can argue that tied

DATA PREPARATION IN BUSINESS ANALYTICS

STAGE 4: DESCRIPTIVE ANALYTICS

Here is a graph on the distribution of Sachins innings score.

STAGE 5: PREDICTIVE MODELLING

What if we could quantify this relationship? If we could somehow create a mathematical

BUILDING A LOGISTIC REGRESSION MODEL

READING DATA INTO R

RUNNING A LOGISTIC MODEL

For a full understanding of the model output, click here.

STAGE 6: INTERPRETING THE OUTPUT

Case 2: When Runs = 50

A snippet of the table

When Sachin scores a 0 in a game, his contribution is 0 for that game.

WHAT ABOUT THE BATTING AVERAGE?

STAGE 7: MODEL VALIDATION

And this is what we actually see.

How much information do we have?

What is the quality of the available data?

How do we need to prepare the data?

INTERPRETING THE RESULTS

You might also like