0% found this document useful (0 votes)
137 views22 pages

Big Data Assn Document PDF

The document discusses using R studio to develop linear regression models to analyze factors influencing the sold price of players in the Indian Premier League (IPL). Simple linear regression models showed that batting strike rate and number of sixers scored had a statistically significant relationship with sold price, though they only explained a small percentage of variation. A multiple regression model incorporating both batting strike rate and sixers scored was also significant and showed that number of sixers scored had a greater impact on sold price than batting strike rate alone.

Uploaded by

SRI PRAGATHI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views22 pages

Big Data Assn Document PDF

The document discusses using R studio to develop linear regression models to analyze factors influencing the sold price of players in the Indian Premier League (IPL). Simple linear regression models showed that batting strike rate and number of sixers scored had a statistically significant relationship with sold price, though they only explained a small percentage of variation. A multiple regression model incorporating both batting strike rate and sixers scored was also significant and showed that number of sixers scored had a greater impact on sold price than batting strike rate alone.

Uploaded by

SRI PRAGATHI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

NATIONAL INSTITUTE OF FASHION TECHNOLOGY, PATNA

DEPARTMENT OF FASHION MANAGEMENT STUDIES

Bid Data, Business Analytics, Advanced IT and Digital Management

End Term Jury

on

“IPL Case study evaluation using R studio”

Semester-3

18 December, 2019

SUBMITTED TO: SUBMITTED BY:

Prof. Kislay Kashyap Rakshit Jain – MFM/18/354

Assistant Professor

FMS

1
CONTENTS

1. Introduction to R studio ........................................................................................................... 3

2. Case: Pricing of Players in The Indian Premier League .......................................................... 4

3. Conclusion ............................................................................................................................. 22

List of Figures

Figure 1: Liner regression model graphically ................................................................................. 3

Figure 2: Scatter plot for Striking rate vs Sold price ...................................................................... 7

Figure 3: Scatter plot for sold price vs sixers scored ...................................................................... 9

Figure 4: Input and output for Q.3 ................................................................................................ 10

Figure 5: Multiple regression model- Sixers score and batting strike rate vs sold price .............. 11

Figure 6: Scatter plot for Q.4 ........................................................................................................ 13

List of Tables

Table 1: Parameters to accept reject the factors influencing sold price ........................................ 20

2
1. Introduction to R studio

RStudio is an integrated development environment (IDE) for R, a programming


language for statistical computing and graphics. It contains a lot of integrated packages that act as
an add-on to the R studio.

It can be used for advance data computations like data mining, web scrapping, text analysis.

For this document, the scope of R studio is limited to simple and multiple linear regression.

Simple Linear Regression:

The simple linear regression is used to predict a quantitative outcome y on the basis of one single
predictor variable x. The goal is to build a mathematical model (or formula) that defines y as a
function of the x variable.

The mathematical formula of the linear regression can be written as y = b0 + b1*x + e,

where:

b0 and b1 are known as the regression beta coefficients or parameters:

b0 is the intercept of the regression line; that is the predicted value when x = 0.

b1 is the slope of the regression line which depicts change in y per unit change in x

e is the error term (also known as the residual errors), the part of y that cannot be explained by
the regression model.

The figure below illustrates the linear regression model,


where:

• the best-fit regression line is in blue


• the intercept (b0) and the slope (b1) are shown in
green
• the error terms (e) are represented by vertical red
lines
Figure 1: Liner regression model graphically

3
Multiple Linear Regression:

Multiple linear regression is an extension of simple linear regression used to predict an outcome
variable (y) on the basis of multiple distinct predictor variables (x).

With three predictor variables (x), the prediction of y is expressed by the following equation:

y = b0 + b1*x1 + b2*x2 + b3*x3

The “b” values are called the regression weights (or beta coefficients). They measure the
association between the predictor variable and the outcome. “b” can be interpreted as the average
effect on y of a one unit increase in “x”, holding all other predictors fixed.

2. Case: Pricing of Players in The Indian Premier League

The year 2008 was a game changer for cricket as a sport. It changed cricket forever. The right price
for a player and the factors that influenced the pricing puzzled many sports analysts. The franchises
acquired players through an English auction with several rules. The price of the players in any
sports is driven by many factors. Not all the factors that drove the price of a player are directly
related to their performance on the field. The factors thought to affecting the price was compiled
in a tabular form to analyze which factor strongly associates with the price of the player and what
factors negatively impact the pricing of the player. These can be used by the franchises to pick or
drop the player while bidding. Also, to rationally assign sold price to the players these analysis
will play a key role.

Some of the keys for the data file are as follow:

AGE: Age of the player at the time of the auction classified into three categories: L25 means player
is less than 25 years old, B25-35 means that age is between 25 and 35, A35 means age is more
than 35.

HS: Highest score by a batsman in IPL

Ave-B: average runs scored by a batsman in IPL

AVE-BL: Bowling average

SR-B: Batting strike rate


4
SR-BL: Bowling strike rate

Sixers: No. of six runs scored by a player in IPL

WKTS: No. of wickets taken by a player in IPL

Captaincy EXP: Captained either an T20 or national team

ODI-SR-B: Batting strike rate in ODI

ODI-SR-BL: Bowling strike rate in ODI

ODI-RUN-S: Runs scored in ODI

ODI-WKTS: Wickets taken in ODI

T-RUNS-S: Runs scored in Test matches

T-WKTS: Wickets taken in Test matches

Player-SKILL: Player’s primary skill

COUNTRY: Country of Origin of the player

YEAR-A: Year of Auction in IPL

IPL Team: Teams for which the player had played in the IPL

Q.1 Develop a simple linear regression model between the sold price and batting strike rate,
is there a statistically significant relationship between sold price and batting strike rate?

The simple linear regression is developed using following commands and each output type is defined in the figure
below:

5
Null hypothesis: ᵦ1= 0

Sold price and batting strike rate in IPL are not significantly
related

Alternate hypothesis: ᵦ1≠ 0

Sold price and batting strike rate in IPL are significantly related

Here, ᵦ1≠ 0

Also, p value is 0.0358 is less than 0.05.

So, reject null hypothesis and accept alternate hypothesis.

Therefore, Sold price and batting strike rate in IPL are


statistically significantly related.

6
R squared for the model is 0.03396.

R-squared is the statistical measure that explains the variation in


dependent variable by virtue of variation in the independent
variable.

No, Batting strike rate explains only 3.4% of the variation in Sold
price. The rest variation in sold price is unexplained as per the
model generated here.

Figure 2: Scatter plot for Striking rate vs Sold price

7
The scatter plot was made using command

plot(SR..B,SOLD.PRICE,main= "scatterplot")

abline(model1)
The scatter plot of the model clearly depicts that the data points are far away from best fit line. The

same is explained by only 3.4 % multiple R-squared value.

Q.2 What is the impact of ability to score “SIXERS” on the player’s price?

• The impact of scoring sixers in IPL is significant on the Sold Price of the player. Every
sixer scored by the player increases his sold price by 7693.
• Using multiple R-squared, we can see 20.3% of the sold price is explained by the sixewrs
scored by the player.
• Statistically, p-value is almost approaching zero. So, alternate hypothesis is accepted that
states sold price is significantly related to the sixers scored by the player in IPL.
• The scatter plot of the model depicts data is cumulated in one zone of the plot, yet it is not
a good-fit as the data points are scattered above and below the best fit line and not aligned
to fitted model.

8
Figure 3: Scatter plot for sold price vs sixers scored

Q.3 Develop a multiple linear regression model between Sold price and batting striking rate and Sixers? What
do you conclude from this model?

As per the output of the model,

Sold Price= 395327 -102.4(SR..B) + 7758.7 (SIXERS)

For, SR..B=1 & SIXERS=0;( Per unit increase in sold price when no sixers scored)

Sold Price= 395327-102.4= 395224.6

For, SR..B=0 & SIXERS=1; ;( Per unit increase in sold price when batting strike rate is nil)

Sold Price= 395327+7758.7= 403085.7

For, SR..B=1 & SIXERS=1;

Sold Price= 395327-102.4+7758.7= 402983.3

Predicted increase in Sold Price per unit increase in Sixers and Batting strike rate in IPL= 7656.3
9
Figure 4: Input and output for Q.3

The batting strike rate of player is negatively associated with the sold price and the strength of
association is insignificant.

The p-value= >0.05, so there is no significant relation between SR..B and sold price.

Whereas, the sixers scored by a player is positively associated with sold price and the strength of
association is high as depicted by per unit increase in sixers scored raises sold price by 7758.

Also, the p-value=<0.05 and almost close to 0, it shows sixers scored and sold price is significantly
related.

When the two are combined, it explains only 20.3 % of the variation in sold price.

Even the scatter plot of the model, data points are spread around the best fit line and not aligned which
represents that the two variables combined cannot generate a good fit. 10
Figure 5: Multiple regression model- Sixers score and batting strike rate vs sold price

Q.4 Cricket in the T20 format is considered a young man’s sport, is there evidence that the player’s price is
influenced by age?

In order to give evidence regarding age influencing the sold price, dummy variables are introduced
using the following command:

➢ CatAGE<- cut(IPLdata$AGE,br=c(0,1,2,3), labels= c("L25","B25-35","A35"))


➢ IPLdata2=data.frame(IPLdata,CatAGE)

The categories generated are:

• L25 means player is less than 25 years old,


• B25-35 means that age is between 25 and 35,
• A35 means age is more than 35.

Once, the variables are grouped by creating categories, linear regression model is run by entering
following commands:

11
➢ model4<-lm(SOLD.PRICE~CatAGE)

➢ summary(model4)

The output of the model is as follow:

For Category L25: 1, if age is less than 25 or else 0, if age is more than 25

For category B25-35: 1, if age is between 25 and 35; 0 if not between 25-35

For category A35: 1 if age is more than 35; 0 if not more than 35

The output can be interpreted as follow:

µ= β0 +β1X(B25-35)+ β2X(A35)

µ= 720,250 – 235,715 X(B25-35) - 200,071 X(A35) --------- (a)

For mean of category L25:

In the equation (a). Put X(B25-35) =0 & X(A35) = 0,

µ= 720,250

For mean of category B25-35:

12
In the equation (a). Put X(B25-35) =1& X(A35) = 0,

µ= 720250 – 235715*1 = 484,535

For mean of category L25:

In the equation (a). Put X(B25-35) =0 & X(A35) = 1,

µ= 720,250 – 200,071*1 = 520,179

The mean sold price of the players aged less than 25 is highest, followed by players who are
above 35. The mean selling price of players aged between 25-35 is lowest. This implies that
sold price is positively associated with age less than 25. The association with age is negative
when the age increase beyond 35.

The scatter plot shown in Figure 6 of the model also depicts the negative association of the
categories with the sold price. As the age of the player increases, their selling price decreases.

From the p-value of the categories of age; it is clear that for age 25-35 p-value is less than 0.05
and thus the association is significant statistically. Whereas, for age >35 the p-value is >0.05 so
that data are not statistically significant.

Figure 6: Scatter plot for Q.4

13
Q.5 Are players of Indian origin paid more than players from other countries?

Step 1: The factors are converted into numerical first using following command operation.

Result: This is the output formt eh above command input.

14
Step 2: Locating the correct numerical representative of each country

Once the countries are coded numerically, they are sorted and numbered coded.

By looking at the values in console, we can easily locate that which number represents which
country. For example, South Africa (SA) – 7, IND – 2, BAN-3 and so on.

Step 3: Now the numeric values of the countries were also cut into categories.

15
Result:

Now, keeping IND (of numeric value 2) equal to A and all other numerical values of the country
equal to B, mean selling price for both categories were calculated.

Findings:

Mean sold price of category A = 652339.6

Mean sold price of category B = 43097

16
Now, the regression analysis was done

Result:

Analysis:

The Multilinear regression model of the given sample is:

Mean Sold Price = β1+ β2 *(A)

17
From the linear analysis, we get Slope of other countries, β1 = 430974 and slope of IND,
β2 equals to (221366).

Solving the indicator variable:

A is called 1, if the player is of Indian origin

0, otherwise

Therefore, Mean Sold Price for A=1 is,

 Mean Sold Price = 430974+ 221366*(1) = 6,52,340

For a player belonging from other country, A = 0

Therefore, Mean Sold Price for A=0 is,

 Mean Sold Price = 430974+ 221366*(0) = 4,30,974

Conclusion:

The mean selling price of a player belonging from Indian origin is 6, 52,340 which is higher than
that of the mean sold price of a player from Non-Indian origin. Hence, yes the players from the
Indian origin are paid more than the players of Non-Indian origin.

Q.6 Develop the model which can used by Franchises to predict the sold price.

The Franchise model for buying players that can be used to predict the price of the player will be done by the following
steps:

Step 1: Input

The input command is as follow:

➢ model6<- lm(SOLD.PRICE~ PLAYING.ROLE+T.RUNS+T.WKTS+HS+SR..B+ SIXERS+ WK


TS+ AVE.BL+ ECON+ SR.BL, data= IPLdata)
➢ summary(model6)

The model will include independent variables that are of importance in the selection of final player
in general and will thus impact the price of the player while bidding sessions.

18
Step 2: Output

The output of the model as per R studio is as follow. The output is then used to select the variables
that will contribute with higher variation per unit change in the variable.

The key variables are:

Playing role: All-rounder, batsman, bowler and keeper

Total runs; total wickets; Highest score; Striking batting rate; Sixers scored; wickets taken;
Average of bowling; Economy rate of the player; Striking bowling rate.

19
Sold Price= 158470.32(Playing role all-rounder)+ 189003.87 (Playing role
batsman) -10209 ((Playing role Bowler) +69253.24 (Playing role wicket
keeper)+19.13 (Total runs)+ 291.51 (Total wickets) -572.03 (Highest score) +
30.72 (Striking rate) +7211.71 (Sixers scored) +3737.22 (Wickets taken in IPL)
+11996.39 (Bowling average) + -2476.09 (Economy rate) -10794 (Bowling strike
rate)

Step 3: Selecting variables for final model


Table 1: Parameters to accept reject the factors influencing sold price

Factor Slope Level of P-value Is Selected/Rejected


value (Per impact/Association relationship (Owing to slope
unit type Statistically value)
increase in significant?
Sold price)

All 158470 High/ + 0.30945 No Selected


Rounder

Batsman 189003 High/ + 0.08279 No Selected

Bowler 10209 Average/ - 0.92190 No Selected

Wicket 69253 High / + 0.65159 No Selected


keeper

Total runs 19.31 Low / + 0.10601 No Rejected

Total 291.51 Low/ + 0.26262 No Rejected


wickets

Highest 572.03 Low /- 0.75624 No Rejected


score

Striking 30.72 Low / + 0.97699 No Rejected


batting rate

20
Sixers 7211.71 Low/ + 0.00185 Yes Selected

Wickets in 3737.22 Low / + 0.06701 No Rejected


IPL

Bowling 11996.39 Average/ + 0.19847 No Selected


average

Economy 2476.09 Low / - 0.80255 No Rejected


rate

Striking 10794.78 Average / - 0.40680 No Selected


bowling
rate

Keys for the table:

Low: <10000

Average: 10000- 20,000

High: >20,000

Selected: If High and average slope value; if p-value = <0.05

Rejected: If slope value is Low; if p-value= >0.05

Step 4: Omitting Unwanted variables from the model equation

The final model equation will be as follow:

Sold Price= 158470.32(Playing role all-rounder) + 189003.87 (Playing


role batsman) -10209 ((Playing role. Bowler) +69253.24 (Playing role
wicket keeper) +7211.71 (Sixers scored) +11996.39 (Bowling average)
-10794 (Bowling strike rate)

Therefore, the key factors determining the selling price of the players are as follow:
21
➢ Playing role
➢ Sixers record
➢ Bowling average
➢ Bowling strike rate

3. Conclusion

By the means of R studio the data was easily analyzed to assess the factors affecting the selling
price of players in IPL bidding. The key findings are:

• Sold price and batting strike rate in IPL are statistically significantly related.
• Batting strike rate explains only 3.4% of the variation in Sold price. The rest variation in
sold price is unexplained as per the model generated here.
• The impact of scoring sixers in IPL is significant on the Sold Price of the player. Every
sixer scored by the player increases his sold price by 7693.
• The batting strike rate of player is negatively associated with the sold price and the strength
of association is insignificant.
• Whereas, the sixers scored by a player is positively associated with sold price and the
strength of association is high as depicted by per unit increase in sixers scored raises sold
price by 7758.
• sold price is positively associated with age less than 25. The association with age is
negative when the age increase beyond 35.
• The mean selling price of a player belonging from Indian origin is 6, 52,340 which is higher
than that of the mean sold price of a player from Non-Indian origin. Hence, yes the players
from the Indian origin are paid more than the players of Non-Indian origin,
• The key factors determining the selling price of the players are as follow:
o Playing role
o Sixers record
o Bowling average
o Bowling strike rate

22

You might also like