Big Data Assn Document PDF
Big Data Assn Document PDF
on
Semester-3
18 December, 2019
Assistant Professor
FMS
1
CONTENTS
3. Conclusion ............................................................................................................................. 22
List of Figures
Figure 5: Multiple regression model- Sixers score and batting strike rate vs sold price .............. 11
List of Tables
Table 1: Parameters to accept reject the factors influencing sold price ........................................ 20
2
1. Introduction to R studio
It can be used for advance data computations like data mining, web scrapping, text analysis.
For this document, the scope of R studio is limited to simple and multiple linear regression.
The simple linear regression is used to predict a quantitative outcome y on the basis of one single
predictor variable x. The goal is to build a mathematical model (or formula) that defines y as a
function of the x variable.
where:
b0 is the intercept of the regression line; that is the predicted value when x = 0.
b1 is the slope of the regression line which depicts change in y per unit change in x
e is the error term (also known as the residual errors), the part of y that cannot be explained by
the regression model.
3
Multiple Linear Regression:
Multiple linear regression is an extension of simple linear regression used to predict an outcome
variable (y) on the basis of multiple distinct predictor variables (x).
With three predictor variables (x), the prediction of y is expressed by the following equation:
The “b” values are called the regression weights (or beta coefficients). They measure the
association between the predictor variable and the outcome. “b” can be interpreted as the average
effect on y of a one unit increase in “x”, holding all other predictors fixed.
The year 2008 was a game changer for cricket as a sport. It changed cricket forever. The right price
for a player and the factors that influenced the pricing puzzled many sports analysts. The franchises
acquired players through an English auction with several rules. The price of the players in any
sports is driven by many factors. Not all the factors that drove the price of a player are directly
related to their performance on the field. The factors thought to affecting the price was compiled
in a tabular form to analyze which factor strongly associates with the price of the player and what
factors negatively impact the pricing of the player. These can be used by the franchises to pick or
drop the player while bidding. Also, to rationally assign sold price to the players these analysis
will play a key role.
AGE: Age of the player at the time of the auction classified into three categories: L25 means player
is less than 25 years old, B25-35 means that age is between 25 and 35, A35 means age is more
than 35.
IPL Team: Teams for which the player had played in the IPL
Q.1 Develop a simple linear regression model between the sold price and batting strike rate,
is there a statistically significant relationship between sold price and batting strike rate?
The simple linear regression is developed using following commands and each output type is defined in the figure
below:
5
Null hypothesis: ᵦ1= 0
Sold price and batting strike rate in IPL are not significantly
related
Sold price and batting strike rate in IPL are significantly related
Here, ᵦ1≠ 0
6
R squared for the model is 0.03396.
No, Batting strike rate explains only 3.4% of the variation in Sold
price. The rest variation in sold price is unexplained as per the
model generated here.
7
The scatter plot was made using command
plot(SR..B,SOLD.PRICE,main= "scatterplot")
abline(model1)
The scatter plot of the model clearly depicts that the data points are far away from best fit line. The
Q.2 What is the impact of ability to score “SIXERS” on the player’s price?
• The impact of scoring sixers in IPL is significant on the Sold Price of the player. Every
sixer scored by the player increases his sold price by 7693.
• Using multiple R-squared, we can see 20.3% of the sold price is explained by the sixewrs
scored by the player.
• Statistically, p-value is almost approaching zero. So, alternate hypothesis is accepted that
states sold price is significantly related to the sixers scored by the player in IPL.
• The scatter plot of the model depicts data is cumulated in one zone of the plot, yet it is not
a good-fit as the data points are scattered above and below the best fit line and not aligned
to fitted model.
8
Figure 3: Scatter plot for sold price vs sixers scored
Q.3 Develop a multiple linear regression model between Sold price and batting striking rate and Sixers? What
do you conclude from this model?
For, SR..B=1 & SIXERS=0;( Per unit increase in sold price when no sixers scored)
For, SR..B=0 & SIXERS=1; ;( Per unit increase in sold price when batting strike rate is nil)
Predicted increase in Sold Price per unit increase in Sixers and Batting strike rate in IPL= 7656.3
9
Figure 4: Input and output for Q.3
The batting strike rate of player is negatively associated with the sold price and the strength of
association is insignificant.
The p-value= >0.05, so there is no significant relation between SR..B and sold price.
Whereas, the sixers scored by a player is positively associated with sold price and the strength of
association is high as depicted by per unit increase in sixers scored raises sold price by 7758.
Also, the p-value=<0.05 and almost close to 0, it shows sixers scored and sold price is significantly
related.
When the two are combined, it explains only 20.3 % of the variation in sold price.
Even the scatter plot of the model, data points are spread around the best fit line and not aligned which
represents that the two variables combined cannot generate a good fit. 10
Figure 5: Multiple regression model- Sixers score and batting strike rate vs sold price
Q.4 Cricket in the T20 format is considered a young man’s sport, is there evidence that the player’s price is
influenced by age?
In order to give evidence regarding age influencing the sold price, dummy variables are introduced
using the following command:
Once, the variables are grouped by creating categories, linear regression model is run by entering
following commands:
11
➢ model4<-lm(SOLD.PRICE~CatAGE)
➢ summary(model4)
For Category L25: 1, if age is less than 25 or else 0, if age is more than 25
For category B25-35: 1, if age is between 25 and 35; 0 if not between 25-35
For category A35: 1 if age is more than 35; 0 if not more than 35
µ= β0 +β1X(B25-35)+ β2X(A35)
µ= 720,250
12
In the equation (a). Put X(B25-35) =1& X(A35) = 0,
The mean sold price of the players aged less than 25 is highest, followed by players who are
above 35. The mean selling price of players aged between 25-35 is lowest. This implies that
sold price is positively associated with age less than 25. The association with age is negative
when the age increase beyond 35.
The scatter plot shown in Figure 6 of the model also depicts the negative association of the
categories with the sold price. As the age of the player increases, their selling price decreases.
From the p-value of the categories of age; it is clear that for age 25-35 p-value is less than 0.05
and thus the association is significant statistically. Whereas, for age >35 the p-value is >0.05 so
that data are not statistically significant.
13
Q.5 Are players of Indian origin paid more than players from other countries?
Step 1: The factors are converted into numerical first using following command operation.
14
Step 2: Locating the correct numerical representative of each country
Once the countries are coded numerically, they are sorted and numbered coded.
By looking at the values in console, we can easily locate that which number represents which
country. For example, South Africa (SA) – 7, IND – 2, BAN-3 and so on.
Step 3: Now the numeric values of the countries were also cut into categories.
15
Result:
Now, keeping IND (of numeric value 2) equal to A and all other numerical values of the country
equal to B, mean selling price for both categories were calculated.
Findings:
16
Now, the regression analysis was done
Result:
Analysis:
17
From the linear analysis, we get Slope of other countries, β1 = 430974 and slope of IND,
β2 equals to (221366).
0, otherwise
Conclusion:
The mean selling price of a player belonging from Indian origin is 6, 52,340 which is higher than
that of the mean sold price of a player from Non-Indian origin. Hence, yes the players from the
Indian origin are paid more than the players of Non-Indian origin.
Q.6 Develop the model which can used by Franchises to predict the sold price.
The Franchise model for buying players that can be used to predict the price of the player will be done by the following
steps:
Step 1: Input
The model will include independent variables that are of importance in the selection of final player
in general and will thus impact the price of the player while bidding sessions.
18
Step 2: Output
The output of the model as per R studio is as follow. The output is then used to select the variables
that will contribute with higher variation per unit change in the variable.
Total runs; total wickets; Highest score; Striking batting rate; Sixers scored; wickets taken;
Average of bowling; Economy rate of the player; Striking bowling rate.
19
Sold Price= 158470.32(Playing role all-rounder)+ 189003.87 (Playing role
batsman) -10209 ((Playing role Bowler) +69253.24 (Playing role wicket
keeper)+19.13 (Total runs)+ 291.51 (Total wickets) -572.03 (Highest score) +
30.72 (Striking rate) +7211.71 (Sixers scored) +3737.22 (Wickets taken in IPL)
+11996.39 (Bowling average) + -2476.09 (Economy rate) -10794 (Bowling strike
rate)
20
Sixers 7211.71 Low/ + 0.00185 Yes Selected
Low: <10000
High: >20,000
Therefore, the key factors determining the selling price of the players are as follow:
21
➢ Playing role
➢ Sixers record
➢ Bowling average
➢ Bowling strike rate
3. Conclusion
By the means of R studio the data was easily analyzed to assess the factors affecting the selling
price of players in IPL bidding. The key findings are:
• Sold price and batting strike rate in IPL are statistically significantly related.
• Batting strike rate explains only 3.4% of the variation in Sold price. The rest variation in
sold price is unexplained as per the model generated here.
• The impact of scoring sixers in IPL is significant on the Sold Price of the player. Every
sixer scored by the player increases his sold price by 7693.
• The batting strike rate of player is negatively associated with the sold price and the strength
of association is insignificant.
• Whereas, the sixers scored by a player is positively associated with sold price and the
strength of association is high as depicted by per unit increase in sixers scored raises sold
price by 7758.
• sold price is positively associated with age less than 25. The association with age is
negative when the age increase beyond 35.
• The mean selling price of a player belonging from Indian origin is 6, 52,340 which is higher
than that of the mean sold price of a player from Non-Indian origin. Hence, yes the players
from the Indian origin are paid more than the players of Non-Indian origin,
• The key factors determining the selling price of the players are as follow:
o Playing role
o Sixers record
o Bowling average
o Bowling strike rate
22