Advanced Data Analytics Assignment
Advanced Data Analytics Assignment
Submitted by
Ajitesh Agarwal E003
Abhishek Batra C011
Prabhjit Singh Walia B063
Ishan Katyal E024
Anusha Aggarwal I003
Page 1
SUMMARY
The current case talks about the pricing of the players in IPL. IPL being the Indian premiere league is based
on giving exposure to the Indian cricket players. This is T20 format game where players are classified into
baskets where each basket has a minimum base price. Players are not only weighed on their cricket skills
but also marketing skills. The case also talks about the player being good in ODI/Test format doesnt
guarantee that he will be good in T20 too.
Approach
We first modified the modified raw data in the case to Include following more columns.
1) BOW*SR
2) BAT*ECO
3) BAT*SRBL
4) BOW*RUNS-S
5) ALL*RUNS-S
6) BATS*WKI
7) ALL*WKI
8) BOW*TRUNS
9) ALL*TRUNS
10) BOW*ODI RUNS
11) ALL*ODI RUNS
12) BATS*Wk-O
13) ALL*WK-O
This was done in order to incorporate the dummy variables Cricketer, Bowler and All-Rounder in our data.
Our data is taking into account the contribution made by Batsman, bowler and the all-rounder.
1. Linear Regression
The data was entered into SPSS where we performed first regression analysis, taking majority of
the variables as independent variables and SQRT(S-B) as dependent variable
Model Summary
Page 2
The model has a R-Square of 0.610. This is average R-Square and can be improved by applying various
other statistical techniques.
Model Summaryb
In this we removed all the dummy variables like batsman, all-rounder, bowler, country etc.
Though the R-Square got reduced but it laid stone for outlier analysis. We selected Cook
distance in regression analysis and eliminated outliers using Outlier Analysis.
3. Outlier Analysis
Model Summaryb
Model Summaryb
Page 3
a. Predictors: (Constant), Year, BAT*RUN-S, ODI-SR-BL, ALL*RUNS,
BOW*T-RUNS, All*T-runs, SR -B, WKTS, BATS*SR-B, ODI-SR-B, T-
RUNS, BOW*RUNS, ECON, BOW*SR-BL, ALL*WKI, T-WKTS, AVE,
BATS*ECO, All*ODI, bow*ODIR, ALL*WKI, SIXERS, ALL*ECON, HS ,
MTS, BAT*ODI-RUNS, ALL*SR-B, BOW*SR, BOW*WK-O, AVE-BL,
RUNS-C, ALL*SR-BL, ODI-WKTS, BOW*WK-I
b. Dependent Variable: SQRT(S-B)
Model Summaryb
Model Summaryb
ANOVAa
Page 4
a. Dependent Variable: SQRT(S-B)
b. Predictors: (Constant), Year, ALL*RUNS, ODI-SR-BL, BATS*ECO, bow*ODIR, All*T-runs, SR -
B, MTS, ODI-SR-B, T-RUNS, T-WKTS, BOW*RUNS, AVE-BL, ALL*WKI, All*ODI, BATS*SR-B,
BOW*WK-I, AVE, ALL*ECON, SIXERS, BOW*T-RUNS, HS , BOW*ECO, ALL*WKI, BOW*WK-O,
BAT*SR, ODI-RUNS, RUNS-S, ALL*SR-B, RUNS-C, BOW*SR-BL, ALL*SR-BL, WKTS, ODI-
WKTS
The anova is rejected here suggesting that the null hypothesis is rejected. We now check for
tolerance/VIF. If VIF>5, there exists collinearity
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients Collinearity Statistics
Model B Std. Error Beta t Sig. Tolerance VIF
1 (Constant)
######## 30544.661 -4.000 .000
BOW*SR-
BL -2.550 9.350 -.096 -.273 .786 .019 53.442
BOW*WK-I
11.489 11.198 .778 1.026 .309 .004 249.183
BOW*WK-O
5.536 6.029 1.408 .918 .362 .001 1017.570
T-RUNS
-.014 .023 -.124 -.608 .545 .055 18.159
T-WKTS
-2.420 .647 -.766 -3.740 .000 .055 18.171
ODI-RUNS
.027 .024 .243 1.116 .269 .049 20.546
ODI-SR-B
1.267 .809 .099 1.566 .122 .583 1.716
ODI-WKTS
-3.499 6.031 -1.053 -.580 .564 .001 1426.374
ODI-SR-BL
-.937 .995 -.070 -.941 .350 .414 2.416
ALL*ECON
57.357 37.892 .612 1.514 .135 .014 70.886
Page 5
AVE 16.339 3.874 .547 4.218 .000 .137 7.284
SR B -.668 .634 -.076 -1.054 .296 .449 2.225
SIXERS 5.136 2.241 .331 2.292 .025 .111 9.005
RUNS-C .416 .187 .720 2.220 .030 .022 45.482
WKTS -21.750 11.593 -1.493 -1.876 .065 .004 274.142
AVE-BL -.028 6.340 -.002 -.004 .997 .019 52.986
BATS*ECO
18.530 8.830 .162 2.098 .040 .389 2.572
BATS*SR-B
-.464 11.293 -.009 -.041 .967 .050 20.086
BOW*RUNS
-1.613 .644 -.231 -2.506 .015 .273 3.665
ALL*RUNS
.540 .211 .371 2.556 .013 .110 9.125
The variables highlighted in RED do not exhibit multi-collinearity. Rest all variables are passed
through PCA
Page 6
4. Principal Component Analysis (1st time)
Communalities
Initial Extraction
Page 7
Total Variance Explained
Page 8
KMO and Bartlett's Test
Df 378
Sig. .000
Page 9
Anti-image BAT*SR a
.812 .586 .348 .163 -.074 .082 -.072 -.059 .088 -.105 .535 .067 .210 .163 -.264 -.287 -.156 -.102 -.068 -.190 .213 .232 .100 .014 .303 -.036 -.036 -.112
Correlatio BOW*ECO
n .586 .860a .143 -.032 -.122 -.054 .057 .161 .111 .275 .391 .073 .208 -.203 .058 .009 -.140 -.161 .031 -.213 .215 .070 -.017 .119 .177 -.123 -.072 -.136
BOW*SR-
.348 .143 .668a .302 .009 .006 -.241 .009 .003 -.043 -.029 .560 .277 .091 -.083 -.128 -.087 -.102 -.269 -.919 .869 .094 .291 .073 .295 -.034 -.023 -.017
BL
BOW*WK-I a
.163 -.032 .302 .604 -.191 -.009 -.058 -.083 .192 -.453 .041 .160 .075 .426 -.275 -.035 .112 .198 -.950 -.289 .496 -.003 .942 -.007 .111 .035 .017 -.195
BOW*WK-
-.074 -.122 .009 -.191 .673a .111 -.039 -.005 -.991 -.123 -.081 .028 -.015 .064 -.024 -.001 .012 .172 .152 -.042 .040 .006 -.211 -.026 -.058 -.045 -.013 .988
O
T-RUNS .082 -.054 .006 -.009 .111 .609a -.060 -.932 -.104 -.082 .108 -.018 -.012 .158 -.324 -.121 .326 .071 .008 .017 -.054 -.059 -.038 -.094 -.287 .076 .473 .097
The above image has the anti-correlation matrix. In this variables are removed whose values is
Page 10
5. Principal Component Analysis (n times)
Again PCA is repeated with the variables removed several times.
Rotated Component Matrixa
Component
1 2 3 4 5 6
BAT*SR .619 -.389 -.261 -.340 -.175 -.398
BOW*ECO
-.493 -.332 .389 .411 -.171 .373
BOW*SR-BL
-.407 -.327 .375 .380 -.117 .429
BOW*WK-I
-.274 -.260 .165 .831 -.077 .162
BOW*WK-O
-.200 -.135 .919 .108 -.111 .117
T-RUNS .079 -.094 -.050 -.164 .074 -.899
T-WKTS -.152 -.063 .715 .074 .462 .041
ODI-RUNS
.190 -.026 -.087 -.172 .138 -.859
ODI-WKTS
-.183 .120 .777 .093 .530 .034
MTS .738 -.066 -.043 .587 -.095 .084
ALL*SR-B -.040 .852 -.122 -.114 .416 .102
ALL*SR-BL
-.024 .827 -.108 -.147 .414 .149
ALL*ECON
-.059 .846 -.128 -.145 .378 .138
RUNS-S .938 -.024 -.147 -.079 -.069 -.115
HS .855 .122 -.188 -.205 -.024 -.218
AVE .794 .131 -.191 -.284 .028 -.172
SIXERS .915 .067 -.121 -.056 -.057 -.038
RUNS-C -.111 .182 .194 .912 .003 .216
WKTS -.152 .122 .140 .937 -.010 .156
AVE-BL -.182 .313 .170 .132 .219 .645
ALL*RUNS
.192 .878 -.039 .089 .028 .081
ALL*WKI .111 .800 -.029 .240 .129 -.053
BOW*T-RUNS
-.148 -.122 .882 .172 -.098 .105
All*T-runs -.055 .235 -.010 -.062 .885 -.043
bow*ODIR
-.126 -.062 .826 .106 -.169 .084
All*ODI -.014 .398 -.032 -.064 .782 -.025
ALL*WKI -.029 .340 .006 .007 .895 -.074
Extraction Method: Principal Component Analysis.
Rotation Method: Varimax with Kaiser Normalization.
a. Rotation converged in 7 iterations.
Page 11
After applying varimax rotation, we say that a variable is loading on more than two factors as
there are multiple values >0.4. We remove these variables.
Again PCA is applied after removing variables.
df 351
Sig. .000
Component
1 2 3 4 5
Page 12
After doing PCA, all the variables which were removed during factor analysis are added back.
The list of independent variables include
I. Variables excluded during PCA
II. Variables removed after PCA
III. Components generated in PCA
IV. Dummy variables
Model Summaryb
a. Predictors: (Constant), BATS*WKI, MTS, L25, REGR factor score 4 for analysis 7,
ODI-SR-B, AUSTRALIA, ECON, SR -B, ALL*WKI, BOW*RUNS, REGR factor score 2 for
analysis 7, A35, ODI-SR-BL, BATS*SR-B, Year, CAPTAINCY EXP, T-WKTS,
BATS*WKT, BOW*SR-BL, REGR factor score 5 for analysis 7, INDIA, T-RUNS, REGR
factor score 3 for analysis 7, BATS*ECO, REGR factor score 1 for analysis 7, ALL*SR-
B, BOW*SR, BAT*ODI-RUNS, ODI-WKTS, BAT*RUN-S, BOW*ECO, ALL*SR-BL, ALL,
BOW, BAT*T-RUNS
b. Dependent Variable: SQRT(S-B)
R-Square of 0.815
ANOVAa
Total 10264737.621 99
Page 13
b. Predictors: (Constant), BATS*WKI, MTS, L25, REGR factor score 4 for analysis 7, ODI-SR-B,
AUSTRALIA, ECON, SR -B, ALL*WKI, BOW*RUNS, REGR factor score 2 for analysis 7, A35,
ODI-SR-BL, BATS*SR-B, Year, CAPTAINCY EXP, T-WKTS, BATS*WKT, BOW*SR-BL, REGR
factor score 5 for analysis 7, INDIA, T-RUNS, REGR factor score 3 for analysis 7, BATS*ECO,
REGR factor score 1 for analysis 7, ALL*SR-B, BOW*SR, BAT*ODI-RUNS, ODI-WKTS,
BAT*RUN-S, BOW*ECO, ALL*SR-BL, ALL, BOW, BAT*T-RUNS
Anova is rejected
On close observation we found that the tolerance level for some variables is still more than 5. So
we applied Step wise regression
Model Summaryh
Page 14
ANOVAa
Total 10264737.621 99
2 Regression 5894270.584 2 2947135.292 65.410 .000c
Residual 4370467.037 97 45056.361
Total 10264737.621 99
3 Regression 6464348.355 3 2154782.785 54.431 .000d
Residual 3800389.265 96 39587.388
Total 10264737.621 99
4 Regression 6730623.874 4 1682655.969 45.231 .000e
Residual 3534113.746 95 37201.197
Total 10264737.621 99
5 Regression 6899498.050 5 1379899.610 38.544 .000f
Residual 3365239.570 94 35800.421
Total 10264737.621 99
6 Regression 7049546.454 6 1174924.409 33.985 .000g
Residual 3215191.166 93 34571.948
Total 10264737.621 99
7 Regression 7184763.344 7 1026394.763 30.659 .000h
Total 10264737.621 99
Page 15
Page 16
Page 17