0% found this document useful (0 votes)
23 views13 pages

Stats Project Doc

The document analyzes descriptive statistics from data on Pakistani ODI cricketers including their career years and total runs scored. It finds that most players only played for one year and had low run totals, while a few played for over 20 years and had over 10,000 runs. It examines variables like mean, median, mode, standard deviation and distributions to understand patterns in the data.

Uploaded by

zeenia ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views13 pages

Stats Project Doc

The document analyzes descriptive statistics from data on Pakistani ODI cricketers including their career years and total runs scored. It finds that most players only played for one year and had low run totals, while a few played for over 20 years and had over 10,000 runs. It examines variables like mean, median, mode, standard deviation and distributions to understand patterns in the data.

Uploaded by

zeenia ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
You are on page 1/ 13

Introduction

Our data is taken from the list of Pakistani ODI cricketers. The list is arranged in the order in which each
player won his first ODI cap. Where more than one player won his first ODI cap in the same match, those
players are listed alphabetically. For simplicity’s sake, we have only considered and used 2 columns for
our statistical analysis, namely the career years of each player (by subtracting their ending year from
their starting year) and the runs made by them during this time. Thus our analysis solely relies on the
correlation between the time the players spent as professional cricketers and their batting performance.
There were a total of 228 records, out of which we took 198 for the statistical analysis on SPSS. The 30
records that were excluded either had certain missing values and thus for better results were removed,
or they belonged to players who had yet to retire and therefore would be bias to include for an accurate
analysis. As mentioned on the source of this data, the statistics were correct as at 3rd November 2020.
In the following sections we have given the descriptive statistics of our data and drawn certain inferences
from them. Further, we have performed normality tests and finally made certain assumptions about the
data and tested our hypothesis using the appropriate tests. The results are discussed in detail. The
snapshots of our data once loaded onto SPSS have been attached in the appendix of this document and
the source of this data has been cited as reference at the very end.
Variables
Runs and number of Career Years of ODI cricketers, being quantitative variables were taken as ratio data
as they are continuous in nature and therefore cannot have negative values. However, as SPSS takes both
ratio and interval data as “scale”, we can say that Runs and Career Years were taken as scale in SPSS.

Descriptive Statistics

runs
N Valid 198

Missing 0
Mean 840.85
Median 88.00
Mode 0
Std. Deviation 1925.076
Variance 3705917.882
Skewness 3.261
Std. Error of Skewness .173
Kurtosis 11.033
Std. Error of Kurtosis .344
Minimum 0
Maximum 11701

Out of the 198 records, the minimum runs made by a player equaled to zero, while the maximum runs
made by a player were 11701. The mean for this data is equal to about 840 while the median is 88 and
the mode is 0. What we can deduce from this piece of information is the fact that due to certain data
points on extreme ends (either greater run score or less run score), the mean has a higher value than
both the median and the mode. As the median does not get skewed by these extreme values, it is about
1/10th the value of the mean. Coming to the mode, the greatest number of players scored zero runs
before the end of their career. This value has arisen because the data consists of all the retired ODI
cricketers, which also includes those whose careers were greatly cut short for whatever reason. Had we
excluded such players, the mode would have been very different as well as the median and mean. But
the greatest and most easily seen impact would have been to the mode. We can also see that the
standard deviation from the mean was equal to about 1925. This clearly shows us that our data is not
clustered about a single value or group of values, rather it is widely spread out from the mean. The
skewness and kurtosis of the data are analyzed in the coming sections. But with respect to normality of
the data, we can assume that the data is positively skewed as the mean is greater than the median which
is greater than the mode. Our assumption will be investigated further in the coming sections.
Career_years

N Valid 198

Missing 0
Mean 5.04
Median 3.50
Mode 1
Std. Deviation 4.776
Variance 22.811
Skewness 1.167
Std. Error of Skewness .173
Kurtosis .710
Std. Error of Kurtosis .344
Minimum 1
Maximum 21

This particular variable basically tells us the number of career years for each of the 198 players. The
minimum number of years that a player was part of the cricket team was 1 year while the maximum
number of years was 21. The mean was about 5 years, the median was 4 while the mode was 1 year.
Relating this to the number of runs made by each player, we can now see that the mode for that
particular data was 0 because the mode for this data is 1. This means that as most players only played for
one year, their run rate was pretty low, quite possibly zero. The probability of a greater run rate increases
with the number of career years. Furthermore, seeing as how the maximum career years are 21, this
explains the great standard deviation we had for the runs made, as well as the extreme values that
resulted in the mean. Now we know that those extreme values are due to runs that are far greater than
the mean. Coming back to this data, the standard deviation is about 4.7 years which suggests that the
values are spread out from the mean, but not as much as those in runs made. The values of skewness
and kurtosis will be discussed in the coming sections. For now it is sufficient to say that the distribution
of the data does not seem to be normal as the mean is greater than the median which, in turn, is greater
than the mode and therefore is positively skewed.

Frequency Distribution:
Runs:
Cumulative
Frequency Percent Valid Percent Percent

Valid 0 16 8.1 8.1 8.1

1 3 1.5 1.5 9.6

2 4 2.0 2.0 11.6

3 2 1.0 1.0 12.6

4 1 .5 .5 13.1

5 5 2.5 2.5 15.7

7 2 1.0 1.0 16.7

8 1 .5 .5 17.2

9 1 .5 .5 17.7

10 3 1.5 1.5 19.2

11 2 1.0 1.0 20.2

12 2 1.0 1.0 21.2

13 3 1.5 1.5 22.7

14 3 1.5 1.5 24.2

15 4 2.0 2.0 26.3

17 1 .5 .5 26.8

18 3 1.5 1.5 28.3

20 1 .5 .5 28.8

22 1 .5 .5 29.3

24 1 .5 .5 29.8
25 3 1.5 1.5 31.3

26 3 1.5 1.5 32.8

27 4 2.0 2.0 34.8

28 2 1.0 1.0 35.9

29 1 .5 .5 36.4

31 1 .5 .5 36.9

34 3 1.5 1.5 38.4

35 1 .5 .5 38.9

36 1 .5 .5 39.4

37 1 .5 .5 39.9

39 1 .5 .5 40.4
41 1 .5 .5 40.9

42 1 .5 .5 41.4

45 1 .5 .5 41.9

48 3 1.5 1.5 43.4

53 1 .5 .5 43.9

56 1 .5 .5 44.4

60 1 .5 .5 44.9

61 1 .5 .5 45.5

62 1 .5 .5 46.0

65 1 .5 .5 46.5

66 1 .5 .5 47.0

69 1 .5 .5 47.5

74 1 .5 .5 48.0

78 1 .5 .5 48.5

80 1 .5 .5 49.0

84 1 .5 .5 49.5

87 1 .5 .5 50.0

89 2 1.0 1.0 51.0

97 1 .5 .5 51.5

99 1 .5 .5 52.0

100 1 .5 .5 52.5

110 1 .5 .5 53.0

111 2 1.0 1.0 54.0

113 1 .5 .5 54.5

116 2 1.0 1.0 55.6

119 1 .5 .5 56.1

127 1 .5 .5 56.6

130 1 .5 .5 57.1

131 1 .5 .5 57.6

133 1 .5 .5 58.1

141 1 .5 .5 58.6

142 1 .5 .5 59.1

147 1 .5 .5 59.6

154 1 .5 .5 60.1

166 1 .5 .5 60.6
184 1 .5 .5 61.1

193 1 .5 .5 61.6

197 1 .5 .5 62.1

199 1 .5 .5 62.6

209 1 .5 .5 63.1

210 1 .5 .5 63.6

221 2 1.0 1.0 64.6

234 1 .5 .5 65.2

236 1 .5 .5 65.7

242 1 .5 .5 66.2

262 1 .5 .5 66.7

267 1 .5 .5 67.2

271 1 .5 .5 67.7

297 1 .5 .5 68.2

314 2 1.0 1.0 69.2

321 1 .5 .5 69.7

324 1 .5 .5 70.2

330 1 .5 .5 70.7

348 1 .5 .5 71.2

349 1 .5 .5 71.7

383 1 .5 .5 72.2

394 1 .5 .5 72.7

399 2 1.0 1.0 73.7

457 1 .5 .5 74.2

504 1 .5 .5 74.7

524 1 .5 .5 75.3

543 1 .5 .5 75.8

556 1 .5 .5 76.3

593 1 .5 .5 76.8

641 1 .5 .5 77.3

642 1 .5 .5 77.8

711 1 .5 .5 78.3

725 1 .5 .5 78.8

741 1 .5 .5 79.3

768 1 .5 .5 79.8
782 1 .5 .5 80.3

786 1 .5 .5 80.8

812 1 .5 .5 81.3

966 1 .5 .5 81.8

969 1 .5 .5 82.3

1068 1 .5 .5 82.8

1265 1 .5 .5 83.3

1269 1 .5 .5 83.8

1336 1 .5 .5 84.3

1418 1 .5 .5 84.8

1521 1 .5 .5 85.4

1579 1 .5 .5 85.9

1709 1 .5 .5 86.4

1719 1 .5 .5 86.9

1845 1 .5 .5 87.4

1877 1 .5 .5 87.9

1895 1 .5 .5 88.4

2028 1 .5 .5 88.9

2185 1 .5 .5 89.4

2572 1 .5 .5 89.9

2605 1 .5 .5 90.4

2653 1 .5 .5 90.9

3194 1 .5 .5 91.4

3236 1 .5 .5 91.9

3266 1 .5 .5 92.4

3709 1 .5 .5 92.9

3717 1 .5 .5 93.4

4780 1 .5 .5 93.9

5080 1 .5 .5 94.4

5122 1 .5 .5 94.9

5841 1 .5 .5 95.5

6564 1 .5 .5 96.0

7170 1 .5 .5 96.5

7240 1 .5 .5 97.0

7381 1 .5 .5 97.5
7534 1 .5 .5 98.0

8064 1 .5 .5 98.5

8823 1 .5 .5 99.0

9720 1 .5 .5 99.5

11701 1 .5 .5 100.0

Total 198 100.0 100.0

Career_years:
Cumulative
Frequency Percent Valid Percent Percent

Valid 1 79 39.9 39.9 39.9

2 13 6.6 6.6 46.5

3 7 3.5 3.5 50.0

4 13 6.6 6.6 56.6

5 11 5.6 5.6 62.1

6 11 5.6 5.6 67.7

7 8 4.0 4.0 71.7

8 13 6.6 6.6 78.3

9 10 5.1 5.1 83.3

10 6 3.0 3.0 86.4

11 3 1.5 1.5 87.9

12 5 2.5 2.5 90.4

13 4 2.0 2.0 92.4

14 6 3.0 3.0 95.5

15 2 1.0 1.0 96.5

16 1 .5 .5 97.0

17 1 .5 .5 97.5

18 1 .5 .5 98.0

19 2 1.0 1.0 99.0

20 1 .5 .5 99.5

21 1 .5 .5 100.0

Total 198 100.0 100.0


Graphical analysis:
Histograms were made for both our variables as they were quantitative in nature and therefore their
best representation was through a histogram. We were also able to test the normality of the data
through the normal curve made on the histogram. Analysis of both graphs is as follows.

Runs:

From the histogram we can see that the graph is positively screwed. Furthermore, we can also see that
the curve is platykurtic. However, when we look at the values of skewness and kurtosis, it is determined
that the graph is indeed positively skewed with a value of 3.117, but it is not platykurtic in terms of
measure of kurtosis. With kurtosis value of 9.999, the graph is actually leptokurtic. We can also deduce
from the graph that the reason for this positive skewness is because of the values that we have for some
of the players who have runs greater than 2000. As these values are very much greater than the rest of
the data, it has impacted the normality of the curve and thus made the graph positively skewed. Also, it
can be easily seen that the greatest number of players have runs between 0 and 1000.
Career:

In this histogram of Career years of the players, we can see that the graph is positively screwed.
Moreover, we can also see that the curve is platykurtic. However, when we look at the values of
skewness and kurtosis, it is determined that the graph is indeed positively skewed with a value of 1.230,
but it is not platykurtic in terms of measure of kurtosis. With kurtosis value of 0.890, the graph is actually
leptokurtic. We can also deduce from the graph that the reason for this positive skewness is because of
the values that we have for a handful of players who have career years greater than 10. As these values
are very much greater than the rest of the data, it has impacted the normality of the curve and thus
made the graph positively skewed. Furthermore, it can be easily seen that the greatest number of
players have career years between 0 and 10.

Scatter Plot for Runs against Career Years


Although we can view the frequency distribution of the variables, however making a scatter plot has
given us a much clearer picture between the relationship of Career years and Runs

Normality Tests
As discussed above, our variables did not show normality. Here we will test for normality using the
Kolmogorov-Smirnova and Shapiro-Wilk tests in order to determine with accuracy whether our earlier
inferences were correct or not.

Runs
Tests of Normality

Kolmogorov-Smirnova Shapiro-Wilk

Statistic df Sig. Statistic df Sig.

runs .331 198 .000 .490 198 .000

Let H0: The variable (Runs) is normal.


And H1: The variable (Runs) is not normal.
If the significant value (p value) is less than 0.05, then the null hypothesis (H0) will be rejected.
As the p value of KS test is 0.000 and the value of the SW test is 0.000, both are less than 0.05.
Therefore, we will reject the null hypothesis (H0) and conclude that the Runs variable is not normal.

Career_Years:

Tests of Normality

Kolmogorov-Smirnova Shapiro-Wilk

Statistic df Sig. Statistic df Sig.

career .202 198 .000 .823 198 .000

Let H0: The variable (Career Years) is normal.


And H1: The variable (Career Years) is not normal.
If the significant value (p value) is less than 0.05, then the null hypothesis (H0) will be rejected.
As the p value of KS test is 0.000 and the value of the SW test is 0.000, both are less than 0.05.
Therefore, we will reject the null hypothesis (H0) and conclude that the Career Years variable is not
normal.

T-Test
As our number of records exceed 30 and the population variance is unknown, therefore the appropriate
test to be applied is the z test. However, as SPSS only allows for t test, it really does not matter whether
the data is normal or not, whether the records exceed 30 or not, and whether the population variance is
known or unknown. All that matters for SPSS is whether or not the data is quantitative before applying
the test. Since our variables are quantitative, therefore we can apply the t test in order to test our
hypothesis for the average runs made and the average career years from the dataset. There are two
categories in t test, namely one tailed and two tailed tests. We have applied the two tailed test for our
hypothesis

Runs
One-Sample Test

Test Value = 1000

95% Confidence Interval of the


Difference

t df Sig. (2-tailed) Mean Difference Lower Upper

runs -1.163 197 .246 -159.146 -428.95 110.65

Analysis: (let’s suppose average run rate is 1000)


Formulation of null and alternative hypothesis
Ho: µ= 1000
H1: µ≠ 1000
Level of significance
α=0.05
P value= 0.246
Critical region:
Reject Ho if p value < α
Interpretation
As the p value (0.246) is more than α therefore we will accept our Ho and conclude that average run rate
is 1000.

Career:

One-Sample Test

Test Value = 8

95% Confidence Interval of the


Difference

t df Sig. (2-tailed) Mean Difference Lower Upper

career -8.720 197 .000 -2.960 -3.63 -2.29

Analysis: (let’s suppose Average years of each players are 8)


Formulation of null and alternative hypothesis
Ho: µ= 8
H1: µ≠ 8
Level of significance
α=0.05
P value= 0.000
Critical region:
Reject Ho if p value < α
Interpretation
As the p value (0.000) is less than α therefore we will reject our Ho and conclude that average number of
years are not 8.

You might also like