0% found this document useful (0 votes)
26 views15 pages

Math AI IA

Uploaded by

George Joseph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views15 pages

Math AI IA

Uploaded by

George Joseph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Which of the measured statistics in a given

basketball game has the largest effect on


probability of success?

Candidate Code - lpc478


Mathematics AI SL May 2025

1
Introduction
Basketball is a sport where strategy and statistics intersect. As a basketball player and fan
myself, quantifying success in my basketball games often comes down to the measured
statistics(often referred to as “stats”). This investigation focuses on examining the relationship
between these statistics to the outcome of a game.

As someone who plays in the “center” role, my contribution to the team is mainly to recollect the
ball after a failed attempt and score if possible. My role is, admittedly, more defensive than
offensive. The inspiration for this topic stems from the growing role of data analytics in sports.
Oftentimes, more specific statistics are overlooked in favour of Total Points. Although points are
the ultimate deciding factor in a basketball game, the other factors leading to points being
scored are worth examining to identify their importance, and also assist players, like myself, in
placing importance on the ideal skill sets to achieve success. This study aims to contribute to
this discourse by using mathematical tools to identify patterns and relationships. Specifically, the
investigation seeks to answer the question: Which of the measured statistics in a given
basketball game has the largest effect on probability of success?

This investigation blends statistical techniques with real-world data. By quantifying relationships
between game statistics and outcomes, this study hopes to provide valuable insights into the
dynamics of winning basketball games while showcasing the relevance of mathematics in sports
analysis.

Rationale
Mathematics plays a fundamental role in analyzing sports data, where identifying relationships
between variables can offer actionable insights for teams and analysts. In this investigation,
statistical tools are employed to evaluate the relationship between key performance
metrics—points scored, field goals made, and assists—and the Chicago Bulls’ chances of
winning during the 2023–24 NBA season. The use of Pearson’s correlation coefficient and the
chi-square test of independence ensures that this study investigates both linear relationships
and categorical associations comprehensively.

These statistical methods are particularly appropriate for this investigation as they address
different aspects of correlation. While Pearson’s (r) examines numerical relationships, the
chi-square test focuses on categorical associations, offering a comprehensive analysis of the
data. From the various statistics available in basketball games, three were chosen in particular
for this investigation due to their popularity and importance. Points, obviously being the key
factor by which victory is decided and Assists, which are directly related to points being scored
in a game. Additionally, Field Goals are the most common way points are scored in games,
between the 3 possible scoring methods(Free Throws, Field Goals and 3-Pointers).

2
Data Collection
All scores and statistical data were collected from www.basketball-reference.com, a trusted
website which tracks all statistics for not only games in the NBA, but the other various basketball
leagues as well, such as the G-League and the WNBA. Of the 82 total games played by the
Chicago Bulls in the 2023-24 season, 45 games were selected using random sampling and
used for this analysis. Random Sampling was used as the selection method because it
minimizes bias when representing the population, including a variety of scenarios such as home
and away games, games against “stronger” or “weaker” teams, and games with varying levels of
performance.

Furthermore, analysing 45 games provides a sufficiently large dataset to identify patterns and
correlations, since it is more than half of the total 82 games played in an NBA regular season.
The conventions of the Central Limit Theorem in statistics accept that 30 is a sufficiently large
sample size for approximating the characteristics of a population accurately. My sample size of
45 exceeds this. The risk of random anomalies and/or outliers skewering the results is reduced,
since they have less of an effect on the overall calculations of a larger sample.

The statistics from each game I am considering for this analysis are:
●​ Total Points Scored (TP): total number of points scored by the Bulls in each game
●​ Field Goals (FG): total number of successful baskets scored not including free throws
●​ Assists (AST): total number of assists recorded from all members of the team together
●​ Game Result (W/L): Win or Loss.

Collected data from all regular season games for the Chicago Bulls in the 2023-24 season were
grouped into a table. The data from 45 randomly sampled games was then grouped into a
separate table, shown below.

3
Table 1. Randomly sampled games and corresponding statistics (Basketball Reference)
SN Game No. W/L TP FG AST

1 5 0 105 42 19
2 6 0 107 43 21
3 7 0 101 39 28
4 9 0 115 42 30
5 14 1 102 36 24
6 15 0 100 37 21
7 17 0 108 42 20
8 20 1 120 45 32
9 21 1 124 48 32
10 22 1 111 38 23
11 23 1 121 47 27
12 24 0 129 47 25
13 25 0 106 36 26
14 27 0 116 37 25
15 29 1 124 48 25
16 31 0 95 35 23
17 32 1 118 42 23
18 35 0 97 37 17
19 36 0 100 37 24
20 38 1 119 45 27
21 39 1 124 42 31
22 41 1 122 46 31
23 42 0 91 35 21
24 44 1 125 47 32
25 46 0 132 47 27
26 50 0 115 42 24
27 51 1 129 45 24
28 53 0 108 43 21
29 54 1 136 51 29
30 56 0 112 44 30
31 60 0 97 36 25
32 63 1 125 46 29
33 64 0 102 37 27
34 65 0 92 35 19
35 66 1 132 48 22
36 67 0 111 41 28
37 68 1 127 49 33
38 70 0 117 42 23
39 71 0 113 47 36
40 75 1 109 42 30
41 77 1 108 44 27
42 79 0 117 47 20
43 80 1 127 51 28
44 81 1 129 48 28
45 82 0 119 49 25

4
In the W/L column, wins have been represented with the value 1 and losses have been
represented with the value 0. This is done so the values can be used in our mathematical
calculations later. Also, the Game No. column shows which game from the regular season is
being represented in the table, in relation to the 81 total games the team actually played.

Data Analysis
Before performing any statistical tests, it is important to summarise and visualise the data to
identify patterns, trends, and outliers. In the table below, I have the mean and median values for
each performance statistic. Additionally, to gain better understanding of their variance, I have
calculated the standard deviation of each performance metric using the following formula:
2
Σ(𝑥𝑖−𝑥𝑖)
σ = 𝑛−1

Table 2. Mean, Median and Standard Deviation for each statistic

Mean ( ) Median( )
(𝑛−1) Σ𝑥
2 𝑛 Std. Dev.

TP 114.1555556 115 11.75112288

FG 42.82222222 43 4.763921651

AST 25.82222222 115 4.307967718

To understand the relationships and variance in the statistics, as well as identify overall trends
present in the dataset, I have visualised the different statistics appropriately. First, I have used a
scatter plot diagram, made using Google Sheets, to graphically represent the total points scored
in each game, and how they relate to each other.

5
Figure 1. Scatter plot of wins and losses with trend lines

Linear regression trend lines have also been plotted on the graph using Google Sheets, showing us
that in general, the winning scores are higher than the losing scores. However, there are outliers to this,
such as Games 46 and 24, which see the Bulls losing even with a relatively high score of 132 and 129
respectively.

For Field Goals and Assists, I have created box plots to compare their respective distributions. The
data sets have been split up further into Assists in Games Won (ASTW), Assists in Games Lost (ASTL),
Field Goals in Games Won (FGW), and Field Goals in Games Lost (FGL). This was done to be able to
compare the metrics in winning and losing games. In this diagram, the boxes represent higher
concentrations of games, while the vertical lines(referred to as “whiskers”), represent the outliers.

6
Figure 2. Box plots of Assists in won and lost games

Figure 3. Box plots of Field Goals in won and lost games

7
In Assists, the box plots reveal an overall higher concentration in winning games, but high-assist
outliers in lost games can be seen in the whiskers of the box plot(e.g. Game 71, Game 56). Similarly, in
Field Goals, a higher concentration of successful field goals are seen in winning games. As expected,
lesser field goals are seen in losing games, even considering outliers.

Pearson's correlation coefficient


Pearson's correlation coefficient (r) is a statistical measure that evaluates the strength and
direction of the linear relationship between two variables. It ranges from −1 to 1, where:

●​ r = 0.9 – 1: Perfect positive correlation (as the independent variable increases, the
dependent variable increases).
●​ r = -0.9 – -1: Perfect negative correlation (as the independent variable increases, the
dependent variable decreases).
●​ r = 0: No correlation.

It is commonly used to determine how strongly two variables are related. Pearson’s correlation
coefficient (r ) was used to measure the strength of the linear relationship between each
performance statistic and the outcome of the 45 sampled games. In the dataset, game
outcomes are treated as numerical values(Win = 1, Loss = 0), allowing the correlation to reflect
the degree to which performance influences success. Special cases such as these, in which
one of the variables are dichotomous(i.e., has two values), are known as Point-Biserial
Correlation tests, an alternative method of using the Pearson Correlation Coefficient. The
equation remains the same, as follows:
Σ(𝑥𝑖 − 𝑥)(𝑦𝑖 − 𝑦)
𝑟 =
2 2
Σ(𝑥𝑖 − 𝑥) Σ(𝑦𝑖 − 𝑦)

Where:
●​ 𝑥𝑖 represents the independent variable (Total Points Scored, Field Goals, Assists)
●​ 𝑦𝑖 represents the dependent variable (Game Outcome)

A perfect negative or positive correlation is rare in cases such as these, so the range of possible
r values are classified into groups:
●​ Weak Positive Correlation (0.1 – 0.3)
●​ Moderate Positive Correlation (0.3 – 0.5)
●​ Strong Positive Correlation (0.5 – 0.9)
●​ Weak Negative Correlation (-0.1 – -0.3)
●​ Moderate Positive Correlation (-0.3 – -0.5)
●​ Strong Positive Correlation (-0.5 – -0.9)

8
Using the above equation, I have calculated the r values for each statistic against Game
Outcome, as well as determined what the result infers. I have presented my findings in the
following table:

Table 3. Chi-Square Contingency table for Points Scored

r Inference
TP 0.5730306314 Strong Positive correlation
FG 0.4894471199 Moderate Positive correlation
AST 0.4257678174 Moderate Positive correlation

Total Points Scored vs. Game Outcome:


The correlation coefficient r calculated for Total Points Scored (TP) against Game
Outcome (W/L) is 0.5730306314. This shows a strong positive correlation between the
independent and dependent variable. From this result, we can infer that higher points
scored are strongly associated with winning games. The concepts of correlation and
causation apply here, as a higher score will have a direct relation with victory, but doesn’t
necessarily cause it. For example, Game 46 has a score of 132 points, one of the
highest scores in the entire sample, and the result was still a loss.

Field Goals vs Game Outcome


The r value calculated for Field Goals (FG) against Game Outcome is 0.4894471199.
While this result still suggests a positive correlation, it falls under the classification of a
moderate positive correlation by convention. This unsurprisingly implies that making field
goals does positively impact the chances of the Bulls winning, but not to the same
degree as scoring more overall points.

Assists vs. Game Outcome


The correlation between Assists(AST) and Game Outcome has a value of
0.4257678174. Similar to Field Goals, this suggests a moderate correlation between
higher assists and winning games, but still weaker than Field Goals and much lesser
than Total Points Scored. From this we can conclude that assists are not as critical a
factor in determining game outcomes.

Chi-Square test of independence


The chi-square test of independence is a statistical method used to determine whether there is
a significant association between two categorical variables. It compares the observed
frequencies in a contingency table to the expected frequencies (assuming the variables are
independent).

●​ A large chi-square statistic indicates a significant association between the variables.


●​ A small chi-square statistic suggests no significant relationship.

9
It is commonly used to test if one variable influences or is related to another in a dataset. This
test was also conducted to determine whether each performance metric is significantly
associated with game outcomes, but unlike Pearson’s r, it examines categorical associations
rather than linear relationships.

Before calculating the Chi-Square statistic, we must define the hypotheses for each test:
●​ Null Hypothesis (H0): There is no association between the performance metric (TP, FG or
AST) and game outcomes. The variables are independent.
●​ Alternative Hypothesis (H1): There is an association between the performance metric
and game outcomes. The variables are dependent.

The test is set up by categorizing the data into class intervals. For each statistic, intervals were
made and organised into a contingency table for testing. The intervals were decided by splitting
the data of each observed performance metric into separate categories depending on their
respective frequencies. These categories were then further split into wins and losses, showing
how many games within each interval were won or lost.

Table 4. Chi-Square Contingency table for Points Scored

wins losses Total


91-110 5 19 24
111-130 15 5 20
131-150 0 1 1
Total 20 25 45

Once this contingency table, containing the observed frequency of each statistic, is constructed,
the expected frequencies of each metric have to be calculated using the following formula:
(𝑅𝑜𝑤 𝑇𝑜𝑡𝑎𝑙)(𝐶𝑜𝑙𝑢𝑚𝑛 𝑇𝑜𝑡𝑎𝑙)
𝐸 = 𝐺𝑟𝑎𝑛𝑑 𝑇𝑜𝑡𝑎𝑙
For example, the expected frequency for winning games between 91-110 points would be:

(24)(20)
𝐸 = 45
= 10. 66666667

Using this equation, another table was constructed with the expected frequencies of each
categorised interval.

10
Table 5. Expected Frequency table for Points Scored

wins losses
91-110 10.66666667 13.33333333
111-130 8.888888889 11.11111111
131-150 0.4444444444 0.5555555556

Similarly, I have created observed frequency and expected frequency tables for Field Goals and
Assists. I have split the class intervals on the basis of the value being above or below the
median of the respective statistic’s data.

Table 6-9. Observed and Expected Frequency tables for Field Goals and Assists

Field Goals
wins losses Total
35-43 5 17 22
43-51 15 8 23
Total 20 25 45

Expected Frequency (FG)


wins losses
35-43 9.777777778 12.22222222
43-51 10.22222222 12.77777778

Assists
wins losses Total
17-25 6 17 23
25-36 14 8 22
Total 20 25 45

11
Expected Frequency (AST)
wins losses
17-25 10.22222222 12.77777778
25-36 9.777777778 12.22222222

The Chi-Square statistic is calculated using the following formula:

2
2 ​(𝑂𝑖​−𝐸𝑖​)
χ =∑ 𝐸𝑖

Where:
●​ 𝑂 : Observed Frequency in each cell
𝑖
●​ 𝐸 : Expected Frequency for each cell
𝑖

The critical value and degrees of freedom must also be found for the test. The degrees of
freedom are calculated by finding the product of the number of rows - 1 and the number of
columns - 1. In my table, the degree of freedom would be 1 [(2-1)(2-1)]. The critical value of a
Chi-Square test with 1 degree of freedom at my chosen significance level of 0.05 is 3.481. With
this knowledge in mind, we can calculate the Chi-Square value for each individual performance
statistic.

However, in the table for points scored, the upper range of 131-150 has only one of the games
fall under its range (Game 54). This game was a loss, and the expected frequencies as seen
above have values 0.45 and 0.56 for wins and losses respectively. Chi-square tests are
unreliable if 20% of the expected values are less than 5, which is the case here. However, this
can be rectified by combining the ranges so none of the expected frequencies fall under 5.

A new table has been created with the new data groupings:

Table 10. New Observed Frequency table for Point Scored

Points Scored
wins losses Total
91-111 3 14 17
111-150 17 11 28
Total 20 25 45

12
The same formula from above is used to calculate the new expected frequencies:

Table 11. New Expected Frequency table for Point Scored

Expected Frequency (TP)


wins losses
91-111 7.555555556 9.444444444
111-131 12.44444444 15.55555556

Finally, a Chi-Square test can be conducted on each of the statistics.

Table 12. Results of Chi-Square test

Χ2 p-value
TP 7.9459 0.00482
FG 8.2218 0.004139
AST 6.4209 0.011278

However, these results do not give us an accurate reading. Smaller sets of data tend to
overestimate statistical significance (Campbell). In the case of 2x2 tables specifically, “N-1”
Chi-square tests are recommended to calculate statistical significance. In this test, Χ2 is
multiplied by a factor of [(N-1)/N], N being the total sample size. Formerly, the Yates’ correction
for continuity
2
(|𝑓0−𝑓𝑒| − 0.5)
(χ2 = 𝑓𝑒
) was used to correct this issue, but this method tends to overcorrect, and

can result in overly conservative results that fail to reject the null hypothesis when they should
(Watson). Hence, I have used the N-1 method to correct these results for a more accurate
outcome.

13
Table 13. Results of N-1 Chi-Square test

Χ2 Χ2 * (N-1/N)
TP 7.9459 7.769324444
FG 8.2218 8.039093333
AST 6.4209 6.278213333

Conducting an N-1 Chi-Square test on Total Points Scored gives a Χ2 value of 7.7693. Since the
critical value of our test is 3.481, and the Χ2 value exceeds that, we can reject the null
hypothesis and accept the alternative hypothesis. Accepting the alternative hypothesis gives us
the result that Total Point Scored does have a significant association with the outcome of a
game.

Field Goals gives a similar answer, with a Χ2 value of 8.2218, which implies the same result as
TP. Rejecting H0 and accepting H1, we come to the conclusion that there is also an association
between Field Goals made and the outcome of a game. The same goes for Assists as well.
Calculating for Assists gives us a Χ2 value of 6.4209. Once again we reject H0, accept H1, and
conclude that there is an association between Assists and game outcome.

Conclusion & Evaluation


Comparing the Chi-Square statistic of all 3 values, we can see that Game Outcome has the
highest dependence on Points Scored, followed closely by Field Goals. The lowest dependence
is shown with Assists, with a value much lower than the other two performance statistics.
Therefore, while it does depend on a multitude of variables, the Chicago Bulls’ success is most
dependent on their Total Points Score. The results for Field Goals being similar to Points is not
surprising, since successful Field Goals give most of the points that a team scores in a game.
Since both Field Goals and Assists lead to more points for a team, and points have the highest
association with game outcome, their correlations are significant as well.

Works Cited

Basketball Reference. “2023-24 Chicago Bulls Roster and Stats.” Basketball-Reference.com,

https://www.basketball-reference.com/teams/CHI/2024.html. Accessed 27 February

2025.

14
Campbell, Ian. “Statistical tests for two-by-two tables.” iancampbell.co.uk, 2006,

https://www.iancampbell.co.uk/twobytwo/twobytwo.htm.

Watson, Peter. “When do I use the correction for continuity when performing a chi-square

analysis on a 2x2 table? - FAQ/Yates.” MRC CBU Wiki, 2013,

https://imaging.mrc-cbu.cam.ac.uk/statswiki/FAQ/yates#:~:text=The%20effect%20of%20

Yates'%20correction,correction%20may%20tend%20to%20overcorrect.

15

You might also like