Basketball Prediction Using Multiple Regression As A Data Model in Predicting The Outcome of Game
Basketball Prediction Using Multiple Regression As A Data Model in Predicting The Outcome of Game
John Ian Paulo I. Lumbao1, Kristen Kaye V. Sindac2, Fatima Corrine Angel B. Paed3,
Juan Miguel M. Gonzaga4, Danna Jade Y. Canoza5, Eduardo P. Pablo Jr.6, Herminiño C.
Lagunzad7, Jake M. Libed8, Mary Rose V. Hilarion9 and Arnold S. De Guzman10
123456
Student, College of Information Technology, The National Teachers College
Quiapo, Manila
1
[email protected], 2 [email protected], 3 [email protected], 4 [email protected], 5
[email protected], 6 [email protected]
7 8 9 10
IT Instructor, College of Information Technology, The National Teachers College
Quiapo, Manila
7
[email protected], 8 [email protected], 9 [email protected], 10
[email protected]
Abstract
In basketball arena, analyzation of a team performance is one of the difficult tasks for the coaches. Categorizing the
factors that would result for the winning of their respective games is a crucial aspect for every team. In this study, the
researchers employ the use of multiple regression in identifying the relevant variables leading to a successful
prediction process of winning a basketball game.
Keyword
data mining, multiple regression, classification, basketball prediction
1. Introduction
Technology and application software are growing fast each day and one of this technology is sport data
mining, in sports they use this technology as an analysis and a basis to know the statistics and capability of
a players and to predict the outcomes of the game. Data mining (DM) is the process of finding useful
association, pattern and trend from many data. It is also a process in extracting information of implicit and
people didn't know, but it is potentially useful information and knowledge, from many incomplete, noisy,
and fuzzy, random data [1]. DM systems aim to assist coaches, players and other sports enthusiast not only
for prediction, but also for evaluating player performance. Predicting game results has become widely
popular among sports fans, especially basketball fans, throughout the world [2].
Classification is a data mining technique that assigns items in a collection to target categories or classes.
The aim of classification is to predict the target class for each case in the data accurately. Classification is
important when a data repository contains samples that can be used as the basis for future decision making.
There are several classification techniques which can be used for decision support system [3].
Regression is a statistical technique to determine the linear relationship between two or more variables.
Regression is primarily used for prediction and causal inference. In its simplest (bivariate) form, regression
shows the relationship between one independent variable (X) and a dependent variable (Y), as in the
formula [4]: Y = β0 + β1X + u
2. Review of Related Literature and Studies
Multiple Regression is a statistical tool that allows you to examine how multiple independent variables are
related to a dependent variable. Once you have identified how these multiple variables relate to your
dependent variable, you can take information about all the independent variables and use it to make much
more powerful and accurate predictions about why things are the way they are. This latter process is called
“Multiple Regression” [5].
Y= a + b1X1 + b2X2
Y = a predicted value of Y (which is the dependent variable) (1)
a = the “Y Intercept”
b1 = the change in Y for each 1 increment change in X 1
b2 = the change in Y for each 1 increment in X 2
Lopez, M. and Matthews, G. (2014) used data from the 2014 men's basketball tournament and more than
400 predictions of game outcomes submitted to a contest hosted by the website Kaggle. The researchers
built a prediction model for men's basketball tournament outcomes under the binomial log-likelihood loss
function. Under different sets of true underlying game probabilities, they simulate tournament outcomes
and imputed pool standings, to determine how much of an entry's success can be attributed to luck. While
one of the two submissions finished first in the Kaggle contest, the researchers estimate that this winning
entry had no more than about a 12% chance of doing so, even under the most optimistic of game probability
scenarios [6].
Arnold, T. and Godbey, J. (2012) the researchers conduct a study that aims to determine what are the key
factors that affects the basketball game. In this study, they gather data from the 12 players for the basketball
team of Georgia State University. Then, they considered using multiple regression as a tool in analyzing
the data of all the players. The researchers came up with analysis and suggest the following variables that
they used in analyzing the game. This are: points per game, minutes per game and rebound per game. Based
on the analyze data, they came up with a good result in predicting the outcome of a basketball game [7].
Yuan, LH, et al. (2015) this study aims to predict or forecast the outcome of the NCAA men’s basketball
tournament, which spans 63 games over 3 weeks. Statistical prediction of game outcomes involves a
multitude of possible covariates and information sources, large performance variations from game to game,
and a scarcity of detailed historical data. In this paper, the researchers considered the used of logical
regression to present the results of a team of modelers working together to forecast the 2014 NCAA men’s
basketball tournament. The researchers present not only the methods and data used, but also several novel
ideas for post-processing statistical forecasts and decontaminating sources [8].
3. Research Work
3.1 The Basketball Game System Architecture
This study shows a system architecture that can be used and it is shown on Figure 1. It also consists of
major phase such as (1) Data Preprocessing; (2) Apply Multiple Regression; and (3) Prediction.
Figure 1. Basketball Game Prediction System Architecture
Variable Description
Points Total number of points
Free Throw % Total free throw
percentage
Rebound Total number of rebound
3.2.A.a The Dependent (y) and Independent Variable (x1) and (x2)
Table 4 displays the selected and needed data for Dependent Variable which is Points and Independent
Variable such as Free Throw% and Rebound.
Table 4. Dependent and Independent Variable
Y = a + 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑏3 𝑥3……. 𝑏𝑖 𝑥𝑖
Y = Dependent
X = Independent Variable (2)
a = Intercept
b = Coefficient
ŷ = a + b1 x1 + b2 x2
The result of the coefficient after the computation is shown in equation 3 [10].
𝑦̂ = 𝑎 + 𝑏1 𝑥1 + 𝑏2 𝑥2
Values of 𝑏1 , 𝑏2 , 𝑎
b1 = 0.3269129 (3)
b2 = -0.42211
a = 32.0068
Table 5 shows the result of the computation for the value of x1y, x2y, x1x2, x12, x22 and y2. To get the
result, the computation will be: (x1*y), (x2*y), (x1*x2), (x1)2, (x2)2 and (y)2.
Table 5. Compute for the value of x1y, x2y, x1x2, x1 2, x22 and y2
The result for the computation of Sum and Average and it is shown in table 6. To get the sum, we need to
sum up each variable and data like points, filed goal%, rebound, x1y, x2y, x1x2, x12, x22 and y2. And get
also the average of each variable and data to be use in the next process.
The formula will be used to solve for the value of dependent variable (y) and independent variable (x1)
and (x2) as shown in equation 4 [10].
(∑𝑦)2
∑𝒚 𝟐 = ∑𝑦 2 −
𝑁
(𝛴𝑥 )1 (𝛴𝑦 )
𝜮𝒙𝟏 𝒚 = 𝛴𝑥1 𝑦 −
𝑁
(∑𝑥1)2
∑𝒙𝟏 𝟐 = ∑𝑥1 2 − (4)
𝑁
(∑𝑥 ) (𝛴𝑦 )
2
𝜮𝒙𝟐 𝒚 = 𝛴𝑥2 𝑦 −
𝑁
(∑𝑥2)2
∑𝒙𝟐 𝟐 = ∑𝑥2 2 −
𝑁
(∑𝑥 ) (∑𝑥 )
1 2
𝜮𝒙𝟏 𝒙𝟐 = 𝛴𝑥1 𝑥2 −
𝑁
Equation 5 shows the result for the computation of dependent variable (y) and independent variable (x1)
and (x2).
The last formula in getting the value for a = intercept and b1 & b2 = coefficient and it is shown in
equation 6 [10].
b1 = (∑𝑥2 2)(𝛴𝑥𝐼 𝑦) − (∑𝑥1 𝑥2 ) ( 𝛴𝑥2 𝑦)
__________________________________
2
(∑𝑥1 2) (∑𝑥2 2) - (𝛴𝑥1 𝑥2 ) (6)
a = 𝑦̅ − 𝑏1 𝑥̅1 − 𝑏2 2
Figure 7 shows the result for the coefficient value of a = Intercept, b1 = Free Throw% and b2 = Rebound.
3.4Prediction
This is the final process where you can predict points based on the free throw% and rebound. A formula
and computation in predicting points is shown in equation 10.
Equation 11 shows the result after the computation process for the predicted points.
5. References
[1] Tian, L. and Wang, F. 2013. Design and Application of Basketball Tactics Analysis Based on the Database
and Data Mining Technology. Proceedings of the 2nd International Conference on Computer Science and
Electronics Engineering (ICCSEE 2013)
[2] Haghighat, M., Rastegari, H. and Nourafza, N. 2013. A Review of Data Mining Techniques for Result
Prediction in Sports. ACSIJ Advances in Computer Science: an International Journal. Volume 2, Issue 5,
Number 6, ISSN:2322-5157
[3] Vaghela, C., Bhatt, N. and Mistry, D. 2015. A Survey on Various Classification Techniques for Clinical
Decision Support System. International Journal of Computer Applications (0975 – 8887). Volume 116,
Number 23
[4] Campbell, D. and Campbell S. Statlab Workshop Introduction to Regression and Data Analysis.
http://statlab.stat.yale.edu/wo rkshops/IntroRegression/StatLab-IntroRegre ssionFa08.pdf [retrieved:
February, 2017]
[5] Introduction to Multiple Regression. http://www.biddle.com/documents/bcg_comp_chapter4.pdf [retrieved:
February, 2017]
[6] Lopez, M. and Matthews, G. 2014. Building an NCAA mens basketball predictive model and quantifying its
success. https://arxiv. org/pdf/1412.0248.pdf [retrieved: February, 2017]
[7] Arnold, T. and Godbey, J. 2012. Introducing Linear Regression: An Example Using Basketball Statistics.
Journal of Economics And Finance Education Volume 11, Number 2
[8] Yuan, LH., Liu, A., Yeh, A., Kaufman, A., Reece, A., Bull, P., Franks, A., Wang, S., Illushin, D. and Bornn,
L. 2015. A mixture-of-modelers approach to forecasting NCAA tournament outcomes. Journal of
Quantitative Analysis in Sports, Volume 11, Issue 1, 2015, pp 13–27
[9] “National Collegiate Athletics Association Basketball Statistics” http://ncaaph.org/bask etball/statistics/
[retrieved: February, 2017]
[10] “Elements of Multiple Regression Analysis: Two Independent Variables” http://jonathan
templin.com/files/regression/ersh8320f07/ersh8320f07_06.pdf [retrived: February, 2017]
Authors
John Ian Paulo I. Lumbao
4th Year Bachelor of Science in Information Technology
Former President, Computer Society Club
The National Teachers College
Arnold S. De Guzman
Master in Information Technology - ongoing
IT Instructor
The National Teachers College