0% found this document useful (0 votes)

139 views15 pages

Classification in Horse Race Prediction Through Principal Component Decomposition

This document discusses a novel approach to horse race prediction using principal component decomposition and partial least squares regression, which addresses limitations of traditional logistic regression methods. The authors argue that existing algorithms often fail to achieve systematic profitability due to issues like data imbalance and multicollinearity among factors. Their proposed method demonstrates improved classification accuracy and profitability across various betting strategies by reducing the number of predictive factors to essential drivers.

Uploaded by

joshualindsaytas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

139 views15 pages

Classification in Horse Race Prediction Through Principal Component Decomposition

Uploaded by

joshualindsaytas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/370049041

Classiﬁcation in horse race prediction through principal component

decomposition

Preprint · April 2023

CITATIONS READS

0 1,260

2 authors, including:

Jason West
Bureau of Meteorology
50 PUBLICATIONS 422 CITATIONS

SEE PROFILE

All content following this page was uploaded by Jason West on 15 April 2023.

The user has requested enhancement of the downloaded file.

Classification in horse race prediction through principal
component decomposition

Jason West
Vlad Kazakov

Abstract
The established view for horse race handicapping and staking strategies is to model them as a
classification problem using factors describing horse, jockey, trainer, and racing history coupled with
public odds, solved via a logistic regression. Logistic regression probabilities are then normalised, and
bets filtered by threshold, or anomalous pricing. However, published algorithms do not show
systematic profitability, nor do machine learning approaches using algorithmic betting strategies. This
deficiency is due to three factors. First, wins are rare and racing data are thus imbalanced. Second,
racing factors are multicollinear. Third, the number of factors needed for accurate prediction is very
large. We show that alternative methods using variants from principal component analysis produces
sustainable profitability regardless of staking strategy through a reduction of factors to fundamental
drivers. We apply a partial least squares regression methodology to Australian thoroughbred racing.
This approach is shown to outperform logistic regression and machine learning methods in classifying
winners for a profitable trading strategy. This method can be applied to multiple betting domains.
Keywords: Partial least squares; logistic regression; horse racing; imbalanced data.
1.0 Introduction
The established analytical approach for thoroughbred race handicapping and staking strategies is for
odds estimation in betting to be modelled as a classification problem using factors describing the
horse, jockey, and or trainer, as well as training and racing history, coupled with public odds, and
solved using a logistic regression. Logistic regression probabilities are then normalized, and optimal
bets can be chosen using a threshold filter or by searching for under-priced runners and applying a
staking process aligned with the Kelly criterion (Kelly, 1956). Benter (1994) famously demonstrated
the capacity to earn positive returns using a computer-generated betting strategy based on a variant of
this approach. Well-known gambling identities Alan Woods and Patrick Veitch have described
detailed implementation methods and factors used to rank thoroughbreds for odds estimation.
However, the durability of these methods in practice has been limited.

To the best of our knowledge, most published algorithms fail to show systematic profitability.
Similarly, none of the published solutions from the machine learning domain result in a viable betting
algorithmic strategy. One can develop and apply any of them, adapt them for local data, generate bets
and the resulting expectation is that one will lose money over the medium term. Sustained
profitability based on persistent pricing anomalies from racing metrics is difficult to achieve in
practice.
There are several reasons explaining the dislocation between a theoretically proven betting / staking
strategy and profitability. First, wins are rare and racing data are imbalanced. Classifiers, including
logistic regression methods, will underestimate the probability of rare events unless this imbalance is
corrected. Popular approaches to rebalancing data are divided into pre-processing, post-processing,
and hybrid methods. Pre-processing is a simple but effective technique that avoids issues of signal
detection using arbitrary thresholds from model estimates. Any one of several pre-processing
corrections can be applied, including up-sampling, down-sampling, and class re-weighting. In this
analysis we adopt the pre-processing technique of up sampling (scaling) to address the imbalance
which retains the benefits of a large sample size to maximise predictive power.
Second, many of the factors used to predict racing ability are multicollinear. To defend against model
misspecification, regression methodologies demand the use of independent factors or at least the use
of an alternative approach whereby factors are transformed into principal components. Finally, racing
guides contain, and experts use, many more factors than the 20 or so factors often cited in published
algorithms. Racing data is now so prevalent that potentially thousands of factors are available for
analysis, many of which are at least weakly informative and thus can provide incremental information
for prediction.
In this analysis we show how classification accuracy improves when adjusting for imbalanced data
using a pre-processing approach. We then define principal components, using a variant of Partial
Least Squares (PLS) estimation, from a large set of factors and align them with expert knowledge and
publicly available odds. We show that by integrating these derived factors with a logistic regression
approach greatly reduces prediction error and improves profitability regardless of staking strategy.
This approach can accommodate any number of correlated factors from data with limited sample size
and missing values, with only a minor impact on prediction accuracy.
2.0 Methodology
The application of conditional logistic regression to horse racing was popularised by Boltman and
Chapman (1986) and extended by a range of authors (Benter, 1994; Edelman, 2006; Silverman and
Suchard, 2013). The evolution of the research in this field has been aimed at improving the
calculation of a horse's "strength" estimated using a conditional logistic regression on identifiable

2
factors and combining this with a payoff dividend related to market odds. The likelihood of the
conditional logit can be expressed as
𝑤 𝑤
𝑒 𝛼𝑟ℎ 𝛽1 +𝑝𝑟ℎ 𝛽2
∏𝑅𝑟=1 𝑤 𝑤 , (1)
∑ℎ∈𝑟 𝑒 𝛼𝑟ℎ 𝛽1+𝑝𝑟ℎ𝛽2

which combines previous model predicted "strength" 𝛼 = {1,2, … , 𝐴} and historical limit 𝐴 with the
"odds implied probability" 𝑝 = {1,2, … , 𝑃} for a runner. The payoff dividend for a horse is converted
1 1
to an implied probability 𝑝(𝑥ℎ ) = 𝑑 or 𝑝̃(𝑥ℎ ) = 𝑑 where 𝛿 is the track take from a parimutuel
ℎ ℎ +𝛿
pool. The value 𝑝 is the bettor implied confidence rather than a genuine estimate of outcome
probabilities which is an aggregate estimate of the betting public's opinion.
The expression above expression is formulated as follows. A race 𝑟 = {1,2, … , 𝑅} is run between
multiple horses ℎ = {1,2, … , 𝐻} with a single runner winning. The racing characteristics of each horse
is represented by a 𝑘 dimensional vector 𝑋𝑟ℎ with the covariates 𝑋 represented by an 𝑁 × 𝐾
dimensional matrix where each row represents the covariates of each horse / jockey / trainer
characteristics, 𝛽 as a column of 𝑘 regression coefficients, and 𝑝𝑖 as the probability of winning an
event (raw response variable). The probability of each runner ℎ winning in race 𝑟 is
𝑒 𝑋𝑟ℎ 𝛽
𝑝𝑟ℎ = , (2)
∑ℎ∈𝑟 𝑒 𝑋𝑟ℎ 𝛽

where the winning probabilities of race 𝑟 sum to 1, ∑ 𝑝𝑟ℎ = 1. Estimates for the regression
coefficients 𝛽 in 𝑋𝑟ℎ 𝛽 are typically derived using logistic regression
𝑝
𝑟
(1−𝑝 ) = 𝑒 𝑋𝑟ℎ 𝛽 . (3)
𝑟

The formulation in (1) combines the set of regression coefficient estimates (the "strength" estimate)
𝑤 𝑤
𝛼𝑟ℎ 𝛽1 with the odds implied probability 𝑝𝑟ℎ 𝛽2 to derive a fundamental probability estimate for each
runner. Strength estimates can be obtained using various techniques including conditional logistic
regression (Benter, 1994), support vector regression on a runner's finishing position (Edelman, 2006),
support vector classifier coupled with distance from the hyperplane (Lessmann and Sung, 2007),
CART (Lessmann and Sung, 2010), LASSO regression or variants of the Cox Proportional Hazards
model (Silverman and Suchard, 2013). Despite different methods for calculating the “strength” of a
horse, existing models rely on the standard conditional logistic regression as a key component for
their prediction algorithm.
There are well-known shortfalls to this approach, and in many cases previous scholars have admitted
to these issues as limitations to the generalisability of their results. When a large set of 𝑗 explanatory
variables 𝑥𝑗 , 𝑗 ∈ {1,2, … , 𝑛} are required to forecast a single response variable 𝑦 using a sample of size
𝑛 where all regressors contain information, the use of ordinary least squares (OLS) or logistic
regression (LR) can be problematic when 𝑗 ≫ 𝑛 or 𝑗 is at least large relative to 𝑛. An OLS or logit
model of this type may also be mis-specified when the variables 𝑥𝑗 are collinear, and where the
variance of the model is of order 𝑗 ≈ 𝑛 the prediction estimates may be biased. Importantly, even
when 𝑗 > 3, the OLS or logit estimator can be inadmissible using a mean square error (MSE) criterion
(James and Stein, 1961) which can undermine the typical diagnostics applied to such models to test
for validity.
These issues have motivated the use of shrinkage estimation methods. Some shrinkage estimators
such as RIDGE and LASSO are relatively specific criteria-based choices in the space of OLS-
consistent solutions. Recent algorithms have applied random forests and gradient boosting machines
(GBM). Whereas random forests build an ensemble of deep independent trees, GBMs build an
ensemble of shallow and weak successive trees with each tree learning and improving on the

3
previous. These can be combined such that the many weak successive trees produce a powerful
ensemble algorithm. However, shrinkage estimation and machine learning derivatives still suffer from
the same issues of collinearity and small sample size relative to factors that undermines the more
orthodox methods.
We offer an alternative approach to existing approaches that entirely circumvents the limitations
related to standard regressions. Partial Least Squares (PLS) techniques, along with other dimension
reduction methods, have been used in chemometrics and related applications where 𝑗 ≫ 𝑛. In an early
PLS method Wold (1966) uses latent quantitative factor variables (akin to principal components) from
the data to regress an outcome variable against many components. The contribution of each variable is
evaluated using standardized model coefficients, with the outputs indicating the direction and
magnitude of the effect. A positive correlation between the independent variable and the outcome
variable is inferred for positive coefficients.
For a typical multiple linear regression approach the least-squares solution for
𝒀 = 𝑿𝑩 + 𝜺, (1)
is
𝑩 = (𝑿𝑇 𝑿)−1 𝑿𝑇 𝒀. (2)

When 𝑗 ≫ 𝑛 and / or in the presence of collinearities the estimate for 𝑿𝑇 𝑿 becomes singular and
unable to converge to a unique solution. The PLS approach averts this by decomposing 𝑿 into
orthogonal 'scores' 𝑭 and 'loadings' 𝑷
𝑿 = 𝑭𝑷, (3)
and regressing 𝒀 not on 𝑿 itself but on the first 𝑛 columns of the 'scores' 𝑭. The PLS approach
therefore incorporates information on both 𝑿 and 𝒀 in the definition of the 'scores' and 'loadings'. The
PLS algorithm performs a simultaneous bilinear decomposition of the outcome variable and the
regressors. Scores 𝑭 and loadings 𝑷 are orthogonal and the following decompositions are carried out
concurrently
𝑥 = 𝑞1′ 𝑓1 + ⋯ + 𝑞𝑢′ 𝑓𝑢 + 𝐸𝑢 , (4a)
𝑦 = 𝑝1′ 𝑓1 + ⋯ + 𝑝𝑢′ 𝑓𝑢 + 𝑒𝑢 , (4b)
where 𝑓𝑖 are the individual scores, and 𝑝𝑖 and 𝑞𝑖 are the loadings generated at each step. The suffix 𝑚
represents the final step of the calibration process and by design, the algorithm converges in the sense
that after 𝑝 steps the factors will be identical and equal to zero. The recursive formulas for the scores
and loadings provide a linear form like the predictions of the estimators
′
𝑦𝑃𝐿𝑆 = 𝑥𝑡=0 𝛽𝑃𝐿𝑆 (𝑚), (5)
where
𝛽𝑃𝐿𝑆 (𝑚) = 𝑾𝑚 (𝑾′𝑚 𝚺𝑾𝑚 )−1 𝑾′𝑚 𝜎𝑥𝑦 , (6)

and 𝑾𝑚 = (𝑤1 , … , 𝑤𝑚 ) is obtained after 𝑚 recursions of the algorithm by stacking weights generated
at each step, which are effectively the weighted covariances of the predictors and the response.
We extend the PLS approach to a generalised linear regression model (PLSGLR) which appropriately
accounts for missing data. The PLSGLR regression of the response 𝑦𝑃𝐿𝑆 on variables 𝑥𝑖 is defined as
𝑢 ∗
𝑔(𝜃)𝑖 = ∑𝐻
ℎ=1(𝐁 ∑𝑗=1 𝑤𝑗𝑥 𝑥𝑖𝑗 ), (7)

4
with 𝐻 components, where 𝜃 is a probability vector of the response variable 𝑦𝑃𝐿𝑆 with a finite
support. The components 𝑥𝑖𝑗 are built to be orthogonal and the link function 𝑔(. ) is a logistic function
to fit the model to the data.
In conceptual terms, the PLS method projects the input and output variables in directions of maximum
covariance, with the calibration performed between the orthogonal 'latent' variables. This approach
has been shown to be stable with respect to collinearity (Esbensen, 2002) and is able to generate
unbiased prediction equations from pre-processed datasets (Sundberg et al., 1999; Wold, Sjöström, &
Eriksson, 2001).
We validate the PLS method using a k-fold leave-one-out cross-validation approach to identify
significant variables contributing to the best fit of the model. The advantage of PLS over other
approaches is that it identifies only relevant predictor variables, while other linear models require pre-
selection of potential predictor variables prior to regression analysis. The disadvantage of PLS is that
the resulting components don't necessarily correspond to a specific 'factor' unlike typical regressions
that identify predictive attributes directly related to the data.
Model estimation and subsequent prediction is affected by the presence of imbalanced data (many
losers, few winners). Importantly, model diagnostics are also affected by imbalanced data. For
classification, accuracy and its complement error rate used to assess performance when defined as
𝑇𝑃+𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃+𝐹𝑁+𝑇𝑁+𝐹𝑃
, (7)

where the correctly classified number of outcomes for True Positive (TP) and True Negative (TN) are
combined with incorrectly classified outcomes for False Positives (FP) and False Negatives (FN).
High accuracy can be obtained by predicting a loss for all test races while winners are misclassified.
This metric is of little use when predicting relatively rare outcomes is the objective.
An alternative metric for imbalanced data is the geometric mean 𝐺𝑚𝑒𝑎𝑛 which is defined as

𝑇𝑃 𝑇𝑁
𝐺𝑚𝑒𝑎𝑛 = √𝑇𝑃+𝐹𝑁 . 𝑇𝑁+𝐹𝑃 = √𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 × 𝑠𝑝𝑒𝑐𝑓𝑖𝑐𝑖𝑡𝑦. (8)

The 𝐺𝑚𝑒𝑎𝑛 computes the geometric mean of the accuracies of each class and seeking to measure the
relative balance between classification. These metrics are used to compare model accuracy in the
analysis below.
3.0 Data and setup
We use a data set of all competitive thoroughbred horse races in Australia from January 2019 to
December 2021 in both metropolitan and provincial grades (but excluding country-level races which
are in remote locations with limited competitiveness and market liquidity). Race distances vary from
1000m to 3200m under all weather conditions and multiple field sizes. Hurdles and steeplechase races
are excluded. To eliminate possible bias, maidens (runners who are yet to win) are initially excluded,
but when added to the training and test data sets for model verification made little difference to
accuracy. Events with fewer than five runners, more than one winner (a tie), and for which no pricing
data is available were excluded.
Data is obtained for a range of primary variables which are provided in the Appendix. There are
hundreds of potential variables that can be used but we constrain the list to 22 of the most promising
for this analysis. Figure 1 provides a correlation matrix pictorial of the strength of the relationship
between variables over the full period. Several variables are highly correlated (positive in blue,
negative in red) which is known to bias regression coefficient estimates. The effects of high
correlation are of no concern for the PLS regression when the dimensions of the data are reduced to
orthogonal components but have deleterious effects on the applicability of regression methods.

5
Figure 1. Correlation matrix between factors used in the analysis.
We split the data into two halves with pre-July 2020 designated as the training data set and post-July
2020 as the test data set. This results in 9,060 events in each of the training and test data sets. Table 1
provides a chi-squared style comparison of win percentage both training and test data sets, against
expected win probability implied by the starting prices for each event. The 𝜒 2 p-values for win
percentages in both the training and test data sets is <0.001. A feature of these results is that no hint of
favourite-longshot bias (i.e., overvaluing longshots and undervaluing favourites; Sobel & Raines,
2003) or other forms of bias appears in either data set.

Price Category Training Data Win % Test Data Win % Expected Win %
$1 - $2 56.83% 57.46% 56.97%
$2 - $3 39.28% 37.86% 38.35%
$3 - $4 27.65% 28.39% 28.06%
$4 - $5 21.35% 21.64% 22.00%
$5 - $10 13.72% 14.02% 13.25%
$10 - $20 6.97% 6.92% 6.67%
$20 - $50 3.30% 3.38% 3.10%
$50 - $100 1.60% 1.36% 1.41%
$100+ 0.45% 0.41% 0.37%
Table 1. Chi-squared style comparison of win percentage between training and test data sets, against
expected win probability implied by starting prices, Australian thoroughbreds, 2019-2021.
To avoid issues related to bias from the presence of few winners relative to many losers, the training
data set was 'up-sampled' using bootstrap resampling so that the number of winners exceeded 40 per
cent of the data used to train each model, where the number of observations (individual runners)
increased from 85,540 to 147,170. We use both logistic regression and PLSGLM regression to
classify runners into winners and losers per race.

4.0 Results
The logistic regression results using the up-sampled training data set are provided in Table 2.

6
Variable Coefficient Std Err. Odds Ratio 95% CI p-value
Start price last start -0.041 0.012 0.94 (0.87, 1.01) 0.091
Track win (previous) 0.016 0.005 1.17 (1.06, 1.29) 0.002
Grade difference -0.005 0.001 0.99 (0.98, 1.00) 0.007
Distance difference 0.000 0.000 1.00 (1.00, 1.00) 0.018
Finishing speed -0.065 0.015 0.93 (0.86, 1.02) 0.110
Horse rating 0.032 0.004 1.03 (1.01, 1.06) 0.006
Jockey rating 0.223 0.012 1.32 (1.24, 1.40) <0.001
Expected position -0.002 0.014 1.07 (0.99, 1.17) 0.100
Barrier adjustment -0.005 0.027 0.80 (0.67, 0.95) 0.009
Quick back up 0.009 0.002 0.98 (0.96, 0.99) 0.007
Handicap -0.012 0.003 0.96 (0.93, 0.99) 0.002
Peak rating 0.014 0.003 0.94 (0.90, 0.98) 0.002
Race distance 0.000 0.000 1.00 (1.00, 1.00) 0.001
Price at start of day 0.385 0.003 1.57 (1.54, 1.61) <0.001
Table 2: Logistic regression results (odds ratios, 95% confidence intervals, p-values) for
thoroughbred races in Australia, Jan 2019-Jun 2020.
A total of 14 variables are significant at a 95% confidence level with the AIC of 15,975 marginally
improved over the AIC of the alternative formulation to use all variables of 16,025. Variables
representing previous wins at the current track, jockey rating, horse rating, and price are strongly
related to win probability.
The PLSGLM logistic regression was calibrated using the training data set and the results are
provided in graphical form for ease of interpretation. Figure 2 indicates the explanatory strength of
each coefficient (numbered 1 to 22 on the x-axis) represented by each regression component (first five
components are shown). One variable (price at start of day) exerts a strong influence on the
decomposition relative to other variables in a similar way to its influence over the logistic regression
model. The correlation loading between the first two most significant components are provided in
Figure 3, which indicates that while the first component is dominant, it is also unbiased.

Figure 2: Partial least squares regression Figure 3: Partial least squares regression
coefficient estimates by variable (Australian correlation loading, first two components
thoroughbreds Jan 19 – Jun 20). (Australian thoroughbreds Jan 19 – Jun 20).

Figure 4 depicts the Root Mean Squared Error of Prediction (RMSEP) from the bias-corrected cross-
validation estimate as the number of principal components is added to the model. The RMSEP
reaches a stable minimum after 8 components are added. Prediction quality is provided in Figure 5
using the test data, which shows predicted probabilities relative to measured probabilities (implied by

7
market odds). There are some differences at the lower 'scores' (low win probability) but the bulk of
the predicted scores for runners at higher probabilities tracks measured values.

Figure 4: Cross-validated RMSEP curves for Figure 5: Cross-validated predictions for PLS
PLS regression (Australian thoroughbreds Jan regression (Australian thoroughbreds Jul 20 –
19 – Jun 20). Dec 21)

A comparison of the effect of up-sampling winners in the training data set is depicted in Figure 6.
Estimated prices using the PLSGLM model derived from the up-sampled data are closely aligned with
market observed prices (start prices) which have been shown to be relatively accurate (Table 1). Some
divergence is present at higher prices (lower odds) however the fact that most trades are placed on
runners at odds representing a winning probability greater than around 15 percent is the critical zone
in which correct model estimates are important. Model estimates for the PLSGLM model using the
non-scaled data indicates a divergence in estimates at higher prices (lower odds) which demonstrates
the bias away from winning categories when data is imbalanced.

Figure 6. Forecast price against traded starting price for up sampled (scaled) and the unscaled test
data sets (truncated at a start price of $10).

8
Diagnostics for the original test data and the up-sampled test data are provided in Table 3. The PLS
model produces greater improved accuracy measures over the logistic regression approach as well as
a superior G-mean metric.
Test statistic Log. regression (scaled) Partial least squares (scaled)
Accuracy 0.857 0.871
(95% CI) (0.855, 0.859) (0.868, 0.873)
Sensitivity 0.920 0.925
Specificity 0.325 0.328
Prevalence 0.894 0.916
Detection rate 0.827 0.840
Balanced accuracy 0.623 0.613
G-mean 0.547 0.551
Table 3. Model diagnostics for logistic regression and partial least squares models using existing test
data and up-sampled test data, Australian thoroughbreds 2019-2021.
To compare profitability profiles between models, we apply a simple flat staking betting strategy of
$1 for each trade. We trade on the highest rating runner per race using the logistic regression model.
We bet on runners whose scores exceed a threshold equivalent to the 80th percentile for the PLS
model. This means that for the PLS method, there are potentially multiple bets for some races and
zero bets for other races. For comparison, we also use a naïve strategy of simply betting on the race
favourite for each race. The trading results comparison is provided in Figure 7. Profitability from
trades informed by the PLS approach clearly dominate alternative methods. Surprisingly, the naïve
approach generates a profit over the testing period however the resulting return on investment makes
this approach infeasible.

9
Figure 7. Profit/loss results for PLS versus logistic regression model and naïve methods using a flat
staking strategy $1 per bet, Australian thoroughbreds Jul 2020 – Dec 2021.

The profitability breakdown by price category (i.e., inverse odds) are provided in Table 4. All
strategies are unprofitable using flat staking for prices less than $3.00 (odds 4:1 against). While there
is a higher chance of earning a return at lower odds, the ability of the models to distinguish between
winners predicted by the model versus those predicted by the market is relatively poor. At higher
price categories (i.e., odds greater than 5:1 against), profitability increases. Simply backing the
favourite (the naïve strategy) is also unprofitable at lower odds and there are naturally lower profits
available at higher odds.
The PLS model earns over half its profits in the $5-$10 price range (win probability of roughly 8.3 -
14.3 per cent) indicating that shorter priced favourites are often overpriced relative to the model
choice. The logistic regression and naïve strategy profitability is highest in the $4-$5 price range (win
probability of 14.3 - 16.6 per cent).

10
Price Category Logistic R. % PLS % Naïve %
$1 - $2 -$10.80 -3.3% -$22.50 -4.0% -$11.50 -14.9%
$2 - $3 -$18.00 -5.6% -$22.80 -4.1% -$97.20 -125.9%
$3 - $4 $63.80 19.7% $140.80 25.1% $18.60 24.1%
$4 - $5 $174.20 53.8% $138.40 24.6% $159.50 206.6%
$5 - $10 $110.50 34.1% $327.80 58.4% $7.80 10.1%
$10 - $20 $60.00 18.5% $14.00 2.5% $- 0.0%
$20 - $50 -$38.00 -11.7% -$10.00 -1.8% $- 0.0%
$50 - $100 -$12.00 -3.7% -$3.00 -0.5% $- 0.0%
$100+ -$6.00 -1.9% -$1.00 -0.2% $- 0.0%
Table 4. Profitability by price category for PLS versus logistic regression model and naïve methods
using a flat staking strategy $1 per bet, Australian thoroughbreds Jul 2020 – Dec 2021.
Concerns over model degradation over time resulting in a loss of accuracy for either approach are
addressed through the accuracy plot provided in Figure 8. The plot depicts the positive predictive
value (precision) as win percentage for each month from July 2020 – December 2021 in the test data
defined as TP/(TP+FP). While precision varies from month to month, there is no persistent rate of
degradation over the 18-month test data series. The precision in the PLS model outperforms the
logistic regression model in every month apart from August 2021 where the precision of the PLS
approach fell to roughly 25 per cent of win predictions compared with 28 per cent for the logistic
regression.

Figure 8. Precision (win % rate) for PLS versus logistic regression model by month in the test data,
Australian thoroughbreds Jul 2020 – Dec 2021.

11
5.0 Discussion
The benefits of the PLS approach extends to the use of many more variables than used in this analysis
which can serve as 'weak learners' that incrementally improve the information content from large,
complex, and interrelated data. The need to account for multicollinearity and interaction terms when
using regression and other classifiers in complex data sets is avoided with the PLS approach. While
PLS model outputs don't directly attribute explanatory strength for features (variables) used to
estimate principal components as they do in regression models, this is a small price to pay for the
opportunity to improve model accuracy.
Probability estimates are greatly improved when accounting for imbalanced data using a relatively
simple scaling (up sampling) process. Pre-processing data through bootstrapping the re-weighting
towards winners avoids the arbitrary filtering techniques of post-processing methods such as adopting
thresholds for selection or using hybrid methods that rely on learning 'agents' from alternative
classifiers. While post-processing is a valuable tool for prediction in complex settings (e.g., weather
forecasting, genomics, computational biology), the simple, but effective, method of up sampling
winners achieves close alignment in estimated odds relative to market observed odds in thoroughbred
racing. The simpler bootstrapping method also avoids issues of overfitting associated with post-
processing methods.
The combination of principal decomposition and data scaling results in superior profitability and
return on investment using a narrow data set for estimation. The accuracy can be shown to improve
even further with data sets comprising many more features, but the incremental increase in accuracy
rapidly diminishes. Profitability can also be enhanced through Kelly staking or proportional Kelly
staking strategies, particularly when the PLS method is able to reliably identify winners offering
higher market prices that represent a greater return on investment relative to a flat staking strategy.
6.0 Summary
We have shown that the use of generalised principal component analysis to dissect data for trading
strategies can deliver superior accuracy and profitability against existing methods. Also, simple up
sampling of categorical data greatly improves accuracy ay lower odds. A limitation of PLS in
prediction is that its validity pertains to the conditions under which the observed data was obtained.
While seasonality is apparent in racing (e.g., high profile runners prepare and compete for racing
'carnivals'), its effects are generally assumed to be minimal. We did not detect degradation in accuracy
due to seasonal effects in the test data. The persistence of factors and the accuracy of the PLS did not
materially change when the model was trained on alternative periods or across seasons. Dimension
reduction methods for large and complex data are robust and offer improved accuracy, which should
be considered for algorithmic selection strategies in similar settings.

12
Appendix
Factors used in the analysis of racing data used for the logistic regression and PLS models.

• Number of starts
• Days since last run
• Days since second last run
• Days since third last run
• Win rate (% wins by starts)
• Place rate (% places by starts)
• Log transform of starting price in previous race
• Number of times won at current track
• Differential rating between grades for this race against previous race
• Weight difference between this race and previous race
• Distance difference between this race and previous race
• Number of times won at current Grade
• Finishing speed in previous 5 races (averaged)
• Horse's rating in previous 5 races (averaged)
• Current jockey rating computed as % wins in previous 12 months
• Jockey rating differential between this race and previous race
• Average rating in first up runs, if last run ≥60 days
• Number of times horse has competed against current class
• Highest rating over the last 3 races
• Highest rating over the last 8 races
• Expected position in race (adjusted for distance, turns, barrier)
• Adjustment factor for wide barriers over track and distance
• Quick back up (days since last run (DSLR) if DSLR<=10)
• Horse age if <=3 years old competing in an open age race
• Average pace over last 200m (previous 8 races)
• Odds (price) listed at start of racing day

13
References
W Benter 'Computer-based horse race handicapping and wagering systems: a report' (1994) Efficiency
of Racetrack Betting Markets pp 183-198.
D Edelman 'Adapting support vector machine methods for horserace odds prediction' (2007) Annals
of Operations Research 151 pp 325-336.
RN Bolton and RG Chapman 'Searching for positive returns at the track: a multinomial logit model
for handicapping horse races (1986) Management Science 32(8) pp 1040-1060.
KH Esbensen Multivariate data analysis - In practice: An introduction to multivariate data analysis
and experimental design (Oslo, Norway: Camo Process AS, 5th edn, 2002).
IE Frank and JH Friedman 'A statistical view of some chemometrics regression tools (with
discussion)' (1993) Technometrics 35(2) pp 109-148.
P Geladi and BR Kowalski 'Partial least squares regression: A tutorial' (1986) Analytica Chimica Acta
185 pp 1-17.
W James and C Stein 'Estimation with quadratic loss' (1961). Proceedings of the 4th Berkeley
Symposium on Mathematical Statistics and Probability 1 pp 361-379.
JR Kelly Jr 'A New interpretation of information rate' (1956) Bell System Technical Journal 35 pp
917-926.
S Lessmann, MS Sung, and JEV Johnson 'Alternative methods of predicting competitive events: An
application in horserace betting markets' (2010) International Journal of Forecasting 26(2) pp 518-
536.
N Silverman and M Suchard 'Predicting horse race winners through a regularized conditional logistic
regression with frailty' (2013) Journal of Prediction Markets 7(1) pp 43-52.
RS, Sobel, and ST Raines 'An examination of the empirical derivatives of the favourite-longshot bias
in racetrack betting' (2003) Applied Economics 35(4) pp 371-385.
R Sundberg, PJ Brown, H Martens, T Næs, SD Oman and S Wold 'Multivariate calibration: Direct
and indirect regression methodology' (1999) Scandinavian Journal of Statistics 26(2) pp 161–207.
H Wold 'Estimation of principal components and related models by iterative least squares' in PR
Krishnaiaah (ed) Multivariate Analysis (New York Academic Press, 1966).
S Wold, M Sjöström, and L Eriksson 'PLS-regression: A basic tool of chemometrics' (2001)
Chemometrics and Intelligent Laboratory Systems 58(2) pp 109–130.

View publication stats

Bayesian Models For Astrophysical Data Using R, JAGS, Python, and Stan
100% (1)
Bayesian Models For Astrophysical Data Using R, JAGS, Python, and Stan
413 pages
Discrete and Continuous Random Variable
67% (3)
Discrete and Continuous Random Variable
30 pages
MATH 1281 Written Assignment Unit 6
No ratings yet
MATH 1281 Written Assignment Unit 6
3 pages
Stats Chap10
No ratings yet
Stats Chap10
63 pages
Benter
94% (16)
Benter
16 pages
Big Data Algorithm For Horse Racing Prediction
No ratings yet
Big Data Algorithm For Horse Racing Prediction
183 pages
05 Statistical Inference-2 PDF
No ratings yet
05 Statistical Inference-2 PDF
14 pages
All in Likelihood
No ratings yet
All in Likelihood
546 pages
Chapter 11 - Project Management
No ratings yet
Chapter 11 - Project Management
67 pages
2 Msa
No ratings yet
2 Msa
31 pages
Predicting Horse Racing With ML
100% (3)
Predicting Horse Racing With ML
72 pages
Predicting Horse Race Winners Through Regularized Conditional Logistic Regression With Frailty
No ratings yet
Predicting Horse Race Winners Through Regularized Conditional Logistic Regression With Frailty
11 pages
Benter PDF
No ratings yet
Benter PDF
16 pages
Noah Silverman - Predicting The Outcome of The Horse Race Using Data Mining Technique
No ratings yet
Noah Silverman - Predicting The Outcome of The Horse Race Using Data Mining Technique
20 pages
Chi Zhang-PhD Thesis
No ratings yet
Chi Zhang-PhD Thesis
169 pages
EPMA Forum2010 DEdelman
No ratings yet
EPMA Forum2010 DEdelman
11 pages
04 - Betting Odds and Theory EN
No ratings yet
04 - Betting Odds and Theory EN
37 pages
A Hierarchical Bayesian Analysis of Hors
100% (1)
A Hierarchical Bayesian Analysis of Hors
13 pages
Mit
No ratings yet
Mit
225 pages
The Science of Impact Values - On Course Profits
No ratings yet
The Science of Impact Values - On Course Profits
5 pages
Horse Pologne
No ratings yet
Horse Pologne
37 pages
Reithinger Florian
No ratings yet
Reithinger Florian
223 pages
Factor Analysis
No ratings yet
Factor Analysis
90 pages
Model Considerations For Multi-Entry Competitions
100% (1)
Model Considerations For Multi-Entry Competitions
52 pages
50 Pro Horse Racing Systems
60% (10)
50 Pro Horse Racing Systems
103 pages
Predicting Horse Racing Result
No ratings yet
Predicting Horse Racing Result
66 pages
Make Your Dreams Come True Using My Football Betting System
86% (14)
Make Your Dreams Come True Using My Football Betting System
39 pages
Fairness Lectures-21
No ratings yet
Fairness Lectures-21
63 pages
Estimations For Statistical Arbitrage in Horse Racing Markets
No ratings yet
Estimations For Statistical Arbitrage in Horse Racing Markets
45 pages
Assessing The Performance of Online Students - New Data, New Approaches, Improved Accuracy
No ratings yet
Assessing The Performance of Online Students - New Data, New Approaches, Improved Accuracy
44 pages
Wne WP361
No ratings yet
Wne WP361
36 pages
Psyc 103 (Stats)
No ratings yet
Psyc 103 (Stats)
75 pages
R Is For Racing: Colin Magee January 2019
No ratings yet
R Is For Racing: Colin Magee January 2019
26 pages
8119-Article Text-8942-1-10-20230930
No ratings yet
8119-Article Text-8942-1-10-20230930
10 pages
Machine Learning Techniques For Cross Sectional Equity Returns' Prediction
No ratings yet
Machine Learning Techniques For Cross Sectional Equity Returns' Prediction
35 pages
Weighted Elo Rating For Tennis Match Predictions
No ratings yet
Weighted Elo Rating For Tennis Match Predictions
29 pages
FULLTEXT01
No ratings yet
FULLTEXT01
42 pages
Case Study On Milkfish Production of Illera Fish Farm
No ratings yet
Case Study On Milkfish Production of Illera Fish Farm
57 pages
Zcode Horse Racing Bible PDF
100% (8)
Zcode Horse Racing Bible PDF
228 pages
2015 - Using Artificial Neural Networks To Predict Winners in Horseraces - A Case Study at The Champs de Mars
100% (2)
2015 - Using Artificial Neural Networks To Predict Winners in Horseraces - A Case Study at The Champs de Mars
8 pages
Forecasting Odds Movements in Horse Racing
No ratings yet
Forecasting Odds Movements in Horse Racing
63 pages
1986 Bolton
No ratings yet
1986 Bolton
21 pages
AGLM Paper
No ratings yet
AGLM Paper
18 pages
Variable Selection
No ratings yet
Variable Selection
26 pages
Distributional Regression Modeling Via Generalized Additive Models For Location Scale and Shape
No ratings yet
Distributional Regression Modeling Via Generalized Additive Models For Location Scale and Shape
22 pages
Man V Machine: Greyhound Racing Predictions: MSC Research Project Data Analytics
No ratings yet
Man V Machine: Greyhound Racing Predictions: MSC Research Project Data Analytics
26 pages
Kempston HowToWinAtTheTrack
No ratings yet
Kempston HowToWinAtTheTrack
5 pages
A Novel Regularization Approach To Fair ML
No ratings yet
A Novel Regularization Approach To Fair ML
20 pages
The Logic of Sports Betting
100% (16)
The Logic of Sports Betting
241 pages
A Day at The Races
No ratings yet
A Day at The Races
19 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
One and Two Logistic
No ratings yet
One and Two Logistic
17 pages
Big Data Lesson 2 Lucrezia Noli
No ratings yet
Big Data Lesson 2 Lucrezia Noli
21 pages
Informs Management Science: This Content Downloaded From 138.25.78.25 On Fri, 29 Jul 2016 19:39:35 UTC
No ratings yet
Informs Management Science: This Content Downloaded From 138.25.78.25 On Fri, 29 Jul 2016 19:39:35 UTC
13 pages
Testing Market Efficiency in A Fixed Odds Betting Market. Working Paper No. 12,-12-2007 WpJakobsson, R. & Karlsson, N.
No ratings yet
Testing Market Efficiency in A Fixed Odds Betting Market. Working Paper No. 12,-12-2007 WpJakobsson, R. & Karlsson, N.
14 pages
Model Selection Techniques - An Overview: Jie Ding, Vahid Tarokh, and Yuhong Yang
No ratings yet
Model Selection Techniques - An Overview: Jie Ding, Vahid Tarokh, and Yuhong Yang
21 pages
From Dummy Regression To Prior Probabili
No ratings yet
From Dummy Regression To Prior Probabili
8 pages
What Are The Most Important Statistical Ideas of The Past 50 Years?
No ratings yet
What Are The Most Important Statistical Ideas of The Past 50 Years?
19 pages
FALL SEMESTER 2019-20 AI With Python: ECE4031 Digital Assignment - 1
No ratings yet
FALL SEMESTER 2019-20 AI With Python: ECE4031 Digital Assignment - 1
14 pages
What AI Can Do For Horse-Racing?: Pierre Colle, Alezan AI Pierre@alezan - Ai
No ratings yet
What AI Can Do For Horse-Racing?: Pierre Colle, Alezan AI Pierre@alezan - Ai
7 pages
MachineLearning Technique On Horse Racing
50% (2)
MachineLearning Technique On Horse Racing
7 pages
IEEE Horse Race Prediction
No ratings yet
IEEE Horse Race Prediction
3 pages
PW3 SupervisedLearning
No ratings yet
PW3 SupervisedLearning
10 pages
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
No ratings yet
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
9 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Characterisation of Model Error For Charpy Impact Energy of Heat Treated Steels Using Probabilistic Reasoning and A Gaussian Mixture Model
No ratings yet
Characterisation of Model Error For Charpy Impact Energy of Heat Treated Steels Using Probabilistic Reasoning and A Gaussian Mixture Model
6 pages
Big Mike Betting Masterclass Strategic Betting3
100% (3)
Big Mike Betting Masterclass Strategic Betting3
142 pages
Adapting Least Square Support Vector Reg
No ratings yet
Adapting Least Square Support Vector Reg
19 pages
Two Horse Per Race Method
75% (8)
Two Horse Per Race Method
4 pages
The Art of Live Sports Betting-2
No ratings yet
The Art of Live Sports Betting-2
120 pages
Copilot Horse Racing
No ratings yet
Copilot Horse Racing
2 pages
Quiz Unit 3 Exam Remotely Proctored 1 PDF
No ratings yet
Quiz Unit 3 Exam Remotely Proctored 1 PDF
20 pages
MDSI - Using SVM Regression To Predict Harness Races
No ratings yet
MDSI - Using SVM Regression To Predict Harness Races
17 pages
PDF KVFD L 3076
No ratings yet
PDF KVFD L 3076
10 pages
Over-Under 2.5 Goals System
88% (8)
Over-Under 2.5 Goals System
13 pages
Horse Racing Prediction at The Champ de Mars Using A Weighted Probabilistic Approach
No ratings yet
Horse Racing Prediction at The Champ de Mars Using A Weighted Probabilistic Approach
4 pages
50 To 50000
78% (9)
50 To 50000
23 pages
Global Binarization - Otsu 79
No ratings yet
Global Binarization - Otsu 79
2 pages
The Monster Trifecta System Monstertrifectacom
100% (3)
The Monster Trifecta System Monstertrifectacom
50 pages
The Ultimate Horse Racing System Collection PDF
100% (4)
The Ultimate Horse Racing System Collection PDF
11 pages
Horseracing Profits 4 NonGamblers Book
100% (9)
Horseracing Profits 4 NonGamblers Book
95 pages
JUN 2020 S1 QP Edx Ial s1
No ratings yet
JUN 2020 S1 QP Edx Ial s1
24 pages
Betsmartwinbig
100% (6)
Betsmartwinbig
62 pages
Analysis of Variance
No ratings yet
Analysis of Variance
20 pages
Horse Racing System - Turf Anaylist The Indicator Handicapping Method
80% (10)
Horse Racing System - Turf Anaylist The Indicator Handicapping Method
24 pages
BETTING STRATEGY EBOOK - Teaser
89% (9)
BETTING STRATEGY EBOOK - Teaser
14 pages
Mordin On Time
20% (10)
Mordin On Time
103 pages
(Ebook PDF) Fundamentals of Biostatistics 8th Edition Instant Download
100% (5)
(Ebook PDF) Fundamentals of Biostatistics 8th Edition Instant Download
56 pages
Bookie Buster 21 Secret Systems Used by Pro Sports Gamblers Finally REVEALED by Frank Belanger
100% (16)
Bookie Buster 21 Secret Systems Used by Pro Sports Gamblers Finally REVEALED by Frank Belanger
160 pages
How To Make A Living From Sports Betting
100% (11)
How To Make A Living From Sports Betting
24 pages
Assign 1
No ratings yet
Assign 1
2 pages
Ch2 Introduce To Firing Theory
No ratings yet
Ch2 Introduce To Firing Theory
15 pages
Sports Investing Bible
80% (5)
Sports Investing Bible
120 pages
The Peak Racing System Horse Win
50% (2)
The Peak Racing System Horse Win
6 pages
My Secret Bet That Profits 95 of The Time
78% (9)
My Secret Bet That Profits 95 of The Time
6 pages
A Gentlemen's Guide To Calculating Winning Bets
100% (1)
A Gentlemen's Guide To Calculating Winning Bets
192 pages
Six Long Shot Factors
No ratings yet
Six Long Shot Factors
10 pages
ML Lab
No ratings yet
ML Lab
14 pages
Chapter 6
No ratings yet
Chapter 6
24 pages
Understanding T Test 0 1
No ratings yet
Understanding T Test 0 1
5 pages
The Coefficient of Determinaton Exposed - Hahn
No ratings yet
The Coefficient of Determinaton Exposed - Hahn
4 pages
30 Seconds To Guaranteed Tennis Profit
80% (10)
30 Seconds To Guaranteed Tennis Profit
21 pages
COT No. 4sampling Distribution With Replacement
No ratings yet
COT No. 4sampling Distribution With Replacement
15 pages
Contoh Data Regresi Berganda
No ratings yet
Contoh Data Regresi Berganda
7 pages
Cracking The Placepot PDF
100% (1)
Cracking The Placepot PDF
12 pages
General (UCP-AFMT1003-A-FOMS, F21) - Microsoft Teams
No ratings yet
General (UCP-AFMT1003-A-FOMS, F21) - Microsoft Teams
8 pages
Tips Profitable Betting Systems For Horseracing
100% (3)
Tips Profitable Betting Systems For Horseracing
160 pages
Horse Racing Winning Techniques
100% (1)
Horse Racing Winning Techniques
14 pages
Statistics and Probability Practice testMM
No ratings yet
Statistics and Probability Practice testMM
21 pages
C - Stat 310 Stat 311
No ratings yet
C - Stat 310 Stat 311
2 pages
Discrete and Continuous Random Variables: Lesson 1.2
No ratings yet
Discrete and Continuous Random Variables: Lesson 1.2
27 pages
CONSM Practical File
No ratings yet
CONSM Practical File
27 pages
Practical-2 Sem 2
No ratings yet
Practical-2 Sem 2
5 pages
Rags To Riches Method
100% (2)
Rags To Riches Method
6 pages
Make 1k Per Month From Betting
100% (2)
Make 1k Per Month From Betting
42 pages
Sensitivity Analysis
No ratings yet
Sensitivity Analysis
3 pages
The Merciless Barrage System
100% (2)
The Merciless Barrage System
7 pages
Introduction To Management Science 12th Edition Taylor Solutions Manual Download
100% (1)
Introduction To Management Science 12th Edition Taylor Solutions Manual Download
56 pages
Multivariate Data Analysis Techniques Using Python. Dimension Reduction, Classification and Segmentation
From Everand
Multivariate Data Analysis Techniques Using Python. Dimension Reduction, Classification and Segmentation
César Pérez López
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Classification in Horse Race Prediction Through Principal Component Decomposition

Uploaded by

Classification in Horse Race Prediction Through Principal Component Decomposition

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Classiﬁcation in horse race prediction through principal component

Preprint · April 2023

The user has requested enhancement of the downloaded file.

View publication stats

You might also like