Signal Selector Notes
Signal Selector Notes
Signal Selector
Jonas Svallin
Senior Director Quantitative Solutions
[email protected]
Georgi Mitov
Director of Research
[email protected]
Nikolay Radev
Senior Quantitative Researcher
[email protected]
Alexander Atanasov
Quantitative Researcher
[email protected]
Copyright © 2025 FactSet Research Systems Inc. All rights reserved. 1 FactSet Research Systems Inc. | www.factset.com
White Paper
Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Multicollinearity Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1. Correlation Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2. Singular Value Decomposition (SVD) Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3. Variance Inflation Factor (VIF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3. Selecting the Best Return-Predicting Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1. Stepwise Regression Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2. Monte Carlo Stepwise Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3. Lasso Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Appendix 14
Copyright © 2025 FactSet Research Systems Inc. All rights reserved. 2 FactSet Research Systems Inc. | www.factset.com
White Paper
1. Introduction
With sufficient time and energy, numerous signals (also referred to as factors) can be discovered that show
a significant correlation between values and subsequent returns. However, quantitative investment man-
agement, much like portfolio construction, is ultimately about selecting a set of signals that collectively
are correlated to future returns. This implies an assumption: the presence of more signals, up to a point,
provides diversification properties. Hence, it is crucial to select the most suitable set of signals from the
broader pool available. Practically, most investment professionals use more than one alpha source, making
it crucial to optimally combine these multiple alpha signals. The process of blending alpha signals optimally
is complex and includes different stages (Fig. 1). We will discuss the first step, suitability selection, in detail
and show how the FactSet Programmatic Environment (FPE) SignalSelector module can be utilized for
that purpose.
Variable selection for a linear regression model is more of an art than a science. While there are various selec-
tion algorithms and practices, none guarantee universal applicability. Frequently, the final solution combines
expert opinion with algorithmic output. FPE users have access to a suite of algorithms and visualization
tools enabling adequate decision-making. Signal selection often proceeds along two main avenues:
• Reduction of multicollinearity
• Selection of the subset that best predicts stock returns without overfitting
The first approach aims to minimize the number of signals by eliminating those that either duplicate or
correlate too highly with other signals. The second aims to choose the signals that best predict returns.
Often these methods are applied in combination – first reducing to a linearly independent set, and then
selecting the best regression model features out of this set. These two algorithmic steps complement each
other effectively as the first step tends to manage larger signal sets, while the second step performs more
reliably and efficiently with smaller, linearly independent sets. In Section 2., we examine the available
techniques used for reducing multicollinearity: Correlations Plot, Singular Value Decomposition (SVD),
and Variance Inflation Factors (VIF). In Section 3., we discuss methods used for signal selection based on
Copyright © 2025 FactSet Research Systems Inc. All rights reserved. 3 FactSet Research Systems Inc. | www.factset.com
White Paper
predictive capability for returns: Stepwise Regression (SwR), Monte Carlo Stepwise Regression (MCSwR),
and Lasso Regression.
2. Multicollinearity Reduction
Multicollinearity in a regression model means that one or more explanatory variables (signals) are linear
functions of others, or that high correlations are present. This presents a problem, as explanatory vari-
ables should be independent. The presence of multicollinearity subsequently compromises the reliability of
statistical inferences during the model fitting and interpretation.
1
V IFj = (1)
1 − Rj2
Copyright © 2025 FactSet Research Systems Inc. All rights reserved. 4 FactSet Research Systems Inc. | www.factset.com
White Paper
where Rj2 is the R2 -value obtained by regressing the j-th explanatory variable on all the other remaining
variables:
∑
xj,t = b0 + xi,t bi,t + ϵt (2)
i̸=j
VIF can be calculated for each independent variable, and a high VIF indicates that the corresponding variable
is highly collinear with the other variables in the model. Typically, VIF values above 3 ∼ 5 are considered
high and indicate problematic multicollinearity [3]. Based on this quantity, we provide a procedure to try
to eliminate the multicollinearity in a set of variables. The procedure iteratively removes the variable with
the highest VIF value until all variables have values lower than a selected threshold (VIFs are recalculated
after each removal, as the regression in Eq. (2) changes). Figure 4 shows an example report from the VIF
selection method of the SignalSelector module. In that example, two of the signals are strongly correlated
to each other and have VIFs above the specified threshold. One of them is dropped, causing the other’s VIF
to drop below the threshold.
So far, we have discussed various methods to reduce a pool of signals to a linearly independent subset but
have not yet matched any of these signals to returns. In the following section, we present algorithms that
select signals based on how well they predict stock returns.
Copyright © 2025 FactSet Research Systems Inc. All rights reserved. 5 FactSet Research Systems Inc. | www.factset.com
White Paper
------------------------------------
Singular Values Decomposition Report
------------------------------------
Signals list (17):
['Value-- Book Yield', 'Value-- Earnings Yield', 'Value-- Sales Yield', 'Value-- CFO Yield',
'Sentiment-- Stdzd Analyst PT', 'Sentiment-- PT Revisions', 'Sentiment-- Earnings Est. Revisions',
'Sentiment-- Earnings Est Stability', 'Quality-- FCF Mgn', 'Quality-- FCF Mgn Stability',
'Quality-- Interest Coverage', 'Quality-- Piotroski F-Score', 'Technical-- Avg True Range',
'Technical-- Short-Term Reversal', 'Technical-- Velocity', 'Technical-- DownBetaR2', 'Value Composite']
Results
-------
There are total of 1 redundant signals.
Linear dependence found in the following groups of signals, with number of redundant signals in brackets:
('Value-- Book Yield', 'Value-- Earnings Yield', 'Value-- Sales Yield', 'Value-- CFO Yield', 'Value Composite') (1)
Figure 3: An example report of the Singular Value Decomposition method of the SignalSelector module.
∑
rt = b0 + bi xi,(t−1−lag) + ϵt (3)
i
While having a linearly independent set of signals is desirable for a reliable regression model, not every
signal is necessary for a good model – overfitting and data mining are common pitfalls that must be avoided.
Information theory provides some useful ways of quantifying the quality of a regression model as a balance
between goodness of fit and limiting the number of independent variables. Information criteria, like the
Bayesian Information Criterion (BIC) (4), Akaike Information Criterion (AIC) (5), and Corrected Akaike
Information Criterion (AICc) (6), are powerful and well-established statistical metrics used in many scientific
fields [1, 2].
2k 2 + 2k
AICc = AIC + (6)
n−k−1
Here, for a given regression model,( k is the number of ) variables (signals), n is the sample size (number
of data points), and ln (L̂) = − n2 ln (2π) + ln ( ssr
n ) + 1 is the maximized log-likelihood function, with ssr
being the sum of squared residuals.
We can use these criteria as the target of optimization for our signal selection methods. There are no
analytic solutions to this discrete problem, and while it would seem attractive to run a best subset regression
(exhaustive search) the combinatorial explosion becomes unmanageable even with as few as 20 signals (more
Copyright © 2025 FactSet Research Systems Inc. All rights reserved. 6 FactSet Research Systems Inc. | www.factset.com
White Paper
---------------------------------------------------------------
Variance Inflation Factor selection: 2020-12-31, universe
---------------------------------------------------------------
Parameters:
-----------
Time window: 47
VIF Threshold: 4
Results:
--------
Selected signals (15):
['Value-- Book Yield', 'Value-- Earnings Yield', 'Value-- Sales Yield', 'Value-- CFO Yield',
'Sentiment-- Stdzd Analyst PT', 'Sentiment-- PT Revisions', 'Sentiment-- Earnings Est. Revisions',
'Sentiment-- Earnings Est Stability', 'Quality-- FCF Mgn', 'Quality-- FCF Mgn Stability',
'Quality-- Interest Coverage', 'Quality-- Piotroski F-Score', 'Technical-- Avg True Range',
'Technical-- Velocity', 'Technical-- DownBetaR2']
WARNING: VIF selection only drops signals to reduce multicollinearity, but not all selected signals
are necessarily significant to explaining and predicting returns
Figure 4: A report from the Variance Inflation Factors (VIF) selection method of the SignalSelector Module. One
of a pair of correlated signals was dropped.
Copyright © 2025 FactSet Research Systems Inc. All rights reserved. 7 FactSet Research Systems Inc. | www.factset.com
White Paper
than a million combinations and, at 40 signals, a trillion combinations). Therefore, a more efficient process
that directedly converges on the desired solution is necessary. The following two subsections present two
methods available in the SignalSelector module of FPE that leverage information theory in the selection
of signals that best predict returns.
The final subsection covers a method that adopts an alternative approach to finding a constrained optimal
subset of return predicting signals - Lasso Regression. It is a constrained version of the simple least squares
regression:
{ }
min ∥rt − b0 − x(t−1−lag) b∥22 , subject to ∥b∥1 ≤ L (7)
b0 ,b
An equivalent formulation of the Lasso Regression, and also the one used in SignalSelector, is the La-
grangian form:
{ }
1
min ∥rt − b0 − x(t−1−lag) b∥22 + λ∥b∥1 (8)
b0 ,b N
Copyright © 2025 FactSet Research Systems Inc. All rights reserved. 8 FactSet Research Systems Inc. | www.factset.com
White Paper
---------------------------------------------------------
Stepwise Regression selection: 2020-12-31, universe
---------------------------------------------------------
Parameters:
-----------
Time window: 47
Results:
--------
Selected signals (7):
['Value-- Book Yield', 'Value-- Earnings Yield', 'Quality-- FCF Mgn Stability', 'Quality-- Interest Coverage',
'Technical-- Avg True Range', 'Technical-- Short-Term Reversal', 'Technical-- DownBetaR2']
AIC: -44973.03271716798
R-squared: 0.0029447592175323445
Figure 5: A report from the Stepwise Regression selection method of the SignalSelector Module. Detailed log of
the steps taken by the procedure and the evolution of the information criterion and R2 can be seen.
Copyright © 2025 FactSet Research Systems Inc. All rights reserved. 9 FactSet Research Systems Inc. | www.factset.com
White Paper
method does, using a randomized starting model for each iteration. The starting model is randomly selected
by ’tossing a coin’ for each available signal to decide whether it is included in the beginning model. It also
provides statistical information that helps gauge the confidence in the reached solution (Fig. 6). We have
found, throughout testing (including exhaustive brute force validation for small enough sets of signals), that
the number of iterations, N , should be at least:
N ≳ (number of signals)
is also desirable for strong confidence that the best global solution was found among these iterations. Ideally,
each of the different solutions should occur more than a handful of times, say more than the number of
different solutions found. This last one is not a requirement, as there can often be ‘rare’ solutions only
reached from very few specific starting models, and these are almost never the best solutions. An example
can be seen in Figure 6, where 50 iterations were used to select an optimal BIC subset from among 15 signals
(50 > 16). Two potential solutions were found (50 ≫ 2), and the solutions were reached 20 and 30 times
respectively (20, 30 > 2). In fact, the better, and more commonly found, of these two solutions is the best
global solution as confirmed by an exhaustive subset search in this case.
The Appendix provides empirical statistics, showing the reliability and scalability (as compared to brute
force/exhaustive search methods) of the stepwise regression methods. Monte Carlo Stepwise Regression is
significantly more tractable with respect to the number of signals, and capable of reliably finding the optimal
information criterion subset from a pool of signals numbering up to the low hundreds (100 ∼ 200) within
less than an hour (that would take more than the current age of the universe with a simple exhaustive search
algorithm).
Copyright © 2025 FactSet Research Systems Inc. All rights reserved. 10 FactSet Research Systems Inc. | www.factset.com
White Paper
---------------------------------------------------------------
Monte Carlo Stepwise Regression selection: 2020-12-31, universe
---------------------------------------------------------------
Parameters:
-----------
Time window: 47
Number of iterations: 50
Results:
--------
Selected signals (3):
['Value-- Book Yield', 'Value-- Earnings Yield', 'Technical-- Short-Term Reversal']
BIC: -44930.668554305994
R-squared: 0.0021769390469288386
Figure 6: A report from the Monte Carlo Stepwise Regression selection method of the SignalSelector Module.
Breakdown of the local solutions found can be seen.
Copyright © 2025 FactSet Research Systems Inc. All rights reserved. 11 FactSet Research Systems Inc. | www.factset.com
White Paper
------------------------------------------------
Lasso Regression selection: 2020-12-31, universe
------------------------------------------------
Parameters:
-----------
Time window: 47
Results:
--------
Selected signals (10):
['Value-- Book Yield', 'Value-- Earnings Yield', 'Value-- CFO Yield', 'Sentiment-- Stdzd Analyst PT',
'Sentiment-- Earnings Est Stability', 'Quality-- FCF Mgn', 'Quality-- FCF Mgn Stability',
'Technical-- Avg True Range', 'Technical-- Short-Term Reversal', 'Technical-- DownBetaR2']
Lambda: 0.0005232991146814947
Figure 7: A report from the Lasso Regression selection of the SignalSelector Module. It was ran with a desired
number of signals specified (k = 10) and the Lambda (λ) that achieves that can be seen in the results.
4. Conclusion
We have presented the diverse array of signal suitability selection tools in the SignalSelector module of
FPE. These tools are designed to help sort through the ever-growing zoo of signals and pick the ones that
can be combined into a desirable alpha signal. We have provided several complementary techniques for
detecting and reducing multicollinearity. These synergize with the regression-based methods for selecting
signals with the best return-predicting power, as they provide the most reliable statistical inferences when
applied to linearly independent sets of signals.
The SignalSelector methods can be conveniently applied to the entire universe (panel regression across
time and assets), as well as user-specified groups of stocks (e.g., by sectors) and/or sub-periods (e.g., for
out-of-sample validation or studying signal decay). All of these tools are deeply integrated with the Backtest
module, the backbone of FPE equity workflow, ensuring seamless transition into the subsequent stages –
calculating optimal signal weights (OptimalWeightsEngine) and backtesting.
Copyright © 2025 FactSet Research Systems Inc. All rights reserved. 12 FactSet Research Systems Inc. | www.factset.com
White Paper
Bibliography
[1] K.P. Burnham and D.R. Anderson. Model Selection and Multimodel Inference: A Practical
Information-Theoretic Approach. Springer New York, 2003. ISBN 9780387953649. URL
https://books.google.bg/books?id=BQYR6js0CC8C.
[2] Sadanori Konishi and Genshiro Kitagawa. Information Criteria and Statistical Modeling. Springer
Publishing Company, Incorporated, 1st edition, 2007. ISBN 0387718869.
[3] I. Pardoe. Applied Regression Modeling. John Wiley & Sons, Ltd, 2020. ISBN 9781119615941. doi:
https://doi.org/10.1002/9781119615941.ch5. URL
https://onlinelibrary.wiley.com/doi/abs/10.1002/9781119615941.ch5.
Copyright © 2025 FactSet Research Systems Inc. All rights reserved. 13 FactSet Research Systems Inc. | www.factset.com
White Paper
Appendix
Here, we present empirical statistics showcasing the reliability and scalability of the Monte Carlo Stepwise
Regression selection method we have implemented. The setup we used was the following:
• Russell 1000 asset universe
• Two-year time window (2019-2021), monthly frequency
• 62 potential signals (linearly independent to V IF < 5):
['Tax Burden', 'Tangible Book to Price', 'Standardized Analyst Price Target', 'Sales to Price',
'Sales Estimate Stability', 'Sales Estimate Revisions (75D)', 'Return on Invested Capital Change',
'Return on Invested Capital', 'Return on Assets Change', 'Retention Ratio', 'Price Target Estimate Stability',
'Price Target Estimate Revisions (75D)', 'Piotroski F Score', 'Operating Margin', 'Operating Cash Flow Yield',
'Operating Cash Flow Stability', 'Operating Cash Flow Margin Stability', 'Operating Cash Flow Margin Change',
'Operating Cash Flow Growth - 1 Year', 'Net Margin Stability', 'Market Share Industry Group Growth Rate',
'Liability Coverage Ratio Change', 'Liability Coverage Ratio', 'Interest Coverage Ratio Change',
'Interest Coverage Ratio', 'Interest Burden', 'Intangible Assets to Sales',
'Free Cash Flow to Enterprise Value', 'Free Cash Flow Margin Stability', 'Free Cash Flow Margin Change',
'Free Cash Flow Margin', 'Equity Turnover Change', 'Equity Turnover', 'Equity Issuance Growth',
'Equity Buyback Ratio', 'Earnings Estimate Stability','Earnings Estimate Revisions (75D)', 'EPS Stability',
'EPS Growth Rate', 'EBITDA Stability', 'EBITDA Margin Stability', 'EBITDA Margin Change', 'EBITDA Margin',
'EBIT to Enterprise Value', 'EBIT Margin Change', 'EBIT Estimate Stability','EBIT Estimate Revisions (75D)',
'Debt Service Ratio', 'Debt Issuance Growth', 'Change in Intangible Assets to Sales',
'Cash Generating Power Ratio', 'Cash Earnings Ratio', 'Cash Coverage Ratio Change', 'Cash Coverage Ratio',
'CAPEX to Depreciation', 'CAPEX Growth', 'Beneish M Score', 'Asset Turnover Change', 'Asset Turnover',
'Asset Growth', 'Accruals Ratio - Cash Flow Method','Accruals Ratio - Balance Sheet Method']
We applied the Monte Carlo Stepwise Regression (MCSwR) method to find the combination of signals that
best predicts stock returns across this universe and time window. This was repeated for all three available
information criteria (AIC, AICc, BIC) as the optimization target. Similar analysis was performed on random
samples of the whole set of 62 signals to demonstrate the scalability of the methods. For the sufficiently small
sample signal pools (n ≤ 20), an exhaustive calculation of all information criteria for all possible combinations
was performed, using our optimized framework (same as for SwR and MCSwR). This exhaustive calculation
was used to validate that the best solution is indeed found by the MCSwR method. All these calculations
were run on the same machine with an Intel Core i7-12800, 2400MHz, 14 Cores, 20 Logical Processors CPU.
Table 1 shows these results, with a few notable features:
• Brute force method (exhaustive subset search) time is exponential with the number of signals – it could
not finish (DNF) for 25 signals, though it should have taken ∼ 5 days, by extrapolation.
• MCSwR is considerably more tractable with respect to the number of signals and can feasibly work
with numbers in the low hundreds.
• For all brute force feasible cases (n ≤ 20), MCSwR provably finds the global solution much faster.
• The best solution, found throughout the iterations, is always the one that occurs most often – this is
expected behavior and has been observed in all of our experiments, but is not strictly guaranteed.
• From the two observations above, we can infer strong confidence that MCSwR also finds the global
solutions in cases with a bigger number of available signals (n ≳ 20), even if brute force validation is
Copyright © 2025 FactSet Research Systems Inc. All rights reserved. 14 FactSet Research Systems Inc. | www.factset.com
White Paper
infeasible (taking from weeks at ∼ 30 signals to the age of the universe at ∼ 64).
• For the full case (62 signals), both the 50 and 500 iteration runs find the same best solution; the latter
offers stronger statistical confidence that this is indeed the global solution.
• Even a simple Stepwise Regression (equivalent to a single iteration of MCSwR) has a high probability
of finding the global solution, though more iterations for statistical confidence are advisable. See last
column, showing the percentage of random starting point iterations (single stepwise regressions) that
land on the global solution.
Table 1: Empirical Statistics for Monte Carlo Stepwise Regression Selection method. For small signal selection pools
(n ≤ 20), a brute force method was also used, validating that the global solution was found.
Copyright © 2025 FactSet Research Systems Inc. All rights reserved. 15 FactSet Research Systems Inc. | www.factset.com