Mbedded Methods For Feature Selection in Neural Networks: Reprint
Mbedded Methods For Feature Selection in Neural Networks: Reprint
NETWORKS
A P REPRINT
Vinay Varma K.
Machine Learning Engineer
arXiv:2010.05834v1 [cs.LG] 12 Oct 2020
Niveshi
New Delhi, PIN 110034
[email protected]
A BSTRACT
The representational capacity of modern neural network architectures has made them a default
choice in various applications with high dimensional feature sets. But these high dimensional and
potentially noisy features combined with the black box models like neural networks negatively
affect the interpretability, generalizability, and the training time of these models. Here, I propose
two integrated approaches for feature selection that can be incorporated directly into the parameter
learning. One of them involves adding a drop-in layer and performing sequential weight pruning. The
other is a sensitivity-based approach. I benchmarked both the methods against Permutation Feature
Importance (PFI) - a general-purpose feature ranking method and a random baseline. The suggested
approaches turn out to be viable methods for feature selection, consistently outperform the baselines
on the tested datasets - MNIST, ISOLET, and HAR. We can add them to any existing model with
only a few lines of code.
1 Introduction
1.1 Motivation
Outside of domains like Computer Vision and Natural Language Processing (NLP), identifying relevant input feature
set for the task is not obvious. For example, consider the problem of forecasting stock-returns. Here, potentially relevant
features might range from the returns of the index and other stocks to the tweets of the CEO. Even in applications where
Domain-expertise renders a good feature set, the predictions from the model are rarely the end-goal1 . The work aims to
present two simple methods that reliably give feature importance ranking in neural networks. It’s important to note
that the method needs to be embedded - use the specific predictor in this case neural networks and be integrated into
the parameter learning. This ensures appropriate interaction effects relevant to the optimization criteria are taken into
account and the redundancies thereby are removed. Though the results are presented on feed-forward neural networks
in a classification setting, by construction the same methods can be used with any learning criteria and architecture.
1.2 Background
For a given dataset X n×d , any Feature ranking method involves returning a list I = [f1 ,f2 ,. . . fd ]. The magnitude of
the elements in the list is used to rank order the features. Here I detail the Permutation Feature Importance(PFI)[1]
1
Complete end-to-end decision-making systems aren’t yet possible with machine learning, at least in high stakes environments.
For instance, consider using machine learning for decision making in healthcare[11]. Though, In theory, diagnosing and treating a
patient can be formulated under active reinforcement learning, present machine learning models only augment the expertise of the
human doctor.
A PREPRINT - O CTOBER 13, 2020
algorithm which I here used as a baseline. PFI is a widely used importance attribution technique for random forests.
Here, I adapted the model agnostic version[2] of the method for neural networks.
PFI has numerous shortcomings, especially under correlated-features [3]. Nevertheless, the generality and the simplicity
of the method makes it a good baseline.
The contributions of the present work are two-fold:
1. A drop-in layer with iterative pruning of the weights is shown to be a viable feature selection method.
2. Many prediction-explanation [4, 5, 6, 7] methods are used in neural networks. These are post-hoc instance-wise
explainers. This work uses a gradient-based sensitivity method for global feature selection. I also propose
that the testing framework used herein, where the features are removed and retrained should be adopted to
approximate the relative validity of any feature attribution technique. As the approach doesn’t distinguish
between instance-specific and global attribution methods, the results would allow us to compare between them
and also use them for the task of feature selection.
2 Proposed Methods
2.1 Sensitivity based Selection (SBS)
Specific to neural networks several feature attribution methods exist. These approaches typically involve backpropa-
gating an importance signal from the output neuron to the input features. For instance, Simonyan et al. used saliency
maps [7] for images that involve computing the gradient of the output w.r.t. pixels of an input image. To the best of my
knowledge, they are designed as a Visualization-aid and are not used for feature selection. Sensitivity based selection
method can be summarized as follows. Considering the classification problem with input X ∈ Rn×d . First, train the
model on this data. Now, for each instance of the data point x ∈ X, compute the gradient of the output label w.r.t
x on this model. Take the absolute value of the average of this gradient across all the data instances. The resulting
list of values I = [f1 ,f2 ,. . . fd ] gives the global feature sensitivity of the model under consideration. Similarly, In a
regression setting gradient can be taken w.r.t the error signal. The intuition is that the absolute value of the partial
derivative of input feature w.r.t relevant output neuron, In a classification setting its the corresponding class-probability,
quantify its importance. This method can be seen as adapting PFI for the special case of neural networks, using the
gradient information. PFI, being model agnostic randomly permutes a feature and uses the drop in accuracy (or another
similar metric) as a proxy for feature importance. Moreover, with PFI computational cost scales linearly with the
dimensionality of the dataset. Neural Networks implemented with libraries like Pytorch[8] allows the computation of
this feature sensitivity in a single forward and backward pass through the network.
In Neural networks, weight-pruning refers to systematically removing the parameters of the existing network. For a
given reduction in parameters, pruning techniques aim to minimize the loss in the performance of the original model.2
Here I use a drop-in layer that can be incorporated into any neural network and sequentially prune its weights to arrive
at the most important ones. The idea is if these weights W ∈ R1×d are in the first layer of the network and are all
initialized to ones and the output of the first layer O = {w1 .x1 , . . . , wd .xd } is the multiplication of the corresponding
2
Performance doesn’t necessarily always decrease with pruning. see [10]
2
A PREPRINT - O CTOBER 13, 2020
x1 w1
x2 w2
xd wd
Drop-in Layer
elements of input, then pruning a weight wi has a direct interpretation of removing xi . The complete algorithm is
summarized below.
Algorithm 2: Stepwise Weight Pruning Algorithm (SWPA)
Input: training data X ∈ Rn×d , training labels Y , base network fθ (.), Drop-in Layer W , Step Counter n ∈ Z≥1 ,
Selection factor f ∈ [0, 1]
1 for count in 1,. . . , n + 1 do
2 O ← {w1 .x1 , . . . , wd .xd }
3 if count > 1 then
4 k ← (1−f n
)∗d
5 Sort the weights W of the Drop-in Layer based on their absolute value.
6 Set the least k of them to 0.
7 end
8 Train the base network on O
9 end
10 Take the features corresponding to the top f fraction of the weights in W based on their absolute value and train
them on the base network.
3 Experiments
3
A PREPRINT - O CTOBER 13, 2020
momentum of 0.9. Other hyperparameters include the Step Counter n of the SWPA which is set to 4, Selection Factor f
to 0.1, and the number of random permutations c in PFA to 10.
3.2 Results
In Table 1 both the top and the bottom 10% of the features are selected based on the proposed methods and the
corresponding accuracies on the test set are presented. Both the methods are able to effectively rank order the input
features. Of the two, SWPA outperforms on all the data sets. Table 2 summarizes the results on the baseline -
Permutation Feature Importance (PFI). As expected, PFI reduces the feature importance when a correlated feature is
added [3], and shares it between the two. This explains the small the accuracy gap between the top and the bottom as
informative features are spread between both the groups3 . But, It performs better than random assignment, acting as a
reasonable baseline. For the accuracies on the randomly chosen features, 10 runs are used.
Dataset (n,d) #feat Top SWPA Btm SPWA Top SBS Btm SBS
HAR (5744,561) 56 0.976 0.654 0.932 0.574
ISOLET (7797,617) 61 0.900 0.411 0.841 0.498
MNIST (60000,784) 78 0.941 0.134 0.880 0.153
Table 2: Classification accuracies on PFI based features and the randomly assigned ones.
4 Ablations
4.1 Effect of Step Counter
In this section, I answer the question Is the Step Counter n > 1 (SWPA) adding any value? I compared the accuracy
of the resulting models with n = 1 and n = 4. It is interesting to note that Sequential pruning is adding a noticeable
improvement in all the cases. In figure 2 , I have chosen two instances of the MNIST dataset where SWPA (n=1)
misclassified them. About 35% of the features in SW P A (n = 4) are different from SW P A(n = 1). Given, the
performance of the model on SW P A(n = 4), It indicates that these features are responsible for the marginal gains.
Here, I tested the effect of adding various constraints to the original objective function L(fθ ,W ) . First I added l1
regularization on the weights W of the Drop-in Layer.
3
All the used datasets have high degree of correlated features.
4
A PREPRINT - O CTOBER 13, 2020
l1 = λ.kW k1 (1)
Further, to make the weights more spread out and hence distinctive, I also considered adding weight variance loss (wvl)
as in (2). In all the experiments I chose n = 1 for simplicity. λ, γ are set to 1 and 10 respectively.
Table 4: Effects of adding constraints - l1 and Weight variance factor (wvf ). Comparision of the features selected in
each of these methods is given in Appendix A.
As there is no considerable difference in performance, I conclude that the marginal benefit of adding constraints like l1
or wvf is negligible for the task of selecting informative features.
5 Conclusion
As neural network based methods are becoming the default choice for modeling problems in high stakes applications
like Finance and Medicine, maintaining the model and being able to reason about its predictions is crucial. Moreover,
In real-world applications performance on the validation set or even the test set doesn’t give a complete picture as the
statistical correlations can always break and new ones can emerge in the future. In this regard, a simpler model with
fewer features is always desirable. In this work, I proposed two approaches for feature selection in neural networks, of
which a simple pruning based method (SWPA) outperformed on all the datasets.
References
[1] Breiman, Leo. Random Forests. In Machine Learning, pages 5–32. Springer, 2001.
[2] Aaron Fisher, Cynthia Rudin, Francesca Dominici. All Models are Wrong, but Many are Useful: Learning a
Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously. pages 1–81. JMLR, 2019.
[3] Molnar, Christoph. Interpretable machine learning. A Guide for Making Black Box Models Explainable.
https://christophm.github.io/interpretable-ml-book/
5
A PREPRINT - O CTOBER 13, 2020
[4] Leila Arras, Grégoire Montavon, Klaus-Robert Müller, Wojciech Samek. Explaining recurrent neural network
predictions in sentiment analysis. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity,
Sentiment and Social Media Analysis ,pages 159–168, 2017.
[5] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Doina Precup
and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of
Proceedings of Machine Learning Research, pages 3319—3328. PMLR, 2017.
[6] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating
activation differences. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference
on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3145–3153. PMLR, 2017.
[7] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image
classification models and saliency maps. In ICLR Workshop, 2014.
[8] Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory
and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf,
Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy,
Sasank and Steiner, Benoit and Fang, Lu and Bai, Junjie and Chintala, Soumith. PyTorch: An Imperative Style,
High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pages
8024–8035. 2019
[9] Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, John Guttag What is the State of Neural Network
Pruning? In Proceedings of Machine Learning and Systems, 2020.
[10] Jonathan Frankle, Michael Carbin. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.
In ICLR, 2019.
[11] Chao Yu, Jiming Liu, Shamim Nemati. Reinforcement Learning in Healthcare: A Survey. arXiv:1908.08796
,2019.
[12] George Kour and Raid Saabne. Real-time segmentation of on-line handwritten arabic script. In Frontiers in
Handwriting Recognition (ICFHR), 2014 14th International Conference on, pages 417–422. IEEE, 2014.
[13] George Kour and Raid Saabne. Fast classification of handwritten on-line arabic characters. In Soft Computing and
Pattern Recognition (SoCPaR), 2014 6th International Conference of, pages 312–318. IEEE, 2014.
[14] Guy Hadash, Einat Kermany, Boaz Carmeli, Ofer Lavi, George Kour, and Alon Jacovi. Estimate and replace: A
novel approach to integrating deep neural networks with existing applications. arXiv preprint arXiv:1804.09028,
2018.
6
A PREPRINT - O CTOBER 13, 2020
Table 5: HAR
Table 6: ISOLET
Table 7: MNIST
B Network
A 3 layer feed-forward neural network is used with a reduction factor of 2,meaning at each successive layer the
dimensionality of the input is reduced by half. The final layer has the dimensionality equal to the number of distinct
classes in the dataset. Tabel 8 and 9 summarizes the number of neurons in each layer for the networks on both reduced
and full feature sets.