0% found this document useful (0 votes)

44 views7 pages

Mbedded Methods For Feature Selection in Neural Networks: Reprint

Uploaded by

emil hard

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views7 pages

Mbedded Methods For Feature Selection in Neural Networks: Reprint

Uploaded by

emil hard

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

E MBEDDED METHODS FOR FEATURE SELECTION IN NEURAL

NETWORKS

A P REPRINT

Vinay Varma K.
Machine Learning Engineer
arXiv:2010.05834v1 [cs.LG] 12 Oct 2020

Niveshi
New Delhi, PIN 110034
[email protected]

October 13, 2020

A BSTRACT
The representational capacity of modern neural network architectures has made them a default
choice in various applications with high dimensional feature sets. But these high dimensional and
potentially noisy features combined with the black box models like neural networks negatively
affect the interpretability, generalizability, and the training time of these models. Here, I propose
two integrated approaches for feature selection that can be incorporated directly into the parameter
learning. One of them involves adding a drop-in layer and performing sequential weight pruning. The
other is a sensitivity-based approach. I benchmarked both the methods against Permutation Feature
Importance (PFI) - a general-purpose feature ranking method and a random baseline. The suggested
approaches turn out to be viable methods for feature selection, consistently outperform the baselines
on the tested datasets - MNIST, ISOLET, and HAR. We can add them to any existing model with
only a few lines of code.

1 Introduction

1.1 Motivation

Outside of domains like Computer Vision and Natural Language Processing (NLP), identifying relevant input feature
set for the task is not obvious. For example, consider the problem of forecasting stock-returns. Here, potentially relevant
features might range from the returns of the index and other stocks to the tweets of the CEO. Even in applications where
Domain-expertise renders a good feature set, the predictions from the model are rarely the end-goal1 . The work aims to
present two simple methods that reliably give feature importance ranking in neural networks. It’s important to note
that the method needs to be embedded - use the specific predictor in this case neural networks and be integrated into
the parameter learning. This ensures appropriate interaction effects relevant to the optimization criteria are taken into
account and the redundancies thereby are removed. Though the results are presented on feed-forward neural networks
in a classification setting, by construction the same methods can be used with any learning criteria and architecture.

1.2 Background

For a given dataset X n×d , any Feature ranking method involves returning a list I = [f1 ,f2 ,. . . fd ]. The magnitude of
the elements in the list is used to rank order the features. Here I detail the Permutation Feature Importance(PFI)[1]

1
Complete end-to-end decision-making systems aren’t yet possible with machine learning, at least in high stakes environments.
For instance, consider using machine learning for decision making in healthcare[11]. Though, In theory, diagnosing and treating a
patient can be formulated under active reinforcement learning, present machine learning models only augment the expertise of the
human doctor.
A PREPRINT - O CTOBER 13, 2020

algorithm which I here used as a baseline. PFI is a widely used importance attribution technique for random forests.
Here, I adapted the model agnostic version[2] of the method for neural networks.

Algorithm 1: Permutation Feature Importance

Input: training data X ∈ Rn×d , training labels Y , neural network fθ (.), number of random permutations c,
Accuracy measure of the model Afw (X, Y )
1 for dim in 1,. . . , d do
2 Let the feature importance of the dim be fc
3 fc ← 0
4 for i in 1,. . . , c do
5 Generate a new input data Xperm with random permutation of the column dim in X
6 fc ← fc + Afw (X, Y ) − Afw (Xperm , Y )
7 end
8 The quantity fcc gives the feature importance for each dim in d
9 end

PFI has numerous shortcomings, especially under correlated-features [3]. Nevertheless, the generality and the simplicity
of the method makes it a good baseline.
The contributions of the present work are two-fold:

1. A drop-in layer with iterative pruning of the weights is shown to be a viable feature selection method.
2. Many prediction-explanation [4, 5, 6, 7] methods are used in neural networks. These are post-hoc instance-wise
explainers. This work uses a gradient-based sensitivity method for global feature selection. I also propose
that the testing framework used herein, where the features are removed and retrained should be adopted to
approximate the relative validity of any feature attribution technique. As the approach doesn’t distinguish
between instance-specific and global attribution methods, the results would allow us to compare between them
and also use them for the task of feature selection.

2 Proposed Methods
2.1 Sensitivity based Selection (SBS)

Specific to neural networks several feature attribution methods exist. These approaches typically involve backpropa-
gating an importance signal from the output neuron to the input features. For instance, Simonyan et al. used saliency
maps [7] for images that involve computing the gradient of the output w.r.t. pixels of an input image. To the best of my
knowledge, they are designed as a Visualization-aid and are not used for feature selection. Sensitivity based selection
method can be summarized as follows. Considering the classification problem with input X ∈ Rn×d . First, train the
model on this data. Now, for each instance of the data point x ∈ X, compute the gradient of the output label w.r.t
x on this model. Take the absolute value of the average of this gradient across all the data instances. The resulting
list of values I = [f1 ,f2 ,. . . fd ] gives the global feature sensitivity of the model under consideration. Similarly, In a
regression setting gradient can be taken w.r.t the error signal. The intuition is that the absolute value of the partial
derivative of input feature w.r.t relevant output neuron, In a classification setting its the corresponding class-probability,
quantify its importance. This method can be seen as adapting PFI for the special case of neural networks, using the
gradient information. PFI, being model agnostic randomly permutes a feature and uses the drop in accuracy (or another
similar metric) as a proxy for feature importance. Moreover, with PFI computational cost scales linearly with the
dimensionality of the dataset. Neural Networks implemented with libraries like Pytorch[8] allows the computation of
this feature sensitivity in a single forward and backward pass through the network.

2.2 Stepwise Weight Pruning Algorithm (SWPA)

In Neural networks, weight-pruning refers to systematically removing the parameters of the existing network. For a
given reduction in parameters, pruning techniques aim to minimize the loss in the performance of the original model.2
Here I use a drop-in layer that can be incorporated into any neural network and sequentially prune its weights to arrive
at the most important ones. The idea is if these weights W ∈ R1×d are in the first layer of the network and are all
initialized to ones and the output of the first layer O = {w1 .x1 , . . . , wd .xd } is the multiplication of the corresponding
2
Performance doesn’t necessarily always decrease with pruning. see [10]

2
A PREPRINT - O CTOBER 13, 2020

x1 w1

x2 w2

Figure 1: Modified Network

x3 w3 Arbitrary Architecture The Inputs of the orginal model are first passed
through the Drop-in Layer. Then,the weights wi
are dropped sequentially during the training accord-
ing to Alogrithm 2.

xd wd

Drop-in Layer

elements of input, then pruning a weight wi has a direct interpretation of removing xi . The complete algorithm is
summarized below.
Algorithm 2: Stepwise Weight Pruning Algorithm (SWPA)
Input: training data X ∈ Rn×d , training labels Y , base network fθ (.), Drop-in Layer W , Step Counter n ∈ Z≥1 ,
Selection factor f ∈ [0, 1]
1 for count in 1,. . . , n + 1 do
2 O ← {w1 .x1 , . . . , wd .xd }
3 if count > 1 then
4 k ← (1−f n
)∗d

5 Sort the weights W of the Drop-in Layer based on their absolute value.
6 Set the least k of them to 0.
7 end
8 Train the base network on O
9 end
10 Take the features corresponding to the top f fraction of the weights in W based on their absolute value and train
them on the base network.

3 Experiments

3.1 Data and Setup

Here, I give a brief description of the datasets used.

Smartphone Dataset for Human Activity Recognition (HAR)is the data collected from smartphone sensors mounted
on to the humans performing various activities like WALKING, LYING, STANDING, etc. For each data instance, 561
features are given.
ISOLET is a widely used speech dataset collected from 150 individuals speaking each letter of the alphabet. The data
is preprocessed to include 617 attributes like spectral coefficients; contour and sonorant features.
MNIST is one of the most commonly used datasets in the machine learning community. It consists of 28-by-28
gray-scale images of handwritten digits from 0-to-9. All the images in the dataset are centered. Thus, each pixel can be
safely treated as a separate feature.
In all the experiments data is randomly divided into 60-20-20 split between train, validation, and test set respectively.
Training is done with a budget of 20000 epochs, and the patience factor is set to 2000 epochs. This essentially means
the model is allowed to be trained for up to 20000 epochs only if there is no degradation of the performance on the
validation set on any continuos set of 2000 epochs. The best performing model on the validation set is chosen and tested
on the test set. A 3 layer feedforward neural network with a reduction factor of 2 is used in all experiments. Complete
details of the network architecture are given in Appendix B. SGD optimizer is used with a learning rate of 0.001 and a

3
A PREPRINT - O CTOBER 13, 2020

momentum of 0.9. Other hyperparameters include the Step Counter n of the SWPA which is set to 4, Selection Factor f
to 0.1, and the number of random permutations c in PFA to 10.

3.2 Results

In Table 1 both the top and the bottom 10% of the features are selected based on the proposed methods and the
corresponding accuracies on the test set are presented. Both the methods are able to effectively rank order the input
features. Of the two, SWPA outperforms on all the data sets. Table 2 summarizes the results on the baseline -
Permutation Feature Importance (PFI). As expected, PFI reduces the feature importance when a correlated feature is
added [3], and shares it between the two. This explains the small the accuracy gap between the top and the bottom as
informative features are spread between both the groups3 . But, It performs better than random assignment, acting as a
reasonable baseline. For the accuracies on the randomly chosen features, 10 runs are used.

Dataset (n,d) #feat Top SWPA Btm SPWA Top SBS Btm SBS
HAR (5744,561) 56 0.976 0.654 0.932 0.574
ISOLET (7797,617) 61 0.900 0.411 0.841 0.498
MNIST (60000,784) 78 0.941 0.134 0.880 0.153

Table 1: Classification accuracies on the features chosen by the proposed methods

Dataset Top PFI Btm PFI Random Worst Random Avg

HAR 0.917 0.814 0.306 0.504
ISOLET 0.787 0.700 0.717 0.766
MNIST 0.893 0.867 0.323 0.714

Table 2: Classification accuracies on PFI based features and the randomly assigned ones.

4 Ablations
4.1 Effect of Step Counter

In this section, I answer the question Is the Step Counter n > 1 (SWPA) adding any value? I compared the accuracy
of the resulting models with n = 1 and n = 4. It is interesting to note that Sequential pruning is adding a noticeable
improvement in all the cases. In figure 2 , I have chosen two instances of the MNIST dataset where SWPA (n=1)
misclassified them. About 35% of the features in SW P A (n = 4) are different from SW P A(n = 1). Given, the
performance of the model on SW P A(n = 4), It indicates that these features are responsible for the marginal gains.

Dataset SWPA with n = 4 SWPA with n = 1

HAR 0.976 0.949
ISOLET 0.900 0.835
MNIST 0.941 0.877

Table 3: SWPA with stepsize = 1

4.2 Effects of Adding constraints to the objective function

Here, I tested the effect of adding various constraints to the original objective function L(fθ ,W ) . First I added l1
regularization on the weights W of the Drop-in Layer.
3
All the used datasets have high degree of correlated features.

4
A PREPRINT - O CTOBER 13, 2020

Figure 2: Effects of Step Counter n

The second and the third column highlights the
final selected features by SWPA (n = 1) and
SWPA (n = 4) respectively. In both the instances,
SWPA (n = 1) misclassifies the input. Visually,
It can be inferred that SWPA (n = 4) chooses
slightly more informative features.

l1 = λ.kW k1 (1)

wvl = −γ.Var[Sigmoid(20 ∗ W − 0.5)] (2)

Further, to make the weights more spread out and hence distinctive, I also considered adding weight variance loss (wvl)
as in (2). In all the experiments I chose n = 1 for simplicity. λ, γ are set to 1 and 10 respectively.

Dataset base only l1 only wvf l1 and wvf

HAR 0.949 0.963 0.961 0.960
ISOLET 0.835 0.833 0.816 0.828
MNIST 0.877 0.879 0.877 0.867

Table 4: Effects of adding constraints - l1 and Weight variance factor (wvf ). Comparision of the features selected in
each of these methods is given in Appendix A.

As there is no considerable difference in performance, I conclude that the marginal benefit of adding constraints like l1
or wvf is negligible for the task of selecting informative features.

5 Conclusion
As neural network based methods are becoming the default choice for modeling problems in high stakes applications
like Finance and Medicine, maintaining the model and being able to reason about its predictions is crucial. Moreover,
In real-world applications performance on the validation set or even the test set doesn’t give a complete picture as the
statistical correlations can always break and new ones can emerge in the future. In this regard, a simpler model with
fewer features is always desirable. In this work, I proposed two approaches for feature selection in neural networks, of
which a simple pruning based method (SWPA) outperformed on all the datasets.

References
[1] Breiman, Leo. Random Forests. In Machine Learning, pages 5–32. Springer, 2001.
[2] Aaron Fisher, Cynthia Rudin, Francesca Dominici. All Models are Wrong, but Many are Useful: Learning a
Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously. pages 1–81. JMLR, 2019.
[3] Molnar, Christoph. Interpretable machine learning. A Guide for Making Black Box Models Explainable.
https://christophm.github.io/interpretable-ml-book/

5
A PREPRINT - O CTOBER 13, 2020

[4] Leila Arras, Grégoire Montavon, Klaus-Robert Müller, Wojciech Samek. Explaining recurrent neural network
predictions in sentiment analysis. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity,
Sentiment and Social Media Analysis ,pages 159–168, 2017.
[5] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Doina Precup
and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of
Proceedings of Machine Learning Research, pages 3319—3328. PMLR, 2017.
[6] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating
activation differences. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference
on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3145–3153. PMLR, 2017.
[7] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image
classification models and saliency maps. In ICLR Workshop, 2014.
[8] Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory
and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf,
Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy,
Sasank and Steiner, Benoit and Fang, Lu and Bai, Junjie and Chintala, Soumith. PyTorch: An Imperative Style,
High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pages
8024–8035. 2019
[9] Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, John Guttag What is the State of Neural Network
Pruning? In Proceedings of Machine Learning and Systems, 2020.
[10] Jonathan Frankle, Michael Carbin. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks.
In ICLR, 2019.
[11] Chao Yu, Jiming Liu, Shamim Nemati. Reinforcement Learning in Healthcare: A Survey. arXiv:1908.08796
,2019.
[12] George Kour and Raid Saabne. Real-time segmentation of on-line handwritten arabic script. In Frontiers in
Handwriting Recognition (ICFHR), 2014 14th International Conference on, pages 417–422. IEEE, 2014.
[13] George Kour and Raid Saabne. Fast classification of handwritten on-line arabic characters. In Soft Computing and
Pattern Recognition (SoCPaR), 2014 6th International Conference of, pages 312–318. IEEE, 2014.
[14] Guy Hadash, Einat Kermany, Boaz Carmeli, Ofer Lavi, George Kour, and Alon Jacovi. Estimate and replace: A
novel approach to integrating deep neural networks with existing applications. arXiv preprint arXiv:1804.09028,
2018.

6
A PREPRINT - O CTOBER 13, 2020

A Feature Similarity on SWPA with various constraints

The numbers in the below tables indicate fraction of Features that are similar between the methods. Again, for all
comparision the top 10% of the features are taken.

Experiment base l1 wvl l1 and wvl

base 1.0 0.803 0.839 0.839
l1 0.803 1.0 0.857 0.839
wvl 0.839 0.857 1.0 0.785

Table 5: HAR

Experiment base l1 wvl l1 and wvl

base 1.0 0.786 0.852 0.786
l1 0.786 1.0 0.786 0.852
wvl 0.852 0.786 1.0 0.786

Table 6: ISOLET

Experiment base l1 wvl l1 and wvl

base 1.0 0.820 0.807 0.897
l1 0.820 1.0 0.833 0.807
wvl 0.807 0.833 1.0 0.820

Table 7: MNIST

B Network
A 3 layer feed-forward neural network is used with a reduction factor of 2,meaning at each successive layer the
dimensionality of the input is reduced by half. The final layer has the dimensionality equal to the number of distinct
classes in the dataset. Tabel 8 and 9 summarizes the number of neurons in each layer for the networks on both reduced
and full feature sets.

Dataset #feat #layer 1 #layer 2 #layer 3

HAR 56 28 14 6
ISOLET 61 30 15 26
MNIST 78 39 19 10

Table 8: Network on reduced feature sets.

Dataset #feat #layer 1 #layer 2 #layer 3

HAR 561 280 140 6
ISOLET 617 308 154 26
MNIST 784 392 196 10

Table 9: Network on full data set.

Generative AI For Dummies
67% (3)
Generative AI For Dummies
6 pages
Contextual Image Classification: Understanding Visual Data for Effective Classification
From Everand
Contextual Image Classification: Understanding Visual Data for Effective Classification
Fouad Sabry
No ratings yet
Performance Analysis of Deep Learning Based Object Detection Algorithms On COCO Benchmark: A Comparative Study
No ratings yet
Performance Analysis of Deep Learning Based Object Detection Algorithms On COCO Benchmark: A Comparative Study
18 pages
EILPR Toward End-To-End Irregular License Plate Recognition Based On Automatic P
No ratings yet
EILPR Toward End-To-End Irregular License Plate Recognition Based On Automatic P
10 pages
AI Powered Descriptive Answer Evaluation Presentation
No ratings yet
AI Powered Descriptive Answer Evaluation Presentation
12 pages
Main
No ratings yet
Main
9 pages
ZhangWang2020.Therelationshipbetweenmathematicsinterestandmathematicsachievementmediatingrolesofself-efficacyandmathematicsanxiety
No ratings yet
ZhangWang2020.Therelationshipbetweenmathematicsinterestandmathematicsachievementmediatingrolesofself-efficacyandmathematicsanxiety
9 pages
Digital Image Processing
No ratings yet
Digital Image Processing
16 pages
BCSE497J Project I Report - Sparsh
No ratings yet
BCSE497J Project I Report - Sparsh
67 pages
NNDL 1
No ratings yet
NNDL 1
13 pages
To What Extent Are Convolutional Neural Networks Based On DenseNet and ResNet Architectures and Deci
No ratings yet
To What Extent Are Convolutional Neural Networks Based On DenseNet and ResNet Architectures and Deci
41 pages
IT Prodman Compendium DMS IIT Delhi
No ratings yet
IT Prodman Compendium DMS IIT Delhi
88 pages
Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles For Better Empirical Performance
No ratings yet
Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles For Better Empirical Performance
74 pages
Using Reinforcement Learning To Select An Optimal
No ratings yet
Using Reinforcement Learning To Select An Optimal
11 pages
5sem Bca
No ratings yet
5sem Bca
25 pages
Self Supervised Multi Modal Sequential Recommendation
No ratings yet
Self Supervised Multi Modal Sequential Recommendation
11 pages
CAT 1 - Artificial Intelligence Programming
No ratings yet
CAT 1 - Artificial Intelligence Programming
6 pages
The Framingham Offspring Study: Risk Variable Clustering in The Insulin Resistance Syndrome
No ratings yet
The Framingham Offspring Study: Risk Variable Clustering in The Insulin Resistance Syndrome
7 pages
Mail Spam
No ratings yet
Mail Spam
4 pages
T2M-GPT - Generating Human Motion From Textual Descriptions
No ratings yet
T2M-GPT - Generating Human Motion From Textual Descriptions
15 pages
Artificial Intelligence in Healthcare: A Comprehensive Review of Its Ethical Concerns
No ratings yet
Artificial Intelligence in Healthcare: A Comprehensive Review of Its Ethical Concerns
12 pages
3406-Article Text-6396-1-10-20210421
No ratings yet
3406-Article Text-6396-1-10-20210421
6 pages
3ML.03.Feature Reduction
No ratings yet
3ML.03.Feature Reduction
44 pages
AE21B025 - Resume 1 1
No ratings yet
AE21B025 - Resume 1 1
1 page
SEF: A Method For Computing Prediction Intervals by Shifting The Error Function in Neural Networks
No ratings yet
SEF: A Method For Computing Prediction Intervals by Shifting The Error Function in Neural Networks
13 pages
7 Selectia Trasaturilor
No ratings yet
7 Selectia Trasaturilor
54 pages
Engineering Towards Industry 4.0 Using Data-Driven Methods
No ratings yet
Engineering Towards Industry 4.0 Using Data-Driven Methods
7 pages
Pattern Recognition Practicals
No ratings yet
Pattern Recognition Practicals
8 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
48 pages
Feature Selection
No ratings yet
Feature Selection
173 pages
Agent Cheat Sheet in AIML
No ratings yet
Agent Cheat Sheet in AIML
2 pages
Aishwarya DL Mini Project Report
No ratings yet
Aishwarya DL Mini Project Report
4 pages
Feature Selection Techniques in Machine Learning
No ratings yet
Feature Selection Techniques in Machine Learning
49 pages
English To Luganda Translation
No ratings yet
English To Luganda Translation
13 pages
AI Subsets
No ratings yet
AI Subsets
5 pages
ML Notes
No ratings yet
ML Notes
15 pages
Unit 4 (Second Part) and 5
No ratings yet
Unit 4 (Second Part) and 5
19 pages
Minor Project
No ratings yet
Minor Project
21 pages
Bit Soft Computing Lab Manual
No ratings yet
Bit Soft Computing Lab Manual
2 pages
DLP Lab
No ratings yet
DLP Lab
81 pages
What Is Machine Learning?
No ratings yet
What Is Machine Learning?
7 pages
Hybrid Machine Learning System For Solving Fraud Detection Tasks
No ratings yet
Hybrid Machine Learning System For Solving Fraud Detection Tasks
5 pages
3 Unit - Dspu
No ratings yet
3 Unit - Dspu
23 pages
Big Data - A Primer
100% (3)
Big Data - A Primer
195 pages
ML Lab File
No ratings yet
ML Lab File
43 pages
210..127 Ai
No ratings yet
210..127 Ai
35 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
A Review Paper On Artificial Neural Network: Intelligent Traffic Management System
100% (1)
A Review Paper On Artificial Neural Network: Intelligent Traffic Management System
7 pages
A Novel Machine Learning Algorithm For Spammer Identification in Industrial Mobile Cloud Computing
No ratings yet
A Novel Machine Learning Algorithm For Spammer Identification in Industrial Mobile Cloud Computing
10 pages
Warpper Method
No ratings yet
Warpper Method
8 pages
AAM Book
No ratings yet
AAM Book
159 pages
How To Create A Simple Neural Network in Python
100% (1)
How To Create A Simple Neural Network in Python
4 pages
ABCs2018 Paper 156
No ratings yet
ABCs2018 Paper 156
5 pages
Jdavis Advice
No ratings yet
Jdavis Advice
49 pages
AI-ML Module3
No ratings yet
AI-ML Module3
117 pages
In5490 Classification
No ratings yet
In5490 Classification
85 pages
ML Lab Experiments (1) - Pages-2
No ratings yet
ML Lab Experiments (1) - Pages-2
10 pages
BigML WhizzML Tutorials
No ratings yet
BigML WhizzML Tutorials
45 pages
Greedy Pruning For Continually Adapting Networks
No ratings yet
Greedy Pruning For Continually Adapting Networks
60 pages
Intro To Machine Learning With Python
100% (1)
Intro To Machine Learning With Python
55 pages
Unit 3,4 and 5
No ratings yet
Unit 3,4 and 5
5 pages
Anomaly Detection
No ratings yet
Anomaly Detection
2 pages
Merging Result-Merged
No ratings yet
Merging Result-Merged
14 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
Liu 2009
No ratings yet
Liu 2009
9 pages
1997 Statistical Methods Neural Network Prediction Models Ee JVR 97 2
No ratings yet
1997 Statistical Methods Neural Network Prediction Models Ee JVR 97 2
55 pages
John Michael T. Ricacho Grade 8: 1. Artificial Intelligence (AI)
No ratings yet
John Michael T. Ricacho Grade 8: 1. Artificial Intelligence (AI)
4 pages
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: NAIVE BAYES, NEAREST NEIGHBORS and NEURAL NETWORKS: Examples with MATLAB
César Pérez López
No ratings yet
CO327 ML 2023 End Nov
No ratings yet
CO327 ML 2023 End Nov
4 pages
Image Processing
No ratings yet
Image Processing
5 pages
DSH - L5 - Data-Driven Approaches - Concepts
No ratings yet
DSH - L5 - Data-Driven Approaches - Concepts
38 pages
Module-3 DSV
No ratings yet
Module-3 DSV
20 pages
Pattern Recognition 14
No ratings yet
Pattern Recognition 14
46 pages
Unit 1 Lecture 3
No ratings yet
Unit 1 Lecture 3
5 pages
Mental Illness Prediction Using Deep Learning
No ratings yet
Mental Illness Prediction Using Deep Learning
58 pages
Pattern Recognition
No ratings yet
Pattern Recognition
33 pages
ML4 - Decision Trees & Random Forest
No ratings yet
ML4 - Decision Trees & Random Forest
44 pages
Feature Selection: Slide 1
No ratings yet
Feature Selection: Slide 1
29 pages
An Introduction To Feature Selection
No ratings yet
An Introduction To Feature Selection
45 pages
Artificial Intelligence - Lecture Notes, Study Material and Important Questions, Answers
100% (1)
Artificial Intelligence - Lecture Notes, Study Material and Important Questions, Answers
3 pages
1 s2.0 S0888327019308088 Main
No ratings yet
1 s2.0 S0888327019308088 Main
39 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Python Data Science Cookbook - (Chapter 10 Large-Scale Machine Learning – Online Learning) PDF
No ratings yet
Python Data Science Cookbook - (Chapter 10 Large-Scale Machine Learning – Online Learning) PDF
24 pages
Introduction To Classification - PPT Slides 1
No ratings yet
Introduction To Classification - PPT Slides 1
62 pages
P C N N R E I: Runing Onvolutional Eural Etworks FOR Esource Fficient Nference
No ratings yet
P C N N R E I: Runing Onvolutional Eural Etworks FOR Esource Fficient Nference
17 pages
Cheat Sheet - Machine Learning - Data Science Interview PDF
No ratings yet
Cheat Sheet - Machine Learning - Data Science Interview PDF
16 pages
Preface To The Second Edition V 1 1
No ratings yet
Preface To The Second Edition V 1 1
9 pages
Pattern Recognition
No ratings yet
Pattern Recognition
33 pages
Feature Gradients: Scalable Feature Selection Via Discrete Relaxation
No ratings yet
Feature Gradients: Scalable Feature Selection Via Discrete Relaxation
9 pages
PR Assignment 01 - Seemal Ajaz (206979)
No ratings yet
PR Assignment 01 - Seemal Ajaz (206979)
7 pages
Neal Zhang
No ratings yet
Neal Zhang
33 pages
Chandra Shekar 2014
No ratings yet
Chandra Shekar 2014
13 pages
Pattern Classification Using Simplified Neural Networks With Pruning Algorithm
No ratings yet
Pattern Classification Using Simplified Neural Networks With Pruning Algorithm
7 pages
Image Classification
No ratings yet
Image Classification
18 pages
I-:'-Ntrumsu'!I: Model Selection in Neural Networks
No ratings yet
I-:'-Ntrumsu'!I: Model Selection in Neural Networks
27 pages
KNIME - Seven Techs For Dimensionality Reduction
No ratings yet
KNIME - Seven Techs For Dimensionality Reduction
17 pages
Classification Techniques
No ratings yet
Classification Techniques
99 pages

Mbedded Methods For Feature Selection in Neural Networks: Reprint

Uploaded by

Mbedded Methods For Feature Selection in Neural Networks: Reprint

Uploaded by

E MBEDDED METHODS FOR FEATURE SELECTION IN NEURAL

October 13, 2020

Algorithm 1: Permutation Feature Importance

2.2 Stepwise Weight Pruning Algorithm (SWPA)

Figure 1: Modified Network

3.1 Data and Setup

Here, I give a brief description of the datasets used.

Table 1: Classification accuracies on the features chosen by the proposed methods

Dataset Top PFI Btm PFI Random Worst Random Avg

Dataset SWPA with n = 4 SWPA with n = 1

Table 3: SWPA with stepsize = 1

4.2 Effects of Adding constraints to the objective function

Figure 2: Effects of Step Counter n

wvl = −γ.Var[Sigmoid(20 ∗ W − 0.5)] (2)

Dataset base only l1 only wvf l1 and wvf

A Feature Similarity on SWPA with various constraints

Experiment base l1 wvl l1 and wvl

Experiment base l1 wvl l1 and wvl

Experiment base l1 wvl l1 and wvl

Dataset #feat #layer 1 #layer 2 #layer 3

Table 8: Network on reduced feature sets.

Dataset #feat #layer 1 #layer 2 #layer 3

Table 9: Network on full data set.

You might also like