0% found this document useful (0 votes)
23 views29 pages

Unit-4 Part 3 Feature Engineering

The document discusses feature engineering, focusing on feature transformation, construction, scaling, encoding, and selection techniques. It outlines methods such as quantization, log transformation, one-hot encoding, and feature hashing, emphasizing their roles in improving model performance and efficiency. Additionally, it covers feature selection approaches to reduce model complexity while maintaining predictive accuracy.

Uploaded by

yadavaakash1260
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views29 pages

Unit-4 Part 3 Feature Engineering

The document discusses feature engineering, focusing on feature transformation, construction, scaling, encoding, and selection techniques. It outlines methods such as quantization, log transformation, one-hot encoding, and feature hashing, emphasizing their roles in improving model performance and efficiency. Additionally, it covers feature selection approaches to reduce model complexity while maintaining predictive accuracy.

Uploaded by

yadavaakash1260
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

UNIT-4: FEATURE

ENGINEERING
Prof. Atmiya Patel
Feature Transformation

■ It conforms the assumption of a model.


■ Important tool for dimensionality reduction.
■ Two goals of feature transformation:
– Achieving best reconstruction in the original features.
– Achieving highest efficiency in the learning task.
■ It can apply to numeric features or non-numeric. (Like: text and
images)
Feature Construction

■ The process discovers missing information about the relationships


between features and expands the feature space by creating
additional features.
■ Added more features.
■ Techniques:
– Quantization or Binning
– Log Transform
– Feature Scaling or Normalization etc...
Quantization or Binning

■ The original data values which fall into a given small interval, a bin,
are replaced by a value representative of that interval, often the
central value. It is a form of quantization.
■ Statistical data binning is a way to group numbers of more or less
continuous values into a smaller number of "bins"
Log Transform

■ Power tool for dealing with large positive number with a heavy-tailed
distribution.
■ It compress the long tall in the heigh end of the distribution into a
shorter tail and expands the low end into a longer head.
Feature Scaling / Normalization

■ Some feature are bounded in value and other numeric features


increase without bound are affected by the scale of the input.
■ If the model is sensitive to the scale of input feature, feature scaling
could help.
■ It also called feature normalization.
■ It done individually to each feature.
Min-Max Scaling
Variance Scaling

■ Standardization (or Z-Score normalization or Variance scaling) scales


the values taking standard deviation of the features into account.
■ The mean of the feature is subtracted and divided by the standard
deviation of the feature.
■ The resulting feature has a mean 0 and a standard deviation of 1.
■ If the original feature is a normal distribution then the scaled feature
is also normal distribution.
l2 Normalization

■ It normalize the original feature value by l2 norm. It also known as the


Euclidean norm.

■ The l2 norm measures the length of the vector in coordinate space.


Encoding Categorical Variables

■ A categorical variable is used to represent categories or labels.


■ Large categorical variable are particularly common in transactional
records. Like IP address.
■ Even though user IDs and IP address are numeric, Their magnitude is
not relevant to the task.
■ The IP address might be relevant when doing fraud detection on
individual transaction.
■ The categories of a categorical variable are usually not numeric. So an
encoding method is needed to turn these non- numeric categories in
to numbers.
One-Hot Encoding

■ It creates new(binary) columns, indicating the presence of each


possible value from the original data.
■ Each bit represents a possible category.

■ One-Hot Encoding is simple but uses more bit than it strictly


necessary.
■ The sum of all the bits must be equal to 1.
Dummy Coding

■ The Problem with One-Hot encoding is that it allows for k degrees of


freedom, while the variable itself needs only k-1.
■ Dummy coding removes the extra degree of freedom by using only k-
1 features in the representation.
■ One feature is disregarded and is represented by the vector of all
zeros. This is known as the reference category.
■ The column “Blue” is deleted as it contained 0 for the first two rows.
■ The last row has both “Black” and “Brown” as 0 meaning that “Blue”
must be 1.
Feature Hashing

■ Large categorical features, such as user ID, website URL, IP address


etc., pose computation challenges in terms of memory efficiency and
storage.
■ To overcome this problem, Feature Hashing is used that makes
working with large categorical variables less computation intensive
and yet produces accurate models that fast to train.
■ Hashing, in general, is the process of taking any length of input
information and finding a unique fixed length representation of that
input information.
■ It is the process of finding a unique message digest (or hash value)
that corresponds to the input information.
■ It can be used in several different domains such as information
security, cryptocurrency, high-performance programming and for
creating quick lookup tables.
■ In machine learning, hash functions can be constructed for any object
that can be represented numerically. Like numbers, strings, complex
structures, etc.
Handling Textual Features

■ Need to apply machine learning on textural features such as product


reviews, comments, story line, news reports, etc.
■ List of techniques:
– Bag-of-Words
– Bag-of-n-Grams
Feature Extraction

■ It is the process of extracting or creating a new set of features from the


current dataset using some functional mapping.
■ It use for dimensionality reduction.
■ This can be having supervised and unsupervised.
■ Popular methods for the feature extractions are:
– Principal Components Analysis (PCA)
– Singular Value Decomposition (SVD)
– Linear Discriminant Analysis (LDA)
■ Both are linear projection methods. PCA is unsupervised and LDA is
supervised method.
Feature Subset Selection

■ Feature selection technique discards unnecessary features to reduce


the complexity of the resulting model.
■ Similar activity as dimensionality reduction.
■ The goal is to prudent model that is fast to compute, with little or no
degradation in predictive accuracy.
Key Drivers of Feature Selection

■ Which feature is going to be select?


■ Which feature to exclude?
■ Two key drivers for selecting features.
– Feature Relevance
– Feature Redundancy
Feature Relevance

■ Any feature, which is irrelevant in the context of machine learning


task on hand, is a potential candidate for rejection when selecting
subset of features.
■ Done by case-by-case basis.

■ In this “Name” feature is the most


irrelevant feature for age prediction.
Feature Redundancy

■ A feature may contribute information which is similar to the


information contribution by one or more features in the same data
set.
■ All features having potential redundancy are candidates for rejection
in the final feature subset.

■ The “Site length”, “Site Breadth” and “Site Area”


Reveal the dimensions of the site and can be
removed.
Overall Feature Selection Process

■ Generation of possible subsets.


■ Subset evaluation
■ Stop searching based on some stopping criterion
■ Validation of the result with respect to the chosen subsets.
Feature Selection Approaches

1. Filter:- Features are pre-processed to remove the ones that are


unlikely to be useful for the model.
2. Wrapper:- Allow to try out subsets of features.
3. Hybrid:- Takes the advantages of both.
4. Embedded:-Performs feature selection as part of the model training
process.
Thank you…

You might also like