Module-3 DSV
Module-3 DSV
Module-3 Syllabus:
Feature Generation and Feature Selection
Extracting Meaning from Data: Motivating application: user (customer) retention. Feature
Generation (brainstorming, role of domain expertise, and place for imagination), Feature Selection
algorithms. Filters; Wrappers; Decision Trees; Random Forests. Recommendation Systems: Building
a User-Facing Data Product, Algorithmic ingredients of a Recommendation Engine, Dimensionality
Reduction, Singular Value Decomposition, Principal Component Analysis, Exercise: build your own
recommendation system.
• This involves identifying, creating, and selecting meaningful variables from the raw data that can
be used in machine learning models to make predictions or understand patterns.
• This process is both an art and a science. Having a domain expert involved is beneficial, but
using creativity and imagination is equally important.
• Remember, feature generation is constrained by two factors: the feasibility of capturing certain
information and the awareness to consider capturing it.
2.Explain the importance of feature selection along with the types of feature selection methods.
Feature Selection
Feature Selection refers to the process of selecting the most relevant features (or variables) from
the dataset to use in building a predictive model. The goal is to improve the performance of the
model by eliminating irrelevant or redundant features, which can lead to better accuracy, reduced
overfitting, and more efficient computation.
Importance of Feature Selection: Feature selection is crucial because it helps in simplifying
models, making them easier to interpret, reducing computational cost, and often improving the
generalization of the model by reducing overfitting.
Filter Methods:
Filters prioritize features based on specific metrics or statistics, such as correlation with the
outcome variable, offering a quick overview of predictive power off. These involve statistical
techniques to evaluate the relevance of each feature individually based on its relationship with
the target variable. Examples include correlation coefficients, Chi-square tests, and mutual
information.
A subset of features is selected based on their relationship to the target variable. The selection
is not dependent of any machine learning algorithm. On the contrary, filter methods measure
the “relevance” of the features with the output via statistical tests. You can use the following
table for reference:
Wrapper methods
In wrapper methods, the feature selection process is based on a specific machine
learning algorithm that we are trying to fit on a given dataset.
It follows a greedy search approach by evaluating all the possible combinations of
features against the evaluation criterion. The evaluation criterion is simply the
performance measure which depends on the type of problem.
Forward selection:
Forward selection involves systematically adding features to a regression model one at a time based
on their ability to improve model performance according to a selection criterion. This iterative
process continues until further feature additions no longer enhance the model performance.
Backward elimination:
Begins with a regression model containing all features. Subsequently one feature is systematically
removed at a time, the feature whose removal makes the biggest improvement in the selection
criterion. Stop removing features when removing the feature makes the selection criterion get worse.
B. Selection Criterion
The choice of selection criteria in feature selection methods may seem arbitrary. To address this,
experimenting with various criteria can help assess model robustness. Different criteria may yield
diverse models, necessitating the prioritization of optimization goals based on the problem context
and objectives.
R-squared
R-squared can be interpreted as the proportion of variance explained by your model.
p-values
In regression analysis, the interpretation of p-values involves assuming a null hypothesis where the
coefficients (βs) are zero. A low p-value suggests that observing the data and obtaining the
estimated coefficient under the null hypothesis is highly unlikely, indicating a high likelihood that
the coefficient is non-zero.
AIC (Akaike Information Criterion)
Given by the formula 2k−2ln(L), where k is the number of parameters in the model and
ln(L) is the “maximized value of the log likelihood.” The goal is to minimize AIC.
BIC (Bayesian Information Criterion)
Given by the formula k*ln(n) −2ln(L), where k is the number of parameters in the model, n
is the number of observations (data points, or users), and ln(L) is the maximized value of
the log likelihood. The goal is to minimize BIC.
Entropy
Embedded Methods: These involve algorithms that perform feature selection during the model
training process. Regularization methods like LASSO (Least Absolute Shrinkage and Selection
Operator) and Ridge Regression are examples.
Decision Trees
A decision tree is a non-parametric supervised learning algorithm for classification and regression
tasks. It has a hierarchical tree structure consisting of a root node, branches, internal nodes, and leaf
nodes. Decision trees are used for classification and regression tasks, providing easy-to-understand
models.
A decision tree is a hierarchical model used in decision support that depicts decisions and their potential
outcomes, incorporating chance events, resource expenses, and utility. This algorithmic model utilizes
conditional control statements and is non- parametric, supervised learning, useful for both classification
and regression tasks. The tree structure is comprised of a root node, branches, internal nodes, and leaf
nodes, forming a hierarchical, tree-like structure.
It is a tool that has applications spanning several different areas. Decision trees can be used for
classification as well as regression problems. The name itself suggests that it uses a flowchart like a tree
structure to show the predictions that result from a series of feature-based splits. It starts with a root
node and ends with a decision made by leaves.
Example of Decision Tree
https://www.analyticsvidhya.com/blog/2021/08/decision-tree-algorithm/
https://medium.com/geekculture/step-by-step-decision-tree-id3-algorithm-from-
scratch-in-python-no-fancy-library-4822bbfdd88f
Solution:
First, check which attribute provides the highest Information Gain in order to split the
training set based on that attribute. We need to calculate the expected information to
classify the set and the entropy of each attribute.
The information gain is this mutual information minus the entropy: The mutual information of the two classes,
Entropy(S)= E(9,5)= -9/14 log2(9/14) – 5/14 log2(5/14)=0.94
For Age, we have three values age<=30 (2 yes and 3 no), age31..40 (4 yes and 0 no),
and age>40 (3 yes and 2 no)
For Income, we have three values incomehigh (2 yes and 2 no), incomemedium (4 yes
and 2 no), and incomelow (3 yes 1 no)
Next, consider Student Attribute For Student, we have two values studentyes (6 yes
and 1 no) and studentno (3 yes 4 no)
= 7/14(0.5916) + 7/14(0.9852)
For Credit_Rating we have two values credit_ratingfair (6 yes and 2 no) and credit_ratingexcellent (3 yes 3 no)
= 8/14(0.8112) + 6/14(1)
Since Age has the highest Information Gain we start splitting the dataset using
the age attribute.
Decision Tree after step 1
Since all records under the branch age31..40 are all of the class, Yes, we can replace
the leaf with Class=Yes
The same process of splitting has to happen for the two remaining branches.
Left sub-branch
For branch age<=30 we still have attributes income, student, and credit_rating. Which
one should be used to split the partition?
For Income, we have three values income high (0 yes and 2 no), income medium
(1 yes and 1 no) and income low (1 yes and 0 no)
We can then safely split on attribute student without checking the other attributes
since the information gain is maximized.
Right sub-branch
The mutual information is Entropy(Sage>40)= I(3,2)= -3/5 log2(3/5) – 2/5
log2(2/5)=0.97
For Income, we have two values income medium (2 yes and 1 no) and income low
(1 yes and 1 no)
Entropy(income) = 3/5(-2/3log2(2/3)-1/3log2(1/3)) + 2/5 (-1/2log2(1/2)-1/2log2(1/2))
For Credit_Rating, we have two values credit_ratingfair (3 yes and 0 no) and
credit_ratingexcellent (0 yes and 2 no)
Entropy(credit_rating) = 0
We then split based on credit_rating. These splits give partitions each with records
from the same class. We just need to make these into leaf nodes with their class label
attached:
Buys_computer = yes
4. Explain the Random Forest algorithm with an example (or) What are the drawbacks of
decision trees and mention the key features of random forest along with the how the
working.
Random Forest
High Predictive Accuracy: Imagine Random Forest as a team of decision-making wizards. Each
wizard (decision tree) looks at a part of the problem, and together, they weave their insights into a
powerful prediction tapestry. This teamwork often results in a more accurate model than what a
single wizard could achieve.
Resistance to Overfitting: This approach helps prevent getting too caught up with the training data
which makes the model less prone to overfitting.
Large Datasets Handling: Each helper takes on a part of the dataset, ensuring that the
expedition is not only thorough but also surprisingly quick.
Variable Importance Assessment: It assesses the importance of each clue in solving the case,
helping you focus on the key elements that drive predictions.
Built-in Cross-Validation: This built-in validation ensures your model doesn’t just ace the training
but also performs well on new challenges.
https://www.geeksforgeeks.org/random-forest-algorithm-in-machine-learning/
4. Apply random forest algorithm for the iris dataset. Compute accuracy using decision tree and
random forest and justify the results.
A recommendation engine filters the data using different algorithms and recommends
the most relevant items to users. It first captures the past behavior of a customer and
based on that, recommends products which the users might be likely to buy.
If a completely new user visits an e-commerce site, that site will not have any past
history of that user. So how does the site go about recommending products to the user
in such a scenario? One possible solution could be to recommend the best selling
products, i.e. the products which are high in demand. Another possible solution could
be to recommend the products which would bring the maximum profit to the business.
If we can recommend a few items to a customer based on their needs and interests, it
will create a positive impact on the user experience and lead to frequent visits. Hence,
businesses nowadays are building smart and intelligent recommendation engines by
studying the past behavior of their users.
This is the first and most crucial step for building a recommendation engine. The data
can be collected by two means: explicitly and implicitly. Explicit data is information
that is provided intentionally, i.e. input from the users such as movie ratings. Implicit
data is information that is not provided intentionally but gathered from available data
streams like search history, clicks, order history, etc.
Filtering Algorithms
1. Content based filtering
This algorithm recommends products which are similar to the ones that a user has liked
in the past.
Based on the cosine value, which ranges between -1 to 1, the movies are arranged in
descending order and one of the two below approaches is used for recommendations:
Top-n approach: where the top n movies are recommended (Here n can be decided by
the business)
Rating scale approach: Where a threshold is set and all the movies above that
threshold are recommended
Euclidean Distance: Similar items will lie in close proximity to each other if plotted in
n-dimensional space. So, we can calculate the distance between items and based on
that distance, recommend items to the user. The formula for the euclidean distance is
given by:
Pearson’s Correlation: It tells us how much two items are correlated. Higher the
correlation, more will be the similarity. Pearson’s correlation can be calculated using
the following formula:
2. Collaborative filtering
Let us understand this with an example. If person A likes 3 movies, say Interstellar,
Inception and Predestination, and person B likes Inception, Predestination and The
Prestige, then they have almost similar interests. We can say with some certainty
that A should like The Prestige and B should like Interstellar. The collaborative
filtering algorithm uses “User Behavior” for recommending items.
User-User collaborative filtering
This algorithm first finds the similarity score between users. Based on this similarity
score, it then picks out the most similar users and recommends products which these
similar users have liked or bought previously.
This algorithm finds the similarity between each user based on the ratings they have
previously given to different movies. The prediction of an item for a user u is
calculated by computing the weighted sum of the user ratings given by other users
to an item i.
The prediction Pu,i is given by:
Here,
Pu,i is the prediction of an item
Rv,i is the rating given by a user v to a movie i
Su,v is the similarity between users
Let us understand it with an example:
The correlation between user A and C is more than the correlation between B and C.
Hence users A and C have more similarity and the movies liked by user A will be recommended
to user C and vice versa.
This algorithm is quite time consuming as it involves calculating the similarity for
each user and then calculating prediction for each similarity score. One way of
handling this problem is to select only a few users (neighbors) instead of all to
make predictions, i.e. instead of making predictions for all similarity values, we
choose only few similarity values.
Item-Item collaborative filtering
The algorithm aims to find similarity between movie pairs and recommend similar ones based on
user-user collaborative filtering. It uses the weighted sum of ratings of “item-neighbors” instead
of “user-neighbors” and provides predictions based on user-friendliness.
User/Movie x1 x2 x3 x4 x5
A 4 1 2 4 4
B 2 4 4 2 1
C – 1 – 3 4
Mean Item 3 2 3 3 3
Rating
The mean item rating is the average of all ratings given to a particular item, compared
to the user-user filtering table. Instead of finding user-user similarity, item-item
similarity is calculated. For example, comparing movies (x1, x4) and (x1, x5), common
users who have rated these items are A and B, while those who have rated movies x1
and x5 are also A and B.
The similarity between movie x1 and x4 is more than the similarity between movie x1
and x5. So based on these similarity values, if any user searches for movie x1, they
will be recommended movie x4 and vice versa.
Data Quality Issues: Check the quality and consistency of your data. Ensure that your
data is clean, free from outliers, and properly preprocessed.
Distance Metric Selection: The choice of distance metric (e.g., Euclidean distance,
cosine similarity) can significantly impact the performance of the algorithm.
Experiment with different metrics to see which one best captures the similarity between
items or users in your dataset.
Cold Start Problem: Nearest neighbor algorithms may struggle with cold start
problems, where there isn't enough data available for new users or items. Consider
using hybrid approaches or incorporating content-based features to handle this scenario.
Scalability: For large datasets, computing distances between all pairs of items or users
can be computationally expensive. Look into approximate nearest neighbor methods or
data structures like KD-trees or Ball trees to improve efficiency.
Normalization: Ensure that features used for calculating similarity are properly
normalized to prevent certain features from dominating the distance calculation.
User/item representation: Make sure that your representation of users and items
(feature vectors) appropriately captures the relevant characteristics that define similarity
in your recommendation context.
Evaluation Metrics: Use appropriate evaluation metrics (e.g., precision, recall, RMSE
for rating prediction) to assess the performance of your recommendation system and
identify areas for improvement.
By systematically addressing these potential issues, you should be able to diagnose and
improve the performance of your nearest neighbor algorithm in your recommendation
system.
7.What is dimensionality reduction? What are the different techniques along with the benefits
and applications of dimensionality reduction.
Dimensionality Reduction
Techniques:
Principal Component Analysis (PCA): PCA is one of the most widely used
dimensionality reduction techniques. It transforms the original variables into a new set
of orthogonal variables (principal components) that capture the maximum variance in
the data.
Linear Discriminant Analysis (LDA): LDA is often used in supervised learning tasks
to find the feature subspace that maximizes class separability.
Autoencoders: These are neural network models that learn efficient representations
of data by encoding input into a lower-dimensional latent space and then reconstructing
the output from this representation.
Benefits:
Improved Model Performance: By reducing noise and irrelevant features,
dimensionality reduction can lead to better generalization and predictive performance
of machine learning models.
Efficiency: Reduced dimensionality can lead to faster training times and less memory
usage, especially beneficial for large datasets.
Dimensionality Reduction:
SVD is used for reducing the dimensionality of data. By retaining only the most
significant singular values and corresponding vectors, you can represent the original
matrix with reduced dimensions.
Matrix Approximation:
o SVD allows for approximating a matrix AAA by using only the first kkk singular
values and vectors. This approximation can be useful for compressing data or
denoising.
o PCA can be seen as a specific application of SVD, where the covariance matrix of a
dataset is decomposed to find its principal components.
Implicit Feedback Handling: SVD can handle implicit feedback data (e.g., user views
or clicks) effectively by capturing underlying patterns in user-item interactions.
Scalability: Techniques like incremental SVD and stochastic gradient descent can be
used to scale SVD to large datasets.
PCA is a statistical method that transforms a set of correlated variables (or features)
into a set of linearly uncorrelated variables called principal components. These
principal components are ordered by the amount of variance they explain in the original
data.
Principal Component Analysis (PCA) is a widely used technique in data science for
reducing the dimensionality of data while retaining as much variance as possible.
Covariance Matrix:
PCA starts by computing the covariance matrix of the dataset, which captures the
pairwise relationships between different variables.
Applications of PCA:
Dimensionality Reduction:
PCA is primarily used for reducing the number of variables in a dataset while retaining
most of the information. This is beneficial for improving computational efficiency,
reducing noise, and avoiding overfitting in machine learning models.
Visualization:
PCA is valuable for visualizing high-dimensional data. By reducing data to two or
three principal components, it becomes easier to plot and understand the underlying
structure and relationships.
Feature Extraction:
PCA can be used as a feature extraction technique where the principal components
serve as new features that may be more informative or less redundant than the original
variables.
Noise Reduction:
PCA can effectively filter out noise by emphasizing variations in data that are
significant (captured by principal components with high eigenvalues) and disregarding
variations that are less significant (captured by components with low eigenvalues).
Advantages of PCA:
Interpretability: Principal components are linear combinations of original variables,
making them interpretable in terms of the contributions of different features.
Data Compression: PCA allows for data compression by reducing the number of
dimensions while preserving most of the variance, which is useful for storage and
computation.
Improves Model Performance: By reducing the number of input variables, PCA can
lead to simpler and more efficient models that generalize better to new data.