0% found this document useful (0 votes)
15 views

Module-3 DSV

Uploaded by

Swathi Y
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Module-3 DSV

Uploaded by

Swathi Y
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

21CS644- INTRODUCTION TO DATA SCIENCE AND VISUALIZATION

Module-3 Syllabus:
Feature Generation and Feature Selection
Extracting Meaning from Data: Motivating application: user (customer) retention. Feature
Generation (brainstorming, role of domain expertise, and place for imagination), Feature Selection
algorithms. Filters; Wrappers; Decision Trees; Random Forests. Recommendation Systems: Building
a User-Facing Data Product, Algorithmic ingredients of a Recommendation Engine, Dimensionality
Reduction, Singular Value Decomposition, Principal Component Analysis, Exercise: build your own
recommendation system.

1.Define Feature generation. How information can be categorized in feature generation in


detail.

Feature Generation or Feature Extraction


Feature generation, also known as feature extraction, is the process of transforming raw data into a
structured format where each column represents a specific characteristic or attribute (feature) of the
data, and each row represents an observation or instance.

• This involves identifying, creating, and selecting meaningful variables from the raw data that can
be used in machine learning models to make predictions or understand patterns.
• This process is both an art and a science. Having a domain expert involved is beneficial, but
using creativity and imagination is equally important.
• Remember, feature generation is constrained by two factors: the feasibility of capturing certain
information and the awareness to consider capturing it.

Information can be categorized into the following buckets:


• Relevant and useful, but it’s impossible to capture it.
Keep in mind that much user information isn't captured, like free time, other apps, employment
status, or insomnia, which might predict their return. Some captured data may act as proxies for
these factors, such as playing the game at 3 a.m. indicating insomnia or night shifts.
• Relevant and useful, possible to log it, and you did.
The decision to log this information during the brainstorming session was crucial. However, mere
logging doesn't guarantee understanding its relevance or usefulness. The feature selection process
aims to uncover this information.
• Relevant and useful, possible to log it, but you didn’t.
Human limitations can lead to overlooking crucial information, emphasizing the need for creative
feature selection. Usability studies help identify key user actions for better feature capture.
• Not relevant or useful, but you don’t know that and log it.
Feature selection aims to address this: while you've logged certain information, unknowing its
necessity.
• Not relevant or useful, and you either can’t capture it or it didn’t occur to you.
Feature Generation or Feature Extraction.

2.Explain the importance of feature selection along with the types of feature selection methods.

Feature Selection
Feature Selection refers to the process of selecting the most relevant features (or variables) from
the dataset to use in building a predictive model. The goal is to improve the performance of the
model by eliminating irrelevant or redundant features, which can lead to better accuracy, reduced
overfitting, and more efficient computation.
Importance of Feature Selection: Feature selection is crucial because it helps in simplifying
models, making them easier to interpret, reducing computational cost, and often improving the
generalization of the model by reducing overfitting.

Why Feature Selection is important?

 It simplifies the model: data reduction, less storage, Occam’s razor


and better visualization
 Reduces training time
 Avoids over-fitting
 Improves accuracy of the model
 Avoids curse of dimensionality.

Types of Feature Selection Methods:


Feature selection methods can be grouped into three categories: filter method, wrapper method
and embedded method.

Filter Methods:

Filters prioritize features based on specific metrics or statistics, such as correlation with the
outcome variable, offering a quick overview of predictive power off. These involve statistical
techniques to evaluate the relevance of each feature individually based on its relationship with
the target variable. Examples include correlation coefficients, Chi-square tests, and mutual
information.

A subset of features is selected based on their relationship to the target variable. The selection
is not dependent of any machine learning algorithm. On the contrary, filter methods measure
the “relevance” of the features with the output via statistical tests. You can use the following
table for reference:
Wrapper methods
In wrapper methods, the feature selection process is based on a specific machine
learning algorithm that we are trying to fit on a given dataset.
It follows a greedy search approach by evaluating all the possible combinations of
features against the evaluation criterion. The evaluation criterion is simply the
performance measure which depends on the type of problem.

In wrapper feature selection, two key aspects require consideration:

✓ first, the choice of an algorithm for feature selection, and


✓ second, the determination of a selection criterion or filter to find out the usefulness of
the chosen feature set.
A. Selecting an Algorithm/ Most commonly used techniques under wrapper methods
are:

Forward selection:
Forward selection involves systematically adding features to a regression model one at a time based
on their ability to improve model performance according to a selection criterion. This iterative
process continues until further feature additions no longer enhance the model performance.

Backward elimination:
Begins with a regression model containing all features. Subsequently one feature is systematically
removed at a time, the feature whose removal makes the biggest improvement in the selection
criterion. Stop removing features when removing the feature makes the selection criterion get worse.

Bi-directional elimination (Stepwise Selection):


The combined approach in feature selection blends forward selection and backward elimination to
strike a balance between maximizing relevance and minimizing redundancy. It iteratively adds and
removes features based on their significance and impact on model fit, resulting in a subset of features
optimized for predictive power.

B. Selection Criterion
The choice of selection criteria in feature selection methods may seem arbitrary. To address this,
experimenting with various criteria can help assess model robustness. Different criteria may yield
diverse models, necessitating the prioritization of optimization goals based on the problem context
and objectives.

R-squared
R-squared can be interpreted as the proportion of variance explained by your model.

p-values
In regression analysis, the interpretation of p-values involves assuming a null hypothesis where the
coefficients (βs) are zero. A low p-value suggests that observing the data and obtaining the
estimated coefficient under the null hypothesis is highly unlikely, indicating a high likelihood that
the coefficient is non-zero.
AIC (Akaike Information Criterion)
Given by the formula 2k−2ln(L), where k is the number of parameters in the model and
ln(L) is the “maximized value of the log likelihood.” The goal is to minimize AIC.
BIC (Bayesian Information Criterion)

Given by the formula k*ln(n) −2ln(L), where k is the number of parameters in the model, n
is the number of observations (data points, or users), and ln(L) is the maximized value of
the log likelihood. The goal is to minimize BIC.
Entropy

Entropy is a measure of disorder or impurity in the given dataset.

Embedded Methods: These involve algorithms that perform feature selection during the model
training process. Regularization methods like LASSO (Least Absolute Shrinkage and Selection
Operator) and Ridge Regression are examples.

3. Explain and construct decision tree with an example.

Decision Trees

A decision tree is a non-parametric supervised learning algorithm for classification and regression
tasks. It has a hierarchical tree structure consisting of a root node, branches, internal nodes, and leaf
nodes. Decision trees are used for classification and regression tasks, providing easy-to-understand
models.

A decision tree is a hierarchical model used in decision support that depicts decisions and their potential
outcomes, incorporating chance events, resource expenses, and utility. This algorithmic model utilizes
conditional control statements and is non- parametric, supervised learning, useful for both classification
and regression tasks. The tree structure is comprised of a root node, branches, internal nodes, and leaf
nodes, forming a hierarchical, tree-like structure.

It is a tool that has applications spanning several different areas. Decision trees can be used for
classification as well as regression problems. The name itself suggests that it uses a flowchart like a tree
structure to show the predictions that result from a series of feature-based splits. It starts with a root
node and ends with a decision made by leaves.
Example of Decision Tree
https://www.analyticsvidhya.com/blog/2021/08/decision-tree-algorithm/

https://medium.com/geekculture/step-by-step-decision-tree-id3-algorithm-from-
scratch-in-python-no-fancy-library-4822bbfdd88f

ID3 Algorithm Decision Tree – Solved Example – Machine Learning


Problem Definition:
Build a decision tree using ID3 algorithm for the given training data in the table (Buy
Computer data), and predict the class of the following new example: age<=30,
income=medium, student=yes, credit-rating=fair

Solution:

First, check which attribute provides the highest Information Gain in order to split the
training set based on that attribute. We need to calculate the expected information to
classify the set and the entropy of each attribute.

The information gain is this mutual information minus the entropy: The mutual information of the two classes,
Entropy(S)= E(9,5)= -9/14 log2(9/14) – 5/14 log2(5/14)=0.94

Now Consider the Age attribute

For Age, we have three values age<=30 (2 yes and 3 no), age31..40 (4 yes and 0 no),
and age>40 (3 yes and 2 no)

Entropy(age) = 5/14 (-2/5 log2(2/5)-3/5log2(3/5)) + 4/14 (0) + 5/14 (-3/5log2(3/5)-


2/5log2(2/5))

= 5/14(0.9709) + 0 + 5/14(0.9709) = 0.6935


Gain(age) = 0.94 – 0.6935 = 0.2465

Next, consider Income Attribute

For Income, we have three values incomehigh (2 yes and 2 no), incomemedium (4 yes
and 2 no), and incomelow (3 yes 1 no)

Entropy(income) = 4/14(-2/4log2(2/4)-2/4log2(2/4)) + 6/14 (-4/6log2(4/6)-


2/6log2(2/6)) + 4/14 (-3/4log2(3/4)-1/4log2(1/4))

= 4/14 (1) + 6/14 (0.918) + 4/14 (0.811)

= 0.285714 + 0.393428 + 0.231714 = 0.9108

Gain(income) = 0.94 – 0.9108 = 0.0292

Next, consider Student Attribute For Student, we have two values studentyes (6 yes
and 1 no) and studentno (3 yes 4 no)

entropy(student) = 7/14(-6/7log2(6/7)-1/7log2(1/7)) + 7/14(-3/7log2(3/7)-4/7log2(4/7)

= 7/14(0.5916) + 7/14(0.9852)

= 0.2958 + 0.4926 = 0.7884

Gain (student) = 0.94 – 0.7884 = 0.1516

Finally, consider Credit_Rating Attribute

For Credit_Rating we have two values credit_ratingfair (6 yes and 2 no) and credit_ratingexcellent (3 yes 3 no)

Entropy(credit_rating) = 8/14(-6/8log2(6/8)-2/8log2(2/8)) + 6/14(-3/6log2(3/6)-


3/6log2(3/6))

= 8/14(0.8112) + 6/14(1)

= 0.4635 + 0.4285 = 0.8920

Gain(credit_rating) = 0.94 – 0.8920 = 0.479

Since Age has the highest Information Gain we start splitting the dataset using
the age attribute.
Decision Tree after step 1
Since all records under the branch age31..40 are all of the class, Yes, we can replace
the leaf with Class=Yes

Decision Tree after step 1_1

Now build the decision tree for the left subtree

The same process of splitting has to happen for the two remaining branches.

Left sub-branch
For branch age<=30 we still have attributes income, student, and credit_rating. Which
one should be used to split the partition?

The mutual information is E(Sage<=30)= E(2,3)= -2/5 log2(2/5) – 3/5 log2(3/5)=0.97

For Income, we have three values income high (0 yes and 2 no), income medium
(1 yes and 1 no) and income low (1 yes and 0 no)

Entropy(income) = 2/5(0) + 2/5 (-1/2log2(1/2)-1/2log2(1/2)) + 1/5 (0) = 2/5 (1) = 0.4

Gain(income) = 0.97 – 0.4 = 0.57


For Student, we have two values student yes (2 yes and 0 no) and student no (0 yes
3 no)
Entropy(student) = 2/5(0) + 3/5(0) = 0

Gain (student) = 0.97 – 0 = 0.97

We can then safely split on attribute student without checking the other attributes
since the information gain is maximized.

Decision Tree after step 2


Since these two new branches are from distinct classes, we make them into leaf nodes
with their respective class as label:

Decision Tree after step 2_2

Now build the decision tree for right left subtree

Right sub-branch
The mutual information is Entropy(Sage>40)= I(3,2)= -3/5 log2(3/5) – 2/5
log2(2/5)=0.97

For Income, we have two values income medium (2 yes and 1 no) and income low
(1 yes and 1 no)
Entropy(income) = 3/5(-2/3log2(2/3)-1/3log2(1/3)) + 2/5 (-1/2log2(1/2)-1/2log2(1/2))

= 3/5(0.9182)+2/5 (1) = 0.55+0. 4= 0.95

Gain(income) = 0.97 – 0.95 = 0.02


For Student, we have two values student yes (2 yes and 1 no) and student no (1 yes and 1 no)

Entropy(student) = 3/5(-2/3log2(2/3)-1/3log2(1/3)) + 2/5(-1/2log2(1/2)-1/2log2(1/2))


= 0.95

Gain (student) = 0.97 – 0.95 = 0.02

For Credit_Rating, we have two values credit_ratingfair (3 yes and 0 no) and
credit_ratingexcellent (0 yes and 2 no)

Entropy(credit_rating) = 0

Gain(credit_rating) = 0.97 – 0 = 0.97

We then split based on credit_rating. These splits give partitions each with records
from the same class. We just need to make these into leaf nodes with their class label
attached:

Decision Tree for Buys Computer


New example: age<=30, income=medium, student=yes, credit-rating=fair

Follow branch(age<=30) then student=yes we predict Class=yes

Buys_computer = yes

4. Explain the Random Forest algorithm with an example (or) What are the drawbacks of
decision trees and mention the key features of random forest along with the how the
working.

Random Forest

Random forest is a supervised learning algorithm. The “forest” it builds is an


ensemble of decision trees, usually trained with the bagging method.
 Random forests generalize decision trees with bagging, otherwise known has Bootstrap
Aggregating.
 Makes the models more accurate and more robust, but at the cost of interpretability
 But easy to specify – 2 Hyperparameters: Number of Trees (N) in the forest and Number
of Features (F) to randomly select for each tree
 A bootstrap sample is a sample with replacement, which means we might sample the same
data point more than once. We usually take to the sample size to be 80% of the size of the
entire (training) dataset, but of course this parameter can be adjusted de‐pending on
circumstances. This is technically a third hyper parameter of our random forest algorithm.
 To construct a random forest, you construct N decision trees as follows:
 For each tree, take a bootstrap sample of your data, and for each node you randomly
select F features, say 5 out of the 100 total features.
 Then you use your entropy-information-gain engine as described in the previous section
to decide which among those features you will split your tree on at each stage.

Algorithm for Random Forest Work:

Step 1: Select random K data points from the training set.


Step 2: Build the decision trees associated with the selected data points(Subsets).
Step 3: Choose the number N for decision trees that you want to build.
Step 4: Repeat Step 1 and 2.
Step 5: For new data points, find the predictions of each decision tree, and assign the
new data points to the category that wins the majority votes.
Key Features of Random Forest
Some of the Key Features of Random Forest are discussed below:

High Predictive Accuracy: Imagine Random Forest as a team of decision-making wizards. Each
wizard (decision tree) looks at a part of the problem, and together, they weave their insights into a
powerful prediction tapestry. This teamwork often results in a more accurate model than what a
single wizard could achieve.

Resistance to Overfitting: This approach helps prevent getting too caught up with the training data
which makes the model less prone to overfitting.

Large Datasets Handling: Each helper takes on a part of the dataset, ensuring that the
expedition is not only thorough but also surprisingly quick.

Variable Importance Assessment: It assesses the importance of each clue in solving the case,
helping you focus on the key elements that drive predictions.

Built-in Cross-Validation: This built-in validation ensures your model doesn’t just ace the training
but also performs well on new challenges.

https://www.geeksforgeeks.org/random-forest-algorithm-in-machine-learning/

4. Apply random forest algorithm for the iris dataset. Compute accuracy using decision tree and
random forest and justify the results.

Refer the colab notebook for the implementation


Decision Tree and Random Forest.ipynb - Colab (google.com)
5. What are recommendation systems? Explain various filtering algorithms used in
recommendation systems with examples. (or)

A recommendation engine filters the data using different algorithms and recommends
the most relevant items to users. It first captures the past behavior of a customer and
based on that, recommends products which the users might be likely to buy.

If a completely new user visits an e-commerce site, that site will not have any past
history of that user. So how does the site go about recommending products to the user
in such a scenario? One possible solution could be to recommend the best selling
products, i.e. the products which are high in demand. Another possible solution could
be to recommend the products which would bring the maximum profit to the business.

If we can recommend a few items to a customer based on their needs and interests, it
will create a positive impact on the user experience and lead to frequent visits. Hence,
businesses nowadays are building smart and intelligent recommendation engines by
studying the past behavior of their users.

How does a Recommendation Engine Work?


Step1: Data Collection

This is the first and most crucial step for building a recommendation engine. The data
can be collected by two means: explicitly and implicitly. Explicit data is information
that is provided intentionally, i.e. input from the users such as movie ratings. Implicit
data is information that is not provided intentionally but gathered from available data
streams like search history, clicks, order history, etc.

Filtering Algorithms
1. Content based filtering
This algorithm recommends products which are similar to the ones that a user has liked
in the past.

Consider Example of Netflix


Recommendation engines save all information related to each user in a vector form
known as the profile vector, which contains the user’s past behavior, including liked
or disliked movies and given ratings. Information about movies is stored in another
vector called the item vector, which includes details such as genre, cast, and director.
The content-based filtering algorithm uses cosine similarity to find the cosine of the
angle between the profile vector and the item vector. If A is the profile vector and B is
the item vector, the similarity between them can be calculated as the cosine of the
angle between these two vectors.

Based on the cosine value, which ranges between -1 to 1, the movies are arranged in
descending order and one of the two below approaches is used for recommendations:

Top-n approach: where the top n movies are recommended (Here n can be decided by
the business)

Rating scale approach: Where a threshold is set and all the movies above that
threshold are recommended

Other methods that can be used to calculate the similarity are:

Euclidean Distance: Similar items will lie in close proximity to each other if plotted in
n-dimensional space. So, we can calculate the distance between items and based on
that distance, recommend items to the user. The formula for the euclidean distance is
given by:

Pearson’s Correlation: It tells us how much two items are correlated. Higher the
correlation, more will be the similarity. Pearson’s correlation can be calculated using
the following formula:

2. Collaborative filtering
Let us understand this with an example. If person A likes 3 movies, say Interstellar,
Inception and Predestination, and person B likes Inception, Predestination and The
Prestige, then they have almost similar interests. We can say with some certainty
that A should like The Prestige and B should like Interstellar. The collaborative
filtering algorithm uses “User Behavior” for recommending items.
User-User collaborative filtering
This algorithm first finds the similarity score between users. Based on this similarity
score, it then picks out the most similar users and recommends products which these
similar users have liked or bought previously.
This algorithm finds the similarity between each user based on the ratings they have
previously given to different movies. The prediction of an item for a user u is
calculated by computing the weighted sum of the user ratings given by other users
to an item i.
The prediction Pu,i is given by:

Here,
 Pu,i is the prediction of an item
 Rv,i is the rating given by a user v to a movie i
 Su,v is the similarity between users
Let us understand it with an example:

Consider the user-movie rating matrix:

User/Movie x1 x2 x3 x4 x5 Mean User Rating


A 4 1 – 4 – 3
B – 4 – 2 3 3
C – 1 – 4 4 3
Here we have a user movie rating matrix. To understand this in a more practical manner, let’s
find the similarity between users (A, C) and (B, C) in the above table. Common movies rated
by A/[ and C are movies x2 and x4 and by B and C are movies x2, x4 and x5.

The correlation between user A and C is more than the correlation between B and C.
Hence users A and C have more similarity and the movies liked by user A will be recommended
to user C and vice versa.
This algorithm is quite time consuming as it involves calculating the similarity for
each user and then calculating prediction for each similarity score. One way of
handling this problem is to select only a few users (neighbors) instead of all to
make predictions, i.e. instead of making predictions for all similarity values, we
choose only few similarity values.
Item-Item collaborative filtering

The algorithm aims to find similarity between movie pairs and recommend similar ones based on
user-user collaborative filtering. It uses the weighted sum of ratings of “item-neighbors” instead
of “user-neighbors” and provides predictions based on user-friendliness.

Let us understand it with an example.

User/Movie x1 x2 x3 x4 x5
A 4 1 2 4 4
B 2 4 4 2 1
C – 1 – 3 4
Mean Item 3 2 3 3 3
Rating

The mean item rating is the average of all ratings given to a particular item, compared
to the user-user filtering table. Instead of finding user-user similarity, item-item
similarity is calculated. For example, comparing movies (x1, x4) and (x1, x5), common
users who have rated these items are A and B, while those who have rated movies x1
and x5 are also A and B.

The similarity between movie x1 and x4 is more than the similarity between movie x1
and x5. So based on these similarity values, if any user searches for movie x1, they
will be recommended movie x4 and vice versa.

6. Describe the problems with Nearest Neighbor in recommendation systems.


kNN algorithm is a reliable and intuitive recommendation system that leverages user
or item similarity to provide personalized recommendations. kNN recommender
system is helpful in e-commerce, social media, and healthcare, and continues to be an
important tool for generating accurate and personalized recommendations.

The nearest neighbor algorithm is a popular approach in recommendation systems for


identifying items or users that are most similar to a given item or user.

Some Problem with nearest neighbor:

Data Quality Issues: Check the quality and consistency of your data. Ensure that your
data is clean, free from outliers, and properly preprocessed.

Distance Metric Selection: The choice of distance metric (e.g., Euclidean distance,
cosine similarity) can significantly impact the performance of the algorithm.
Experiment with different metrics to see which one best captures the similarity between
items or users in your dataset.

Curse of Dimensionality: In high-dimensional spaces, distance-based algorithms like


nearest neighbors can become less effective due to the increased sparsity of data
points. Consider dimensionality reduction techniques like PCA (Principal Component
Analysis) or feature selection to mitigate this issue.

Cold Start Problem: Nearest neighbor algorithms may struggle with cold start
problems, where there isn't enough data available for new users or items. Consider
using hybrid approaches or incorporating content-based features to handle this scenario.

Scalability: For large datasets, computing distances between all pairs of items or users
can be computationally expensive. Look into approximate nearest neighbor methods or
data structures like KD-trees or Ball trees to improve efficiency.

Normalization: Ensure that features used for calculating similarity are properly
normalized to prevent certain features from dominating the distance calculation.

Hyperparameter Tuning: If using algorithms like k-nearest neighbors (k-NN),


experiment with different values of k and other hyperparameters to find the optimal
configuration for your dataset.

User/item representation: Make sure that your representation of users and items
(feature vectors) appropriately captures the relevant characteristics that define similarity
in your recommendation context.

Evaluation Metrics: Use appropriate evaluation metrics (e.g., precision, recall, RMSE
for rating prediction) to assess the performance of your recommendation system and
identify areas for improvement.

Implementation Bugs: Double-check your implementation for bugs or logical errors


that could affect the correctness of your results.

By systematically addressing these potential issues, you should be able to diagnose and
improve the performance of your nearest neighbor algorithm in your recommendation
system.
7.What is dimensionality reduction? What are the different techniques along with the benefits
and applications of dimensionality reduction.

Dimensionality Reduction

Dimensionality reduction is a fundamental technique in data science aimed at reducing


the number of input variables (dimensions) under consideration. It's particularly useful
in scenarios where datasets have a large number of features or dimensions, which can
lead to increased computational complexity, overfitting, and reduced model
interpretability. Here are some key points about dimensionality reduction:

Purpose: The primary goal of dimensionality reduction is to simplify data


representation while retaining important information. This simplification can aid in
better understanding the underlying structure of data, improving computational
efficiency, and enhancing model performance.

Techniques:

Principal Component Analysis (PCA): PCA is one of the most widely used
dimensionality reduction techniques. It transforms the original variables into a new set
of orthogonal variables (principal components) that capture the maximum variance in
the data.

t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is effective for


visualizing high-dimensional data by mapping similar instances to nearby points in a
lower-dimensional space.

Linear Discriminant Analysis (LDA): LDA is often used in supervised learning tasks
to find the feature subspace that maximizes class separability.

Autoencoders: These are neural network models that learn efficient representations
of data by encoding input into a lower-dimensional latent space and then reconstructing
the output from this representation.

Benefits:
Improved Model Performance: By reducing noise and irrelevant features,
dimensionality reduction can lead to better generalization and predictive performance
of machine learning models.

Visualization: Lower-dimensional representations are easier to visualize, enabling


better exploration and understanding of data patterns.

Efficiency: Reduced dimensionality can lead to faster training times and less memory
usage, especially beneficial for large datasets.

Application in Recommendation Systems:


In recommendation systems, dimensionality reduction techniques can be applied to
user-item interaction matrices to uncover latent factors or preferences.
By reducing the dimensionality of feature vectors representing users and items,
recommendation algorithms can efficiently compute similarities or recommendations
while mitigating the effects of sparsity and noise in data.
8. Explain Singular value decomposition.

Singular Value Decomposition (SVD) is a powerful matrix factorization technique used


extensively in data science and recommendation systems. Here’s an overview of SVD and its
relevance:

Key Concepts and Uses of SVD:

Dimensionality Reduction:

SVD is used for reducing the dimensionality of data. By retaining only the most
significant singular values and corresponding vectors, you can represent the original
matrix with reduced dimensions.

Matrix Approximation:

o SVD allows for approximating a matrix AAA by using only the first kkk singular
values and vectors. This approximation can be useful for compressing data or
denoising.

Collaborative Filtering in Recommendation Systems:

o In recommendation systems, SVD is used for collaborative filtering. It helps in


uncovering latent factors that represent user preferences and item characteristics.

Principal Component Analysis (PCA):

o PCA can be seen as a specific application of SVD, where the covariance matrix of a
dataset is decomposed to find its principal components.

Steps for Using SVD in Recommendation Systems:


Advantages of SVD in Recommendation Systems:

Implicit Feedback Handling: SVD can handle implicit feedback data (e.g., user views
or clicks) effectively by capturing underlying patterns in user-item interactions.

Personalization: By learning latent factors, SVD can provide personalized


recommendations based on user preferences.

Scalability: Techniques like incremental SVD and stochastic gradient descent can be
used to scale SVD to large datasets.

9. Explain Principal Component Analysis

PCA is a statistical method that transforms a set of correlated variables (or features)
into a set of linearly uncorrelated variables called principal components. These
principal components are ordered by the amount of variance they explain in the original
data.

Principal Component Analysis (PCA) is a widely used technique in data science for
reducing the dimensionality of data while retaining as much variance as possible.

Key Concepts and Steps in PCA:

Covariance Matrix:
PCA starts by computing the covariance matrix of the dataset, which captures the
pairwise relationships between different variables.

Eigen decomposition or Singular Value Decomposition (SVD):


The covariance matrix is then decomposed into its eigenvectors and eigenvalues, which
represent the directions and magnitudes of maximum variance in the data.
Selecting Principal Components:
Principal components are selected based on the eigenvalues, with higher eigenvalues
indicating greater variance explained. Typically, the number of principal components
chosen is based on the desired level of variance retention.

Transforming the Data:


Finally, the original dataset is transformed into the new space defined by the selected
principal components. This transformation projects the data onto a lower-dimensional
subspace while preserving as much variance as possible.

Applications of PCA:

Dimensionality Reduction:
PCA is primarily used for reducing the number of variables in a dataset while retaining
most of the information. This is beneficial for improving computational efficiency,
reducing noise, and avoiding overfitting in machine learning models.

Visualization:
PCA is valuable for visualizing high-dimensional data. By reducing data to two or
three principal components, it becomes easier to plot and understand the underlying
structure and relationships.

Feature Extraction:
PCA can be used as a feature extraction technique where the principal components
serve as new features that may be more informative or less redundant than the original
variables.

Noise Reduction:
PCA can effectively filter out noise by emphasizing variations in data that are
significant (captured by principal components with high eigenvalues) and disregarding
variations that are less significant (captured by components with low eigenvalues).

Advantages of PCA:
Interpretability: Principal components are linear combinations of original variables,
making them interpretable in terms of the contributions of different features.

Data Compression: PCA allows for data compression by reducing the number of
dimensions while preserving most of the variance, which is useful for storage and
computation.

Improves Model Performance: By reducing the number of input variables, PCA can
lead to simpler and more efficient models that generalize better to new data.

You might also like