0% found this document useful (0 votes)
5 views12 pages

Lab #3

The document discusses feature reduction, particularly through Principal Component Analysis (PCA), which reduces high-dimensional data to lower dimensions while retaining important information. It outlines the advantages of PCA, such as reduced training time and enhanced data visualization, and emphasizes the need for normalization of features before applying PCA. Additionally, it provides a practical implementation guide using Scikit-Learn, including steps for data preprocessing, applying PCA, training a classifier, and evaluating performance.

Uploaded by

Amberish Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views12 pages

Lab #3

The document discusses feature reduction, particularly through Principal Component Analysis (PCA), which reduces high-dimensional data to lower dimensions while retaining important information. It outlines the advantages of PCA, such as reduced training time and enhanced data visualization, and emphasizes the need for normalization of features before applying PCA. Additionally, it provides a practical implementation guide using Scikit-Learn, including steps for data preprocessing, applying PCA, training a classifier, and evaluating performance.

Uploaded by

Amberish Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Data

Analytics Lab
By
Dr. Akriti Nigam
Assistant Professor
Birla Institute of Technology, Mesra, Ranchi
What is Feature Reduction?
• Feature reduction, also known as dimensionality reduction, is the process of
reducing the number of features in a resource heavy computation without losing
important information.
• There are many techniques by which feature reduction is accomplished.
• Another benefit of feature reduction is that it makes data easier to visualize for
humans, particularly when the data is reduced to two or three dimensions which
can be easily displayed graphically.
• An interesting problem that feature reduction can help with is called the curse of
dimensionality.
• This refers to a group of phenomena in which a problem will have so many
dimensions that the data becomes sparse.
• Feature reduction is used to decrease the number of dimensions, making the data
less sparse and more statistically significant for machine learning applications.
Principal Component Analysis

• Principal component analysis, or PCA, is a statistical technique to


convert high dimensional data to low dimensional data by selecting
the most important features that capture maximum information
about the dataset.
• The features are selected on the basis of variance that they cause in
the output. The feature that causes highest variance is the first
principal component. The feature that is responsible for second
highest variance is considered the second principal component, and
so on.
The components that are having a similar or greater amount of variance are grouped under a single category and
the components that are having varying or smaller variance are grouped under the second category.
Advantages of PCA

There are two main advantages of dimensionality reduction with PCA.

• The training time of the algorithms reduces significantly with less


number of features.
• It is not always possible to analyze data in high dimensions. For
instance, if there are 100 features in a dataset. Total number of
scatter plots required to visualize the data would be 100(100-1)2 =
4950. Practically it is not possible to analyze data this way.
Normalization of Features

• It is imperative to mention that a feature set must be normalized


before applying PCA. For instance if a feature set has data expressed
in units of Kilograms, Light years, or Millions, the variance scale is
huge in the training set.
• Hence, principal components will be biased towards features with
high variance, leading to false results.
• PCA is a statistical technique and can only be applied to numeric data.
Therefore, categorical features are required to be converted into
numerical features before PCA can be applied.
Implementing PCA with Scikit-
Learn
• Importing Libraries

import numpy as np
import pandas as pd

• Importing Dataset
url =
"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iri
s.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width',
'Class']
dataset = pd.read_csv(url, names=names)
• Preprocessing
The first preprocessing step is to divide the dataset into a feature set
and corresponding labels.
The next preprocessing step is to divide data into training and test
sets.
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


• Applying PCA
Performing PCA using Scikit-Learn is a two-step process:
i. Initialize the PCA class by passing the number of components to the
constructor.
ii. Call the fit and then transform methods by passing the feature set to
these methods.
from sklearn.decomposition import PCA

pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

The PCA class contains explained_variance_ratio_ which returns the variance caused by each of the principal
components.

explained_variance = pca.explained_variance_ratio_
Any k number of principal components can be used to train the algorithm

from sklearn.decomposition import PCA

pca = PCA(n_components=1)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

• Training and Making Predictions

from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(max_depth=2, random_state=0)


classifier.fit(X_train, y_train)

# Predicting the Test set results


y_pred = classifier.predict(X_test)
• Performance Evaluation

from sklearn.metrics import confusion_matrix


from sklearn.metrics import accuracy_score

cm = confusion_matrix(y_test, y_pred)
print(cm)
print('Accuracy' + accuracy_score(y_test, y_pred))
Lab Assignment
1. Apply PCA on the iris dataset (without using inbuilt function). Generate the
scatter plot of the data before and after applying PCA (across the top 2 PCs).
2. Generate the scree plot for all four PCs.
3. Apply Random Forrest on the iris dataset using 2 PC from above step (k=2) and
using 3 PCs (k=3).
4. Compare the results of step 2.

You might also like