Lab #3
Lab #3
Analytics Lab
By
Dr. Akriti Nigam
Assistant Professor
Birla Institute of Technology, Mesra, Ranchi
What is Feature Reduction?
• Feature reduction, also known as dimensionality reduction, is the process of
reducing the number of features in a resource heavy computation without losing
important information.
• There are many techniques by which feature reduction is accomplished.
• Another benefit of feature reduction is that it makes data easier to visualize for
humans, particularly when the data is reduced to two or three dimensions which
can be easily displayed graphically.
• An interesting problem that feature reduction can help with is called the curse of
dimensionality.
• This refers to a group of phenomena in which a problem will have so many
dimensions that the data becomes sparse.
• Feature reduction is used to decrease the number of dimensions, making the data
less sparse and more statistically significant for machine learning applications.
Principal Component Analysis
import numpy as np
import pandas as pd
• Importing Dataset
url =
"https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iri
s.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width',
'Class']
dataset = pd.read_csv(url, names=names)
• Preprocessing
The first preprocessing step is to divide the dataset into a feature set
and corresponding labels.
The next preprocessing step is to divide data into training and test
sets.
from sklearn.model_selection import train_test_split
pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
The PCA class contains explained_variance_ratio_ which returns the variance caused by each of the principal
components.
explained_variance = pca.explained_variance_ratio_
Any k number of principal components can be used to train the algorithm
pca = PCA(n_components=1)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
print('Accuracy' + accuracy_score(y_test, y_pred))
Lab Assignment
1. Apply PCA on the iris dataset (without using inbuilt function). Generate the
scatter plot of the data before and after applying PCA (across the top 2 PCs).
2. Generate the scree plot for all four PCs.
3. Apply Random Forrest on the iris dataset using 2 PC from above step (k=2) and
using 3 PCs (k=3).
4. Compare the results of step 2.