0% found this document useful (0 votes)
29 views30 pages

Unit 3

Uploaded by

imjyoti1511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views30 pages

Unit 3

Uploaded by

imjyoti1511
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

UNIT-3

Building good training data sets

 Dealing with missing data


 Handling categorical data
 Partitioning a data set into separate training and test datasets
 Bringing features onto the same scale
 Selecting meaningful features
 Assessing feature importance with random forests

Compressing data via dimensionality reduction

 Unsupervised dimensionality reduction via PCA


 Supervised data compression via linear discriminant analysis

Dealing with missing data :

a. Faculty sensors
b. Incorrect data entered manually
c. Even it can be due to mass update of a table with unwanted value
d. People do not respond to survey
e. Many other reasons

Missing data means absence of observations in columns. It appears in values such as “0”, “NA”,
“NaN”, “NULL”, “Not Applicable”, “None”.

Type(none)

Output: None type

Import numpy as np

Type(np.nan)

Output : Float

Nan is an undefined number (not a number)

np.isnan(np.nan)

o/p: true
Why dataset has Missing values?

The cause of it can be data corruption ,failure to record data, lack of information, incomplete
results ,person might not provided the data intentionally ,some system or equipment failure etc.
There could any reason for missing values in your dataset.

Why to handle Missing values?

One of the biggest impact of Missing Data is, It can bias the results of the machine learning
models or reduce the accuracy of the model. So, It is very important to handle missing values.

How to check Missing Data?

The first step in handling missing values is to look at the data carefully and find out all the
missing values. In order to check missing values in Python Pandas Data Frame, we use a function
like isnull() and notnull() which help in checking whether a value is “NaN”(True) or not and
return boolean values.

Dataset=pd.read_csv(‘train.csv’)

Check the data

Dataset=pd.read_csv(‘train.csv’)

Dataset[dataset[‘age’].isnull()].head()

Dataset[dataset[‘age’].notnull()].head()

How it will behave if we try to replace some value with nan


Dataset[‘sex’]=dataset[‘sex’].replace (‘female’,np.nan)

How to handle Missing data?

Missing values can be handled in different ways depending on, if the missing values are
continuous or categorical. Because method of handling missing values are different between these
two data type .By using “dtypes” function in python we can filter our columns from dataset.

THREE WAYS to treat missing values in dataset are as follows:

 DROPPING

 IMPUTION

 PREDICTIVE MODEL

Dropping missing values

This method is commonly used to handle null values. It is easy to implement and there is no
manipulation of data required. This varies from case to case on the amount of information you
think the variable has. If dataset information is valuable or training dataset has less number of
records then deleting rows might have negative impact on the analysis. Deletion methods works
great when the nature of missing data is missing completely at random(MCAR) but for non-
random missing values can create a bias in the dataset, if a large amount of a particular type of
variable is deleted from it.

1. deleting rows (listwise deletion)

2. deleting columns

3. pairwise deletion

Imputing missing values


There exists many approach to missing-data imputation and they usually depend on your problem
and how your data algorithm behaves. We will see Missing data in Time-series
problem and General problem.

Time-series problem: Time series datasets may contain trends and seasonality. A trend pattern
exists when there is a long-term increase or decrease in the series which can result in a varying
mean over time. Seasonality exists when data is influenced by seasonal factors which can result in
a changing variance over time, such as a day of the week, a month, and one-quarter of the year.

So, time series can define in three categories, such as follows:

1. data without trend and without seasonality having missing data can be solved by mean,
median, mode or by random sample imputation etc.

2. data with trend and without seasonality having missing data can be solved by linear
interpolation

3. data with trend and with seasonality ,having missing data can be solved by seasonal
adjustment plus interpolation

General problem: Method of handling missing values between two data type such as continuous
data and categorical data are different.

1. Missing values in continuous data can be solved by imputing with mean ,median ,mode or
with multiple imputation

2. Missing values in categorical data can be solved by mode ,multiple imputation

MEAN AND MEDIAN


If the features are numeric you can use simple approaches, such as mean or median. This is the
most common method of imputing missing values of numeric columns. We can make use of
‘SimpleImputer’ from python for this. If there are outliers then the mean will not be appropriate.
It’s better to use the median value for imputation in the case of outliers. Median is the middlemost
value.

MODE(Frequent Category Imputation)

Mode is the most frequently occurring value. It is used in the case of categorical features. This
technique says to replace the missing value with the variable with the highest frequency or
replacing the values with the Mode of that column.

In some cases, imputing the values with the “previous value” (ffill) ,“single value” or with “next
value”(bfill) instead of mean, mode or median is more appropriate. Can be used with strings or
numeric data. You can use ‘fillna’ function of python library for that.

Other Imputation Methods:

There are many other imputation techniques to impute missing values, few are given below in
addition to above methods.

Multiple imputation is flexible and essentially an iterative form of stochastic imputation. It is


statistical technique for handing missing data. It preserve sample size and statistical power

KNN Imputer or Iterative Imputer classes to impute missing values considering the
multivariate approach. In a multivariate approach, more than one feature is taken into
consideration.

Arbitrary Value Imputation is an important technique used in Imputation as it can handle both
the Numerical and Categorical variables. This should be used when data is not missing at random
or for tree based algorithms but not with linear regression or logistic regression. This technique
states that we group the missing values in a column and assign them to a new value like 999 or -
999 or “Missing” or “Not defined” .It’s easy to use but it can create outliers.

Predictive Model

It is like running a predictive model to estimate values that will substitute the missing data. you
can predict missing value using non-missing data. We just have to divide dataset into two
datasets, one with no missing data as a training dataset and second as test dataset having missing
values. Then use training dataset to create the model to predict the target variable and predict
missing values.

Example code using python :

1. Remove the missing value


2. Use immuter (if we don’t want to remove the record we can use immuter can be
applied on the columns which are numeric or float )
3. Group by and fill the missing values

Import pandas as pd

Import numpy as np

Df=pd.read_csv(‘datapreprocessing .csv’)

Get the rows that contains null (nan)

Df.isnull().sum()

Data without missing values with respect to columns

Data_without_misssing_values_cols=df.dropna(axis=1)

Data without missing values with respect to rows

Data_without_misssing_values_rows=df.dropna(axis=0)

If we want to get the columns in a dataset with the missing value we will use the following
approach
Cols_with_missing_values=[col for col in df.columns if df[col].isnull().any(()]

Drop the columns that conatins thye missing calues then we we follow the below approach

Reduced_data=df.drop(cols_with_missing_values,axis=1)

Example -2 immuter

Second approach : ML approach is of two categories

1. feature is input variable or independent variable

2.level output variables or target variables or dependent variable

See the data

df.info()

features =df.iloc[;,;-1],values

output:

type (features)

features.shape

labels=df.iloc[;,-1].values

output:

labels

labels.shape

from sklearn .preprocessing import Imputer (fit and transform)

imputer =Imputer(missing_values=’Nan’,startegy=’mean’,axis=0)

2 step transformation

Fit and transform


Imputer .fit(features[;,[1,6]])

Features[;,[1,6]]=imputer.fit_transform(features[;,[1,6]])

Output :

Features[;,[1,6]]=imputer.fit_transform(features[;,[1,6]])

Features

df1=pd.dataframe(features)

Output :

df1=pd.dataframe(features)

Df.head(10)

Cols=[‘occupation’,’employment status’,’employment type’]

If we want to replace with mode values

df[cols]=df[cols].fillno(df.mode().iloc[0])

output: df.info()

simple imputer

1. Impute with mean values


2. Impute with median values
3. Impute with modal values
4. Impute with constant values

Mean :

Import pandas as pd

Import numpy as np

From sklearn.impute import SimpleImputer


df2=pd.DataFrame()

df2[‘col’]=[75,88,nan,94,168,np.nan,543]

mean_imputer=SimpleImputer(strategy=’mean’)

df2.iloc[;,;]=mean_imputer.fit_transform(df2)

median :

df2=pd.DataFrame()

df2[‘col’]=[75,88,nan,94,168,220,543]

median_imputer=SimpleImputer(strategy=’median’)

df2.iloc[;,;]=median_imputer.fit_transform(df2)

print(df2)

mode:

df2=pd.DataFrame()

df2[‘col’]=[75,88,np.nan,94,94,np.nan,543]

mode_imputer=SimpleImputer(strategy=’most frequent ’)

df2.iloc[;,;]=mode_imputer.fit_transform(df2)

print(df2)

constant :

df2=pd.DataFrame()

df2[‘col’]=[75,88,np.nan,94,94,np.nan,543]

constant_imputer=SimpleImputer(strategy=’constant ’,fill_value=100)

df2.iloc[;,;]=constant_imputer.fit_transform(df2)

print(df2)

Handling Categorical Data :


Data that only includes a few values is referred to as categorical data, often known as categories
or levels and it is described in two ways - nominal or ordinal. Data that lacks any intrinsic order,
such as colors, genders, or animal species, is represented as nominal categorical data while
ordinal categorical data refers to information that is naturally ranked or ordered, such as
customer satisfaction levels or educational attainment.

Categorical data is often represented as text labels, and many machine learning algorithms
require numerical input data. Customer demographics, product classifications, and geographic
areas are just a few examples of real-world datasets that include categorical data which must be
converted into numerical representation before being used in machine learning algorithms.

Ways to Handle Categorical Data

Example 1 - One Hot Encoding

One-Hot Encoding is a technique used to convert categorical data into numerical format. It
creates a binary vector for each category in the dataset. The vector contains a 1 for the category it
represents and 0s for all other categories

The pandas and scikit-learn libraries provide functions to perform One-Hot Encoding. The
following code snippet shows how to perform One-Hot Encoding using pandas and scikit-learn.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from category_encoders import OrdinalEncoder, TargetEncoder

# Create a pandas DataFrame with categorical data


df = pd.DataFrame({'color': ['red', 'blue', 'green', 'green', 'red']})

# Create an instance of OneHotEncoder


encoder = OneHotEncoder()

# Fit and transform the DataFrame using the encoder


encoded_data = encoder.fit_transform(df)

# Convert the encoded data into a pandas DataFrame


encoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names())
print(encoded_df)

Output
x0_blue x0_green x0_red
0 0.0 0.0 1.0
1 1.0 0.0 0.0
2 0.0 1.0 0.0
3 0.0 1.0 0.0
4 0.0 0.0 1.0

Example 2 - Ordinal Encoding

Ordinal coding is a popular technique for encoding categorical data where each category is given
a different numerical value based on its rank or order. The categories with the lowest values
receive the smallest integers, while those with the highest values receive the largest integers.
When the categories are grouped organically, like with ratings (poor, fair, good, outstanding), or
educational achievement, this strategy is extremely useful (high school, college, graduate
school). Let us do ordinal encoding using Pandas and the category encoders package −

import pandas as pd
import category_encoders as ce

# create a sample dataset


data = {'category': ['red', 'green', 'blue', 'red', 'green']}
df = pd.DataFrame(data)

# initialize the encoder


encoder = ce.OrdinalEncoder()

# encode the categorical feature


df['category_encoded'] = encoder.fit_transform(df['category'])

# print the encoded dataframe


print(df)

Output
category category_encoded
0 red 1
1 green 2
2 blue 3
3 red 1
4 green 2

As you can see, the red category has been given the value 1, green has been given the value 2,
and blue has been given the value 3. The sequence in which the categories occurred in the
original dataset served as the basis for this encoding.

Example 3: Target Encoding using Category Encoders

Target Encoding is another technique used for encoding categorical data, particularly when
dealing with high cardinality features. It replaces each category with the average target value for
that category. Target Encoding is useful when there is a strong relationship between the
categorical feature and the target variable.

import pandas as pd
import category_encoders as ce

# create a sample dataset


data = {'category': ['red', 'green', 'blue', 'red', 'green'], 'target': [1, 0, 1, 0, 1]}
df = pd.DataFrame(data)

# initialize the encoder


encoder = ce.TargetEncoder()

# encode the categorical feature


df['category_encoded'] = encoder.fit_transform(df['category'], df['target'])

# print the encoded dataframe


print(df)

In this example, we create a sample dataset with a single categorical feature called "category"
and a corresponding target variable called "target". We import the category_encoders library and
initialize a TargetEncoder object. We use the fit_transform() method to encode the categorical
feature based on the target variable and add the encoded feature to the original dataframe.

Output
category target category_encoded
0 red 1 0.585815
1 green 0 0.585815
2 blue 1 0.652043
3 red 0 0.585815
4 green 1 0.585815

How to split a Dataset into Train and Test Sets using Python
The train-test split is used to estimate the performance of machine learning algorithms that are
applicable for prediction-based Algorithms/Applications. This method is a fast and easy
procedure to perform such that we can compare our own machine learning model results to
machine results. By default, the Test set is split into 30 % of actual data and the training set is
split into 70% of the actual data.
Dataset Splitting:

Scikit-learn alias sklearn is the most useful and robust library for machine learning in Python.
The scikit-learn library provides us with the model_selection module in which we have the
splitter function train_test_split().
Syntax:
train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True,
stratify=None)
Parameters:
1. *arrays: inputs such as lists, arrays, data frames, or matrices
2. test_size: this is a float value whose value ranges between 0.0 and 1.0. it represents the
proportion of our test size. its default value is none.
3. train_size: this is a float value whose value ranges between 0.0 and 1.0. it represents the
proportion of our train size. its default value is none.
4. random_state: this parameter is used to control the shuffling applied to the data before
applying the split. it acts as a seed.
5. shuffle: This parameter is used to shuffle the data before splitting. Its default value is true.
6. stratify: This parameter is used to split the data in a stratified fashion.

Example : real estate file

Code:

# import modules
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# read the dataset


df = pd.read_csv('Real estate.csv')

# get the locations


X = df.iloc[:, :-1]
y = df.iloc[:, -1]

# split the dataset


X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.05, random_state=0)

What is Feature Scaling?


Feature Scaling is a technique to standardize the independent features present in the data in a
fixed range. It is performed during the data pre-processing to handle highly varying
magnitudes or values or units. If feature scaling is not done, then a machine learning algorithm
tends to weigh greater values, higher and consider smaller values as the lower values,
regardless of the unit of the values.
Why use Feature Scaling?
In machine learning, feature scaling is employed for a number of purposes:
 Scaling guarantees that all features are on a comparable scale and have comparable ranges.
This process is known as feature normalisation. This is significant because the magnitude of
the features has an impact on many machine learning techniques. Larger scale features may
dominate the learning process and have an excessive impact on the outcomes. You can
avoid this problem and make sure that each feature contributes equally to the learning
process by scaling the features.
 Algorithm performance improvement: When the features are scaled, several machine
learning methods, including gradient descent-based algorithms, distance-based algorithms
(such k-nearest neighbours), and support vector machines, perform better or converge more
quickly. The algorithm’s performance can be enhanced by scaling the features, which can
hasten the convergence of the algorithm to the ideal outcome.
 Preventing numerical instability: Numerical instability can be prevented by avoiding
significant scale disparities between features. Examples include distance calculations or
matrix operations, where having features with radically differing scales can result in
numerical overflow or underflow problems. Stable computations are ensured and these
issues are mitigated by scaling the features.
 Scaling features makes ensuring that each characteristic is given the same consideration
during the learning process. Without scaling, bigger scale features could dominate the
learning, producing skewed outcomes. This bias is removed through scaling, which also
guarantees that each feature contributes fairly to model predictions.
Absolute Maximum Scaling
This method of scaling requires two-step:
1. We should first select the maximum absolute value out of all the entries of a particular
measure.
2. Then after this, we divide each entry of the column by this maximum value.

After performing the above-mentioned two steps we will observe that each entry of the column
lies in the range of -1 to 1. But this method is not used that often the reason behind this is that
it is too sensitive to the outliers. And while dealing with the real-world data presence of
outliers is a very common thing.
For the demonstration purpose, we will use the dataset which you can download from here.
This dataset is a simpler version of the original house price prediction dataset having only two
columns from the original dataset. The first five rows of the original data are shown below:

import pandas as pd

df = pd.read_csv('SampleFile.csv')
print(df.head())

Output:
LotArea MSSubClass
0 8450 60
1 9600 20
2 11250 60
3 9550 70
4 14260 60

Now let’s apply the first method which is of the absolute maximum scaling. For this first, we
are supposed to evaluate the absolute maximum values of the columns.

import numpy as np

max_vals = np.max(np.abs(df))

max_vals

Output:
LotArea 215245
MSSubClass 190
dtype: int64

Now we are supposed to subtract these values from the data and then divide the results from
the maximum values as well.

print((df - max_vals) / max_vals)

Output:
LotArea MSSubClass
0 -0.960742 -0.684211
1 -0.955400 -0.894737
2 -0.947734 -0.684211
3 -0.955632 -0.631579
4 -0.933750 -0.684211
... ... ...
1455 -0.963219 -0.684211
1456 -0.938791 -0.894737
1457 -0.957992 -0.631579
1458 -0.954856 -0.894737
1459 -0.953834 -0.894737

[1460 rows x 2 columns]

Min-Max Scaling
This method of scaling requires below two-step:
1. First, we are supposed to find the minimum and the maximum value of the column.
2. Then we will subtract the minimum value from the entry and divide the result by the
difference between the maximum and the minimum value.

As we are using the maximum and the minimum value this method is also prone to outliers but
the range in which the data will range after performing the above two steps is between 0 to 1.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaled_data = scaler.fit_transform(df)

scaled_df = pd.DataFrame(scaled_data,

columns=df.columns)

scaled_df.head()

Output:
LotArea MSSubClass
0 0.033420 0.235294
1 0.038795 0.000000
2 0.046507 0.235294
3 0.038561 0.294118
4 0.060576 0.235294

Normalization
This method is more or less the same as the previous method but here instead of the minimum
value, we subtract each entry by the mean value of the whole data and then divide the results
by the difference between the minimum and the maximum value.

from sklearn.preprocessing import Normalizer

scaler = Normalizer()

scaled_data = scaler.fit_transform(df)

scaled_df = pd.DataFrame(scaled_data,

columns=df.columns)

print(scaled_df.head())

Output:
LotArea MSSubClass
0 0.999975 0.007100
1 0.999998 0.002083
2 0.999986 0.005333
3 0.999973 0.007330
4 0.999991 0.004208

Standardization
This method of scaling is basically based on the central tendencies and variance of the data.
1. First, we should calculate the mean and standard deviation of the data we would like to
normalize.
2. Then we are supposed to subtract the mean value from each entry and then divide the result
by the standard deviation.
This helps us achieve a normal distribution (if it is already normal but skewed) of the data with
a mean equal to zero and a standard deviation equal to 1.
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaled_data = scaler.fit_transform(df)

scaled_df = pd.DataFrame(scaled_data,

columns=df.columns)

print(scaled_df.head())

Output:
LotArea MSSubClass
0 -0.207142 0.073375
1 -0.091886 -0.872563
2 0.073480 0.073375
3 -0.096897 0.309859
4 0.375148 0.073375

Robust Scaling
In this method of scaling, we use two main statistical measures of the data.
 Median
 Inter-Quartile Range
After calculating these two values we are supposed to subtract the median from each entry and
then divide the result by the interquartile range.

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()

scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,

columns=df.columns)

print(scaled_df.head())

Output:
LotArea MSSubClass
0 -0.254076 0.2
1 0.030015 -0.6
2 0.437624 0.2
3 0.017663 0.4
4 1.181201 0.2

Feature Importance with Random Forests

Last Updated : 05 Apr, 2024



Features in machine learning, plays a significant role in model accuracy. Exploring feature
importance in Random Forests enhances model performance and efficiency.
What is Feature Importance?
Features in machine learning, also known as variables or attributes, are individual measurable
properties or characteristics of the phenomena being observed. They serve as the input to the
model, and their quality and quantity can greatly influence the accuracy and efficiency of the
model. There are three primary categories of features:
 Numerical Features: These features represent quantitative data, expressed as numerical
values (integers or decimals). Examples include temperature (°C), weight (kg), and age
(years).
 Categorical Features: These features represent qualitative data, signifying the category to
which a data point belongs. Examples include hair color (blonde, brunette, black) and
customer satisfaction (satisfied, neutral, dissatisfied).
 Ordinal Features: These features are a subtype of categorical features, possessing an
inherent order or ranking. Examples include movie ratings (1 star, 2 stars, etc.) and customer
service experience (poor, average, excellent).
Why Feature Importance Matters?
Understanding feature importance offers several advantages:
 Enhanced Model Performance: By identifying the most influential features, you can
prioritize them during model training, leading to more accurate predictions.
 Faster Training Times: Focusing on the most relevant features streamlines the training
process, saving valuable time and computational resources.
 Reduced Overfitting: Overfitting occurs when a model memorizes the training data instead
of learning general patterns. By focusing on important features, you can prevent the model
from becoming overly reliant on specific data points.
Feature Importance in Random Forests
Random Forests, a popular ensemble learning technique, are known for their efficiency and
interpretability. They work by building numerous decision trees during training, and the final
prediction is the average of the individual tree predictions.
Several techniques can be employed to calculate feature importance in Random Forests, each
offering unique insights:
 Built-in Feature Importance: This method utilizes the model’s internal calculations to
measure feature importance, such as Gini importance and mean decrease in accuracy.
Essentially, this method measures how much the impurity (or randomness) within a node of
a decision tree decreases when a specific feature is used to split the data.
 Permutation feature importance: Permutation importance assesses the significance of each
feature independently. By evaluating the impact of individual feature permutations on
predictions, it calculates importance.
 SHAP (SHapley Additive exPlanations) Values: SHAP values delve deeper by explaining
the contribution of each feature to individual predictions. This method offers a
comprehensive understanding of feature importance across various data points.
Feature Importance in Random Forests: Implementation
To show implementation, The iris dataset is used throughout the article to understand the
implementation of feature importance.
Prerequisities: Install necessary libraries
!pip install shap
Python3

from sklearn.datasets import load_iris


from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import shap
from sklearn.metrics import accuracy_score
import numpy as np

iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.25, random_state=42) #


Split dataset into 75% train and 25% test
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
PART-2

Compressing data via dimensionality reduction:


 Unsupervised dimensionality reduction via PCA
 Supervised data compression via linear discriminant analysis

What is Principal Component Analysis(PCA)?


Principal Component Analysis (PCA) technique was introduced by the mathematician Karl
Pearson in 1901. It works on the condition that while the data in a higher dimensional space is
mapped to data in a lower dimension space, the variance of the data in the lower dimensional
space should be maximum.
 Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal
transformation that converts a set of correlated variables to a set of uncorrelated
variables.PCA is the most widely used tool in exploratory data analysis and in machine
learning for predictive models. Moreover,
 Principal Component Analysis (PCA) is an unsupervised learning algorithm technique used
to examine the interrelations among a set of variables. It is also known as a general factor
analysis where regression determines a line of best fit.
 The main goal of Principal Component Analysis (PCA) is to reduce the dimensionality of a
dataset while preserving the most important patterns or relationships between the variables
without any prior knowledge of the target variables.
Principal Component Analysis (PCA) is used to reduce the dimensionality of a data set by
finding a new set of variables, smaller than the original set of variables, retaining most of the
sample’s information, and useful for the regression and classification of data.

Overall, PCA is a powerful tool for data analysis and can help to simplify complex datasets,
making them easier to understand and work with.

Python

import pandas as pd
import numpy as np
# Here we are using inbuilt dataset of scikit learn
from sklearn.datasets import load_breast_cancer #
instantiating cancer = load_breast_cancer(as_frame=True)
# creating dataframe
df = cancer.frame
# checking shape
print('Original Dataframe shape :',df.shape)
# Input features
X = df[cancer['feature_names']]
print('Inputs Dataframe shape :', X.shape)

Python

# Mean
X_mean = X.mean()
# Standard deviation
X_std = X.std()
# Standardization Z = (X - X_mean) / X_std
The covariance matrix helps us visualize how strong the dependency of two features is with
each other in the feature space.
Python

# covariance
c = Z.cov()
# Plot the covariance matrix
import matplotlib.pyplot as plt
import seaborn as sns
sns.heatmap(c)
plt.show()

Python

# Index the eigenvalues in descending order


idx = eigenvalues.argsort()[::-1]
# Sort the eigenvalues in descending order
eigenvalues = eigenvalues[idx]
# sort the corresponding eigenvectors accordingly
eigenvectors = eigenvectors[:,idx]
Explained variance is the term that gives us an idea of the amount of the total variance which
has been retained by selecting the principal components instead of the original feature space.
Python

explained_var = np.cumsum(eigenvalues) / np.sum(eigenvalues) explained_var

Determine the Number of Principal Components


Here we can either consider the number of principal components of any value of our choice or
by limiting the explained variance. Here I am considering explained variance more than equal
to 50%. Let’s check how many principal components come into this.
Python

n_components = np.argmax(explained_var >= 0.50) + 1 n_components


Output:
2

Project the Data onto the Selected Principal Components


 Find the projection matrix, It is a matrix of eigenvectors corresponding to the largest
eigenvalues of the covariance matrix of the data. it projects the high-dimensional dataset
onto a lower-dimensional subspace
 The eigenvectors of the covariance matrix of the data are referred to as the principal axes of
the data, and the projection of the data instances onto these principal axes are called the
principal components.
Python

# PCA component or unit matrix


u = eigenvectors[:,:n_components] pca_component = pd.DataFrame(u, index = cancer['feature_
names'], columns = ['PC1','PC2'] )
# plotting heatmap
plt.figure(figsize =(5, 7)) sns.heatmap(pca_component) plt.title('PCA Component')
plt.show()

Python

# Matrix multiplication or dot Product


Z_pca = Z @ pca_component
# Rename the columns name
Z_pca.rename({'PC1': 'PCA1', 'PC2': 'PCA2'}, axis=1, inplace=True)
# Print the Pricipal Component values print
(Z_pca)

PCA using Using Sklearn


There are different libraries in which the whole process of the principal component analysis
has been automated by implementing it in a package as a function and we just have to pass the
number of principal components which we would like to have. Sklearn is one such library that
can be used for the PCA as shown below.
Python

# Importing PCA
from sklearn.decomposition
import PCA
# Let's say, components = 2
pca = PCA(n_components=2) pca.fit(Z) x_pca = pca.transform(Z)
# Create the dataframe
df_pca1 = pd.DataFrame(x_pca, columns=['PC{}'. format(i+1) for i in range(n_components)]) p
rint(df_pca1)
# giving a larger plot
plt.figure(figsize=(8, 6)) plt.scatter(x_pca[:, 0], x_pca[:, 1], c=cancer['target'], cmap='plasma')
# labeling x and y axes
plt.xlabel('First Principal Component') plt.ylabel('Second Principal Component') plt.show()

Advantages of Principal Component Analysis


1. Dimensionality Reduction: Principal Component Analysis is a popular technique used
for dimensionality reduction , which is the process of reducing the number of variables in a
dataset. By reducing the number of variables, PCA simplifies data analysis, improves
performance, and makes it easier to visualize data.
2. Feature Selection: Principal Component Analysis can be used for feature selection, which
is the process of selecting the most important variables in a dataset. This is useful in
machine learning, where the number of variables can be very large, and it is difficult to
identify the most important variables.
3. Data Visualization: Principal Component Analysis can be used for data visualization. By
reducing the number of variables, PCA can plot high-dimensional data in two or three
dimensions, making it easier to interpret.
4. Multicollinearity: Principal Component Analysis can be used to deal
with multicollinearity, which is a common problem in a regression analysis where two or
more independent variables are highly correlated. PCA can help identify the underlying
structure in the data and create new, uncorrelated variables that can be used in the
regression model.
5. Noise Reduction: Principal Component Analysis can be used to reduce the noise in data.
By removing the principal components with low variance, which are assumed to represent
noise, Principal Component Analysis can improve the signal-to-noise ratio and make it
easier to identify the underlying structure in the data.
6. Data Compression: Principal Component Analysis can be used for data compression. By
representing the data using a smaller number of principal components, which capture most
of the variation in the data, PCA can reduce the storage requirements and speed up
processing.
7. Outlier Detection: Principal Component Analysis can be used for outlier
detection. Outliers are data points that are significantly different from the other data points
in the dataset. Principal Component Analysis can identify these outliers by looking for data
points that are far from the other points in the principal component space.

Disadvantages of Principal Component Analysis


1. Interpretation of Principal Components: The principal components created by Principal
Component Analysis are linear combinations of the original variables, and it is often
difficult to interpret them in terms of the original variables. This can make it difficult to
explain the results of PCA to others.
2. Data Scaling: Principal Component Analysis is sensitive to the scale of the data. If the data
is not properly scaled, then PCA may not work well. Therefore, it is important to scale the
data before applying Principal Component Analysis.
3. Information Loss: Principal Component Analysis can result in information loss. While
Principal Component Analysis reduces the number of variables, it can also lead to loss of
information. The degree of information loss depends on the number of principal
components selected. Therefore, it is important to carefully select the number of principal
components to retain.
4. Non-linear Relationships: Principal Component Analysis assumes that the relationships
between variables are linear. However, if there are non-linear relationships between
variables, Principal Component Analysis may not work well.
5. Computational Complexity: Computing Principal Component Analysis can be
computationally expensive for large datasets. This is especially true if the number of
variables in the dataset is large.
6. Overfitting: Principal Component Analysis can sometimes result in overfitting, which is
when the model fits the training data too well and performs poorly on new data. This can
happen if too many principal components are used or if the model is trained on a small
dataset.

What is Linear Discriminant Analysis (LDA)?

Although the logistic regression algorithm is limited to only two-class, linear Discriminant
analysis is applicable for more than two classes of classification problems.

Linear Discriminant analysis is one of the most popular dimensionality reduction techniques
used for supervised classification problems in machine learning. It is also considered a pre-
processing step for modeling differences in ML and applications of pattern classification.

Whenever there is a requirement to separate two or more classes having multiple features
efficiently, the Linear Discriminant Analysis model is considered the most common technique to
solve such classification problems. For e.g., if we have two classes with multiple features and
need to separate them efficiently. When we classify them using a single feature, then it may
show overlapping.

To overcome the overlapping issue in the classification process, we must increase the number of
features regularly.

Example:
Let's assume we have to classify two different classes having two sets of data points in a 2-
dimensional plane as shown below image:
However, it is impossible to draw a straight line in a 2-d plane that can separate these data points
efficiently but using linear Discriminant analysis; we can dimensionally reduce the 2-D plane
into the 1-D plane. Using this technique, we can also maximize the separability between multiple
classes.

How Linear Discriminant Analysis (LDA) works?

Linear Discriminant analysis is used as a dimensionality reduction technique in machine


learning, using which we can easily transform a 2-D and 3-D graph into a 1-dimensional plane.

Let's consider an example where we have two classes in a 2-D plane having an X-Y axis, and we
need to classify them efficiently. As we have already seen in the above example that LDA
enables us to draw a straight line that can completely separate the two classes of the data points.
Here, LDA uses an X-Y axis to create a new axis by separating them using a straight line and
projecting data onto a new axis.

Hence, we can maximize the separation between these classes and reduce the 2-D plane into 1-D.
To create a new axis, Linear Discriminant Analysis uses the following criteria:

o It maximizes the distance between means of two classes.


o It minimizes the variance within the individual class.
Using the above two conditions, LDA generates a new axis in such a way that it can maximize
the distance between the means of the two classes and minimizes the variation within each class.

In other words, we can say that the new axis will increase the separation between the data points
of the two classes and plot them onto the new axis.

Why LDA?

o Logistic Regression is one of the most popular classification algorithms that perform well
for binary classification but falls short in the case of multiple classification problems with
well-separated classes. At the same time, LDA handles these quite efficiently.
o LDA can also be used in data pre-processing to reduce the number of features, just as
PCA, which reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to extract
useful data from different faces. Coupled with eigenfaces, it produces effective results.

Drawbacks of Linear Discriminant Analysis (LDA)

Although, LDA is specifically used to solve supervised classification problems for two or more
classes which are not possible using logistic regression in machine learning. But LDA also fails
in some cases where the Mean of the distributions is shared. In this case, LDA fails to create a
new axis that makes both the classes linearly separable.

To overcome such problems, we use non-linear Discriminant analysis in machine learning.

Extension to Linear Discriminant Analysis (LDA)

Linear Discriminant analysis is one of the most simple and effective methods to solve
classification problems in machine learning. It has so many extensions and variations as follows:

1. Quadratic Discriminant Analysis (QDA): For multiple input variables, each class
deploys its own estimate of variance.
2. Flexible Discriminant Analysis (FDA): it is used when there are non-linear groups of
inputs are used, such as splines.
3. Flexible Discriminant Analysis (FDA): This uses regularization in the estimate of the
variance (actually covariance) and hence moderates the influence of different variables
on LDA.

Real-world Applications of LDA

Some of the common real-world applications of Linear discriminant Analysis are given below:

o FaceRecognition
Face recognition is the popular application of computer vision, where each face is
represented as the combination of a number of pixel values. In this case, LDA is used to
minimize the number of features to a manageable number before going through the
classification process. It generates a new template in which each dimension consists of a
linear combination of pixel values. If a linear combination is generated using Fisher's
linear discriminant, then it is called Fisher's face.
o Medical
In the medical field, LDA has a great application in classifying the patient disease on the
basis of various parameters of patient health and the medical treatment which is going on.
On such parameters, it classifies disease as mild, moderate, or severe. This classification
helps the doctors in either increasing or decreasing the pace of the treatment.
o CustomerIdentification
In customer identification, LDA is currently being applied. It means with the help of
LDA; we can easily identify and select the features that can specify the group of
customers who are likely to purchase a specific product in a shopping mall. This can be
helpful when we want to identify a group of customers who mostly purchase a product in
a shopping mall.
o ForPredictions
LDA can also be used for making predictions and so in decision making. For example,
"will you buy this product” will give a predicted result of either one or two possible
classes as a buying or not.
o InLearning
Nowadays, robots are being trained for learning and talking to simulate human work, and
it can also be considered a classification problem. In this case, LDA builds similar groups
on the basis of different parameters, including pitches, frequencies, sound, tunes, etc.
Example

# necessary import import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.model_selection import train_test_split

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, confusion_matrix

# load the iris dataset

iris = load_iris() dataset = pd.DataFrame(columns=iris.feature_names, data=iris.data) dataset['t


arget'] = iris.target

# divide the dataset into class and target


variable X = dataset.iloc[:, 0:4].values y = dataset.iloc[:, 4].values # Preprocess the dataset
and divide into train and test

sc = StandardScaler() X = sc.fit_transform(X) le = LabelEncoder() y = le.fit_transform(y) X_tr


ain, X_test,\ y_train, y_test = train_test_split(X, y, test_size=0.2)

# apply Linear Discriminant Analysis

lda = LinearDiscriminantAnalysis(n_components=2) X_train = lda.fit_transform(X_train, y_tra


in) X_test = lda.transform(X_test)

# plot the scatterplot

plt.scatter( X_train[:, 0], X_train[:, 1], c=y_train, cmap='rainbow', alpha=0.7, edgecolors='b' )


# classify using random forest classifier

classifier = RandomForestClassifier(max_depth=2, random_state=0) classifier.fit(X_train, y_tr


ain) y_pred = classifier.predict(X_test)

# print the accuracy and confusion matrix


print('Accuracy' + str(accuracy_score(y_test, y_pred))) conf_m = confusion_matrix(y_test, y_p
red) print(conf_m)

Advanatages of using LDA


1. It is a simple and computationally efficient algorithm.
2. It can work well even when the number of features is much larger than the number of
training samples.
3. It can handle multicollinearity (correlation between features) in the data.
Disadvantages of LDA
1. It assumes that the data has a Gaussian distribution, which may not always be the case.
2. It assumes that the covariance matrices of the different classes are equal, which may not be
true in some datasets.
3. It assumes that the data is linearly separable, which may not be the case for some datasets.
4. It may not perform well in high-dimensional feature spaces.

You might also like