Unit 3
Unit 3
a. Faculty sensors
b. Incorrect data entered manually
c. Even it can be due to mass update of a table with unwanted value
d. People do not respond to survey
e. Many other reasons
Missing data means absence of observations in columns. It appears in values such as “0”, “NA”,
“NaN”, “NULL”, “Not Applicable”, “None”.
Type(none)
Import numpy as np
Type(np.nan)
Output : Float
np.isnan(np.nan)
o/p: true
Why dataset has Missing values?
The cause of it can be data corruption ,failure to record data, lack of information, incomplete
results ,person might not provided the data intentionally ,some system or equipment failure etc.
There could any reason for missing values in your dataset.
One of the biggest impact of Missing Data is, It can bias the results of the machine learning
models or reduce the accuracy of the model. So, It is very important to handle missing values.
The first step in handling missing values is to look at the data carefully and find out all the
missing values. In order to check missing values in Python Pandas Data Frame, we use a function
like isnull() and notnull() which help in checking whether a value is “NaN”(True) or not and
return boolean values.
Dataset=pd.read_csv(‘train.csv’)
Dataset=pd.read_csv(‘train.csv’)
Dataset[dataset[‘age’].isnull()].head()
Dataset[dataset[‘age’].notnull()].head()
Missing values can be handled in different ways depending on, if the missing values are
continuous or categorical. Because method of handling missing values are different between these
two data type .By using “dtypes” function in python we can filter our columns from dataset.
DROPPING
IMPUTION
PREDICTIVE MODEL
This method is commonly used to handle null values. It is easy to implement and there is no
manipulation of data required. This varies from case to case on the amount of information you
think the variable has. If dataset information is valuable or training dataset has less number of
records then deleting rows might have negative impact on the analysis. Deletion methods works
great when the nature of missing data is missing completely at random(MCAR) but for non-
random missing values can create a bias in the dataset, if a large amount of a particular type of
variable is deleted from it.
2. deleting columns
3. pairwise deletion
Time-series problem: Time series datasets may contain trends and seasonality. A trend pattern
exists when there is a long-term increase or decrease in the series which can result in a varying
mean over time. Seasonality exists when data is influenced by seasonal factors which can result in
a changing variance over time, such as a day of the week, a month, and one-quarter of the year.
1. data without trend and without seasonality having missing data can be solved by mean,
median, mode or by random sample imputation etc.
2. data with trend and without seasonality having missing data can be solved by linear
interpolation
3. data with trend and with seasonality ,having missing data can be solved by seasonal
adjustment plus interpolation
General problem: Method of handling missing values between two data type such as continuous
data and categorical data are different.
1. Missing values in continuous data can be solved by imputing with mean ,median ,mode or
with multiple imputation
Mode is the most frequently occurring value. It is used in the case of categorical features. This
technique says to replace the missing value with the variable with the highest frequency or
replacing the values with the Mode of that column.
In some cases, imputing the values with the “previous value” (ffill) ,“single value” or with “next
value”(bfill) instead of mean, mode or median is more appropriate. Can be used with strings or
numeric data. You can use ‘fillna’ function of python library for that.
There are many other imputation techniques to impute missing values, few are given below in
addition to above methods.
KNN Imputer or Iterative Imputer classes to impute missing values considering the
multivariate approach. In a multivariate approach, more than one feature is taken into
consideration.
Arbitrary Value Imputation is an important technique used in Imputation as it can handle both
the Numerical and Categorical variables. This should be used when data is not missing at random
or for tree based algorithms but not with linear regression or logistic regression. This technique
states that we group the missing values in a column and assign them to a new value like 999 or -
999 or “Missing” or “Not defined” .It’s easy to use but it can create outliers.
Predictive Model
It is like running a predictive model to estimate values that will substitute the missing data. you
can predict missing value using non-missing data. We just have to divide dataset into two
datasets, one with no missing data as a training dataset and second as test dataset having missing
values. Then use training dataset to create the model to predict the target variable and predict
missing values.
Import pandas as pd
Import numpy as np
Df=pd.read_csv(‘datapreprocessing .csv’)
Df.isnull().sum()
Data_without_misssing_values_cols=df.dropna(axis=1)
Data_without_misssing_values_rows=df.dropna(axis=0)
If we want to get the columns in a dataset with the missing value we will use the following
approach
Cols_with_missing_values=[col for col in df.columns if df[col].isnull().any(()]
Drop the columns that conatins thye missing calues then we we follow the below approach
Reduced_data=df.drop(cols_with_missing_values,axis=1)
Example -2 immuter
df.info()
features =df.iloc[;,;-1],values
output:
type (features)
features.shape
labels=df.iloc[;,-1].values
output:
labels
labels.shape
imputer =Imputer(missing_values=’Nan’,startegy=’mean’,axis=0)
2 step transformation
Features[;,[1,6]]=imputer.fit_transform(features[;,[1,6]])
Output :
Features[;,[1,6]]=imputer.fit_transform(features[;,[1,6]])
Features
df1=pd.dataframe(features)
Output :
df1=pd.dataframe(features)
Df.head(10)
df[cols]=df[cols].fillno(df.mode().iloc[0])
output: df.info()
simple imputer
Mean :
Import pandas as pd
Import numpy as np
df2[‘col’]=[75,88,nan,94,168,np.nan,543]
mean_imputer=SimpleImputer(strategy=’mean’)
df2.iloc[;,;]=mean_imputer.fit_transform(df2)
median :
df2=pd.DataFrame()
df2[‘col’]=[75,88,nan,94,168,220,543]
median_imputer=SimpleImputer(strategy=’median’)
df2.iloc[;,;]=median_imputer.fit_transform(df2)
print(df2)
mode:
df2=pd.DataFrame()
df2[‘col’]=[75,88,np.nan,94,94,np.nan,543]
mode_imputer=SimpleImputer(strategy=’most frequent ’)
df2.iloc[;,;]=mode_imputer.fit_transform(df2)
print(df2)
constant :
df2=pd.DataFrame()
df2[‘col’]=[75,88,np.nan,94,94,np.nan,543]
constant_imputer=SimpleImputer(strategy=’constant ’,fill_value=100)
df2.iloc[;,;]=constant_imputer.fit_transform(df2)
print(df2)
Categorical data is often represented as text labels, and many machine learning algorithms
require numerical input data. Customer demographics, product classifications, and geographic
areas are just a few examples of real-world datasets that include categorical data which must be
converted into numerical representation before being used in machine learning algorithms.
One-Hot Encoding is a technique used to convert categorical data into numerical format. It
creates a binary vector for each category in the dataset. The vector contains a 1 for the category it
represents and 0s for all other categories
The pandas and scikit-learn libraries provide functions to perform One-Hot Encoding. The
following code snippet shows how to perform One-Hot Encoding using pandas and scikit-learn.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from category_encoders import OrdinalEncoder, TargetEncoder
Output
x0_blue x0_green x0_red
0 0.0 0.0 1.0
1 1.0 0.0 0.0
2 0.0 1.0 0.0
3 0.0 1.0 0.0
4 0.0 0.0 1.0
Ordinal coding is a popular technique for encoding categorical data where each category is given
a different numerical value based on its rank or order. The categories with the lowest values
receive the smallest integers, while those with the highest values receive the largest integers.
When the categories are grouped organically, like with ratings (poor, fair, good, outstanding), or
educational achievement, this strategy is extremely useful (high school, college, graduate
school). Let us do ordinal encoding using Pandas and the category encoders package −
import pandas as pd
import category_encoders as ce
Output
category category_encoded
0 red 1
1 green 2
2 blue 3
3 red 1
4 green 2
As you can see, the red category has been given the value 1, green has been given the value 2,
and blue has been given the value 3. The sequence in which the categories occurred in the
original dataset served as the basis for this encoding.
Target Encoding is another technique used for encoding categorical data, particularly when
dealing with high cardinality features. It replaces each category with the average target value for
that category. Target Encoding is useful when there is a strong relationship between the
categorical feature and the target variable.
import pandas as pd
import category_encoders as ce
In this example, we create a sample dataset with a single categorical feature called "category"
and a corresponding target variable called "target". We import the category_encoders library and
initialize a TargetEncoder object. We use the fit_transform() method to encode the categorical
feature based on the target variable and add the encoded feature to the original dataframe.
Output
category target category_encoded
0 red 1 0.585815
1 green 0 0.585815
2 blue 1 0.652043
3 red 0 0.585815
4 green 1 0.585815
How to split a Dataset into Train and Test Sets using Python
The train-test split is used to estimate the performance of machine learning algorithms that are
applicable for prediction-based Algorithms/Applications. This method is a fast and easy
procedure to perform such that we can compare our own machine learning model results to
machine results. By default, the Test set is split into 30 % of actual data and the training set is
split into 70% of the actual data.
Dataset Splitting:
Scikit-learn alias sklearn is the most useful and robust library for machine learning in Python.
The scikit-learn library provides us with the model_selection module in which we have the
splitter function train_test_split().
Syntax:
train_test_split(*arrays, test_size=None, train_size=None, random_state=None, shuffle=True,
stratify=None)
Parameters:
1. *arrays: inputs such as lists, arrays, data frames, or matrices
2. test_size: this is a float value whose value ranges between 0.0 and 1.0. it represents the
proportion of our test size. its default value is none.
3. train_size: this is a float value whose value ranges between 0.0 and 1.0. it represents the
proportion of our train size. its default value is none.
4. random_state: this parameter is used to control the shuffling applied to the data before
applying the split. it acts as a seed.
5. shuffle: This parameter is used to shuffle the data before splitting. Its default value is true.
6. stratify: This parameter is used to split the data in a stratified fashion.
Code:
# import modules
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
After performing the above-mentioned two steps we will observe that each entry of the column
lies in the range of -1 to 1. But this method is not used that often the reason behind this is that
it is too sensitive to the outliers. And while dealing with the real-world data presence of
outliers is a very common thing.
For the demonstration purpose, we will use the dataset which you can download from here.
This dataset is a simpler version of the original house price prediction dataset having only two
columns from the original dataset. The first five rows of the original data are shown below:
import pandas as pd
df = pd.read_csv('SampleFile.csv')
print(df.head())
Output:
LotArea MSSubClass
0 8450 60
1 9600 20
2 11250 60
3 9550 70
4 14260 60
Now let’s apply the first method which is of the absolute maximum scaling. For this first, we
are supposed to evaluate the absolute maximum values of the columns.
import numpy as np
max_vals = np.max(np.abs(df))
max_vals
Output:
LotArea 215245
MSSubClass 190
dtype: int64
Now we are supposed to subtract these values from the data and then divide the results from
the maximum values as well.
Output:
LotArea MSSubClass
0 -0.960742 -0.684211
1 -0.955400 -0.894737
2 -0.947734 -0.684211
3 -0.955632 -0.631579
4 -0.933750 -0.684211
... ... ...
1455 -0.963219 -0.684211
1456 -0.938791 -0.894737
1457 -0.957992 -0.631579
1458 -0.954856 -0.894737
1459 -0.953834 -0.894737
Min-Max Scaling
This method of scaling requires below two-step:
1. First, we are supposed to find the minimum and the maximum value of the column.
2. Then we will subtract the minimum value from the entry and divide the result by the
difference between the maximum and the minimum value.
As we are using the maximum and the minimum value this method is also prone to outliers but
the range in which the data will range after performing the above two steps is between 0 to 1.
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
columns=df.columns)
scaled_df.head()
Output:
LotArea MSSubClass
0 0.033420 0.235294
1 0.038795 0.000000
2 0.046507 0.235294
3 0.038561 0.294118
4 0.060576 0.235294
Normalization
This method is more or less the same as the previous method but here instead of the minimum
value, we subtract each entry by the mean value of the whole data and then divide the results
by the difference between the minimum and the maximum value.
scaler = Normalizer()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
columns=df.columns)
print(scaled_df.head())
Output:
LotArea MSSubClass
0 0.999975 0.007100
1 0.999998 0.002083
2 0.999986 0.005333
3 0.999973 0.007330
4 0.999991 0.004208
Standardization
This method of scaling is basically based on the central tendencies and variance of the data.
1. First, we should calculate the mean and standard deviation of the data we would like to
normalize.
2. Then we are supposed to subtract the mean value from each entry and then divide the result
by the standard deviation.
This helps us achieve a normal distribution (if it is already normal but skewed) of the data with
a mean equal to zero and a standard deviation equal to 1.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
columns=df.columns)
print(scaled_df.head())
Output:
LotArea MSSubClass
0 -0.207142 0.073375
1 -0.091886 -0.872563
2 0.073480 0.073375
3 -0.096897 0.309859
4 0.375148 0.073375
Robust Scaling
In this method of scaling, we use two main statistical measures of the data.
Median
Inter-Quartile Range
After calculating these two values we are supposed to subtract the median from each entry and
then divide the result by the interquartile range.
scaler = RobustScaler()
scaled_data = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_data,
columns=df.columns)
print(scaled_df.head())
Output:
LotArea MSSubClass
0 -0.254076 0.2
1 0.030015 -0.6
2 0.437624 0.2
3 0.017663 0.4
4 1.181201 0.2
Features in machine learning, plays a significant role in model accuracy. Exploring feature
importance in Random Forests enhances model performance and efficiency.
What is Feature Importance?
Features in machine learning, also known as variables or attributes, are individual measurable
properties or characteristics of the phenomena being observed. They serve as the input to the
model, and their quality and quantity can greatly influence the accuracy and efficiency of the
model. There are three primary categories of features:
Numerical Features: These features represent quantitative data, expressed as numerical
values (integers or decimals). Examples include temperature (°C), weight (kg), and age
(years).
Categorical Features: These features represent qualitative data, signifying the category to
which a data point belongs. Examples include hair color (blonde, brunette, black) and
customer satisfaction (satisfied, neutral, dissatisfied).
Ordinal Features: These features are a subtype of categorical features, possessing an
inherent order or ranking. Examples include movie ratings (1 star, 2 stars, etc.) and customer
service experience (poor, average, excellent).
Why Feature Importance Matters?
Understanding feature importance offers several advantages:
Enhanced Model Performance: By identifying the most influential features, you can
prioritize them during model training, leading to more accurate predictions.
Faster Training Times: Focusing on the most relevant features streamlines the training
process, saving valuable time and computational resources.
Reduced Overfitting: Overfitting occurs when a model memorizes the training data instead
of learning general patterns. By focusing on important features, you can prevent the model
from becoming overly reliant on specific data points.
Feature Importance in Random Forests
Random Forests, a popular ensemble learning technique, are known for their efficiency and
interpretability. They work by building numerous decision trees during training, and the final
prediction is the average of the individual tree predictions.
Several techniques can be employed to calculate feature importance in Random Forests, each
offering unique insights:
Built-in Feature Importance: This method utilizes the model’s internal calculations to
measure feature importance, such as Gini importance and mean decrease in accuracy.
Essentially, this method measures how much the impurity (or randomness) within a node of
a decision tree decreases when a specific feature is used to split the data.
Permutation feature importance: Permutation importance assesses the significance of each
feature independently. By evaluating the impact of individual feature permutations on
predictions, it calculates importance.
SHAP (SHapley Additive exPlanations) Values: SHAP values delve deeper by explaining
the contribution of each feature to individual predictions. This method offers a
comprehensive understanding of feature importance across various data points.
Feature Importance in Random Forests: Implementation
To show implementation, The iris dataset is used throughout the article to understand the
implementation of feature importance.
Prerequisities: Install necessary libraries
!pip install shap
Python3
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
Overall, PCA is a powerful tool for data analysis and can help to simplify complex datasets,
making them easier to understand and work with.
Python
import pandas as pd
import numpy as np
# Here we are using inbuilt dataset of scikit learn
from sklearn.datasets import load_breast_cancer #
instantiating cancer = load_breast_cancer(as_frame=True)
# creating dataframe
df = cancer.frame
# checking shape
print('Original Dataframe shape :',df.shape)
# Input features
X = df[cancer['feature_names']]
print('Inputs Dataframe shape :', X.shape)
Python
# Mean
X_mean = X.mean()
# Standard deviation
X_std = X.std()
# Standardization Z = (X - X_mean) / X_std
The covariance matrix helps us visualize how strong the dependency of two features is with
each other in the feature space.
Python
# covariance
c = Z.cov()
# Plot the covariance matrix
import matplotlib.pyplot as plt
import seaborn as sns
sns.heatmap(c)
plt.show()
Python
Python
# Importing PCA
from sklearn.decomposition
import PCA
# Let's say, components = 2
pca = PCA(n_components=2) pca.fit(Z) x_pca = pca.transform(Z)
# Create the dataframe
df_pca1 = pd.DataFrame(x_pca, columns=['PC{}'. format(i+1) for i in range(n_components)]) p
rint(df_pca1)
# giving a larger plot
plt.figure(figsize=(8, 6)) plt.scatter(x_pca[:, 0], x_pca[:, 1], c=cancer['target'], cmap='plasma')
# labeling x and y axes
plt.xlabel('First Principal Component') plt.ylabel('Second Principal Component') plt.show()
Although the logistic regression algorithm is limited to only two-class, linear Discriminant
analysis is applicable for more than two classes of classification problems.
Linear Discriminant analysis is one of the most popular dimensionality reduction techniques
used for supervised classification problems in machine learning. It is also considered a pre-
processing step for modeling differences in ML and applications of pattern classification.
Whenever there is a requirement to separate two or more classes having multiple features
efficiently, the Linear Discriminant Analysis model is considered the most common technique to
solve such classification problems. For e.g., if we have two classes with multiple features and
need to separate them efficiently. When we classify them using a single feature, then it may
show overlapping.
To overcome the overlapping issue in the classification process, we must increase the number of
features regularly.
Example:
Let's assume we have to classify two different classes having two sets of data points in a 2-
dimensional plane as shown below image:
However, it is impossible to draw a straight line in a 2-d plane that can separate these data points
efficiently but using linear Discriminant analysis; we can dimensionally reduce the 2-D plane
into the 1-D plane. Using this technique, we can also maximize the separability between multiple
classes.
Let's consider an example where we have two classes in a 2-D plane having an X-Y axis, and we
need to classify them efficiently. As we have already seen in the above example that LDA
enables us to draw a straight line that can completely separate the two classes of the data points.
Here, LDA uses an X-Y axis to create a new axis by separating them using a straight line and
projecting data onto a new axis.
Hence, we can maximize the separation between these classes and reduce the 2-D plane into 1-D.
To create a new axis, Linear Discriminant Analysis uses the following criteria:
In other words, we can say that the new axis will increase the separation between the data points
of the two classes and plot them onto the new axis.
Why LDA?
o Logistic Regression is one of the most popular classification algorithms that perform well
for binary classification but falls short in the case of multiple classification problems with
well-separated classes. At the same time, LDA handles these quite efficiently.
o LDA can also be used in data pre-processing to reduce the number of features, just as
PCA, which reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to extract
useful data from different faces. Coupled with eigenfaces, it produces effective results.
Although, LDA is specifically used to solve supervised classification problems for two or more
classes which are not possible using logistic regression in machine learning. But LDA also fails
in some cases where the Mean of the distributions is shared. In this case, LDA fails to create a
new axis that makes both the classes linearly separable.
Linear Discriminant analysis is one of the most simple and effective methods to solve
classification problems in machine learning. It has so many extensions and variations as follows:
1. Quadratic Discriminant Analysis (QDA): For multiple input variables, each class
deploys its own estimate of variance.
2. Flexible Discriminant Analysis (FDA): it is used when there are non-linear groups of
inputs are used, such as splines.
3. Flexible Discriminant Analysis (FDA): This uses regularization in the estimate of the
variance (actually covariance) and hence moderates the influence of different variables
on LDA.
Some of the common real-world applications of Linear discriminant Analysis are given below:
o FaceRecognition
Face recognition is the popular application of computer vision, where each face is
represented as the combination of a number of pixel values. In this case, LDA is used to
minimize the number of features to a manageable number before going through the
classification process. It generates a new template in which each dimension consists of a
linear combination of pixel values. If a linear combination is generated using Fisher's
linear discriminant, then it is called Fisher's face.
o Medical
In the medical field, LDA has a great application in classifying the patient disease on the
basis of various parameters of patient health and the medical treatment which is going on.
On such parameters, it classifies disease as mild, moderate, or severe. This classification
helps the doctors in either increasing or decreasing the pace of the treatment.
o CustomerIdentification
In customer identification, LDA is currently being applied. It means with the help of
LDA; we can easily identify and select the features that can specify the group of
customers who are likely to purchase a specific product in a shopping mall. This can be
helpful when we want to identify a group of customers who mostly purchase a product in
a shopping mall.
o ForPredictions
LDA can also be used for making predictions and so in decision making. For example,
"will you buy this product” will give a predicted result of either one or two possible
classes as a buying or not.
o InLearning
Nowadays, robots are being trained for learning and talking to simulate human work, and
it can also be considered a classification problem. In this case, LDA builds similar groups
on the basis of different parameters, including pitches, frequencies, sound, tunes, etc.
Example
import pandas as pd