0% found this document useful (0 votes)
25 views

Machine Learning Unit-1

Uploaded by

mr. potter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Machine Learning Unit-1

Uploaded by

mr. potter
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 147

MACHINE LEARNING

By
R Soujanya
Assistant Professor,
CSE, GRIET
Unit-1 Content
 Introduction: Introduction to Machine learning, Supervised learning, Unsupervised
learning, Reinforcement learning. Deep learning.

 Feature Selection: Filter, Wrapper, Embedded methods.

 Feature Normalization:- min-max normalization, z-score normalization, and constant


factor normalization

 Introduction to Dimensionality Reduction: Principal Component Analysis (PCA),


Linear Discriminant Analysis (LDA)
Introduction to
Machine Learning
Definitions of Machine Learning

 Learning is any process by which system improves performance from experience.

 Machine learning (ML) is a  subset / branch of Artificial Intelligence.

1. Machine learning is a "Field of study that gives computers the ability to learn without being
explicitly programmed“ defined by Arthur Samuel in 1959.

In machine learning, algorithms are trained to find patterns and correlations in large data
sets and to make the best decisions and predictions based on that analysis.

(OR)

2. Machine learning is a Computer Program is said to learn from Experience E with respect to
small Class of tasks T and Performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E – by Tom Mitchell in 1998

Machine Learning is the study of algorithms that improves its Performance P, at some Task T with
Experience E.
Continued..
 Machine learning behaves similarly to the growth of a child. As a child grows, her experience E in
performing task T increases, which results in higher performance measure (P).
 For instance, we give a “shape sorting block” toy to a child. (We know that the toy has different
shapes and shape holes).
 In this case, our task T is to find an appropriate shape hole for a shape. Afterward, the child observes
the shape and tries to fit it in a shaped hole.
 Let us say that this toy has three shapes: a circle, a triangle, and a square. In her first attempt at
finding a shaped hole, her performance measure(P) is 1/3, which means that the child found 1 out of 3
correct shape holes.
 Second, the child tries it another time and notices that she is a little experienced in this task.
Considering the experience gained (E), the child tries this task another time, and when measuring the
performance(P), it turns out to be 2/3. After repeating this task (T) 100 times, the baby now figured
out which shape goes into which shape hole.
Continued..
 So her experience (E) increased, her performance(P) also increased, and then we noticed that
as the number of attempts at this toy increased. The performance also increases, which
results in higher accuracy.
 Such execution is similar to machine learning. What a machine does is, it takes a task (T),
executes it, and measures its performance (P). Now a machine has a large number of data, so
as it processes that data, its experience (E) increases over time, resulting in a higher
performance measure (P). So after going through all the data, our machine learning model’s
accuracy increases, which means that the predictions made by our model will be very
accurate.
 3. Machine Learning is the ability of systems to learn from data, identify patterns, and enact
lessons from that data without human interaction or with minimal human interaction.
 Machine learning makes day-to-day and repetitive work much easier!
Machine Learning
Data Information Knowledge
 Need For Machine Learning

 Machine learning is a tool for turning information into knowledge.


 Ever since the technical revolution, we’ve been generating an immeasurable amount of data.
 Google gets over 3.5 billion searches daily.
 WhatsApp users exchange up to 65 billion messages daily.
 As per research, we generate around 2.5 quintillion bytes of data every single day! It is
estimated that by 2020, 1.7MB of data will be created every second for every person on
earth.(0.0025ZB)
 Facebook generates 4 petabytes of data per day
 With the availability of so much data, it is finally possible to build predictive models that
can study and analyze complex data to find useful insights and deliver more accurate
results.
List of reasons why Machine Learning is so important:
List of reasons why Machine Learning is so important:

• Increase in Data Generation: Due to excessive production of data, we need a method that


can be used to structure, analyze and draw useful insights from data. This is where
Machine Learning comes in. It uses data to solve problems and find solutions to the most
complex tasks faced by organizations.
• Improve Decision Making: By making use of various algorithms, Machine Learning can be
used to make better business decisions. For example, Machine Learning is used to forecast
sales, predict downfalls in the stock market, identify risks and anomalies, etc.
• Uncover patterns & trends in data: Finding hidden patterns and extracting key insights
from data is the most essential part of Machine Learning. By building predictive models and
using statistical techniques, Machine Learning allows you to dig beneath the surface and
explore the data at a minute scale. Understanding data and extracting patterns manually
will take days, whereas Machine Learning algorithms can perform such computations in less
than a second.
• Solve complex problems: From detecting the genes linked to the deadly ALS disease to
building self-driving cars, Machine Learning can be used to solve the most complex
problems.
How important Machine Learning and Machine Learning Applications:

• Netflix’s Recommendation Engine: The core of Netflix is its infamous


recommendation engine. Over 75% of what you watch is recommended by Netflix
and these recommendations are made by implementing Machine Learning.
• Facebook’s Auto-tagging feature: The logic behind Facebook’s DeepMind face
verification system is Machine Learning and Neural Networks. DeepMind studies
the facial features in an image to tag your friends and family.
• Amazon’s Alexa: The infamous Alexa, which is based on Natural Language
Processing and Machine Learning is an advanced level Virtual Assistant that does
more than just play songs on your playlist. It can book you an Uber, connect with
the other IoT devices at home, track your health, etc.
• Google’s Spam Filter: Gmail makes use of Machine Learning to filter out spam
messages. It uses Machine Learning algorithms and Natural Language Processing
to analyze emails in real-time and classify them as either spam or non-spam.
Features of Machine Learning
Machine Learning Vs Traditional Programming

 Traditional Programming:
Data and program is run on the computer to produce the output.

Traditional
Programming

Data
Output
Program Computer

 Machine Learning:
Data and output is run on the computer to create a program.
Machine
 This program can be used in traditional programming.
Learning
Data
Computer Program
Output
Relation between Data Science, Machine learning , Deep Learning &
Artificial Intelligence
Relation between Data Science, Machine learning & Artificial Intelligence:

 Artificial Intelligence (AI): Developing machines to mimic human


intelligence and behavior.
 Ability of a machine to imitate human intelligence.
 AI is a broader term that describes the capability of the machine to learn
and solve problems just like humans. In other words, AI refers to the
replication of humans, how it thinks, works and functions.
 Artificial intelligence is used when a machine completes a task using human
intellect and behaviours.
 Ex: Roomba, the smart robotic vacuum, uses AI to analyze the size of the
room, obstacles, and pathways.
 There are two ways of incorporating intelligence in artificial things i.e., to
achieve artificial intelligence. One is through machine learning and another
is through deep learning. That means DL and ML are ways of achieving AI.
Relation between Data Science, Machine learning & Artificial Intelligence:

 Machine Learning (ML): Algorithms that learn from structured data to


predict outputs and discover patterns in that data.
 ML is an application or subset of AI.
 The major aim of ML is to allow the systems to learn by themselves through
the experience without any kind of human intervention or assistance.
 Ex: We use machine learning in our day to day life when we use services
like recommendation systems on Netflix, Youtube, Spotify; search engines
like google and yahoo; voice assistants like google home and amazon alexa.
(structured data)
 In Machine Learning we train the algorithm by providing it with a lot of
data and allowing it to learn more about the processed information.
Relation between Data Science, Machine learning & Artificial Intelligence:

 Deep Learning (DL): Algorithms based on highly complex neural networks that mimic
the way a human brain works to detect patterns in large unstructured data sets.
 Deep learning is the evolution of machine learning and neural networks, which uses
advanced computer programming and training to understand complex patterns hidden
in large data sets.
 DL is about understanding how the human brain works in different situations and then
trying to recreate its behaviour.
 Deep learning is used to complete complex tasks and train models using unstructured
data.
 Ex: Deep learning is commonly used in image classification tasks like facial recognition.
Although machine learning models can also identify faces, deep learning models are
more accurate.
 In this case, it takes the unstructured data (images of faces) and extracts factors such
as the various facial features. The extracted features are then matched to those stored
in a database.
Two major advantages of DL over ML:

1. Feature Extraction
 Machine learning algorithms such as Naive Bayes, Logistic Regression, SVM, etc., are
termed as “flat algorithms”. By flat, we mean, these algorithms require pre-processing
phase (known as Feature Extraction which is quite complicated and computationally
expensive) before been applied to data such as images, text, CSV.
 For instance, if we want to determine whether a particular image is of a cat or dog using
the ML model.
 We have to manually extract features from the image such as size, color, shape, etc., and
then give these features to the ML model to identify whether the image is of a dog or cat.
 However, DL models do not any feature extraction pre-processing step and are capable of
classifying data into different classes and categories themselves. That is, in the case of
identification of cat or dog in the image, we do not need to extract features from the
image and give it to the DL model. But, the image can be given as the direct input to the
DL model whose job is then to classify it without human intervention.
 Raw Data is given to DL model. Pre-processed data is given to ML model.
Two major advantages of DL over ML:

2. Big Data
 With technology and the ever-increasing use of the web, it is estimated that every
second 1.7MB of data is generated by every person on the planet Earth. Therefore,
analyzing and learning from data is of utmost importance.
 Deep Learning is seen as a rocket whose fuel is data.
 The accuracy of ML models stops increasing with an increasing amount of data after
a point while the accuracy of the DL model keeps on increasing with increasing data.
All the technologies at a glance………
ML Tools
Step:1 - Gathering the data
Data:

 Data: It can be any unprocessed fact, value, text, sound, or picture that is not being
interpreted and analyzed.
 Data is the most important part of all Data Analytics, Machine Learning, Artificial
Intelligence.
 Without data, we can’t train any model and all modern research and automation will go
in vain. Big Enterprises are spending lots of money just to gather as much certain data
as possible. 
 Data is typically divided into two types: labeled and unlabeled. Labeled data includes a
label or target variable that the model(Supervised) is trying to predict, whereas
unlabeled data does not include a label or target variable (UnSupervised) .
 A labeled dataset is one where you already know the target answer. 
 The data used in machine learning is typically numerical or categorical.
• Numerical data includes values that can be ordered and measured, such as age or
income.(Regression-if target variable is numerical)
• Categorical data/Nominal data: includes values that represent categories, such as
gender or type of fruit.(Classification-if target variable is Categorical)
Types of Data

The data is classified into majorly four categories:

• Nominal data
• Ordinal data
• Discrete data
• Continuous data
Types of Data
Types of Data
1. Quantitative Data Type: –
 This Type Of Data Type Consists Of Numerical Values. Anything Which Is Measured By
Numbers.
 E.G., Profit, Quantity Sold, Height, Weight, Temperature, Etc.
This Is Again Of Two Types
A.)  Discrete Data Type: –(counting process)
 The Numeric Data Which Have Discrete Values Or Whole Numbers. This Type Of Variable
Value If Expressed In Decimal Format Will Have No Proper Meaning. Their Values Can Be
Counted.
  E.G.: – No. Of Cars You Have, No. Of Marbles In Containers, Students In A Class, Etc.
Types of Data
 B.) Continuous Data Type: –(measuring process)
 The Numerical Measures Which Can Take The Value Within A Certain Range. This Type Of
Variable Value If Expressed In Decimal Format Has True Meaning. Their Values Can Not Be
Counted But Measured. The Value Can Be Infinite
 E.G.: – Height, Weight, Time, Area, Distance, Measurement Of Rainfall, Etc.
Types of Data
2. Qualitative Data Type: –
 These Are The Data Types That Cannot Be Expressed In Numbers. This Describes Categories
Or Groups And Is Hence Known As The Categorical Data Type.
This Can Be Divided Into:-
A. Structured Data:
 This Type Of Data Is Either Number Or Words. This Can Take Numerical Values But
Mathematical Operations Cannot Be Performed On It. This Type Of Data Is Expressed In
Tabular Format.
 E.G.)  Sunny=1, Cloudy=2, Windy=3 Or Binary Form Data Like 0 Or1, Good Or Bad, Etc.
Types of Data
 B. Unstructured Data:
 This Type Of Data Does Not Have The Proper Format And Therefore Known As
Unstructured Data. This Comprises Textual Data, Sounds, Images, Videos, Etc.
Types of Data
 Besides This, There Are Also Other Types Refer As Data Types Preliminaries Or Data
Measures:-
1. Nominal
2. Ordinal
3. Interval
4. Ratio
 These Can Also Be Refer Different Scales Of Measurements.
 I. Nominal Data Type:
 This Is In Use To Express Names Or Labels Which Are Not Order Or Measurable.
 E.G., Male Or Female (Gender), Race, Country, Etc.

Fig: Gender (Female, Male), An Example Of Nominal Data


Type
Types of Data
 II. Ordinal Data Type:
 This Is Also A Categorical Data Type Like Nominal Data But Has Some Natural Ordering Associated With It.
 E.G., Likert Rating Scale, Shirt Sizes, Ranks, Grades, Etc.

 III. Interval Data Type:


 This Is Numeric Data Which Has Proper Order And The Exact Zero Means The True Absence Of A Value
Attached. Here Zero Means Not A Complete Absence But Has Some Value. This Is The Local Scale.
 E.G., Temperature Measured In Degree Celsius, Time, Sat Score, Credit Score, PH, Etc. Difference Between
Values Is Familiar. In This Case, There Is No Absolute Zero. Absolute 
Types of Data
 IV. Ratio Data Type:
 This Quantitative Data Type Is The Same As The Interval Data Type But Has The Absolute Zero. Here Zero
Means Complete Absence And The Scale Starts From Zero. This Is The Global Scale.
 E.G., Temperature In Kelvin, Height, Weight, Etc.

Fig: Weight, An Example Of Ratio Data Type


What is a Data set ?
A data set is an organized collection of
data. They are generally associated with a
unique body of work and typically cover
one topic at a time.
Each data set has one output variable and
one/more input variables.
Instance
/observation/rows/RECORDS/SAMPLES/obj
ects/predictors
Columns/features/attributes/variables/
fields and characteristics.

• Independent variables—Input variable-


predictor variable

• Dependent variables-output
variable/target variable/response
variable
Types of datasets:
 1.Data set consists of only numerical attributes
 2.Data set consists of only categorical attributes
 3.Data set consists of both numerical and categorical attributes
 Dataset1: Dataset2: Dataset3:
age incom heig weight age income studen age income Credit
e ht t rating
20 12000 6.3 30 youth Fair Yes youth 12000 Yes
40 15000 5.2 70 youth Good No senior 15000 No
35 20000 5.6 65 senior excellent Yes middle 20000 Yes

60 100000 5.4 59 middle Good Yes youth 100000 Yes


senior Fair No
middle good no
Step:2 – Data Preparation

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets
and summarize their main characteristics, often employing data visualization methods.
Data Preparation

 Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.
Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other
words, whenever the data is gathered from different sources it is collected in raw format which is not
feasible for the analysis.
 Data preparation is also known as data "pre-processing," "data wrangling," "data cleaning," "data pre-
processing," and "feature engineering." It is the later stage of the machine
 Few essential tasks when working with data in the data preparation step.
• Data cleaning: This task includes the identification of errors and making corrections or improvements to
those errors.
• Feature Selection: We need to identify the most important or relevant input data variables for the model.
• Data Transforms: Data transformation involves converting raw data into a well-suitable format for the
model.
• Feature Engineering: Feature engineering involves deriving new variables from the available dataset.
• Dimensionality Reduction: The dimensionality reduction process involves converting higher
dimensions into lower dimension features without changing the information
Data Preparation
 Data Preparation is the process of cleaning and transforming raw data to make predictions accurately through using ML
algorithms. Although data preparation is considered the most complicated stage in ML, it reduces process complexity later
in real-time projects. Various issues have been reported during the data preparation step in machine learning as follows:
• Mismatched data types: When you collect data from many different sources, it may come to you in different formats.
While the ultimate goal of this entire process is to reformat your data for machines, you still need to begin with similarly
formatted data. For example, if part of your analysis involves family income from multiple countries, you’ll have to convert
each income amount into a single currency.
• Mixed data values: Perhaps different sources use different descriptors for features – for example, man or male. These
value descriptors should all be made uniform.
• Data outliers: Outliers can have a huge impact on data analysis results. For example if you're averaging test scores for a
class, and one student didn’t respond to any of the questions, their 0% could greatly skew the results.
• Missing data: Take a look for missing data fields, blank spaces in text, or unanswered survey questions. This could be due
to human error or incomplete data. To take care of missing data, you’ll have to perform data cleaning.
• Unstructured data format: Data comes from various sources and needs to be extracted into a different format. Hence,
before deploying an ML project, always consult with domain experts or import data from known sources.
• Limited Features: Whenever data comes from a single source, it contains limited features, so it is necessary to import data
from various sources for feature enrichment or build multiple features in datasets.
• Understanding feature engineering: Features engineering helps develop additional content in the ML models, increasing
model performance and accuracy in predictions
.

Data Preprocessing
 Data preprocessing is the process of transforming raw data into a useful, understandable format. Real-world
or raw data usually has inconsistent formatting, human errors, and can also be incomplete. Data preprocessing
resolves such issues and makes datasets more complete and efficient to perform data analysis.
 In other words, data preprocessing is transforming data into a form that computers can easily work on. It
makes data analysis or visualization easier and increases the accuracy and speed of the machine learning
algorithms that train on the data.
Why is data preprocessing required?
 A database is a collection of data points. Data points are also called observations, data samples, events, and
records.
 Each sample is described using different characteristics, also known as features or attributes. Data
preprocessing is essential to effectively build models with these features.
 If you’re aggregating data from two or more independent datasets, the gender field may have two different
values for men: man and male. Likewise, if you’re aggregating data from ten different datasets, a field that’s
present in eight of them may be missing in the rest two.
 By preprocessing data, we make it easier to interpret and use. This process eliminates inconsistencies or
duplicates in data, which can otherwise negatively affect a model’s accuracy. Data preprocessing also ensures
that there aren’t any incorrect or missing values due to human error or bugs. In short, employing data
preprocessing techniques makes the database more complete and accurate.
.

The four stages of data preprocessing


 There are four stages of data processing: cleaning, integration, reduction, and transformation.
1. Datacleaning: or cleansing/scrubbing is the process of cleaning datasets by accounting for missing
values, removing outliers, correcting inconsistent data points, and smoothing noisy data. In essence,
the motive behind data cleaning is to offer complete and accurate samples for machine learning
models.
• Missing values
• Noisy data
i) Missing values:
 The problem of missing data values is quite common. It may happen during data collection or due to
some specific data validation rule. In such cases, you need to collect additional data samples or look
for additional datasets.
 The issue of missing values can also arise when you concatenate two or more datasets to form a
bigger dataset. If not all fields are present in both datasets, it’s better to delete such fields before
merging.
.

 Here are some ways to account for missing data:


• Manually fill in the missing values. This can be a tedious and time-consuming approach and is not
recommended for large datasets.
• Make use of a standard value to replace the missing data value. You can use a global constant like
“unknown” or “N/A” to replace the missing value. Although a straightforward approach, it isn’t
foolproof.
• Fill the missing value with the most probable value. To predict the probable value, you can use
algorithms like logistic regression or decision trees.
• Use a central tendency to replace the missing value. Central tendency is the tendency of a value to
cluster around its mean, mode, or median.
ii) Noisy data
 A large amount of meaningless data is called noise. More precisely, it’s the random variance in a
measured variable or data having incorrect attribute values. Noise includes duplicate or semi-
duplicates of data points, data segments of no value for a specific research process, or unwanted
information fields.
 For example, if you need to predict whether a person can drive, information about their hair color,
height, or weight will be irrelevant.
.

 An outlier can be treated as noise, although some consider it a valid data point. Suppose you’re
training an algorithm to detect tortoises in pictures. The image dataset may contain images of turtles
wrongly labeled as tortoises. This can be considered noise.
 However, there can be a tortoise’s image that looks more like a turtle than a tortoise. That sample can
be considered an outlier and not necessarily noise. This is because we want to teach the algorithm all
possible ways to detect tortoises, and so, deviation from the group is essential.
 For numeric values, you can use a scatter plot or box plot to identify outliers.
 The following are some methods used to solve the problem of noise:
• Regression: Regression analysis can help determine the variables that have an impact. This will
enable you to work with only the essential features instead of analyzing large volumes of data. Both
linear regression and multiple linear regression can be used for smoothing the data.
• Binning: Binning methods can be used for a collection of sorted data. They smoothen a sorted value
by looking at the values around it. The sorted values are then divided into “bins,” which means
sorting data into smaller segments of the same size. There are different techniques for binning,
including smoothing by bin means and smoothing by bin medians.
• Clustering: Clustering algorithms such as k-means clustering can be used to group data and detect
outliers in the process.
.

2. Data integration

It is involved in a data analysis task that combines data from multiple sources into a coherent data
store. These sources may include multiple databases. Do you think how data can be matched up ??
For a data analyst in one database, he finds Customer_ID and in another he finds cust_id, How can
he sure about them and say these two belong to the same entity. Databases and Data warehouses
have Metadata (It is the data about data) it helps in avoiding errors.
Since data is collected from various sources, data integration is a crucial part of data preparation. Integration
may lead to several inconsistent and redundant data points, ultimately leading to models with inferior
accuracy.
 Here are some approaches to integrate data:
• Data consolidation: Data is physically brought together and stored in a single place. Having all data in
one place increases efficiency and productivity. This step typically involves using 
data warehouse software.
• Data virtualization: In this approach, an interface provides a unified and real-time view of data from
multiple sources. In other words, data can be viewed from a single point of view.
• Data propagation: Involves copying data from one location to another with the help of specific
applications. This process can be synchronous or asynchronous and is usually event-driven.
.

3. Data reduction 
 As the name suggests, data reduction is used to reduce the amount of data and thereby reduce the
costs associated with data mining or data analysis.
 It offers a condensed representation of the dataset. Although this step reduces the volume, it
maintains the integrity of the original data. This data preprocessing step is especially crucial when
working with big data as the amount of data involved would be gigantic.
 The following are some techniques used for data reduction.
 Dimensionality reduction, also known as dimension reduction, reduces the number of features or
input variables in a dataset.
 The number of features or input variables of a dataset is called its dimensionality. The higher the
number of features, the more troublesome it is to visualize the training dataset and create a predictive
model.
 In some cases, most of these attributes are correlated, hence redundant; therefore, dimensionality
reduction algorithms can be used to reduce the number of random variables and obtain a set of
principal variables.
.

3. Data reduction 
 There are two segments of dimensionality reduction: feature selection and feature extraction.
i. Feature selection (selecting a subset of the variables)--try to find a subset of the original set of features. This
allows us to get a smaller subset that can be used to visualize the problem using data modeling
ii. Feature extraction (extracting new variables from the data)---reduces the data in a high-dimensional space
to a lower-dimensional space, or in other words, space with a lesser number of dimensions.
 The following are some ways to perform dimensionality reduction:
• Principal component analysis (PCA): A statistical technique used to extract a new set of variables from a large
set of variables. The newly extracted variables are called principal components. This method works only for
features with numerical values.
• High correlation filter: A technique used to find highly correlated features and remove them; otherwise, a pair
of highly correlated variables can increase the multicollinearity in the dataset. 
• Missing values ratio: This method removes attributes having missing values more than a specified threshold.
• Low variance filter: Involves removing normalized attributes having variance less than a threshold value as
minor changes in data translate to less information.
• Random forest: This technique is used to assess the importance of each feature in a dataset, allowing us to keep
just the top most important features.
.

4. Data Transformation
 Data transformation is the process of converting data from one format to another. In essence, it involves methods for
transforming data into appropriate formats that the computer can learn efficiently from.
 For example, the speed units can be miles per hour, meters per second, or kilometers per hour. Therefore a dataset may store
values of the speed of a car in different units as such. Before feeding this data to an algorithm, we need to transform the data into
the same unit.
 The following are some strategies for data transformation.
 Smoothing
 This statistical approach is used to remove noise from the data with the help of algorithms. It helps highlight the most valuable
features in a dataset and predict patterns. It also involves eliminating outliers from the dataset to make the patterns more visible.
 Aggregation
 Aggregation refers to pooling data from multiple sources and presenting it in a unified format for data mining or analysis.
Aggregating data from various sources to increase the number of data points is essential as only then the ML model will have
enough examples to learn from.
 Discretization
 Discretization involves converting continuous data into sets of smaller intervals. For example, it’s more efficient to place people
in categories such as “teen,” “young adult,” “middle age,” or “senior” than using continuous age values.
 Generalization
 Generalization involves converting low-level data features into high-level data features. For instance, categorical attributes such
as home address can be generalized to higher-level definitions such as city or state.
.

4. Data Transformation
 Normalization
 Normalization refers to the process of converting all data variables into a specific range. In other words,
it’s used to scale the values of an attribute so that it falls within a smaller range, for example, 0 to 1.
Decimal scaling, min-max normalization, and z-score normalization are some methods of data
normalization.
 Feature construction
 Feature construction involves constructing new features from the given set of features. This method
simplifies the original dataset and makes it easier to analyze, mine, or visualize the data.
 Concept hierarchy generation
 Concept hierarchy generation lets you create a hierarchy between features, although it isn’t specified. For
example, if you have a house address dataset containing data about the street, city, state, and country, this
method can be used to organize the data in hierarchical forms.
 Accurate data, accurate results
 Machine learning algorithms are like kids. They have little to no understanding of what’s favorable or
unfavorable. Like how kids start repeating foul language picked up from adults, inaccurate or inconsistent
data easily influences ML models. The key is to feed them high-quality, accurate data, for which data
preprocessing is an essential step.
.

Data Preprocessing
Step:3 – Choosing the Learning Model
Types of Machine Learning
Machine Learning

Supervised Learning Unsupervised Reinforcement Learning


Learning

Classification Regression Clustering Q-Learning

Markov Decision
Decision trees Simple Linear K-Means Process

KNN K-Modes
Multiple Linear

Naïve Bayes K-Medoids


Polynomial
SVM DBScan

Logistic Regression Agglomerative

Divisive
Multinomial Logistic
Regression

Convolutional
Artificial Neural Neural
Networks Networks Deep Learning
Recurrent
Neural
Step:4 – Training the Model
Training set & Test Set
Training the Model
 Dataset split ratio is mainly depends on 2 things. First, the total number of
samples(instance/rows) in your data and second, on the actual model you are
training.
 Train/Validation/Test is a method to measure the accuracy of your model.
 We can split the data set into three sets: a training set , Validation and
testing set.
 70%/80% for training, and 30%/20% for testing.(it depends on the given data)
 Train the model means create the model.
 Test the model means test the accuracy of the model.
 The fundamental purpose for splitting the dataset is to assess how effective
will the trained model be in generalizing to new data.
 This split can be achieved by using train_test_split function of scikit-learn.
Training the Model
 Training dataset: The sample of data used to fit the model. The actual dataset
that we use to train the model (weights and biases in the case of a Neural
Network). The model sees and learns from this data.
 This is the actual dataset from which a model trains .i.e. the model sees and learns
from this data to predict the outcome or to make the right decisions.
 Most of the training data is collected from several resources and then preprocessed
and organized to provide proper performance of the model.
 Type of training data hugely determines the ability of the model to generalize .i.e. the
better the quality and diversity of training data, the better will be the performance of
the model.
 This data is more than 60% of the total data available for the project.
Training the Model
 Test dataset : The sample of data used to provide an unbiased evaluation of a
final model fit on the training dataset.
 This dataset is independent of the training set but has a somewhat similar type of
probability distribution of classes and is used as a benchmark to evaluate the model,
used only after the training of the model is complete.
 Testing set is usually a properly organized dataset having all kinds of data for
scenarios that the model would probably be facing when used in the real world. Often
the validation and testing set combined is used as a testing set which is not
considered a good practice.
 If the accuracy of the model on training data is greater than that on testing data
then the model is said to have overfitting.
 This data is approximately 20-25% of the total data available for the project.
Training the Model
 Validation dataset: The sample of data used to provide an unbiased evaluation of a
model fit on the training dataset while tuning model hyperparameters. The evaluation
becomes more biased as skill on the validation dataset is incorporated into the model
configuration.
 The validation set is used to fine-tune the hyperparameters of the model and is considered
a part of the training of the model.
 The model only sees this data for evaluation but does not learn from this data, providing
an objective unbiased evaluation of the model.
 Validation dataset can be utilized for regression as well by interrupting training of model
when loss of validation dataset becomes greater than loss of training dataset .i.e. reducing
bias and variance. This data is approximately 10-15% of the total data available for the
project but this can change depending upon the number of hyperparameters .i.e. if model
has quite many hyperparameters then using large validation set will give better results.
Now, whenever the accuracy of model on validation data is greater than that on training
data then the model is said to have generalized well.
Step:5 – Performance Evaluation
Performance metrics
 Evaluating the performance of a Machine learning model is one of the important steps while building an
effective ML model. To evaluate the performance or quality of the model, different metrics are used,
and these metrics are known as performance metrics or evaluation metrics.
 These performance metrics help us understand how well our model has performed for the given data. In
this way, we can improve the model's performance by tuning the hyper-parameters. Each ML model aims
to generalize well on unseen/new data, and performance metrics help determine how well the model
generalizes on the new dataset.
Performance metrics
 In machine learning, each task or problem is divided into classification and Regression. Not all metrics
can be used for all types of problems; hence, it is important to know and understand which metrics
should be used. Different evaluation metrics are used for both Regression and Classification tasks. In
this topic, we will discuss metrics used for classification and regression tasks.
Performance Metrics for Classification
 In a classification problem, the category or classes of data is identified based on training data. The
model learns from the given dataset and then classifies the new data into classes or groups based on the
training. It predicts class labels as the output, such as Yes or No, 0 or 1, Spam or Not Spam, etc. To
evaluate the performance of a classification model, different metrics are used, and some of them are
as follows:
1. Accuracy-it can be determined as the number of correct predictions to the total number of
predictions.
2. Confusion Matrix
3. Precision
4. Recall
5. F-Score
6. AUC(Area Under the Curve)-ROC
Performance metrics
1.Accuracy- It can be determined as the number of correct predictions to the total number of predictions.

2. Confusion Matrix:
 A confusion matrix is a tabular representation of prediction outcomes of any binary classifier, which is
used to describe the performance of the classification model on a set of test data when true values are
known.
 The confusion matrix is simple to implement, but the terminologies used in this matrix might be confusing
for beginners.
 A typical confusion matrix for a binary classifier looks like the below image(However, it can be extended
to use for classifiers with more than two classes).
Performance metrics
 Accuracy for the matrix can be calculated by taking average of the values
lying across the “main diagonal” i.e
 Accuracy = (True Positives+False Negatives)/Total Number of Samples

 3.Precision:-It is the number of correct positive results divided by the


number of positive results predicted by classifier

 4. Recall :- It is the number of correct positive results divided by the


number of all relevant samples
 5. F-Score:F-score or F1 Score is a metric to evaluate a binary classification
model on the basis of predictions that are made for the positive class. It is calculated
with the help of Precision and Recall. It is a type of single score that represents both
Precision and Recall. So, the F1 Score can be calculated as the harmonic mean of
both precision and Recall, assigning equal weight to each of them.
 The formula for calculating the F1 score is given below:
Performance metrics

 6.AUC(Area Under the Curve)-ROC


 Sometimes we need to visualize the performance of the classification model on charts; then, we can use the AUC-
ROC curve. It is one of the popular and important metrics for evaluating the performance of the classification
model.
 Firstly, let's understand ROC (Receiver Operating Characteristic curve) curve. ROC represents a graph to show
the performance of a classification model at different threshold levels. The curve is plotted between two
parameters, which are:
• True Positive Rate
• False Positive Rate
• TPR or true Positive rate is a synonym for Recall, hence can be calculated as:
• FPR or False Positive Rate can be calculated as:
Performance metrics
Performance Metrics for Regression
 Regression is a supervised learning technique that aims to find the relationships between the dependent and independent
variables. A predictive regression model predicts a numeric or discrete value. The metrics used for regression are different
from the classification metrics. It means we cannot use the Accuracy metric (explained above) to evaluate a regression
model; instead, the performance of a Regression model is reported as errors in the prediction. Following are the popular
metrics that are used to evaluate the performance of Regression models.
1. Mean Absolute Error--MAE is one of the simplest metrics, which measures the absolute difference between actual
and predicted values, where absolute means taking a number as Positive.
 Y is the Actual outcome, Y' is the predicted outcome, and N is the total number of data points.

2. Mean Squared Error--It measures the average of the Squared difference between predicted values and the actual value
given by the model.

3. R2 Score--R squared error is also known as Coefficient of Determination, which is another popular metric used for
Regression model evaluation. The R-squared metric enables us to compare our model with a constant baseline to determine
the performance of the model. To select the constant baseline, we need to take the mean of the data and draw the line at the
mean.
Performance metrics
4. Adjusted R2
 Adjusted R squared, as the name suggests, is the improved version of R squared error. R square has a limitation of
improvement of a score on increasing the terms, even though the model is not improving, and it may mislead the
data scientists.
 To overcome the issue of R square, adjusted R squared is used, which will always show a lower value than R². It is
because it adjusts the values of increasing predictors and only shows improvement if there is a real improvement.
 We can calculate the adjusted R squared as follows:

 n is the number of observations


 k denotes the number of independent variables
 and Ra2 denotes the adjusted R2
Step:6 – Hyperparameter Tuning
Parameters and hyperparameters
Parameters
A model parameter is a configuration variable that is internal to the model and whose value can be estimated from data.
• They are required by the model when making predictions.
• They values define the skill of the model on your problem.
• They are estimated or learned from data.
• They are often not set manually by the practitioner.
• They are often saved as part of the learned model.
 Parameters are key to machine learning algorithms. They are the part of the model that is learned from historical training data.
 possible parameter values.
• Statistics: In statistics, you may assume a distribution for a variable, such as a Gaussian distribution. Two parameters of the Gaussian
distribution are the mean (mu) and the standard deviation (sigma). This holds in machine learning, where these parameters may be
estimated from data and used as part of a predictive model.
• Programming: In programming, you may pass a parameter to a function. In this case, a parameter is a function argument that could have
one of a range of values. In machine learning, the specific model you are using is the function and requires parameters in order to make a
prediction on new data.
 Whether a model has a fixed or variable number of parameters determines whether it may be referred to as “parametric” or
“nonparametric“.
 Some examples of model parameters include:
• The weights in an artificial neural network.
• The support vectors in a support vector machine.
• The coefficients in a linear regression or logistic regression.
Parameters and hyperparameters
 Hyperparameters
 A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data.
• They are often used in processes to help estimate model parameters.
• They are often specified by the practitioner.
• They can often be set using heuristics.
• They are often tuned for a given predictive modeling problem.
 We cannot know the best value for a model hyperparameter on a given problem. We may use rules of thumb, copy values used on other
problems, or search for the best value by trial and error.
 When a machine learning algorithm is tuned for a specific problem, such as when you are using a grid search or a random search, then you
are tuning the hyperparameters of the model or order to discover the parameters of the model that result in the most skillful predictions.
 Model hyperparameters are often referred to as model parameters which can make things confusing. A good rule of thumb to overcome this
confusion is as follows:
 If you have to specify a model parameter manually then
it is probably a model hyperparameter.
 Some examples of model hyperparameters include:
• The learning rate for training a neural network.
• The C and sigma hyperparameters for support vector machines.
• The k in k-nearest neighbors.
Hyperparameter Tuning
 Hyperparameters are adjustable parameters that let you control the model training process. For example, with
neural networks, you decide the number of hidden layers and the number of nodes in each layer. Model
performance depends heavily on hyperparameters.
 Hyperparameter tuning, also called hyperparameter optimization, is the process of finding the configuration of
hyperparameters that results in the best performance. The process is typically computationally expensive and
manual.
 we are not aware of optimal values for hyperparameters which would generate the best model output. So, what
we tell the model is to explore and select the optimal model architecture automatically. This selection procedure
for hyperparameter is known as Hyperparameter Tuning.
Hyperparameter Tuning
Step:7 – Prediction
Types of
Machine Learning
Types of Machine Learning
 Machine learning is an application of AI that provides systems the ability to learn on their own and
improve from experiences without being programmed externally.
 Machine Learning is an application of Artificial Intelligence that enables systems to learn from vast
volumes of data and solve specific problems. It uses computer algorithms that improve their
efficiency automatically through experience.
Types of Machine Learning
 How Machine Learning Works
Types of Machine Learning
 How Machine Learning Works
 Consider a system with input data that contains photos of various kinds of fruits. Now the system
wants to group the data according to the different types of fruits. 
 First, the system will analyze the input data. Next, it tries to find patterns, like shapes, size, and color.
 Based on these patterns, the system will try to predict the different types of fruit and segregate them.
 Finally, it keeps track of all the decisions it made during the process to ensure it is learning. The next
time you ask the same system to predict and segregate the different types of fruits, it won't have to go
through the entire process again. That’s how machine learning works.
Types of Machine Learning
Machine Learning

Supervised Learning Unsupervised Reinforcement Learning


Learning

Classification Regression Clustering Q-Learning

Markov Decision
Decision trees Simple Linear K-Means Process

KNN K-Modes
Multiple Linear

Naïve Bayes K-Medoids


Polynomial
SVM DBScan

Logistic Regression Agglomerative

Divisive
Multinomial Logistic
Regression

Convolutional
Artificial Neural Neural
Networks Networks Deep Learning
Recurrent
Neural
Types of Machine Learning
 There are primarily three types of machine learning: Supervised, Unsupervised, and
Reinforcement Learning.
• Supervised machine learning:  User supervise the machine while training it to work on its
own. This requires labeled training data 
• Unsupervised learning: There is training data, but it won’t be labeled
• Reinforcement learning: The system learns on its own 
1.Supervised Learning
 Supervised learning is a type of machine learning that uses labeled data to train machine
learning models. In labeled data, the output is already known. The model just needs to map
the inputs to the respective outputs. 
 An example of supervised learning is to train a system that identifies the image of an
animal.
 Supervised learning algorithms take labeled inputs and map them to the known outputs,
which means you already know the target variable.
 Supervised Learning methods need external supervision to train machine learning models.
Hence, the name supervised. They need guidance and additional information to return the
desired result.
 First, you have to provide a data set that contains pictures of a kind of fruit, e.g., apples. 
 Then, provide another data set that lets the model know that these are pictures of apples.
This completes the training phase. 
 Next, provide a new set of data that only contains pictures of apples. At this point, the
system can recognize what the fruit it is and will remember it.
1.Supervised Learning
1.Supervised Learning
 Supervised learning algorithms are generally used for solving
classification and regression problems. 
• Classification-- Predicts a Class Label (Categorical)
• Regression--Predicts a Class Label (Numerical)
 Classification: Classification is used when the output variable is
categorical i.e. with 2 or more classes. For example, yes or no,
male or female, true or false, etc.
 In order to predict whether a mail is spam or not, we need to
first teach the machine what a spam mail is. This is done based
on a lot of spam filters - reviewing the content of the mail,
reviewing the mail header, and then searching if it contains any
false information. Certain keywords and blacklist filters that
blackmails are used from already blacklisted spammers.
 All of these features are used to score the mail and give it a
spam score. The lower the total spam score of the email, the
more likely that it is not a scam.
 Based on the content, label, and the spam score of the new
incoming mail, the algorithm decides whether it should land in
the inbox or spam folder.
1.Supervised Learning
 Regression:
 Regression is used when the output variable is a real or continuous
value. In this case, there is a relationship between two or more variables
i.e., a change in one variable is associated with a change in the other
variable. For example, salary based on work experience or weight based
on height, etc.
 Let’s consider two variables - humidity and temperature. Here,
‘temperature’ is the independent variable and ‘humidity' is the dependent
variable. If the temperature increases, then the humidity decreases. 
 These two variables are fed to the model and the machine learns the
relationship between them. After the machine is trained, it can easily
predict the humidity based on the given temperature. 
Note: Real-Life Applications of Supervised Learning
Risk Assessment-to assess the risk in financial services or insurance
domains 
Image Classification--Facebook can recognize your friend in a picture from
an album of tagged photos. 
Fraud Detection--To identify whether the transactions made by the user are
authentic or not. 
Visual Recognition--The ability of a machine learning model to identify
objects, places, people, actions, and images
2.Unsupervised Learning
 Unsupervised learning is a type of machine learning that uses unlabeled data to train machines.
 Unlabeled data doesn’t have a fixed output variable.
 The model learns from the data, discovers the patterns and features in the data, and returns the
output. 
 Consider a cluttered dataset: a collection of pictures of different spoons.
 Feed this data to the model, and the model analyzes it to recognize any patterns. I
 The machine categorizes the photos into two types, as shown in the image, based on their
similarities.
 Flipkart uses this model to find and recommend products that are well suited for you.
2.Unsupervised Learning
 Depicted below is an example of an unsupervised learning technique that uses the images of
vehicles to classify if it’s a bus or a truck.
 The model learns by identifying the parts of a vehicle, such as a length and width of the vehicle, the
front, and rear end covers, roof hoods, the types of wheels used, etc.
 Based on these features, the model classifies if the vehicle is a bus or a truck.
2.Unsupervised Learning
 Unsupervised learning finds patterns and understands the trends in the data to discover the output.
So, the model tries to label the data based on the features of the input data.
 The training process used in unsupervised learning techniques does not need any supervision to
build models. They learn on their own and predict the output.
 Unsupervised learning can be further grouped into types:
1. Clustering
2. Association
 Clustering:Clustering is the method of dividing the objects into clusters that are similar between
them and are dissimilar to the objects belonging to another cluster. For example, finding out which
customers made similar product purchases.
2.Unsupervised Learning
 Suppose a telecom company wants to reduce its customer churn rate by providing personalized call
and data plans. The behavior of the customers is studied and the model segments the customers
with similar traits. Several strategies are adopted to minimize churn rate and maximize profit
through suitable promotions and campaigns.
 On the right side of the image, you can see a graph where customers are grouped. Group A
customers use more data and also have high call durations. Group B customers are heavy Internet
users, while Group C customers have high call duration. So, Group B will be given more data benefit
plants, while Group C will be given cheaper called call rate plans and group A will be given the
benefit of both.
 2. Association:
 Association is a rule-based machine learning to discover the probability of the co-occurrence of
items in a collection. For example, finding out which products were purchased together.
2.Unsupervised Learning
 Let’s say that a customer goes to a supermarket and buys bread, milk, fruits, and wheat. Another
customer comes and buys bread, milk, rice, and butter. Now, when another customer comes, it is
highly likely that if he buys bread, he will buy milk too. Hence, a relationship is established based on
customer behavior and recommendations are made. 
 Real-Life Applications of Unsupervised Learning:
• Market Basket Analysis: It is a machine learning model based on the algorithm that if you buy a
certain group of items, you are less or more likely to buy another group of items.
• Semantic Clustering: Semantically similar words share a similar context. People post their queries
on websites in their own ways. Semantic clustering groups all these responses with the same
meaning in a cluster to ensure that the customer finds the information they want quickly and easily.
It plays an important role in information retrieval, good browsing experience, and comprehension.
• Delivery Store Optimization: Machine learning models are used to predict the demand and keep up
with supply. They are also used to open stores where the demand is higher and optimizing roots for
more efficient deliveries according to past data and behavior.
• Identifying Accident Prone Areas: Unsupervised machine learning models can be used to identify
accident-prone areas and introduce safety measures based on the intensity of those accidents.
Difference between Supervised and Unsupervised Learning:

S. No Supervised Learning Un Supervised Learning

1 The data used in supervised learning is This algorithm does not require any
labeled. labeled data because its job is to look for
The system learns from the labeled data patterns in the input data and organize it
and makes future predictions

2 We get feedback--once you receive the That does not happen with unsupervised
output, the system remembers it and learning.
uses it for the next operation

3 Supervised learning is mostly used to Unsupervised learning is used to find


predict data hidden patterns or structures in data. 
Reinforcement learning:
 Reinforcement learning is a sub-branch of Machine Learning that trains a model to return an
optimum solution for a problem by taking a sequence of decisions by itself. 
 Reinforcement Learning is a feedback-based Machine learning technique in which an agent learns to
behave in an environment by performing the actions and seeing the results of actions.
 For each good action, the agent gets positive feedback, and for each bad action, the agent gets
negative feedback or penalty.
• In Reinforcement Learning, the agent learns automatically using feedbacks without any labeled
data, unlike supervised learning.
• Since there is no labeled data, so the agent is bound to learn by its experience only.
• RL solves a specific type of problem where decision making is sequential, and the goal is long-term,
such as game-playing, robotics, etc.
• The agent interacts with the environment and explores it by itself. The primary goal of an agent in
reinforcement learning is to improve the performance by getting the maximum positive rewards.
• The agent learns with the process of hit and trial, and based on the experience, it learns to perform
the task in a better way. Hence, we can say that "Reinforcement learning is a type of machine
learning method where an intelligent agent (computer program) interacts with the
environment and learns to act within that." How a Robotic dog learns the movement of his arms
is an example of Reinforcement learning.
Reinforcement learning:
• It is a core part of Artificial intelligence, and all AI agent works on the concept of reinforcement learning.
Here we do not need to pre-program the agent, as it learns from its own experience without any human
intervention.
• Example: Suppose there is an AI agent present within a maze environment, and his goal is to find the
diamond. The agent interacts with the environment by performing some actions, and based on those actions,
the state of the agent gets changed, and it also receives a reward or penalty as feedback.
• The agent continues doing these three things (take action, change state/remain in the same state, and get
feedback), and by doing these actions, he learns and explores the environment.
• The agent learns that what actions lead to positive feedback or rewards and what actions lead to negative
feedback penalty. As a positive reward, the agent gets a positive point, and as a penalty, it gets a negative
point.
Terms used in Reinforcement Learning:
Agent(): An entity that can perceive/explore the environment and act upon it.
Environment(): A situation in which an agent is present or surrounded by. In RL, we assume the stochastic
environment, which means it is random in nature.
Action(): Actions are the moves taken by an agent within the environment.
State(): State is a situation returned by the environment after each action taken by the agent.
Reward(): A feedback returned to the agent from the environment to evaluate the action of the agent.
Policy(): Policy is a strategy applied by the agent for the next action based on the current state.
Value(): It is expected long-term retuned with the discount factor and opposite to the short-term reward.
Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current action (a).
Reinforcement Learning:
 Key Features of Reinforcement Learning
• In RL, the agent is not instructed about the environment and what actions need to be taken.
• It is based on the hit and trial process.
• The agent takes the next action and changes states according to the feedback of the previous action.
• The agent may get a delayed reward.
• The environment is stochastic, and the agent needs to explore it to reach to get the maximum positive
rewards.
 Applications of Reinforcement Learning
Difference between Supervised, Unsupervised and Reinforcement Learning:

S. No Supervised Unsupervised Reinforcement


1 Data provided is labeled Data provided is unlabeled The machine learns from
data with output values data, the output values not its environment using
specified specified, machine makes its rewards and errors
own predictions

2 Used to solve classification Used to solve clustering and Used to solve resolve based
and regression problems association problems problems

3 Labeled data is used unlabeled data is used No predefined data is used


4 External supervision No Supervision No Supervision
5 Solves the problems by Solves problems by Follows trail and error
mapping labelled input to understanding patterns and problem solving approach
known output discovering output
Machine Learning algorithms
Deep Learning
 Deep Learning is about learning multiple levels of representation and
abstraction that help to make sense of data such as images, sound, and text.
it makes use of deep neural networks.
 Deep learning mimics the network of neurons in a brain.
 It is a subset of Machine Learning, these algorithms are constructed with

connected layers.
Feature Selection:
Filter, Wrapper, Embedded
methods
.

Feature Selection
 While developing the machine learning model, only a few variables in the dataset are useful for building the
model, and the rest features are either redundant or irrelevant. If we input the dataset with all these redundant and
irrelevant features, it may negatively impact and reduce the overall performance and accuracy of the model.
Hence it is very important to identify and select the most appropriate/relevant features from the data and remove
the irrelevant or less important features, which is done with the help of feature selection in machine learning.
 “Feature selection is a way of selecting the subset of the most relevant features from the original features set
by removing the redundant, irrelevant, or noisy features.”
What is Feature Selection?
 A feature is an attribute that has an impact on a problem or is useful for the problem, and choosing the important
features for the model is known as feature selection. Each machine learning process depends on feature
engineering, which mainly contains two processes; which are Feature Selection and Feature Extraction. Although
feature selection and extraction processes may have the same objective, both are completely different from each
other. The main difference between them is that feature selection is about selecting the subset of the original
feature set, whereas feature extraction creates new features. Feature selection is a way of reducing the input
variable for the model by using only relevant data in order to reduce overfitting in the model.
 So, we can define feature Selection as, "It is a process of automatically or manually selecting the subset of most
appropriate and relevant features to be used in model building." Feature selection is performed by either
including the important features or excluding the irrelevant features in the dataset without changing them.
.

 Selecting the best features helps the model to perform well. For example, Suppose we want to create a model that
automatically decides which car should be crushed for a spare part, and to do this, we have a dataset. This dataset
contains a Model of the car, Year, Owner's name, Miles. So, in this dataset, the name of the owner does not
contribute to the model performance as it does not decide if the car should be crushed or not, so we can remove
this column and select the rest of the features(column) for the model building.
Below are some benefits of using feature selection in machine learning:
• It helps in avoiding the curse of dimensionality.
• It helps in the simplification of the model so that it can be easily interpreted by the researchers.
• It reduces the training time.
• It reduces overfitting hence enhance the generalization.
Feature Selection Techniques
 There are mainly two types of Feature Selection techniques, which are:
• Supervised Feature Selection technique
Supervised Feature selection techniques consider the target variable and can be used for the labelled dataset.
• Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable and can be used for the unlabelled dataset.
.
.

1. Filter Methods
 In Filter Method, features are selected on the basis of statistics measures. This method does not depend on the
learning algorithm and chooses the features as a pre-processing step.
 The filter method filters out the irrelevant feature and redundant columns from the model by using different
metrics through ranking.
 The advantage of using filter methods is that it needs low computational time and does not overfit the data.
• Information Gain(dwdm unit 3 long ans)
• Chi-square Test (chi square test between feature and target variable)
• Fisher's Score(fishers rank and we arrange in desc. order. We choose can select the variables with a large fisher's
score. )
• Missing Value Ratio() variable with high ratio than threshold ratio is ignored
 Information Gain: Information gain determines the reduction in entropy while transforming the dataset. It can
be used as a feature selection technique by calculating the information gain of each variable with respect to the
target variable.
 Chi-square Test: Chi-square test is a technique to determine the relationship between the categorical variables.
The chi-square value is calculated between each feature and the target variable, and the desired number of
features with the best chi-square value is selected.
 Fisher's Score:
.

 Fisher's Score:
 Fisher's score is one of the popular supervised technique of features selection. It returns the rank of the
variable on the fisher's criteria in descending order. Then we can select the variables with a large fisher's
score.
 Missing Value Ratio:
 The value of the missing value ratio can be used for evaluating the feature set against the threshold value.
The formula for obtaining the missing value ratio is the number of missing values in each column divided by
the total number of observations. The variable is having more than the threshold value can be dropped.

2. Embedded Methods
 Embedded methods combined the advantages of both filter and wrapper methods by considering the
interaction of features along with low computational cost. These are fast processing methods similar to the
filter method but more accurate than the filter method.
.

 These methods are also iterative, which evaluates each iteration, and optimally finds the most important features that
contribute the most to training in a particular iteration. Some techniques of embedded methods are:
• Regularization- Regularization adds a penalty term to different parameters of the machine learning model for
avoiding overfitting in the model. This penalty term is added to the coefficients; hence it shrinks some coefficients
to zero. Those features with zero coefficients can be removed from the dataset. The types of regularization
techniques are L1 Regularization (Lasso Regularization) or Elastic Nets (L1 and L2 regularization).
• Random Forest Importance - Different tree-based methods of feature selection help us with feature importance to
provide a way of selecting features. Here, feature importance specifies which feature has more importance in model
building or has a great impact on the target variable. Random Forest is such a tree-based method, which is a type of
bagging algorithm that aggregates a different number of decision trees. It automatically ranks the nodes by their
performance or decrease in the impurity (Gini impurity) over all the trees. Nodes are arranged as per the impurity
values, and thus it allows to pruning of trees below a specific node. The remaining nodes create a subset of the most
important features.
.

There are mainly three techniques under supervised feature Selection:


3. Wrapper Methods:
 In wrapper methodology, selection of features is done by considering it as a search problem, in which different
combinations are made, evaluated, and compared with other combinations. It trains the algorithm by using the subset of
features iteratively.
 Some techniques of wrapper methods are:
• Forward selection - Forward selection is an iterative process, which begins with an empty set of features. After each
iteration, it keeps adding on a feature and evaluates the performance to check whether it is improving the performance or
not. The process continues until the addition of a new variable/feature does not improve the performance of the model.
• Backward elimination - Backward elimination is also an iterative approach, but it is the opposite of forward selection.
This technique begins the process by considering all the features and removes the least significant feature. This
elimination process continues until removing the features does not improve the performance of the model.
• Exhaustive Feature Selection- Exhaustive feature selection is one of the best feature selection methods, which evaluates
each feature set as brute-force. It means this method tries & make each possible combination of features and return the
best performing feature set.
• Recursive Feature Elimination –

Recursive feature elimination is a recursive greedy optimization approach, where features are selected by recursively
taking a smaller and smaller subset of features. Now, an estimator is trained with each set of features, and the importance
of each feature is determined using coef_attribute or through a feature_importances_attribute.
.

 How to choose a Feature Selection Method?


.

 How to choose a Feature Selection Method?

 For machine learning engineers, it is very important to understand that which feature selection method will
work properly for their model. The more we know the datatypes of variables, the easier it is to choose the
appropriate statistical measure for feature selection.
 To know this, we need to first identify the type of input and output variables. In machine learning, variables
are of mainly two types:
• Numerical Variables: Variable with continuous values such as integer, float
• Categorical Variables: Variables with categorical values such as Boolean, ordinal, nominals.
 Below are some univariate statistical measures, which can be used for filter-based feature selection:
 1. Numerical Input, Numerical Output:
 Numerical Input variables are used for predictive regression modelling. The common method to be used for
such a case is the Correlation coefficient.
• Pearson's correlation coefficient (For linear Correlation).
• Spearman's rank coefficient (for non-linear correlation).
.

 How to choose a Feature Selection Method?

 2. Numerical Input, Categorical Output:


 Numerical Input with categorical output is the case for classification predictive modelling problems. In this
case, also, correlation-based techniques should be used, but with categorical output.
• ANOVA correlation coefficient (linear).
• Kendall's rank coefficient (nonlinear).
 3. Categorical Input, Numerical Output:
 This is the case of regression predictive modelling with categorical input. It is a different example of a
regression problem. We can use the same measures as discussed in the above case but in reverse order.
 4. Categorical Input, Categorical Output:
 This is a case of classification predictive modelling with categorical Input variables.
 The commonly used technique for such a case is Chi-Squared Test. We can also use Information gain in this
case.
.

Input Variable Output Variable Feature Selection technique

Numerical Numerical •Pearson's correlation coefficient (For linear Correlation).

•Spearman's rank coefficient (for non-linear correlation).

Numerical Categorical •ANOVA correlation coefficient (linear).


•Kendall's rank coefficient (nonlinear).

Categorical Numerical •Kendall's rank coefficient (linear).


•ANOVA correlation coefficient (nonlinear).

Categorical Categorical •Chi-Squared test (contingency tables).


•Mutual Information.
Feature Normalization:-
Min-max normalization, z-score
normalization,
.

Feature/ Data Normalization


Definition:
 The process of transforming the columns in a dataset to the same/standard scale is referred to as normalization. Or The production of clean data is
generally referred to as Data Normalization. Or Data Normalization is the process of organizing data such that it seems consistent across all records
and fields.
 Every dataset does not need to be normalized for machine learning. It is required only when features of machine learning models have different ranges.
 Normalization is also known as feature scaling ,it is applied on numerical features . Many machine learning algorithms like Gradient descent
methods, KNN algorithm, linear and logistic regression, etc. require data scaling to produce good results. Various scalers are defined for this purpose.
 Data normalization consists of remodeling numeric columns to a standard scale. 
 Why do you Need Data Normalization?
 As data becomes more useful to all types of businesses, the manner it is arranged in mass amounts becomes increasingly important. It is clear that when
Data Normalization is done effectively, 
• it results in a better overall business function, 
• from assuring email delivery to preventing misdials and 
• improving group analysis without the fear of duplicates. 
 Let’s say we have a dataset containing two variables: time traveled and distance covered. Time is measured in hours (e.g. 5, 10, 25 hours ) and
distance in miles (e.g. 500, 800, 1200 miles). Do you see the problem?
 One obvious problem of course is that these two variables are measured in two different units — one in hours and the other in miles. The
other problem — which is not obvious but if you take a closer look you'll find it — is the distribution of data, which is quite different in these
two variables (both within and between variables).
 The purpose of normalization is to transform data in a way that they are either dimensionless and/or have similar distributions. This process
of normalization is known by other names such as standardization, feature scaling etc. Normalization is an essential step in data pre-
processing in any machine learning application and model fitting.
.

Data Normalization
 Types of Normalization techniques in Machine Learning
 The most widely used types of normalization in machine learning are:
1. Min-Max Scaling – Subtract the minimum value from each column’s value and divide by the range. Each new column has a
minimum value of 0 and a maximum value of 1.
2. Standardization Scaling –  The term “standardization” refers to the process of centering a variable at zero and
standardizing the variance at one. Subtracting the mean of each observation and then dividing by the standard deviation is
the procedure.(Z-score normalization)
 Normalization and standardization
 Normalization and standardization are not the same things. Standardization, interestingly, refers to setting the mean to
zero and the standard deviation to one. Normalization in machine learning is the process of translating data into the range
[0, 1] (or any other range) or simply transforming data onto the unit sphere.
.

1. Min-Max Scaling/Rescaling/Linear Normalization – Subtract the minimum value from each column’s value and divide by the range.
Each new column has a minimum value of 0 and a maximum value of 1.
 This technique is also referred to as scaling. As we have already discussed above, the Min-Max scaling method helps the dataset to
shift and rescale the values of their attributes, so they end up ranging between 0 and 1.
 MinMax Scaler is one of the most popular scaling algorithms. It transforms features by scaling each feature to a given range, which is
generally [0,1], or [-1,-1] in case of negative values.

Xn = (X - Xminimum) / ( Xmaximum - Xminimum)  

• Xn = Value of Normalization
• Xmaximum = Maximum value of a feature
• Xminimum = Minimum value of a feature

Example: Let's assume we have a model dataset having maximum and minimum values of feature as mentioned above. To normalize the
machine learning model, values are shifted and rescaled so their range can vary between 0 and 1. This technique is also known as Min-Max
scaling. In this scaling technique, we will change the feature values as follows:
Case1- If the value of X is minimum, the value of Numerator will be 0; hence Normalization will also be 0.
• Put X =Xminimum in above formula, we get;
• Xn = Xminimum- Xminimum/ ( Xmaximum - Xminimum)
• Xn = 0
.

Case2- If the value of X is maximum, then the value of the numerator is equal to the denominator; hence Normalization will be 1.
• Put X =Xmaximum in above formula, we get;
• Xn = Xmaximum - Xminimum/ ( Xmaximum - Xminimum)
• Xn = 1
Case3- On the other hand, if the value of X is neither maximum nor minimum, then values of normalization will also be between 0
and 1.
 Hence, Normalization can be defined as a scaling method where values are shifted and rescaled to maintain their ranges
between 0 and 1, or in other words; it can be referred to as Min-Max scaling technique.
.

2. Standardization Scaling / Z-score normalization –  The term “standardization” refers to the process of centering a variable
at zero and standardizing the variance at one. Subtracting the mean of each observation and then dividing by the standard
deviation is the procedure:
 Standardization scaling is also known as Z-score normalization, in which values are centered around the mean with a unit
standard deviation, which means the attribute becomes zero and the resultant distribution has a unit standard deviation.
Mathematically, we can calculate the standardization by subtracting the feature value from the mean and dividing it by standard
deviation.
 Hence, standardization can be expressed as follows:

 Here, µ represents the mean of feature value, and σ represents the standard deviation of feature values.
 However, unlike Min-Max scaling technique, feature values are not restricted to a specific range in the standardization
technique.
.

 Mean and standard deviation:


 The mean represents the average value in a dataset.
It is calculated as:
Sample mean = Σxi / n
where:
• Σ: A symbol that means “sum”
• xi: The ith observation in a dataset
• n: The total number of observations in the dataset
 The standard deviation represents how spread out the values are in a dataset relative to the mean.
It is calculated as:

Sample standard deviation = √Σ(xi – xbar)2 / (n-1)


where:
• Σ: A symbol that means “sum”
• xi: The ith value in the sample
• xbar: The mean of the sample
• n: The sample size
 Note: The relationship between the mean and standard deviation: The mean is used in the formula to calculate the
standard deviation.
.

 Example:
Introduction to Dimensionality
Reduction
.

Dimensionality Reduction
 The number of input variables or features for a dataset is referred to as its dimensionality.
 More input features often make a predictive modeling task more challenging to model, more generally referred to as the curse of
dimensionality.
 Objective:
• Large numbers of input features can cause poor performance for machine learning algorithms..
• Dimensionality reduction methods include feature selection, linear algebra methods, projection methods, and autoencoders.
 Problem With Many Input Variables
 The performance of machine learning algorithms can degrade with too many input variables.
 If your data is represented using rows and columns, such as in a spreadsheet, then the input variables are the columns that are fed as input
to a model to predict the target variable. Input variables are also called features.
 We can consider the columns of data representing dimensions on an n-dimensional feature space and the rows of data as points in that
space. This is a useful geometric interpretation of a dataset.
 Having a large number of dimensions in the feature space can mean that the volume of that space is very large, and in turn, the points that
we have in that space (rows of data) often represent a small and non-representative sample.
 This can dramatically impact the performance of machine learning algorithms fit on data with many input features, generally referred to as
the “curse of dimensionality.”
 Therefore, it is often desirable to reduce the number of input features.
 This reduces the number of dimensions of the feature space, hence the name “dimensionality reduction.”
.

Data Reduction
 Dimensionality Reduction
 Dimensionality reduction refers to techniques for reducing the number of input variables in training data.
 In machine learning classification problems, there are often too many factors on the basis of which the final
classification is done. These factors are basically variables called features. The higher the number of features, the
harder it gets to visualize the training set and then work on it. Sometimes, most of these features are correlated, and
hence redundant. This is where dimensionality reduction algorithms come into play. Dimensionality reduction is the
process of reducing the number of random variables under consideration, by obtaining a set of principal variables. It
can be divided into feature selection and feature extraction.
 An intuitive example of dimensionality reduction can be discussed through a simple e-mail classification problem,
where we need to classify whether the e-mail is spam or not. This can involve a large number of features, such as
whether or not the e-mail has a generic title, the content of the e-mail, whether the e-mail uses a template, etc.
However, some of these features may overlap. In another condition, a classification problem that relies on both
humidity and rainfall can be collapsed into just one underlying feature, since both of the aforementioned are
correlated to a high degree. Hence, we can reduce the number of features in such problems. A 3-D classification
problem can be hard to visualize, whereas a 2-D one can be mapped to a simple 2 dimensional space, and a 1-D
problem to a simple line. The below figure illustrates this concept, where a 3-D feature space is split into two 2-D
feature spaces, and later, if found to be correlated, the number of features can be reduced even further.
.

Data Reduction
.

Data Reduction
 Components of Dimensionality Reduction:
There are two components of dimensionality reduction:
 Feature selection: In this, we try to find a subset of the original set of variables, or features, to
get a smaller subset which can be used to model the problem. It usually involves three ways:
1. Filter
2. Wrapper
3. Embedded
 Feature extraction: This reduces the data in a high dimensional space to a lower dimension space,
i.e. a space with lesser no. of dimensions.
 Methods of Dimensionality Reduction:
The various methods used for dimensionality reduction include:
1. Principal Component Analysis (PCA)
2. Linear Discriminant Analysis (LDA)
3. Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear or non-linear, depending upon the method used. The
prime linear method, called Principal Component Analysis, or PCA.
.

Data Reduction
 Why Dimensionality Reduction:
• Prevent from curse of Dimensionality
• Improve the performance of the model
• To visualize the data or to understand the data

 Advantages of Dimensionality Reduction


• It helps in data compression, and hence reduced storage space.
• It reduces computation time.
• It also helps remove redundant features, if any.

 Disadvantages of Dimensionality Reduction


• It may lead to some amount of data loss.
• PCA tends to find linear correlations between variables, which is sometimes undesirable.
• PCA fails in cases where mean and covariance are not enough to define datasets.
• We may not know how many principal components to keep- in practice, some thumb rules are
applied.
Principal Component Analysis
(PCA)
.

Principal Component Analysis (PCA)


 In large dimensional datasets, there might be lots of inconsistencies in the features or lots of
redundant features in the dataset, which will only increase the computation time and make
data processing and EDA more convoluted.
 To get rid of the curse of dimensionality, (When working with high-dimensional data, there
are a number of issues known as the “Curse of Dimensionality” ) a process called
dimensionality reduction was introduced.
 Dimensionality reduction techniques can be used to filter only a limited number of significant
features needed for training and this is where PCA comes in.
 Principal components analysis (PCA) is a dimensionality reduction technique that enables
you to identify correlations and patterns in a data set so that it can be transformed into a
data set of significantly lower dimension without loss of any important information.
 Such a process is very essential in solving complex data-driven problems that involve the use
of high-dimensional data sets. PCA can be achieved via a series of steps. Let’s discuss the
whole end-to-end process.
.

Principal Component Analysis (PCA)


Step By Step Computation Of PCA
 The below steps need to be followed to perform dimensionality reduction using PCA:
1. Standardization of the data (z test)
2. Computing the covariance matrix
3. Calculating the eigenvectors and eigenvalues
4. Computing the Principal Components
5. Reducing the dimensions of the data set
Step1: Standardization of the data:
 Standardization is all about scaling your data in such a way that all the variables and their values lie
within a similar range.
 Consider an example, let’s say that we have 2 variables in our data set, one has values ranging between
10-100 and the other has values between 1000-5000. In such a scenario, it is obvious that the output
calculated by using these predictor variables is going to be biased since the variable with a larger range
will have a more obvious impact on the outcome.
 Therefore, standardizing the data into a comparable range is very important. Standardization is carried
out by subtracting each value in the data from the mean and dividing it by the overall deviation in the
data set.
 It can be calculated like so:
.

Principal Component Analysis (PCA)


Step 2: Computing the covariance matrix
 As mentioned earlier, PCA helps to identify the correlation and dependencies among the features in a data set.
A covariance matrix expresses the correlation between the different variables in the data set. It is essential to
identify heavily dependent variables because they contain biased and redundant information which reduces
the overall performance of the model.
 Mathematically, a covariance matrix is a p × p matrix, where p represents the dimensions of the data set. Each
entry in the matrix represents the covariance of the corresponding variables.
 Consider a case where we have a  2-Dimensional data set with variables a and b, the covariance matrix is a 2×2
matrix as shown below:

 In the above matrix:


• Cov(a, a) represents the covariance of a variable with itself, which is nothing but the variance of the variable ‘a’
• Cov(a, b) represents the covariance of the variable ‘a’ with respect to the variable ‘b’. And since covariance is
commutative, Cov(a, b) = Cov(b, a)
 Here are the key takeaways from the covariance matrix:
• The covariance value denotes how co-dependent two variables are with respect to each other
• If the covariance value is negative, it denotes the respective variables are indirectly proportional to each other
• A positive covariance denotes that the respective variables are directly proportional to each other
.

Principal Component Analysis (PCA)


Step 3: Calculating the Eigenvectors and Eigenvalues
Eigenvectors and eigenvalues are the mathematical constructs that must be computed from
the covariance matrix in order to determine the principal components of the data set.
But first, let’s understand more about principal components
 What are Principal Components?
 Simply put, principal components are the new set of variables that are obtained from the
initial set of variables. The principal components are computed in such a manner that
newly obtained variables are highly significant and independent of each other. The
principal components compress and possess most of the useful information that was
scattered among the initial variables.
 If your data set is of 5 dimensions, then 5 principal components are computed, such that,
the first principal component stores the maximum possible information and the second
one stores the remaining maximum info and so on.
 Now, where do Eigenvectors fall into this whole process?
.

Principal Component Analysis (PCA)


Step 3: Calculating the Eigenvectors and Eigenvalues
 Assuming that you all have a basic understanding of Eigenvectors and
eigenvalues, we know that these two algebraic formulations are always computed
as a pair, i.e, for every eigenvector there is an eigenvalue. The dimensions in the
data determine the number of eigenvectors that you need to calculate.
 Consider a 2-Dimensional data set, for which 2 eigenvectors (and their respective
eigenvalues) are computed. The idea behind eigenvectors is to use the
Covariance matrix to understand where in the data there is the most amount of
variance. Since more variance in the data denotes more information about the
data, eigenvectors are used to identify and compute Principal Components.
 Eigenvalues, on the other hand, simply denote the scalars of the respective
eigenvectors. Therefore, eigenvectors and eigenvalues will compute the Principal
Components of the data set.
.

Principal Component Analysis (PCA)


Step 4: Computing the Principal Components
 Once we have computed the Eigenvectors and eigenvalues, all we have to do is order them
in the descending order, where the eigenvector with the highest eigenvalue is the most
significant and thus forms the first principal component. The principal components of lesser
significances can thus be removed in order to reduce the dimensions of the data.
 The final step in computing the Principal Components is to form a matrix known as the
feature matrix that contains all the significant data variables that possess maximum
information about the data.
Step 5: Reducing the dimensions of the data set
 The last step in performing PCA is to re-arrange the original data with the final
principal components which represent the maximum and the most significant
information of the data set. In order to replace the original data axis with the
newly formed Principal Components, you simply multiply the transpose of the
original data set by the transpose of the obtained feature vector.
 So that was the theory behind the entire PCA process. It’s time to get your hands
dirty and perform all these steps by using a real data set.
 URL: https://www.youtube.com/watch?v=osgqQy9Hr8s
.

Principal Component Analysis (PCA)


 Finding the covariance of a matrix:
.

Principal Component Analysis (PCA)


 Finding Eigen values and eigen vectors:
.

Principal Component Analysis (PCA)


 Advantages of Principal Component Analysis
• Correlated features are removed
• Enhances the performance of the algorithm.
• Enhanced Visualization
 Disadvantages of Principal Component Analysis
• The major components are difficult to comprehend
• Data normalization is required.
• Loss of information
Linear Discriminant Analysis
(LDA)
.

Linear Discriminant Analysis (LDA)


 Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant Function Analysis
is a dimensionality reduction technique that is commonly used for supervised classification
problems.
 It is used for modelling differences in groups i.e. separating two or more classes.
 It is used to project the features in higher dimension space into a lower dimension space.
 LDA is similar to PCA (principal component analysis) in the sense that LDA reduces the dimensions.
However, the main purpose of LDA is to find the line (or plane) that best separates data points
belonging to different classes.
 The key idea behind LDA is that the decision boundary should be chosen such that it maximizes
the distance between the means of the two classes while simultaneously minimizing the variance
within each classes data or within-class scatter.
 For mathematical calculation : Here I am providing link
 https://www.youtube.com/watch?v=mtTVXZq-9gE
.

Linear Discriminant Analysis (LDA)


 This criterion is known as the Fisher criterion and can be expressed as the following formula
for two classes:

 The following are some of the benefits of using LDA:


• LDA is used for classification problems.
• LDA is a powerful tool for dimensionality reduction. 
• LDA is not susceptible to the “curse of dimensionality” like many other machine learning
algorithms.
.

Linear Discriminant Analysis (LDA)


 How LDA works and the steps involved in the process?
• LDA is a supervised machine learning algorithm that can be used for both classification and
dimensionality reduction.
LDA algorithm works based on the following steps:
• The first step is to calculate the means and standard deviation of each feature.
• Within class scatter matrix and between class scatter matrix is calculated
• These matrices are then used to calculate the eigenvectors and eigenvalues.
• LDA chooses the k eigenvectors with the largest eigenvalues to form a transformation matrix.
• LDA uses this transformation matrix to transform the data into a new space with k dimensions.
• Once the transformation matrix transforms the data into new space with k dimensions, LDA can then
be used for classification or dimensionality reduction
 Examples of how LDA can be used in practice
The following are some examples of how LDA can be used in practice:
• LDA can be used for classification, such as classifying emails as spam or not spam.
• LDA can be used for dimensionality reduction, such as reducing the number of features in a dataset.
• LDA can be used to find the most important features in a dataset.
.

You might also like