Machine Learning Unit-1
Machine Learning Unit-1
By
R Soujanya
Assistant Professor,
CSE, GRIET
Unit-1 Content
Introduction: Introduction to Machine learning, Supervised learning, Unsupervised
learning, Reinforcement learning. Deep learning.
1. Machine learning is a "Field of study that gives computers the ability to learn without being
explicitly programmed“ defined by Arthur Samuel in 1959.
In machine learning, algorithms are trained to find patterns and correlations in large data
sets and to make the best decisions and predictions based on that analysis.
(OR)
2. Machine learning is a Computer Program is said to learn from Experience E with respect to
small Class of tasks T and Performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E – by Tom Mitchell in 1998
Machine Learning is the study of algorithms that improves its Performance P, at some Task T with
Experience E.
Continued..
Machine learning behaves similarly to the growth of a child. As a child grows, her experience E in
performing task T increases, which results in higher performance measure (P).
For instance, we give a “shape sorting block” toy to a child. (We know that the toy has different
shapes and shape holes).
In this case, our task T is to find an appropriate shape hole for a shape. Afterward, the child observes
the shape and tries to fit it in a shaped hole.
Let us say that this toy has three shapes: a circle, a triangle, and a square. In her first attempt at
finding a shaped hole, her performance measure(P) is 1/3, which means that the child found 1 out of 3
correct shape holes.
Second, the child tries it another time and notices that she is a little experienced in this task.
Considering the experience gained (E), the child tries this task another time, and when measuring the
performance(P), it turns out to be 2/3. After repeating this task (T) 100 times, the baby now figured
out which shape goes into which shape hole.
Continued..
So her experience (E) increased, her performance(P) also increased, and then we noticed that
as the number of attempts at this toy increased. The performance also increases, which
results in higher accuracy.
Such execution is similar to machine learning. What a machine does is, it takes a task (T),
executes it, and measures its performance (P). Now a machine has a large number of data, so
as it processes that data, its experience (E) increases over time, resulting in a higher
performance measure (P). So after going through all the data, our machine learning model’s
accuracy increases, which means that the predictions made by our model will be very
accurate.
3. Machine Learning is the ability of systems to learn from data, identify patterns, and enact
lessons from that data without human interaction or with minimal human interaction.
Machine learning makes day-to-day and repetitive work much easier!
Machine Learning
Data Information Knowledge
Need For Machine Learning
Traditional Programming:
Data and program is run on the computer to produce the output.
Traditional
Programming
Data
Output
Program Computer
Machine Learning:
Data and output is run on the computer to create a program.
Machine
This program can be used in traditional programming.
Learning
Data
Computer Program
Output
Relation between Data Science, Machine learning , Deep Learning &
Artificial Intelligence
Relation between Data Science, Machine learning & Artificial Intelligence:
Deep Learning (DL): Algorithms based on highly complex neural networks that mimic
the way a human brain works to detect patterns in large unstructured data sets.
Deep learning is the evolution of machine learning and neural networks, which uses
advanced computer programming and training to understand complex patterns hidden
in large data sets.
DL is about understanding how the human brain works in different situations and then
trying to recreate its behaviour.
Deep learning is used to complete complex tasks and train models using unstructured
data.
Ex: Deep learning is commonly used in image classification tasks like facial recognition.
Although machine learning models can also identify faces, deep learning models are
more accurate.
In this case, it takes the unstructured data (images of faces) and extracts factors such
as the various facial features. The extracted features are then matched to those stored
in a database.
Two major advantages of DL over ML:
1. Feature Extraction
Machine learning algorithms such as Naive Bayes, Logistic Regression, SVM, etc., are
termed as “flat algorithms”. By flat, we mean, these algorithms require pre-processing
phase (known as Feature Extraction which is quite complicated and computationally
expensive) before been applied to data such as images, text, CSV.
For instance, if we want to determine whether a particular image is of a cat or dog using
the ML model.
We have to manually extract features from the image such as size, color, shape, etc., and
then give these features to the ML model to identify whether the image is of a dog or cat.
However, DL models do not any feature extraction pre-processing step and are capable of
classifying data into different classes and categories themselves. That is, in the case of
identification of cat or dog in the image, we do not need to extract features from the
image and give it to the DL model. But, the image can be given as the direct input to the
DL model whose job is then to classify it without human intervention.
Raw Data is given to DL model. Pre-processed data is given to ML model.
Two major advantages of DL over ML:
2. Big Data
With technology and the ever-increasing use of the web, it is estimated that every
second 1.7MB of data is generated by every person on the planet Earth. Therefore,
analyzing and learning from data is of utmost importance.
Deep Learning is seen as a rocket whose fuel is data.
The accuracy of ML models stops increasing with an increasing amount of data after
a point while the accuracy of the DL model keeps on increasing with increasing data.
All the technologies at a glance………
ML Tools
Step:1 - Gathering the data
Data:
Data: It can be any unprocessed fact, value, text, sound, or picture that is not being
interpreted and analyzed.
Data is the most important part of all Data Analytics, Machine Learning, Artificial
Intelligence.
Without data, we can’t train any model and all modern research and automation will go
in vain. Big Enterprises are spending lots of money just to gather as much certain data
as possible.
Data is typically divided into two types: labeled and unlabeled. Labeled data includes a
label or target variable that the model(Supervised) is trying to predict, whereas
unlabeled data does not include a label or target variable (UnSupervised) .
A labeled dataset is one where you already know the target answer.
The data used in machine learning is typically numerical or categorical.
• Numerical data includes values that can be ordered and measured, such as age or
income.(Regression-if target variable is numerical)
• Categorical data/Nominal data: includes values that represent categories, such as
gender or type of fruit.(Classification-if target variable is Categorical)
Types of Data
• Nominal data
• Ordinal data
• Discrete data
• Continuous data
Types of Data
Types of Data
1. Quantitative Data Type: –
This Type Of Data Type Consists Of Numerical Values. Anything Which Is Measured By
Numbers.
E.G., Profit, Quantity Sold, Height, Weight, Temperature, Etc.
This Is Again Of Two Types
A.) Discrete Data Type: –(counting process)
The Numeric Data Which Have Discrete Values Or Whole Numbers. This Type Of Variable
Value If Expressed In Decimal Format Will Have No Proper Meaning. Their Values Can Be
Counted.
E.G.: – No. Of Cars You Have, No. Of Marbles In Containers, Students In A Class, Etc.
Types of Data
B.) Continuous Data Type: –(measuring process)
The Numerical Measures Which Can Take The Value Within A Certain Range. This Type Of
Variable Value If Expressed In Decimal Format Has True Meaning. Their Values Can Not Be
Counted But Measured. The Value Can Be Infinite
E.G.: – Height, Weight, Time, Area, Distance, Measurement Of Rainfall, Etc.
Types of Data
2. Qualitative Data Type: –
These Are The Data Types That Cannot Be Expressed In Numbers. This Describes Categories
Or Groups And Is Hence Known As The Categorical Data Type.
This Can Be Divided Into:-
A. Structured Data:
This Type Of Data Is Either Number Or Words. This Can Take Numerical Values But
Mathematical Operations Cannot Be Performed On It. This Type Of Data Is Expressed In
Tabular Format.
E.G.) Sunny=1, Cloudy=2, Windy=3 Or Binary Form Data Like 0 Or1, Good Or Bad, Etc.
Types of Data
B. Unstructured Data:
This Type Of Data Does Not Have The Proper Format And Therefore Known As
Unstructured Data. This Comprises Textual Data, Sounds, Images, Videos, Etc.
Types of Data
Besides This, There Are Also Other Types Refer As Data Types Preliminaries Or Data
Measures:-
1. Nominal
2. Ordinal
3. Interval
4. Ratio
These Can Also Be Refer Different Scales Of Measurements.
I. Nominal Data Type:
This Is In Use To Express Names Or Labels Which Are Not Order Or Measurable.
E.G., Male Or Female (Gender), Race, Country, Etc.
• Dependent variables-output
variable/target variable/response
variable
Types of datasets:
1.Data set consists of only numerical attributes
2.Data set consists of only categorical attributes
3.Data set consists of both numerical and categorical attributes
Dataset1: Dataset2: Dataset3:
age incom heig weight age income studen age income Credit
e ht t rating
20 12000 6.3 30 youth Fair Yes youth 12000 Yes
40 15000 5.2 70 youth Good No senior 15000 No
35 20000 5.6 65 senior excellent Yes middle 20000 Yes
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets
and summarize their main characteristics, often employing data visualization methods.
Data Preparation
Pre-processing refers to the transformations applied to our data before feeding it to the algorithm.
Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other
words, whenever the data is gathered from different sources it is collected in raw format which is not
feasible for the analysis.
Data preparation is also known as data "pre-processing," "data wrangling," "data cleaning," "data pre-
processing," and "feature engineering." It is the later stage of the machine
Few essential tasks when working with data in the data preparation step.
• Data cleaning: This task includes the identification of errors and making corrections or improvements to
those errors.
• Feature Selection: We need to identify the most important or relevant input data variables for the model.
• Data Transforms: Data transformation involves converting raw data into a well-suitable format for the
model.
• Feature Engineering: Feature engineering involves deriving new variables from the available dataset.
• Dimensionality Reduction: The dimensionality reduction process involves converting higher
dimensions into lower dimension features without changing the information
Data Preparation
Data Preparation is the process of cleaning and transforming raw data to make predictions accurately through using ML
algorithms. Although data preparation is considered the most complicated stage in ML, it reduces process complexity later
in real-time projects. Various issues have been reported during the data preparation step in machine learning as follows:
• Mismatched data types: When you collect data from many different sources, it may come to you in different formats.
While the ultimate goal of this entire process is to reformat your data for machines, you still need to begin with similarly
formatted data. For example, if part of your analysis involves family income from multiple countries, you’ll have to convert
each income amount into a single currency.
• Mixed data values: Perhaps different sources use different descriptors for features – for example, man or male. These
value descriptors should all be made uniform.
• Data outliers: Outliers can have a huge impact on data analysis results. For example if you're averaging test scores for a
class, and one student didn’t respond to any of the questions, their 0% could greatly skew the results.
• Missing data: Take a look for missing data fields, blank spaces in text, or unanswered survey questions. This could be due
to human error or incomplete data. To take care of missing data, you’ll have to perform data cleaning.
• Unstructured data format: Data comes from various sources and needs to be extracted into a different format. Hence,
before deploying an ML project, always consult with domain experts or import data from known sources.
• Limited Features: Whenever data comes from a single source, it contains limited features, so it is necessary to import data
from various sources for feature enrichment or build multiple features in datasets.
• Understanding feature engineering: Features engineering helps develop additional content in the ML models, increasing
model performance and accuracy in predictions
.
Data Preprocessing
Data preprocessing is the process of transforming raw data into a useful, understandable format. Real-world
or raw data usually has inconsistent formatting, human errors, and can also be incomplete. Data preprocessing
resolves such issues and makes datasets more complete and efficient to perform data analysis.
In other words, data preprocessing is transforming data into a form that computers can easily work on. It
makes data analysis or visualization easier and increases the accuracy and speed of the machine learning
algorithms that train on the data.
Why is data preprocessing required?
A database is a collection of data points. Data points are also called observations, data samples, events, and
records.
Each sample is described using different characteristics, also known as features or attributes. Data
preprocessing is essential to effectively build models with these features.
If you’re aggregating data from two or more independent datasets, the gender field may have two different
values for men: man and male. Likewise, if you’re aggregating data from ten different datasets, a field that’s
present in eight of them may be missing in the rest two.
By preprocessing data, we make it easier to interpret and use. This process eliminates inconsistencies or
duplicates in data, which can otherwise negatively affect a model’s accuracy. Data preprocessing also ensures
that there aren’t any incorrect or missing values due to human error or bugs. In short, employing data
preprocessing techniques makes the database more complete and accurate.
.
An outlier can be treated as noise, although some consider it a valid data point. Suppose you’re
training an algorithm to detect tortoises in pictures. The image dataset may contain images of turtles
wrongly labeled as tortoises. This can be considered noise.
However, there can be a tortoise’s image that looks more like a turtle than a tortoise. That sample can
be considered an outlier and not necessarily noise. This is because we want to teach the algorithm all
possible ways to detect tortoises, and so, deviation from the group is essential.
For numeric values, you can use a scatter plot or box plot to identify outliers.
The following are some methods used to solve the problem of noise:
• Regression: Regression analysis can help determine the variables that have an impact. This will
enable you to work with only the essential features instead of analyzing large volumes of data. Both
linear regression and multiple linear regression can be used for smoothing the data.
• Binning: Binning methods can be used for a collection of sorted data. They smoothen a sorted value
by looking at the values around it. The sorted values are then divided into “bins,” which means
sorting data into smaller segments of the same size. There are different techniques for binning,
including smoothing by bin means and smoothing by bin medians.
• Clustering: Clustering algorithms such as k-means clustering can be used to group data and detect
outliers in the process.
.
2. Data integration
It is involved in a data analysis task that combines data from multiple sources into a coherent data
store. These sources may include multiple databases. Do you think how data can be matched up ??
For a data analyst in one database, he finds Customer_ID and in another he finds cust_id, How can
he sure about them and say these two belong to the same entity. Databases and Data warehouses
have Metadata (It is the data about data) it helps in avoiding errors.
Since data is collected from various sources, data integration is a crucial part of data preparation. Integration
may lead to several inconsistent and redundant data points, ultimately leading to models with inferior
accuracy.
Here are some approaches to integrate data:
• Data consolidation: Data is physically brought together and stored in a single place. Having all data in
one place increases efficiency and productivity. This step typically involves using
data warehouse software.
• Data virtualization: In this approach, an interface provides a unified and real-time view of data from
multiple sources. In other words, data can be viewed from a single point of view.
• Data propagation: Involves copying data from one location to another with the help of specific
applications. This process can be synchronous or asynchronous and is usually event-driven.
.
3. Data reduction
As the name suggests, data reduction is used to reduce the amount of data and thereby reduce the
costs associated with data mining or data analysis.
It offers a condensed representation of the dataset. Although this step reduces the volume, it
maintains the integrity of the original data. This data preprocessing step is especially crucial when
working with big data as the amount of data involved would be gigantic.
The following are some techniques used for data reduction.
Dimensionality reduction, also known as dimension reduction, reduces the number of features or
input variables in a dataset.
The number of features or input variables of a dataset is called its dimensionality. The higher the
number of features, the more troublesome it is to visualize the training dataset and create a predictive
model.
In some cases, most of these attributes are correlated, hence redundant; therefore, dimensionality
reduction algorithms can be used to reduce the number of random variables and obtain a set of
principal variables.
.
3. Data reduction
There are two segments of dimensionality reduction: feature selection and feature extraction.
i. Feature selection (selecting a subset of the variables)--try to find a subset of the original set of features. This
allows us to get a smaller subset that can be used to visualize the problem using data modeling
ii. Feature extraction (extracting new variables from the data)---reduces the data in a high-dimensional space
to a lower-dimensional space, or in other words, space with a lesser number of dimensions.
The following are some ways to perform dimensionality reduction:
• Principal component analysis (PCA): A statistical technique used to extract a new set of variables from a large
set of variables. The newly extracted variables are called principal components. This method works only for
features with numerical values.
• High correlation filter: A technique used to find highly correlated features and remove them; otherwise, a pair
of highly correlated variables can increase the multicollinearity in the dataset.
• Missing values ratio: This method removes attributes having missing values more than a specified threshold.
• Low variance filter: Involves removing normalized attributes having variance less than a threshold value as
minor changes in data translate to less information.
• Random forest: This technique is used to assess the importance of each feature in a dataset, allowing us to keep
just the top most important features.
.
4. Data Transformation
Data transformation is the process of converting data from one format to another. In essence, it involves methods for
transforming data into appropriate formats that the computer can learn efficiently from.
For example, the speed units can be miles per hour, meters per second, or kilometers per hour. Therefore a dataset may store
values of the speed of a car in different units as such. Before feeding this data to an algorithm, we need to transform the data into
the same unit.
The following are some strategies for data transformation.
Smoothing
This statistical approach is used to remove noise from the data with the help of algorithms. It helps highlight the most valuable
features in a dataset and predict patterns. It also involves eliminating outliers from the dataset to make the patterns more visible.
Aggregation
Aggregation refers to pooling data from multiple sources and presenting it in a unified format for data mining or analysis.
Aggregating data from various sources to increase the number of data points is essential as only then the ML model will have
enough examples to learn from.
Discretization
Discretization involves converting continuous data into sets of smaller intervals. For example, it’s more efficient to place people
in categories such as “teen,” “young adult,” “middle age,” or “senior” than using continuous age values.
Generalization
Generalization involves converting low-level data features into high-level data features. For instance, categorical attributes such
as home address can be generalized to higher-level definitions such as city or state.
.
4. Data Transformation
Normalization
Normalization refers to the process of converting all data variables into a specific range. In other words,
it’s used to scale the values of an attribute so that it falls within a smaller range, for example, 0 to 1.
Decimal scaling, min-max normalization, and z-score normalization are some methods of data
normalization.
Feature construction
Feature construction involves constructing new features from the given set of features. This method
simplifies the original dataset and makes it easier to analyze, mine, or visualize the data.
Concept hierarchy generation
Concept hierarchy generation lets you create a hierarchy between features, although it isn’t specified. For
example, if you have a house address dataset containing data about the street, city, state, and country, this
method can be used to organize the data in hierarchical forms.
Accurate data, accurate results
Machine learning algorithms are like kids. They have little to no understanding of what’s favorable or
unfavorable. Like how kids start repeating foul language picked up from adults, inaccurate or inconsistent
data easily influences ML models. The key is to feed them high-quality, accurate data, for which data
preprocessing is an essential step.
.
Data Preprocessing
Step:3 – Choosing the Learning Model
Types of Machine Learning
Machine Learning
Markov Decision
Decision trees Simple Linear K-Means Process
KNN K-Modes
Multiple Linear
Divisive
Multinomial Logistic
Regression
Convolutional
Artificial Neural Neural
Networks Networks Deep Learning
Recurrent
Neural
Step:4 – Training the Model
Training set & Test Set
Training the Model
Dataset split ratio is mainly depends on 2 things. First, the total number of
samples(instance/rows) in your data and second, on the actual model you are
training.
Train/Validation/Test is a method to measure the accuracy of your model.
We can split the data set into three sets: a training set , Validation and
testing set.
70%/80% for training, and 30%/20% for testing.(it depends on the given data)
Train the model means create the model.
Test the model means test the accuracy of the model.
The fundamental purpose for splitting the dataset is to assess how effective
will the trained model be in generalizing to new data.
This split can be achieved by using train_test_split function of scikit-learn.
Training the Model
Training dataset: The sample of data used to fit the model. The actual dataset
that we use to train the model (weights and biases in the case of a Neural
Network). The model sees and learns from this data.
This is the actual dataset from which a model trains .i.e. the model sees and learns
from this data to predict the outcome or to make the right decisions.
Most of the training data is collected from several resources and then preprocessed
and organized to provide proper performance of the model.
Type of training data hugely determines the ability of the model to generalize .i.e. the
better the quality and diversity of training data, the better will be the performance of
the model.
This data is more than 60% of the total data available for the project.
Training the Model
Test dataset : The sample of data used to provide an unbiased evaluation of a
final model fit on the training dataset.
This dataset is independent of the training set but has a somewhat similar type of
probability distribution of classes and is used as a benchmark to evaluate the model,
used only after the training of the model is complete.
Testing set is usually a properly organized dataset having all kinds of data for
scenarios that the model would probably be facing when used in the real world. Often
the validation and testing set combined is used as a testing set which is not
considered a good practice.
If the accuracy of the model on training data is greater than that on testing data
then the model is said to have overfitting.
This data is approximately 20-25% of the total data available for the project.
Training the Model
Validation dataset: The sample of data used to provide an unbiased evaluation of a
model fit on the training dataset while tuning model hyperparameters. The evaluation
becomes more biased as skill on the validation dataset is incorporated into the model
configuration.
The validation set is used to fine-tune the hyperparameters of the model and is considered
a part of the training of the model.
The model only sees this data for evaluation but does not learn from this data, providing
an objective unbiased evaluation of the model.
Validation dataset can be utilized for regression as well by interrupting training of model
when loss of validation dataset becomes greater than loss of training dataset .i.e. reducing
bias and variance. This data is approximately 10-15% of the total data available for the
project but this can change depending upon the number of hyperparameters .i.e. if model
has quite many hyperparameters then using large validation set will give better results.
Now, whenever the accuracy of model on validation data is greater than that on training
data then the model is said to have generalized well.
Step:5 – Performance Evaluation
Performance metrics
Evaluating the performance of a Machine learning model is one of the important steps while building an
effective ML model. To evaluate the performance or quality of the model, different metrics are used,
and these metrics are known as performance metrics or evaluation metrics.
These performance metrics help us understand how well our model has performed for the given data. In
this way, we can improve the model's performance by tuning the hyper-parameters. Each ML model aims
to generalize well on unseen/new data, and performance metrics help determine how well the model
generalizes on the new dataset.
Performance metrics
In machine learning, each task or problem is divided into classification and Regression. Not all metrics
can be used for all types of problems; hence, it is important to know and understand which metrics
should be used. Different evaluation metrics are used for both Regression and Classification tasks. In
this topic, we will discuss metrics used for classification and regression tasks.
Performance Metrics for Classification
In a classification problem, the category or classes of data is identified based on training data. The
model learns from the given dataset and then classifies the new data into classes or groups based on the
training. It predicts class labels as the output, such as Yes or No, 0 or 1, Spam or Not Spam, etc. To
evaluate the performance of a classification model, different metrics are used, and some of them are
as follows:
1. Accuracy-it can be determined as the number of correct predictions to the total number of
predictions.
2. Confusion Matrix
3. Precision
4. Recall
5. F-Score
6. AUC(Area Under the Curve)-ROC
Performance metrics
1.Accuracy- It can be determined as the number of correct predictions to the total number of predictions.
2. Confusion Matrix:
A confusion matrix is a tabular representation of prediction outcomes of any binary classifier, which is
used to describe the performance of the classification model on a set of test data when true values are
known.
The confusion matrix is simple to implement, but the terminologies used in this matrix might be confusing
for beginners.
A typical confusion matrix for a binary classifier looks like the below image(However, it can be extended
to use for classifiers with more than two classes).
Performance metrics
Accuracy for the matrix can be calculated by taking average of the values
lying across the “main diagonal” i.e
Accuracy = (True Positives+False Negatives)/Total Number of Samples
2. Mean Squared Error--It measures the average of the Squared difference between predicted values and the actual value
given by the model.
3. R2 Score--R squared error is also known as Coefficient of Determination, which is another popular metric used for
Regression model evaluation. The R-squared metric enables us to compare our model with a constant baseline to determine
the performance of the model. To select the constant baseline, we need to take the mean of the data and draw the line at the
mean.
Performance metrics
4. Adjusted R2
Adjusted R squared, as the name suggests, is the improved version of R squared error. R square has a limitation of
improvement of a score on increasing the terms, even though the model is not improving, and it may mislead the
data scientists.
To overcome the issue of R square, adjusted R squared is used, which will always show a lower value than R². It is
because it adjusts the values of increasing predictors and only shows improvement if there is a real improvement.
We can calculate the adjusted R squared as follows:
Markov Decision
Decision trees Simple Linear K-Means Process
KNN K-Modes
Multiple Linear
Divisive
Multinomial Logistic
Regression
Convolutional
Artificial Neural Neural
Networks Networks Deep Learning
Recurrent
Neural
Types of Machine Learning
There are primarily three types of machine learning: Supervised, Unsupervised, and
Reinforcement Learning.
• Supervised machine learning: User supervise the machine while training it to work on its
own. This requires labeled training data
• Unsupervised learning: There is training data, but it won’t be labeled
• Reinforcement learning: The system learns on its own
1.Supervised Learning
Supervised learning is a type of machine learning that uses labeled data to train machine
learning models. In labeled data, the output is already known. The model just needs to map
the inputs to the respective outputs.
An example of supervised learning is to train a system that identifies the image of an
animal.
Supervised learning algorithms take labeled inputs and map them to the known outputs,
which means you already know the target variable.
Supervised Learning methods need external supervision to train machine learning models.
Hence, the name supervised. They need guidance and additional information to return the
desired result.
First, you have to provide a data set that contains pictures of a kind of fruit, e.g., apples.
Then, provide another data set that lets the model know that these are pictures of apples.
This completes the training phase.
Next, provide a new set of data that only contains pictures of apples. At this point, the
system can recognize what the fruit it is and will remember it.
1.Supervised Learning
1.Supervised Learning
Supervised learning algorithms are generally used for solving
classification and regression problems.
• Classification-- Predicts a Class Label (Categorical)
• Regression--Predicts a Class Label (Numerical)
Classification: Classification is used when the output variable is
categorical i.e. with 2 or more classes. For example, yes or no,
male or female, true or false, etc.
In order to predict whether a mail is spam or not, we need to
first teach the machine what a spam mail is. This is done based
on a lot of spam filters - reviewing the content of the mail,
reviewing the mail header, and then searching if it contains any
false information. Certain keywords and blacklist filters that
blackmails are used from already blacklisted spammers.
All of these features are used to score the mail and give it a
spam score. The lower the total spam score of the email, the
more likely that it is not a scam.
Based on the content, label, and the spam score of the new
incoming mail, the algorithm decides whether it should land in
the inbox or spam folder.
1.Supervised Learning
Regression:
Regression is used when the output variable is a real or continuous
value. In this case, there is a relationship between two or more variables
i.e., a change in one variable is associated with a change in the other
variable. For example, salary based on work experience or weight based
on height, etc.
Let’s consider two variables - humidity and temperature. Here,
‘temperature’ is the independent variable and ‘humidity' is the dependent
variable. If the temperature increases, then the humidity decreases.
These two variables are fed to the model and the machine learns the
relationship between them. After the machine is trained, it can easily
predict the humidity based on the given temperature.
Note: Real-Life Applications of Supervised Learning
Risk Assessment-to assess the risk in financial services or insurance
domains
Image Classification--Facebook can recognize your friend in a picture from
an album of tagged photos.
Fraud Detection--To identify whether the transactions made by the user are
authentic or not.
Visual Recognition--The ability of a machine learning model to identify
objects, places, people, actions, and images
2.Unsupervised Learning
Unsupervised learning is a type of machine learning that uses unlabeled data to train machines.
Unlabeled data doesn’t have a fixed output variable.
The model learns from the data, discovers the patterns and features in the data, and returns the
output.
Consider a cluttered dataset: a collection of pictures of different spoons.
Feed this data to the model, and the model analyzes it to recognize any patterns. I
The machine categorizes the photos into two types, as shown in the image, based on their
similarities.
Flipkart uses this model to find and recommend products that are well suited for you.
2.Unsupervised Learning
Depicted below is an example of an unsupervised learning technique that uses the images of
vehicles to classify if it’s a bus or a truck.
The model learns by identifying the parts of a vehicle, such as a length and width of the vehicle, the
front, and rear end covers, roof hoods, the types of wheels used, etc.
Based on these features, the model classifies if the vehicle is a bus or a truck.
2.Unsupervised Learning
Unsupervised learning finds patterns and understands the trends in the data to discover the output.
So, the model tries to label the data based on the features of the input data.
The training process used in unsupervised learning techniques does not need any supervision to
build models. They learn on their own and predict the output.
Unsupervised learning can be further grouped into types:
1. Clustering
2. Association
Clustering:Clustering is the method of dividing the objects into clusters that are similar between
them and are dissimilar to the objects belonging to another cluster. For example, finding out which
customers made similar product purchases.
2.Unsupervised Learning
Suppose a telecom company wants to reduce its customer churn rate by providing personalized call
and data plans. The behavior of the customers is studied and the model segments the customers
with similar traits. Several strategies are adopted to minimize churn rate and maximize profit
through suitable promotions and campaigns.
On the right side of the image, you can see a graph where customers are grouped. Group A
customers use more data and also have high call durations. Group B customers are heavy Internet
users, while Group C customers have high call duration. So, Group B will be given more data benefit
plants, while Group C will be given cheaper called call rate plans and group A will be given the
benefit of both.
2. Association:
Association is a rule-based machine learning to discover the probability of the co-occurrence of
items in a collection. For example, finding out which products were purchased together.
2.Unsupervised Learning
Let’s say that a customer goes to a supermarket and buys bread, milk, fruits, and wheat. Another
customer comes and buys bread, milk, rice, and butter. Now, when another customer comes, it is
highly likely that if he buys bread, he will buy milk too. Hence, a relationship is established based on
customer behavior and recommendations are made.
Real-Life Applications of Unsupervised Learning:
• Market Basket Analysis: It is a machine learning model based on the algorithm that if you buy a
certain group of items, you are less or more likely to buy another group of items.
• Semantic Clustering: Semantically similar words share a similar context. People post their queries
on websites in their own ways. Semantic clustering groups all these responses with the same
meaning in a cluster to ensure that the customer finds the information they want quickly and easily.
It plays an important role in information retrieval, good browsing experience, and comprehension.
• Delivery Store Optimization: Machine learning models are used to predict the demand and keep up
with supply. They are also used to open stores where the demand is higher and optimizing roots for
more efficient deliveries according to past data and behavior.
• Identifying Accident Prone Areas: Unsupervised machine learning models can be used to identify
accident-prone areas and introduce safety measures based on the intensity of those accidents.
Difference between Supervised and Unsupervised Learning:
1 The data used in supervised learning is This algorithm does not require any
labeled. labeled data because its job is to look for
The system learns from the labeled data patterns in the input data and organize it
and makes future predictions
2 We get feedback--once you receive the That does not happen with unsupervised
output, the system remembers it and learning.
uses it for the next operation
2 Used to solve classification Used to solve clustering and Used to solve resolve based
and regression problems association problems problems
connected layers.
Feature Selection:
Filter, Wrapper, Embedded
methods
.
Feature Selection
While developing the machine learning model, only a few variables in the dataset are useful for building the
model, and the rest features are either redundant or irrelevant. If we input the dataset with all these redundant and
irrelevant features, it may negatively impact and reduce the overall performance and accuracy of the model.
Hence it is very important to identify and select the most appropriate/relevant features from the data and remove
the irrelevant or less important features, which is done with the help of feature selection in machine learning.
“Feature selection is a way of selecting the subset of the most relevant features from the original features set
by removing the redundant, irrelevant, or noisy features.”
What is Feature Selection?
A feature is an attribute that has an impact on a problem or is useful for the problem, and choosing the important
features for the model is known as feature selection. Each machine learning process depends on feature
engineering, which mainly contains two processes; which are Feature Selection and Feature Extraction. Although
feature selection and extraction processes may have the same objective, both are completely different from each
other. The main difference between them is that feature selection is about selecting the subset of the original
feature set, whereas feature extraction creates new features. Feature selection is a way of reducing the input
variable for the model by using only relevant data in order to reduce overfitting in the model.
So, we can define feature Selection as, "It is a process of automatically or manually selecting the subset of most
appropriate and relevant features to be used in model building." Feature selection is performed by either
including the important features or excluding the irrelevant features in the dataset without changing them.
.
Selecting the best features helps the model to perform well. For example, Suppose we want to create a model that
automatically decides which car should be crushed for a spare part, and to do this, we have a dataset. This dataset
contains a Model of the car, Year, Owner's name, Miles. So, in this dataset, the name of the owner does not
contribute to the model performance as it does not decide if the car should be crushed or not, so we can remove
this column and select the rest of the features(column) for the model building.
Below are some benefits of using feature selection in machine learning:
• It helps in avoiding the curse of dimensionality.
• It helps in the simplification of the model so that it can be easily interpreted by the researchers.
• It reduces the training time.
• It reduces overfitting hence enhance the generalization.
Feature Selection Techniques
There are mainly two types of Feature Selection techniques, which are:
• Supervised Feature Selection technique
Supervised Feature selection techniques consider the target variable and can be used for the labelled dataset.
• Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable and can be used for the unlabelled dataset.
.
.
1. Filter Methods
In Filter Method, features are selected on the basis of statistics measures. This method does not depend on the
learning algorithm and chooses the features as a pre-processing step.
The filter method filters out the irrelevant feature and redundant columns from the model by using different
metrics through ranking.
The advantage of using filter methods is that it needs low computational time and does not overfit the data.
• Information Gain(dwdm unit 3 long ans)
• Chi-square Test (chi square test between feature and target variable)
• Fisher's Score(fishers rank and we arrange in desc. order. We choose can select the variables with a large fisher's
score. )
• Missing Value Ratio() variable with high ratio than threshold ratio is ignored
Information Gain: Information gain determines the reduction in entropy while transforming the dataset. It can
be used as a feature selection technique by calculating the information gain of each variable with respect to the
target variable.
Chi-square Test: Chi-square test is a technique to determine the relationship between the categorical variables.
The chi-square value is calculated between each feature and the target variable, and the desired number of
features with the best chi-square value is selected.
Fisher's Score:
.
Fisher's Score:
Fisher's score is one of the popular supervised technique of features selection. It returns the rank of the
variable on the fisher's criteria in descending order. Then we can select the variables with a large fisher's
score.
Missing Value Ratio:
The value of the missing value ratio can be used for evaluating the feature set against the threshold value.
The formula for obtaining the missing value ratio is the number of missing values in each column divided by
the total number of observations. The variable is having more than the threshold value can be dropped.
2. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper methods by considering the
interaction of features along with low computational cost. These are fast processing methods similar to the
filter method but more accurate than the filter method.
.
These methods are also iterative, which evaluates each iteration, and optimally finds the most important features that
contribute the most to training in a particular iteration. Some techniques of embedded methods are:
• Regularization- Regularization adds a penalty term to different parameters of the machine learning model for
avoiding overfitting in the model. This penalty term is added to the coefficients; hence it shrinks some coefficients
to zero. Those features with zero coefficients can be removed from the dataset. The types of regularization
techniques are L1 Regularization (Lasso Regularization) or Elastic Nets (L1 and L2 regularization).
• Random Forest Importance - Different tree-based methods of feature selection help us with feature importance to
provide a way of selecting features. Here, feature importance specifies which feature has more importance in model
building or has a great impact on the target variable. Random Forest is such a tree-based method, which is a type of
bagging algorithm that aggregates a different number of decision trees. It automatically ranks the nodes by their
performance or decrease in the impurity (Gini impurity) over all the trees. Nodes are arranged as per the impurity
values, and thus it allows to pruning of trees below a specific node. The remaining nodes create a subset of the most
important features.
.
For machine learning engineers, it is very important to understand that which feature selection method will
work properly for their model. The more we know the datatypes of variables, the easier it is to choose the
appropriate statistical measure for feature selection.
To know this, we need to first identify the type of input and output variables. In machine learning, variables
are of mainly two types:
• Numerical Variables: Variable with continuous values such as integer, float
• Categorical Variables: Variables with categorical values such as Boolean, ordinal, nominals.
Below are some univariate statistical measures, which can be used for filter-based feature selection:
1. Numerical Input, Numerical Output:
Numerical Input variables are used for predictive regression modelling. The common method to be used for
such a case is the Correlation coefficient.
• Pearson's correlation coefficient (For linear Correlation).
• Spearman's rank coefficient (for non-linear correlation).
.
Data Normalization
Types of Normalization techniques in Machine Learning
The most widely used types of normalization in machine learning are:
1. Min-Max Scaling – Subtract the minimum value from each column’s value and divide by the range. Each new column has a
minimum value of 0 and a maximum value of 1.
2. Standardization Scaling – The term “standardization” refers to the process of centering a variable at zero and
standardizing the variance at one. Subtracting the mean of each observation and then dividing by the standard deviation is
the procedure.(Z-score normalization)
Normalization and standardization
Normalization and standardization are not the same things. Standardization, interestingly, refers to setting the mean to
zero and the standard deviation to one. Normalization in machine learning is the process of translating data into the range
[0, 1] (or any other range) or simply transforming data onto the unit sphere.
.
1. Min-Max Scaling/Rescaling/Linear Normalization – Subtract the minimum value from each column’s value and divide by the range.
Each new column has a minimum value of 0 and a maximum value of 1.
This technique is also referred to as scaling. As we have already discussed above, the Min-Max scaling method helps the dataset to
shift and rescale the values of their attributes, so they end up ranging between 0 and 1.
MinMax Scaler is one of the most popular scaling algorithms. It transforms features by scaling each feature to a given range, which is
generally [0,1], or [-1,-1] in case of negative values.
Xn = (X - Xminimum) / ( Xmaximum - Xminimum)
• Xn = Value of Normalization
• Xmaximum = Maximum value of a feature
• Xminimum = Minimum value of a feature
Example: Let's assume we have a model dataset having maximum and minimum values of feature as mentioned above. To normalize the
machine learning model, values are shifted and rescaled so their range can vary between 0 and 1. This technique is also known as Min-Max
scaling. In this scaling technique, we will change the feature values as follows:
Case1- If the value of X is minimum, the value of Numerator will be 0; hence Normalization will also be 0.
• Put X =Xminimum in above formula, we get;
• Xn = Xminimum- Xminimum/ ( Xmaximum - Xminimum)
• Xn = 0
.
Case2- If the value of X is maximum, then the value of the numerator is equal to the denominator; hence Normalization will be 1.
• Put X =Xmaximum in above formula, we get;
• Xn = Xmaximum - Xminimum/ ( Xmaximum - Xminimum)
• Xn = 1
Case3- On the other hand, if the value of X is neither maximum nor minimum, then values of normalization will also be between 0
and 1.
Hence, Normalization can be defined as a scaling method where values are shifted and rescaled to maintain their ranges
between 0 and 1, or in other words; it can be referred to as Min-Max scaling technique.
.
2. Standardization Scaling / Z-score normalization – The term “standardization” refers to the process of centering a variable
at zero and standardizing the variance at one. Subtracting the mean of each observation and then dividing by the standard
deviation is the procedure:
Standardization scaling is also known as Z-score normalization, in which values are centered around the mean with a unit
standard deviation, which means the attribute becomes zero and the resultant distribution has a unit standard deviation.
Mathematically, we can calculate the standardization by subtracting the feature value from the mean and dividing it by standard
deviation.
Hence, standardization can be expressed as follows:
Here, µ represents the mean of feature value, and σ represents the standard deviation of feature values.
However, unlike Min-Max scaling technique, feature values are not restricted to a specific range in the standardization
technique.
.
Example:
Introduction to Dimensionality
Reduction
.
Dimensionality Reduction
The number of input variables or features for a dataset is referred to as its dimensionality.
More input features often make a predictive modeling task more challenging to model, more generally referred to as the curse of
dimensionality.
Objective:
• Large numbers of input features can cause poor performance for machine learning algorithms..
• Dimensionality reduction methods include feature selection, linear algebra methods, projection methods, and autoencoders.
Problem With Many Input Variables
The performance of machine learning algorithms can degrade with too many input variables.
If your data is represented using rows and columns, such as in a spreadsheet, then the input variables are the columns that are fed as input
to a model to predict the target variable. Input variables are also called features.
We can consider the columns of data representing dimensions on an n-dimensional feature space and the rows of data as points in that
space. This is a useful geometric interpretation of a dataset.
Having a large number of dimensions in the feature space can mean that the volume of that space is very large, and in turn, the points that
we have in that space (rows of data) often represent a small and non-representative sample.
This can dramatically impact the performance of machine learning algorithms fit on data with many input features, generally referred to as
the “curse of dimensionality.”
Therefore, it is often desirable to reduce the number of input features.
This reduces the number of dimensions of the feature space, hence the name “dimensionality reduction.”
.
Data Reduction
Dimensionality Reduction
Dimensionality reduction refers to techniques for reducing the number of input variables in training data.
In machine learning classification problems, there are often too many factors on the basis of which the final
classification is done. These factors are basically variables called features. The higher the number of features, the
harder it gets to visualize the training set and then work on it. Sometimes, most of these features are correlated, and
hence redundant. This is where dimensionality reduction algorithms come into play. Dimensionality reduction is the
process of reducing the number of random variables under consideration, by obtaining a set of principal variables. It
can be divided into feature selection and feature extraction.
An intuitive example of dimensionality reduction can be discussed through a simple e-mail classification problem,
where we need to classify whether the e-mail is spam or not. This can involve a large number of features, such as
whether or not the e-mail has a generic title, the content of the e-mail, whether the e-mail uses a template, etc.
However, some of these features may overlap. In another condition, a classification problem that relies on both
humidity and rainfall can be collapsed into just one underlying feature, since both of the aforementioned are
correlated to a high degree. Hence, we can reduce the number of features in such problems. A 3-D classification
problem can be hard to visualize, whereas a 2-D one can be mapped to a simple 2 dimensional space, and a 1-D
problem to a simple line. The below figure illustrates this concept, where a 3-D feature space is split into two 2-D
feature spaces, and later, if found to be correlated, the number of features can be reduced even further.
.
Data Reduction
.
Data Reduction
Components of Dimensionality Reduction:
There are two components of dimensionality reduction:
Feature selection: In this, we try to find a subset of the original set of variables, or features, to
get a smaller subset which can be used to model the problem. It usually involves three ways:
1. Filter
2. Wrapper
3. Embedded
Feature extraction: This reduces the data in a high dimensional space to a lower dimension space,
i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction:
The various methods used for dimensionality reduction include:
1. Principal Component Analysis (PCA)
2. Linear Discriminant Analysis (LDA)
3. Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear or non-linear, depending upon the method used. The
prime linear method, called Principal Component Analysis, or PCA.
.
Data Reduction
Why Dimensionality Reduction:
• Prevent from curse of Dimensionality
• Improve the performance of the model
• To visualize the data or to understand the data