0% found this document useful (0 votes)
22 views59 pages

Data Science-Unit-4 - 05.10.23

Uploaded by

rishavsingh7478
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views59 pages

Data Science-Unit-4 - 05.10.23

Uploaded by

rishavsingh7478
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

UNIT IV

MACHINE LEARNING BASICS

Overview of machine learning concepts - Over fitting and Under fitting –


Correctness - The Bias-Variance Trade-off - Feature Extraction and Selection -
Decision trees - linear regression - Naive Bayes.
OVERVIEW OF MACHINE
LEARNING CONCEPTS
• Artificial Intelligence and Machine Learning are correlated with each other,
and yet they have some differences.
• Artificial Intelligence is an overarching concept that aims to create
intelligence that mimics human-level intelligence.
• Artificial Intelligence is a general concept that deals with creating human-
like critical thinking capability and reasoning skills for machines.
• On the other hand, Machine Learning is a subset or specific application of
Artificial intelligence that aims to create machines that can learn
autonomously from data.
• Machine Learning is specific, not general, which means it allows a
machine to make predictions or take some decisions on a specific problem
using data.
MACHINE LEARNING :
DEFINITION
• Machine learning is a branch of artificial intelligence (AI) and
computer science which focuses on the use of data and
algorithms to imitate the way that humans learn, gradually
improving its accuracy.

• Machine learning (ML) is the subset of artificial intelligence (AI)


that focuses on building systems that learn or improve
performance based on the data they consume.

• A computer program is said to learn from experience E with


respect to some class of tasks T and performance measure P, if
its performance at tasks in T, as measured by P, improves with
experience E.
BROAD TYPES OF MACHINE
LEARNING
• Supervised Learning
• Unsupervised learning
• Semi-supervised Learning
• Reinforcement Learning
SUPERVISED MACHINE LEARNING

• Supervised learning is the types of machine learning in which machines


are trained using well "labelled" training data, and on basis of that data,
machines predict the output. The labelled data means some input data is
already tagged with the correct output.
• In supervised learning, the training data provided to the machines work
as the supervisor that teaches the machines to predict the output
correctly. It applies the same concept as a student learns in the
supervision of the teacher.
• Supervised learning is a process of providing input data as well as correct
output data to the machine learning model. The aim of a supervised
learning algorithm is to find a mapping function to map the input
variable(x) with the output variable(y).
• In the real-world, supervised learning can be used for Risk Assessment,
Image classification, Fraud Detection, spam filtering, etc.
SUPERVISED MACHINE LEARNING
SUPERVISED MACHINE
LEARNING
SUPERVISED MACHINE
LEARNING
• Supervised Machine Learning
includes Regression and Some of the more popular algorithms in
Classification algorithms. these categories are:
• It relation between an • Linear Regression
independent and a dependent • Regression Trees
variable. • Non-Linear Regression
• It demonstrates the impact on • Bayesian Linear Regression
the dependent variable when the • Polynomial Regression
independent variable is changed
• Random Forest
in any way.
• Decision Trees
• So the independent variable is
• Logistic Regression
called the explanatory variable
and the dependent variable is • Support vector Machines
called the factor of interest.
Advantages of Supervised learning:

• With the help of supervised learning, the model can predict the output on the
basis of prior experiences.
• In supervised learning, we can have an exact idea about the classes of objects.
• Supervised learning model helps us to solve various real-world problems such
as fraud detection, spam filtering, etc.

Disadvantages of supervised learning:

• Supervised learning models are not suitable for handling the complex tasks.
• Supervised learning cannot predict the correct output if the test data is different
from the training dataset.
• Training required lots of computation times.
• In supervised learning, we need enough knowledge about the classes of object.
UNSUPERVISED LEARNING
• Unsupervised learning is a type of machine learning in which models
are trained using unlabeled dataset and are allowed to act on that
data without any supervision

• Unsupervised learning is a type of machine learning algorithm used


to draw inferences from datasets consisting of input data without
labeled responses.

• In unsupervised learning algorithms, classification or categorization


is not included in the observations.
.
Advantages of Unsupervised Learning

• Unsupervised learning is used for more complex tasks as compared to


supervised learning because, in unsupervised learning, we don't have labeled
input data.
• Unsupervised learning is preferable as it is easy to get unlabeled data in
comparison to labeled data.

Disadvantages of Unsupervised Learning

• Unsupervised learning is intrinsically more difficult than supervised learning as it


does not have corresponding output.
• The result of the unsupervised learning algorithm might be less accurate as
input data is not labeled, and algorithms do not know the exact output in
advance.
SOME OF THE MORE POPULAR ALGORITHMS IN THESE
CATEGORIES ARE:

• K-means clustering
• KNN (k-nearest neighbors)
• Hierarchal clustering
• Anomaly detection
• Neural Networks
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
• Singular value decomposition
WHAT IS MACHINE LEARNING USED FOR?

• Machine Learning is used in almost all modern technologies and this is


only going to increase in the future.
• In fact, there are applications of Machine Learning in various fields
ranging from smartphone technology to healthcare to social media, and
so on.
• Smartphones use personal voice assistants like Siri, Alexa, Cortana,
etc.
• Machine Learning is also used in social media. Let’s take
Facebook’s ‘People you may know’ as an example.
• Machine Learning is also very important in healthcare diagnosis as it
can be used to diagnose a variety of problems in the medical field.
Over fitting and Under fitting
OVER FITTING AND UNDER FITTING
• Overfitting and Underfitting are the two main problems that occur in
machine learning and degrade the performance of the machine learning
models.
• The main goal of each machine learning model is to generalize well.
Here generalization defines the ability of an ML model to provide a
suitable output by adapting the given set of unknown input.
• It means after providing training on the dataset, it can produce reliable
and accurate output.
OVERFITTING

• Overfitting occurs when our machine learning model tries to cover all the
data points or more than the required data points present in the given
dataset.
• Because of this, the model starts caching noise and inaccurate values
present in the dataset, and all these factors reduce the efficiency and
accuracy of the model.
• The over fitted model has low bias and high variance.
OVER FITTING AND UNDER FITTING
• Before understanding the overfitting and underfitting, let's understand some
basic term that will help to understand this topic well:
• Signal: It refers to the true underlying pattern of the data that helps the
machine learning model to learn from the data.
• Noise: Noise is unnecessary and irrelevant data that reduces the performance
of the model.
• Bias: Bias is a prediction error that is introduced in the model due to
oversimplifying the machine learning algorithms. Or it is the difference between
the predicted values and the actual values.
• Variance: If the machine learning model performs well with the training
dataset, but does not perform well with the test dataset, then variance occurs.
OVER FITTING AND UNDER FITTING

As we can see from the above graph, the model tries to cover all the data points present
in the scatter plot. It may look efficient, but in reality, it is not so. Because the goal of the
regression model to find the best fit line, but here we have not got any best fit, so, it will
generate the prediction errors.
HOW TO AVOID THE OVERFITTING IN MODEL

• Both overfitting and underfitting cause the degraded performance of the


machine learning model. But the main cause is overfitting, so there are
some ways by which we can reduce the occurrence of overfitting in our
model.
• Cross-Validation
• Training with more data
• Removing features
• Early stopping the training
• Regularization
• Ensembling
UNDER FITTING

• Underfitting occurs when our machine learning model is not able to


capture the underlying trend of the data.
• To avoid the overfitting in the model, the fed of training data can be
stopped at an early stage, due to which the model may not learn enough
from the training data.
• As a result, it may fail to find the best fit of the dominant trend in the
data.
• In the case of underfitting, the model is not able to learn enough from the
training data, and hence it reduces the accuracy and produces unreliable
predictions.
UNDER FITTING

• An underfitted model has high bias and low variance.

As we can see from the above diagram, the model is unable to capture the data points
present in the plot.
How to avoid under fitting:
•By increasing the training time of the model.
•By increasing the number of features.
CORRECTNESS

• Data scientists know that when they build training sets, they need to
watch out for data leakage in order to ensure that a model is only trained
on the correct data.
• Data leakage occurs when models are trained on examples that did not
really occur in the real world.
• In time-series models, data leakage typically is caused by adding features
to your training set that occurred after a given prediction would have
occurred.
• When feature generation, predictions, and label generation occur at
different points in time, data leakage can easily be introduced into your
training sets.
CORRECTNESS

• Imagine you have an e-commerce website that makes product


recommendations. The features for this model might include:
• RFM metrics, such as the sum of products purchased by a user over the
last week or month or year, calculated every week
• Summary of the items currently in a user’s cart that are updated in real
time
The Bias-Variance
What is Bias?
• In general, a machine learning model analyses the data, find patterns in
it and make predictions.
• While training, the model learns these patterns in the dataset and
applies them to test data for prediction.
• While making predictions, a difference occurs between prediction values
made by the model and actual values/expected values, and this
difference is known as bias errors or Errors due to bias.
• It can be defined as an inability of machine learning algorithms such as
Linear Regression to capture the true relationship between the data
points.
• Each algorithm begins with some amount of bias because bias occurs
from assumptions in the model, which makes the target function simple
to learn.
A model has either:

• Low Bias: A low bias model will make fewer assumptions about the form
of the target function.

• High Bias: A model with a high bias makes more assumptions, and the
model becomes unable to capture the important features of our
dataset. A high bias model also cannot perform well on new data.

• Generally, a linear algorithm has a high bias, as it makes them learn fast.
The simpler the algorithm, the higher the bias it has likely to be
introduced. Whereas a nonlinear algorithm often has low bias.
THE BIAS-VARIANCE
• Bias is one type of error that occurs due to wrong assumptions
about data such as assuming data is linear when in reality, data
follows a complex function.
• On the other hand, variance gets introduced with high
sensitivity to variations in training data.
• This also is one type of error since we want to make our model
robust against noise.
THE BIAS-VARIANCE
• Before coming to the mathematical definitions, we need to know about
random variables and functions.
• Let’s say, f(x) is the function which our given data follows. We will build
few models which can be denoted as f\hat(x).
• Each point on this function is a random variable having the number of
values equal to the number of models.
• To correctly approximate the true function f(x), we take expected value of

• f(x) : E [f(x)]
THE BIAS-VARIANCE
• Bias : f - E[f]
• Variance : E[f^2] - E[f]] = E[(f - E[f])^2]
• Let’s see some visuals of what importance both of these terms hold.
THE BIAS-VARIANCE
Trade-off
TRADE-OFF
• In Machine Learning, the performance and complexity of the model not only
depends on certain parameters, assumptions and conditions.
• but also on the quality of data that is used to train the model and that’s
one of the steps that everyone goes through i.e. cleaning and standardizing
the data.
• If the data is not cleaned and standardized then no matter how fine tune
the model parameters and hyper-parameters are, the model will not be
able to provide the best solution.
• If the data is not cleaned and standardized then no matter how fine tune
the model parameters and hyper-parameters are, the model will not be
able to provide the best solution.
TRADE-OFF
SKEWNESS IN DATA
• In simple words, skewness is the measure of how much the probability
distribution of a random variable deviates from the normal distribution
(probability distribution without any skewness).
SKEWNESS IN DATA
• If our data is positively skewed, it means that it has a higher number of
data points having low values.
• So, when we train our model on this data, it will perform better at
predicting data points with lower values as compared to those with higher
values.

• Bias-vs-Variance Trade-Off
• It is one of the important concepts to understand for supervised machine
learning and predictive modeling use cases and the main goal is to
choose a model to train that offers lowest bias versus variance tradeoff
for that dataset or business use case.
Feature Extraction and Selection
FEATURE EXTRACTION

• Feature Extraction
• Feature Extraction is quite a complex concept concerning the
translation of raw data into the inputs that a particular Machine
Learning algorithm requires.
• Features must represent the information of the data in a format that will
best fit the needs of the algorithm that is going to be used to solve the
problem.
• Some of the most popular methods of feature extraction are :
• Bag-of-Words
• TF-IDF
FEATURE EXTRACTION
• Bag of Words: Bag-of-Words is one of the most fundamental methods to
transform tokens into a set of features

• The BoW model is used in document classification, where each word is


used as a feature for training the classifier

• For example, in a task of review based sentiment analysis, the presence


of words like ‘fabulous’, ‘excellent’ indicates a positive review, while
words like ‘annoying’, ‘poor’ point to a negative review .
FEATURE EXTRACTION
• There are 3 steps while creating a BoW model :

• 1. The first step is text-preprocessing which involves:


• converting the entire text into lower case characters.
• removing all punctuations and unnecessary symbols.

• 2. The second step is to create a vocabulary of all unique words from


the corpus. Let’s suppose, we have a hotel review text. Let’s consider 3 of
these reviews, which are as follows :
• good movie
• not a good movie
• did not like
FEATURE SELECTION

• A feature is an attribute that has an impact on a problem or is


useful for the problem, and choosing the important features
for the model is known as feature selection.
• Each machine learning process depends on feature
engineering, which mainly contains two processes;
• feature Selection is defined as, "It is a process of
automatically or manually selecting the subset of most
appropriate and relevant features to be used in model
building." Feature selection is performed by either including
the important features or excluding the irrelevant features in
the dataset without changing them.
NEED FOR FEATURE SELECTION

• We collect a huge amount of data to train our model and help it to learn better.
• Generally, the dataset consists of noisy data, irrelevant data, and some part of useful
data.
• Moreover, the huge amount of data also slows down the training process of the
model, and with noise and irrelevant data, the model may not predict and perform
well.
• Below are some benefits of using feature selection in machine
learning:
• It helps in avoiding the curse of dimensionality.
• It helps in the simplification of the model so that it can be
easily interpreted by the researchers.
• It reduces the training time.
• It reduces overfitting hence enhance the generalization.
FEATURE SELECTION TECHNIQUES

• There are mainly two types of Feature Selection


techniques, which are:
• Supervised Feature Selection technique
Supervised Feature selection techniques consider
the target variable and can be used for the
labelled dataset.
• Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore
the target variable and can be used for the un
labelled dataset.
FEATURE SELECTION TECHNIQUES
Decision trees
• Decision trees

• Decision Tree is a Supervised learning technique that can be used for


both classification and Regression problems, but mostly it is preferred for
solving Classification problems. It is a tree-structured classifier, where
internal nodes represent the features of a dataset, branches represent
the decision rules and each leaf node represents the outcome.

• In a Decision tree, there are two nodes, which are the Decision Node and
Leaf Node. Decision nodes are used to make any decision and have
multiple branches, whereas Leaf nodes are the output of those decisions
and do not contain any further branches.
• The decisions or the test are performed on the basis of features of the
given dataset.

• It is a graphical representation for getting all the possible solutions to a


problem/decision based on given conditions.

• It is called a decision tree because, similar to a tree, it starts with the root
node, which expands on further branches and constructs a tree-like
structure.
• In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into subtrees.
• Below diagram explains the general structure of a decision tree:
• Decision Tree Terminologies

• Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.

• Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.

• Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.

• Branch/Sub Tree: A tree formed by splitting the tree.

• Pruning: Pruning is the process of removing the unwanted branches from the tree.

• Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
linear regression
• Machine Learning is a branch of Artificial intelligence that focuses on
the development of algorithms and statistical models that can learn
from and make predictions on data.

• Linear regression is also a type of machine-learning algorithm more


specifically a supervised machine-learning algorithm that learns from
the labeled datasets and maps the data points to the most optimized
linear functions.

• which can be used for prediction on new datasets.

Supervised learning has two types:

• Classification: It predicts the class of the dataset based on the


independent input variable. Class is the categorical or discrete values.
like the image of an animal is a cat or dog?
• Regression: It predicts the continuous output variables based on the
independent input variable. like the prediction of house prices based
on different parameters like house age, distance from the main road,
What Is a Regression?
• Regression is a statistical method used in finance, investing, and other disciplines that
attempts to determine the strength and character of the relationship between one
dependent variable (usually denoted by Y) and a series of other variables (known as
independent variables).

• The general form of each type of regression model is:

Simple linear regression:


Y=a+bX+u​

Multiple linear regression:

Y=a + b1​X1​+b2​X2 +b3​X3 +...+by​Xt + u

where:
Y=The dependent variable
X=The explanatory (independent) variable(s)
a=The y-intercept
b=(beta coefficient) is the slope of the explanatory
variable(s)
u=The regression residual or error term ​
Applications of linear regression

• Market analysis.
• Financial analysis.
• Sports analysis.
• Environmental health.
• Medicine.
• Least squares.
• Predicting outcomes.
Naive Bayes
• The Naïve Bayes classifier is a supervised machine learning algorithm, which
is used for classification tasks, like text classification.

• It is also part of a family of generative learning algorithms, meaning that it


seeks to model the distribution of inputs of a given class or category.

• Naïve Bayes is also known as a probabilistic classifier since it is based on


Bayes’ Theorem.
Advantages

• Less complex: Compared to other classifiers, Naïve Bayes is considered a


simpler classifier since the parameters are easier to estimate. As a result, it’s
one of the first algorithms learned within data science and machine learning
courses.

• Scales well: Compared to logistic regression, Naïve Bayes is considered a fast
and efficient classifier that is fairly accurate when the conditional
independence assumption holds. It also has low storage requirements.

• Can handle high-dimensional data: Use cases, such document classification,


can have a high number of dimensions, which can be difficult for other
classifiers to manage.
• Disadvantages:

• Subject to Zero frequency: Zero frequency occurs when a categorical


variable does not exist within the training set.

• Unrealistic core assumption: While the conditional independence


assumption overall performs well, the assumption does not always hold,
leading to incorrect classifications
Applications:

• Spam filtering: Spam classification is one of the most popular applications of


Naïve Bayes cited in literature.

• Document classification: Document and text classification go hand in hand.


Another popular use case of Naïve Bayes is content classification. Imagine the
content categories of a News media website.

• Sentiment analysis: While this is another form of text classification, sentiment


analysis is commonly leveraged within marketing to better understand and
quantify opinions and attitudes around specific products and brands.

• Mental state predictions: Using MRI data, naïve bayes has been leveraged to
predict different cognitive states among humans. The goal of this research was
to assist in better understanding hidden cognitive states, particularly among
brain injury patients.

You might also like