0% found this document useful (0 votes)
50 views108 pages

ML Unit-1 (CEC)

Uploaded by

riskman1919
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views108 pages

ML Unit-1 (CEC)

Uploaded by

riskman1919
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

Machine learning is a branch of Artificial Intelligence (AI) and

computer science which focuses on the use of data and


algorithms to imitate the way that humans learn, gradually
improving its accuracy.

Machine learning uses various algorithms for building


mathematical models, statistical models and making predictions
using historical data or information. Currently, it is being used
for various tasks such as image recognition, speech
recognition, email filtering, Facebook auto-
tagging, recommender system, and many more.
 “The function of a machine learning system can
be descriptive, meaning that the system uses the data to
explain what happened; predictive, meaning the system uses
the data to predict what will happen; or prescriptive, meaning
the system will use the data to make suggestions about what
action to take,”
Features of Machine Learning:
 Machine learning uses data to detect various patterns in a
given dataset.
 It can learn from past data and improve automatically.

 It is a data-driven technology.

 Machine learning is much similar to data mining as it also


deals with the huge amount of the data.
Importance of Machine Learning: Machine Learning can
be easily understood by its uses cases, Currently, machine
learning is used in
 Self-driving cars,
 Cyber fraud detection,
 Face recognition
 Facebook
 Netflix
 Recommend Systems
 Rapid increment in the production of data
 Solving complex problems, which are difficult for a human
 Decision making in various sector including finance
 Finding hidden patterns and extracting useful information
from data.
Classification of Machine Learning
At a broad level, machine learning can be classified into three
types:

 Supervised learning
 Unsupervised learning
 Reinforcement learning
 Supervised learning is a type of machine learning method in
which we provide sample labeled data to the machine learning
system in order to train it, and on that basis, it predicts the
output.
 The system creates a model using labeled data to understand
the datasets and learn about each data, once the training and
processing are done then we test the model by providing a
sample data to check whether it is predicting the exact output
or not.
 The goal of supervised learning is to map input data with the
output data. The supervised learning is based on supervision,
and it is the same as when a student learns things in the
supervision of the teacher. The example of supervised learning
is spam filtering.
 Supervised learning can be grouped further in two categories
of algorithms:

 Classification
 Regression
How does supervised machine learning work?
Supervised learning algorithms are good for the
following tasks:
 Binary classification: Dividing data into two
categories.
 Multi-class classification: Choosing more than two
classifications.
 Regression Modeling: Predicting continuous values.
 Assembling: Combining the predictions of multiple
machine learning models to produce an accurate
prediction.
Unsupervised learning is a learning method in which a machine
learns without any supervision.

 The training is provided to the machine with the set of data


that has not been labeled, classified, or categorized, and the
algorithm needs to act on that data without any supervision.
The goal of unsupervised learning is to restructure the input
data into new features or a group of objects with similar
patterns.
 In unsupervised learning, we don't have a predetermined
result. The machine tries to find useful insights from the huge
amount of data. It can be further classifieds into two categories
of algorithms:
 Clustering
 Association
 How does Unsupervised Machine Learning work?
Unsupervised learning algorithms are good for the following
tasks:
 Clustering: Splitting the dataset into groups based on
similarity.

 Anomaly detection: Identifying unusual data points in a data


set.
 Association mining: Identifying sets of items in a data set that
frequently occur together.

 Dimensionality Reduction: Reducing the number of


variables in a data set.
 Reinforcement learning is a feedback-based learning method,
in which a learning agent gets a reward for each right action
and gets a penalty for each wrong action. The agent learns
automatically with these feedbacks and improves its
performance. In reinforcement learning, the agent interacts
with the environment and explores it. The goal of an agent is
to get the most reward points, and hence, it improves its
performance.
 The robotic dog, which automatically learns the movement of
his arms, is an example of Reinforcement learning.
Reinforcement learning is often used in areas such as:

 Robotics: Robots can learn to perform tasks the physical


world using this technique.

 Video Game play: Reinforcement learning has been used to


teach bots to play a number of video games.

 Resource Management: Given finite resources and a defined


goal, reinforcement learning can help enterprises plan out how
to allocate resources.

 Regression and Classification algorithms are Supervised
Learning algorithms. Both the algorithms are used for
prediction in Machine learning and work with the labeled
datasets. But the difference between both is how they are
used for different machine learning problems.

 The main difference between Regression and


Classification algorithms that Regression algorithms are
used to predict the continuous values such as price,
salary, age, etc. and Classification algorithms are used
to predict/Classify the discrete values such as Male or
Female, True or False, Spam or Not Spam, etc.
 Classification is a process of finding a function which
helps in dividing the dataset into classes based on
different parameters.

 In Classification, a computer program is trained on


the training dataset and based on that training, it
categorizes the data into different classes. The task of
the classification algorithm is to find the mapping
function to map the input(x) to the discrete output(y).
 Example: The best example to understand the
Classification problem is Email Spam Detection. The
model is trained on the basis of millions of emails on
different parameters, and whenever it receives a new
email, it identifies whether the email is spam or not.
If the email is spam, then it is moved to the Spam
folder.
 Types of ML Classification Algorithms:
Classification Algorithms can be further divided
into the following types:
 Logistic Regression
 K-Nearest Neighbours (KNN)
 Support Vector Machines
 Kernel SVM
 Naive Bayes
 Decision Tree Classification
 Random Forest Classification
 Regression:

 Regression is a process of finding the correlations


between dependent and independent variables. It helps
in predicting the continuous variables such as
prediction of Market Trends, prediction of House
prices, etc.

 The task of the Regression algorithm is to find the


mapping function to map the input variable(x) to the
continuous output variable(y).
 Example: Suppose we want to do weather
forecasting, so for this, we will use the Regression
algorithm. In weather prediction, the model is trained
on the past data, and once the training is completed, it
can easily predict the weather for future days.
 Types of Regression Algorithm
 Simple Linear Regression
 Multiple Linear Regression
 Polynomial Regression
 Support Vector Regression (SVR)
 Decision Tree Regression
 Random Forest Regression
 Machine learning has given the computer systems the
abilities to automatically learn without being explicitly
programmed. But how does a machine learning system
work? So, it can be described using the Process of
machine learning. Machine learning Process is a life
cyclic process to build an efficient machine learning
project. The main purpose of the life cycle is to find a
solution to the problem or project.
 Machine learning process involves seven major steps,
which are given below:
 Gathering Data
 Data preparation
 Data Wrangling/Data Preprocessing
 Analyse Data
 Train the Model
 Test the Model
 Deployment
 The most important thing in the complete process is
to understand the problem and to know the purpose
of the problem. Therefore, before starting the process,
we need to understand the problem because the good
result depends on the better understanding of the
problem.

 In the complete life cycle process, to solve a problem,


we create a machine learning system called "model",
and this model is created by providing "training". But
to train a model, we need data, hence, life cycle starts
by collecting data.
 Data Gathering is the first step of the machine
learning process. The goal of this step is to identify
and obtain all data-related problems.

 In this step, we need to identify the different data


sources, as data can be collected from various sources
such as files, database, internet, or mobile devices.
It is one of the most important steps of the process.
The quantity and quality of the collected data will
determine the efficiency of the output. The more will
be the data, the more accurate will be the prediction.
 This step includes the below tasks:

 Identify various data sources


 Collect data
 Integrate the data obtained from different sources

 By performing the above task, we get a coherent set


of data, also called as a dataset. It will be used in
further steps.

 After collecting the data, we need to prepare it for
further steps. Data preparation is a step where we put
our data into a suitable place and prepare it to use in
our machine learning training.
Data Exploration:

It is used to understand the nature of data that we have to work


with. We need to understand the characteristics, format, and
quality of data. A better understanding of data leads to an
effective outcome. In this, we find Correlations, general
trends, and outliers.
 Data wrangling is the process of cleaning and
converting raw data into a useable format. It is the
process of cleaning the data, selecting the variable to
use, and transforming the data in a proper format to
make it more suitable for analysis in the next step. It
is one of the most important steps of the complete
process. Cleaning of data is required to address the
quality issues.
 collected data may have various issues, including:

 Missing Values
 Duplicate data
 Invalid data
 Noise
 So, we use various filtering techniques to clean the data.
 It is mandatory to detect and remove the above issues
because it can negatively affect the quality of the
outcome.
 Now the cleaned and prepared data is passed on to
the analysis step. This step involves:

 Selection of Analytical Techniques


 Building Models
 Review the Result
 The aim of this step is to build a machine learning
model to analyze the data using various analytical
techniques and review the outcome. It starts with the
determination of the type of the problems, where we
select the machine learning techniques such
as Classification, Regression, Cluster
analysis, Association, etc. then build the model using
prepared data, and evaluate the model.

 Hence, in this step, we take the data and use machine


learning algorithms to build the model.
 Now the next step is to train the model, in this step
we train our model to improve its performance for
better outcome of the problem.

 We use datasets to train the model using various


machine learning algorithms. Training a model is
required so that it can understand the various
patterns, rules, and, features.
 Once our machine learning model has been trained on
a given dataset, then we test the model. In this step,
we check for the accuracy of our model by providing
a test dataset to it.

 Testing the model determines the percentage


accuracy of the model as per the requirement of
project or problem.
 If the above-prepared model is producing an accurate
result as per our requirement with acceptable speed,
then we deploy the model in the real system. But
before deploying the project, we will check whether
it is improving its performance using available data
or not. The deployment phase is similar to making the
final report for a project.
 Weights and biases (commonly referred to as w and
b) are the learnable parameters of a some machine
learning models.
 Inputs: Inputs are the set of values for which we need to predict a output
value. They can be viewed as features or attributes in a dataset.

 Weights: weights are the real values that are attached with each
input/feature and they convey the importance of that corresponding
Attribute in predicting the final output.

 Bias: Bias is used for shifting the activation function towards left or right,
you can compare this to y-intercept in the line equation.

 Summation Function: The work of the summation function is to bind the


weights and inputs together and calculate their sum.

 Activation Function: It is used to introduce non-linearity in the model.



 Overfitting and Underfitting are the two main
problems that occur in machine learning and degrade
the performance of the machine learning models.

 The main goal of each machine learning model is to


generalize well. Here generalization defines the
ability of an ML model to provide a suitable output
by adapting the given set of unknown input. It means
after providing training on the dataset, it can produce
reliable and accurate output.
 Hence, the underfitting and overfitting are the two
terms that need to be checked for the performance of
the model and whether the model is generalizing well
or not.
Before understanding the overfitting and underfitting,
let's understand some basic term that will help to
understand this topic well:
 Noise: Noise is unnecessary and irrelevant data that
reduces the performance of the model.
 Bias: Bias is a prediction error that is introduced in
the model due to oversimplifying the machine
learning algorithms. Or it is the difference between
the predicted values and the actual values.
 Variance: If the machine learning model performs
well with the training dataset, but does not perform
well with the test dataset, then variance occurs.
Overfitting
 Overfitting occurs when our Machine
Learning model tries to cover all the data points or
more than the required data points present in the
given dataset. Because of this, the model starts
caching noise and inaccurate values present in the
dataset, and all these factors reduce the efficiency and
accuracy of the model. The overfitted model has low
bias and high variance.
 The chances of occurrence of overfitting increase as
much we provide training to our model. It means the
more we train our model, the more chances of
occurring the overfitted model.
 Overfitting is the main problem that occurs
in supervised learning.
 Both overfitting and underfitting cause the degraded
performance of the machine learning model. But the
main cause is overfitting, so there are some ways by
which we can reduce the occurrence of overfitting in
our model.
 Cross-Validation
 Training with more data
 Removing Noise
 Regularization
 Underfitting: A statistical model or a machine
learning algorithm is said to have underfitting when it
cannot capture the underlying trend of the data, i.e., it
only performs well on training data but performs
poorly on testing data. Underfitting destroys the
accuracy of our machine learning model. Its
occurrence simply means that our model or the
algorithm does not fit the data well enough.
Underfitting can be avoided by using more data and
also reducing the features by feature selection.
Reasons for Underfitting:
 High bias and low variance
 The size of the training dataset used is not enough.
 The model is too simple.
 Training data is not cleaned and also contains noise in
it.
Techniques to reduce underfitting:
 Increase model complexity
 Increase the number of features
 Remove noise from the data.
 Increase the duration of training to get better results.
 Curse of Dimensionality refers to a set of problems that
arise when working with high-dimensional data. The
dimension of a dataset corresponds to the number of
attributes/features that exist in a dataset. A dataset with a
large number of attributes, generally of the order of a
hundred or more, is referred to as high dimensional data.
Some of the difficulties that come with high dimensional
data manifest during analyzing or visualizing the data to
identify patterns, and some manifest while training machine
learning models. The difficulties related to training machine
learning models due to high dimensional data are referred to
as the ‘Curse of Dimensionality’.
 The curse of dimensionality will apply to our machine
learning algorithms because as the number of input
dimensions gets larger, we will need more data to
enable the algorithm to generalise sufficiently well.
curse of dimensionality, so will the number of data
points we need. For this reason, we will often have to
be careful about what information we give to the
algorithm, meaning that we need to understand
something about the data in advance.
 As the dimensionality increases, the number of data
points required for good performance of any machine
learning algorithm increases exponentially.
 The curse of dimensionality basically means that the
error increases with the increase in the number of
features.
 The validation set is used to evaluate a given model,
but this is for frequent evaluation.
 A validation dataset is a sample of data held back from
training your model that is used to give an estimate of model
skill while tuning model’s hyperparameters (maximizes the
model's performance, minimizing a predefined loss function to
produce better results with fewer errors).
 Training sets
 Initially, the development method involves initial inputs
within specified project parameters. The process also
requires the expert setting of weightings between the
various connections of so-called neurons within the ML
model or estimator*.

 After the introduction of this first dataset, developers


compare the resulting output to target answers. Next, they
adjust the model's parameters, weighting, and
functionality, as needed.
 More than one epoch or iteration of this adjustment loop
is often necessary. The goal is to achieve a trained or
fitted model that relates to and corresponds with the
expected range of new, unknown data.
 Validation sets
 The next stage involves using a validation dataset to estimate the
accuracy of the ML model concerned. During this phase, developers
ensure that new data classification is precise and results are
predictable.

 Validation datasets comprise unbiased inputs and expected results


designed to check the function and performance of the model.
Different methods of cross-validation (CV) exist, though all aim to
ensure stability by estimating how a predictive model will perform.
An example is the usage of rotation estimation or out-of-sample
testing to assure reasonable precision.

 and fine-tuning involve various iterations. Whatever the


methodology, these verification techniques aim to assess the results
and check them against independent inputs. It is possible also to
adjust the hyperparameters, i.e. the values used to control the overall
process.

 Test sets
 The final step is to use a test set to verify the model's
functionality. Some publications refer to the validation
dataset as a test set, especially if there are only two
subsets instead of three. Similarly, if records in this final
test set have not formed part of a previous evaluation or
cross-validation, they might also constitute a holdout set.

 Test samples provide a simulated real-world check using


unseen inputs and expected results. In practice, there
could be some overlap between validation and testing.
Each procedure shows that the ML model will function in
a live environment once out of testing.
 A confusion matrix presents a table layout of the
different outcomes of the prediction and actual of a
classification problem and helps visualize its
outcomes. It plots a table of all the predicted and
actual values of a classifier.
 A confusion matrix is a tabular summary of the
number of correct and incorrect predictions made
by a classifier. It is used to measure the performance
of a classification model. It can be used to evaluate
the performance of a classification model through the
calculation of performance metrics like accuracy,
precision, recall, and F1-score.
 The confusion matrix is a matrix used to determine
the performance of the classification models for a
given set of test data. It can only be determined if the
true values for test data are known. The matrix itself
can be easily understood, but the related
terminologies may be confusing. Some features of
Confusion matrix are given below:
 For the 2 prediction classes of classifiers, the matrix
is of 2*2 table, for 3 classes, it is 3*3 table, and so
on.
 The matrix is divided into two dimensions, that
are predicted values and actual values along with
the total number of predictions.
 Predicted values are those values, which are predicted
by the model, and actual values are the true values for
the given observations.

 A good model is one which has high TP and TN


rates, while low FP and FN rates.

 If you have an imbalanced dataset to work with, it’s


always better to use confusion matrix as your
evaluation criteria for your machine learning model.
 True Positive: The number of times our actual positive
values are equal to the predicted positive. You predicted
a positive value, and it is correct.
 False Positive: The number of times our model wrongly
predicts negative values as positives. You predicted a
negative value, and it is actually positive.
 True Negative: The number of times our actual
negative values are equal to predicted negative values.
You predicted a negative value, and it is actually
negative.
 False Negative: The number of times our model
wrongly predicts negative values as positives. You
predicted a negative value, and it is actually positive.
Sensitivity tells us what proportion of the positive class
got correctly classified.
 Misclassification rate: It is also termed as Error rate,
and it defines how often the model gives the wrong
predictions. The value of error rate can be calculated as
the number of incorrect predictions to all number of the
predictions made by the classifier. The formula is given
below:
When to use Accuracy / Precision / Recall / F1-Score?
 a. Accuracy is used when the True Positives and True
Negatives are more important. Accuracy is a better
metric for Balanced Data.
 b. Whenever False Positive is much more important
use Precision.
 c. Whenever False Negative is much more important
use Recall.
 d. F1-Score is used when the False Negatives and
False Positives are important. F1-Score is a better
metric for Imbalanced Data.
 ROC or Receiver Operating Characteristic curve
represents a probability graph to show the
performance of a classification model at different
threshold levels. The curve is plotted between two
parameters, which are:
 True Positive Rate or TPR
 False Positive Rate or FPR
 In the curve, TPR is plotted on Y-axis, whereas FPR
is on the X-axis.
 AUC is known for Area Under the ROC curve. As
its name suggests, AUC calculates the two-
dimensional area under the entire ROC curve ranging
from (0,0) to (1,1), as shown below image:
 In the ROC curve, AUC computes the performance of
the binary classifier across different thresholds and
provides an aggregate measure. The value of AUC
ranges from 0 to 1, which means an excellent model
will have AUC near 1, and hence it will show a good
measure of Separability.
 Classification of 3D model
The curve is used to classify a 3D model and separate it from the
normal models. With the specified threshold level, the curve
classifies the non-3D and separates out the 3D models.

 Healthcare
The curve has various applications in the healthcare sector. It can be
used to detect cancer disease in patients. It does this by using false
positive and false negative rates, and accuracy depends on the
threshold value used for the curve.

 Binary Classification
AUC-ROC curve is mainly used for binary classification problems
to evaluate their performance.
 While making predictions, a difference occurs
between prediction values made by the model and
actual values/expected values, and this difference is
known as bias errors or Errors due to bias. It can be
defined as an inability of machine learning algorithms
such as Linear Regression to capture the true
relationship between the data points. Each algorithm
begins with some amount of bias because bias occurs
from assumptions in the model, which makes the target
function simple to learn. A model has either:
 Low Bias: A low bias model will make fewer
assumptions about the form of the target function.
 High Bias: A model with a high bias makes more
assumptions, and the model becomes unable to
capture the important features of our dataset. A high
bias model also cannot perform well on new data.
 Some examples of machine learning algorithms with
low bias are Decision Trees, k-Nearest Neighbours
and Support Vector Machines. At the same time, an
algorithm with high bias is Linear Regression,
Linear Discriminant Analysis and Logistic
Regression.
 The variance would specify the amount of variation in
the prediction if the different training data was used. In
simple words, variance tells that how much a random
variable is different from its expected value. Ideally, a
model should not vary too much from one training
dataset to another, which means the algorithm should
be good in understanding the hidden mapping between
inputs and output variables. Variance errors are either
of low variance or high variance.
 Low variance means there is a small variation in the
prediction of the target function with changes in the
training data set.
 High variance shows a large variation in the
prediction of the target function with changes in the
training dataset.
 Low-Bias, Low-Variance:
The combination of low bias and low variance shows
an ideal machine learning model.

 Low-Bias, High-Variance: With low bias and high


variance, model predictions are inconsistent and
accurate on average. This case occurs when the
model learns with a large number of parameters and
hence leads to an Overfitting
 High-Bias, Low-Variance: With High bias and low
variance, predictions are consistent but inaccurate on
average. This case occurs when a model does not
learn well with the training dataset or uses few
numbers of the parameter. It leads
to Underfitting problems in the model.

 High-Bias, High-Variance:
With high bias and high variance, predictions are
inconsistent and also inaccurate on average.
 While building the machine learning model, it is
really important to take care of bias and variance in
order to avoid overfitting and underfitting in the
model. If the model is very simple with fewer
parameters, it may have low variance and high bias.
Whereas, if the model has a large number of
parameters, it will have high variance and low bias.
So, it is required to make a balance between bias and
variance errors, and this balance between the bias
error and variance error is known as the Bias-
Variance trade-off.

You might also like