ML Unit 2
ML Unit 2
Introduction….
● This structured representation of raw input data to the meaningful pattern is called
a model.
● The process of assigning a model, and fitting a specific model to a data set is
called model training Once the model is trained, the raw input data is summarized
into an abstracted form.
● Generalization is a term used to describe a model’s ability to react to new data.
That is, after being trained on a training set, a model can digest new data and make
accurate predictions.
● If a model has been trained too well on training data, it will be unable to
generalize. It will make inaccurate predictions when given new data,it is able to
make accurate predictions for the training data. This is called overfitting.
● Underfitting happens when a model has not been trained enough on the data. it is
not capable of making accurate predictions, even with the training data.
● If the outcome is systematically incorrect, the learning is said to have a bias.
Categories of Machine Learning Approaches
● Three broad categories of machine learning approaches used for resolving different types
of problems
1. Supervised
1. Classification
2. Regression
2. Unsupervised
1. Clustering
2. Association analysis
3. Reinforcement
1. Active
2. Passive
● For each of the cases, the model that has to be created/trained is different.
● Multiple factors play a role when we try to select the model for solving a machine
learning problem
Predictive models
Classification models: which are used for prediction of target features of categorical value
are known as classification models.
• Some of the popular classification models include
1. k-Nearest Neighbor (KNN),
2. Naive Bayes, and
3. Decision Tree.
Regression models: which are used for prediction of the numerical value of the target
feature of a data instance are known as regression models.
• popular regression models.
1. Linear Regression and
2. Logistic Regression models
Descriptive models
• Models for unsupervised learning or descriptive models are used to describe a data set or
gain insight from a data set.
• There is no target feature or single feature of interest in case of unsupervised learning.
• Based on the value of all features, interesting patterns or insights are derived about the
data set.
• Descriptive models which group together similar data instances, i.e. data instances having
a similar value of the different features are called clustering models.
• Examples of clustering include
1. Customer grouping or segmentation based on social, demographic, ethnic,
etc. factors
2. Grouping of music based on different aspects like genre, language, time-
period, etc.
• The most popular model for clustering is k-Means.
Descriptive models
Market Basket Analysis
• Descriptive models related to pattern discovery is used for market basket analysis
of transactional data.
• In market basket analysis, based on the purchase pattern available in the
transactional data,
• the possibility of purchasing one product based on the purchase of another product
is determined.
• For example, transactional data may reveal a pattern that generally a customer who
purchases milk also purchases biscuit at the same time.
• This can be useful for targeted promotions or in-store set up.
• Promotions related to biscuits can be sent to customers of milk products or vice
versa.
• Also, in the store products related to milk can be placed close to biscuits.
Training a Model(for Supervised Learning)
• Holdout method
• Cross-validation methods
• Bootstrap sampling
• Lazy vs. Eager learner
Holdout method
•In supervised learning, a model is trained using the
labelled input data.
•The test data may not be available immediately, also,
the label value of the test data is not known.
•That is the reason why a part of the input data is held
back (holdout) for evaluation of the model.
•This subset of the input data is used as the test data for
evaluating the performance of a trained model.
•In general 70% to 80% of the input data (labelled) is
used for model training.
Holdout method
• Once the model is trained using the training data, the labels of the test data are
predicted using the model's target function.
• Then the predicted value is compared with the actual value of the label.
The validation data used for measuring the model performance. It is used in
iterations and to refine the model in each iteration.
• If the volume of input data is huge, then
• stratified Random sampling is employed for test data selection.
• the whole data is broken into several homogenous groups
• a random sample is selected from each group.
• This ensures that the generated random partitions have equal proportions of each
class.
Holdout method
• The issues in random sampling approach, in Holdout method,
1. the smaller data sets - difficult to divide the data of some of the classes proportionally
amongst training and test data sets.
2. A repeated holdout, is sometimes used to ensure the randomness of the composed data
sets.
• Several random holdouts are used to measure the model performance.
• In the end, the average of all performances is taken.
• As multiple holdouts have been drawn, the training and test data (and validation data) are
contain representative data from all classes and resemble the original input data closely.
• This process of repeated holdout is the basis of k-fold cross-validation technique.
• In k-fold cross-validation, the data set is divided into k-completely distinct or non-
overlapping random partitions called folds.
k-fold cross-validation
• In k-fold cross-validation, the data set is divided into k-completely distinct or non-
overlapping random partitions called folds.
• The value of 'k' in k-fold cross-validation can
be set to any number.
• there are two approaches which are extremely
popular:
• 1. 10-fold cross-validation (10-fold CV)
• 2. Leave-one-out cross-validation (LOOCV)
10-fold cross-validation
• 10-fold cross-validation is by far the most popular approach.
• for each of the 10-folds, each comprising of approximately 10% of the data, one of
the folds is used as the test data for validating model performance trained based on
the remaining 9 folds (or 90% of the data).
• This is repeated 10 times, once for each of the 10 folds being used as the test data
and the remaining folds as the training data.
• The average performance across all folds is being reported.
10-fold cross-validation
• each of the circles resembles a record in the input data
set whereas the different colors indicate the different
classes that the records belong to.
• The entire data set is broken into 'k' folds –out of which
one fold is selected in each iteration as the test data set.
• The fold selected as test data set in each of the
'k' iterations is different.
• the contiguous circles represented as folds, do not mean
that they are subsequent records in the data set.
• the records in a fold are drawn by using random sampling
technique.
Lazy learning
• Lazy learning, completely skips the abstraction and generalization processes,
otherwise, lazy learner doesn't 'learn' anything.
• It uses the training data in exact, and uses the knowledge to classify the unlabelled
test data.
• it is also known as rote learning (i.e. memorization technique based on repetition).
• Due to its heavy dependency on the given training data instance, it is also known
as instance learning or non-parametric learning.
• Lazy learners take very little time in training because not much of training
actually happens.
• it takes long time in classification as for each attribute in record of test data,
a comparison-based assignment of label happens.
Model Representation and Interpretability
● The main goal of each machine learning model is to generalize well.
● Generalization defines the ability of an ML model to provide a suitable output by
adapting the given set of unknown input.
● It means after providing training on the dataset, it can produce reliable and accurate
output.
● The underfitting and overfitting are the two terms that need to be checked for the
performance of the model and whether the model is generalizing well or not
● Bias: difference occurs between prediction values made by the model and actual
values/expected values, and this difference is known as bias errors or Errors due to bias.
(the error rate of the training data. )
The error rate has a high value, we call it High Bias
The error rate has a low value, we call it low Bias.
● Variance: The difference between the error rate of training data and testing data is called
variance.
difference of errors is high then it’s called high variance
difference of errors is low then it’s called low variance.
Overfitting
• Overfitting occurs when our machine learning model tries to cover more
than the required data present in the given dataset.
• Because of this, the model starts caching noise and inaccurate values present
in the dataset, and all these factors reduce the efficiency and accuracy of the
model.
• The overfitted model has low bias and high variance.
• Overfitting is the main problem that occurs in supervised learning.
• How to avoid the Overfitting in Model:
1. Early stopping the training
2. using re-sampling techniques like cross validation
3. hold back of a validation data set
4. Removing the features
Underfitting & Overfitting
Underfitting
● Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data
● In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions.
● An underfitted model has high bias and low variance.
● How to avoid underfitting:
1. Increasing the training time of a model
2. Increasing the number of features
Bias – variance trade-off
• In supervised learning, the class value assigned by the learning model built
based on the training data may differ from the actual class value.
• This error in learning can be of two types:
1. errors due to 'bias' and
2. error due to 'variance’.
• Errors due to bias arise due to underfitting of the model. Underfitting
results in high bias.
• Errors due to variance occur from difference in training data sets used to
train the model.
• In case of overfitting, the model closely matches the training data, even a
small difference in training data gets magnified in the model.
Model Accuracy
• Model Accuracy is given by total number of correct classifications (either True
Positive or True Negative) divided by total number of classifications done
• In context of the above confusion matrix, total count of TPs = 85, count of FPs =
4, count of FNs = 2 and count of TNs = 9.
Error Rate
• The percentage of misclassifications is indicated using error rate which
is measured as
Kappa value(k)
• Kappa value of a model indicates the adjusted the model accuracy.
• It is calculated using the formula below
Kappa value(k)…
● In context of the above confusion matrix, total count of TPs = 85, count of
FPs = 4, count of FNs = 2 and count of TNs = 9.
Sensitivity
• The sensitivity of a model measures the proportion of TP examples or positive
cases which were correctly classified.
• It is measured as
• In the context of the above confusion matrix for the cricket match win prediction
problem,
Specificity
• Specificity of a model measures the proportion of negative examples which have
been correctly classified.
• In the context of the above confusion matrix for the cricket match win prediction
problem,
• Visualization is an easier and more effective way to understand the model Performance
• 1. Receiver operating characteristic (ROC) curves
• 2. Area Under Curve (AUC)
• It also helps in comparing the
efficiency of two models.
Receiver operating characteristic (ROC) curves
• Receiver Operating Characteristic (ROC) curve helps in visualizing the
performance of a classification model.
• It shows the efficiency of a model in the detection of true positives while avoiding
the occurrence of false positives.
• To refresh our memory, true positives - the model has correctly classified data
instances as the class of interest.
• On the other hand, FPs are those cases where the model incorrectly classified data
instances as the class of interest.
Supervised Learning-Regression
A regression model
•
• Sum of Squared Errors (SSE) (of prediction) = sum of the squared residuals
=
● The silhouette coefficient, which is one of the most popular internal evaluation
Internal Evaluation – Silhouett Coefficient
methods, uses distance (Euclidean or Manhattan distances most commonly used)
between data elements as a similarity measure.
● The value of silhouette width ranges between
-1 and +1, with a high value indicating high
intra-cluster homogeneity and inter-cluster
heterogeneity.
• For a data set clustered into 'k' clusters, silhouette
width is calculated as:
External Evaluation
• In this approach, class label is known for the data set subjected to
clustering.
• the known class labels are not a part of the data used in clustering.
• The cluster algorithm is assessed based on how close the results are
compared to those known class labels.
• For example, purity is one of the most popular measures of cluster
algorithms — evaluates the extent to which clusters contain a single class.
For a data set having 'n' data instances and 'c' known class labels which generates'k'
clusters, purity is measured as:
IMPROVING PERFORMANCE OF MODEL
• The model selection is done on several aspects:
1. Type of learning the task in hand, i.e. supervised or
unsupervised
2. Type of the data, i.e. categorical or numeric
3. Sometimes on the problem domain
4. Above all, experience in working with different models to
solve problems of diverse domains
This approach of
•
combining different
models with diverse
strengths is known as
ensemble (figure).
ENSEMBLE……….
● Alternatively, the same training data may be used but the models combined are
quite varying, e.g, SVM, neural network, kNN, etc.
● The outputs from the different models are combined using a combination
function. A very simple strategy of combining, in the case of a prediction task
using ensemble, it is based on majority voting of the different models combined.
● For example, 3 out of 5 classes predict ‘win’ and 2 predict ‘loss’ – then the final
outcome of the ensemble using majority vote would be a ‘win’.
● The ensemble models are:
1. Bagging or bootstrap aggregating
2. Boosting
3. Random Forest
Bagging or Bootstrap aggregating
● Bagging uses bootstrap sampling method to generate multiple
training data sets.
● These training data sets are used to generate (or train) a set of models
using the same learning algorithm.
● Then the outcomes of the models are combined by majority voting
(classification) or by average (regression).
● Bagging is a very simple ensemble technique which can perform
really well for unstable learners like a decision tree, in which a slight
change in data can impact the outcome of a model significantly.
Boosting & Random Forest
● Just like bagging, boosting is another key ensemble based technique.
● The weaker learning models are trained on resampled data and the outcomes are
combined using a weighted voting approach based on the performance of different
models.
● Adaptive boosting or AdaBoost is a special variant of boosting algorithm.
● It is based on the idea of generating weak learners and slowly learning.
● Random forest is another ensemble-based technique. It is an ensemble of decision
trees hence the name random forest to indicate a forest of decision trees.
● Random Forest is a powerful ensemble learning technique that leverages the
strength of decision trees while addressing their limitations such as overfitting.
● By introducing randomness in feature selection and data sampling, Random
Forest builds a diverse set of decision trees and combines their predictions to
make robust and accurate predictions for classification and regression tasks.
Basics of Feature Engineering
● A feature is an attribute of a data set that is used in a machine learning process.
● The features in a data set are also called its dimensions. So, a data set having ‘n’
features is called an n-dimensional data set.
● A model for predicting the risk of cardiac disease may have features such as the
following: Age, Gender, Weight, Whether the person smokes, etc.
● Features in machine learning is very important, Because the quality of the features
in the dataset has major impact on the quality of the insights you will get while
using the dataset for machine learning
Feature extraction
● In feature extraction, new features are created from a combination of original features.
Some of the commonly used operators for combining the original features include
1. For Boolean features: Conjunctions, Disjunctions, Negation, etc.
2. For nominal features: Cartesian product, M of N, etc.
3. For numerical features: Min, Max, Addition, Subtraction, Multiplication, Division,
Average, Equivalence, Inequality, etc.
● Let’s take an example. Say, we have a
data set with a feature set F (F1 , F2 , …, Fn).
● After feature extraction using a mapping
function f. we will have a set of features
F ’ (F’ 1 , F’ 2 , …, F’m ) such that
F ’ i =f(Fi) and m<n
Feature extraction algorithms- PCA
• The most popular feature extraction algorithms used in machine learning are
1. Principal Component Analysis(PCA)
2. Singular value decomposition(SVD)
3. Linear Discriminant Analysis(LDA)
• Principal Component Analysis(PCA): is an unsupervised learning algorithm that is
used for the dimensionality reduction in machine learning.
• It is a statistical process that converts the observations of correlated features into a set
of linearly uncorrelated features with the help of orthogonal transformation.
• These new transformed features are called the Principal Components.
• It is one of the popular tools that is used for exploratory data analysis and predictive
modeling.
• It is a technique to draw strong patterns from the given dataset by reducing the
variances.