ML - Unit 1
ML - Unit 1
MCA
DSA
Machine Learning Introduction
Data
Computer Output
Program
Machine Learning
Data
Computer Program
Output
Machine Learning Traditional Programming
Machine Learning is a subset of In traditional programming, rule-
artificial intelligence(AI) that focus based code is written by the
on learning from data to develop developers depending on the
an algorithm that can be used to problem statements
make a prediction.
Machine Learning uses a data- Traditional programming is
driven approach, It is typically typically rule-based and
trained on historical data and then deterministic. It hasn’t self-learning
used to make predictions on new features like Machine Learning
data. and AI.
Machine Learning Traditional Programming
ML can find patterns and insights Traditional programming is totally
in large datasets that might be dependent on the intelligence of
difficult for humans to discover developers. So, it has very limited
capability
Machine Learning is the subset of Traditional programming is often
AI. And Now it is used in various used to build applications and
AI-based tasks like Chatbot software systems that have
Question answering, self-driven specific functionality
car., etc.
DSA
When Do We Use Machine Learning?
5
A classic example of a task that requires machine learning: It is very hard to
say what makes a 2
6
Slide credit: Geoffrey Hinton
Applications of Machine learning
Some more examples of tasks that are best solved by
using a learning algorithm
• Recognizing patterns:
– Facial identities or facial expressions
– Handwritten or spoken words
– Medical images
• Generating patterns:
– Generating images or motion sequences
• Recognizing anomalies:
– Unusual credit card transactions
– Unusual patterns of sensor readings in a nuclear power plant
• Prediction:
– Future stock prices or currency exchange rates
7
1. Image Recognition
10
Slide credit: Ray Mooney
Ingredients of Machine Learning
3.Features:
▪ good feature representation is central to achieve high performance in machine learning.
▪ Feature selection is process of choosing subset of feature from the original features so
that feature space is optimally reduced according to a certain criteria.
How does Machine Learning work
▪ A Machine Learning system learns from historical data, builds the prediction models, and
whenever it receives new data, predicts the output for it. The accuracy of predicted output
depends upon the amount of data, as the huge amount of data helps to build a better model
which predicts the output more accurately.
TRAINING AND TESTING DATASET
DSA
Training Datatset
▪ The training data is the biggest (in -size) subset of the original dataset, which is used to train
or fit the machine learning model
▪ The training data is fed to the ML algorithms, which lets them learn how to make predictions
for the given task.
▪ For example, for training a sentiment analysis model, the training data could be as below:
Input Ouput(Lables)
The New UI is Great Positive
Update is really Slow Negative
▪ For Unsupervised learning, the training data contains unlabeled data points, for supervised
learning, the training data contains labels in order to train the model and make predictions.
▪ better the quality of the training data, the better will be the performance of the model.
Training data is approximately more than or equal to 60% of the total data for an ML project.
Test Dataset
▪ The test dataset is another subset of original data, which is independent of the
training dataset.
▪ Once we train the model with the training dataset, it's time to test the model with
the test dataset. testing data is used to check the accuracy of the model.
▪ This dataset evaluates the performance of the model and ensures that the model
can generalize well with the new or unseen dataset
▪ the test dataset is approximately 20-25% of the total original data for an ML
project.
▪ The general ratios of splitting train and test datasets are 80:20, 70:30, or 90:10.
POSITIVE AND NEGATIVE CLASS
DSA
The Boy Who Cried Wolf
▪ A shepherd boy gets bored tending the town's flock. To have some
fun, he cries out, "Wolf!" even though no wolf is in sight. The
villagers run to protect the flock, but then get really mad when they
realize the boy was playing a joke on them.
▪ One night, the shepherd boy sees a real wolf approaching the flock
and calls out, "Wolf!" The villagers refuse to be fooled again and
stay in their houses. The hungry wolf turns the flock into lamb
chops. The town goes angry. Panic ensues.
Let's make the following definitions:
• "Wolf" is a positive class.
• "No wolf" is a negative class.
"wolf-prediction" model using a 2x2 confusion matrix that depicts all four
possible outcomes:
▪ A false positive is an outcome where the model incorrectly predicts the positive class.
And a false negative is an outcome where the model incorrectly predicts
the negative class.
Let's try calculating accuracy for the following model that classified
100 tumors as the positive class or the negative class
For binary classification, accuracy can also be calculated in terms of positives and
negatives as follows:
Accuracy= TP+TN
TP+TN+FP+FN
DSA
Validation
▪ In this method, we perform training on the 50% of the given data-set and rest
50% is used for the testing purpose.
▪ The major drawback of this method is that we perform training on the 50% of
the dataset, it may possible that the remaining 50% of the data contains some
important information which we are leaving while training our model. i.e
higher bias
Leave P Out Cross Validation
▪ In this approach, the p datasets are left out of the training data. It means, if there
are total n datapoints in the original input dataset, then n-p data points will be
used as the training dataset and the p data points as the validation set.
▪ This is repeated for all combinations, and then the error is averaged.
Pros
▪ It has Zero randomness
▪ The Bias will be lower
Cons
▪ This method is exhaustive and computationally infeasible.
DSA
LOOCV (Leave One Out Cross Validation)
▪ This method is similar to the leave-p-out cross-validation, but instead of p, we need to
take 1 dataset out of training.
▪ for each learning set, only one datapoint is reserved, and the remaining dataset is used to
train the model. This process repeats for each datapoint. Hence for n samples, we get n
different training set and n test set.
▪ An advantage of using this method is that we make use of all data points and hence it is
low bias.
▪ The major drawback of this method is that it leads to higher variation in the testing
model as we are testing against one data point. If the data point is an outlier it can lead
to higher variation.
▪ Another drawback is it takes a lot of execution time as it iterates over ‘the number of
data points’ times
DSA
K-Fold Cross Validation
▪ In this method, we split the data-set into k number of subsets(known as folds)
then we perform training on the all the subsets but leave one(k-1) subset for the
evaluation of the trained model. In this method, we iterate k times with a
different subset reserved for testing purpose each time.
▪ Here, we have total 25 instances. In first iteration we use the first 20 percent of
data for evaluation, and the remaining 80 percent for training([1-5] testing and
[5-25] training) while in the second iteration we use the second subset of 20
percent for evaluation, and the remaining three subsets of the data for
training([5-10] testing and [1-5 and 10-25]
DSA
▪ Total instances: 25 No. Training set observations Testing set
▪ Value of k : 5 Iterat observation
ion s
1 [ 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 [0 1 2 3 4]
21 22 23 24]
2 [ 0 1 2 3 4 10 11 12 13 14 15 16 17 18 19 20 [5 6 7 8 9]
21 22 23 24]
3 [ 0 1 2 3 4 5 6 7 8 9 15 16 17 18 19 20 21 22 [10 11 12
23 24] 13 14]
4 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 21 22 [15 16 17
23 24] 18 19]
5 [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 [20 21 22
18 19] 23 24]
▪ Pros
▪ This will help to overcome the problem of computational power.
▪ Models may not be affected much if an outlier is present in data.
▪ It helps us overcome the problem of variability.
▪ Cons
▪ Imbalanced data sets will impact our model.
DSA
Stratified K-Fold Cross-Validation
▪ K Fold Cross Validation technique will not work as expected for an Imbalanced
Data set.
▪ Slight change to the K Fold cross validation technique, such that each fold
contains approximately the same strata of samples of each output class as the
complete. This variation of using a stratum in K Fold Cross Validation is known as
Stratified K Fold Cross Validation.
▪ It helps in reducing both Bias and Variance.
▪ Let the population for that state be 51.3% male and 48.7% female, Then for
choosing 1000 people from that state if you pick 513 male ( 51.3% of 1000 ) and
487 female ( 48.7% for 1000 ) i.e 513 male + 487 female (Total=1000 people) to
ask their opinion. Then these groups of people represent the entire state. This is
called Stratified Sampling.
TYPES OF LEARNING
DSA
Types of Learning
▪ Unsupervised learning
▪ Training data does not include desired outputs
▪ Semi-supervised learning
▪ Training data includes a few desired outputs
▪ Reinforcement learning
▪ Rewards from sequence of actions
Supervised learning
• Classification
• Regression
Classification
▪ Advantages:
• Since supervised learning work with the labelled dataset so we can have an
exact idea about the classes of objects.
• These algorithms are helpful in predicting the output on the basis of prior
experience.
▪ Disadvantages:
• These algorithms are not able to solve complex tasks.
• It may predict the wrong output if the test data is different from the training
data.
• It requires lots of computational time to train the algorithm.
Applications of Supervised Learning
▪ Advantages:
• These algorithms can be used for complicated tasks compared
to the supervised ones because these algorithms work on the
unlabeled dataset.
• Unsupervised algorithms are preferable for various tasks as
getting the unlabeled dataset is easier as compared to the
labelled dataset.
▪ Disadvantages:
• The output of an unsupervised algorithm can be less accurate
as the dataset is not labelled, and algorithms are not trained
with the exact output in prior.
• Working with Unsupervised learning is more difficult as it works
with the unlabelled dataset that does not map with the output.
Applications of Unsupervised Learning
• Network Analysis: Unsupervised learning is used for identifying
plagiarism and copyright in document network analysis of text data
for scholarly articles.
• Recommendation Systems: Recommendation systems widely use
unsupervised learning techniques for building recommendation
applications for different web applications and e-commerce
websites.
• Anomaly Detection: Anomaly detection is a popular application of
unsupervised learning, which can identify unusual data points
within the dataset. It is used to discover fraudulent transactions.
3. Semi-Supervised Learning
DSA
▪ Advantages and disadvantages of Semi-supervised Learning
▪ Advantages:
• It is simple and easy to understand the algorithm.
• It is highly efficient.
• It is used to solve drawbacks of Supervised and Unsupervised
Learning algorithms.
▪ Disadvantages:
• Iterations results may not be stable.
• We cannot apply these algorithms to network-level data.
• Accuracy is low.
4. Reinforcement Learning
DSA
Example: The problem is as follows: We have an agent and a reward, with
many hurdles in between. The agent is supposed to find the best possible
path to reach the reward. The following problem explains the problem more
easily.
Categories of Reinforcement Learning
▪ Advantages
• It helps in solving complex real-world problems which are difficult to be solved by general
techniques.
• The learning model of RL is similar to the learning of human beings; hence most accurate results
can be found.
• Helps in achieving long term results.
▪ Disadvantage
• RL algorithms are not preferred for simple problems.
• RL algorithms require huge data and computations.
• Too much reinforcement learning can lead to an overload of states which can weaken the
results.
Criteria Supervised ML Unsupervised ML Reinforcement ML
Trained using
Learns by using Works on interacting
Definition unlabelled data without
labelled data with the environment
any guidance.
f( ) = “apple”
f( ) = “tomato”
f( ) = “cow”
Slide credit: L. Lazebnik
The machine learning framework
y = f(x)
output prediction function Image feature
▪ Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the
prediction function f by minimizing the prediction error on the training set
▪ Testing: apply f to a never before seen test example x and output the predicted
value y = f(x)
Image Learned
Training
Features model
Testin
g
Image
Learned model Prediction
Features
Test Slide credit: D. Hoiem and L. Lazebnik
Image
MODELS OF MACHINE LEARNING
DSA
▪ Geometric models use intuitions from geometry such as separating (hyper-)planes,
linear transformations and distance metrics.
▪ Probabilistic models view learning as a process of reducing uncertainty, modelled
by means of probability distributions.
▪ Logical models are defined in terms of easily interpretable logical expressions.
▪ Grouping models divide the instance space into segments; in each segment a very
simple (e.g., constant) model is learned.
▪ Grading models learning a single, global model over the instance space.
Geometric Model
ML
Geometric Model
▪ Example : The nearest neighbour algorithm used for
classification and regression
▪ works by finding the closest data point to a given query
point in a geometric space.
▪ The distance between two data points can be measured
using different metrics, such as Euclidean distance or
cosine similarity. Once the closest data point is found,
the algorithm can use its properties to classify or predict
the properties of the query point.
▪ The basic linear classifier constructs a decision boundary by
half-way intersecting the line between the positive and
negative centres of mass.
▪ It is described by the equation w· x = t, with w = p−n;
▪ the decision threshold can be found by noting that (p+n)/2 is
on the decision boundary, and hence t = (p−n)·(p+n)/2 =
(||p||2 −||n||2 )/2, where ||x|| denotes the length of vector x.
Geometric Model
▪ Support vector machine used in classification task
▪ SVM works by finding a hyperplane in a high-
dimensional space that separates the data points into
different classes.
▪ The hyperplane is chosen in such a way that it
maximizes the margin between the two closest data
points from different classes.
▪ The decision boundary learned by a support vector
machine from the linearly separable data from Figure
1.
▪ The decision boundary maximises the margin, which
is indicated by the dotted lines. The circled data points
are the support vectors
Geometric Model
▪ Geometric models can also be used in clustering tasks, where the goal is to
group similar data points together.
▪ One example of a geometric model for clustering is the k-means algorithm,
which works by partitioning the data into k clusters based on their distance to k
initial centroids. The centroids are then updated iteratively to minimize the
distance between the data points and their respective centroids.
ML
Geometric Model
Challenges:
1. Curse of Dimensionality: As the dimensionality of data increases, it becomes
increasingly difficult to model and analyze the data. In order to overcome this
challenge, various techniques have been developed such as feature selection,
dimensionality reduction, and regularization.
2. Choosing the Right Model: Geometric models come in many different forms, each
with their own strengths and weaknesses. Choosing the right model for a given
problem can be a challenging task, and often requires careful experimentation and
analysis.
3. Interpreting Results: Geometric models can produce complex and high-
dimensional results, making it difficult to interpret and understand the output.
Techniques such as visualization and feature importance analysis can help to
overcome this challenge and make the results more interpretable.
ML
Probabilistic models
▪ probabilistic models, which take into consideration the uncertainty inherent in real-
world data.
▪ Not all data fits well into a probabilistic framework, which can limit the usefulness of
these models in certain applications.
ML
Categories Of Probabilistic Models
1. Generative models :
▪ Generative models aim to model the joint distribution of the input and output
variables.
▪ These models generate new data based on the probability distribution of the
original dataset.
▪ The joint distribution looks for a relationship between two variables. Once this
relationship is inferred, it is possible to infer new data points.
▪ Generative models are powerful because they can generate new data that
resembles the training data.
▪ They can be used for tasks such as image and speech synthesis, language
translation, and text generation.
ML
2. Discriminative models:
▪ The discriminative model aims to model the conditional distribution of the output
variable given the input variable.
▪ They learn a decision boundary that separates the different classes of the output
variable.
▪ Discriminative models are useful when the focus is on making accurate predictions
rather than generating new data.
▪ They can be used for tasks such as image recognition, speech recognition, and
sentiment analysis.
3. Graphical models:
▪ They are commonly used for tasks such as image recognition, natural
language processing, and causal inference.
Advantages Of Probabilistic Models
▪ The main advantage of these models is their ability to take into account
uncertainty and variability in data. This allows for more accurate predictions and
decision-making, particularly in complex and unpredictable situations.
▪ Probabilistic models can also provide insights into how different factors influence
outcomes and can help identify patterns and relationships within data.
Posterior Probability
Assuming that X and Y are the only variables we know and care about, the
posterior distribution P(Y |X) helps us to answer many questions of interest.
▪ For instance, to classify a new e-mail we determine whether the words
‘Viagra’ and ‘lottery’ occur in it, look up the corresponding probability
P(Y = spam|Viagra,lottery), and predict spam if this probability exceeds
0.5 and ham otherwise.
▪ Such a recipe to predict a value of Y on the basis of the values of X and
the posterior distribution P(Y |X) is called a decision rule
Likelihood ratio
As a matter of fact, statisticians work very often with different conditional probabilities,
given by the likelihood function P(X|Y ).
▪ I like to think of these as thought experiments: if somebody were to send me a spam e-
mail, how likely would it be that it contains exactly the words of the e-mail I’m looking
at? And how likely if it were a ham e-mail instead?
▪ What really matters is not the magnitude of these likelihoods, but their ratio: how much
more likely is it to observe this combination of words in a spam e-mail than it is in a
non-spam e-mail.
▪ For instance, suppose that for a particular e-mail described by X we have P(X|Y =
spam) = 3.5 · 10−5 and P(X|Y = ham) = 7.4 · 10−6 , then observing X in a spam e-mail
is nearly five times more likely than it is in a ham e-mail
▪ This suggests the following decision rule: predict spam if the likelihood ratio is larger
than 1 and ham otherwise.
Naive Bayes Algorithm in Probabilistic Models
▪ It is based on the Bayes theorem of probability and assumes that the features are
conditionally independent of each other given the class
▪ The Naive Bayes Algorithm is used to calculate the probability of a given sample
belonging to a particular class.
▪ This is done by calculating the posterior probability of each class given the
sample and then selecting the class with the highest posterior probability as the
predicted class.
Logical models
● Logical models use a logical expression to divide the instance space into segments and hence
construct grouping models.
● Once the data is grouped using a logical expression, the data is divided into homogeneous
groupings for the problem we are trying to solve. For example, for a classification problem, all
the instances in the group belong to one class.
● There are mainly two kinds of logical models: Tree models and Rule models.
● Rule based models:
○ Rule models consist of a collection of implications or IF-THEN rules
if Viagra=1 then Class=Y=spam.
if Vigra=0 ^ lottery=1 then Class=Y=spam
if Viagra=0 ^ lottery=0 then Class=Y=ham
Such rules are easily arranged in a tree structure, which we refer to as a feature tree.
Tree model
A feature tree:
(left) A feature tree combining two Boolean features. Each internal node or split is labelled with a
feature, and each edge emanating from a split is labelled with a feature value. Each leaf therefore
corresponds to a unique combination of feature values. Also indicated in each leaf is the class
distribution derived from the training set. (right) A feature tree partitions the instance space into
rectangular regions, one for each leaf. We can clearly see that the majority of ham lives in the lower
left-hand corner.
Labelling a feature tree
▪ The leaves of the tree in Figure could be labelled, from left to right, as ham – spam –
spam, employing a simple decision rule called majority class.
▪ Alternatively, we could label them with the proportion of spam e-mail occurring in each
leaf: from left to right, 1/3, 2/3, and 4/5.
▪ Or, if our task was a regression task, we could label the leaves with predicted real values
or even linear functions of some other, real-valued features.
▪ One of the most well-known logical models in machine learning is the decision tree.
Decision trees are a popular classification algorithm that uses a tree-like model of
decisions and their possible consequences to classify data points. Each internal node in the
decision tree represents a decision based on a feature value, and each leaf node represents
a class label.
Grouping and Grading model
The key difference between grouping and grading model is the way they handle the instance
space.
Grouping Model:
▪ Grouping model breaks the instance space into groups or segments, the number of which is
determined at training time.
▪ Grouping models have a fixed and finite resolution and cannot distinguish between individual
instances beyond this resolution.
▪ What grouping models do at this finest resolution is often something very simple, such as
assigning the majority class to all instances that fall into the segment.
▪ The main emphasis of training a grouping model is then on determining the right segments.
▪ Example : tree based models: They work by repeatedly splitting the instance space into
smaller subsets. The subsets at the leaves of the tree partition the instance space with some
finite resolution. Instances filtered into the same leaf of the tree are treated the same ,
regardless of any features not in the tree that might be able to distinguish them.
Grouping and Grading model
Grading Model:
▪ Grading model do not employ notion of segment.
▪ Rather than applying very simple, local models, they form one global model over the instance
space.
▪ Grading models are able to distinguish between arbitrary instances, no matter how similar
they are.
▪ Their resolution is, intheory, infinite,particularly when working in a Cartesian instance space.
▪ Example: Support vector machines and other geometric classifiers- because they work in a
Cartesian instance space, they are able to represent and exploit the minutest differences
between instances.
Parametric and Parametric model
Parametric Model:
A learning model that summarizes data with a set of parameters of fixed size
(independent of the number of training examples) is called a parametric model. No
matter how much data you throw at a parametric model, it won’t change its mind about
how many parameters it needs.
▪ Assumptions can greatly simplify the learning process, but can also limit what can be
learned
▪ Algorithms that simplify the function to a known form are called parametric machine
learning algorithms.
▪ The algorithms involve two steps:
▪ Select a form for the function.
▪ Learn the coefficients for the function from the training data.
▪ An easy to understand functional form for the mapping function is a line, as is used in
linear regression:
b0 + b1*x1 + b2*x2 = 0
Where b0, b1 and b2 are the coefficients of the line that control the intercept and
slope, and x1 and x2 are two input variables.
▪ we need to do is estimate the coefficients of the line equation and we have a
predictive model for the problem
▪ the actual unknown underlying function may not be a linear function like a line in
which case the assumption is wrong and the approach will produce poor result
▪ Some more examples of parametric machine learning algorithms include:
Logistic Regression ,Linear Discriminant Analysis, Naive Bayes
Benefits of Parametric Machine Learning Algorithms:
▪ Simpler: These methods are easier to understand and interpret results.
▪ Speed: Parametric models are very fast to learn from data.
▪ Less Data: They do not require as much training data and can work well even if the fit
to the data is not perfect.
Limitations of Parametric Machine Learning Algorithms:
▪ Constrained: By choosing a functional form these methods are highly constrained to
the specified form.
▪ Limited Complexity: The methods are more suited to simpler problems.
▪ Poor Fit: In practice the methods are unlikely to match the underlying mapping
function.
Non Parametric Model:
Nonparametric methods are good when you have a lot of data and no prior
knowledge, and when you don’t want to worry too much about choosing just the
right features.
▪ Algorithms that do not make strong assumptions about the form of the mapping
function are called nonparametric machine learning algorithms
▪ they are free to learn any functional form from the training data.
▪ referred to as distribution-free methods -no distribution (normal distribution, etc.) of
any kind is available for use.
▪ Example: k-nearest neighbors algorithm that makes predictions based on the k
most similar training patterns for a new data instance. The method does not
assume anything about the form of the mapping function other than patterns that
are close are likely to have a similar output variable.
Some more examples of popular nonparametric machine learning algorithms are:
k-Nearest Neighbors , Decision Trees like CART and C4.5, Support Vector Machines
Benefits of Nonparametric Machine Learning Algorithms:
▪ Flexibility: Capable of fitting a large number of functional forms.
▪ Power: No assumptions (or weak assumptions) about the underlying function.
▪ Performance: Can result in higher performance models for prediction.
Limitations of Nonparametric Machine Learning Algorithms:
▪ More data: Require a lot more training data to estimate the mapping function.
▪ Slower: A lot slower to train as they often have far more parameters to train.
▪ Overfitting: More of a risk to overfit the training data and it is harder to explain why specific
predictions are made.
Parametric Methods Non parametric Methods
uses a fixed number of parameters to use the flexible number of parameters to
build the model. build the model.
require lesser data than Non-Parametric requires much more data than Parametric
Methods. Methods.