0% found this document useful (0 votes)
123 views137 pages

21CSC305P ML - Unit 1-E

Uploaded by

mn6186
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views137 pages

21CSC305P ML - Unit 1-E

Uploaded by

mn6186
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 137

SRM INSTITUTE OF SCIENCE AND TECHNOLOGY

Vadapalani, Chennai – 600 026


Department of Computer Science and Engineering

21CSC305P MACHINE LEARNING

UNIT I

1
Course Learning Rationale (CLR):
CLR-1: explore the fundamental mathematical concepts of machine learning
algorithms

Course Outcomes (CO):


CO-1: understand the basics of machine learning using probability theory

Unit-1 – Introduction 9 Hour


machine learning what and why?, supervised and unsupervised learning,
polynomial curve fitting, probability theory- discrete random variables, fundamental
rules, Bayes rule, Independence and conditional independence, continuous
random variables, Quantiles, Mean and variance, probability densities, Expectation
and covariance.

Practice:
1. Devise a program to import, load and view dataset
2. Create a program to display the summary and statistics of the dataset

2
3
Unit-1 - INTRODUCTION
machine learning what and why?
Machine Learning (ML) is a subfield of artificial intelligence (AI) that
focuses on the development of algorithms that allow computers to learn
from and make predictions or decisions based on data.
Definition: According to Tom Mitchell, a well-known definition is:

"A computer program is said to learn from experience E with respect to


some class of tasks T and performance measure P, if its performance at
tasks in T, as measured by P, improves with experience E."

The capability of AI systems to learn by extracting patterns from data is known as


Machine learning .

4
Why Machine Learning?

Data Availability: The explosion of data in various forms (e.g., text, images,
transactions) provides a rich resource for training machine learning models.
The ability to automatically learn and adapt from this data is invaluable.

•Complex Problem Solving: Many problems are too complex to solve with
traditional programming. ML can tackle tasks like natural language processing,
image recognition, and autonomous driving.

•Automation and Efficiency: ML automates analytical model building, enabling


computers to find hidden insights without being explicitly programmed for
specific tasks.

•Continuous Improvement: ML models can improve over time with more data
and retraining, leading to better performance and adaptability.

5
Key Components of Machine Learning

1. Data: The foundation of any ML system. It includes the historical data used for
training the model and new data used for making predictions.
2. Algorithms: The set of rules or procedures the model follows to learn from data.
Popular algorithms include decision trees, support vector machines, neural networks,
and ensemble methods like boosting.
3. Model: The output of the machine learning process. It is a mathematical
representation of the data that can make predictions or decisions.

4. Training: The process of feeding data into an algorithm to create the model. It
involves adjusting the model parameters to minimize prediction error.

5. Evaluation: Assessing the model’s performance using metrics such as accuracy,


precision, recall, F1 score, and ROC-AUC.

6
Traditional program vs machine
learning approach
Input
Output
Traditional program
Rules

Input
Rules
Machine learning program
Output

machine learning approach


7
Training phase and testing phase

Machine Learning Model


Training Data Feature
algorithm
set vector

Testing Data Feature Model Prediction


set vector

8
Types of machine learning

Machine learning

Supervised Unsupervised Reinforcement Semi-supervise


learning learning learning d learning

9
• Types of Machine Learning
1. Supervised Learning: The model is trained on labeled data. It learns to
map inputs to outputs based on the examples provided. Applications
include classification and regression tasks.
1. Example: Predicting house prices based on historical data of house
features and prices.
2. Unsupervised Learning: The model is trained on unlabeled data. It tries to
find patterns or structures in the data without explicit instructions on what
to predict. Applications include clustering and association.
1. Example: Grouping customers into segments based on purchasing
behavior.
3. Reinforcement Learning: The model learns by interacting with an
environment, receiving rewards or penalties for actions taken. It aims to
maximize cumulative rewards over time.
1. Example: Training a robot to navigate through a maze.

10
Supervised Learning

11
12
supervised and unsupervised learning:
Supervised Learning:
Supervised learning is a type of machine learning where the model is trained on a labeled dataset.
This means that each training example is paired with an output label. The goal is to learn a
mapping from inputs to outputs so that the model can predict the output for new, unseen inputs.
• Key Components:
1. Training Data: Consists of input-output pairs. For example, in a dataset for house
prices, each data point could include features like the number of bedrooms, size of
the house, etc., and the corresponding price.
2. Model: A function that maps input features to output labels. Common models
include linear regression, logistic regression, decision trees, and neural networks.
3. Learning Algorithm: Optimizes the model parameters to minimize the difference
between the predicted outputs and the actual outputs in the training data. Common
algorithms include gradient descent, backpropagation (for neural networks), and
others.
4. Loss Function: Measures the performance of the model by quantifying the
difference between the predicted and actual outputs. Examples include mean
squared error for regression and cross-entropy loss for classification.

13
Types of supervised learning

Supervised
learning

Classification Regression

A classification
A regression
problem is when
problem is when
the output
the output
variable is a
variable is a real
category, such as
value, such as
“Red” or “blue”
“dollars” or
or “disease” and
“weight”.
“no disease”.

14
– Applications:
• Classification: Predicting a discrete label for an input.
Examples include email spam detection, handwriting
recognition, and medical diagnosis.
• Regression: Predicting a continuous value for an input.
Examples include predicting house prices, stock prices, and
temperature forecasting.

15
CLASSIFICATION

Which of the following are classification tasks?


• Find the gender of a person by analyzing his writing style
• Predict the price of a house based on floor area, number of
rooms etc.
• Predict whether there will be abnormally heavy rainfall next year
• Predict the number of copies of a book that will be sold this
month

16
CLASSIFICATION

Which of the following are classification tasks?


• Find the gender of a person by analyzing his writing style
• Predict the price of a house based on floor area, number of rooms
etc.
• Predict whether there will be abnormally heavy rainfall next year
• Predict the number of copies of a book that will be sold this
month

17
REGRESSION

Which ONE of the following are regression tasks?


• Predict the age of a person
• Predict the country from where the person comes from
• Predict whether the price of petroleum will increase tomorrow
• Predict whether a document is related to science

18
REGRESSION

Which ONE of the following are regression tasks?


• Predict the age of a person
• Predict the country from where the person comes from
• Predict whether the price of petroleum will increase tomorrow
• Predict whether a document is related to science

19
Training phase - classification
Training data Feature vector
SHAPE COLOR FRUIT
ROUND RED APPLE
CYLINDER YELLOW BANANA
ROUND ORANGE ORANGE
ROUND RED APPLE

CYLINDER YELLOW BANANA

ROUND ORANGE ORANGE

ROUND RED APPLE

CYLINDER YELLOW BANANA

ROUND ORANGE ORANGE


Model

1. If SHAPE=ROUND and COLOR=RED then


FRUIT=APPLE Supervised
2. If SHAPE=CYLINDER and COLOR=YELLOW learning
then FRUIT =BANANA
algorithm
3. If SHAPE=ROUND and COLOR=ORANGE then
FRUIT=ORANGE

20
Testing phase - classification

Test data Feature vector


SHAPE COLOR

ROUND RED

Model
Prediction 1. If SHAPE=ROUND and COLOR=RED then
FRUIT=APPLE
Apple 2. If SHAPE=CYLINDER and COLOR=YELLOW
then FRUIT =BANANA
3. If SHAPE=ROUND and COLOR=ORANGE then
FRUIT=ORANGE

21
Regression
•Regression is a technique for investigating the relationship between
independent variables(explanatory/exploratory/a covariate / domain point
/predictor/ feature variable)
and a
dependent variable(label / target / response /outcome variable) .
•The relationship between the dependent and
independent variables can be represented as a
function as follows :
y = f(x)

•The line of the form , y = ax+b can be fitted to


the data points that indicate the relationship
between x and y .

22
23
Un-Supervised Learning

24
Un-Supervised Learning

25
Unsupervised Learning
Unsupervised learning is a type of machine learning where the model is trained on
unlabeled data. The goal is to find hidden patterns or intrinsic structures in the input data.

Key Components:
1. Training Data: Consists of input data without explicit output labels. For example,
customer data with features like age, income, and purchasing behavior, but no
predefined categories.
2. Model: Tries to learn the underlying structure of the data. Common models include
clustering algorithms and dimensionality reduction techniques.
3. Learning Algorithm: Identifies patterns in the data. For clustering, it groups similar
data points together, while for dimensionality reduction, it reduces the number of
features while preserving important information.

26
• Applications:
• Clustering: Grouping similar data points together. Examples
include customer segmentation, image segmentation, and
grouping news articles.
• Dimensionality Reduction: Reducing the number of features in
the data while retaining important information. Examples include
Principal Component Analysis (PCA) and t-Distributed
Stochastic Neighbor Embedding (t-SNE).

27
Types of unsupervised learning

Unsupervised learning

Association rule Dimensionality


Clustering
learning reduction

28
Training phase - clustering
Training data Feature vector
SHAPE COLOR

ROUND RED

CYLINDER YELLOW

ROUND ORANGE

ROUND RED

CYLINDER YELLOW

ROUND ORANGE

ROUND RED
Model CYLINDER YELLOW

ROUND ORANGE

Unsupervised
learning
algorithm

29
Test data
Testing phase - clustering
Feature vector
SHAPE COLOR

ROUND RED

Model

Similar to this group. So


this data belongs to this
group

30
Transactio Items purchased
n ID
Rules learnt:
1 Bread, Milk
1. {Diaper} - > {Jam}
2 Bread, Diaper, Jam, Association Rule
2. {Milk, Bread} -> {Eggs, Coke}
Eggs Learning
3. {Jam, Bread} -> {Milk}
3 Milk, Diaper, Jam, Coke
4 Bread, Milk, Diaper, Jam
5 Bread, Milk, Diaper,
coke

31
Feature selection

Feature 1 Feature 2 Feature 3 Feature 2 Feature 3

32
Feature extraction
New New
Feature 1 Feature 2
Feature 1 Feature 2 Feature 3

33
• Differences Between Supervised and Unsupervised
Learning
• Labeled Data: Supervised learning requires labeled data,
while unsupervised learning works with unlabeled data.
• Goal: The goal of supervised learning is to make predictions
for new data based on the learned mapping from inputs to
outputs. The goal of unsupervised learning is to discover the
underlying structure of the data.
• Applications: Supervised learning is used for tasks where the
output is known and labeled, such as classification and
regression. Unsupervised learning is used for tasks where the
output is not known, such as clustering and dimensionality
reduction.

34
• Example Scenarios
• Supervised Learning:
• Spam Detection: Given a dataset of emails labeled as spam or
not spam, train a model to classify new emails.
• House Price Prediction: Given a dataset of house features and
prices, train a model to predict the price of a new house based on
its features.
• Unsupervised Learning:
• Customer Segmentation: Given a dataset of customer
behaviors, identify distinct groups of customers for targeted
marketing.
• Anomaly Detection: Given a dataset of network traffic, identify
unusual patterns that might indicate security breaches.

35
Reinforcement Learning

• Reinforcement Learning is an emerging and most popular type of


Machine Learning Algorithm.
• It is used in various autonomous systems like cars and industrial
robotics.
• Reinforcement machine learning is used for improving or increasing efficiency.
• The aim of the algorithm is to reach a goal in a dynamic environment.
• It can reach this goal based on several rewards that are provided to it by
the system.

36
Reinforcement Learning
The idea behind Reinforcement Learning is that an agent will learn from the environment by interacting with it and receiving
rewards for performing actions.

That’s how humans learn, through interaction. Reinforcement Learning is just a computational approach of
learning from action.

37
Reinforcement Learning
• Topics:
– Policies: what actions should an agent take in a
particular situation
– Utility estimation: how good is a state (🡪used by policy)
• No supervised output but delayed reward
• Applications:
– Game playing
– Robot in a maze
– Multiple agents, partial observability, ...

38
Reinforcement learning

Is this a Environment Output This is a


banana??? Reward banana

Model

State Action
Agent

Training the
Input data Input
model

39
Semi-supervised learning

Semi-supervised Unsupervised learning


Supervised learning
learning

Less
More
All data are labelled labelle All data are unlabeled
d data
unlabeled data

Model Model Model

40
Example …..
• Supervised learning is where a student is
under the supervision of an instructor at home
and college.
• Further, if that student is self-analyzing the
same concept without any help from the
instructor, it comes under unsupervised
learning.
• Under semi-supervised learning, the student
has to revise itself after analyzing the same
concept under the guidance of an instructor at
college.

41
Applications of Machine Learning
• Machine Learning is applied in various domains, including:
• Healthcare: Diagnosing diseases from medical images,
predicting patient outcomes, personalized treatment plans.
• Finance: Fraud detection, algorithmic trading, credit scoring.
• Marketing: Customer segmentation, personalized
recommendations, sentiment analysis.
• Manufacturing: Predictive maintenance, quality control, supply
chain optimization.
• Transportation: Autonomous vehicles, route optimization, traffic
prediction.

42
43
Polynomial Curve Fitting

44
Why Polynomial Curve Fitting ?
• The purpose of polynomial curve fitting in machine learning is
to model complex relationships between the input features and
the output target by fitting a polynomial equation to the data.
• This approach allows us to capture non-linear patterns that
simple linear models might miss, leading to more accurate
predictions and a better understanding of the underlying data
trends.
• Serves to model complex, non-linear relationships between
variables, improving predictive accuracy and providing better
data insights. By fitting polynomials of appropriate degrees,
we can create flexible models that generalize well to new
data, balancing the trade-offs of underfitting and
overfitting.

45
Ex: Polynomial Curve Fitting
• Consider a scenario where we have data on the temperature variations
throughout the day.
• Imagine you have data points representing the temperature at different
times of the day. You might observe that the temperature doesn’t change
linearly but follows a more complex pattern (e.g., it rises in the morning,
peaks in the afternoon, and drops in the evening). A simple linear model
might not capture the true pattern (morning rise, afternoon peak, evening
drop).
• A polynomial curve could model this relationship better than a straight
line. A polynomial model, however, can fit a curve that accurately
represents these changes.

46
polynomial curve fitting:
Polynomial curve fitting involves fitting a polynomial equation of degree n to a set of data
points. The goal is to find the coefficients of the polynomial that minimize the difference
between the predicted values and the actual data points.

47
5. Evaluating the Fit: Assess the performance of the fitted polynomial using metrics
such as mean squared error (MSE) and visual inspection of the fitted curve against the
data points. 48
• Overfitting and Underfitting

• Overfitting: When the degree of the polynomial is too high, the model may

fit the training data very well but fail to generalize to new data. This results

in a complex model that captures noise in the data.

• Underfitting: When the degree of the polynomial is too low, the model

may be too simple to capture the underlying trend in the data, leading to

poor performance on both training and new data.

49
Ex- Fit a polynomial to the data points

• Example: Fit a polynomial to the following


data points
• Given data points: (1,1),(2,4),(3,9),(4,16),(5,25)

50
51
52
53
54
55
56
Regularization
Why Regularization?
• When training a machine learning model, the model can be
easily overfitted or underfitted.
• To avoid this, we use regularization in machine learning to
properly fit the model to our test set.
• Regularization techniques help reduce the possibility of
overfitting and help us obtain an optimal model.
• It is a technique to prevent the model from overfitting
by adding extra information to it.

57
Regularization
To prevent overfitting, regularization techniques such as Ridge Regression (L2 regularization)
can be applied. This involves adding a penalty term to the loss function:

58
Polynomial curve fitting is a powerful technique when the
relationship between the variables can be
well-approximated by a polynomial. However, choosing the
right degree and applying regularization are crucial to avoid
overfitting and underfitting, ensuring the model generalizes
well to new data.

59
Benefits and Challenges of Polynomial Curve Fitting
1. Capturing Non-Linear Relationships:

∙ Linear models assume a straight-line relationship between input and output, which is often
too simplistic for real-world data.

∙ Polynomial models can represent curves, allowing them to fit more complex patterns. For
example, a quadratic polynomial can model parabolic relationships, while higher-degree
polynomials can capture even more intricate trends.

2. Improving Predictive Accuracy:

∙ By fitting a polynomial to the data, we can often achieve a closer match between the
model predictions and actual data points, leading to lower prediction errors.

∙ This is particularly useful when the true relationship between variables is inherently
non-linear.

3. Exploratory Data Analysis:

∙ Fitting polynomials can help visualize the relationship between variables. By plotting the
fitted polynomial, we can see how well the model captures the data trends and identify
potential areas where the model might need adjustments.

60
Challenges:
∙ Overfitting: Higher-degree polynomials can fit the training data too
closely, capturing noise and leading to poor generalization to new data.
∙ Underfitting: A polynomial with too low a degree might not capture the
underlying pattern in the data.

• Understanding the type of trend in your data helps in selecting the


appropriate modeling technique and achieving more accurate predictions
and insights.

61
Probability Theory , discrete random variables, fundamental
rules,Bayes rule, Independence and conditional independence,
continuous random variables

62
Definitions
• Probability: the chance that an uncertain
event will occur (always between 0 and 1)
• Sample Space: the collection of all possible
events
– ex. All 6 faces of a die:

63
Events in Sample Space
• Event: Each possible type of occurrence or outcome.
• Simple event
– An outcome from a sample space with one characteristic
– ex. A red card from a deck of cards
• Complement of an event A (denoted A/)
– All outcomes that are not part of event A
– ex. All cards that are not diamonds
• Joint event
– Involves two or more characteristics simultaneously
– ex. An ace that is also red from a deck of cards
• Mutually exclusive event
– Events that cannot occur together (simultaneously).
– ex. B = having a boy; G = having a girl

64
Events in Sample Space
• Collectively exhaustive events
– One of the events must occur
– The set of events covers the entire sample space

• example:
– A = aces; B = black cards; C = diamonds; D = hearts
– Events A, B, C and D are collectively exhaustive (but not
mutually exclusive – a selected ace may also be a heart)
– Events B, C and D are collectively exhaustive and also mutually
exclusive

65
Visualizing Events in
Sample Space
• Contingency Tables:
Ace Not Total
Ace
Black 2 24 26
Red 2 24 26
Total 4 48 52
• Tree Diagrams:
2
Ace
Card 24
Black Not an Ace
Full Deck
of 52 Cards 2
Sample Ace
Red Ca
rd
Space Not an 24
Ace
66
probability theory:
• Key Concepts in Probability Theory
• Random Variables
• A random variable is a variable that can take on different
values, each with a certain probability. There are two main
types of random variables:
• Discrete Random Variables: Can take on a finite or countably
infinite set of values (e.g., the roll of a die).
• Continuous Random Variables: Can take on any value within
a range (e.g., the height of a person).
• Probability Distribution
• A probability distribution describes how the probabilities are
distributed over the values of the random variable.
– PMF
– PDF

67
Probability Distribution
• A probability mass function (PMF) is a function that describes the
probability distribution of a discrete random variable. In simpler terms, it
tells you the probability of each possible outcome of a random event.
• It is the function that specifically describes the probability distribution of a
discrete random variable.

68
PMF Example…

The random variable X represents the number you roll.


The PMF for this situation would be:

•Each row represents a possible outcome of rolling the die.


•The "P(X)" column represents the probability of each outcome.
•For example, P(X = 3) = 1/6 means the probability of rolling a 3 is 1/6.

69
• In mathematical terms ,

70
Probability Distribution
• A probability density function (PDF) is a function that describes the
probability distribution of a continuous random variable. Unlike a discrete
random variable which can take on a finite or countably infinite number of
values, a continuous random variable can take on any value within a given
range.
•Continuous random variable: A variable that can take on any value within a
given range. Examples include height, weight, temperature, or time.

A Probability Density Function (PDF) is a function that describes the likelihood of a


continuous random variable taking on a particular value. Unlike a Probability Mass
Function (PMF) for discrete variables, the PDF applies to continuous variables and
describes the density of probability rather than the probability itself.

71
The PDF doesn't directly tell you the probability of a
specific value. Instead, it gives you the probability
density at that value.
To find the probability of a range of values, you need to
calculate the area under the PDF curve between
those values.
72
PDF Example…
Example:
• Let's say we want to find the probability that a person's height is between
170cm and 180cm, assuming their height follows a normal distribution
with a mean of 175cm and a standard deviation of 5cm. We would need to:
1. Calculate the area under the PDF curve between 170cm and 180cm.
This can be done using integration or by looking up the area in a table.
2. The area under the curve between 170cm and 180cm represents the
probability that the person's height falls within that range.

73
Summary

74
Fundamental Rules – Basic Rules in Probability

75
76
77
78
79
A

80
B

81
Visualizing probability
• A is a random variable that denotes an uncertain event
– Example: A = “I’ll get an A in the final exam”
• P(A) is “the fraction of possible worlds where A is true”

Worlds in
which A
is true P(A) = Area of the blue
circle.
Worlds in which A is false
Event space of all possible
worlds. Its area is 1.
Slide: Andrew W. Moore
82
Definitions
Simple vs. Joint Probability
• Simple (Marginal) Probability refers to the
probability of a simple event.
– ex. P(King)

• Joint Probability refers to the probability of


an occurrence of two or more events.
– ex. P(King and Spade)

83
Joint probabilities
• We define the probability of the joint event A and B as follows:
• p(A,B) = p(A ∧ B) = p(A|B)p(B)
• This is sometimes called the product rule.
• Given a joint distribution on two events p(A,B),we define the marginal
distribution as follows:

where we are summing over all possible states of B.

We can define p(B) similarly.


Joint probability deals with the probability of two or more events happening
simultaneously.
84
Joint Probability Example
Scenario: Suppose you have a bag containing 5 red balls, 3
blue balls, and 2 green balls. You randomly select two balls
from the bag without replacement (meaning you don't put
the first ball back in).
Events:
•Event A: Selecting a red ball on the first draw.
•Event B: Selecting a blue ball on the second draw.

Joint Probability: Find the probability of both event A


and event B happening (i.e., selecting a red ball first
and then a blue ball) ?

85
Scenario: Suppose you have a bag containing 5 red balls, 3 blue balls, and 2
green balls. You randomly select two balls from the bag without replacement
(meaning you don't put the first ball back in).
Events:
•Event A: Selecting a red ball on the first draw.
•Event B: Selecting a blue ball on the second draw.

Joint Probability: We want to find the probability of both event A and event B
happening (i.e., selecting a red ball first and then a blue ball).

Solution :
.Probability of Event A: The probability of selecting a red ball first is 5 (red
balls) / 10 (total balls) = 1/2.
.Probability of Event B after Event A: After removing a red ball, there are
only 9 balls left. The probability of selecting a blue ball next is 3 (blue balls) / 9
(total balls) = 1/3.
.Joint Probability: To get the joint probability of both events happening, we
multiply the individual probabilities:
.P(A and B) = P(A) * P(B|A) = (1/2) * (1/3) = 1/6

Interpretation: The probability of selecting a red ball first and then a blue ball
is 1/6.
86
Real-world examples:
• Weather: The joint probability of it raining and
being windy on a particular day.
• Health: The joint probability of a person
having high blood pressure and diabetes.
• Finance: The joint probability of a stock price
going up and the market index going down.

87
Marginal Probability Distribution
Example
Imagine you're studying the relationship between a person's height and their
shoe size. You collect data on a group of people, recording their height and
shoe size. This data can be represented in a table, where each row corresponds
to a person and each column represents a specific height and shoe size.

Now, you want to focus on


the probability distribution of
just one variable, say height.
This is where the marginal
probability distribution
comes in.

88
Marginal probability distribution: It describes the probability of each
value of a single variable, regardless of the values of other variables.
In our example, the marginal probability distribution of height would tell us
the probability of a person being 160cm tall, 170cm tall, or 180cm tall,
regardless of their shoe size.
Solution:
1.Group the data by height: Count the number of people with each height:
1. Height 160cm: 3 people
2. Height 170cm: 3 people
3. Height 180cm: 3 people
2.Calculate the probability of each height: Divide the count for each height by the
total number of people (9 in this case):
1. P(Height = 160cm) = 3/9 = 1/3
2. P(Height = 170cm) = 3/9 = 1/3
3. P(Height = 180cm) = 3/9 = 1/3
Marginal Probability Distribution of
Height:
Interpretation:
LHS table tells us that the probability of a
randomly selected person being 160cm tall is 1/3,
the same probability for 170cm and 180cm.
89
Real-world Example:
• Imagine a study examining the relationship between a
student's GPA and their SAT score. The marginal probability
distribution of GPA would tell us the probability of a student
having a specific GPA, regardless of their SAT score.

Marginal Probability help us understand the distribution of a single variable


without considering the influence of other variables.

90
Joint Probability Using a
Contingency Table

Event
Event B1 B2 Total

A1 P(A1 and B1) P(A1 and B2) P(A1)

A2 P(A2 and B1) P(A2 and B2) P(A2)

Total P(B1) P(B2) 1

Joint Probabilities Marginal (Simple) Probabilities

91
Example:
Joint Probability

P(Red and Ace)

Ace Not Total


Ace
Black 2 24 26
Red 2 24 26
Total 4 48 52

92
Example:
Marginal (Simple) Probability

P(Ace)

Ace Not Ace Total

Black 2 24 26
Red 2 24 26
Total 4 48 52

93
94
Conditional Probability
• A conditional probability is the probability of one event,
given that another event has occurred:

The conditional
probability of A given
that B has occurred

The conditional
probability of B given
that A has occurred

Where P(A and B) = joint probability of A and B


P(A) = marginal probability of A
P(B) = marginal probability of B
95
96
Independence
• Independence and conditional independence
• The conditional probability of A given B is represented by
P(A|B). The variables A and B are said to be independent if
P(A)= P(A|B) (or alternatively if P(A,B)=P(A) P(B) because of
the formula for conditional probability ).
• Example1 Suppose Norman and Martin each toss separate
coins. Let A represent the variable "Norman's toss
outcome", and B represent the variable "Martin's toss
outcome". Both A and B have two possible values (Heads
and Tails). It would be uncontroversial to assume that A and
B are independent. Evidence about B will not change our
belief in A.

97
Conditional Independence
• Example2 Now suppose both Martin and Norman toss the same coin.
Again let A represent the variable "Norman's toss outcome", and B
represent the variable "Martin's toss outcome". Assume also that there is
a possibility that the coin in biased towards heads but we do not know this
for certain. In this case A and B are not independent. For example,
observing that B is Heads causes us to increase our belief in A being Heads
(in other words P(a|b)>P(b) in the case when a=Heads and b=Heads).
• In Example 2 the variables A and B are both dependent on a separate
variable C, "the coin is biased towards Heads" (which has the values True
or False). Although A and B are not independent, it turns out that once
we know for certain the value of C then any evidence about B cannot
change our belief about A. Specifically:
• P(A|C) = P(A|B,C)
• In such case we say that A and B are conditionally independent given C.
• In many real life situations variables which are believed to be independent
are actually only independent conditional on some other variable.

98
Bayes Theorem / Bayesian Inference
• Bayesian Inference is a statistical method to infer for
an unknown situation.

26-07-2024 99
Bayes Theorem / Bayesian Inference

• P(H | E) – This is referred to as the posterior probability. Posteriori basically means


deriving theory out of given evidence. It denotes the conditional probability of
H (hypothesis), given the evidence E.
• P(E | H) – This component of our Bayes’ Theorem denotes the likelihood. It is the
conditional probability of the occurrence of the evidence, given the hypothesis. It
calculates the probability of the evidence, considering that the assumed hypothesis
holds true.
• P(H) – This is referred to as the prior probability. It denotes the original probability
of the hypothesis H being true before the implementation of Bayes’ Theorem. That
is, this probability is without the involvement of the data or the
evidence.
• P(E) – This is the probability of the occurrence of evidence regardless of the
hypothesis. It is called Marginal Probability.

26-07-2024 100
Bayes Theorem - Illustration

Consider the following data set of weather and corresponding target variable ‘Play’ (suggesting
possibilities of playing).
Depending on the weather (sunny, rainy or overcast), the children will play(Y) or not play(N).

Problem: Find The probability that children will play given that
weather is sunny ?

101
Bayes Theorem - Illustration

Problem: Find The probability that children will play given that weather is sunny ?
Step 1: Convert the data set into a frequency table

102
Bayes Theorem - Illustration

Problem: Find The probability that children will play given that weather is sunny ?
Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.

103
Bayes Theorem - Illustration

The probability that children will play given that weather is sunny :
P( Yes| Sunny) = ?

26-07-2024 104
Bayes Theorem - Illustration

The probability that children will play given that weather is sunny :
P( Yes| Sunny) = P(Sunny | Yes) * P(Yes) / P(Sunny)

26-07-2024 105
Bayes Theorem - Illustration

The probability that children will play given that weather is sunny :
P( Yes| Sunny) = P(Sunny | Yes) * P(Yes) / P(Sunny)
Here we have
P (Sunny |Yes) = 3/9 = 0.33
P(Sunny) = 5/14 = 0.36
P( Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60

26-07-2024 106
107
108
Bayes’ Theorem
Bayes’ Theorem is used to revise previously calculated
probabilities based on new information.

It is an extension of conditional probability.

where:
Bi = ith event of k mutually exclusive and
collectively exhaustive events
A = new event that might impact P(Bi)

109
Bayes’ Theorem
Example
• A drilling company has estimated a 40% chance
of striking oil for their new well.
• A detailed test has been scheduled for more
information. Historically, 60% of successful
wells have had detailed tests, and 20% of
unsuccessful wells have had detailed tests.
• Given that this well has been scheduled for a
detailed test, what is the probability that the
well will be successful?

110
Bayes’ Theorem
Example

• Let S = successful well


U = unsuccessful well
• P(S) = .4 , P(U) = .6 (prior probabilities)
• Define the detailed test event as D
• Conditional probabilities:
• P(D|S) = .6 P(D|U) = .2
• Goal: To find P(S|D)

111
Bayes’ Theorem
Example
Apply Bayes’ Theorem:

So, the revised probability of success, given that this well


has been scheduled for a detailed test, is .667

112
Bayes Theorem
Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The test
returns a correct positive result in only 98% of the cases in which
the disease is actually present, and a correct negative result in only
97% of the cases in which the disease is not present. Furthermore,
0.8% of the entire population have this cancer.
P(cancer) = P(¬cancer) =
P(+|cancer) = P(-|cancer) =
P(+|¬cancer) = P(-|¬cancer) =

P(cancer|+) =
P(¬cancer|+) =

113
Statistical fundamentals and terminology for model building and validation

Quantiles
• Quantiles are specific points in a data set that partition the data
into intervals of equal probability or divide the ordered data set
into equally sized subsets.
• These points are used to understand the spread and
distribution of the data.
• These measures are calculated after arranging the data in
ascending order.

114
Common Quantiles

1. Median (50th percentile): The value that divides the data into
two equal halves.
2. Quartiles:
1. First Quartile (Q1, 25th percentile): The value below which 25% of
the data falls. i.e 25% of the data falls below the first quartile.
2. Second Quartile (Q2, 50th percentile or median): The value below
which 50% of the data falls. i.e 50% of the data falls below the second quartile.
3. Third Quartile (Q3, 75th percentile): The value below which 75% of
the data falls. i.e 75% of the data falls below the third quartile.
3. Deciles: Divide the data into 10 equal parts.
4. Percentiles: Divide the data into 100 equal parts.

115
By splitting the data at the 25th, 50th, and 75th percentiles, the quartiles divide
the data into four equal parts.
•In a sample or dataset, the quartiles divide the data into four groups with equal
numbers of observations.
•In a probability distribution, the quartiles divide the distributionʼs range into four
intervals with equal probability.

116
• Interquartile range: This is the difference between the third
quartile and first quartile. It is effective in identifying outliers in data. The
interquartile range describes the middle 50 percent of the data points.

IQR=Q3−Q1

The IQR is a robust statistic that is less affected by outliers compared to the
range, making it a valuable tool in descriptive statistics.

To visualize the
interquartile range, a
box plot is commonly
used.

117
118
Applications in Machine Learning:
1. Data Preprocessing: Quantiles can be used to normalize
or standardize data, helping to scale features appropriately
for algorithms.
2. Outlier Detection: By examining the quantiles, one can
identify outliers that lie beyond a certain range (e.g., values
lower than the 1st percentile or higher than the 99th
percentile).
3. Visualization: Quantiles are helpful in creating box plots,
which provide a visual summary of the data’s distribution,
interquartile range, and outliers.
4. Model Evaluation: Techniques like quantile regression
allow models to predict different quantiles rather than just
the mean, providing a more comprehensive view of the
response variable's distribution.
119
Real-life Applications

• Finance: Identifying top/bottom 10% of stocks based on


performance.
• E-commerce: Recognizing the top 5% of customers based
on purchase history.
• Income Distribution: Economists use income percentiles to
study inequality within an economy.
• Clinical Trials: Researchers use percentiles to understand
how a new drug’s effect compares to existing treatments.

120
Expected Value (Mean)

121
Expected Value (Mean) of a Continuous Random
Variable
For continuous random variables, we use integration instead of summation.

122
Variance
• Variance is a measure of the spread or dispersion of a
set of values in a probability distribution.
• It quantifies how much the data points in the
distribution deviate from the mean (average).
• A higher variance indicates a wider spread of data,
while a lower variance indicates data clustered closer
to the mean.
• Computed as the average of the squared differences
between each data point and the mean.
• Standard Deviation is the Square Root of Variance.

123
124
125
126
127
128
Applications:
Understanding variance is crucial in various fields, including:
•Finance: To assess the risk of investments.
•Statistics: To analyze data and draw meaningful conclusions.
•Machine Learning: To evaluate the performance of models.

In summary, variance provides a valuable insight into the


spread and variability of data in a probability distribution,
helping us understand the uncertainty and risk associated
with different scenarios.

129
130
Covariance
• Covariance measures the degree to which two
random variables change together.
• A positive covariance indicates that the variables
tend to move in the same direction; a negative
covariance means they move in opposite directions.
• For two random variables X and Y with means E(X) and E(Y), the
covariance is:
Cov(X, Y) = E[(X - E(X))(Y - E(Y))]
•Cov(X, Y) > 0: X and Y tend to increase or decrease together.
•Cov(X, Y) < 0: X and Y tend to move in opposite directions.
•Cov(X, Y) = 0: There's no linear relationship between X and Y.

131
Example:
• Imagine you're tracking the price of gasoline (X)
and the number of people who drive to work (Y).
• You might find a negative covariance between
these variables: as gas prices rise (X increases),
people might opt for alternative transportation (Y
decreases).

132
Activity 1– UNIT 1
• Identify Domain – Financial investments
• How various MCT’s & PD’s used here ?
Say for investments.
• Ex- How variance is used to assess the
risk of investments?
• List the ML models that can be used for
this .

133
Activity 2– UNIT 1
• Load any Data set from Kaggle / UCI
Repository & perform various MCT’s ,
Measures of Dispersion etc .

134
Python

1. Launch a command prompt if it isn't already open. To do so, open


the Windows search bar, type cmd and click on the icon.
2. Then, run the following command to download the get-pip.py file:
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
3. python get-pip.py
4. Pip install numpy

135
The Python code for the calculation of mean, median, and mode using a numpy array and
the stats package is as follows:

>>> import numpy as np


• >>> from scipy import stats

• >>> data = np.array([4,5,1,2,7,2,6,9,3])

• # Calculate Mean
• >>> dt_mean = np.mean(data) ; print ("Mean :",round(dt_mean,2))

• # Calculate Median
• >>> dt_median = np.median(data) ; print ("Median :",dt_median)

• # Calculate Mode
• >>> dt_mode = stats.mode(data); print ("Mode :",dt_mode[0][0])

136
• >>> from statistics import variance, stdev
• >>> game_points = np.array([35,56,43,59,63,79,35,41,64,43,93,60,77,24,82])

• # Calculate Variance
• >>> dt_var = variance(game_points) ; print ("Sample variance:",
round(dt_var,2))

• # Calculate Standard Deviation
• >>> dt_std = stdev(game_points) ; print ("Sample std.dev:", round(dt_std,2))

• # Calculate Range
• >>> dt_rng = np.max(game_points,axis=0) - np.min(game_points,axis=0) ; print
("Range:",dt_rng)

• #Calculate percentiles
• >>> print ("Quantiles:")
• >>> for val in [20,80,100]:
• >>> dt_qntls = np.percentile(game_points,val)
• >>> print (str(val)+"%" ,dt_qntls)

• # Calculate IQR
• >>> q75, q25 = np.percentile(game_points, [75 ,25]); print ("Inter quartile
range:",q75-q25)
137

You might also like