0% found this document useful (0 votes)

39 views15 pages

Unit 3 DSML

The document discusses populations and samples, types of statistical modeling including regression analysis and time series analysis, types of probability distributions including the normal and binomial distributions, and parametric and nonparametric models. Populations are groups of interest while samples are subsets selected for study. Statistical modeling uses mathematical techniques to analyze data and make predictions. Common probability distributions and their characteristics are described. Parametric models assume a specific distribution while nonparametric models make fewer assumptions.

Uploaded by

Vishnu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views15 pages

Unit 3 DSML

Uploaded by

Vishnu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Unit- 3

Statistical Inference

Populations and samples:

A population is a group of individuals, items, or things that share one or more common
characteristics and are of interest to a researcher. For example, a population could be all the
residents of a particular city, all the students enrolled in a university, or all the employees of
a company.

A sample is a subset of the population selected for research purposes. Researchers use
sampling techniques to select a representative sample from the population that can be
studied in order to make inferences about the population as a whole. A sample should be
selected in such a way that it accurately represents the population from which it was drawn.

Sampling is an important aspect of research because it is often impractical or impossible to

study an entire population. By studying a sample, researchers can draw conclusions about
the population without having to collect data from every member of the population.
However, it is important to ensure that the sample is representative of the population in
order to avoid bias and ensure the validity of the research findings.

Populations and samples are terms commonly used in statistics and research to describe
groups of individuals or objects that are being studied.

A population refers to the entire group of individuals or objects that meet a certain set of
criteria or characteristics. For example, the population of all adults over the age of 18 in the
United States would be a population.

A sample, on the other hand, refers to a smaller group of individuals or objects selected
from the larger population for the purpose of studying or analyzing the characteristics of
that group. For example, a researcher might select a random sample of 500 adults over the
age of 18 in the United States to study their opinions on a particular political issue.

Sampling is an important part of statistical analysis because it allows researchers to make

inferences about the larger population based on data collected from a smaller sample.
However, the accuracy of these inferences depends on the quality of the sample and the
extent to which it represents the larger population. Therefore, it is important to use
appropriate sampling methods and techniques to ensure the validity of statistical analysis.

Types of Statistical modelling:

Statistical modeling refers to the process of using mathematical and statistical techniques to
analyze data and make predictions about future outcomes. There are several types of
statistical modeling, including:

Regression analysis: Regression analysis is a statistical technique used to analyze the

relationship between two or more variables. It is often used to predict the value of one
variable based on the value of another variable, such as predicting a person's weight based
on their height.

Time series analysis: Time series analysis is used to analyze data collected over time. It is
often used in finance, economics, and other fields to make predictions about future trends
and patterns.

Cluster analysis: Cluster analysis is a statistical technique used to group similar objects or
individuals together based on their characteristics. It is often used in market research to
identify consumer segments or in biology to classify different species.

Factor analysis: Factor analysis is used to identify underlying factors or dimensions that
explain the variation in a set of observed variables. It is often used in psychology to identify
underlying personality traits or in marketing to identify underlying brand attributes.

Bayesian analysis: Bayesian analysis is a statistical technique that uses prior knowledge or
assumptions about the data to update and revise predictions based on new data. It is often
used in medical research and other fields where prior knowledge can inform predictions.

Machine learning: Machine learning is a type of statistical modeling that uses algorithms to
identify patterns and make predictions based on data. It is often used in artificial
intelligence, robotics, and other fields where computers can learn from data to make
decisions.

These are just a few examples of the many types of statistical modeling used in research and
analysis. The choice of model will depend on the research question, the type of data being
analyzed, and the goals of the analysis.
Regression analysis: This technique is used to model the relationship between a dependent
variable and one or more independent variables. Linear regression is a commonly used
technique in which a straight line is used to model the relationship between the variables,
while non-linear regression can model more complex relationships.

Time series analysis: This technique is used to model data that is collected over time, such
as stock prices or weather patterns. Time series analysis can help identify trends, seasonal
patterns, and other patterns in the data.

Cluster analysis: This technique is used to group data into clusters based on similarities in
their characteristics. Cluster analysis is commonly used in market research to identify
customer segments based on their purchasing habits or preferences.
Factor analysis: This technique is used to identify underlying factors that explain the
variability in a set of variables. Factor analysis can be used to reduce the number of
variables in a data set and identify the most important factors that explain the variation in
the data.

Decision trees: This technique is used to model decisions and their potential outcomes
based on a set of input variables. Decision trees can be used in fields such as finance,
marketing, and medicine to help make decisions based on the available data.

Bayesian analysis: This technique is used to estimate the probability of an event occurring
based on prior knowledge and new data. Bayesian analysis can be used in a variety of fields,
such as medicine and engineering, to make predictions or make decisions based on
uncertain data.

Types of probability distributions:

There are several types of probability distributions, each with its own characteristics and
applications in statistics and data analysis. Here are some of the most common ones:

Normal distribution: Also known as the Gaussian distribution, it is one of the most common
probability distributions. It is a continuous distribution with a bell-shaped curve that is
symmetrical around its mean. Many natural phenomena, such as height, weight, and IQ,
follow a normal distribution.

Binomial distribution: It is a discrete probability distribution that models the probability of a

certain number of successes in a fixed number of independent trials. It is used to model
outcomes such as the number of heads when flipping a coin a certain number of times.
Poisson distribution: It is a discrete probability distribution that models the probability of a
certain number of events occurring in a fixed interval of time or space. It is used to model
phenomena such as the number of car accidents in a city over a given period of time.

Exponential distribution: It is a continuous probability distribution that models the time

between two consecutive events in a Poisson process. It is used to model phenomena such
as the time between customer arrivals at a service center.
Gamma distribution: It is a continuous probability distribution that generalizes the
exponential distribution to allow for more complex shapes. It is used to model phenomena
such as the time it takes for a machine to fail after a certain number of uses.

Uniform distribution: It is a continuous probability distribution that models a situation in

which all outcomes are equally likely. It is used to model phenomena such as the time it
takes for a random event to occur.

These are just a few examples of probability distributions, and there are many more that are
used in different fields and applications. Understanding the characteristics and applications
of different probability distributions is important in statistical analysis and modeling.

Parametric and Non parametric models:

Parametric Models:
Parametric models are statistical models that assume a specific probability distribution for
the data being analyzed. These models use a fixed set of parameters to define the
distribution, such as the mean and variance for a normal distribution.

Parametric models are often used in situations where the data can be described by a known
probability distribution, and the goal is to estimate the values of the parameters that define
the distribution. This approach can be more efficient and powerful than non-parametric
models, which make fewer assumptions about the underlying data distribution but may
require larger sample sizes to achieve comparable accuracy.

Some examples of parametric models include:

Linear regression models: These models assume that the relationship between the
dependent variable and the independent variables is linear and can be described by a set of
fixed coefficients.

Logistic regression models: These models assume that the probability of an event occurring
can be modeled using a logistic function with fixed parameters.

Normal distribution models: These models assume that the data follows a normal
distribution with fixed mean and variance parameters.

Poisson distribution models: These models assume that the data follows a Poisson
distribution with a fixed rate parameter.

Parametric models can be used in a wide range of applications, including finance,

engineering, and social sciences. However, the accuracy of the models depends on the
appropriateness of the underlying assumptions and the quality of the data being analyzed.

Nonparametric Models:

Non-parametric models are statistical models that make fewer assumptions

about the underlying probability distribution of the data being analyzed. These
models do not assume a specific functional form for the probability
distribution and instead use flexible, data-driven approaches to estimate the
distribution.

Non-parametric models are often used in situations where the underlying

probability distribution is unknown, complex, or difficult to specify. They can
be especially useful when dealing with small sample sizes or data that deviates
significantly from normality.

Some examples of non-parametric models include:

Kernel density estimation: This is a technique used to estimate the probability

density function of a random variable. It involves smoothing the data with a
kernel function to estimate the underlying distribution.

Rank-based methods: These methods use the order or rank of the data rather
than the numerical values to estimate the underlying distribution. Examples
include the Wilcoxon rank-sum test and the Kruskal-Wallis test.

Decision trees: These models do not assume a specific probability distribution

but instead use a tree-like structure to represent the decision-making process
based on the input variables.

Support vector machines: These models use a non-linear mapping of the data
to a higher-dimensional space to separate the classes of data. They do not
assume a specific probability distribution.

Non-parametric models have many advantages, including their flexibility and

ability to handle complex data structures. However, they can be
computationally intensive and may require larger sample sizes to achieve
comparable accuracy to parametric models.

Distance Metrics:

In many real world applications, we use Machine Learning

algorithms for classifying or recognizing images and for retrieving
information through an Image’s content. For example - Face
recognition, Censored Images online, Retail Catalog,
Recommendation Systems etc. Choosing a good distance metric
becomes really important here. The distance metric helps algorithms
to recognize similarities between the contents.

Basic Mathematics Definition(Source Wikipedia),

Distance metric uses distance function which provides a

relationship metric between each elements in the dataset.

Some of you might be thinking, what is this distance function? how

does it work? how does it decide that a particular content or element
in the data has any kind of relationship with another one? Well let’s
try and find this out in next couple of sections.
Distance Function
Do you remember studying Pythagorean theorem? If you do, then
you might remember calculating distance between two data points
using the theorem.

n order to calculate the distance between data points A and B

Pythagorean theorem considers the length of x and y axis.

Many of you must be wondering that, do we even use this theorem

in machine learning algorithm to find the distance? To answer your
question, yes we do use it. In many machine learning algorithms we
use the above formula as a distance function. We will talk about the
algorithms where it is used.
Now you probably have got an idea what is a distance function?
Here is a simplified definition.

Basic Definition from Math.net,

A distance function provides distance between the elements of

a set. If the distance is zero then elements are equivalent else
they are different from each other.

A distance function is nothing but a mathematical formula used by

distance metrics. The distance function can differ across different
distance metrics. Let’s talk about different distance metrics and
understand their role in machine learning modelling.

Distance Metrics
There are a number of distance metrics, but to keep this article
concise, we will only be discussing a few widely used distance
metrics. We will first try to understand the mathematics behind
these metrics and then we will identify the machine learning
algorithms where we use these distance metrics.

Below are the commonly used distance metrics -

Minkowski Distance:
Minkowski distance is a metric in Normed vector space. What is
Normed vector space? A Normed vector space is a vector space on
which a norm is defined. Suppose X is a vector space then a norm on
X is a real valued function ||x||which satisfies below conditions -

1. Zero Vector- Zero vector will have zero length.

2. Scalar Factor- The direction of vector doesn’t change when you

multiply it with a positive number though its length will be
changed.

3. Triangle Inequality- If distance is a norm then the calculated

distance between two points will always be a straight line.

You might be wondering why do we need normed vector, can we just

not go for simple metrics? As normed vector has above properties
which helps to keep the norm induced metric- homogeneous and
translation invariant. More details can be found here.

The distance can be calculated using below formula -

Minkowski distance is the generalized distance metric. Here

generalized means that we can manipulate the above formula to
calculate the distance between two data points in different ways.

As mentioned above, we can manipulate the value of p and calculate

the distance in three different ways-

p = 1, Manhattan Distance
p = 2, Euclidean Distance

p = ∞, Chebychev Distance

We will discuss these distance metrics below in detail.

Manhattan Distance:

We use Manhattan Distance if we need to calculate the distance

between two data points in a grid like path. As mentioned above, we
use Minkowski distance formula to find Manhattan distance by
setting p’s value as 1.

Let’s say, we want to calculate the distance, d, between two data

points- x and y.

Distance d will be calculated using an absolute sum of

difference between its cartesian co-ordinates as below :
where, n- number of variables, xi and yi are the variables of vectors
x and y respectively, in the two dimensional vector space. i.e. x =
(x1,x2,x3,...) and y = (y1,y2,y3,…).

Now the distance d will be calculated as-

(x1 - y1) + (x2 - y2) + (x3 - y3) + … + (xn - yn).

If you try to visualize the distance calculation, it will look something

like as below :

Manhattan distance is also known as Taxicab Geometry, City Block

Distance etc.

Euclidean Distance:

Euclidean distance is one of the most used distance metric. It is

calculated using Minkowski Distance formula by setting p’s value
to 2. This will update the distance ‘d’ formula as below :
Let’s stop for a while! Does this formula look familiar? Well yes, we
just saw this formula above in this article while
discussing “Pythagorean Theorem”.

Euclidean distance formula can be used to calculate the distance

between two data points in a plane.

Cosine Distance:
Mostly Cosine distance metric is used to find similarities between
different documents. In cosine metric we measure the degree of
angle between two documents/vectors(the term frequencies in
different documents collected as metrics). This particular metric is
used when the magnitude between vectors does not matter but the
orientation.

Cosine similarity formula can be derived from the equation of dot

products :-
Now, you must be thinking which value of cosine angle will be
helpful in finding out the similarities.

Now that we have the values which will be considered in order to

measure the similarities, we need to know what do 1, 0 and -1
signify.

Here cosine value 1 is for vectors pointing in the same direction i.e.
there are similarities between the documents/data points. At zero
for orthogonal vectors i.e. Unrelated(some similarity found). Value -
1 for vectors pointing in opposite directions(No similarity).

Mahalanobis Distance:
Mahalanobis Distance is used for calculating the distance between
two data points in a multivariate space.
The Mahalanobis distance is a measure of the distance
between a point P and a distribution D. The idea of measuring
is, how many standard deviations away P is from the mean of
D.

The benefit of using mahalanobis distance is, it takes covariance in

account which helps in measuring the strength/similarity between
two different data objects. The distance between an observation and
the mean can be calculated as below -

Here, S is the covariance metrics. We are using inverse of the

covariance metric to get a variance-normalized distance equation.

Chapter 2
No ratings yet
Chapter 2
22 pages
TRANSPORTATION ENGINEERING - SW 1
No ratings yet
TRANSPORTATION ENGINEERING - SW 1
5 pages
DMBA103
No ratings yet
DMBA103
9 pages
Qualitative
No ratings yet
Qualitative
5 pages
Unit 1 Theory
No ratings yet
Unit 1 Theory
8 pages
Intro To Statistics
No ratings yet
Intro To Statistics
37 pages
APPLIED STATISTICS FOR BUSINESS AND ECONOMICS Midterms Reviewer
No ratings yet
APPLIED STATISTICS FOR BUSINESS AND ECONOMICS Midterms Reviewer
23 pages
Statistics Notes 2019 Certificate
No ratings yet
Statistics Notes 2019 Certificate
87 pages
Statistics Book
No ratings yet
Statistics Book
170 pages
Department of Mathematics: Business Statistics 1
No ratings yet
Department of Mathematics: Business Statistics 1
132 pages
Lecture 1
No ratings yet
Lecture 1
13 pages
Glossary of Research Methodology
From Everand
Glossary of Research Methodology
Dr. Awadhesh Kishore
No ratings yet
Bio 206 Biostatistics
No ratings yet
Bio 206 Biostatistics
12 pages
Stats Theory
No ratings yet
Stats Theory
10 pages
Unit - 1 Introduction-Statistical Inference
No ratings yet
Unit - 1 Introduction-Statistical Inference
28 pages
Lecture 1
No ratings yet
Lecture 1
26 pages
Maths Statistics
No ratings yet
Maths Statistics
132 pages
PIM3 - Basics of Business Statistics
No ratings yet
PIM3 - Basics of Business Statistics
37 pages
Stats Unit I Notes
No ratings yet
Stats Unit I Notes
24 pages
Introduction To Biostatistics
No ratings yet
Introduction To Biostatistics
67 pages
Nature of Statistics W1
No ratings yet
Nature of Statistics W1
39 pages
Statistics For Economics
No ratings yet
Statistics For Economics
58 pages
Topic 1 Introduction To Statistics
No ratings yet
Topic 1 Introduction To Statistics
35 pages
All The Statistical Concept You Required For Data Science
No ratings yet
All The Statistical Concept You Required For Data Science
26 pages
Statistics in Research
No ratings yet
Statistics in Research
11 pages
Statistics
No ratings yet
Statistics
50 pages
Statistics
No ratings yet
Statistics
61 pages
MMW Statistics
No ratings yet
MMW Statistics
5 pages
Statistics
No ratings yet
Statistics
35 pages
4th Unit - Statistics
No ratings yet
4th Unit - Statistics
13 pages
SMA 160 - Stds Notes (2025)
No ratings yet
SMA 160 - Stds Notes (2025)
40 pages
1.population and Sample
No ratings yet
1.population and Sample
9 pages
Module 1-Introduction To Statistics
No ratings yet
Module 1-Introduction To Statistics
4 pages
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
From Everand
Machine Learning - A Complete Exploration of Highly Advanced Machine Learning Concepts, Best Practices and Techniques: 4
Peter Bradley
No ratings yet
Statistics For Data Analysis
No ratings yet
Statistics For Data Analysis
71 pages
Inferential Statistics
No ratings yet
Inferential Statistics
8 pages
Lecture 9 Statistical Learning
No ratings yet
Lecture 9 Statistical Learning
3 pages
EM 104 Module
No ratings yet
EM 104 Module
12 pages
Lesson 1: Brief History of Statistics
No ratings yet
Lesson 1: Brief History of Statistics
17 pages
Data Science
No ratings yet
Data Science
62 pages
Chapter 3 Data
No ratings yet
Chapter 3 Data
48 pages
Statistics Definition of Terms
No ratings yet
Statistics Definition of Terms
47 pages
Ritul's Statistics Assignment
No ratings yet
Ritul's Statistics Assignment
13 pages
Statistics Assignment - 1
No ratings yet
Statistics Assignment - 1
8 pages
Data Collection and Sampling Techniques
No ratings yet
Data Collection and Sampling Techniques
48 pages
Reggie Assignment
No ratings yet
Reggie Assignment
6 pages
Final AB 19-21 PIM3 Basics of Business Statistics
No ratings yet
Final AB 19-21 PIM3 Basics of Business Statistics
37 pages
Regression
No ratings yet
Regression
86 pages
Lesson 1 Introduction To Statistics
No ratings yet
Lesson 1 Introduction To Statistics
12 pages
Statistics For Management
No ratings yet
Statistics For Management
7 pages
Example:: Population
No ratings yet
Example:: Population
7 pages
New Generation University College: AUGUST 2020
No ratings yet
New Generation University College: AUGUST 2020
51 pages
Statistics Is A Branch of Mathematics and A Field of Study That Involves Collecting
No ratings yet
Statistics Is A Branch of Mathematics and A Field of Study That Involves Collecting
31 pages
Chapter 1 Introduction To Statistics and Analysis
No ratings yet
Chapter 1 Introduction To Statistics and Analysis
6 pages
Ebook - Statistics Fundamentals For Business Analytics
No ratings yet
Ebook - Statistics Fundamentals For Business Analytics
9 pages
Lesson 5 Notes
No ratings yet
Lesson 5 Notes
10 pages
Sampling Methods
No ratings yet
Sampling Methods
7 pages
Module A Statistics
No ratings yet
Module A Statistics
36 pages
Introduction To Statistics: There Are Two Major Divisions of Inferential Statistics: Confidence Interval
No ratings yet
Introduction To Statistics: There Are Two Major Divisions of Inferential Statistics: Confidence Interval
8 pages
Module 3 Data Analysis Notes ASCII
No ratings yet
Module 3 Data Analysis Notes ASCII
2 pages
4.02 Statistics Fundamentals
No ratings yet
4.02 Statistics Fundamentals
2 pages
ENS 185 Module 1
No ratings yet
ENS 185 Module 1
64 pages
Internship Report AIML
No ratings yet
Internship Report AIML
40 pages
Bayesian Hierarchical Models - With Applications Using R - Congdon P.D. (CRC 2020) (2nd Ed.)
100% (3)
Bayesian Hierarchical Models - With Applications Using R - Congdon P.D. (CRC 2020) (2nd Ed.)
593 pages
LM-Webinar On Multivariate Techniques For Research - Intro and MRA
No ratings yet
LM-Webinar On Multivariate Techniques For Research - Intro and MRA
24 pages
Durability Index Testing Procedure Manual
No ratings yet
Durability Index Testing Procedure Manual
43 pages
IRIS - Flower
No ratings yet
IRIS - Flower
13 pages
Sports Analyticsfor Football League Tableand Player Performance Prediction CR
No ratings yet
Sports Analyticsfor Football League Tableand Player Performance Prediction CR
9 pages
Measurement of Water Movement in Reference To Benthic Algal Growth (Botanica Marina, Vol. 14, Issue 1) (1971)
No ratings yet
Measurement of Water Movement in Reference To Benthic Algal Growth (Botanica Marina, Vol. 14, Issue 1) (1971)
4 pages
An Empirical Study of The Drivers of Consumer Acceptance of Mobile Advertising in Indonesia
No ratings yet
An Empirical Study of The Drivers of Consumer Acceptance of Mobile Advertising in Indonesia
21 pages
PM - Sept 22-June 23 - Final Updated LW 28 - 4
No ratings yet
PM - Sept 22-June 23 - Final Updated LW 28 - 4
19 pages
1.3 Poorter Et Al. (2010)
No ratings yet
1.3 Poorter Et Al. (2010)
12 pages
The Impact of Marketing Mix On Customer Loyalty Towards Plaza Indonesia Shopping Center
No ratings yet
The Impact of Marketing Mix On Customer Loyalty Towards Plaza Indonesia Shopping Center
11 pages
Statistical Machine Learning
No ratings yet
Statistical Machine Learning
28 pages
Advanced R Notes
No ratings yet
Advanced R Notes
28 pages
Jurnal Internasional Kwu
No ratings yet
Jurnal Internasional Kwu
20 pages
Prajwal. K
No ratings yet
Prajwal. K
31 pages
Correlation
No ratings yet
Correlation
16 pages
Moderation Interpretation Write Up
No ratings yet
Moderation Interpretation Write Up
2 pages
Universal Compression Index Equation
No ratings yet
Universal Compression Index Equation
7 pages
151 Practice Final 1
100% (1)
151 Practice Final 1
11 pages
A Primer of Ecological Statistics 2nd Edition Full Text PDF
No ratings yet
A Primer of Ecological Statistics 2nd Edition Full Text PDF
16 pages
4.79 F.Y.B.com Mathematics Statistics
No ratings yet
4.79 F.Y.B.com Mathematics Statistics
8 pages
The Roles of Technology Literacy and Technology Integration To Improve Students' Teaching Competencies
No ratings yet
The Roles of Technology Literacy and Technology Integration To Improve Students' Teaching Competencies
14 pages
Alto Vacio Desaguador de Hoja en Cajas Planas
No ratings yet
Alto Vacio Desaguador de Hoja en Cajas Planas
7 pages
Leslie Salt Property Project Report
No ratings yet
Leslie Salt Property Project Report
10 pages
Exam 1 Spring 2023 Donald
No ratings yet
Exam 1 Spring 2023 Donald
8 pages
Chapter Six: Cost Estimation
No ratings yet
Chapter Six: Cost Estimation
43 pages
(A) Regress Log of Wages On A Constant and The Female Dummy. Paste Output Here
No ratings yet
(A) Regress Log of Wages On A Constant and The Female Dummy. Paste Output Here
6 pages
Uncertain and Sensitivity in LCA
No ratings yet
Uncertain and Sensitivity in LCA
7 pages

Unit 3 DSML

Uploaded by

Unit 3 DSML

Uploaded by

Unit- 3

Populations and samples:

Sampling is an important aspect of research because it is often impractical or impossible to

Sampling is an important part of statistical analysis because it allows researchers to make

Types of Statistical modelling:

Regression analysis: Regression analysis is a statistical technique used to analyze the

Types of probability distributions:

Binomial distribution: It is a discrete probability distribution that models the probability of a

Exponential distribution: It is a continuous probability distribution that models the time

Uniform distribution: It is a continuous probability distribution that models a situation in

Parametric and Non parametric models:

Some examples of parametric models include:

Parametric models can be used in a wide range of applications, including finance,

Non-parametric models are statistical models that make fewer assumptions

Non-parametric models are often used in situations where the underlying

Some examples of non-parametric models include:

Kernel density estimation: This is a technique used to estimate the probability

Decision trees: These models do not assume a specific probability distribution

Non-parametric models have many advantages, including their flexibility and

In many real world applications, we use Machine Learning

Basic Mathematics Definition(Source Wikipedia),

Distance metric uses distance function which provides a

Some of you might be thinking, what is this distance function? how

n order to calculate the distance between data points A and B

Many of you must be wondering that, do we even use this theorem

Basic Definition from Math.net,

A distance function provides distance between the elements of

A distance function is nothing but a mathematical formula used by

Below are the commonly used distance metrics -

1. Zero Vector- Zero vector will have zero length.

2. Scalar Factor- The direction of vector doesn’t change when you

3. Triangle Inequality- If distance is a norm then the calculated

You might be wondering why do we need normed vector, can we just

The distance can be calculated using below formula -

Minkowski distance is the generalized distance metric. Here

As mentioned above, we can manipulate the value of p and calculate

We will discuss these distance metrics below in detail.

We use Manhattan Distance if we need to calculate the distance

Let’s say, we want to calculate the distance, d, between two data

Distance d will be calculated using an absolute sum of

Now the distance d will be calculated as-

(x1 - y1) + (x2 - y2) + (x3 - y3) + … + (xn - yn).

If you try to visualize the distance calculation, it will look something

Manhattan distance is also known as Taxicab Geometry, City Block

Euclidean distance is one of the most used distance metric. It is

Euclidean distance formula can be used to calculate the distance

Cosine similarity formula can be derived from the equation of dot

Now that we have the values which will be considered in order to

The benefit of using mahalanobis distance is, it takes covariance in

Here, S is the covariance metrics. We are using inverse of the

You might also like