0% found this document useful (0 votes)
39 views15 pages

Unit 3 DSML

The document discusses populations and samples, types of statistical modeling including regression analysis and time series analysis, types of probability distributions including the normal and binomial distributions, and parametric and nonparametric models. Populations are groups of interest while samples are subsets selected for study. Statistical modeling uses mathematical techniques to analyze data and make predictions. Common probability distributions and their characteristics are described. Parametric models assume a specific distribution while nonparametric models make fewer assumptions.

Uploaded by

Vishnu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views15 pages

Unit 3 DSML

The document discusses populations and samples, types of statistical modeling including regression analysis and time series analysis, types of probability distributions including the normal and binomial distributions, and parametric and nonparametric models. Populations are groups of interest while samples are subsets selected for study. Statistical modeling uses mathematical techniques to analyze data and make predictions. Common probability distributions and their characteristics are described. Parametric models assume a specific distribution while nonparametric models make fewer assumptions.

Uploaded by

Vishnu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Unit- 3

Statistical Inference

Populations and samples:

A population is a group of individuals, items, or things that share one or more common
characteristics and are of interest to a researcher. For example, a population could be all the
residents of a particular city, all the students enrolled in a university, or all the employees of
a company.

A sample is a subset of the population selected for research purposes. Researchers use
sampling techniques to select a representative sample from the population that can be
studied in order to make inferences about the population as a whole. A sample should be
selected in such a way that it accurately represents the population from which it was drawn.

Sampling is an important aspect of research because it is often impractical or impossible to


study an entire population. By studying a sample, researchers can draw conclusions about
the population without having to collect data from every member of the population.
However, it is important to ensure that the sample is representative of the population in
order to avoid bias and ensure the validity of the research findings.

Populations and samples are terms commonly used in statistics and research to describe
groups of individuals or objects that are being studied.

A population refers to the entire group of individuals or objects that meet a certain set of
criteria or characteristics. For example, the population of all adults over the age of 18 in the
United States would be a population.

A sample, on the other hand, refers to a smaller group of individuals or objects selected
from the larger population for the purpose of studying or analyzing the characteristics of
that group. For example, a researcher might select a random sample of 500 adults over the
age of 18 in the United States to study their opinions on a particular political issue.

Sampling is an important part of statistical analysis because it allows researchers to make


inferences about the larger population based on data collected from a smaller sample.
However, the accuracy of these inferences depends on the quality of the sample and the
extent to which it represents the larger population. Therefore, it is important to use
appropriate sampling methods and techniques to ensure the validity of statistical analysis.

Types of Statistical modelling:


Statistical modeling refers to the process of using mathematical and statistical techniques to
analyze data and make predictions about future outcomes. There are several types of
statistical modeling, including:

Regression analysis: Regression analysis is a statistical technique used to analyze the


relationship between two or more variables. It is often used to predict the value of one
variable based on the value of another variable, such as predicting a person's weight based
on their height.

Time series analysis: Time series analysis is used to analyze data collected over time. It is
often used in finance, economics, and other fields to make predictions about future trends
and patterns.

Cluster analysis: Cluster analysis is a statistical technique used to group similar objects or
individuals together based on their characteristics. It is often used in market research to
identify consumer segments or in biology to classify different species.

Factor analysis: Factor analysis is used to identify underlying factors or dimensions that
explain the variation in a set of observed variables. It is often used in psychology to identify
underlying personality traits or in marketing to identify underlying brand attributes.

Bayesian analysis: Bayesian analysis is a statistical technique that uses prior knowledge or
assumptions about the data to update and revise predictions based on new data. It is often
used in medical research and other fields where prior knowledge can inform predictions.

Machine learning: Machine learning is a type of statistical modeling that uses algorithms to
identify patterns and make predictions based on data. It is often used in artificial
intelligence, robotics, and other fields where computers can learn from data to make
decisions.

These are just a few examples of the many types of statistical modeling used in research and
analysis. The choice of model will depend on the research question, the type of data being
analyzed, and the goals of the analysis.
Regression analysis: This technique is used to model the relationship between a dependent
variable and one or more independent variables. Linear regression is a commonly used
technique in which a straight line is used to model the relationship between the variables,
while non-linear regression can model more complex relationships.

Time series analysis: This technique is used to model data that is collected over time, such
as stock prices or weather patterns. Time series analysis can help identify trends, seasonal
patterns, and other patterns in the data.

Cluster analysis: This technique is used to group data into clusters based on similarities in
their characteristics. Cluster analysis is commonly used in market research to identify
customer segments based on their purchasing habits or preferences.
Factor analysis: This technique is used to identify underlying factors that explain the
variability in a set of variables. Factor analysis can be used to reduce the number of
variables in a data set and identify the most important factors that explain the variation in
the data.

Decision trees: This technique is used to model decisions and their potential outcomes
based on a set of input variables. Decision trees can be used in fields such as finance,
marketing, and medicine to help make decisions based on the available data.

Bayesian analysis: This technique is used to estimate the probability of an event occurring
based on prior knowledge and new data. Bayesian analysis can be used in a variety of fields,
such as medicine and engineering, to make predictions or make decisions based on
uncertain data.

Types of probability distributions:


There are several types of probability distributions, each with its own characteristics and
applications in statistics and data analysis. Here are some of the most common ones:

Normal distribution: Also known as the Gaussian distribution, it is one of the most common
probability distributions. It is a continuous distribution with a bell-shaped curve that is
symmetrical around its mean. Many natural phenomena, such as height, weight, and IQ,
follow a normal distribution.

Binomial distribution: It is a discrete probability distribution that models the probability of a


certain number of successes in a fixed number of independent trials. It is used to model
outcomes such as the number of heads when flipping a coin a certain number of times.
Poisson distribution: It is a discrete probability distribution that models the probability of a
certain number of events occurring in a fixed interval of time or space. It is used to model
phenomena such as the number of car accidents in a city over a given period of time.

Exponential distribution: It is a continuous probability distribution that models the time


between two consecutive events in a Poisson process. It is used to model phenomena such
as the time between customer arrivals at a service center.
Gamma distribution: It is a continuous probability distribution that generalizes the
exponential distribution to allow for more complex shapes. It is used to model phenomena
such as the time it takes for a machine to fail after a certain number of uses.

Uniform distribution: It is a continuous probability distribution that models a situation in


which all outcomes are equally likely. It is used to model phenomena such as the time it
takes for a random event to occur.

These are just a few examples of probability distributions, and there are many more that are
used in different fields and applications. Understanding the characteristics and applications
of different probability distributions is important in statistical analysis and modeling.

Parametric and Non parametric models:

Parametric Models:
Parametric models are statistical models that assume a specific probability distribution for
the data being analyzed. These models use a fixed set of parameters to define the
distribution, such as the mean and variance for a normal distribution.

Parametric models are often used in situations where the data can be described by a known
probability distribution, and the goal is to estimate the values of the parameters that define
the distribution. This approach can be more efficient and powerful than non-parametric
models, which make fewer assumptions about the underlying data distribution but may
require larger sample sizes to achieve comparable accuracy.

Some examples of parametric models include:


Linear regression models: These models assume that the relationship between the
dependent variable and the independent variables is linear and can be described by a set of
fixed coefficients.

Logistic regression models: These models assume that the probability of an event occurring
can be modeled using a logistic function with fixed parameters.

Normal distribution models: These models assume that the data follows a normal
distribution with fixed mean and variance parameters.

Poisson distribution models: These models assume that the data follows a Poisson
distribution with a fixed rate parameter.

Parametric models can be used in a wide range of applications, including finance,


engineering, and social sciences. However, the accuracy of the models depends on the
appropriateness of the underlying assumptions and the quality of the data being analyzed.

Nonparametric Models:

Non-parametric models are statistical models that make fewer assumptions


about the underlying probability distribution of the data being analyzed. These
models do not assume a specific functional form for the probability
distribution and instead use flexible, data-driven approaches to estimate the
distribution.

Non-parametric models are often used in situations where the underlying


probability distribution is unknown, complex, or difficult to specify. They can
be especially useful when dealing with small sample sizes or data that deviates
significantly from normality.

Some examples of non-parametric models include:

Kernel density estimation: This is a technique used to estimate the probability


density function of a random variable. It involves smoothing the data with a
kernel function to estimate the underlying distribution.

Rank-based methods: These methods use the order or rank of the data rather
than the numerical values to estimate the underlying distribution. Examples
include the Wilcoxon rank-sum test and the Kruskal-Wallis test.

Decision trees: These models do not assume a specific probability distribution


but instead use a tree-like structure to represent the decision-making process
based on the input variables.

Support vector machines: These models use a non-linear mapping of the data
to a higher-dimensional space to separate the classes of data. They do not
assume a specific probability distribution.

Non-parametric models have many advantages, including their flexibility and


ability to handle complex data structures. However, they can be
computationally intensive and may require larger sample sizes to achieve
comparable accuracy to parametric models.

Distance Metrics:

In many real world applications, we use Machine Learning


algorithms for classifying or recognizing images and for retrieving
information through an Image’s content. For example - Face
recognition, Censored Images online, Retail Catalog,
Recommendation Systems etc. Choosing a good distance metric
becomes really important here. The distance metric helps algorithms
to recognize similarities between the contents.

Basic Mathematics Definition(Source Wikipedia),

Distance metric uses distance function which provides a


relationship metric between each elements in the dataset.

Some of you might be thinking, what is this distance function? how


does it work? how does it decide that a particular content or element
in the data has any kind of relationship with another one? Well let’s
try and find this out in next couple of sections.
Distance Function
Do you remember studying Pythagorean theorem? If you do, then
you might remember calculating distance between two data points
using the theorem.

n order to calculate the distance between data points A and B


Pythagorean theorem considers the length of x and y axis.

Many of you must be wondering that, do we even use this theorem


in machine learning algorithm to find the distance? To answer your
question, yes we do use it. In many machine learning algorithms we
use the above formula as a distance function. We will talk about the
algorithms where it is used.
Now you probably have got an idea what is a distance function?
Here is a simplified definition.

Basic Definition from Math.net,

A distance function provides distance between the elements of


a set. If the distance is zero then elements are equivalent else
they are different from each other.

A distance function is nothing but a mathematical formula used by


distance metrics. The distance function can differ across different
distance metrics. Let’s talk about different distance metrics and
understand their role in machine learning modelling.

Distance Metrics
There are a number of distance metrics, but to keep this article
concise, we will only be discussing a few widely used distance
metrics. We will first try to understand the mathematics behind
these metrics and then we will identify the machine learning
algorithms where we use these distance metrics.

Below are the commonly used distance metrics -

Minkowski Distance:
Minkowski distance is a metric in Normed vector space. What is
Normed vector space? A Normed vector space is a vector space on
which a norm is defined. Suppose X is a vector space then a norm on
X is a real valued function ||x||which satisfies below conditions -

1. Zero Vector- Zero vector will have zero length.

2. Scalar Factor- The direction of vector doesn’t change when you


multiply it with a positive number though its length will be
changed.

3. Triangle Inequality- If distance is a norm then the calculated


distance between two points will always be a straight line.

You might be wondering why do we need normed vector, can we just


not go for simple metrics? As normed vector has above properties
which helps to keep the norm induced metric- homogeneous and
translation invariant. More details can be found here.

The distance can be calculated using below formula -

Minkowski distance is the generalized distance metric. Here


generalized means that we can manipulate the above formula to
calculate the distance between two data points in different ways.

As mentioned above, we can manipulate the value of p and calculate


the distance in three different ways-

p = 1, Manhattan Distance
p = 2, Euclidean Distance

p = ∞, Chebychev Distance

We will discuss these distance metrics below in detail.

Manhattan Distance:

We use Manhattan Distance if we need to calculate the distance


between two data points in a grid like path. As mentioned above, we
use Minkowski distance formula to find Manhattan distance by
setting p’s value as 1.

Let’s say, we want to calculate the distance, d, between two data


points- x and y.

Distance d will be calculated using an absolute sum of


difference between its cartesian co-ordinates as below :
where, n- number of variables, xi and yi are the variables of vectors
x and y respectively, in the two dimensional vector space. i.e. x =
(x1,x2,x3,...) and y = (y1,y2,y3,…).

Now the distance d will be calculated as-

(x1 - y1) + (x2 - y2) + (x3 - y3) + … + (xn - yn).

If you try to visualize the distance calculation, it will look something


like as below :

Manhattan distance is also known as Taxicab Geometry, City Block


Distance etc.

Euclidean Distance:

Euclidean distance is one of the most used distance metric. It is


calculated using Minkowski Distance formula by setting p’s value
to 2. This will update the distance ‘d’ formula as below :
Let’s stop for a while! Does this formula look familiar? Well yes, we
just saw this formula above in this article while
discussing “Pythagorean Theorem”.

Euclidean distance formula can be used to calculate the distance


between two data points in a plane.

Cosine Distance:
Mostly Cosine distance metric is used to find similarities between
different documents. In cosine metric we measure the degree of
angle between two documents/vectors(the term frequencies in
different documents collected as metrics). This particular metric is
used when the magnitude between vectors does not matter but the
orientation.

Cosine similarity formula can be derived from the equation of dot


products :-
Now, you must be thinking which value of cosine angle will be
helpful in finding out the similarities.

Now that we have the values which will be considered in order to


measure the similarities, we need to know what do 1, 0 and -1
signify.

Here cosine value 1 is for vectors pointing in the same direction i.e.
there are similarities between the documents/data points. At zero
for orthogonal vectors i.e. Unrelated(some similarity found). Value -
1 for vectors pointing in opposite directions(No similarity).

Mahalanobis Distance:
Mahalanobis Distance is used for calculating the distance between
two data points in a multivariate space.
The Mahalanobis distance is a measure of the distance
between a point P and a distribution D. The idea of measuring
is, how many standard deviations away P is from the mean of
D.

The benefit of using mahalanobis distance is, it takes covariance in


account which helps in measuring the strength/similarity between
two different data objects. The distance between an observation and
the mean can be calculated as below -

Here, S is the covariance metrics. We are using inverse of the


covariance metric to get a variance-normalized distance equation.

You might also like