Unit 3 DSML
Unit 3 DSML
Statistical Inference
A population is a group of individuals, items, or things that share one or more common
characteristics and are of interest to a researcher. For example, a population could be all the
residents of a particular city, all the students enrolled in a university, or all the employees of
a company.
A sample is a subset of the population selected for research purposes. Researchers use
sampling techniques to select a representative sample from the population that can be
studied in order to make inferences about the population as a whole. A sample should be
selected in such a way that it accurately represents the population from which it was drawn.
Populations and samples are terms commonly used in statistics and research to describe
groups of individuals or objects that are being studied.
A population refers to the entire group of individuals or objects that meet a certain set of
criteria or characteristics. For example, the population of all adults over the age of 18 in the
United States would be a population.
A sample, on the other hand, refers to a smaller group of individuals or objects selected
from the larger population for the purpose of studying or analyzing the characteristics of
that group. For example, a researcher might select a random sample of 500 adults over the
age of 18 in the United States to study their opinions on a particular political issue.
Time series analysis: Time series analysis is used to analyze data collected over time. It is
often used in finance, economics, and other fields to make predictions about future trends
and patterns.
Cluster analysis: Cluster analysis is a statistical technique used to group similar objects or
individuals together based on their characteristics. It is often used in market research to
identify consumer segments or in biology to classify different species.
Factor analysis: Factor analysis is used to identify underlying factors or dimensions that
explain the variation in a set of observed variables. It is often used in psychology to identify
underlying personality traits or in marketing to identify underlying brand attributes.
Bayesian analysis: Bayesian analysis is a statistical technique that uses prior knowledge or
assumptions about the data to update and revise predictions based on new data. It is often
used in medical research and other fields where prior knowledge can inform predictions.
Machine learning: Machine learning is a type of statistical modeling that uses algorithms to
identify patterns and make predictions based on data. It is often used in artificial
intelligence, robotics, and other fields where computers can learn from data to make
decisions.
These are just a few examples of the many types of statistical modeling used in research and
analysis. The choice of model will depend on the research question, the type of data being
analyzed, and the goals of the analysis.
Regression analysis: This technique is used to model the relationship between a dependent
variable and one or more independent variables. Linear regression is a commonly used
technique in which a straight line is used to model the relationship between the variables,
while non-linear regression can model more complex relationships.
Time series analysis: This technique is used to model data that is collected over time, such
as stock prices or weather patterns. Time series analysis can help identify trends, seasonal
patterns, and other patterns in the data.
Cluster analysis: This technique is used to group data into clusters based on similarities in
their characteristics. Cluster analysis is commonly used in market research to identify
customer segments based on their purchasing habits or preferences.
Factor analysis: This technique is used to identify underlying factors that explain the
variability in a set of variables. Factor analysis can be used to reduce the number of
variables in a data set and identify the most important factors that explain the variation in
the data.
Decision trees: This technique is used to model decisions and their potential outcomes
based on a set of input variables. Decision trees can be used in fields such as finance,
marketing, and medicine to help make decisions based on the available data.
Bayesian analysis: This technique is used to estimate the probability of an event occurring
based on prior knowledge and new data. Bayesian analysis can be used in a variety of fields,
such as medicine and engineering, to make predictions or make decisions based on
uncertain data.
Normal distribution: Also known as the Gaussian distribution, it is one of the most common
probability distributions. It is a continuous distribution with a bell-shaped curve that is
symmetrical around its mean. Many natural phenomena, such as height, weight, and IQ,
follow a normal distribution.
These are just a few examples of probability distributions, and there are many more that are
used in different fields and applications. Understanding the characteristics and applications
of different probability distributions is important in statistical analysis and modeling.
Parametric Models:
Parametric models are statistical models that assume a specific probability distribution for
the data being analyzed. These models use a fixed set of parameters to define the
distribution, such as the mean and variance for a normal distribution.
Parametric models are often used in situations where the data can be described by a known
probability distribution, and the goal is to estimate the values of the parameters that define
the distribution. This approach can be more efficient and powerful than non-parametric
models, which make fewer assumptions about the underlying data distribution but may
require larger sample sizes to achieve comparable accuracy.
Logistic regression models: These models assume that the probability of an event occurring
can be modeled using a logistic function with fixed parameters.
Normal distribution models: These models assume that the data follows a normal
distribution with fixed mean and variance parameters.
Poisson distribution models: These models assume that the data follows a Poisson
distribution with a fixed rate parameter.
Nonparametric Models:
Rank-based methods: These methods use the order or rank of the data rather
than the numerical values to estimate the underlying distribution. Examples
include the Wilcoxon rank-sum test and the Kruskal-Wallis test.
Support vector machines: These models use a non-linear mapping of the data
to a higher-dimensional space to separate the classes of data. They do not
assume a specific probability distribution.
Distance Metrics:
Distance Metrics
There are a number of distance metrics, but to keep this article
concise, we will only be discussing a few widely used distance
metrics. We will first try to understand the mathematics behind
these metrics and then we will identify the machine learning
algorithms where we use these distance metrics.
Minkowski Distance:
Minkowski distance is a metric in Normed vector space. What is
Normed vector space? A Normed vector space is a vector space on
which a norm is defined. Suppose X is a vector space then a norm on
X is a real valued function ||x||which satisfies below conditions -
p = 1, Manhattan Distance
p = 2, Euclidean Distance
p = ∞, Chebychev Distance
Manhattan Distance:
Euclidean Distance:
Cosine Distance:
Mostly Cosine distance metric is used to find similarities between
different documents. In cosine metric we measure the degree of
angle between two documents/vectors(the term frequencies in
different documents collected as metrics). This particular metric is
used when the magnitude between vectors does not matter but the
orientation.
Here cosine value 1 is for vectors pointing in the same direction i.e.
there are similarities between the documents/data points. At zero
for orthogonal vectors i.e. Unrelated(some similarity found). Value -
1 for vectors pointing in opposite directions(No similarity).
Mahalanobis Distance:
Mahalanobis Distance is used for calculating the distance between
two data points in a multivariate space.
The Mahalanobis distance is a measure of the distance
between a point P and a distribution D. The idea of measuring
is, how many standard deviations away P is from the mean of
D.