0% found this document useful (0 votes)
5 views

chapter2-statistical analysis

TYBCS foundation of data science chapter 3

Uploaded by

devyanibotre2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

chapter2-statistical analysis

TYBCS foundation of data science chapter 3

Uploaded by

devyanibotre2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 86

Chapter 2 Statistical Data Analysis

2.1.Role of statistics in data science


2.2.Descriptive statistics
Measuring the Frequency
Measuring the Central Tendency: Mean, Median, and Mode
Measuring the Dispersion: Range, Standard deviation, Variance,
Interquartile Range
2.3.Inferential statistics Hypothesis testing, Multiple hypothesis
testing, Parameter Estimation methods,
2.4.Measuring Data Similarity and Dissimilarity
Data Matrix versus Dissimilarity Matrix, Proximity Measures for
Nominal Attributes, Proximity Measures for Binary Attributes,
Dissimilarity of Numeric Data: Euclidean, Manhattan, and Minkowski
distances, Proximity Measures for Ordinal Attributes
2.5.Concept of Outlier, types of outliers, outlier detection methods
Role of statistics in data science

Statistics” means the practice or science of collecting


and analyzing numerical data in large quantities.

The most important aspect of any Data Science


approach is how the information is processed. When we
talk about developing insights out of data it is basically
digging out the possibilities. Those possibilities in Data
Science are known as Statistical Analysis.

Most of us wonder how can data in the form of text,


images, videos, and other highly unstructured formats
get easily processed by Machine Learning models. But,
the truth is we actually convert that data into a
numerical form which is not exactly our data but the
numerical equivalent of it. So, this brings us to the very
important aspect of Data Science.

With data in numerical format, it provides us with


infinite possibilities to understand the information out
it. Statistics acts as a pathway to understand your data
and process that for successful results. Not only the
power of statistics is limited to understanding the data
it also provides methods to measure the success of our
insights, getting different approaches for the same
problem, getting the right mathematical approach for
your data.

Some ways in which Statistics helps in Data Science


are:

1. Prediction and Classification: Statistics help in


prediction and classification of data whether it
would be right for the clients viewing by
their previous usage of data.

2. Helps to create Probability Distribution and


Estimation: Probability Distribution(it is a
mathematical function that gives probabilities of
occurrence of different possible outcomes for an
experiment) and Estimation are crucial
in understanding the basics of machine
learning and algorithms like logistic
regressions. (It gives probability of an event
occurring such as voted or didn’t voted)

Cross-validation techniques are also inherently


statistical tools that have been brought into the
Machine Learning and Data Analytics world for
inference-based research, A/B and hypothesis
testing.
3. Pattern Detection and Grouping: Statistics
help in picking out the optimal data and
weeding out the unnecessary dump of
data for companies who like their work organised.
It also helps spot out anomalies which further helps
in processing the right data.

4. Powerful Insights: Dashboards, charts,


reports and other data visualizations types in
the form of interactive and effective
representations give much more powerful insights
than plain data and it also makes the data more
readable and interesting.

5. Segmentation and Optimization: It also


segments the data according to different kinds of
demographic or psychographic factors that
affect its processing. It also optimizes data in
accordance with minimizing risk and maximizing

outputs
What is Statistics?

Statistics is the science of collecting data and analyzing


them to infer proportions (sample) that are
representative of the population. In other words,
statistics is interpreting data in order to make
predictions for the population.

Branches of Statistics:

There are two branches of Statistics.

 DESCRIPTIVE STATISTICS : Descriptive Statistics is a


statistics or a measure that describes the data.

 INFERENTIAL STATISTICS : Using a random sample of


data taken from a population to describe and make
inferences (conclusions)about the population is
called Inferential Statistics.

Descriptive Statistics

Descriptive Statistics is summarizing the data at hand


through certain numbers like mean, median etc. so as
to make the understanding of the data easier. It does
not involve any generalization or inference beyond what
is available. This means that the descriptive statistics
are just the representation of the data (sample)
available and not based on any theory of probability.

Commonly Used Measures

1. Measures of Central Tendency

2. Measures of Dispersion (or Variability)

Measures of Central Tendency

A Measure of Central Tendency is a one number


summary of the data that typically describes the center
of the data. These one number summary is of three
types.

1. Mean : Mean is defined as the ratio of the sum of all


the observations in the data to the total number of
observations. This is also known as Average. Thus
mean is a number around which the entire data set is
spread.

2. Median : Median is the point which divides the


entire data into two equal halves. One-half of the
data is less than the median, and the other half is
greater than the same. Median is calculated by first
arranging the data in either ascending or descending
order.
 If the number of observations are odd, median is
given by the middle observation in the sorted form.

 If the number of observations are even, median is


given by the mean of the two middle observation in
the sorted form.

An important point to note that the order of the data


(ascending or descending) does not effect the median.

3. Mode : Mode is the number which has the maximum


frequency in the entire data set, or in other words,mode
is the number that appears the maximum number of
times. A data can have one or more than one mode.

 If there is only one number that appears maximum


number of times, the data has one mode, and is
called Uni-modal.

 If there are two numbers that appear maximum


number of times, the data has two modes, and is
called Bi-modal.

 If there are more than two numbers that appear


maximum number of times, the data has more than
two modes, and is called Multi-modal.

Example to compute the Measures of Central Tendency

Consider the following data points.


17, 16, 21, 18, 15, 17, 21, 19, 11, 23

 Mean — Mean is calculated as

 Median — To calculate Median, lets arrange the data


in ascending order.

11, 15, 16, 17, 17, 18, 19, 21, 21, 23

Since the number of observations is even (10), median


is given by the average of the two middle observations
(5th and 6th here).

 Mode — Mode is given by the number that occurs


maximum number of times. Here, 17 and 21 both
occur twice. Hence, this is a Bimodal data and the
modes are 17 and 21.

Note-
1. Since Median and Mode does not take all the data
points for calculations, these are robust to outliers,
i.e. these are not effected by outliers.

2. At the same time, Mean shifts towards the outlier as


it considers all the data points. This means if the
outlier is big, mean overestimates the data and if it is
small, the data is underestimated.

3. If the distribution is symmetrical, Mean = Median =


Mode. Normal distribution is an example.

Measures of Dispersion (or Variability)

Measures of Dispersion describes the spread of the data


around the central value (or the Measures of Central
Tendency)

1. Absolute Deviation from Mean — The


Absolute Deviation from Mean, also called Mean
Absolute Deviation (MAD), describe the variation
in the data set, in sense that it tells the average
absolute distance of each data point in the set. It
is calculated as
2. Variance — Variance measures how far are data
points spread out from the mean. A high variance
indicates that data points are spread widely and a small
variance indicates that the data points are closer to the
mean of the data set. It is calculated as

3. Standard Deviation — The square root of Variance


is called the Standard Deviation. It is calculated as

4. Range — Range is the difference between the


Maximum value and the Minimum value in the data set.
It is given as

5. Quartiles — Quartiles are the points in the data set


that divides the data set into four equal parts. Q1, Q2
and Q3 are the first, second and third quartile of the
data set.

 25% of the data points lie below Q1 and 75% lie


above it.
 50% of the data points lie below Q2 and 50% lie
above it. Q2 is nothing but Median.

 75% of the data points lie below Q3 and 25% lie


above it.

6. Skewness — The measure of asymmetry in a


probability distribution is defined by Skewness. It can
either be positive, negative or undefined.

 Positive Skew — This is the case when


the tail on the right side of the curve is
bigger than that on the left side. For
these distributions, mean is greater
than the mode.

 Negative Skew — This is the case when


the tail on the left side of the curve is
bigger than that on the right side. For
these distributions, mean is smaller
than the mode.
The most commonly used method of calculating
Skewness is

If the skewness is zero, the distribution is symmetrical.


If it is negative, the distribution is Negatively Skewed
and if it is positive, it is Positively Skewed.

7. Kurtosis — Kurtosis describes the whether the data


is light tailed (lack of outliers(it Is a data point that
differs significantly from other observations or
the observation that lies an abnormal distance
from other values in random sample from a
population)) or heavy tailed (outliers present) when
compared to a Normal distribution. There are three
kinds of Kurtosis:
 Mesokurtic — This is the case when the
kurtosis is zero, similar to the normal
distributions.

 Leptokurtic — This is when the tail of


the distribution is heavy (outlier
present) and kurtosis is higher than
that of the normal distribution.

 Platykurtic — This is when the tail of


the distribution is light( no outlier) and
kurtosis is lesser than that of the
normal distribution.
Frequency distribution curve
Inferential statistics
Most of the time, you can only acquire data from samples,
because it is too difficult or expensive to collect data from the
whole population that you’re interested in.

While descriptive statistics can only summarize a sample’s


characteristics, inferential statistics use your sample to make
reasonable guesses about the larger population.

With inferential statistics, it’s important to use random and


unbiased sampling methods. If your sample isn’t representative
of your population, then you can’t make valid statistical
inferences.

Example: Inferential statisticsYou randomly select a sample of 11th


graders in your state and collect data on their SAT(scholastic
assessment test)scores and other characteristics.
You can use inferential statistics to make estimates and test
hypotheses about the whole population of 11th graders in the state
based on your sample data.
Sampling error in inferential statistics
Since the size of a sample is always smaller than the size of
the population, some of the population isn’t captured by sample
data. This creates sampling error, which is the difference
between the true population values (called parameters) and the
measured sample values (called statistics).

Sampling error arises any time you use a sample, even if your
sample is random and unbiased. For this reason, there is
always some uncertainty in inferential statistics.

Hypothesis testing
Hypothesis testing is a formal process of statistical analysis
using inferential statistics. The goal of hypothesis testing is to
compare populations or assess relationships between variables
using samples.

Hypotheses, or predictions, are tested using statistical tests.


Statistical tests also estimate sampling errors so that valid
inferences can be made.

Types of hypothesis

Null Hypothesis.
The null hypothesis, H0 is the commonly accepted
fact; it is the opposite of the alternate hypothesis.
Researchers work to reject, nullify or disprove the null
hypothesis. Researchers come up with an alternate
hypothesis, one that they think explains a
phenomenon, and then work to reject the null
hypothesis.
Why is it Called the “Null”?
The word “null” in this context means that it’s a
commonly accepted fact that researchers work
to nullify. It doesn’t mean that the statement is null (i.e.
amounts to nothing) itself! (Perhaps the term should be
called the “nullifiable hypothesis” as that might cause
less confusion). To
Why Do I need to Test it? Why not just prove an alternate
one?
The short answer is, as a scientist, you are required to;
It’s part of the scientific process. Science uses a battery
of processes to prove or disprove theories, making sure
than any new hypothesis has no flaws. Including both a
null and an alternate hypothesis is one safeguard to
ensure your research isn’t flawed. Not including the
null hypothesis in your research is considered
very bad practice by the scientific community. If
you set out to prove an alternate hypothesis without
considering it, you are likely setting yourself up for
failure. At a minimum, your experiment will likely not
be taken seriously.
Example
Not so long ago, people believed that the world was
flat.

 Null hypothesis: H0: The world is flat.


 Alternate hypothesis: The world is round.
Several scientists, including Copernicus, set out to
disprove the null hypothesis. This eventually led to the
rejection of the null and the acceptance of the
alternate. Most people accepted it — the ones that
didn’t created the Flat Earth Society!. What would have
happened if Copernicus had not disproved the it and
merely proved the alternate? No one would have
listened to him. In order to change people’s thinking, he
first had to prove that their thinking was wrong.

How to State the Null Hypothesis from a Word Problem


You’ll be asked to convert a word problem into
a hypothesis statement in statistics that will include a
null hypothesis and an alternate hypothesis.
Breaking your problem into a few small steps makes
these problems much easier to handle.
How to State the Null Hypothesis
Example Problem: A researcher thinks that if knee
surgery patients go to physical therapy twice a week
(instead of 3 times), their recovery period will be
longer. Average recovery times for knee surgery
patients is 8.2 weeks.

Hypothesis testing is vital to test patient outcomes.

Mathematical Symbols Used in H0 and Ha:

H0 Ha

not equal (≠) or greater than (>) or less


equal (=)
than (<)

greater than or equal to less than (<)


H0 Ha

(≥)

less than or equal to (≤) more than (>)

Step 1: Figure out the hypothesis from the problem.


The hypothesis is usually hidden in a word problem,
and is sometimes a statement of what you expect to
happen in the experiment. The hypothesis in the above
question is “I expect the average recovery period to be
greater than 8.2 weeks.”
Step 2: Convert the hypothesis to math. Remember
that the average is sometimes written as μ.
H1: μ > 8.2
Broken down into (somewhat) English, that’s
H1 (The hypothesis): μ (the average) > (is greater than)
8.2
Step 3: State what will happen if the hypothesis
doesn’t come true. If the recovery time isn’t greater
than 8.2 weeks, there are only two possibilities, that
the recovery time is equal to 8.2 weeks or less than 8.2
weeks.
H0: μ ≤ 8.2
Broken down again into English, that’s H0 (The null
hypothesis): μ (the average) ≤ (is less than or equal to)
8.2

How to State the Null Hypothesis: Part Two


But what if the researcher doesn’t have any idea what will
happen?
Example Problem: A researcher is studying the
effects of radical exercise program on knee surgery
patients. There is a good chance the therapy will
improve recovery time, but there’s also the possibility it
will make it worse. Average recovery times for knee
surgery patients is 8.2 weeks.
Step 1: State what will happen if the experiment
doesn’t make any difference. That’s the null
hypothesis–that nothing will happen. In this
experiment, if nothing happens, then the recovery time
will stay at 8.2 weeks.
H0: μ = 8.2
Broken down into English, that’s H0 (The null
hypothesis): μ (the average) = (is equal to) 8.2
Step 2: Figure out the alternate hypothesis. The
alternate hypothesis is the opposite of the null
hypothesis. In other words, what happens if our
experiment makes a difference?
H1: μ ≠ 8.2
In English again, that’s H1 (The alternate hypothesis): μ
(the average) ≠ (is not equal to) 8.2
That’s How to State the Null Hypothesis!

Alternate Hypothesis in Statistics: What is it?


In order to understand what an alternate
hypothesis (also called an alternative hypothesis) is,
you first need to understand what the null
hypothesis means. The word hypothesis means a
working statement. In statistics, we’re interested in
proving whether a working statement (the null
hypothesis) is true or false. Usually, these working
statements are things that are expected to be true —
some kind of known or fact or perhaps a historical
value. The word “null” can be thought of as “no
change”. With the null hypothesis, you get what you
expect, from a historical point of view.
This video explains both the null and alternate
hypotheses:

The Alternate Hypothesis


The alternate hypothesis is just an alternative to the
null. For example, if your null is “I’m going to win up to
$1,000” then your alternate is “I’m going to win $1,000
or more.” Basically, you’re looking at whether there’s
enough change (with the alternate hypothesis) to be
able to reject the null hypothesis.

In many cases, the alternate hypothesis will just be


the opposite of the null hypothesis. For example,
the null hypothesis might be “There was no change in
the water level this Spring,” and the alternative
hypothesis would be “There was a change in the water
level this Spring.”
In other cases, there might be a change in the
amount of something. For example, let’s say a Gallup
poll predicts an election will re-elect a president with a
5 percent majority. However, you, the researcher, has
uncovered a secret grassroots campaign composed of
hundreds of thousands of minorities who are going to
vote the opposite way from expected.
 Null hypothesis: President re-elected with 5
percent majority
 Alternate hypothesis: President re-elected
with 1-2 percent majority.
Although the outcome hasn’t changed (the President is
still re-elected), the majority percentage has changed—
which may be important to an electoral campaign.
The alternate hypothesis is usually what you will be
testing in hypothesis testing. It’s a statement that you
or another researcher) thinks is true and one that can
ultimately lead you to reject the null hypothesis and
replace it with the alternate hypothesis.

Alternate Hypothesis Examples


Example 1: It’s an accepted fact that ethanol boils at
173.1°F; you have a theory that ethanol actually has a
different boiling point, of over 174°F. The accepted fact
(“ethanol boils at 173.1°F”) is the null hypothesis; your
theory (“ethanol boils at temperatures of 174°F”) is the
alternate hypothesis.
Example 2: A classroom full of students at a certain
elementary school is performing at lower
than average levels on standardized tests. The low test
scores are thought to be due to poor teacher
performance. However, you have a theory that the
students are performing poorly because their classroom
is not as well ventilated as the other classrooms in the
school. The accepted theory (“low test scores are due
to poor teacher performance”) is the null hypothesis;
your theory (“low test scores are due to inadequate
ventilation in the classroom”) is the alternative
hypothesis.

Parametric Hypothesis Tests:


Hypothesis testing can be classified as parametric tests
and non-parametric tests.
1. In the case of parametric tests, information
about the population is completely known and
can be used for statistical inference.
2. In the case of non-parametric tests,
information about the population is unknown
and hence no assumptions can be made
regarding the population.

Let us discuss a few of the important most commonly


used parametric tests and their significance in various
statistical analyses.

In each of these parametric tests, there is a common


step of procedures.
The initial step is to state the null and alternate
hypotheses based on the given problem. The level of
significance is chosen based on the given problem.

The type of parametric test to be considered is an


important decision-making task for correct analysis of
the problem. Next, a decision rule is formulated to find
the critical values and the acceptance/rejection regions.
Lastly, the obtained value of the parametric test is
compared with the critical value to decide whether the
null hypothesis (Ho) is rejected or accepted.

The null hypothesis (Ho) and the alternate hypothesis


(Ha) are mutually exclusive.
At the beginning of any parametric test, is always
assumed to be true and the alternate hypothesis Ho or
Ha carries the burden .Before we perform any type of
parametric tests, let us try to understand some of the
core terms related to any parametric tests that are
required to be known:
1.Acceptance and Critical Regions:
All the set of possible values which a test-statistic can
be divided fall into two mutually exclusive groups:
1st group, called the acceptance region, consists of
values that appear to be consistent with the null
hypothesis.
2nd group, called the rejection region or the critical
region, consists of values that are unlikely to occur if
the null hypothesis is true.

The value(s) that separates the critical region


from the acceptance region is called the critical
value(s).

2. One-tailed Test and Two-tailed Test:


For some parametric tests like z-test, it is important to
decide if the test is one-tailed or two-tailed test.
If the specified problem has an equal sign, it is a case
of a two-tailed test, whereas if the problem has a
greater than (>) or less than (<) sign, it is a one-tailed
test.
example, let us consider the following three cases for a
problem statement:
Case 1: A government school states that the dropout of
female students between ages 12 and 18 years is 28%.
Case 2: A government school states that the dropout of
female students between ages 12 and 18 years is
greater than 28%.
Case 3: A government school states that the dropout of
female students between ages 12 and 18 years is less
than 28%.

Case 1 is an example of the two-tailed test as it states


that dropout rate = 28%. Again,
Case Il and Case Ill are both examples of one-tailed
tests as Case Il states that dropout
rate > 28% and Case IlI states that dropout rate< 28%
The alternate hypothesis can take one of three forms-
either a parameter has
increased, or it has decreased or it has changed (may
increase or decrease). This can be illustrated as shown
below:
O
Ha: u > MO: This type of test is called an upper-tailed
test or right-tailed test,

Ha: u < MO: This type of test is called a lower-tailed


test or left-tailed test,

Ha: u #(not equal to)MO; This type of test is called the


two-tailed test.

To summarize, while a one-tailed test checks for the


effect of a change only in one direction, a two-tailed
test checks for the effect of a change in both the
directions.
Thus, a two-tailed test considers both positive and
negative effects for a change that is being studied for
statistical analysis.

Significance Level (alpha):

It is denoted by a, is the probability of the null


hypothesis being rejected even if it is true. This is so
because 100% accuracy is practically not possible for
accepting or rejecting a hypothesis.
For example, a significance level of 0.03 indicates that
a 3% risk is being taken that a difference in values
exists when there is no difference.
Typical values of significance level are 0.01, 0.05, and
0.1 which are significantly small values chosen to
control the probability of committing a Type I error,

Calculated Probability (r):


The r-value is a calculated probability that states that
when the null hypothesis is true,the statistical summary
will be greater than or equal to the actual observed
results.
It is the probability of finding the observed or more
extreme results when the null hypothesis is true.
Low r-values indicate that there is little likelihood that
the statistical expectation is true.
Some of widely used hypothesis testing types are t-
test, z-test, ANOVA-test and Chi-square test.
Let us see above tests in detail;
1.
t-test:
A t-test is a type of inferential statistic which is used to
determine if there is a significant difference between
the means of two groups which may be related in
certain features.
It is mostly used when the data sets, like the set of data
recorded as outcome from flipping a coin a 100 times,
would follow a normal distribution and may have
unknown variances.
The t-test is used as a hypothesis testing tool, which
allows testing of an assumption applicable to a
population.

The t-test has two types namely, one sampled t-test


and two-sampled t-test.

(i) One Sample t-test:


The one sample t-test determines whether the sample
mean is statistically different from a known
or hypothesized population mean. The one sample t-
test is a parametric test,

(in) Two sampled t-test:


The independent samples t-test or 2-sample t-test compares the
means of two independent groups in order to determine whether
there is statistical evidence that the associated population means are
significantly different.
The independent samples t-test is a parametric test. This test is also
known as
Independent t-test.
Paired Sampled t-test:
The paired sample t-test is also called dependent sample t-test. It's
an uni variate test that tests for a significant difference between two
related variables.

2. z-test:
A z-test is mainly used when the data is normally distributed. The z-
test is mainly used when the population mean and standard
deviation are given.
The one-sample z-test is mainly used for comparing the mean of a
sample to some hypothesized mean of a given population, or when
the population variance is known.
The main analysis is to check whether the mean of a sample is
reflective of the
population being considered.
We would use a z test if:
 Our sample size is greater than 30. Otherwise, use at test.
 Data points should be independent from each other. In other
words, one data point
isn't related or doesn't affect another data point.
 Our data should be normally distributed. However, for large
sample sizes (over 30)
this doesn't always matter.
 Our data should be randomly selected from a population,
where each item has an
equal chance of being selected.
 Sample sizes should be equal if at all possible.

3. ANOVA Test (f-test):f-distribution


The t-test works well when dealing with two groups, but sometimes
we want to compare (mean or average)more than two groups at the
same time.
It determines whether all groups are taken from common population
or not
The ANalysis Of VAriance or ANOVA is a statistical inference test that
lets you compare multiple groups at the same time.

F-statistic = variance between the sample


means/variance within the sample
Here,
The variation among the observations of each specific group is called
its internal variation or between variance
The total of internal variation is called as within variance

Unlike the z and t-distributions, the f-distribution does not have any
negative values
because between and within-group variability are always positive
due to squaring each deviation.
One way f-test (ANOVA) tell whether two or more groups are similar
or not based on their mean similarity and f-score.

4. Chi-square Test:
The test is applied when we have two categorical variables from a
single population. It is used to determine whether there is a
significant association between the two variables.

Introduction

Hypothesis testing is one of the most important


concepts in Statistics which is heavily used
by Statisticians, Machine Learning Engineers,
and Data Scientists.
In hypothesis testing, Statistical tests are used to check
whether the null hypothesis is rejected or not
rejected. These Statistical tests assume a
null hypothesis of no relationship or no difference
between groups.

So, In this article, we will be discussing the statistical


test for hypothesis testing including both parametric
and non-parametric tests.

Table of Contents

1. What are Parametric Tests?

2. What are Non-parametric Tests?

3. Parametric Tests for Hypothesis testing

 T-test
 Z-test
 F-test
 ANOVA

4. Non-parametric Tests for Hypothesis testing

 Chi-square
 Mann-Whitney U-test
 Kruskal-Wallis H-test

Let’s get started,

Parametric Tests

The basic principle behind the parametric tests is that


we have a fixed set of parameters that are used to
determine a probabilistic model that may be used in
Machine Learning as well.

Parametric tests are those tests for which we have prior


knowledge of the population distribution (i.e, normal),
or if not then we can easily approximate it to a normal
distribution which is possible with the help of the
Central Limit Theorem.

Parameters for using the normal distribution is –

 Mean
 Standard Deviation

Eventually, the classification of a test to be parametric


is completely dependent on the population
assumptions. There are many parametric tests
available from which some of them are as follows:

 To find the confidence interval for the population


means with the help of known standard deviation.
 To determine the confidence interval for population
means along with the unknown standard deviation.
 To find the confidence interval for the population
variance.
 To find the confidence interval for the difference of
two means, with an unknown value of standard
deviation.

Non-parametric Tests
In Non-Parametric tests, we don’t make any assumption
about the parameters for the given population or the
population we are studying. In fact, these tests don’t
depend on the population.
Hence, there is no fixed set of parameters is available,
and also there is no distribution (normal distribution,
etc.) of any kind is available for use.

This is also the reason that nonparametric tests are


also referred to as distribution-free tests.
In modern days, Non-parametric tests are gaining
popularity and an impact of influence some reasons
behind this fame is –

 The main reason is that there is no need to be


mannered while using parametric tests.
 The second reason is that we do not require to
make assumptions about the population given (or
taken) on which we are doing the analysis.
 Most of the nonparametric tests available are very
easy to apply and to understand also i.e. the
complexity is very low.
Image
Source: Google Images

T-Test

1. It is a parametric test of hypothesis testing based


on Student’s T distribution.

2. It is essentially, testing the significance of the


difference of the mean values when the sample size is
small (i.e, less than 30) and when the population
standard deviation is not available.

3. Assumptions of this test:

 Population distribution is normal, and


 Samples are random and independent
 The sample size is small.
 Population standard deviation is not known.
4. Mann-Whitney ‘U’ test is a non-parametric
counterpart of the T-test.

A T-test can be a:

One Sample T-test: To compare a sample mean with


that of the population mean.

where,

x̄ is the sample mean

s is the sample standard deviation

n is the sample size

μ is the population mean

Two-Sample T-test: To compare the means of two


different samples.
where,

x̄ 1 is the sample mean of the first group

x̄ 2 is the sample mean of the second group

S1 is the sample-1 standard deviation

S2 is the sample-2 standard deviation

n is the sample size

Conclusion:

 If the value of the test statistic is greater than the


table value -> Rejects the null hypothesis.
 If the value of the test statistic is less than the
table value -> Do not reject the null
hypothesis.

Z-Test

1. It is a parametric test of hypothesis testing.

2. It is used to determine whether the means are


different when the population variance is known and
the sample size is large (i.e, greater than 30).

3. Assumptions of this test:

 Population distribution is normal


 Samples are random and independent.
 The sample size is large.
 Population standard deviation is known.

A Z-test can be:


One Sample Z-test: To compare a sample mean with
that of the population mean.

Image Source:
Google Images

Two Sample Z-test: To compare the means of two


different samples.

where,
x̄ 1 is the sample mean of 1st group

x̄ 2 is the sample mean of 2nd group

σ1 is the population-1 standard deviation

σ2 is the population-2 standard deviation

n is the sample size

F-Test

1. It is a parametric test of hypothesis testing based


on Snedecor F-distribution.

2. It is a test for the null hypothesis that two normal


populations have the same variance.

3. An F-test is regarded as a comparison of equality of


sample variances.

4. F-statistic is simply a ratio of two variances.

5. It is calculated as:

F = s12/s22
6. By changing the variance in the ratio, F-test has
become a very flexible test. It can then be used to:

 Test the overall significance for a regression


model.
 To compare the fits of different models and
 To test the equality of means.

7. Assumptions of this test:

 Population distribution is normal, and


 Samples are drawn randomly and independently.

ANOVA

1. Also called as Analysis of variance, it is a


parametric test of hypothesis testing.

2. It is an extension of the T-Test and Z-test.

3. It is used to test the significance of the differences in


the mean values among more than two sample groups.

4. It uses F-test to statistically test the equality of


means and the relative variance between them.

5. Assumptions of this test:

 Population distribution is normal, and


 Samples are random and independent.
 Homogeneity of sample variance.

6. One-way ANOVA and Two-way ANOVA are is types.


7. F-statistic = variance between the sample
means/variance within the sample

Chi-Square Test

1. It is a non-parametric test of hypothesis testing.

2. As a non-parametric test, chi-square can be used:

 test of goodness of fit.


 as a test of independence of two variables.

3. It helps in assessing the goodness of fit between a


set of observed and those expected theoretically.

4. It makes a comparison between the expected


frequencies and the observed frequencies.

5. Greater the difference, the greater is the value of


chi-square.

6. If there is no difference between the expected and


observed frequencies, then the value of chi-square is
equal to zero.

7. It is also known as the “Goodness of fit


test” which determines whether a particular
distribution fits the observed data or not.

8. It is calculated as:
9. Chi-square is also used to test the independence of
two variables.

10. Conditions for chi-square test:

 Randomly collect and record the Observations.


 In the sample, all the entities must be
independent.
 No one of the groups should contain very few
items, say less than 10.
 The reasonably large overall number of items.
Normally, it should be at least 50, however small
the number of groups may be.

11. Chi-square as a parametric test is used as a test for


population variance based on sample variance.

12. If we take each one of a collection of sample


variances, divide them by the known population
variance and multiply these quotients by (n-1), where n
means the number of items in the sample, we get the
values of chi-square.
Parametric tests assume a normal distribution of values, or a
“bell-shaped curve.” For example, height is roughly a normal
distribution in that if you were to graph height from a group of
people, one would see a typical bell-shaped curve. This
distribution is also called a Gaussian distribution. Parametric
tests are in general more powerful (require a smaller sample
size) than nonparametric tests.

Advantages of Parametric Tests

Advantage 1: Parametric tests can provide trustworthy results


with distributions that are skewed and nonnormal

Many people aren’t aware of this fact, but parametric analyses


can produce reliable results even when your continuous
data are nonnormally distributed. You just have to be sure that
your sample size meets the requirements for each analysis in
the table below. Simulation studies have identified these
requirements. Read here for more information about
these studies.

Sample size requirements for


Parametric analyses
nonnormal data

1-sample t-test Greater than 20


Each group should have more than 15
2-sample t-test
observations

o For 2-9 groups, each group

should have more than 15

observations
One-Way ANOVA
o For 10-12 groups, each group

should have more than 20

observations

You can use these parametric tests with nonnormally


distributed data thanks to the central limit theorem.

Advantage 2: Parametric tests can provide trustworthy results


when the groups have different amounts of variability

It’s true that nonparametric tests don’t require data that are
normally distributed. However, nonparametric tests have the
disadvantage of an additional requirement that can be very
hard to satisfy. The groups in a nonparametric analysis typically
must all have the same variability (dispersion).
Nonparametric analyses might not provide accurate results
when variability differs between groups.

Conversely, parametric analyses, like the 2-sample t-test or


one-way ANOVA, allow you to analyze groups with unequal
variances. In most statistical software, it’s as easy as checking
the correct box! You don’t have to worry about groups having
different amounts of variability when you use a parametric
analysis.

Advantage 3: Parametric tests have greater statistical power

In most cases, parametric tests have more power. If


an effect actually exists, a parametric analysis is more likely to
detect it.

Nonparametric tests are used in cases where parametric tests


are not appropriate. Most nonparametric tests use some way of
ranking the measurements and testing for weirdness of the
distribution. Typically, a parametric test is preferred because it
has better ability to distinguish between the two arms. In other
words, it is better at highlighting the weirdness of the
distribution. Nonparametric tests are about 95% as powerful as
parametric tests.
However, nonparametric tests are often necessary. Some
common situations for using nonparametric tests are when the
distribution is not normal (the distribution is skewed), the
distribution is not known, or the sample size is too small (<30)
to assume a normal distribution. Also, if there are extreme
values or values that are clearly “out of range,” nonparametric
tests should be used.

Advantages of Nonparametric Tests

Advantage 1: Nonparametric tests assess the median which


can be better for some study areas.Now we’re coming to my
preferred reason for when to use a nonparametric test. The one
that practitioners don’t discuss frequently enough!
For some datasets, nonparametric analyses provide an
advantage because they assess the median rather than the
mean. The mean is not always the better measure of central
tendency for a sample. Even though you can perform a valid
parametric analysis on skewed data, that doesn’t necessarily
equate to being the better method. Let me explain using the
distribution of salaries.Salaries tend to be a right-skewed
distribution. The majority of wages cluster around the median,
which is the point where half are above and half are below.
However, there is a long tail that stretches into the higher salary
ranges. This long tail pulls the mean far away from the central
median value. The two distributions are typical for salary
distributions.

These two distributions have roughly equal medians but different


means.
In these distributions, if several very high-income individuals
join the sample, the mean increases by a significant amount
despite the fact that incomes for most people don’t change.
They still cluster around the median.
In this situation, parametric and nonparametric test results can
give you different results, and they both can be correct! For the
two distributions, if you draw a large random sample from
each population, the difference between the means is
statistically significant. Despite this, the difference between the
medians is not statistically significant. Here’s how this
works.For skewed distributions, changes in the tail affect the
mean substantially. Parametric tests can detect this mean
change. Conversely, the median is relatively unaffected, and a
nonparametric analysis can legitimately indicate that the
median has not changed significantly.You need to decide
whether the mean or median is best for your study and which
type of difference is more important to detect.

Advantage 2: Nonparametric tests are valid when our sample


size is small and your data are potentially nonnormal

Use a nonparametric test when your sample size isn’t large


enough to satisfy the requirements in the table above and
you’re not sure that your data follow the normal distribution.
With small sample sizes, be aware that normality tests can
have insufficient power to produce useful results.

This situation is difficult. Nonparametric analyses tend to have


lower power at the outset, and a small sample size only
exacerbates that problem.

Advantage 3: Nonparametric tests can analyze ordinal data,


ranked data, and outliers.Parametric tests can analyze only
continuous data and the findings can be overly affected
by outliers. Conversely, nonparametric tests can also analyze
ordinal and ranked data, and not be tripped up by
outliers.Sometimes you can legitimately remove outliers from
your dataset if they represent unusual conditions.
However,sometimes outliers are a genuine part of the
distribution for a study area, and you should not remove them.
You should verify the assumptions for nonparametric analyses
because the various tests can analyze different types of data
and have differing abilities to handle outliers.

If your data use the ordinal Likert scale and you want to
compare two groups, read my post about which analysis you
should use to analyze Likert data.

Point Estimates

Point estimators are functions that are used to find an approximate


value of a population parameter from random samples of the
population. They use the sample data of a population to calculate a
point estimate or a statistic that serves as the best estimate of an
unknown parameter of a population.
Most often, the existing methods of finding the
parameters of large populations are unrealistic. For
example, when finding the average age of kids
attending kindergarten, it will be impossible to collect
the exact age of every kindergarten kid in the world.
Instead, a statistician can use the point estimator to
make an estimate of the population parameter.

A point estimate is a type of estimation that uses a


single value, oftentimes a sample statistic,
to infer information about the population
parameter.

Let’s go through some of the major point estimates


which include point estimates for the population mean,
the population variance and the population standard
deviation.

Point Estimate for the Population Mean

So let’s say we’ve recently purchased 5,000 widgets to


be consumed in our next manufacturing order, and we
require that the average length of the widget of the
5,000 widgets is 2 inches.

Instead of measuring all 5,000 units, which would be


extremely time consuming and costly, and in other
cases possibly destructive, we can take a sample from
that population and measure the average length of the
sample.

As you know, the sample mean can be calculated by


simply summing up the individual values and dividing
by the number of samples measured.
Point Estimate for the Population Variance &
Standard Deviation

Similar to this example, you might want to estimate the


variance or standard deviation associated with a
population of product.

The point estimate of the population variance &


standard deviation is simply the sample variance &
sample standard deviation:

Example of Sample Standard Deviation

Let’s find the sample standard deviation for the same


data set we used above: 16.5, 17.2, 14.5, 15.3, 16.1
Confidence Interval (The Interval Estimate)

interval estimation uses sample data to calculate the interval of the


possible values of an unknown parameter of a population. The
interval of the parameter is selected in a way that it falls within a 95%
or higher probability, also known as the confidence interval. The
confidence interval is used to indicate how reliable an estimate is, and
it is calculated from the observed data. The endpoints of the intervals
are referred to as the upper and lower confidence limits.
An interval estimate is a type of estimation that uses
a range (or interval) of values, based on sampling
information, to “capture” or “cover” the true population
parameter being inferred / estimated.

Interval estimates are created using a confidence


level, which is the probability that your interval truly
captures the population parameter being estimated.

Because we use a confidence level, we often call these


interval estimates a confidence interval.
You can see an example of the confidence interval
below.

The image starts with the population distribution in


orange, and this distribution has an unknown
population mean, which we’re attempting to estimate.

Then we take a sample (of size n) from that population,


and that sample distribution is shown in purple.

We can then create our confidence interval based on


that sample data.

Let’s talk more about this confidence level before


jumping into the interval calculations.

What does a Confidence Level Mean?


The confidence level is an often miss-understood
concept.

When we say that we have 95% confidence in our


interval estimate, we do not mean that 95% of the
overall population falls within the confidence interval.

The confidence level is the probability that your


confidence interval truly captures the population
parameter being estimated.

So if we have a 95% confidence level, we can be


confident that 95% of the time (19 out of 20), our
interval estimate will accurately captures the true
parameter being estimated.

If you look at the graph below, the true population


parameter (μ in this case) is shown as the solid blue
line down the middle.

In 19 of the 20 intervals created, the true population


mean is captured within those 19 intervals. There is
only 1 interval that does not capture the true
population mean and it’s shown in red.

Up until this point I’ve only used the 95% confidence


level, but your confidence level can vary.

You can be 80% confident, 90%, 99% ,etc.


The confidence level you choose is based on risk –
specifically your alpha risk(α).

Alpha risk is also called your significance level and it


is the risk that you will not accurately capture the true
population parameter.

Your Confidence Level then is equal to 100%


minus your significance level (α).

Confidence Level = 100% – Significance Level (α)

So if your significance level is 0.05 (5%), then your


confidence level is 95%.

The confidence interval equation is comprised of 3


parts: a point estimate (also sometimes called a
sample statistic), a confidence level, and a margin
of error.

The point estimate, or statistic, is the most likely


value of the population parameter and the confidence
level & margin of error represents the amount of
uncertainty associated with the sample taken.

Confidence Interval = Point


Estimate + Confidence Level * Margin of Error

When dealing with the population mean, the confidence


interval looks like this:
Margin of error is the maximum expected difference
between the actual population parameter and a sample
estimate of the parameter. In other words, it is the
range of values above and below sample statistics.is is
also called as standard error for mean

Here,

Variables

 x: The individual value


 X̅ : a point estimate for the sample mean
 σ: the actual population standard deviation /
symbol for the measurement of dispersion in a
population
 n: The statistic for number of data in a sample
 N is for populations
 X̿ :(double bar): The grand average of the subgroup
averages. AKA X-bar bar or X-double bar
 s (or sd): The sample standard deviation is a point
estimate for the population standard deviation /
the dispersion statistic for samples
 µ: the central tendency statistic for populations
 z is a confidence coefficient
 Alpha is a confidence level
Measuring Data Similarity and Dissimilarity
In data mining applications such as clustering, outlier
analysis, and nearest- neighbor classification, we need
ways of assessing how alike objects are, or how unlike
they are in comparison to one another. For example, a
store may like to search for clusters of customer
objects, resulting in groups of customers with sim- ilar
characteristics (e.g., similar income, area of residence,
and age). Such information can then be used for
marketing. A cluster is a collection of data objects such
that the objects within a cluster are similar to one
another and dissimilar to the objects in other clusters.
Outlier analysis also employs clustering-based
techniques to identify potential outliers as objects that
are highly dissimilar to others.
This section presents measures of similarity and
dissimilarity. Similarity and dissimilarity measures are
referred to as measures of proximity. Similarity and
dissimilarity are related. A similarity measure for two
objects, i and j, will typically return the value 0 if the
objects are unalike. The higher the similarity value, the
greater the similarity between objects. (Typically, a
value of 1 indicates complete similarity, that is, that the
objects are identical.) A dissimilarity measure works the
opposite way. It returns a value of 0 if the objects are
the same (and therefore, far from being dissimilar). The
higher the dissimilarity value, the more dissimilar the
two objects are.

Distance or similarity measures are essential in solving


many pattern recognition problems such as
classification and clustering. Various distance/similarity
measures are available in the literature to compare two
data distributions. As the names suggest, a similarity
measures how close two distributions are. For
multivariate data complex summary methods are
developed to answer this question.

Similarity measure

 is a numerical measure of how alike two data


objects are.
 higher when objects are more alike.
 often falls in the range [0,1]
Similarity might be used to identify

 duplicate data that may have differences due


to types
 equivalent instances from different data sets.
E.g. names and/or addresses that are the
same but have misspellings.
 groups of data that are very close (clusters)

Dissimilarity measure

 is a numerical measure of how different two data


objects are
 lower when objects are more alike
 minimum dissimilarity is often 0 while the upper
limit varies depending on how much variation can
be

Dissimilarity might be used to identify

 outliers
 interesting exceptions, e.g. credit card fraud
 boundaries to clusters

Proximity refers to either a similarity or dissimilarity

Single attribute sim/dissim measures


Nominal is binary if two values are equal or not

Ordinal is the difference between two values,


normalized by the maximum distance

Quantitative dissimilarity is just a distance between,


similarity attempts to scale that distance to [0,1]

Note:

A similarity is larger if the objects are more


similar.

A dissimilarity is larger if the objects are less


similar.

Data Matrix vs. Dissimilarity Matrix

In this section, we talk about objects described by


multiple attributes. Therefore, we need a change in
notation. Suppose that we have n objects (such as
persons, items, or courses) described by p attributes
(also called measurements or features), such as age,
height, weight, or gender. The objects are x1 = (x11,
x12, . . . , x1p),
x2 = (x21,x22,...,x2p), and so on, where xij is the value
for object xi of the jth attribute. For brevity, we
hereafter refer to object xi as object i. The objects may
be tuples in a relational database, and are also referred
to as data samples or feature vectors.

Main memory-based clustering and nearest-neighbor


algorithms typically operate on either of the following
two data structures.

• Data matrix (or object-by-attribute structure): This


structure stores the n data objects in the form of a
relational table, or n-by-p matrix (n objects ×p
attributes):

Each row corresponds to an object. As part of our


notation, we may use f to index through the p
attributes.

• Dissimilarity matrix (or object-by-object structure):


This stores a collection of proximities that are available
for all pairs of n objects. It is often represented by an n-
by-n table:
MEASURING DATA SIMILARITY AND
DISSIMILARITY

where d(i, j) is the measured dissimilarity or


“difference” between objects i and j. In general, d(i, j) is
a nonnegative number that is close to 0 when objects i
and j are highly similar or “near” each other, and
becomes larger the more they differ. Note that ,

d(i, i) = 0, that is, the difference between an object and


itself is 0. Furthermore,
d(i, j) = d(j, i). (For readability, we do not show the d(j,
i) entries; the matrix is symmetric.)

Measures of similarity can often be expressed as a


function of measures of dissimilarity. For example, for
nominal data,

sim(i, j) = 1 − d(i, j) where sim(i, j) is the similarity


between objects i and j.

A data matrix is made up of two entities or “things”,


namely rows (for objects) and columns (for attributes).
Therefore, the data matrix is often called a two-mode
matrix. The dissimilarity matrix contains one kind of
entity (dis- similarities) and so is called a one-mode
matrix. Many clustering and nearest- neighbor
algorithms operate on a dissimilarity matrix. Data in the
form of a data matrix can be transformed into a
dissimilarity matrix before applying such algorithms.
Proximity Measures for Nominal Attributes
A nominal attribute can take on two or more states For example, map
color is a nominal attribute that may have, say, five states: red,
yellow, green, pink, and blue.
Let the number of states of a nominal attribute be M. The states can
be denoted by letters, symbols, or a set of integers, such as 1, 2, . . . ,
Proximity Measures for Binary Attributes
Let’s look at dissimilarity and similarity measures for objects
described by either symmetric or asymmetric binary attributes.
Recall that a binary attribute has only two states: 0 or 1, where 0
means that the attribute is absent, and 1 means that it is present.Given
the attribute smoker describing a patient, for instance, 1 indicates that
the patient smokes, while 0 indicates that the patient does not.
Treating binary attributes as if they are numeric can be misleading.
Therefore, methods specific to binary data are necessary for
computing dissimilarities.
“So, how can we compute the dissimilarity between two binary
attributes?”
Proximity Measures for Ordinal
Attributes

The values of an ordinal attribute have a meaningful

order or ranking about them, yet the magnitude

between successive values is unknown . An example

includes the sequence small, medium, large for a size

attribute. Ordinal attributes may also be obtained from

the discretization of numeric attributes by splitting the

value range into a finite number of categories. These

categories are organized into ranks. That is, the range of

a numeric attribute can be mapped to an ordinal

attribute f having Mf states. For example, the range of

interval-scaled attribute temperature (in Celsius) can be

organized into the following states: −30 to −10, −10 to

10, 10 to 30, representing the categories cold

temperature, moderate temperature, and warm


temperature, respectively. Let the number of possible

states that an ordinal attribute can have be M . These

ordered states define the ranking 1, . . . , Mf .

“How are ordinal attributes handled?” The treatment of

ordinal attributes is quite similar to that of numeric

attributes when computing the dissimilarity between

objects. Suppose that f is an attribute from a set of

ordinal attributes describing n objects. The dissimilarity

computation with respect to f involves the following

steps:

1. The value of f for the ith object is xif , and f has Mf

ordered states, rep- resenting the ranking 1, . . . , Mf .

Replace each xif by its corresponding rank, rif ∈ {1,...,

Mf}.

2. Since each ordinal attribute can have a different

number of states, it is often necessary to map the range


of each attribute onto [0.0,1.0] so that each attribute

has equal weight. We perform such data normalization

by replacing the rank rif of the ith object in the fth

attribute by
The Dissimilarity of
Numeric Data
One of the common fundamental tasks in data mining is
calculating the differences between objects. Likewise, in
any other calculation and validation step, there are
some measures to calculate the dissimilarity of numeric
data. In this article, we will discuss the Euclidean and
Manhattan distance as the two most common distance
measures in the dissimilarity of objects described by
numeric attributes. Moreover, there is also a specific
section for the Minkowski distance as the generalization
of Euclidean and Manhattan distance.

1. Introduction

As a consequence of the rapid increase in the internet’s


data in recent years, users faced lots of difficulties
finding what they searched. Regarding this issue,
recommender systems as a subclass of filtering
systems assist users in accessing what they need faster
and easier. The most common technique to make a
recommender system is Collaborative filtering (CF). To
crystalize the CF technique, let’s have an example:
imagine that user Awatched and liked Harry Potter,
users B and C also watched and loved Lord of the Rings.
So based on the CF, by considering the same opinions
between the users about Harry Potter, there is a chance
with a high probability that user A also will like to watch
Lord of the Rings. In this example, a collaborative
filtering based recommender system will recommend
Lord of the Rings to the user A.

However, there is an essential question of how can we


calculate similarity or dissimilarity between the objects?
In the following paragraphs, we will explain some of the
most popular distance measures in the dissimilarity of
objects described by numeric attributes, such as
Euclidean and Manhattan distance and also Minkowski
distance as the generalization of the first two
mentioned distance measures.

2. Dissimilarity measures in numerical data


2.1 Euclidean distance
In coordinate geometry, Euclidean distance is the distance between two
points. To find the two points on a plane, the length of a segment
connecting the two points is measured. We derive the Euclidean distance
formula using the Pythagoras theorem. Let us learn the Euclidean distance
formula along with a few solved examples.
What Is Euclidean Distance Formula?

The Euclidean distance formula, as its name suggests, gives the distance
between two points (or) the straight line distance. Let us assume that (x1,y1)
(x1,y1) and (x2,y2)(x2,y2) are two points in a two-dimensional plane.
Here is the Euclidean distance formula.

Euclidean Distance Formula


The Euclidean distance formula says:
d = √[ (x22 – x11)2 + (y22 – y11)2]
where,
 (x11, y11) are the coordinates of one point.
 (x22, y22) are the coordinates of the other point.
 d is the distance between (x11, y11) and (x22, y22).

The first and the most common measure to calculate


the dissimilarity of numeric data is Euclidean distance,
also known “as the crow flies.”

Example
of the Euclidean distance
Euclidean distance is the straight line between the
starting point and destination. If we consider i and j as
follows

the Euclidean distance between these two objects can


be calculated from the below formula:

2.2 Manhattan distance

Another well-known measure for calculating


dissimilarity named the Manhattan distance, also known
as the taxi driver or city block distance. In contrast to
the Euclidean distance, the Manhattan distance, we
count city blocks that we need to pass in moving from
the starting point to the destination.
Note

For instance, based on the below map, the taxi driver to


reach the destination, first, has to move five blocks to
the left and then three blocks toward the north
direction.
K

Example
of the Manhattan distance

Important tips about Euclidean and Manhattan


distances

Both Euclidean and Manhattan distances have some


important mathematical properties like:

 Non-negativity

The distance between two points like p and q is always


equal or greater than zero. d(i,j) >= 0
 Identity of indiscernibles

The distance between any object to its self is equal to


zero. d(i,i) = 0

 Symmetry

Distance is a symmetric measure. d(i,j) = d(j,i)

 Triangle inequality

Base on the Triangle inequality, distance from i to j, can


not be greater than when we move from i to j with a
detour of k.

Also, each side of the triangle is greater than the result


of the difference between the other two sides.

Minkowski distance

The Minkowski distance is the generalization of the


Euclidean and Manhattan distances. In the below
formula, the h is a real number, which is also greater
than 1. h >= 1
In the Minkowski distance formula, for h=1, the result
will be the same as the Manhattan distances, and for
the h=2, it will be equal to the Euclidean distance.
Concept of Outlier, types of outliers,
outlier detection methods

In statistics, an Outlier is an observation point that is


distant from other observations.'

The definition suggests to us that an outlier is


something which is an odd-one-out or the one that is
different from the crowd. Some statisticians define
outliers as ‘having a different underlying behavior than
the rest of the data’. Alternatively, an outlier is a data
point that is distant from other points.
From the image below we can see that the sample
points in Green are close to each other, whereas the
two sample points in Red are far apart from them.
These red sample points are outliers.

How the Outlier are introduced in the datasets?


Now we know what outlier is. Are you also wondering
how outlier are introduced to the population or
dataset? The Outlier may be due to just variability in
the measurement or may indicate experimental errors.

Outliers are first introduced to the population while


gathering or collecting the data. Data can be collected
in many ways be it via Interview; Questionnaires &
Survey; Observations; Documents & Records; Focus
groups; Oral History etc., and in this Tech era Internet;
IT sensors etc., are generating data for us.

Another possible cause of outliers could be Incorrect


entry; Misreporting of data or observations; Sampling
errors while doing the experiment; Exceptional but
True value. Though, you will not be aware of the
outliers at in the collection phase. The outliers can be
a result of a mistake during data collection or they can
be just an indication of variance in your data.

If possible, outliers should be excluded from the data


set. However, detecting anomalous instances might be
difficult, and is not always possible. Data Science
Developers and statisticians don’t like to declare
outliers too quickly. The ‘too large’ number could be a
data entry error, a scale problem, or just a really big
number.

There are two main reasons why giving outliers special


attention is a necessary aspect of the data analytics process:

1. Outliers may have a negative effect on the result of an


analysis
2. Outliers—or their behavior—may be the information that a
data analyst requires from the analysis

Types of outliers
There are two kinds of outliers:

 A univariate outlier is an extreme value that relates to


just one variable. For example, Sultan Kösen is currently
the tallest man alive, with a height of 8ft, 2.8 inches
(251cm). This case would be considered a univariate
outlier as it’s an extreme case of just one factor: height.
 A multivariate outlier is a combination of unusual or
extreme values for at least two variables. For example, if
you’re looking at both the height and weight of a group of
adults, you might observe that one person in your dataset
is 5ft 9 inches tall—a measurement that would fall within
the normal range for this particular variable. You may also
observe that this person weighs 110lbs. Again, this
observation alone falls within the normal range for the
variable of interest: weight. However, when you consider
these two observations in conjunction, you have an adult
who is 5ft 9 inches and weighs 110lbs—a surprising
combination. That’s a multivariate outlier.

Type of Outliers
There are mainly 3 types of Outliers.

1. Point or global Outliers: Observations


anomalous with respect to the majority of
observations in a feature. In-short A data point is
considered a global outlier if its value is far
outside the entirety of the data set in which it is
found.
Example: In a class all student age will be approx.
similar, but if see a record of a student with age as
500. It’s an outlier. It could be generated due to
various reason.

2. Contextual (Conditional)
Outliers: Observations considered anomalous
given a specific context.A data point is considered
a contextual outlier if its value significantly
deviates from the rest of the data points in the
same context. Note that this means that same
value may not be considered an outlier if it
occurred in a different context. If we limit our
discussion to time series data, the “context” is
almost always temporal, because time series data
are records of a specific quantity over time. It’s no
surprise then that contextual outliers are common
in time series data.In Contextual Anomaly values
are not outside the normal global range but are
abnormal compared to the seasonal pattern.
3. If an individual data instance is anomalous in a
specific context or condition (but not otherwise),
then it is termed as a contextual outlier. Attributes

4. ⦁ Contextual attributes: defines the context, e.g.,


of data objects should be divided into two groups

5. ⦁ Behavioral attributes: characteristics of the


time & location

object, used in outlier evaluation, e.g., temperature

6.

Example: World economy falls drastically in 2020


due to COVID-19. Stock Market crashes due to the
scam in 1992; . Usual data points will be near to
each other whereas data point during the specific
period will either up or down very far. This is not
due to erroneous, but it’s an actual observation
data point.

7. Collective Outliers: A collection of observations


anomalous but appear close to one another
because they all have a similar anomalous value.

A subset of data points within a data set is


considered anomalous if those values as a
collection deviate significantly from the entire
data set, but the values of the individual data
points are not themselves anomalous in either a
contextual or global sense. In time series data,
one way this can manifest is as normal peaks and
valleys occurring outside of a time frame when
that seasonal sequence is normal or as a
combination of time series that is in an outlier
state as a group.

If a collection of data points is anomalous with respect


to the entire data set, it is termed as a collective
outlier. There are three approaches for outlier
detection:
In the example, the anomalous drop in the number of
successful purchases for three different product
categories were discovered to be related to each other
and are combined into a single anomaly. For each time
series the individual behavior does not deviate
significantly from the normal range, but the combined
anomaly indicated a bigger issue with payments.

Note:The Z-score(also called the standard score) is an important


concept in statistics that indicates how far away a certain
point is from the mean. By applying Z-transformation we
shift the distribution and make it 0 mean with unit
standard deviation.
outlier detection methods

In supervised anomaly detection methods, the dataset


has labels for normal and anomaly observations or data
points. Supervised anomaly detection is a sort of binary
classification problem. It should be noted that the
datasets for anomaly detection problems are quite
imbalanced. So it's important to use some data
augmentation procedure (k-nearest neighbors
algorithm, ADASYN, SMOTE, random sampling, etc.)
before using supervised classification methods. Jordan
Sweeney shows how to use the k-nearest algorithm in a
project on Education Ecosystem, Travelling Salesman -
Nearest Neighbour.
Unsupervised anomaly detection is useful when there is
no information about anomalies and related patterns.
Isolation Forests, OneClassSVM, or k-means methods
are used in this case. The main idea here is to divide all
observations into several clusters and to analyze the
structure and size of these clusters.

You might also like