chapter2-statistical analysis
chapter2-statistical analysis
outputs
What is Statistics?
Branches of Statistics:
Descriptive Statistics
Note-
1. Since Median and Mode does not take all the data
points for calculations, these are robust to outliers,
i.e. these are not effected by outliers.
Sampling error arises any time you use a sample, even if your
sample is random and unbiased. For this reason, there is
always some uncertainty in inferential statistics.
Hypothesis testing
Hypothesis testing is a formal process of statistical analysis
using inferential statistics. The goal of hypothesis testing is to
compare populations or assess relationships between variables
using samples.
Types of hypothesis
Null Hypothesis.
The null hypothesis, H0 is the commonly accepted
fact; it is the opposite of the alternate hypothesis.
Researchers work to reject, nullify or disprove the null
hypothesis. Researchers come up with an alternate
hypothesis, one that they think explains a
phenomenon, and then work to reject the null
hypothesis.
Why is it Called the “Null”?
The word “null” in this context means that it’s a
commonly accepted fact that researchers work
to nullify. It doesn’t mean that the statement is null (i.e.
amounts to nothing) itself! (Perhaps the term should be
called the “nullifiable hypothesis” as that might cause
less confusion). To
Why Do I need to Test it? Why not just prove an alternate
one?
The short answer is, as a scientist, you are required to;
It’s part of the scientific process. Science uses a battery
of processes to prove or disprove theories, making sure
than any new hypothesis has no flaws. Including both a
null and an alternate hypothesis is one safeguard to
ensure your research isn’t flawed. Not including the
null hypothesis in your research is considered
very bad practice by the scientific community. If
you set out to prove an alternate hypothesis without
considering it, you are likely setting yourself up for
failure. At a minimum, your experiment will likely not
be taken seriously.
Example
Not so long ago, people believed that the world was
flat.
H0 Ha
(≥)
2. z-test:
A z-test is mainly used when the data is normally distributed. The z-
test is mainly used when the population mean and standard
deviation are given.
The one-sample z-test is mainly used for comparing the mean of a
sample to some hypothesized mean of a given population, or when
the population variance is known.
The main analysis is to check whether the mean of a sample is
reflective of the
population being considered.
We would use a z test if:
Our sample size is greater than 30. Otherwise, use at test.
Data points should be independent from each other. In other
words, one data point
isn't related or doesn't affect another data point.
Our data should be normally distributed. However, for large
sample sizes (over 30)
this doesn't always matter.
Our data should be randomly selected from a population,
where each item has an
equal chance of being selected.
Sample sizes should be equal if at all possible.
Unlike the z and t-distributions, the f-distribution does not have any
negative values
because between and within-group variability are always positive
due to squaring each deviation.
One way f-test (ANOVA) tell whether two or more groups are similar
or not based on their mean similarity and f-score.
4. Chi-square Test:
The test is applied when we have two categorical variables from a
single population. It is used to determine whether there is a
significant association between the two variables.
Introduction
Table of Contents
T-test
Z-test
F-test
ANOVA
Chi-square
Mann-Whitney U-test
Kruskal-Wallis H-test
Parametric Tests
Mean
Standard Deviation
Non-parametric Tests
In Non-Parametric tests, we don’t make any assumption
about the parameters for the given population or the
population we are studying. In fact, these tests don’t
depend on the population.
Hence, there is no fixed set of parameters is available,
and also there is no distribution (normal distribution,
etc.) of any kind is available for use.
T-Test
A T-test can be a:
where,
Conclusion:
Z-Test
Image Source:
Google Images
where,
x̄ 1 is the sample mean of 1st group
F-Test
5. It is calculated as:
F = s12/s22
6. By changing the variance in the ratio, F-test has
become a very flexible test. It can then be used to:
ANOVA
Chi-Square Test
8. It is calculated as:
9. Chi-square is also used to test the independence of
two variables.
observations
One-Way ANOVA
o For 10-12 groups, each group
observations
It’s true that nonparametric tests don’t require data that are
normally distributed. However, nonparametric tests have the
disadvantage of an additional requirement that can be very
hard to satisfy. The groups in a nonparametric analysis typically
must all have the same variability (dispersion).
Nonparametric analyses might not provide accurate results
when variability differs between groups.
If your data use the ordinal Likert scale and you want to
compare two groups, read my post about which analysis you
should use to analyze Likert data.
Point Estimates
Here,
Variables
Similarity measure
Dissimilarity measure
outliers
interesting exceptions, e.g. credit card fraud
boundaries to clusters
Note:
steps:
Mf}.
attribute by
The Dissimilarity of
Numeric Data
One of the common fundamental tasks in data mining is
calculating the differences between objects. Likewise, in
any other calculation and validation step, there are
some measures to calculate the dissimilarity of numeric
data. In this article, we will discuss the Euclidean and
Manhattan distance as the two most common distance
measures in the dissimilarity of objects described by
numeric attributes. Moreover, there is also a specific
section for the Minkowski distance as the generalization
of Euclidean and Manhattan distance.
1. Introduction
The Euclidean distance formula, as its name suggests, gives the distance
between two points (or) the straight line distance. Let us assume that (x1,y1)
(x1,y1) and (x2,y2)(x2,y2) are two points in a two-dimensional plane.
Here is the Euclidean distance formula.
Example
of the Euclidean distance
Euclidean distance is the straight line between the
starting point and destination. If we consider i and j as
follows
Example
of the Manhattan distance
Non-negativity
Symmetry
Triangle inequality
Minkowski distance
Types of outliers
There are two kinds of outliers:
Type of Outliers
There are mainly 3 types of Outliers.
2. Contextual (Conditional)
Outliers: Observations considered anomalous
given a specific context.A data point is considered
a contextual outlier if its value significantly
deviates from the rest of the data points in the
same context. Note that this means that same
value may not be considered an outlier if it
occurred in a different context. If we limit our
discussion to time series data, the “context” is
almost always temporal, because time series data
are records of a specific quantity over time. It’s no
surprise then that contextual outliers are common
in time series data.In Contextual Anomaly values
are not outside the normal global range but are
abnormal compared to the seasonal pattern.
3. If an individual data instance is anomalous in a
specific context or condition (but not otherwise),
then it is termed as a contextual outlier. Attributes
6.