0% found this document useful (0 votes)
26 views12 pages

E Book - Unit 4

Uploaded by

65kzmdy4xr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views12 pages

E Book - Unit 4

Uploaded by

65kzmdy4xr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

DATA ANALYSIS & TESTING OF

HYPOTHESIS

TO P I C S TO B E CO V E R E D

• Descriptive analysis/statistics & Inferential analysis/statistics,


• Hypothesis testing (concept, type of error, steps, types),
• Parametric tests with SPSS (z-test, t-test, F-test) and
• non-parametric test with SPSS (Chi- square, Mann-Whitney U Test, Kruskal Wallis test),
• Multivariate Analysis (Factor Analysis, Regression Analysis).
Statistics is the study of the collection, organization, analysis, interpretation, presentation, and
of data

STATISTICS DISCRIPTIVE
MATHEMATICS
INFERENTIAL

DESCRIP T I VE STAT IST ICS

DESCRIPTIVE STATISTICS

Measure of Central Measure of Spread


Measure of Shape
Tendency (Variability)

Descriptive Statistics- Measure of central tendency


• Mean : the arithmetic average of a Data set that is found by adding the numbers in a set
and dividing by the number of observations in the Data set.
MERITS & DEMERITS OF MEAN
Merits:

Ø Arithmetic mean rigidly defined.

Ø It is easy to calculate and simple to understand.

Ø It is based on all observations of the given data.

Ø It is suitable for further mathematical and statistical analysis.

Demerits:

Ø It can neither be determined by inspection or by graph.

Ø Arithmetic mean can not be computed for qualitative data.

Ø Arithmetic mean can not be computed for open ended class intervals.

Ø It can’t be computed if one or more observations are missing.

Ø It is highly affected by the extreme values (very high or very small values as compared to other
observations).

• Median: The middle number in the Data set while listed in either ascending or
descending order is the Median.
ADVANTAGES of Median

• (1) Simplicity:- In the case of simple statistical series, just a glance at the data is enough to
locate the median value.

(2) Free from the effect of extreme values: - Unlike arithmetic mean, median value is not
destroyed by the extreme values of the series.

(3) Certainty: - Certainty is another merits is the median. Median values are always a certain
specific value in the series.

(4) Real value: - Median value is real value and is a better representative value of the series
compared to arithmetic mean average, the value of which may not exist in the series at all.

(5) Graphic presentation: - Besides algebraic approach, the median value can be estimated also
through the graphic presentation of data.

(6) Possible even when data is incomplete: - Median can be estimated even in the case of
certain incomplete series. It is enough if one knows the number of items and the middle item of

the series.
Demerits of median:

• (1) Lack of representative character: - median is of limited representative character as it is not


based on all the items in the series.

• (2) Unrealistic:- When the median is located somewhere between the two middle values, it
remains only an approximate measure, not a precise value.

• (3) Lack of algebraic treatment: - Arithmetic mean is capable of further algebraic treatment, but
median is not. For example, multiplying the median with the number of items in the series will
not give us the sum total of the values of the series.

• Mode: The number that occurs the most in a Data set and ranges between the highest
and lowest value is the Mode.
Advantages of Mode

• Simple and popular: - Mode is very simple measure of central tendency. Sometimes,
just at the series is enough to locate the model value.

(2) Less effect of marginal values: - Compared top mean, mode is less affected by
marginal values in the series. Mode is determined only by the value with highest
frequencies.

(3) Graphic presentation:- Mode can be located graphically, with the help of histogram.

(4) Best representative: - Mode is that value which occurs most frequently in the series.
Accordingly, mode is the best representative value of the series.

(5) No need of knowing all the items or frequencies: - The calculation of mode does not
require knowledge of all the items and frequencies of a distribution. In simple series, it
is enough if one knows the items with highest frequencies in the distribution.
Demerits of mode:
(1) Uncertain and vague: - Mode is an uncertain and vague measure of the central
tendency.

(2) Not capable of algebraic treatment: - Unlike mean, mode is not capable of further
algebraic treatment.

(3) Difficult: - With frequencies of all items are identical, it is difficult to identify the
modal value.
(4) Complex procedure of grouping:- Calculation of mode involves cumbersome
procedure of grouping the data. If the extent of grouping changes there will be a change
in the model value.

(5) Ignores extreme marginal frequencies:- It ignores extreme marginal frequencies. To


that extent model value is not a representative value of all the items in a series. Besides,
one can question the representative character of the model value as its calculation does
not involve all items of the series.
Descriptive Statistics- Measure of Variability
• Range : It is the spread of your data from the lowest to the highest value in the
distribution.

R=H–L
• Mean Standard deviation : is the average amount of variability in your dataset. Average,
how far each value lies from the mean. A high standard deviation means that values are
generally far from the mean, while a low standard deviation indicates that values are
clustered close to the mean.

• Variance : Variance reflects the degree of spread in the data set. The more spread the
data, the larger the variance is in relation to the mean. Variance is the square of the
standard deviation. This means that the units of variance are much larger than those of
a typical value of a data set.
Descriptive Statistics- Measure of Shape-

SKEWNESS
 Skewness is a statistical number that tells us if a distribution is symmetric or not.
 A distribution is symmetric if the right side of the distribution is similar to the left side of
the distribution.
 If a distribution is symmetric, then the Skewness value is 0. i.e.
 If a distribution is Symmetric (normal distribution): median= mean= mode, (Skewness
value is 0)
 If Skewness is greater than 0, then it is called right-skewed or that the right tail is longer
than the left tail.
 If Skewness is less than 0, then it is called left-skewed or that the left tail is longer than
the right tail

Relationship b/w Skewness & Mean Median & Mode


• The mode is the apex high point of the curve
• The median is the middle value
• The means tends to be located to wards the tail of the distribution
• The coefficient compares the mean and median in the light of the magnitude of standard
deviation

• If the distribution is symmetrical , the co-efficient is equal to zero

X is the mean, S- Std Deviation, Md- Median

Descriptive Statistics- Measure of Shape


KURTOSIS

• Kurtosis determines the amount of peakedness.


• If a distribution is similar to the normal distribution, the Kurtosis value is 0.
• If Kurtosis is greater than 0, then it has a higher peak compared to the normal
distribution.
• If Kurtosis is less than 0, then it is flatter than a normal distribution.
• There are three types of distributions:
 Leptokurtic: Sharply peaked with fat tails, and less variable.
 Mesokurtic: Medium peaked
 Platykurtic: Flattest peak and highly dispersed.
**********************************************************************

Inferential analysis

INFERENTIAL
STATISTICS

Confidence Interval Hypothesis Testing Regression Analysis

1. Confidence Interval

• A confidence interval uses the variability around a statistic to come up with an interval
estimate for a parameter.
• A confidence level tells you the probability (in percentage) of the interval containing the
parameter estimate if you repeat the study again.
• A 95% confidence interval means that if you repeat your study with a new sample in
exactly the same way 100 times, you can expect your estimate to lie within the
specified range of values 95 times.
2. Regression TEST

• Regression tests demonstrate whether changes in predictor variables cause changes in an


outcome variable
• The dependent variable’s response to a unit change in the independent variable is
examined through linear regression.

3. Hypothesis testing

Null Hypothesis – There is no significance relationship b/w two variables


Alternate Hypothesis – There is a significant relationship b/w variables.

Ex:
Ho- There is no significant relationship between eating sugar and weight gain
Ha- There is significant relationship between eating sugar and weight gain
PARAMETRIC TEST V/S NON
PARAMETRIC

1. Parametric test – Z Test

 A z test is a test that is used to check if the means of two populations are different or not
provided the data follows a normal distribution.
 It checks if the means of two large samples are different or not when the population
variance is known.
 The null hypothesis and the alternative hypothesis must be set
 The sample Size should be greater than 30
 The z test formula to set up the required hypothesis tests for a one sample and a two-
sample z test

x¯= sample mean,


μ = population mean,
Z Test Formula for σ = population standard
one sample is deviation and
n = the sample size.
 A two sample z-test is used to test whether two population means are equal.
 This test assumes that the standard deviation of each population is known.
 The distribution of the two sample is normal
 H0: μ1 = μ2 (the two population means are equal)
 HA: μ1 ≠ μ2 (the two population means are not equal)

FORMULA –
x1, x2: sample means
σ1, σ2: population standard
deviations
n1, n2: sample sizes

2. Parametric test – T Test

 A t-test (also known as Student's t-test) is a tool for evaluating the means of one or two
populations using hypothesis testing
 When choosing a t test, you will need to consider two things: whether the groups being
compared come from a single population or two different populations, and whether you
want to test the difference in a specific direction.

t-Test assumptions
• The data are continuous.
• The sample data have been randomly sampled from a population.
• There is homogeneity of variance (i.e., the variability of the data in each group is
similar).
• The distribution is approximately normal.

3. Parametric test – F Test

• F test can be defined as a test that uses the f test statistic to check whether the variances
of two samples (or populations) are equal to the same value.
• The population should have a normal distribution
• The F-test is commonly employed to assess the fit of a proposed regression model to the
data, evaluating how well the model explains the variability in the data.
• The f test formula can be used to find the f statistic. The f test formula is given as
follows:

You might also like