0% found this document useful (0 votes)
20 views125 pages

(Descriptive Stats (Unit 1) New Syllabus

Uploaded by

Vishal Dagar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views125 pages

(Descriptive Stats (Unit 1) New Syllabus

Uploaded by

Vishal Dagar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 125

DESCRIPTIVE STATISTICS

Unit 1
SYLLABUS unit 1
Concepts

• Statistics is the science of learning from data and making decisions using the wealth of
information available to us
----Nicholas Horton,

• “Statistics is about the development of methods for the collection and analysis of data
in order to answer specific questions in an unbiased way, so that the conclusions
depend only on the data and not on any preconceived ideas.”
—Bryan Manly
Branches of Statistics
Population and Sample in Statistics
The difference between a population characteristic ( parameter) and a sample
characteristic (statistic) is a error.
Descriptive Statistics
Descriptive statistics is the term given to the analysis of data that helps describe, show or summarize data in
a meaningful way such that, for example, patterns might emerge from the data. Descriptive statistics do not,
however, allow us to make conclusions beyond the data we have analysed or reach conclusions regarding
any hypotheses we might have made. They are simply a way to describe our data.
Typically, there are two general types of statistic that are used to describe data:
- Measures of Central tendency : Try to locate a representative value where the data is centred
- Measures of Dispersion : try to measure the spread in the data

Descriptive statistics can be applied to both the population and sample . The properties/ characteristics of
populations, like the mean or standard deviation, are called parameters as they represent the whole
population (i.e., everybody you are interested in). Whereas the properties/characteristics of the sample are
called sample statistic.

The branch of Statistics that helps summarize the data ( whether of population or of sample) is called
Descriptive statistics
Whereas if we use the sample attributes/ statistic to predict the population parameters or infer about
the population, it is Inferential Statistics
Other measures like quartiles,
Skewness, Kurtosis
percentiles
Data

• Data is any useful information that is capable of being processed i.e being treated and
analyzed.
• Information or data could be quantitative or qualitative
• It could be collected directly (Primary data) by the researcher himself or picked up from
sources that exist i.e it is pre-collected ( secondary data)
• Secondary data can be collected from authentic places like World bank, IMF, RBI,
NSSO, Kaggle etc…
Classification of Data

Data

Discrete

Qualitative Quantitative
(categorical) (Numerical)
Continuous

Nominal Ordinal Interval Ratio


Frequency and relative frequency
Frequency Distributions
Continuous frequency distribution
Continuous Dataset
Bar Graphs
Histogram of discreet
Variable
Frequency Distribution of Continuous Variable
Histogram
Ex 1.2
Do Ques 23,29…. Ex 1.2
Measures of Central Tendency
Measures of Central Tendency
• More formal data analysis often requires the calculation and
interpretation of numerical summary measures.

• Measurements that characterize the data set and convey some of its
salient features.

• These focus on numerical data primarily

• The important measures of central tendency are


Mean , Median, Mode
Prerequisites of a Good Average
Mean
Population Mean

Limitation of Mean
Mean
Median
The sample median is insensitive to outliers
Which measure of Central Tendency to use
Level of Measurement
Which measure of Central Tendency to Use
Shape of the Distribution : skewness
Shape of the distribution
Which measure of Central tendency to use?
Other Measures : Percentiles, Quartiles, trimmed
mean
Trimmed Mean
QUARTILES
• Quartiles are numerical measures that divide the data into FOUR equal parts
• There are 3 quartiles
• The three quartiles are Q1(lower Quartile), Q2(median) and Q3(Upper Quartile)
• IQR=Q3-Q1
Calculation of Quartiles

• Since these are positional averages, arranging the data in (preferably ascending order is a must.

BOX PLOTS TO IDENTIFY
OUTLIERS
Box Plots
Box Plots and skewness
Box Plots and
Dispersion
Box Plots and
Outliers
8-3.2=4.8
12.53
OR
Measures of Variation/Dispersion
• At times, the measures of central tendency Mean, Median and Mode are not sufficient to describe the
data as you have series with the same mean but they look very different , therefore we require
additional measures called the MEASURES OF DISPERSION.
• E.g

In this example, the mean of the three series is same, but


scatter of marks is different

• Measures of Dispersion are values that tell us how the observations in a dataset scatter around a
central value

• Higher is the scatter, higher will be the value of this measure and lesser will be the uniformity in the
dataset
Commonly used measures of Dispersion are:
 Range
 Inter Quartile Range or Quartile Deviation
 Standard Deviation
 Variance

 Range is the difference between the largest and smallest value of the dataset

 Inter-Quartile Range is the difference between the upper quartile and lower quartile

 Quartile Deviation is also called Semi-inter quartile range. It is IQR/2


) = 0 Sum of deviations from mean is always zero
is minimum. Sum of squared deviations from mean
 Standard Deviation and Variance

Standard deviation measures the square root of the Variance or of average squared deviations from the
mean.
Why square the Differences?
Variance is comparatively difficult to interpret
For e.g if instead of marks the variable X
mentioned alongside was length of beams
measured in cms.

Then the average length of beams in all the


samples 1,2 and 3 is 30 cms
But, variance of lengths i.e the scatter of
lengths around a central value is 120 sq cms in
case of first sample….and 600 sq cms in the
third sample (considering they are populations)

If we state standard deviation of beam lengths, it


is 10.95cm for the first sample…makes much
more sense as we are measuring how much the
length of beams in a sample differs from the
mean length
Variance, despite this limitation is widely
used in statistical analysis
Why n-1 for a sample?
• As population data is seldom available, we use as an estimator of Population mean μ
• Likewise, we use sample variance to estimate the population variance .
• Deviations of X are hence measured from sample mean . Since the data points of a sample are
expected to be farther from μ than from , the deviations from sample mean used in the
denominator underestimates these deviations.
• To get an unbiased estimator, we reduce the denominator to balance the underestimation by the
numerator and therefore divide by n-1
• n-1 is referred to as degrees of freedom i.e the number of free observations in the dataset that
can be used for the desired computation, sample variance in this case.
• Though is based on n observations i.e n deviations , the sum of which is always 0, therefore if
we know the value of n-1 deviations, the last deviation can be calculated and is not free to take
up any random value.
• E.g if we have first three deviations of a dataset with n=4 as 8, -6, -4 then the last deviation has
to be 2 so that the sum of these deviations is zero, that is only n-1 i.e 3 observations are freely
determined. df=3
The formula hence becomes :
To Summarise
Find the other two values
Box Plots : Refer to Devore
Ques
Skewness and Kurtosis

Skewness= Kurtosis=
OR
Moment based measure of skewness
CENTRAL MOMENTS

Note: m1. m2. m3 and m4 are moments


about the mean (central moments) for the
sample and therefore their denominators
should be n-1
If =3 (Mesokurtic); if > 3 (Leptokurtic); if < 3 (Platykurtic)
Karl Pearson’s coeff of skewness
Population
For sample
Solution:

The coefficient indicates that the prices of the S&P


500 and Apple Inc. have a high positive
correlation. This means that their respective
prices tend to move in the same direction.
Therefore, adding Apple to his portfolio would, in
Scatter diagram and
correlation coefficient
Coefficient of Determination
Properties of Correlation coefficient

i.e rxy=ryx
Ques

Ques

Coeff of variation=
OR *100

For moderately skewed distribution


Q.D=2/3*
M.D= 4/5 *
Not to be done

You might also like