(Descriptive Stats (Unit 1) New Syllabus
(Descriptive Stats (Unit 1) New Syllabus
Unit 1
SYLLABUS unit 1
Concepts
• Statistics is the science of learning from data and making decisions using the wealth of
information available to us
----Nicholas Horton,
• “Statistics is about the development of methods for the collection and analysis of data
in order to answer specific questions in an unbiased way, so that the conclusions
depend only on the data and not on any preconceived ideas.”
—Bryan Manly
Branches of Statistics
Population and Sample in Statistics
The difference between a population characteristic ( parameter) and a sample
characteristic (statistic) is a error.
Descriptive Statistics
Descriptive statistics is the term given to the analysis of data that helps describe, show or summarize data in
a meaningful way such that, for example, patterns might emerge from the data. Descriptive statistics do not,
however, allow us to make conclusions beyond the data we have analysed or reach conclusions regarding
any hypotheses we might have made. They are simply a way to describe our data.
Typically, there are two general types of statistic that are used to describe data:
- Measures of Central tendency : Try to locate a representative value where the data is centred
- Measures of Dispersion : try to measure the spread in the data
Descriptive statistics can be applied to both the population and sample . The properties/ characteristics of
populations, like the mean or standard deviation, are called parameters as they represent the whole
population (i.e., everybody you are interested in). Whereas the properties/characteristics of the sample are
called sample statistic.
The branch of Statistics that helps summarize the data ( whether of population or of sample) is called
Descriptive statistics
Whereas if we use the sample attributes/ statistic to predict the population parameters or infer about
the population, it is Inferential Statistics
Other measures like quartiles,
Skewness, Kurtosis
percentiles
Data
• Data is any useful information that is capable of being processed i.e being treated and
analyzed.
• Information or data could be quantitative or qualitative
• It could be collected directly (Primary data) by the researcher himself or picked up from
sources that exist i.e it is pre-collected ( secondary data)
• Secondary data can be collected from authentic places like World bank, IMF, RBI,
NSSO, Kaggle etc…
Classification of Data
Data
Discrete
Qualitative Quantitative
(categorical) (Numerical)
Continuous
• Measurements that characterize the data set and convey some of its
salient features.
Limitation of Mean
Mean
Median
The sample median is insensitive to outliers
Which measure of Central Tendency to use
Level of Measurement
Which measure of Central Tendency to Use
Shape of the Distribution : skewness
Shape of the distribution
Which measure of Central tendency to use?
Other Measures : Percentiles, Quartiles, trimmed
mean
Trimmed Mean
QUARTILES
• Quartiles are numerical measures that divide the data into FOUR equal parts
• There are 3 quartiles
• The three quartiles are Q1(lower Quartile), Q2(median) and Q3(Upper Quartile)
• IQR=Q3-Q1
Calculation of Quartiles
• Since these are positional averages, arranging the data in (preferably ascending order is a must.
•
BOX PLOTS TO IDENTIFY
OUTLIERS
Box Plots
Box Plots and skewness
Box Plots and
Dispersion
Box Plots and
Outliers
8-3.2=4.8
12.53
OR
Measures of Variation/Dispersion
• At times, the measures of central tendency Mean, Median and Mode are not sufficient to describe the
data as you have series with the same mean but they look very different , therefore we require
additional measures called the MEASURES OF DISPERSION.
• E.g
• Measures of Dispersion are values that tell us how the observations in a dataset scatter around a
central value
• Higher is the scatter, higher will be the value of this measure and lesser will be the uniformity in the
dataset
Commonly used measures of Dispersion are:
Range
Inter Quartile Range or Quartile Deviation
Standard Deviation
Variance
Range is the difference between the largest and smallest value of the dataset
Inter-Quartile Range is the difference between the upper quartile and lower quartile
Standard deviation measures the square root of the Variance or of average squared deviations from the
mean.
Why square the Differences?
Variance is comparatively difficult to interpret
For e.g if instead of marks the variable X
mentioned alongside was length of beams
measured in cms.
Skewness= Kurtosis=
OR
Moment based measure of skewness
CENTRAL MOMENTS
i.e rxy=ryx
Ques
Ques
Coeff of variation=
OR *100