Basic Concepts of Statistics

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 41

Basic Concepts

of Statistics
SEVERINO B. SALERA JR
ASSO. PROF. 5
BISU, BILAR
Basics of Statistics

Definition: Science of collection, presentation, analysis,


and reasonable interpretation of data.
Statistics presents a rigorous scientific method for gaining insight into data.
For example, suppose we measure the weight of 100 patients in a study. With so
many measurements, simply looking at the data fails to provide an informative account.
However statistics can give an instant overall picture of data based on graphical
presentation or numerical summarization irrespective to the number of data points.
Besides data summarization, another important task of statistics is to make inference
and predict relations of variables.
What is Data?
Definition: Facts or figures, which are numerical or
otherwise, collected with a definite purpose are
called data.
Everyday we come across a lot of information in the form of facts,
numerical figures, tables, graphs, etc.
These are provided by newspapers, televisions, magazines and
other means of communication.
These may relate to cricket batting or bowling averages, profits of
a company, temperatures of cities, expenditures in various sectors
of a five year plan, polling results, and so on.
These facts or figures, which are numerical or otherwise, collected
with a definite purpose are called data.
Primary Data Vs Secondary Data
Primary Data
Primary data is the data that is collected for the first time
through personal experiences or evidence, particularly for
research.
It is also described as raw data or first-hand information.
The mode of assembling the information is costly.
The data is mostly collected through observations, physical
testing, mailed questionnaires, surveys, personal interviews,
telephonic interviews, case studies, and focus groups, etc.
Primary Data Vs Secondary Data
Secondary Data
Secondarydata is a second-hand data that is already collected and recorded
by some researchers for their purpose, and not for the current research
problem.
It is accessible in the form of data collected from different sources such as
government publications, censuses, internal records of the organisation, books,
journal articles, websites and reports, etc.
This method of gathering data is affordable, readily available, and saves cost
and time.
However, the one disadvantage is that the information assembled is for some
other purpose and may not meet the present research purpose or may not be
accurate.
Discrete Vs continuous data

 Discrete data (countable) is information that can only take


certain values. These values don’t have to be whole numbers but
they are fixed values – such as shoe size, number of teeth,
number of kids, etc.
 Discrete data includes discrete variables that are finite, numeric,
countable, and non-negative integers (5, 10, 15, and so on).
 Continuous data (measurable) is data that can take any value.
Height, weight, temperature and length are all examples of
continuous data.
 Continuous data changes over time and can have different
values at different time intervals like weight of a person.
Data Presentation
 Two types of statistical presentation of data - graphical
and numerical.
 Graphical Presentation: We look for the overall pattern and
for striking deviations from that pattern. Over all pattern
usually described by shape, center, and spread of the data.
An individual value that falls outside the overall pattern is
called an outlier.
 Bar diagram and Pie charts are used for categorical
variables.
 Histogram, stem and leaf and Box-plot are used for
numerical variable.
Histogram
 A histogram is a graphical display of data using bars of different
heights. In a histogram, each bar groups numbers into ranges.
Taller bars show that more data falls in that range.
A histogram displays the shape and spread of continuous
sample data
Box Plotting

 Box plots (also called box-and-whisker


plots or box-whisker plots) give a good graphical
image of the concentration of the data.
 They also show how far the extreme values are from
most of the data.
 A box plot is constructed from five values: the
minimum value, the first quartile, the median, the third
quartile, and the maximum value.
Box Plotting

The image above is a boxplot. A boxplot is a standardized way of displaying the


distribution of data based on a five number summary (“minimum”, first quartile (Q1),
median, third quartile (Q3), and “maximum”). It can tell you about your outliers and
what their values are. It can also tell you if your data is symmetrical, how tightly your
data is grouped, and if and how your data is skewed.
Statistical concepts of
classification of Data
 Classification is the process of arranging data into
homogeneous (similar) groups according to their common
characteristics.
 Raw data cannot be easily understood, and it is not fit for
further analysis and interpretation. Arrangement of data helps
users in comparison and analysis. It is also important for
statistical sampling.
Classification of Data
There are four types of classification. They are:
 Geographical classification
When data are classified on the basis of location or areas, it is called geographical
classification
 Chronological classification
Chronological classification means classification on the basis of time, like months, years etc.
 Qualitative classification
In Qualitative classification, data are classified on the basis of some attributes or quality such
as gender, colour of hair, literacy and religion. In this type of classification, the attribute
under study cannot be measured. It can only be found out whether it is present or absent
in the units of study.
 Quantitative classification
Quantitative classification refers to the classification of data according to some
characteristics, which can be measured such as height, weight, income, profits etc.
Quantitative classification
 There are two types of quantitative classification of data:
Discrete frequency distribution and Continuous frequency
distribution.
 In this type of classification there are two elements
 variable
Variable refers to the characteristic that varies in magnitude or
quantity. E.g. weight of the students. A variable may be discrete or
continuous.
 Frequency
Frequency refers to the number of times each variable gets repeated.
For example there are 50 students having weight of 60 kgs. Here 50
students is the frequency.
Frequency distribution
 Frequency distribution refers to data classified on the basis of
some variable that can be measured such as prices, weight,
height, wages etc.
Frequency distribution
The following technical terms are important when a
continuous frequency distribution is formed
Class limits: Class limits are the lowest and highest
values that can be included in a class. For example
take the class 51-55. The lowest value of the class is
51 and the highest value is 55. In this class there can
be no value lesser than 51 or more than 55. 51 is the
lower class limit and 55 is the upper class limit.
Class interval: The difference between the upper
and lower limit of a class is known as class interval of
that class.
Class frequency: The number of observations
corresponding to a particular class is known as the
frequency of that class
Measures of Centre Tendency
 In statistics, the central tendency is the descriptive summary of a
data set.
 Through the single value from the dataset, it reflects the centre of the
data distribution.
 Moreover, it does not provide information regarding individual data
from the dataset, where it gives a summary of the dataset. Generally,
the central tendency of a dataset can be defined using some of the
measures in statistics.
Mean
 The mean represents the average value of the dataset.
 It can be calculated as the sum of all the values in the dataset
divided by the number of values. In general, it is considered as the
arithmetic mean.
 Some other measures of mean used to find the central tendency
are as follows:
 Geometric Mean (nth root of the product of n numbers)
 Harmonic Mean (the reciprocal of the average of the reciprocals)
 Weighted Mean (where some values contribute more than others)
 It is observed that if all the values in the dataset are the same,
then all geometric, arithmetic and harmonic mean values are the
same. If there is variability in the data, then the mean value
differs.
Arithmetic Mean
Arithmetic mean represents a number that is obtained by dividing the
sum of the elements of a set by the number of values in the set. So you
can use the layman term Average. If any data set consisting of the values
b1, b2, b3, …., bn then the arithmetic mean B is defined as:
B = (Sum of all observations)/ (Total number of observation)

The arithmetic mean of Virat Kohli’s batting scores also called his Batting
Average is;
Sum of runs scored/Number of innings = 661/10
The arithmetic mean of his scores in the last 10 innings is 66.1.
Harmonic Mean
A Harmonic Progression is a sequence if the reciprocals of its terms are in
Arithmetic Progression, and harmonic mean (or shortly written as HM) can be
calculated by dividing the number of terms by reciprocals of its terms.

In particular cases, especially those involving rates and ratios, the harmonic
mean gives the most correct value of the mean. For example, if a vehicle
travels a specified distance at speed x (eg 60 km / h) and then travels again at
the speed y (e.g.40 km / h), the average speed value is the harmonic mean x,
y (Ie, 48 km / h).
Geometric Mean
 The Geometric Mean (GM) is the average value or mean which
signifies the central tendency of the set of numbers by finding
the product of their values.
 Basically, we multiply the numbers altogether and take out the
nth root of the multiplied numbers, where n is the total number
of values.
 For example: for a given set of two numbers such as 3 and 1, the
geometric mean is equal to √(3+1) = √4 = 2.
Use of Geometric Mean
 For example, suppose you have an investment which earns 10%
the first year, 50% the second year, and 30% the third year. What
is its average rate of return?
 It is not the arithmetic mean, because what these numbers mean is
that on the first year your investment was multiplied (not added to)
by 1.10, on the second year it was multiplied by 1.60, and the third
year it was multiplied by 1.20. The relevant quantity is the
geometric mean of these three numbers.
 The question about finding the average rate of return can be
rephrased as: "by what constant factor would your investment need
to be multiplied by each year in order to achieve the same effect as
multiplying by 1.10 one year, 1.60 the next, and 1.20 the third?"
 If you calculate this geometric mean
 You get approximately 1.283, so the average rate of return is about
28% (not 30% which is what the arithmetic mean of 10%, 60%, and
20% would give you).
Median
 Median is the middle value of the dataset in which
the dataset is arranged in the ascending order or in
descending order.
 When the dataset contains an even number of
values, then the median value of the dataset can
be found by taking the mean of the middle two
values.
 If you have skewed distribution, the best measure
of finding the central tendency is the median.
 The median is less sensitive to outliers (extreme
scores) than the mean and thus a better measure
than the mean for highly skewed distributions, e.g.
family income. For example mean of 20, 30, 40,
and 990 is (20+30+40+990)/4 =270. The median
of these four observations is (30+40)/2 =35. Here 3
observations out of 4 lie between 20-40. So, the
mean 270 really fails to give a realistic picture of
the major part of the data. It is influenced by
extreme value 990.
Mode

 The mode represents the frequently occurring value in


the dataset.
 Sometimes the dataset may contain multiple modes and
in some cases, it does not contain any mode at all.
 If you have categorical data, the mode is the best choice
to find the central tendency.
Measures of Dispersion
Dispersion is the state of getting dispersed or spread. Statistical dispersion
means the extent to which a numerical data is likely to vary about an
average value. In other words, dispersion helps to understand the
distribution of the data.
Objectives of computing

dispersion
Comparative study
Measures of dispersion give a single value indicating the degree of consistency or
uniformity of distribution. This single value helps us in making comparisons of
various distributions.
Reliability of an average
A small value of dispersion means low variation between observations and
average. It means that the average is a good representative of observation and very
reliable. A higher value of dispersion means greater deviation among the
observations.
Control the variability
Different measures of dispersion provide us data of variability from different
angles, and this knowledge can prove helpful in controlling the variation.
Basis for further statistical analysis
Measures of dispersion provide the basis for further statistical analysis like
computing correlation, regression, test of hypothesis, sampling etc.
Types of Measures of Dispersion
There are two main types of dispersion methods in statistics which
are:
 Absolute Measure of Dispersion
 Relative Measure of Dispersion
Absolute Measure of Dispersion
An absolute measure of dispersion contains the same unit as the original data set.
Absolute dispersion method expresses the variations in terms of the average of
deviations of observations like standard or means deviations. It includes range,
standard deviation, quartile deviation, etc. The types of absolute measures of
dispersion are:

Range: It is simply the difference between the maximum value and the
minimum value given in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6
Variance: Deduct the mean from each data in the set then squaring each of
them and adding each square and finally dividing them by the total no of values
in the data set is the variance. Variance (σ2)=∑(X−μ)2/N
Standard Deviation: The square root of the variance is known as the standard
deviation i.e. S.D. = √σ.
Quartiles and Quartile Deviation: The quartiles are values that divide a list of
numbers into quarters. The quartile deviation is half of the distance between the
third and the first quartile.
Mean and Mean Deviation: The average of numbers is known as the mean and
the arithmetic mean of the absolute deviations of the observations from a
measure of central tendency is known as the mean deviation (also called mean
absolute deviation).
Range
 It is the simplest method of measurement of dispersion.
 It is defined as the difference between the largest and the
smallest item in a given distribution.
 Range = Largest item (L) – Smallest item (S)
Interquartile Range
 It is defined as the difference between the Upper Quartile and
Lower Quartile of a given distribution.
 Interquartile Range = Upper Quartile (Q3)–Lower
Quartile(Q1)
Variance
 Variance is a measure of how data points differ from the mean.
 A variance is a measure of how far a set of data (numbers) are spread
out from their mean (average) value.
 The more the value of variance, the data is more scattered from its
mean and if the value of variance is low or minimum, then it is less
scattered from mean. Therefore, it is called a measure of spread of
data from mean.
 the formula for variance is
Var (X) = E[(X –μ) 2]
 the variance is the square of standard deviation, i.e.,
Variance = (Standard deviation)2= σ2
Variance

Example: Find the variance of the numbers 3, 8, 6, 10, 12, 9,


11, 10, 12, 7.
Given,
3, 8, 6, 10, 12, 9, 11, 10, 12, 7
Step 1: Compute the mean of the 10 values given.
Mean (μ) = (3+8+6+10+12+9+11+10+12+7) / 10 = 88 / 10 = 8.8
Variance
Coefficient of variance
 The coefficient of variance (CV) is a relative measure of variability
that indicates the size of a standard deviation in relation to its mean.
 It is a standardized, unitless measure that allows you to compare
variability between disparate groups and characteristics.
 It is also known as the relative standard deviation (RSD).
 The coefficient of variation facilitates meaningful comparisons in
scenarios where absolute measures cannot.
Quartile Deviation
 The Quartile Deviation (QD) is the product of half of the
difference between the upper and lower quartiles.
 Mathematically we can define as: Quartile Deviation = (Q3 – Q1) /
2
 Quartile Deviation defines the absolute measure of dispersion.
Whereas the relative measure corresponding to QD, is known as
the coefficient of QD, which is obtained by applying the certain
set of the formula: Coefficient of Quartile Deviation = (Q3 – Q1) /
(Q3 + Q1)
 A Coefficient of QD is used to study & compare the degree of
variation in different situations.
Skewness
 Skewness is a measure of the degree of asymmetry of a
distribution.
 If the left tail (tail at small end of the distribution) is more
pronounced than the right tail (tail at the large end of the
distribution), the function is said to have negative skewness.
 If the reverse is true, it has positive skewness. If the two are
equal, it has zero skewness.
Kurtosis
 Kurtosis is a measure of whether the data are heavy-tailed or
light-tailed relative to a normal distribution.
 That is, data sets with high kurtosis tend to have heavy tails, or
outliers. Data sets with low kurtosis tend to have light tails, or
lack of outliers.
 Significant skewness and kurtosis clearly indicate that data are
not normal.
Types of Distributions
Normal Distribution
 In probability theory and statistics, the Normal Distribution, also
called the Gaussian Distribution, is the most significant
continuous probability distribution.
 A large number of random variables are either nearly or exactly
represented by the normal distribution, in every physical science
and economics.
 In a normal distribution, the mean, mean and mode are equal.
(i.e., Mean = Median= Mode). The normally distributed curve
should be symmetric at the centre.
Normal Distribution
SAS Exam papers
Paper Name of paper Sincere Normal
No. preparatio preparation
n
PC 1 Language Skill 10 6
PC 2 Logical, Analytical and Quantitative 9 3
Abilities
PC 3 Information Technology (Theory) 7-8 2
PC 4 Information Technology (Practical) 10 10
PC 5 Constitution of India, Statutes and 7 2-3
Service Regulations
PC 8 Financial Rules and Principles of 6-7 0
Government Accounts
PC 14 Financial Accounting with 6-7 0
Elementary Costing
PC 16 Public Works Accounts 4-5 0
PC 22 Government Audit 6-7 0
Thank you for giving this opportunity to interact with
you
and
please feel free to contact me in case of any doubt
regarding the lecture
Gaurav Kr. Prajapat
Mobile 9461588507

You might also like