Statistics
Statistics
Statistics
A PRESENTATION BY:
AHMED KOSAR
Basics of Statistics
Discrete data (countable) is information that can only take certain values. These
values don’t have to be whole numbers but they are fixed values – such as shoe
size, number of teeth, number of kids, etc.
Discrete data includes discrete variables that are finite, numeric, countable, and
non-negative integers (5, 10, 15, and so on).
Continuous data (measurable) is data that can take any value. Height, weight,
temperature and length are all examples of continuous data.
Continuous data changes over time and can have different values at different time
intervals like weight of a person.
Data Presentation
Two types of statistical presentation of data - graphical and numerical.
Graphical Presentation: We look for the overall pattern and for striking
deviations from that pattern. Over all pattern usually described by shape,
center, and spread of the data. An individual value that falls outside the
overall pattern is called an outlier.
Bar diagram and Pie charts are used for categorical variables.
Histogram, stem and leaf and Box-plot are used for numerical variable.
Histogram
A histogram is a graphical display of data using bars of different heights. In
a histogram, each bar groups numbers into ranges. Taller bars show that more
data falls in that range. A histogram displays the shape and spread of continuous
sample data
Box Plotting
The image above is a boxplot. A boxplot is a standardized way of displaying the distribution of data based on a
five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). It can
tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how
tightly your data is grouped, and if and how your data is skewed.
Statistical concepts of classification of
Data
Classification is the process of arranging data into homogeneous (similar)
groups according to their common characteristics.
Raw data cannot be easily understood, and it is not fit for further analysis and
interpretation. Arrangement of data helps users in comparison and analysis. It
is also important for statistical sampling.
Classification of Data
There are four types of classification. They are:
Geographical classification
When data are classified on the basis of location or areas, it is called geographical classification
Chronological classification
Chronological classification means classification on the basis of time, like months, years etc.
Qualitative classification
In Qualitative classification, data are classified on the basis of some attributes or quality such as gender,
colour of hair, literacy and religion. In this type of classification, the attribute under study cannot be
measured. It can only be found out whether it is present or absent in the units of study.
Quantitative classification
Quantitative classification refers to the classification of data according to some characteristics, which can
be measured such as height, weight, income, profits etc.
Quantitative classification
There are two types of quantitative classification of data: Discrete frequency
distribution and Continuous frequency distribution.
In this type of classification there are two elements
variable
Variable refers to the characteristic that varies in magnitude or quantity. E.g. weight of the
students. A variable may be discrete or continuous.
Frequency
Frequency refers to the number of times each variable gets repeated. For example there
are 50 students having weight of 60 kgs. Here 50 students is the frequency.
Frequency distribution
Frequency distribution refers to data classified on the basis of some variable that
can be measured such as prices, weight, height, wages etc.
Frequency distribution
The following technical terms are important when a continuous
frequency distribution is formed
Class limits: Class limits are the lowest and highest values that can
be included in a class. For example take the class 51-55. The lowest
value of the class is 51 and the highest value is 55. In this class there
can be no value lesser than 51 or more than 55. 51 is the lower class
limit and 55 is the upper class limit.
Class interval: The difference between the upper and lower limit of
a class is known as class interval of that class.
Class frequency: The number of observations corresponding to a
particular class is known as the frequency of that class
Measures of Centre Tendency
In statistics, the central tendency is the descriptive summary of a data set.
Through the single value from the dataset, it reflects the centre of the data distribution.
Moreover, it does not provide information regarding individual data from the dataset,
where it gives a summary of the dataset. Generally, the central tendency of a dataset can
be defined using some of the measures in statistics.
Mean
The mean represents the average value of the dataset.
It can be calculated as the sum of all the values in the dataset divided by the number
of values. In general, it is considered as the arithmetic mean.
Some other measures of mean used to find the central tendency are as follows:
Geometric Mean (nth root of the product of n numbers)
Harmonic Mean (the reciprocal of the average of the reciprocals)
Weighted Mean (where some values contribute more than others)
It is observed that if all the values in the dataset are the same, then all geometric,
arithmetic and harmonic mean values are the same. If there is variability in the data,
then the mean value differs.
Arithmetic Mean
Arithmetic mean represents a number that is obtained by dividing the sum of the elements of
a set by the number of values in the set. So you can use the layman term Average. If any data
set consisting of the values b1, b2, b3, …., bn then the arithmetic mean B is defined as:
B = (Sum of all observations)/ (Total number of observation)
The arithmetic mean of Virat Kohli’s batting scores also called his Batting Average is;
Sum of runs scored/Number of innings = 661/10
The arithmetic mean of his scores in the last 10 innings is 66.1.
Harmonic Mean
A Harmonic Progression is a sequence if the reciprocals of its terms are in Arithmetic Progression,
and harmonic mean (or shortly written as HM) can be calculated by dividing the number of terms
by reciprocals of its terms.
In particular cases, especially those involving rates and ratios, the harmonic mean gives the most
correct value of the mean. For example, if a vehicle travels a specified distance at speed x (eg 60
km / h) and then travels again at the speed y (e.g.40 km / h), the average speed value is the
harmonic mean x, y (Ie, 48 km / h).
Geometric Mean
The Geometric Mean (GM) is the average value or mean which signifies the
central tendency of the set of numbers by finding the product of their values.
Basically, we multiply the numbers altogether and take out the nth root of the
multiplied numbers, where n is the total number of values.
For example: for a given set of two numbers such as 3 and 1, the geometric mean
is equal to √(3+1) = √4 = 2.
Use of Geometric Mean
For example, suppose you have an investment which earns 10% the first year, 50%
the second year, and 30% the third year. What is its average rate of return?
It is not the arithmetic mean, because what these numbers mean is that on the first
year your investment was multiplied (not added to) by 1.10, on the second year it
was multiplied by 1.60, and the third year it was multiplied by 1.20. The relevant
quantity is the geometric mean of these three numbers.
The question about finding the average rate of return can be rephrased as: "by what
constant factor would your investment need to be multiplied by each year in order to
achieve the same effect as multiplying by 1.10 one year, 1.60 the next, and 1.20 the
third?"
If you calculate this geometric mean
You get approximately 1.283, so the average rate of return is about 28% (not 30%
which is what the arithmetic mean of 10%, 60%, and 20% would give you).
Median
Median is the middle value of the dataset in which the dataset is
arranged in the ascending order or in descending order.
When the dataset contains an even number of values, then the
median value of the dataset can be found by taking the mean of
the middle two values.
If you have skewed distribution, the best measure of finding the
central tendency is the median.
The median is less sensitive to outliers (extreme scores) than the
mean and thus a better measure than the mean for highly skewed
distributions, e.g. family income. For example mean of 20, 30,
40, and 990 is (20+30+40+990)/4 =270. The median of these
four observations is (30+40)/2 =35. Here 3 observations out of 4
lie between 20-40. So, the mean 270 really fails to give a
realistic picture of the major part of the data. It is influenced by
extreme value 990.
Mode
Range: It is simply the difference between the maximum value and the minimum value given in a
data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6
Variance: Deduct the mean from each data in the set then squaring each of them and adding each
square and finally dividing them by the total no of values in the data set is the variance. Variance
(σ2)=∑(X−μ)2/N
Standard Deviation: The square root of the variance is known as the standard deviation i.e. S.D. =
√σ.
Quartiles and Quartile Deviation: The quartiles are values that divide a list of numbers into quarters.
The quartile deviation is half of the distance between the third and the first quartile.
Mean and Mean Deviation: The average of numbers is known as the mean and the arithmetic mean
of the absolute deviations of the observations from a measure of central tendency is known as the mean
deviation (also called mean absolute deviation).
Range
It is the simplest method of measurement of dispersion.
It is defined as the difference between the largest and the smallest item in a given
distribution.
Range = Largest item (L) – Smallest item (S)
Interquartile Range
It is defined as the difference between the Upper Quartile and Lower Quartile of a
given distribution.
Interquartile Range = Upper Quartile (Q3)–Lower Quartile(Q1)
Variance
Variance is a measure of how data points differ from the mean.
A variance is a measure of how far a set of data (numbers) are spread out from their mean
(average) value.
The more the value of variance, the data is more scattered from its mean and if the value
of variance is low or minimum, then it is less scattered from mean. Therefore, it is called a
measure of spread of data from mean.
the formula for variance is
Var (X) = E[(X –μ) 2]
the variance is the square of standard deviation, i.e.,
Variance = (Standard deviation)2= σ2
Variance
Example: Find the variance of the numbers 3, 8, 6, 10, 12, 9, 11, 10, 12, 7.
Given,
3, 8, 6, 10, 12, 9, 11, 10, 12, 7
Step 1: Compute the mean of the 10 values given.
Mean (μ) = (3+8+6+10+12+9+11+10+12+7) / 10 = 88 / 10 = 8.8
Variance
Coefficient of variance
The coefficient of variance (CV) is a relative measure of variability that indicates the
size of a standard deviation in relation to its mean.
It is a standardized, unitless measure that allows you to compare variability between
disparate groups and characteristics.
It is also known as the relative standard deviation (RSD).
The coefficient of variation facilitates meaningful comparisons in scenarios where
absolute measures cannot.
Quartile Deviation
The Quartile Deviation (QD) is the product of half of the difference between the
upper and lower quartiles.
Mathematically we can define as: Quartile Deviation = (Q3 – Q1) / 2
Quartile Deviation defines the absolute measure of dispersion. Whereas the
relative measure corresponding to QD, is known as the coefficient of QD, which
is obtained by applying the certain set of the formula: Coefficient of Quartile
Deviation = (Q3 – Q1) / (Q3 + Q1)
A Coefficient of QD is used to study & compare the degree of variation in
different situations.
Skewness
Skewness is a measure of the degree of asymmetry of a distribution.
If the left tail (tail at small end of the distribution) is more pronounced than the
right tail (tail at the large end of the distribution), the function is said to have
negative skewness.
If the reverse is true, it has positive skewness. If the two are equal, it has zero
skewness.
Kurtosis
Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative
to a normal distribution.
That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets
with low kurtosis tend to have light tails, or lack of outliers.
Significant skewness and kurtosis clearly indicate that data are not normal.
Types of Distributions
Normal Distribution
In probability theory and statistics, the Normal Distribution, also called the
Gaussian Distribution, is the most significant continuous probability distribution.
A large number of random variables are either nearly or exactly represented by
the normal distribution, in every physical science and economics.
In a normal distribution, the mean, mean and mode are equal.(i.e., Mean =
Median= Mode). The normally distributed curve should be symmetric at the
centre.
Normal Distribution
SAS Exam papers
Paper No. Name of paper Sincere Normal
preparation preparation
PC 1 Language Skill 10 6
PC 2 Logical, Analytical and Quantitative Abilities 9 3