Statistics
Statistics
What is Statistics?
Statistics is a branch of mathematics that deals with the collection, analysis, interpretation
and presentation of data in a more understandable and useful manner. It is used for
various purposes in different fields but mainly it is associated with organizing and
analyzing the data. It is also used to validate the hypothesis and predict the
probability of the outcome.
Why is Statistics Important?
Data Interpretation:
Problem Solving
Informed decision making is a process in which the choice is based on proper collection,
analysis and interpretation of the data that we collect for any specific
purposes. In this we use statistical tools and methods to transform the unorganized or
the raw data into a analyzable data from which we can obtain any insights. Using these
methods to analyze the data help us in taking better and accurate decision for the future
events. This method is used at so many places such as business, healthcare, and public
policy, where decisions have significant and far-reaching consequences.
Risk Management and Assessment
Risk management and assessment involve identifying potential risks, evaluating their
likelihood and impact, and implementing strategies to mitigate or manage them. This
process is fundamental to maintaining organizational stability and achieving objectives. Key
components include risk identification, where potential issues are recognized; risk analysis,
which quantifies the probability and potential impact of these risks using statistical
methods such as probability distributions and regression analysis; and risk evaluation,
which prioritizes risks based on their significance. Techniques such as Monte Carlo
simulations, fault tree analysis, and sensitivity analysis are often used to model risk
scenarios and develop mitigation strategies. By systematically assessing and managing
risks, organizations can minimize negative outcomes and capitalize on opportunities,
ensuring long-term success and resilience.
Statistical Methods and Tools
The various methods and tools used in statistics are mentioned below:
Descriptive Statistics
The descriptive statistics is used to describe and evaluate the main numerical features of
the data we have. This includes:
Measure of central tendency: This provides the central point of the data set. In order to
provide the central tendency we use methods like Mean, Median, Mode, Variance and
Standard deviation
These summarize and describe the main features of a dataset, such as measures of central
tendency (mean, median, mode) and measures of variability (standard
deviation, variance).
Inferential Statistics
Inferential statistics is the another part of statistics which enable us to make prediction or
inferences about a large dataset or large population based on a sample data taken from
the whole population. It includes methods like:
These methods used for drawing conclusions that extend beyond the immediate data,
allowing for generalized findings and informed predictions.
In statistics, internal data comes from within an organization, while external data comes
from outside the organization:
Internal data
Data that is generated and used within a company or organization. This data can come
from areas such as operations, maintenance, personnel, and finance. Examples of
internal data include expense reports, cash flow reports, production reports, and
budget variance analysis. Internal data is usually stored in spreadsheets, databases, or
customer relationship management (CRM) systems.
External data
Data that is collected outside an organization from areas like press releases, statistics
departments, government databases, and market research. Examples of external data
include market research reports, social media data, and government data.
Internal data is free to the company, and it can be very relevant and telling. External data can
be purchased from third-party providers or gathered from publicly available sources.
In statistics, **internal** and **external sources of data** refer to the origin from where
data is collected for analysis.
These are data collected from within the organization or system that is being studied.
Internal sources tend to be more specific and relevant to the particular needs of the entity
collecting the data. Examples include:
These are data obtained from sources outside the organization. External data is often used
to complement or enrich internal data. Examples include:
5.**Public databases** – Open-source data from organizations like the World Bank,
WHO, or Eurostat.
6.**Social media and web data** – Data from online platforms, user behavior, and
engagement metrics.
Both internal and external sources play critical roles in forming a comprehensive dataset
for statistical analysis. Internal data provides direct insights from within the organization,
while external data helps in benchmarking and understanding broader trends.
Frequency distribution is a method of organizing and summarizing data to show the
frequency (count) of each possible outcome of a dataset. It is an essential tool in statistics
for understanding the distribution and pattern of data. There are several types of
frequency distributions used based on the nature of the data and the analysis
required.
It is not always possible for an investigator to easily measure the items of a series or set of
data. To make the data simple and easy to read and analyze, the items of the series are
placed within a range of values or limits. In other words, the given raw set of data is
categorized into different classes with a range, known as Class Intervals. Every item of
the given series is put against a class interval with the help of tally bars. The number of
items occurring in the specific range or class interval is shown under Frequency against
that particular class range to which the item belongs.
The marks of a class of 20 students are 11, 27, 18, 14, 28, 18, 2, 22, 11, 24, 22, 11, 8, 20,
25, 28, 30, 12, 11, 8. Prepare a frequency distribution table for the same.
Solution:
The range of marks of the students is 2- 28. Let us take class intervals 0-5, 5-10, 10-15,
15-20, 20-25, and 25-30.
1. Exclusive Series
2. Inclusive Series
3. Open End Series
The series with class intervals, in which all the items having the range from the lower
limit to the value just below its upper limit are included, is known as the Exclusive
Series. This kind of frequency distribution is known as an exclusive series because the
frequencies corresponding to the specific class interval do not include the value of its
upper limit. For example, if a class interval is 0-10, and the values of the given series are
4, 10, 2, 15, 8, and 9, then only 4, 2, 8, and 9 will be included in the 0-10 class interval. 10
and 15 will be included in the next class interval, i.e., 10-20. Also, the upper limit of a class
interval is the lower limit of the next class interval.
From the above table of exclusive series, it can be seen that the upper limits of the first class
interval is the lower limit of the second class interval, and so on. Also, as
discussed above, if the data includes a value 10, it will be included in the class interval
10-20, not in 0-10.
2. Inclusive Series
The series with class intervals, in which all the items having the range from the lower
limit up to the upper limit are included, is known as Inclusive Series. Like exclusive
series, the upper limit of one class interval does not repeat itself as the lower limit of the next
class interval. Therefore, there is a gap (between 0.1 to 1) between the upper-class limit of one
class interval and the lower limit of the next class interval. For example, class intervals of
an inclusive series can be, 0-9, 10-19, 20-29, 30-39, and so on. In this case, the gap
between the upper limit of one class interval and the lower limit of the next class interval is
1, and the class intervals do not overlap with each other like in an exclusive series.
Sometimes it gets difficult to perform statistical analysis with inclusive series. In those
cases, the inclusive series is converted into an exclusive series.
Frequency Distribution in Inclusive Series Example
From the above table of inclusive series, it can be seen that the upper limit of one class
interval (say, 9 of interval 0-9) is not the same as the lower limit of the next class interval (10
of interval 10-19). Also, all the values that come under 0-9, including 0 and 9 are included
in the frequency against 0-9.
For statistical calculation, sometimes it becomes necessary to convert the inclusive series into
exclusive series. Suppose, in the above example some students have
obtained marks such as 10.5, 40,5, etc. In this case, this series will be converted into
exclusive series,
The steps for converting an inclusive series into exclusive series are:
In this first step, calculate the difference between the upper class limit of one class
interval and the lower limit of the next class interval.
The next step is to divide the difference by two and then add the resulting value to the upper
limit of every class interval and subtract it from the lower limit of every class interval.
Example:
The inclusive series of the above example is converted into exclusive series as under.
Difference between Inclusive and Exclusive Series
In Inclusive Series, the upper limit of one class interval is not the same as the
lower limit of the next class interval. There is a gap ranging from 0.1 to 1.0
between the upper class limit of one class interval and the lower class limit of the
next class interval. However, in the Exclusive Series, the upper limit of one class
interval is the same as the lower limit of the next class interval.
In the case of Inclusive Series, the value of the upper and the lower limit are included
in that class interval only. However, in the case of Exclusive Series, the value of upper
limit of a class interval is not included in that interval, instead, it is included in the next
class interval.
Inclusive Series is suitable for an investigator only if the value is in complete number
and not in decimal form. However, an Exclusive Series is suitable for an investigator
whether the value is in complete number or decimal form.
Sometimes the lower limit of the first class interval and the upper class limit of a series is
not available; instead, Less than or Below is mentioned in the former case (in place of
the lower limit of the first class interval), and More than or Above is mentioned in the
latter case (in place of the upper limit of the last class interval). These types of series are
known as Open End Series.
For statistical calculations, if one needs to change the first and last class open-end class
interval into limits, it can be done by the general practice of giving the same magnitude
or class size to these intervals as the class size of other class intervals. In the above
example, the magnitude of other class intervals is 5. Therefore, the open-end class intervals
can be written as 5-10 and 30-35, respectively.
4. Cumulative Frequency Series
A series whose frequencies are continuously added corresponding to the class intervals
is known as Cumulative Frequency Series.
A simple frequency series can be converted into a cumulative frequency series. There are
two ways through which it can be done. These are as follows:
Convert the following simple frequency series into a cumulative frequency series using both
ways.
Solution:
To attain the frequency against a specific class interval of a cumulative frequency series, it can
be converted into a simple frequency series.
Example:
Solution:
5. Mid-Value Frequency Series
The series in which, instead of class intervals, their mid-values are given with the
corresponding frequencies, is known as Mid-Value Frequency Series.
The steps to convert a mid-value frequency series into a simple frequency series are as
follows:
The first step is to determine the mutual difference between the mid-values.
The last step of conversion is to subtract the resulting figure from the second
step from the mid-value to get the lower limit of the class interval, and add the
resulting figure from the second step to the mid-value to get the upper limit.
= Mid-Value
Convert the following Mid-Value Frequency Series into Simple Frequency Series.
Solution:
Calculation:
When the classes of a series are of the same interval, it is known as Equal Class Interval
Series.
Following is the frequency distribution of marks of 25 students with equal class intervals.
When the classes of a series are of unequal interval, it is known as Equal Class Interval
Series.
Simple Arithmetic Mean gives equal importance to all the variables in a series. However, in
some situations, a greater emphasis is given to one item and less to others, i.e., ranking of
the variables is done according to their significance in that situation. For example, during
inflation, the price of everything in an economy tends to rise, but households pay more
importance to the rise in the price of necessary food items rather than the rise in the
price of clothes. In other words, more significance is given to the
price of food and less to the price of clothes. This is when Weighted Arithmetic Mean
comes into the picture.
When every item in a series is assigned some weight according to its significance, the average
of such series is called Weighted Arithmetic Mean.
Here, weight stands for the relative importance of the different variables. In simple words,
the Weighted Arithmetic Mean is the mean of weighted items and is also known as the
Weighted Average Mean.
Weighted Arithmetic Mean is calculated as the weighted sum of the items divided by the
sum of the weights.
Step-1: All the items (X) in a series are weighted according to their significance.
Weights are denoted as ‘W’.
Step-2::Add up all the values of weights ‘W’ to get the sum total of weights, i.e.,
∑W= W1+W2+W3+...............+Wn
Step-3:Items (X) are multiplied by the corresponding weights (W) to get ‘XW’.
Step-4:Add up all the values of ‘XW’ to get the sum total of the product ‘XW’, i.e.,
∑XW= X1W1+X2W2+X3W3+...................+XnWn
Step-5: To get the weighted mean, divide the weighted sum of the items ‘∑XW’ by
the sum of weights ‘∑W’.
Example:
Items (X) 5 10 25 20 25 30
Weight 8 4 5 10 7 6
(W)
Solution:
Items Weight XW
(X) (W)
5 8 40
10 4 40
25 5 125
20 10 200
25 7 175
30 6 180
∑W=40 ∑XW=7
60
Weighted Mean =
= 7c0/40
= 1S
Explanation:
1. Multiply each item with its corresponding weight to get XW, i.e.,
2. Add up all the values of weight to get the sum of weights, i.e.,
∑W= 8 + 4 + 5 + 10 + 7 + c = 40
3. Add up all the values of the product of weight and items(XW) to get the sum of
the product, i.e.,
Mean, Median, and Mode are measures of the central tendency. These values are used to
define the various parameters of the given data set. The measure of central tendency
(Mean, Median, and Mode) gives useful insights about the data studied, these are used to
study any type of data such as the average salary of employees in an organization, the
median age of any class, the number of people who plays cricket in a sports club, etc.
Measure of central tendency is the representation of various values of the given data set.
There are various measures of central tendency and the most important three
measures of central tendency are:
Mean
Median
Mode
Mean, median, and mode are measures of central tendency used in statistics to summarize
a set of data.
Mean (x̅ or μ): The mean, or arithmetic average, is calculated by summing all the values
in a dataset and dividing by the total number of values. It’s sensitive to outliers and is
commonly used when the data is symmetrically distributed.
Median (M): The median is the middle value when the dataset is arranged in ascending
or descending order. If there’s an even number of values, it’s the average of the two
middle values. The median is robust to outliers and is often used when the data is
skewed.
Mode (Z): The mode is the value that occurs most frequently in the dataset. Unlike the
mean and median, the mode can be applied to both numerical and categorical data. It’s
useful for identifying the most common value in a dataset.
What is Mean?
Mean is the sum of all the values in the data set divided by the number of values in the
data set. It is also called the Arithmetic Average. Mean is denoted as x̅ and is read as x
bar.
Formula of Mean
Mean Symbol
The symbol used to represent the mean, or arithmetic average, of a dataset is typically the
Greek letter “μ” (mu) when referring to the population mean, and “̄x” (x-bar) when
referring to the sample mean.
These symbols are commonly used in statistical notation to represent the average value
of a set of data points.
Mean Formula
If x1, x2, x3,……, xn are the values of a data set then the mean is calculated as:
x̅ = (x1 + x2 + x3 + . . . + xn) / n
Example: Find the mean of data sets 10, 30, 40, 20, and 50.
Solution:
Mean for the grouped data can be calculated by using various methods. The most common
methods used are discussed in the table below:
x̅ = a + h∑ fixi / ∑ fi
x̅ = a + ∑
x̅ = ∑ Where,
fixi / ∑ fi
a is Assumed mean
fixi / ∑
Where, ui = (xi – a)/h
fi a is Assumed mean h is Class size
di is equal to xi – a ∑fi the
Where,
∑fi the sum of all sum of all
∑fi is the sum of all
frequencies frequencie
frequencies
s
Read More about Mean, Median and Mode of Grouped Data.
What is Median?
A Median is a middle value for sorted data. The sorting of the data can be done either in
ascending order or descending order. A median divides the data into two equal halves.
The formula to calculate the median of the number of terms if the number of terms is
even is shown in the image below:
The formula to calculate the median of the number of terms if the number of terms is odd
is shown in the image below:
Median Symbol
The letter “M” is commonly used to represent the median of a dataset, whether it’s for a
population or a sample. This notation simplifies the representation of statistical concepts
and calculations, making it easier to understand and apply in various contexts.
Therefore, in Indian statistical practice, “M” is widely accepted and understood as the
symbol for the median.
Median Formula
If the number of values (n value) in the data set is odd then the formula to calculate the
median is:
Solution:
Step 2: Check n (number of terms of data set) is even or odd and find the median of the data
with respective ‘n’ value.
= 30
The median of the grouped data median is calculated using the formula,
where
n is number of observations
h is class size
What is Mode?
A mode is the most frequent value or item of the data set. A data set can generally have
one or more than one mode value. If the data set has one mode then it is called “Uni-
modal”. Similarly, If the data set contains 2 modes then it is called “Bimodal” and if the
data set contains 3 modes then it is known as “Trimodal”. If the data set consists of
more than one mode then it is known as “multi-modal”(can be bimodal or trimodal). There
is no mode for a data set if every number appears only once.
Formula of Median
Symbol of Mode
In statistical notation, the symbol “Z” is commonly used to represent the mode of a
dataset. It indicates the value or values that occur most frequently within the dataset. This
symbol is widely utilised in statistical discourse to signify the mode, enhancing clarity
and precision in statistical discussions and analyses.
Mode Formula
Solution:
Mode = 2
where,
h is the size of class intervals, and l is the lower limit of modal class.
Relation between Mean, Median, And
Mode
For any group of data, the relation between the three central tendencies mean, median, and
mode is shown in the image below:
Mean, Median and Mode: Another name for this relationship is an empirical relationship.
When we know the other two measures for a given set of data, this is used to find one of
the measures. The LHS and RHS can be switched to rewrite this relationship in various
ways.
What is Range?
In a given data set the difference between the largest value and the smallest value of the data
set is called the range of data set. For example, if height(in cm) of 10 students in a class are
given in ascending order, 160, 161, 167, 169, 170, 172, 174, 175, 177, and 181
respectively. Then range of data set is (181 – 160) = 21 cm.
Range of Data
Range is the difference between the highest value and the lowest value. It is a way to
understand how the numbers are spread in a data set. The range of any data set is easily
calculated by using the formula given in the image below:
Range Formula:
The formula to find the Range is:
Example: Find the range of the given data set 12, 1G, 6, 2, 15, 4.
Solution:
Value = 1S Range = 1S −
2 = 17
Differences between Mean, Median and Mode
Median is not
Mean is sensitive to Mode is not
sensitive to
outliers. sensitive to outliers.
Sensitivity outliers .
Calculated by adding up
Calculated by
all values of a dataset Calculated by
finding which value
and dividing them by finding the
occurs more
the total number of middle value in a
number of times in
values in dataset. list of data.
a dataset.
Calculation
Value of median
Value of mode is
Value of mean may or is always a value
also always a value
may not be in dataset. from the dataset.
from the dataset.
Representatio
n
Note: Mean gets easily affected by extreme values.
Difference between Mean and Median is understood by the following example. In a school,
there are 8 teachers whose salaries are 20000 rupees, a principal with a salary of 35000,
find their mean salary and median salary.
Mean = (20000+20000+20000+20000+20000+20000+20000+20000+35000)/S =
1S5000/S = 21ccc.c7
For median, in ascending order: 20000, 20000, 20000, 20000, 20000, 20000, 20000,
20000, 35000.
n = S,
Thus, (S + 1)/2 = 5
= 20000
Mode = 20,000.
In our daily life we came across various instances where we have to use the concept of
mean, median and mode. There are various application of mean, median and mode, here’s
how they link to real life:
Mode: Mode represents the most frequently occurring value in a dataset and is
used in scenarios where identifying the most common value is important. For
example, in manufacturing, the mode may be used to identify the most common
defect in a production line to prioritize quality control efforts
Solved Questions on Mean, Median, and Mode
Question 1: Study the bar graph given below and find the mean, median, and mode of
the given data set.
Solution:
Mean = (5 + 7 + S + c) / 4
= 27 / 2
= c.75
= 4 (which is even)
Median = (c + 7) / 2
= c.5
Range = S – 5
=4
Question 2: Find the mean, median, mode, and range for the given data
1G0, 153, 168, 17G, 1G4, 153, 165, 187, 1G0, 170, 165, 18G, 185, 153, 147, 161, 127, 180
Solution:
For Mean:
1S0, 153, 1c8, 17S, 1S4, 153, 1c5, 187, 1S0, 170, 1c5, 18S, 185, 153, 147, 1c1, 127, 180
Number of observations = 18
= (1S0+153+1c8+17S+1S4+153+1c5+187+1S0+170+1c5+18S+185+153+147
+1c1+127+180) / 18
= 2871/18
= 15S.5
For Median:
127, 147, 153, 153, 153, 1c1, 1c5, 1c5, 1c8, 170, 17S, 180, 185, 187, 18S, 1S0, 1S0, 1S4
Here, n = 18
For Mode:
Thus, mode = 53
For Range:
Solution:
Step 2: Check n (number of terms of data set) is even or odd and find the median of the data
with respective ‘n’ value.
= (22+23) / 2
= 22.5
Question 4: Find the mode of given data 15, 42, 65, 65, G5. Solution:
=c5
Practice Questions on Mean, Median and Mode
Question 1: A company recorded the weekly sales (in dollars) of five salespersons as
follows: $450, $520, $480, $510, and $490, Find the mean sales value for this group?
Question 2: Find the median of the following data set: 12, 15, 20, 9, 17, 25, 10.
Question 3: A survey collected the number of books read by a group of 10 people last year:
5, 7, 6, 5, 9, 7, 8, 5, 10, 6. What is the mode of the data set?
Question 4: In a classroom, the scores (out of 100) for a test are: 56, 78, 67, 45, 56, 90,
56, 67, 78, 82. Find the mean, median, and mode of the scores.
Question 5: In a skewed distribution the mean of the data is 40 and median of the data
is 35. Calculate the mode of the data set.
Conclusion
Mean, Median and Mode are essential statistical measures of central tendency that
provide different perspectives on data sets. The mean provides a general average, making
it useful for evenly distributed data. The median gives a middle value, providing a better
view of central tendency when dealing with skewed distributions or extreme values and,
the mode highlights the most frequent value, making it valuable in
categorical data analysis.