Elements of Statistics BCA Sem-I.
Elements of Statistics BCA Sem-I.
Introduction to statistics
The word statistics has been derived from the Latin word ‘status’ which means political state. In the plural sense
it means a set of numerical figures called data obtained by counting, or, measurement. In singular sense, it
means collection, organization, presentation, analysis and interpretation of data. It has been defined in different
ways by different authors.
Statistics is the study of collection, organization, analysis, interpretation and presentation of data with the use of
quantified models. In short, it is a mathematical tool that is used to collect and summarize data.
Croxton and Cowdon defined it as “the science which deals with the collection, analysis and interpretation of
numerical data”.
Scope of statistics
Statistics is a branch of mathematics that deals with the collection, analysis, interpretation, presentation, and
organization of data. It is a tool used to make sense of data and draw meaningful conclusions and inferences
about a population based on a sample of data.
The scope of statistics is broad and includes a wide range of applications in fields such as biology, social
sciences, engineering, economics, and many others. Some of the key areas in which statistics is used include:
1. Data collection and analysis: Statistics is used to design studies and experiments, collect data, and analyze it
to draw conclusions and make inferences about a population.
2. Descriptive statistics: This involves summarizing and describing data using measures such as mean, median,
mode, and standard deviation, as well as visualizing data using techniques such as histograms and box plots.
3.Inferential statistics: This involves using a sample of data to make inferences about a population. Inferential
statistics involves the use of statistical models and hypothesis testing to determine the likelihood of a relationship
between variables or to make predictions about future events.
4.Probability: Statistics also involves the study of probability, which is used to model and understand random
events and make predictions about the likelihood of certain outcomes.
5.Survey design and analysis: Statistics is used to design and analyze surveys, which are used to gather
information about a population, such as opinions or attitudes.
In summary, the scope of statistics is wide and involves the collection, analysis, interpretation, and presentation
of data for a variety of purposes and applications.
In statistical terms, population refers to the complete set of individuals or objects that have a certain
characteristic or attribute of interest. When conducting research or making statistical inferences, it is often not
possible or practical to study every member of the population. Instead, a sample of the population is selected
and studied in order to make inferences about the larger group.
1
The population can be described using various measures such as mean, median, mode, variance, and standard
deviation. These measures provide information about the central tendency, dispersion, and distribution of the
population data.
It’s important to note that when working with a sample rather than the complete population, the estimates
obtained from the sample may not perfectly reflect the characteristics of the population. This is due to sampling
error, which can be reduced by increasing the sample size or using a more representative sample.
In summary, in statistics, population refers to the complete group of individuals or objects of interest and the
study of the population helps to describe and understand the characteristics of the group.
Raw data refers to unprocessed data that has been collected from various sources, such as surveys, experiments,
or databases. Raw data is usually in its original form and hasn’t been manipulated or analyzed.
Attributes are characteristics or features of the data. For example, in a study of a population of individuals, the
attributes might include age, gender, education level, and income.
Variables are attributes that can take on different values. For example, in a study of a population of individuals,
the variable “age” can take on different values for each person in the population, such as 20, 25, 30, etc. In
statistical analysis, variables are used to answer questions and make predictions.
In summary, raw data is the starting point for any analysis, while attributes and variables are used to describe
and analyze the data.
Frequency distribution is defined as the first method that is used to organize data in an effective way. Frequency
distribution performs the systematic investigation of the raw data. The data is first arranged by frequency
distribution and then set as frequency table.
Frequency distribution is defined as the systematic representation of different values of variables along with the
corresponding frequencies; it is classified on the basis of class interval.
Class interval is defined as the size of each class into which a range of variables is divided and represented as
histogram or bar graph.
Class intervals are divided into two different categories, exclusive and inclusive class intervals. Here is the
example to both:
2
The class interval where the upper limit of previous data entry is the same as the lower limit of next data entry is
called an exclusive data interval. For consideration,
S. Marks No. of
No students
1 0-20 8
2 20-40 7
3 40-60 3
The class interval where the upper limit of previous data entry is the same as the lower limit of next data entry is
called an exclusive data interval. For consideration,
S. Number of
Marks
No students
1 1-20 7
2 21-40 9
3 41-60 8
Frequency distribution is further classified into two types based upon class interval. Named as discrete
frequency table and continuous frequency table. Here are the examples:
If the class interval of data is not given, it is termed as a discrete frequency distribution. For example,
3
S. Number Number of
no. of items packets
1 1 23
2 2 12
3 3 34
4 4 20
5 5 72
Total 163
When the class intervals are available within the data, it is called a continuous frequency distribution. For
consideration,
Number
S.
Marks of
No
students
1 0-10 5
2 20-30 7
3 30-40 12
4 40-50 32
4
5 50-60 4
Total 60
As the name suggests, grouped frequency distribution is well defined and distributed into groups. When the
variables are continuous the data is gathered as grouped frequency distribution. Different measures are taken
during data collection, such as age, salary, etc. The entire data is classified into class intervals. For
consideration,
Number
Family
of
Income
persons
Below-
52
20,000
20,001-
14
30,000
30,001-
6
40,000
5
40,001-
8
50,000
As the name suggests, ungrouped frequency distribution doesn’t consist of well-distributed class intervals.
Ungrouped frequency distribution is applied on discrete data rather than continuous one. Examples of such data
usually include data related to gender, marital status, medical data etc. For consideration,
Number
Variable of
persons
GENDER
Female 19
Male 22
MARITAL
STATUS
Single 32
Married 4
Divorced 4
6
Other Types of Frequency Distribution
Cumulative frequency distribution is also known as percentage frequency distribution. Percentage distribution
reflects the percentage of samples whose scores fall in the specific group and number of scores.
This type of distribution is quite useful for comparison of data with the findings of other studies having different
sample sizes. In this type of distribution, percentages and frequencies are summed up in a single table. For
consideration,
Cumulative Cumulative
Score Frequency Percentage
frequency percentage
1 4 8 4 8
2 14 28 32 64
4 6 12 10 20
5 8 16 18 36
7 8 16 40 80
8 6 12 46 92
9 4 8 50 100
7
UNIT II
A measure of central tendency is a statistical calculation used to determine a single value that summarizes the
central location of a set of data. The central location of the data provides a quick summary of the main
characteristics of the data, which can be useful in making predictions or drawing conclusions about the data.
There are three common measures of central tendency: mean, median, and mode.
1. Mean: The mean, or average, is calculated by adding up all the values in the data set and then dividing
by the number of values. The mean provides a good representation of the central location of the data if
the data is evenly distributed. However, if there are extreme values in the data set, the mean can be
significantly impacted, making it a less reliable measure of central tendency.
2. Median: The median is the middle value of a data set when the values are ordered from smallest to
largest. It provides a good representation of the central location of the data if there are extreme values
in the data set, as it is not affected by outliers.
3. Mode: The mode is the value that occurs most frequently in a data set. The mode can provide a good
representation of the central location of the data if the data is not evenly distributed, such as in the case
of categorical data.
In conclusion, the appropriate measure of central tendency to use depends on the type and distribution of the
data, as well as the goals of the analysis.
Central tendency refers to a single value or typical value that summarizes the central location of a set of data.
The idea is to find a single value that best represents the “center” of the data and gives a quick summary of its
main characteristics. There are several measures of central tendency, including mean, median, and mode. Each
measure provides a different perspective on the central location of the data and may be more or less appropriate
depending on the type and distribution of the data, as well as the goals of the analysis.
Mean, median, and mode are three common measures of central tendency. Mean (average) is calculated by
summing up all the values in a data set and dividing by the number of values. Median is the middle value of a
data set when the values are ordered from smallest to largest. Mode is the value that occurs most frequently in a
data set.
In general, mean is a good representation of central tendency if the data is evenly distributed, but it can be
significantly impacted by extreme values (outliers). Median provides a good representation of central tendency
if there are outliers, as it is not affected by extreme values. Mode provides a good representation of central
tendency if the data is not evenly distributed, such as in the case of categorical data.
Central tendency is a useful tool in data analysis, as it provides a simple and quick summary of a large and
complex data set. It can also help identify patterns, make predictions, and draw conclusions about the data
8
Requirements of good measures of central tendency
For a measure of central tendency to be considered “good”, it should satisfy the following requirements:
1. Uniqueness: There should be a single value that summarizes the central location of the data, not
multiple values or ranges.
2. Representativeness: The measure should provide an accurate representation of the central location of
the data, capturing its main characteristics.
3. Stability: The measure should not change significantly with small variations in the data set.
4. Insensitivity to extreme values: The measure should not be greatly affected by outliers or extreme
values in the data set.
In this example, the median is a better measure of central tendency as it is not greatly affected by the extreme
value of 1000. The mean is significantly impacted by this value and is not representative of the central location
of the data. The lack of a mode suggests that this is not a good measure of central tendency for this data set.
Arithmetic mean
The arithmetic mean, also known as the average, is calculated by summing up all the values in a data set and
dividing by the number of values. Here’s how to calculate the arithmetic mean with an example:
Step 3: The result, 5.6, is the arithmetic mean of the data set.
So, the average of the data set is 5.6. This value provides a single representation of the central location of the
data and can be used for making predictions or drawing conclusions about the data.
Median
Median is a measure of central tendency that represents the middle value of a set of data when the values are
ordered from smallest to largest. The median provides a good representation of the central location of the data if
there are extreme values or outliers, as it is not affected by these values.
9
Here’s how to calculate the median with an example:
Step 2: If the number of values in the data set is odd, the median is the middle value. In this case, the median is
9.
Step 3: If the number of values in the data set is even, the median is the average of the two middle values. In this
case, the median is (8 + 9) / 2 = 8.5.
So, the median of this data set is 8.5. This value provides a good representation of the central location of the
data, as it is not affected by the presence of extreme values
The following steps are helpful while applying the median formula for ungrouped data.
The median formula of a given set of numbers, say having ‘n’ odd number of observations, can be expressed as:
Median = [(n + 1)/2]th term
10
The median formula of a given set of numbers say having ‘n’ even number of observations, can be expressed as:
Median = [(n/2)th term + ((n/2) + 1)th term]/2
Example: The age of the members of a weekend poker team has been listed below. Find the median of the
above set.
Solution:
Step 2: Count the number of observations. If the number of observations is odd, then we will use the following
formula: Median = [(n + 1)/2]th term
11
Median = 42
When the data is continuous and in the form of a frequency distribution, the median is calculated through the
following sequence of steps.
Let us use the above steps in the following practical illustration to understand the application of the median
formula.
Illustration: There are 5 top management employees in an organization. The salaries given to the employees are
$5,000, $6,000, $4,000, $8,000, and $7,500. Using the median formula calculates the median salary.
Solution: We will follow the given steps to find the median salary.
▪ Step 1: Sorting the given data in increasing order, $4,000, $5,000, $6,000, $7,500, and $8,000.
▪ Step 2: Total number of observations = 5
▪ Step 3: The given number of observations is odd.
▪ Step 4: Using median formula for odd observation, Median = [(n + 1)/2] th term
▪ Median = [(5+1)/2]th term. = 6/3 = 3rd term. The third term is $6,000.
12
How to Find Median?
We use a median formula to find the median value of given data. For a set of ungrouped data, we can follow
the below-given steps to find the median value.
Example: The height (in centimeters) of the members of a school football team have been listed below.
Solution:
Step 1:
Ordered Set: {130, 132, 135, 140, 142, 150, 158, 160}
Step 2:
Number of observations, n = 8
Step 3:
13
Median = [(8/2)th term + ((8/2) + 1)th term]/2
= (140 + 142)/2
= 141
Mode
The mode is the value that occurs most frequently in a data set. It provides a good representation of central
tendency for data sets that are not evenly distributed, such as categorical data.
Step 2: Find the value(s) with the highest frequency. In this case, both 4 and 8 have a frequency of 2.
Step 3: Both 4 and 8 are the modes of the data set, as they both occur with the same highest frequency.
So, the modes of this data set are 4 and 8. These values provide a good representation of central tendency for
this data set, as they capture the most common values in the data. Note that it is possible for a data set to have
more than one mode, or no mode at all.
The mode of this data set is 4, as it appears three times, which is more than any other value.
Formula
14
Statisticians use the mode formula in statistics to know the highest frequency in a group of data or distribution.
They take the most repeated data as the Mode of distribution. It is one of the three important measures related to
the central tendency besides mean and median.
Here is the formula that statisticians and analysts use in the calculation of a data set in statistics:
Where the modal class = the one with the highest frequency data interval;
For doing so, one has to first arrange the data in ascending or descending manner in terms of their values. After
the arranging, one must mark the data values, which are repeated more often. Amongst all the frequent data
values, the one having the highest frequency of occurring in the data set is the modal value or the most common
value for the set.
To find the value of grouped data, one has to identify the class interval with the most frequency, known as the
modal class. After doing so, one calculates the class size by subtracting the lower limit from the upper limit.
Finally, statisticians use the themost common value formula to calculate the Mode for the grouped data after
putting all the values in it, as shown below:
Calculation Example
Here is a mode calculation example to understand the concept and its usage.
Frequency 8 5 10 4 7
Modal class = 20-30 as it has the data with the highest frequency
15
The lower limit of the above modal class, L= 20
= 20+10*5/11
i.e., (220+50)/11
= 270/11
Example
As discussed below, the best way to understand the basics of this concept is through a mode example.
Let us assume an inventory manager has to know which stock is mostly purchased by the customers and must be
replenished accordingly. Therefore, the inventory manager prepares a list of the items in the warehouse with the
respective product type and purchase code as below:
For the most common value calculation, one should categorize the above data in the frequency of buying by the
customers as below:
TYPE FREQUENCY
SUNGLASSES 3
KIDS GARMENTS 4
LAPTOPS 3
16
MOBILE 5
As a result of the table above, one finds that mobile is bought more frequently than other items in the warehouse
inventory. Therefore, mobile is the mode of the data set of our example. Thus, the mode in statistics helps
with inventory management.
FORMULA
ARITHMETICMEAN:
th
MEDIAN: N+1 Median=L+ ()− c.f i
th Size of
N+1 ter F
Size of term m
2 2
MODE:
The Harmonic Mean and the Geometric Mean are two types of average measures used in statistics.
17
For ungrouped data, the Harmonic Mean is calculated as the reciprocal of the arithmetic mean of the reciprocals
of the individual values, and is used when finding the average rate, such as speed. The Geometric Mean, on the
other hand, is calculated as the nth root of the product of n values, and is used when finding the average growth
rate.
For grouped data, the Harmonic Mean is calculated by dividing the total number of observations by the sum of
the reciprocals of the class frequencies, and the Geometric Mean is calculated by finding the nth root of the
product of the class frequencies.
In general, the Harmonic Mean is a better measure of central tendency for data sets with extremely large or
small values, while the Geometric Mean is a better measure for data sets with values close to each other
Harmonic Mean is defined as the reciprocal of the arithmetic mean of reciprocals of the observations.
Let x1, x2, …, xn be the n observations then the harmonic mean is defined as
Example 5.11
A man travels from Jaipur to Agra by a car and takes 4 hours to cover the whole distance. In the first hour he
travels at a speed of 50 km/hr, in the second hour his speed is 64 km/hr, in third hour his speed is 80 km/hr and
in the fourth hour he travels at the speed of 55 km/hr. Find the average speed of the motorist.
Solution:
18
For a frequency distribution
Example 5.12
Solution:
19
Where xi is the mid-point of the class interval
Geometric Mean
A geometric mean is a mean or average which shows the central tendency of a set of numbers by using the
product of their values. For a set of n observations, a geometric mean is the nth root of their product. The
geometric mean G.M., for a set of numbers x 1, x2, … ,xn is given as
The geometric mean of two numbers, say x, and y is the square root of their product x×y. For three numbers, it
will be the cube root of their products i.e., (x y z) 1⁄3.
20
UNIT III
Measure of dispersion
Dispersion is a statistical concept that refers to the extent to which data points in a set are spread out from each
other. There are several measures of dispersion, including:
1. Range: It’s the difference between the largest and the smallest value in a dataset.
2. Interquartile Range (IQR): It’s the difference between the third quartile and the first quartile, which
represents the range of the middle 50% of the data.
3. Variance: It’s a measure of the spread of a set of data around its mean. Variance is the average of the
squared differences between each data point and the mean.
4. Standard Deviation: It’s the square root of the variance and provides a measure of how far each data
point is from the mean.
5. Mean Absolute Deviation (MAD): It’s the average of the absolute differences between each data point
and the mean.
6. Coefficient of Variation (CV): It’s the ratio of the standard deviation to the mean, expressed as a
percentage, and provides a measure of relative dispersion.
These measures of dispersion help us to understand how much the data is spread out and how the data points are
distributed around the central value
Concept of Dispersion
Dispersion, also known as variability or scatter, is a statistical concept that measures how spread out the values
in a set of data are. It provides information about the distribution of the data, such as how much the data points
vary from the center of the distribution and how much they vary from each other.
In other words, dispersion reflects the degree of variation or spread in the data. A set of data with high
dispersion means that the data points are widely spread out, while a set of data with low dispersion means that
the data points are clustered closely together.
There are several measures of dispersion, including range, variance, standard deviation, interquartile range,
mean absolute deviation, and coefficient of variation, which are used to describe the spread of a set of data.
These measures help us to understand the shape of the distribution and the degree of variation in the data, and
provide important information for making decisions and predictions
Dispersion measures can be classified into two categories: absolute and relative measures.
1. Absolute Measures of Dispersion: These measures describe the spread of the data in absolute terms,
such as the difference between the largest and smallest values, or the average difference between each
data point and the mean.
21
▪ Range: The difference between the largest and smallest values in a set of data. For example, if the
largest value is 8 and the smallest value is 2, then the range is 8 – 2 = 6.
▪ Mean Absolute Deviation (MAD): The average of the absolute differences between each data point and
the mean. For example, if the data set is [1, 2, 3, 4, 5] and the mean is 3, the MAD would be (|1-3| + |2-
3| + |3-3| + |4-3| + |5-3|)/5 = (2 + 1 + 0 + 1 + 2)/5 = 1.2
2. Relative Measures of Dispersion: These measures describe the spread of the data relative to the mean
or some other central value, such as the standard deviation, which is expressed as a proportion of the
mean. Examples:
▪ Variance: The average of the squared differences between each data point and the mean. For example,
if the data set is [1, 2, 3, 4, 5] and the mean is 3, the variance would be ( (1-3)^2 + (2-3)^2 + (3-3)^2 +
(4-3)^2 + (5-3)^2)/5 = (4 + 1 + 0 + 1 + 4)/5 = 2.4.
▪ Standard Deviation: The square root of the variance. For example, if the variance is 2.4, the standard
deviation would be √2.4 = 1.55.
▪ Coefficient of Variation (CV): The ratio of the standard deviation to the mean, expressed as a
percentage. For example, if the mean is 100 and the standard deviation is 10, the CV would be
(10/100)*100 = 10%.
Relative measures of dispersion are particularly useful when comparing data sets with different units or scales,
as they provide a normalized measure of spread that is independent of the size of the data
Here are examples to illustrate the concepts of range, variance, standard deviation, and coefficient of variation:
1. Range: The range is the difference between the largest and smallest values in a set of data. For
example, consider the following set of numbers: [1, 2, 3, 4, 5]. The largest value is 5 and the smallest
value is 1, so the range is 5 – 1 = 4.
2. Variance: Variance is a measure of the spread of a set of data around its mean. It is the average of the
squared differences between each data point and the mean. For example, consider the data set [1, 2, 3,
4, 5] with a mean of 3. The variance would be calculated as follows:
3. Standard Deviation: Standard deviation is the square root of the variance and provides a measure of
how far each data point is from the mean. For the data set [1, 2, 3, 4, 5] with a variance of 2.4, the
standard deviation would be √2.4 = 1.55.
4. Coefficient of Variation (CV): CV is the ratio of the standard deviation to the mean, expressed as a
percentage. It provides a measure of relative dispersion, allowing for comparisons between data sets
with different units or scales. For example, consider a data set with a mean of 100 and a standard
deviation of 10. The CV would be (10/100)*100 = 10%
Standard deviation
Standard deviation measures the variation or dispersion that exists from the mean. A low standard
deviation indicates that the data points tend to be very close to the mean, whereas high standard
deviation indicates that the data points are spread over a larger range of values.
22
OBSERVATIONS
Q1=L1+N/4 –C.F.*i
QUARTILEDEVIATION:
F
Q1=Size of (N+1)/4
Q2=Size of 2(N+1)/4
Q3=Size of 3(N+1)/4 Q2=L1+(2N/4) –C.F.*i
F
Q3=L1+(3N/4) –C.F.*i
F
Dispersion is a statistical concept that refers to the extent to which data points in a set are spread out from each
other. There are several measures of dispersion, including:
7. Range: It’s the difference between the largest and the smallest value in a dataset.
8. Interquartile Range (IQR): It’s the difference between the third quartile and the first quartile, which
represents the range of the middle 50% of the data.
9. Variance: It’s a measure of the spread of a set of data around its mean. Variance is the average of the
squared differences between each data point and the mean.
10. Standard Deviation: It’s the square root of the variance and provides a measure of how far each data
point is from the mean.
11. Mean Absolute Deviation (MAD): It’s the average of the absolute differences between each data point
and the mean.
12. Coefficient of Variation (CV): It’s the ratio of the standard deviation to the mean, expressed as a
percentage, and provides a measure of relative dispersion.
Dispersion measures can be classified into two categories: absolute and relative measures.
2. Absolute Measures of Dispersion: These measures describe the spread of the data in absolute terms,
such as the difference between the largest and smallest values, or the average difference between each
data point and the mean.
▪ Range: The difference between the largest and smallest values in a set of data. For example, if the
largest value is 8 and the smallest value is 2, then the range is 8 – 2 = 6.
▪ Mean Absolute Deviation (MAD): The average of the absolute differences between each data point and
the mean. For example, if the data set is [1, 2, 3, 4, 5] and the mean is 3, the MAD would be (|1-3| + |2-
3| + |3-3| + |4-3| + |5-3|)/5 = (2 + 1 + 0 + 1 + 2)/5 = 1.2
3. Relative Measures of Dispersion: These measures describe the spread of the data relative to the mean
or some other central value, such as the standard deviation, which is expressed as a proportion of the
mean.
▪ Variance: The average of the squared differences between each data point and the mean. For example,
if the data set is [1, 2, 3, 4, 5] and the mean is 3, the variance would be ( (1-3)^2 + (2-3)^2 + (3-3)^2 +
(4-3)^2 + (5-3)^2)/5 = (4 + 1 + 0 + 1 + 4)/5 = 2.4.
▪ Standard Deviation: The square root of the variance. For example, if the variance is 2.4, the standard
deviation would be √2.4 = 1.55.
23
▪ Coefficient of Variation (CV): The ratio of the standard deviation to the mean, expressed as a
percentage. For example, if the mean is 100 and the standard deviation is 10, the CV would be
(10/100)*100 = 10%.
Relative measures of dispersion are particularly useful when comparing data sets with different units or scales,
as they provide a normalized measure of spread that is independent of the size of the data.
24
UNIT IV
Statistical Quality Control (SQC) is a method used to monitor and control the quality of a product or service by
using statistical techniques and methods. It is a systematic approach to ensuring that the products manufactured
meet the specified quality standards and requirements. SQC helps organizations to identify and control the
sources of variability in the production process, which can lead to defects or nonconformance.
SQC involves collecting data, analyzing the data using statistical tools, and making decisions based on the
results of the analysis. The data collected can come from a variety of sources, including production processes,
inspection results, and customer feedback. The statistical techniques used in SQC can range from simple
statistical measures, such as mean and standard deviation, to more complex statistical models, such as control
charts and Design of Experiments (DOE).
The goal of SQC is to continuously improve the quality of products or services by reducing variability and
improving processes. This can lead to increased customer satisfaction, lower costs due to reduced waste and
rework, and improved competitiveness in the market.
Overall, SQC is a powerful tool for organizations that want to improve their quality and competitiveness by
using data and statistical analysis to make informed decisions about their processes and products
Control limits
Control limits, in the context of statistical quality control, are lines or boundaries that are plotted on a control
chart to distinguish between normal and abnormal behavior of a process. The control limits are calculated from
the data collected from the process and are used to determine if the process is in statistical control or not.
There are typically two types of control limits used in SQC: upper control limit (UCL) and lower control limit
(LCL). The UCL is the upper boundary or limit beyond which any data point is considered to be an out-of-
control point, indicating that the process has deviated from its normal behaviour. The LCL is the lower
boundary or limit below which any data point is considered to be an out-of-control point, indicating that the
process has deviated from its normal behaviour.
Control limits are important in SQC because they help to determine if a process is operating consistently and
within the expected limits. If a data point falls outside of the control limits, it can be an indicator of a problem
with the process and can trigger an investigation to identify and correct the root cause.
In summary, control limits provide a statistical framework for detecting and correcting variations in a process,
and are essential for continuous improvement and maintaining quality control in an organization
specification limits
25
Specification limits, in the context of statistical quality control, are the predetermined bounds that define the
acceptable range for a product characteristic or process output. The specification limits define the criteria for
conformance or non-conformance of a product to the established standards or customer requirements.
For example, in the manufacturing of a component, the specification limits may specify the acceptable range of
dimensions, weight, strength, or other characteristics of the finished product. If a product falls outside of the
specification limits, it is considered to be non-conforming and may be rejected or reworked.
Specification limits are established based on customer requirements, industry standards, and the manufacturer’s
own goals for quality and performance. They serve as the target for the process and are used as a basis for
setting control limits in statistical quality control.
In summary, specification limits are an essential component of quality control in an organization, as they
provide a clear definition of the acceptable quality criteria for products or services, and serve as a benchmark for
continuous improvement and process control
Tolerance limits
Tolerance limits, in the context of statistical quality control, are the acceptable bounds for deviation from the
target or specification limits for a product characteristic or process output. Tolerance limits define the range
within which a product or process can vary while still meeting the customer requirements and quality standards.
For example, in the manufacturing of a component, the tolerance limits may specify the acceptable range of
dimensions, weight, strength, or other characteristics of the finished product, which may vary slightly from the
target or specification limits. If a product falls within the tolerance limits, it is considered to be conforming,
even if it is not exactly the same as the target or specification limits.
Tolerance limits are established based on the customer requirements, industry standards, and the manufacturer’s
own goals for quality and performance. They serve as a flexible range for the process and help to account for
normal variations in the production process.
In summary, tolerance limits are an important component of quality control in an organization, as they provide a
level of flexibility for the production process and help to ensure that products or services meet the customer
requirements and quality standards, even if they are not exactly the same as the target or specification limits
Process and product control are two key concepts in statistical quality control (SQC) that are used to monitor
and improve the quality of a product or service.
Process control refers to the techniques and methods used to monitor and control the production process to
ensure that it is operating consistently and within the expected limits. The goal of process control is to detect
and correct variations in the process, so that the process remains in statistical control and produces products or
services that meet the established quality criteria.
Product control, on the other hand, refers to the techniques and methods used to monitor and control the quality
of the finished product or service. The goal of product control is to detect and correct defects or nonconformities
in the product, so that the product meets the established quality standards and customer requirements.
26
In SQC, both process control and product control are important for ensuring that the final product or service
meets the required quality criteria. Process control helps to maintain consistency in the production process,
while product control helps to detect and correct defects or nonconformities in the finished product
What is it?
An X-bar and R (range) chart is a pair of control charts used with processes that have a subgroup size of two or
more. The standard chart for variables data, X-bar and R charts help determine if a process is stable and
predictable. The X-bar chart shows how the mean or average changes over time and the R chart shows how the
range of the subgroups changes over time. It is also used to monitor the effects of process improvement theories.
As the standard, the X-bar and R chart will work in place of the X-bar and s or median and R chart. To create an
X-bar and R chart using software, download a copy of SQCpack.
The X-bar chart, on top, shows the mean or average of each subgroup. It is used to analyze central location The
range chart, on the bottom, shows how the data is spread . It is used to study system variability
When is it used?
You can use X-bar and R charts for any process with a subgroup size greater than one. Typically, it is used when
the subgroup size falls between two and ten, and X-bar and s charts are used with subgroups of eleven or more.
Use X-bar and R charts when you can answer yes to these questions:
27
Getting the most
Collect as many subgroups as possible before calculating control limits. With smaller amounts of data, the
X-bar and R chart may not represent variability of the entire system. The more subgroups you use in control
limit calculations, the more reliable the analysis. Typically, twenty to twenty-five subgroups will be used in
control limit calculations.
X-bar and R charts have several applications. When you begin improving a system, use them to assess the
system’s stability .
After the stability has been assessed, determine if you need to stratify the data. You may find entirely
different results between shifts, among workers, among different machines, among lots of materials, etc. To see
if variability on the X-bar and R chart is caused by these factors, collect and enter data in a way that lets you
stratify by time, location, symptom, operator, and lots.
You can also use X-bar and R charts to analyze the results of process improvements. Here you would
consider how the process is running and compare it to how it ran in the past. Do process changes produce the
desired improvement?
Finally, use X-bar and R charts for standardization. This means you should continue collecting and analyzing
data throughout the process operation. If you made changes to the system and stopped collecting data, you
would have only perception and opinion to tell you whether the changes actually improved the system. Without
a control chart, there is no way to know if the process has changed or to identify sources of process variability.
NP Control Chart
An np control chart is used to look at variation in yes/no type attributes data. There are only two possible
outcomes: either the item is defective or it is not defective. The np control chart is used to determine if the
number of defective items in a group of items is consistent over time. The subgroup size (the number of item in
the group) must be the same for each sample.
A product or service is defective if it fails, in some respect, to conform to specifications or a standard. For
example, customers like invoices to be correct. If you charge them too much, you will definitely hear about it
and it will take longer to get paid. If you charge them too little, you may never hear about it. As an organization,
it is important that your invoices be correct. Suppose you have decided that an invoice is defective if it has the
wrong item or wrong price on it. You could then take a random sample of invoices (e.g., 100 per week) and
check each invoice to see if it is defective. You could then use an np control chart to monitor the process.
You use an np control chart when you have yes/no type data. This type of chart involves counts. You are
counting items. To use an np control chart, the counts must also satisfy the following two conditions:
1. You are counting n items. A count is the number of items in those n items that fail to conform to
specification.
2. Suppose p is the probability that an item will fail to conform to the specification. The value of p must
be the same for each of the n items in a single sample.
28
If these two conditions are met, the binomial distribution can be used to estimate the distribution of the counts
and the np control chart can be used. The control limits equations for the np control chart are based on the
assumption that you have a binomial distribution. Be careful here because condition 2 does not always hold. For
example, some people use the p control chart to monitor on-time delivery on a monthly basis. A p control chart
is the same as the np control chart, but the subgroup size does not have to be constant. You can’t use the p
control chart unless the probability of each shipment during the month being on time is the same for all the
shipments. Big customers often get priority on their orders, so the probability of their orders being on time is
different from that of other customers and you can’t use the p control chart. If the conditions are not met,
consider using an individuals control chart.
The red bead experiment described in last month’s newsletter is an example of yes/no data that can be tracked
using an np control chart. In this experiment, each worker is given a sampling device that can sample 50 beads
from a bowl containing white and red beads. The objective is to get all white beads. In this case, a bead is “in-
spec” if it is white. It is “out of spec” if it is red. So, we have yes/no data – only two possible outcomes. In
addition, the subgroup size is the same each time, so we can use an np control chart.
Data from one red bead experiment are shown below.The numbers represent the number of red beads each
person received in each sample of 50 beads.
Tom 12 8 6 9
David 5 8 6 13
Paul 12 9 8 9
Sally 9 12 10 6
Fred 10 10 11 10
Sue 10 16 9 11
29
The np control chart plots the number of defects (red beads) in each subgroup (sample number) of 50. The
center line is the average. The upper dotted line is the upper control. The lower dotted line is the lower control
limit. As long as all the points are inside the control limits and there are no patterns to the points, the process is
in statistical control. We know what it will produce in the future. While we don’t know the exact number of red
beads a person will draw the next time, we know it will be between about 2 and 17 (the control limits) and
average about 10.
The steps in constructing the np chart are given below. The data from above is used to demonstrate the
calculations.
a. Select the subgroup size (n). Attributes data often require large subgroup sizes (50 – 200). The subgroup size
should be large enough to have several defective items. The subgroup size must be constant.
In the red bead example, the subgroup size is 50.
b. Select the frequency with which the data will be collected. Data should be collected in the order in which it is
generated.
c. Select the number of subgroups (k) to be collected before control limits are calculated. You can start a control
chart with as few as five to six points but you should recalculate the average and control limits until you have
about 20 subgroups.
d. Inspect each item in the subgroup and record the item as either defective or non-defective. If an item has
several defects, it is still counted as one defective item.
e. Determine np for each subgroup.
np = number of defective items found
f. Record the data.
2. Plot the data
30
a. Calculate the process average number defective:
where np1, np2, etc. are the number of defective items in subgroups 1, 2, etc. and k is the number of subgroups.
In the red bead example, each of the six workers had 4 samples. So, k = 24. The total number of red beads
(summing all the data in the table above) is 229. Thus, the average number of defective items (red beads) in
each sample is 9.54.
b. Draw the process average number defective on the control chart as a solid line and label.
c. Calculate the control limits for the np chart. The upper control limit is given by UCLnp. The lower control
limit is given by LCLnp.
The control limits for the red bead data are calculated by substituting the value of 9.54 for the average number
defective and the value of 50 for the subgroup size in the equations above. This gives an upper control limit of
17.87 and a lower control limit of 1.20.
d. Draw the control limits on the control chart as dashed lines and label.
4. Interpret the chart for statistical control.
c-Chart
What is it?
A c-chart is an attributes control chart used with data collected in subgroups that are the same size. C-charts
show how the process, measured by the number of nonconformities per item or group of items, changes over
time. Nonconformities are defects or occurrences found in the sampled subgroup. They can be described as any
characteristic that is present but should not be, or any characteristic that is not present but should be. For
example a scratch, dent, bubble, blemish, missing button, and a tear would all be nonconformities. C-charts are
used to determine if the process is stable and predictable, as well as to monitor the effects of process
improvement theories. C-charts can be created using software products like SQCpack.
31
32
UNIT V
Probability
A sample space is a collection of all possible outcomes of a random experiment. It is the set of all possible
results of a random process, or a set of possible values of a random variable. The sample space provides a
framework for understanding probability and statistical analysis, as it represents all the possible outcomes of an
event. The elements of a sample space are known as sample points or outcomes. For example, if you roll a dice,
the sample space would be the set {1, 2, 3, 4, 5, 6}.
In probability theory, an event is a set of outcomes of a random experiment. It is a collection of one or more
possible outcomes from a sample space. An event can be either simple, consisting of a single outcome, or it can
be complex, consisting of multiple outcomes. The probability of an event is a measure of the likelihood that the
event will occur, expressed as a number between 0 and 1, where 0 represents that the event is impossible and 1
represents that the event is certain to occur.
where A is the event of interest, and the numerator and denominator are taken from the sample space.
There are two types of events: mutually exclusive and non-mutually exclusive events. Mutually exclusive events
are events that cannot occur at the same time, and their sample spaces do not overlap. For example, the event
“rolling a 4 on a die” and the event “rolling a 5 on a die” are mutually exclusive events because they cannot
occur simultaneously. On the other hand, non-mutually exclusive events are events that can occur
simultaneously. For example, the event “rolling an even number on a die” and the event “rolling a number
greater than 4 on a die” are non-mutually exclusive events because they can occur at the same time.
Example 1: Find the probability of getting a head in tossing a coin. Solution: When a
coin is tossed, we have the sample space Head, TailTherefore, the total number of
possible outcomes is 2
Thefavourablenumberofoutcomesis1,that is the head.
The required probability is ½.
Example2: Find the probability of getting two tails in two tosses of a coin.
Solution: When two coins are tossed, we have the sample space HH, HT, TH, TT Where H represents
the outcome Head and Tail represents the outcome Tail.
The total number of possible outcomes is 4.
The favorable number of outcomes is 1 that is TT
The required probability is ¼
33
Some of the important probability events are:
If the probability of occurrence of an event is 0, such an event is called an impossible event and if the
probability of occurrence of an event is 1, it is called a sure event. In other words, the empty set ϕ is an
impossible event and the sample space S is a sure event.
Simple Events
Any event consisting of a single point of the sample space is known as a simple event in probability. For
example, if S = {56 , 78 , 96 , 54 , 89} and E = {78} then E is a simple event.
Compound Events
Contrary to the simple event, if any event consists of more than one single point of the sample space then such
an event is called a compound event. Considering the same example again, if S = {56 ,78 ,96 ,54 ,89}, E 1 = {56
,54 }, E2 = {78 ,56 ,89 } then, E1 and E2 represent two compound events.
If the occurrence of any event is completely unaffected by the occurrence of any other event, such events are
known as an independent event in probability and the events which are affected by other events are known
as dependent events.
If the occurrence of one event excludes the occurrence of another event, such events are mutually exclusive
events i.e. two events don’t have any common point. For example, if S = {1 , 2 , 3 , 4 , 5 , 6} and E 1, E2 are two
events such that E1 consists of numbers less than 3 and E2 consists of numbers greater than 4.
Exhaustive Events
A set of events is called exhaustive if all the events together consume the entire sample space.
34
Complementary Events
For any event E1 there exists another event E1‘ which represents the remaining elements of the sample space S.
E1 = S − E1‘
If a dice is rolled then the sample space S is given as S = {1 , 2 , 3 , 4 , 5 , 6 }. If event E 1 represents all the
outcomes which is greater than 4, then E1 = {5, 6} and E1‘ = {1, 2, 3, 4}.
Similarly, the complement of E1, E2, E3……….En will be represented as E1‘, E2‘, E3‘……….En‘
If two events E1 and E2 are associated with OR then it means that either E1 or E2 or both. The union
symbol (∪) is used to represent OR in probability.
If we have mutually exhaustive events E1, E2, E3 ………En associated with sample space S then,
E1 U E2 U E3U ………En = S
If two events E1 and E2 are associated with AND then it means the intersection of elements which is common to
both the events. The intersection symbol (∩) is used to represent AND in probability.
35
Types of Events In Probability
It represents the difference between both the events. Event E1 but not E2 represents all the outcomes which are
present in E1 but not in E2. Thus, the event E1 but not E2 is represented as
E1, E2 = E1 – E2
Conditional Probability
The probability of occurrence of any event A when another event B in relation to A has already occurred is
known as conditional probability. It is depicted by P(A|B).
36
Formula
When the intersection of two events happen, then the formula for conditional probability for the occurrence of
two events is given by;
P(A|B) = N(A∩B)/N(B)
Or
P(B|A) = N(A∩B)/N(A)
There are many formulas involved in permutation and combination concepts. The two key formulas are:
Permutation Formula
A permutation is the choice of r things from a set of n things without replacement and where the order matters.
n
Pr = (n!) / (n-r)!
Combination Formula
37
A combination is the choice of r things from a set of n things without replacement and where order does not
matter.
38
Questions for Practice
UNIT-I
VERY SHORT TYPE
39
SHORT TYPE
40
LONG TYPE
UNIT-II
41
SHORT TYPE
LONG TYPE
42
UNIT-III
VERY SHORT TYPE
SHORT TYPE
43
LONG TYPE
UNIT-IV
VERY SHORT TYPE
44
SHORT TYPE
LONG TYPE
UNIT-V
VERY SHORT TYPE
45
SHORT TYPE
LONG TYPE
46