Data Analytics For Lean Six Sigma
Data Analytics For Lean Six Sigma
This document is a summary of notes from the coursera.org certification course "Data
Analytics for 6 Sigma" presented by Inez Zwetsloot. All the credits should definitely go to
her and her fantastic lectures and presentations, that she was kind to share.
Lean Six Sigma is a data driven approach for process improvement. It is evidence based and
data driven. Lean is about improving processes, reducing waste, making things easier and
faster, mostly by using soft tools. And Six Sigma is about statistics and reducing variation.
DMAIC
Analyse - analyse the current situation and make diagnosis (determine the nature of the
problem)
Control - control the improved process performance and close the project
MEASURE
CTQ (Critical to Quality) a property of a process, product or service that is relevant to the
project objective. They are also known as Y- variable, Dependent variable or outcome variable.
Company head complained the sales had too few customers. The Sales felt pressured, that
they had too much administrative work.
No. of customers
Revenue increase per sales representative
Conversation of an offer to a deal
SAMPLING
The sample has to be representative and be based on a mechanism that is independent of the
question.
First time fix rate (call without per person per day % percentage of calls that are
involvement of other handled at first time right
departments) (without passing to another dep)
Case of speed camera. What cars to choose? We have measured 5000 cars. Let's perform a
sampling of 100 cars. This will be done automatically by software. Humans are too subjective.
Datasets
Units are the objects and Variables are the properties of these objects that we have measured.
UNITS should be placed in ROWS.
VARIABLES in the COLUMNS.
Continuous
Variables that can only take integer values, Values not limited to integers, such a distance
such as counts (number of people, number or temperature
of failed pieces). Variables can also be
discrete because of rounding.
There is a well defined order Unordered categories, merely If only two options are
among categories, e.g good, labels, e.g. XO, MM, SW, LS available (yes/no, good/bad)
acceptable, critical, reject
CTQ (Y Variable)
We need to understand what type of data we have in front of us. There are two reasons for it:
- To know which type gives us more information
- To know which statistical technique to use
DESCRIPTIVE STATISTICS
They can be used only with numerical data, because they're based on numerical operations
Low value of dispersion means that values are close to mean, high value of dispersion means
that most values are quite far away from the mean and spread out over a large interval. We
measure dispersion with Standard Deviation.
We measure distance to the mean (red lines) ∑(x - u)
The problem is that some of the distances can have negative values, so we need to square
them, so eventually the average would be close to ZERO.
MEDIAN - if you order all your data from minimum to maximum (small to large) it is the middle
value that separates the smallest 50% of your data from the largest 50% of your data. It's
advantage is that it is not affected by outliers but disadvantage is it is just one number and
doesn't refer to all the data. It's therefore less precise than the mean.
Q1 - 1st QUARTILE separates 25% lowest values from 75% highest values
Q3 - 2rd QUARTILE separates the lowest 75% of the data from the highest 25% of the data
RANGE - distance between MINIMUM and MAXIMUM (outliers can strongly affect the range size)
Therefore it is more common to user INTERQUARTILE RANGE IQR = (Q3 - Q1)
Histogram - the Y-Axis depicts Frequency of cases occurring (in the example the frequency of
processing time)
The bimodal distribution occurs when two distinctive data distributions are present.
BOXPLOT
It shows:
Pie Chart
Which type of data occurs most often?
Who occurred most often?
Bar Chart
Reclaim - X axis - Category
Count - Y axis - how often has this reclaim
occurred?
Pareto analysis helps to concentrate on cases that contribute most to the problem
Statistics
Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum
Total 52 0 9.422 0.555 12.66 0.000 1.04 4.070 12.30 59.800
time 1 1 0 5
Average of TT = 9.422
StDev of TT = 12.66
Approx 80% of reclaims will be addressed in 15 days. See Q3 = 12.30 days.
Is the distribution symmetrical? NO. It is SKEWED to the RIGHT.
Which type of reclaims occurs most often?
NOMINAL VARIABLES
A nominal variable is a type of variable that is used to name, label or categorize particular
attributes that are being measured. It takes qualitative values representing different categories,
and there is no intrinsic ordering of these categories. Some examples of nominal variables
include gender, Name, phone, etc.
ORDINAL VARIABLES
Ordinal variable is a type of measurement variable that takes values with an order or rank. It is
the 2nd level of measurement and is an extension of the nominal variable.
! Work volume is the number of jobs complete per day. This is a number (a count) hence it is a
numerical variable.
POPULATION VS SAMPLE
● In order to have a representative subset of the population we need to select the sample
randomly (e.g. 100 people from 40k customers).
● We get a sample of 100 people (x1, x2,...,x99, x100).
● We analyze sample data.
● We make claims about population.
● To make such claims we use the sample-statistics to estimate the
population-parameters.
Both parameters are unknown, because we haven't measured all the population. So how to
learn what is μ and what is σ?
Estimation of:
Location - μ - can be done with Mean (average) Χ value or Median . Average is more precise,
because it takes all data into account, but Median is more robust, as it is immune to outliers.
Spread - σ - dispersion - the most commonly used is the Standard Deviation, but we can also
use interquartile range (Q3 - Q1) or the range (max - min)
Here is the process again:
In order to answer this question, we need to look at confidence intervals.However, the estimates
will never be exactly equal to the population parameter, because it is based on a sample. We
use confidence intervals to quantify the uncertainty around the estimate. A confidence interval
gives you boundaries in which a population parameter will lie with a certain level of confidence.
Most used is the confidence level of 95%.
The mean for 40 samples (as in example) lies somewhere between 0.779 and 0.884. If we had
more samples the interval would be smaller.
NORMAL DISTRIBUTION (Bell-shaped Distribution)
Here it is skewed to the right. It's mostly used for throughput times or processing times.
LOGNORMAL DISTRIBUTION
PROBABILITY PLOT
We use the Probability plot tool in Minitab and play around types of distribution.
Lognormal:
Weibull:
The lognormal distribution fits the data best. The data have a shape close to a line and is
between the two outer lines.
In the second step we can compare AD value - Anderson Darling value. It measures the
distance between the data and theoretical distribution (normal, lognormal or weibull). The lower
the value, the better the fit.
Watch out for abnormalities! E.g. S-shape data on the probability plot show bimodality of it.
EMPIRICAL CDF
The empirical CDF is used in the Analyze step in the DMAIC tool and determines the
percentage of cases that meet SLA (Service Level Agreement) or specifications.
Project goal is to improve handling time per call. CTQ is Total handling time (THT).
Question: How often are the calls handled within 240 seconds?
When we have a data set, we could count the percentage of total calls that lasted less or equal
to 240sec. However, percentages based on counting are only correct when based on large data
set (N>300).
Let's summarize. The probability plot was helpful to decide what distribution the data represents.
The empirical CDF helps to calculate percentages. In Minitab we can use crosshairs or
percentile lines to find percentages.
One of the principles of the Lean Six Sigma method is Data-based testing of ideas and
improvements, also called evidence- based improvement actions.
Data-based testing
E.g. Caffeine percentage (Y or CTQ) - we want to determine wherever are any factors that
determine its level in the beans. We could consider the number of extractions (X) or duration of
an extraction (X) or amount of solvent (X).
We need to determine which variable is the influence factor (X) and which is the CTQ (Y
variable). Then we can model the relationship using data.
The choice of modeling method depends on types of data, if it's categorical (qualitative) or
numerical (quantitative).
Hypothesis Testing
So far we have worked with estimations. We used parameters to describe the entire population.
The estimate is merely descriptive. Therefore, we often switch from estimation to testing. Testing
means that we claim something. And we use data to test this claim. This claim is called a
hypothesis.
Hypothesis testing
- is a procedure to arrive at a decision,
- is as objective as possible,
- deals with uncertainty in a rational manner
- quantifies the risk of false decision
Correlation does not imply causation.
ANOVA method
ANOVA makes groups according to the categorical influence factor and it test whether these
groups have equal means or not.
Can we generalize our conclusions to the entire population of batches? To answer this question
we will perform a statistical analysis to generalize our findings from the sample.
In ANOVA we're comparing various means across different groups. It tests whether these
means of different groups are significantly different. Significantly different imples that the
categorical influence factor has a significant effect on the numerical Y-variable. We perform this
analysis in 3 steps:
- Organizing the data
- ANOVA
- Residuals
=QUERY(ArrayFormula(SPLIT(TRANSPOSE(SPLIT(textjoin("-",TRUE,TRANSPOSE((A7:
D7&","&A8:D17))),"-")),",")),)
RESIDUAL ANALYSIS
Residual - difference between the measurement and estimated mean. The residuals are
calculated by subtracting the expected value from each observation. In the case of ANOVA, this
expected value is the mean output over the relevant machine.
Kruskal- Wallis Test
When to use?
We need more data than in ANOVA to show the same difference.
Total Handling Time in the Call Centre, when employees attended training and when not. Question:
Has the training influenced THT?
Y- total training time (numerical)
X- Trained or not? (categorical)
The Kruskal-Wallis procedure is a good alternative for the ANOVA if the ANOVA assumptions are not
met. To interpret the output, you have to look at:
The median for each group
The P-value to determine if the results are significant
The p-value can be used to calculate whether the differences of the two means are coincidental or
truly significant.
This test is very useful for small samples.
Test for equality of variances
Too many variations may cause defects , a lot of scrap and losses.
Summary:
Test for Equal variances compare variances across groups with the test for equal variances. It can
be used for the ANOVA analysis to determine if the "assume equal variances"mode can be used
(p-value will be more precise).