0% found this document useful (0 votes)
186 views28 pages

Data Analytics For Lean Six Sigma

This document is a summary of notes from the coursera.org certification course "Data Analytics for 6 Sigma" presented by Inez Zwetsloot. All the credits should definitely go to her and her fantastic lectures and presentations.

Uploaded by

Aleksandra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
186 views28 pages

Data Analytics For Lean Six Sigma

This document is a summary of notes from the coursera.org certification course "Data Analytics for 6 Sigma" presented by Inez Zwetsloot. All the credits should definitely go to her and her fantastic lectures and presentations.

Uploaded by

Aleksandra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

DATA ANALYTICS FOR LEAN SIX SIGMA

This document is a summary of notes from the coursera.org certification course "Data
Analytics for 6 Sigma" presented by Inez Zwetsloot. All the credits should definitely go to
her and her fantastic lectures and presentations, that she was kind to share.

Lean Six Sigma is a data driven approach for process improvement. It is evidence based and
data driven. Lean is about improving processes, reducing waste, making things easier and
faster, mostly by using soft tools. And Six Sigma is about statistics and reducing variation.

Process is a series of steps taken to make a product, provide a service. An Organization is a


collection of processes. They are often improved by these organizations in order to withstand
the competition, adapt to higher customer expectations and implement technological
innovations.

DMAIC

Define - select a project and establish objectives

Measure - make the problem quantifiable and measure its performance

Analyse - analyse the current situation and make diagnosis (determine the nature of the
problem)

Improve - develop and implement improvement actions

Control - control the improved process performance and close the project

MEASURE

CTQ (Critical to Quality) a property of a process, product or service that is relevant to the
project objective. They are also known as Y- variable, Dependent variable or outcome variable.

How to find CTQs:


- What's the metric I want to change/improve?
- Will the project be successful if I do that?

CTQs vs project objectives example:

Company head complained the sales had too few customers. The Sales felt pressured, that
they had too much administrative work.

Project objectives CTQs Measurement unit

No. of customers
Revenue increase per sales representative
Conversation of an offer to a deal

Time spent with new customers in days per sales


Lower work pressure Time spent with existing customers representative
Time spent on administrative tasks

SAMPLING

The sample has to be representative and be based on a mechanism that is independent of the
question.

UNITS - (people, objects, phenomena) - the "things"you can collect data on


MEASUREMENT UNIT - (what a measurement means) - properties of the things you collect
data on

CTQ Unit Measurement unit

Speed Car km/h

Delay time Arrival Minutes

First time fix rate (call without per person per day % percentage of calls that are
involvement of other handled at first time right
departments) (without passing to another dep)

Case of speed camera. What cars to choose? We have measured 5000 cars. Let's perform a
sampling of 100 cars. This will be done automatically by software. Humans are too subjective.

Datasets

Units are the objects and Variables are the properties of these objects that we have measured.
UNITS should be placed in ROWS.
VARIABLES in the COLUMNS.

UNIT - Reclaims opened for the bank transfers


Reclaim No. Total Time Type of reclaim No. of interactions with
other banks
CATEGORIES OF DATA

Numerical Data(quantitative) based on real-valued or whole numbers

Continuous

Variables that can only take integer values, Values not limited to integers, such a distance
such as counts (number of people, number or temperature
of failed pieces). Variables can also be
discrete because of rounding.

Categorical Data (qualitative data)

Ordinal (order) Nominal (nomen = name) Binary

There is a well defined order Unordered categories, merely If only two options are
among categories, e.g good, labels, e.g. XO, MM, SW, LS available (yes/no, good/bad)
acceptable, critical, reject

Numerical Data/ numerical CTQ- you need 30 Observations


Categorical Data/ categorical CTQ - you need MORE - e.g. 300 Observations

Visualizing your data

CTQ (Y Variable)

Numerical - Histogram, Boxplot


Categorical - Pie chart, Bar chart, Pareto Chart

We need to understand what type of data we have in front of us. There are two reasons for it:
- To know which type gives us more information
- To know which statistical technique to use

DESCRIPTIVE STATISTICS

They can be used only with numerical data, because they're based on numerical operations

N - number of values/ measurements


N* - number of missing data
Mean = Sum / N, means average
StDev - dispersion of data

Low value of dispersion means that values are close to mean, high value of dispersion means
that most values are quite far away from the mean and spread out over a large interval. We
measure dispersion with Standard Deviation.
We measure distance to the mean (red lines) ∑(x - u)

We average distance the ∑(x - u)/N to quantify it

The problem is that some of the distances can have negative values, so we need to square
them, so eventually the average would be close to ZERO.

1. We square the distance to get positive numbers.


2. The average squared distance is called VARIANCE
3. Taking the square root again gives us a STANDARD DEVIATION.
StDev = 0.016 means that on average the observations are 0.016 percent away from the
mean

MEDIAN - if you order all your data from minimum to maximum (small to large) it is the middle
value that separates the smallest 50% of your data from the largest 50% of your data. It's
advantage is that it is not affected by outliers but disadvantage is it is just one number and
doesn't refer to all the data. It's therefore less precise than the mean.

Q1 - 1st QUARTILE separates 25% lowest values from 75% highest values
Q3 - 2rd QUARTILE separates the lowest 75% of the data from the highest 25% of the data

RANGE - distance between MINIMUM and MAXIMUM (outliers can strongly affect the range size)
Therefore it is more common to user INTERQUARTILE RANGE IQR = (Q3 - Q1)

The MEAN and MEDIAN indicate LOCATION.


The STANDARD DEVIATION and VARIANCE indicate DISPERSION. The Range and IQR do that
too.

VISUALISING NUMERICAL DATA

Histogram - the Y-Axis depicts Frequency of cases occurring (in the example the frequency of
processing time)

Around 130 data have values close to 0.

With histogram you can see how your data is distributed:

The bimodal distribution occurs when two distinctive data distributions are present.
BOXPLOT

It shows:

The median of the data


- The bulk of the data. Ranges from Q1 to Q3.
- The whiskers: indicate the tails of the distribution
- Possible outliers.
HISTOGRAM gives information about DISTRIBUTION of the data.
PLOT BOX gives information about SPREAD of the data.

VISUALISING CATEGORICAL DATA

These graphs show FREQUENCIES.

Pie Chart
Which type of data occurs most often?
Who occurred most often?

Total number of reclaims - pie

Matching is AROUND 25% of the total


number of reclaims.

It's not perfect for more detailed analysis.

Pie Chart tells how often data occurs.

Bar Chart
Reclaim - X axis - Category
Count - Y axis - how often has this reclaim
occurred?

It is easier to see the difference in occurrence


of categories.

Bar Chart allows to tell which of


categories occurs most often.

Tally Table - show the exact numbers.


Tally
Person Count Percent
Henk 17 13.93
Jan jr. 17 13.93
Jan sr. 14 11.48
Karel 15 12.30
Kees 15 12.30
Marcel 22 18.03
Margrie 22 18.03
t
N= 122
PARETO ANALYSIS - 80/20 rule of the vital few and the trivial many

Pareto analysis helps to concentrate on cases that contribute most to the problem

A Pareto chart can help to single out


the few vital issues, from the trivial
many.

You can make a chart with the


frequencies to single out the problem
that occurs most often.

You can also make a chart with


another variable, like duration to find
out which problems take up most of
the time.
VISUALISING TWO VARIABLES

SCATTERPLOT - Y is numerical and X is numerical

The data is clustered together and


more or less creates a straight
line. The caffeine content
decreases when the extraction
time increases.
BOXPLOT - Y is numerical, but X is categorical

The middle lines are


MEDIANS.The WHISKERS
show the minimum and the
maximum of caffeine % for
each extractor.

This chart doesn't bring any


valuable data, because it
doesn't answer a question
about whether coffee is
caffeine free (< 0.1%
content). In the above chart
we can see that the 3rd
extractor produces too strong
coffee.

BOX PLOT transposed - Y is categorical and X is numerical


Are the students going to pass based on their Math grades prior to the exam?
Exercise 1

Statistics
Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3 Maximum
Total 52 0 9.422 0.555 12.66 0.000 1.04 4.070 12.30 59.800
time 1 1 0 5

Average of TT = 9.422
StDev of TT = 12.66
Approx 80% of reclaims will be addressed in 15 days. See Q3 = 12.30 days.
Is the distribution symmetrical? NO. It is SKEWED to the RIGHT.
Which type of reclaims occurs most often?
NOMINAL VARIABLES
A nominal variable is a type of variable that is used to name, label or categorize particular
attributes that are being measured. It takes qualitative values representing different categories,
and there is no intrinsic ordering of these categories. Some examples of nominal variables
include gender, Name, phone, etc.

Examples of Nominal Variable


Personal Biodata: The variables included in a personal biodata is a nominal variable. This
includes the name, date of birth, gender, etc. E.g

● Full Name _____


● Gender
● Email address_____
● Customer Feedback: Organizations use this to get feedback about their product or service
from customers. E.g.
● How long have you been using our product?
● Less than 6 months
● 6 months
● 7 months+
● What do you think about our mobile app?_____

ORDINAL VARIABLES
Ordinal variable is a type of measurement variable that takes values with an order or rank. It is
the 2nd level of measurement and is an extension of the nominal variable.

Examples of Ordinal Variable


● Likert Scale: A Likert scale is a psychometric scale used by researchers to prepare
questionnaires and get people's opinions.

How satisfied are you with our service?


Very satisfied
Satisfied
Indifferent
Dissatisfied
Very dissatisfied

● Interval Scale: each response in an interval scale is an interval on its own.


How old are you?
13-19 years
20-30 years
31-50 years

! Work volume is the number of jobs complete per day. This is a number (a count) hence it is a
numerical variable.

POPULATION VS SAMPLE

Population - The collection of units of interests


Sample - A subset of population

● In order to have a representative subset of the population we need to select the sample
randomly (e.g. 100 people from 40k customers).
● We get a sample of 100 people (x1, x2,...,x99, x100).
● We analyze sample data.
● We make claims about population.
● To make such claims we use the sample-statistics to estimate the
population-parameters.

ESTIMATION AND CONFIDENCE INTERVALS

Estimates approximate unknown population parameters.

Histogram - distribution of the values in the sample.


The red line is the estimated ("fitted") population distribution. In this estimation we're
interested in two parameters:
- μ - population location
- σ - population spread

Both parameters are unknown, because we haven't measured all the population. So how to
learn what is μ and what is σ?

Estimation of:
Location - μ - can be done with Mean (average) Χ value or Median . Average is more precise,
because it takes all data into account, but Median is more robust, as it is immune to outliers.
Spread - σ - dispersion - the most commonly used is the Standard Deviation, but we can also
use interquartile range (Q3 - Q1) or the range (max - min)
Here is the process again:

How precise are these estimates?

In order to answer this question, we need to look at confidence intervals.However, the estimates
will never be exactly equal to the population parameter, because it is based on a sample. We
use confidence intervals to quantify the uncertainty around the estimate. A confidence interval
gives you boundaries in which a population parameter will lie with a certain level of confidence.
Most used is the confidence level of 95%.

The mean for 40 samples (as in example) lies somewhere between 0.779 and 0.884. If we had
more samples the interval would be smaller.
NORMAL DISTRIBUTION (Bell-shaped Distribution)

WEIBULL DISTRIBUTION (Skewed Distribution)

Here it is skewed to the right. It's mostly used for throughput times or processing times.

LOGNORMAL DISTRIBUTION
PROBABILITY PLOT

Decide which distribution fits your Data best.

THT - Total handling time of a customer call in the center

We use the Probability plot tool in Minitab and play around types of distribution.

Normal Distribution looks like this:

Lognormal:

Weibull:

The lognormal distribution fits the data best. The data have a shape close to a line and is
between the two outer lines.
In the second step we can compare AD value - Anderson Darling value. It measures the
distance between the data and theoretical distribution (normal, lognormal or weibull). The lower
the value, the better the fit.

Watch out for abnormalities! E.g. S-shape data on the probability plot show bimodality of it.

Or the values have been rounded up

EMPIRICAL CDF

The empirical CDF is used in the Analyze step in the DMAIC tool and determines the
percentage of cases that meet SLA (Service Level Agreement) or specifications.

Example of a Call Center

Project goal is to improve handling time per call. CTQ is Total handling time (THT).
Question: How often are the calls handled within 240 seconds?
When we have a data set, we could count the percentage of total calls that lasted less or equal
to 240sec. However, percentages based on counting are only correct when based on large data
set (N>300).

Blue curve represents the data from the


sample. Red curve represents probability
distribution that is a model for population.
Notice this empirical CDF of THT is done for
NORMAL Distribution. We have to change it.

Now it's correct. The distribution is lognormal


and the empirical CDF takes into account.

Let's summarize. The probability plot was helpful to decide what distribution the data represents.
The empirical CDF helps to calculate percentages. In Minitab we can use crosshairs or
percentile lines to find percentages.

NORMAL DISTRIBUTION continuation


TESTING

One of the principles of the Lean Six Sigma method is Data-based testing of ideas and
improvements, also called evidence- based improvement actions.

Data-based testing

Is your CTQ related to any influence factor?

E.g. Caffeine percentage (Y or CTQ) - we want to determine wherever are any factors that
determine its level in the beans. We could consider the number of extractions (X) or duration of
an extraction (X) or amount of solvent (X).

We need to determine which variable is the influence factor (X) and which is the CTQ (Y
variable). Then we can model the relationship using data.
The choice of modeling method depends on types of data, if it's categorical (qualitative) or
numerical (quantitative).

Hypothesis Testing

So far we have worked with estimations. We used parameters to describe the entire population.
The estimate is merely descriptive. Therefore, we often switch from estimation to testing. Testing
means that we claim something. And we use data to test this claim. This claim is called a
hypothesis.

P-value - probability of an event

Hypothesis testing
- is a procedure to arrive at a decision,
- is as objective as possible,
- deals with uncertainty in a rational manner
- quantifies the risk of false decision
Correlation does not imply causation.

ANOVA method

ANOVA - Analysis of Variance

Example of Coffee factory


Problem: Moisture percentage is not always within specifications
Question: Could the machine influence the moisture content of the coffee?

ANOVA makes groups according to the categorical influence factor and it test whether these
groups have equal means or not.
Can we generalize our conclusions to the entire population of batches? To answer this question
we will perform a statistical analysis to generalize our findings from the sample.

In ANOVA we're comparing various means across different groups. It tests whether these
means of different groups are significantly different. Significantly different imples that the
categorical influence factor has a significant effect on the numerical Y-variable. We perform this
analysis in 3 steps:
- Organizing the data
- ANOVA
- Residuals

Performing ANOVA in Minitab

Stack the data as follows (column C6-T):

=QUERY(ArrayFormula(SPLIT(TRANSPOSE(SPLIT(textjoin("-",TRUE,TRANSPOSE((A7:
D7&","&A8:D17))),"-")),",")),)
RESIDUAL ANALYSIS

Residual - difference between the measurement and estimated mean. The residuals are
calculated by subtracting the expected value from each observation. In the case of ANOVA, this
expected value is the mean output over the relevant machine.
Kruskal- Wallis Test

When to use?
We need more data than in ANOVA to show the same difference.
Total Handling Time in the Call Centre, when employees attended training and when not. Question:
Has the training influenced THT?
Y- total training time (numerical)
X- Trained or not? (categorical)

The Kruskal-Wallis procedure is a good alternative for the ANOVA if the ANOVA assumptions are not
met. To interpret the output, you have to look at:
The median for each group
The P-value to determine if the results are significant

TWO-SAMPLE TEST (t-test, Student t-test) William Gosset

This test used to compare means of two groups.


The estimated difference between fertilizers is 2.74kg.

The p-value can be used to calculate whether the differences of the two means are coincidental or
truly significant.
This test is very useful for small samples.
Test for equality of variances

Law of variability - "Increasing variability always degrades the performance of a production


system" (Hopp & Spearman 2008)

Too many variations may cause defects , a lot of scrap and losses.

Summary:
Test for Equal variances compare variances across groups with the test for equal variances. It can
be used for the ANOVA analysis to determine if the "assume equal variances"mode can be used
(p-value will be more precise).

You might also like