Interpret The Key Results For Attribute Agreement Analysis
Interpret The Key Results For Attribute Agreement Analysis
Agreement Analysis
Complete the following steps to interpret an attribute agreement analysis. Key output
includes kappa statistics, Kendall's statistics, and the attribute agreement graphs.
In This Topic
To determine the correctness of each appraiser's ratings, evaluate the Appraiser vs Standard
graph. Compare the percentage matched (blue circle) with the confidence interval for the
percentage matched (red line) for each appraiser.
NOTE
Minitab displays the Within Appraisers graph only when you have multiple trials.
This Within Appraisers graph indicates that Amanda has the most consistent ratings and Eric has the least
consistent ratings. The Appraiser vs Standard graph indicates that Amanda has the most correct ratings
and Eric has the least correct ratings.
Use kappa statistics to assess the degree of agreement of the nominal or ordinal ratings
made by multiple appraisers when the appraisers evaluate the same samples.
Kappa values range from –1 to +1. The higher the value of kappa, the stronger the
agreement, as follows:
The AIAG suggests that a kappa value of at least 0.75 indicates good agreement. However,
larger kappa values, such as 0.90, are preferred.
When you have ordinal ratings, such as defect severity ratings on a scale of 1–5, Kendall's
coefficients, which account for ordering, are usually more appropriate statistics to determine
association than kappa alone.
NOTE
Remember that the Within Appraisers table indicates whether the appraisers' ratings are
consistent, but not whether the ratings agree with the reference values. Consistent ratings
aren't necessarily correct ratings.
Within Appraisers
Assessment Agreement
Use kappa statistics to assess the degree of agreement of the nominal or ordinal ratings
made by multiple appraisers when the appraisers evaluate the same samples.
Kappa values range from –1 to +1. The higher the value of kappa, the stronger the
agreement, as follows:
The AIAG suggests that a kappa value of at least 0.75 indicates good agreement. However,
larger kappa values, such as 0.90, are preferred.
When you have ordinal ratings, such as defect severity ratings on a scale of 1–5, Kendall's
coefficients, which account for ordering, are usually more appropriate statistics to determine
association than kappa alone.
Assessment Agreement
Use kappa statistics to assess the degree of agreement of the nominal or ordinal ratings
made by multiple appraisers when the appraisers evaluate the same samples.
Kappa values range from –1 to +1. The higher the value of kappa, the stronger the
agreement, as follows:
The AIAG suggests that a kappa value of at least 0.75 indicates good agreement. However,
larger kappa values, such as 0.90, are preferred.
When you have ordinal ratings, such as defect severity ratings on a scale of 1–5, Kendall's
coefficients, which account for ordering, are usually more appropriate statistics to determine
association than kappa alone.
NOTE
Remember that the Between Appraisers table indicates whether the appraisers' ratings are
consistent, but not whether the ratings agree with the reference values. Consistent ratings
aren't necessarily correct ratings.
Between Appraisers
Assessment Agreement
Coef Chi - Sq DF P
0.976681 382.859 49 0.0000
Use kappa statistics to assess the degree of agreement of the nominal or ordinal ratings
made by multiple appraisers when the appraisers evaluate the same samples.
Kappa values range from –1 to +1. The higher the value of kappa, the stronger the
agreement, as follows:
The AIAG suggests that a kappa value of at least 0.75 indicates good agreement. However,
larger kappa values, such as 0.90, are preferred.
When you have ordinal ratings, such as defect severity ratings on a scale of 1–5, Kendall's
coefficients, which account for ordering, are usually more appropriate statistics to determine
association than kappa alone.
Assessment Agreement
# Inspected # Matched Percent 95% CI
50 37 74.00 (59.66, 85.37)
Coef SE Coef Z P
0.965563 0.0345033 27.9817 0.0000
***************
The %Contribution table can be convenient because all sources of variability add up nicely to 100%.
Example:
The %Study Variation table doesn’t have the advantage of having all sources add up nicely to 100%,
but it has other positive attributes. Because standard deviation is expressed in the same units as the
process data, it can be used to form other metrics, such as Study Variation (6*standard deviation),
%Tolerance (if you enter in specification limits for your process), and %Process (if you enter in an
historical standard deviation). Of course, there are guidelines for levels of acceptability from AIAG as
well:
If the Total Gage R&R contribution in the %Study Var column (% Tolerance, %Process) is:
If you are looking at the %Contribution column, the corresponding standards are:
We field a lot of questions about %Tolerance as well. %Tolerance is just comparing estimates of
variation (part-to-part, and total gage) to the spread of the tolerance.
When you enter a tolerance, the output from your gage study will be exactly the same as if you
hadn't entered a tolerance, with the exception that your output will now contain a %Tolerance
column. Your results will still be accurate if you don't put in a tolerance range; however, including
the tolerance will provide you more information.
For example, you could have a high percentage in %Study Var for part-to-part, and a high number of
distinct categories. However, when you compare the variation to your tolerance, it might show that
in reference to your spec limits, the variation due to gage is high. The %Tolerance column may be
more important to you than the %Study Var column, since the %Tolerance is more specific to your
product and its spec limits.
Think of it this way: Your total variation comprises part-to-part and the gage (Reproducibility and
Repeatability). After adding a tolerance, we get to see what percentage of variation really dominates
within the tolerance bounds specified. If the ratio between the Total Gage R&R and the tolerance is
high (%Tolerance>30%), that provides insight about the types of parts being selected. It’s telling us
that the measurement tool cannot effectively decipher if the part is good or bad, because too much
measurement system variation is showing up between specifications.
I hope the answers to these common questions help you next time you’re doing Gage R&R in
Minitab!
***************
SS
The sum of squares (SS) is the sum of squared distances, and is a measure of the variability
that is from different sources. Total SS indicates the amount of variability in the data from
the overall mean. SS Operator indicates the amount of variability between the average
measurement for each operator and the overall mean.
MS
The mean squares (MS) is the variability in the data from different sources. MS accounts for
the fact that different sources have different numbers of levels or possible values.
MS = SS/DF for each source of variability
F
The F-statistic is used to determine whether the effects of Operator, Part, or Operator*Part
are statistically significant.
The larger the F statistic, the more likely it is that the factor contributes significantly to the
variability in the response or measurement variable.
P
The p-value is the probability of obtaining a test statistic (such as the F-statistic) that is at
least as extreme as the value that is calculated from the sample, if the null hypothesis is true.
Interpretation
Use the p-value in the ANOVA table to determine whether the average measurements are
significantly different. Minitab displays an ANOVA table only if you select the ANOVA option
for Method of Analysis.
A low p-value indicates that the assumption of all parts, operators, or interactions sharing
the same mean is probably not true.
To determine whether the average measurements are significantly different, compare the p-
value to your significance level (denoted as α or alpha) to assess the null hypothesis. The
null hypothesis states that the group means are all equal. Usually, a significance level of 0.05
works well. A significance level of 0.05 indicates a 5% risk of concluding that a difference
exists when it does not.
If the p-value is less than or equal to the significance level, you reject the null
hypothesis and conclude that at least one of the means is significantly different from
the others. For example, at least one operator measures differently.
However, you also cannot conclude that the means are the same. A difference might
exist, but your test might not have enough power to detect it.
VarComp
VarComp is the estimated variance components for each source in an ANOVA table.
Interpretation
Use the variance components to assess the variation for each source of measurement
error.
Interpretation
Use the %Contribution to assess the variation for each source of measurement error.
StdDev (SD)
StdDev (SD) is the standard deviation for each source of variation. The standard
deviation is equal to the square root of the variance component for that source.
The standard deviation is a convenient measure of variation because it has the same
units as the part measurements and tolerance.
%Study Var is the square root of the calculated variance component (VarComp) for
that source. Thus, the %Contribution of VarComp values sum to 100, but the %Study
Var values do not.
Interpretation
Use %Study Var to compare the measurement system variation to the total variation.
If you use the measurement system to evaluate process improvements, such as
reducing part-to-part variation, %Study Var is a better estimate of measurement
precision. If you want to evaluate the capability of the measurement system to
evaluate parts compared to specification, %Tolerance is the appropriate metric.
%Tolerance (SV/Toler)
%Tolerance is calculated as the study variation for each source, divided by the
process tolerance and multiplied by 100.
%Process (SV/Proc)
If you enter a historical standard deviation but use the parts in the study to estimate
the process variation, then Minitab calculates %Process. %Process compares
measurement system variation to the historical process variation. %Process is
calculated as the study variation for each source, divided by the historical process
variation and multiplied by 100. By default, the process variation is equal to 6 times
the historical standard deviation.
If you use a historical standard deviation to estimate process variation, then Minitab
does not show %Process because %Process is identical to %Study Var.
95% CI
95% confidence intervals (95% CI) are the ranges of values that are likely to contain
the true value of each measurement error metric.
Interpretation
Because samples of data are random, two gage studies are unlikely to yield identical
confidence intervals. But, if you repeat your studies many times, a certain percentage
of the resulting confidence intervals contain the unknown true measurement error.
The percentage of these confidence intervals that contain the parameter is the
confidence level of the interval.
For example, with a 95% confidence level, you can be 95% confident that the
confidence interval contains the true value. The confidence interval helps you assess
the practical significance of your results. Use your specialized knowledge to
determine whether the confidence interval includes values that have practical
significance for your situation. If the interval is too wide to be useful, consider
increasing your sample size.
Suppose that the VarComp for Repeatability is 0.044727 and the corresponding 95%
CI is (0.035, 0.060). The estimate of variation for repeatability is calculated from the
data to be 0.044727. You can be 95% confident that the interval of 0.035 to 0.060
contains the true variation for repeatability.
Interpretation
The Measurement Systems Analysis Manual1 published by the Automobile Industry
Action Group (AIAG) states that 5 or more categories indicates an acceptable
measurement system. If the number of distinct categories is less than 5, the
measurement system might not have enough resolution.
Usually, when the number of distinct categories is less than 2, the measurement
system is of no value for controlling the process, because it cannot distinguish
between parts. When the number of distinct categories is 2, you can split the parts
into only two groups, such as high and low. When the number of distinct categories
is 3, you can split the parts into 3 groups, such as low, middle, and high.
Probabilities of misclassification
When you specify at least one specification limit, Minitab can calculate the
probabilities of misclassifying product. Because of the gage variation, the measured
value of the part does not always equal the true value of the part. The discrepancy
between the measured value and the actual value creates the potential for
misclassifying the part.
Minitab calculates both the joint probabilities and the conditional probabilities of
misclassification.
Joint probability
Use the joint probability when you don't have prior knowledge about the
acceptability of the parts. For example, you are sampling from the line and don't
know whether each particular part is good or bad. There are two misclassifications
that you can make:
• The probability that the part is bad, and you accept it.
• The probability that the part is good, and you reject it.
Conditional probability
Use the conditional probability when you do have prior knowledge about the
acceptability of the parts. For example, you are sampling from a pile of rework or
from a pile of product that will soon be shipped as good. There are two
misclassifications that you can make:
• The probability that you accept a part that was sampled from a pile of bad
product that needs to be reworked (also called false accept).
• The probability that you reject a part that was sampled from a pile of good
product that is about to be shipped (also called false reject).
Interpretation
Three operators measure ten parts, three times per part. The following graph
shows the spread of the measurements compared to the specification limits. In
general, the probabilities of misclassification are higher with a process that has
more variation and produces more parts close to the specification limits.
Source DF SS MS F P
Total 89 94.6471
Source DF SS MS F P
Total 89 94.6471
Gage R&R
Variance Components
%Contribution
Gage Evaluation
Probabilities of Misclassification
Joint Probability
Description Probability
Conditional Probability
Description Probability
The joint probability that a part is bad and you accept it is 0.037. The joint probability that a part is good
and you reject it is 0.055.
The conditional probability of a false accept, that you accept a part during reinspection when it is really
out-of-specification, is 0.151. The conditional probability of a false reject, that you reject a part during
reinspection when it is really in-specification, is 0.073.
• Total Gage R&R: The variability from the measurement system that includes multiple
operators using the same gage.
• Repeatability: The variability in measurements when the same operator measures the same
part multiple times.
• Reproducibility: The variability in measurements when different operators measure the same
part.
• Part-to-Part: The variability in measurements due to different parts.
Interpretation
%Contribution
%Study Variation
%Study Variation is the percentage of study variation from each source. It is
calculated as the study variation for each source divided by the total study variation,
then multiplied by 100.
%Tolerance
%Tolerance compares measurement system variation to specifications. It is calculated
as the study variation for each source divided by the process tolerance, then
multiplied by 100.
Minitab calculates this value when you specify a process tolerance range or
specification limit.
%Process
%Process compares measurement system variation to the total variation. It is
calculated as the study variation for each source divided by the historical process
variation, then multiplied by 100.
Minitab calculates this value when you specify a historical standard deviation and
select Use parts in the study to estimate process variation.
R chart
The R chart is a control chart of ranges that displays operator consistency.
The R chart contains the following elements.
Plotted points
For each operator, the difference between the largest and smallest measurements of
each part. The R chart plots the points by operator so you can see how consistent
each operator is.
NOTE
If each operator measures each part 9 times or more, Minitab displays
an S chart instead of an R chart.
Interpretation
A small average range indicates that the measurement system has low
variation. A point that is higher than the upper control limit (UCL)
indicates that the operator does not measure parts consistently. The
calculation of the UCL includes the number of measurements per part
by each operator, and part-to-part variation. If the operators measure
parts consistently, then the range between the highest and lowest
measurements is small, relative to the study variation, and the points
should be in control.
Xbar chart
The Xbar chart compares the part-to-part variation to the repeatability component.
Plotted points
Interpretation
The parts that are chosen for a Gage R&R study should represent the entire range of
possible parts. Thus, this graph should indicate more variation between part averages than
what is expected from repeatability variation alone.
Ideally, the graph has narrow control limits with many out-of-control points that indicate a
measurement system with low variation.
By Part graph
This graph shows the differences between factor levels. Gage R&R studies usually arrange
measurements by part and by operator. However, with an expanded gage R&R study, you
can graph other factors.
In the graph, dots represent the measurements, and circle-cross symbols represent the
means. The connect line connects the average measurements for each factor level.
NOTE
If there are more than 9 observations per level, Minitab displays a boxplot instead of an
individual value plot.
Interpretation
Multiple measurements for each individual part that vary as minimally as possible (the dots
for one part are close together) indicate that the measurement system has low variation.
Also, the average measurements of the parts should vary enough to show that the parts are
different and represent the entire range of the process.
By Operator graph
The By Operator chart displays all the measurements that were taken in the study, arranged
by operator. This graph shows the differences between factor levels. Gage R&R studies
usually arrange measurements by part and by operator. However, with an expanded gage
R&R study, you can graph other factors.
NOTE
If there are less than 10 observations per operator, Minitab displays an individual value plot
instead of a boxplot.
Interpretation
A straight horizontal line across operators indicates that the mean measurements for each
operator are similar. Ideally, the measurements for each operator vary an equal amount.
Operator*Part Interaction graph
The Operator*Part Interaction graph displays the average measurements by each operator
for each part. Each line connects the averages for a single operator (or for a term that you
specify).
Interaction plots display the interaction between two factors. An interaction occurs when the
effect of one factor is dependent on a second factor. This plot is the graphical analog of the
F-test for an interaction term in the ANOVA table.
Interpretation
Lines that are coincident indicate that the operators measure similarly. Lines that are not
parallel or that cross indicate that an operator's ability to measure a part consistently
depends on which part is being measured. A line that is consistently higher or lower than
the others indicates that an operator adds bias to the measurement by consistently
measuring high or low.
1
Automotive Industry Action Group (AIAG) (2010). Measurement Systems Analysis Reference
Manual, 4th edition. Chrysler, Ford, General Motors Supplier Quality Requirements Task
Force.