Developing An Unsupervised Classification Algorithm For Characterization of Steel Properties (2012)
Developing An Unsupervised Classification Algorithm For Characterization of Steel Properties (2012)
www.emeraldinsight.com/0265-671X.htm
IJQRM
29,4 Developing an unsupervised
classification algorithm
for characterization
368
of steel properties
Prasun Das
SQC and OR Division, Indian Statistical Institute, Kolkata, India, and
Shubhabrata Datta
Production Engineering Department,
Birla Institute of Technology Extension Centre, Deoghar, India
Abstract
Purpose – The purpose of this paper is to develop an unsupervised classification algorithm
including feature selection for industrial product classification with the basic philosophy of a
supervised Mahalanobis-Taguchi System (MTS).
Design/methodology/approach – Two novel unsupervised classification algorithms called
Unsupervised Mahalanobis Distance Classifier (UNMDC) are developed based on Mahalanobis’
distance for identifying “abnormals” as individuals (or, groups) including feature selection. The
identification of “abnormals” is based on the concept of threshold value in MTS and the distribution
property of Mahalanobis-D2.
Findings – The performance of this algorithm, in terms of its efficiency and effectiveness, has been
studied thoroughly for three different types of steel product on the basis of its composition and
processing parameters. Performance in future diagnosis on the basis of useful features by the new
scheme is found quite satisfactory.
Research limitations/implications – This new algorithm is able to identify the set of significant
features, which appears to be always a larger class than that of MTS. In industrial environment, this
algorithm can be implemented for continuous monitoring of “abnormal” situations along with the
general concept of screening “abnormals” either as individuals or as groups during sampling.
Originality/value – The concept of determining threshold for diagnostic purpose is algorithm
dependent and independent of the domain knowledge, hence much more flexible in large domain.
Multi-class separation and feature selection in case of detection of abnormals are the special merits of
this algorithm.
Keywords Unsupervised learning, Mahalanobis distance, MTS philosophy, Feature selection,
Threshold, Severity level, Genetic algorithms, Classification, Steel
Paper type Research paper
1. Introduction
In the process of classification system, by various parametric and non-parametric
methods, a classifier is designed to extract the distinguishing features and help for
International Journal of Quality decision making. A common approach in classification is to map the sparse
& Reliability Management high-dimensional attributes of objects into a dense low-dimensional space. Both types
Vol. 29 No. 4, 2012
pp. 368-383 of learning mechanisms, supervised and unsupervised, are applied on the given set of
q Emerald Group Publishing Limited information for the purpose of classification. In unsupervised learning, information on
0265-671X
DOI 10.1108/02656711211224839 input objects are used to fit a model without having any knowledge of output variable(s).
This method treats input objects as a set of random variables and builds a joint density An unsupervised
model for the data set. Another way of unsupervised learning is clustering of the input classification
patterns. Given a particular set of patterns or cost function, different clustering
algorithms lead to different clusters (Hinton and Sejnowski, 1999) such that each cluster algorithm
contains objects that share some important properties. In recent past, different
unsupervised classification methods are exemplified by various statistical concepts.
The most popular criteria include within group sums of squares underlying the single 369
stage method (Ward, 1963) and iterative relocation method called k-means clustering
(MacQueen, 1967). Other methods are hierarchical agglomeration based on the
classification likelihood (Murtagh and Raftery, 1984; Banfield and Raftery, 1992) and
expectation-maximization (EM) algorithm for maximum likelihood estimation of
multivariate mixture models (Celeux and Govaert, 1995).
In classification problems, whether supervised or unsupervised, identification of a
subset of important input features plays a significant role in creating clusters. The
selection of features is based on the application type and the kind of input data to be
used for classification. The usual way of doing so is to use some indices and evaluate
different feature subsets, which guides in selecting the best subset of features.
Recently, feature selection is becoming popular in unsupervised learning. To mention
some of the methods, sequential unsupervised feature selection algorithm, wrapper
approach based on EM algorithm, maximum entropy based method and the recently
developed neuro-fuzzy approach. Mitra et al. (2002) proposes an unsupervised
algorithm, which uses feature dependency/similarity for redundancy reduction, but
does not require any search. A new similarity measure, called maximal information
compression index, is used here in clustering.
Mahalanobis-Taguchi system (MTS), a kind of supervised technique, uses
Mahalanobis distance (MD) as a multivariate measure (Mahalanobis, 1936) for
prediction, diagnosis and pattern recognition in multi-dimensional system without any
assumption of statistical distribution (Taguchi and Jugulum, 2000, 2002), and attempts
to find out the significant features for generalization. The primary concept behind MTS
philosophy assumes only one group (cluster), namely “normal” and the corresponding
Mahalanobis space (MS) is obtained using the standardized variables of “normal” group
observations. This MS is used in identifying “abnormals”. Once this MS is established,
the number of features (variables) gets reduced using orthogonal array technique and
signal-to-noise (S/N) ratio by evaluating the contribution of individual feature. Recently,
some work has been done to compare MTS with neural network topologies
(Jugulum and Monplaisir, 2002; Hong et al., 2005; Das et al., 2006; Das and Datta,
2007). However, in all these works, the effectiveness of MTS to segregate “normal” and
“abnormal” observation seems to be heavily dependent on selecting the “normal” group
accurately. The determination of threshold value for future diagnosis as illustrated by
the developers is subjective and not clearly defined. Also, the method cannot prescribe
any standard procedure to distinguish the severity levels of abnormality.
In this work, two novel unsupervised classification algorithms called unsupervised
Mahalanobis distance classifier (UNMDC) are developed based on MD for identifying
“abnormals” as individuals (or, groups) including feature selection. The identification of
“abnormals” is based on the concept of threshold value in MTS and the distribution
property of Mahalanobis-D2. The performance of this algorithm is studied from various
perspectives on a real-life industrial environment, particularly for characterization
IJQRM of steel properties on the basis of its composition and processing parameters. The rest of
29,4 the paper is organized as follows. Section 2 contains the general discussion on MD and its
use in MTS. In Section 3, the proposed unsupervised classification scheme UNMDC is
presented. Section 4 narrates the implementation of the developed algorithm in real-life
steel processing system and demonstrates the results of classification including
selection of useful features. Some other merits of the proposed algorithm are highlighted
370 in Section 5. The paper is concluded in Section 6.
D 2 tðxÞ ¼ ðx 2 mt ÞS 21
t ðx 2 m t Þ
t
where:
Dt is the generalized squared distance of each vector point from the t groups.
St represents the within-group covariance matrix.
mt is the vector of the means of the variables of the t groups.
X is the vector containing the values of the variables observed at point x.
When compared to other classical statistical approaches, the MD takes into account not
only the average value but also its variance and the covariance of the variables
measured. It accounts for ranges of the acceptability (variance) between variables.
It compensates for interactions (covariance) between variables. It is dimensionless.
If X
The distributional aspect of MD (D 2) is that it underlies Hotelling’s T 2 statistic. P
and S are the mean vector and covariance matrix of a sample of size n from Np [m, ],
then (n 2 1) ðX 2 mÞ , T 2( p, n 2 1), where Xp£ 1 , Np [m, S ] and
2 mÞT S21 ðX
2
DX; T 21
m ¼ ðX 2 mÞ S ðX 2 mÞ.
Taguchi was first to propose the use of MD (Taguchi and Jugulum, 2002) in a
multi-dimensional system for prediction, diagnosis and pattern recognition without any
assumption about the distribution, and attempted to find out useful variables for future
diagnosis, which has been referred to as MTS. In MTS, the MS (often called reference
group) is obtained using the standardized variables of “normal” data. The MS can be
used to discriminate between “normal” and “abnormal” data. Once this space is
established, the number of variables (or, attributes) is reduced, using orthogonal array
and S/N ratio, by evaluating the contribution of each attribute. Taguchi has also
incorporated the concept of a threshold for future diagnosis (Taguchi and Jugulum,
2000). The expected value of the MDi’s (MD for i-th pattern) for the “normal” items is
unity (Taguchi and Jugulum, 2002). This approximation is evidently based on
x2-distribution with p degrees of freedom. This is the probability distribution of pMDi,
provided the sampling is from multivariate normal distribution with known mean and An unsupervised
dispersion matrix. classification
algorithm
3. The proposed algorithms
3.1 Algorithm: UNMDC-I
This algorithm is developed in principle with MTS philosophy and the extension of
MTS principle, using statistical principles in an unsupervised manner, is discussed. 371
Assume that we have a (nxp) dataset where n is the number of observations and p is
the number of variables:
Stage-1. Construction of a measurement scale with MS as reference
1. Initialize K ¼ n; min. no. of patterns ¼ (p þ 1).
2. Initialize i ¼ 1.
3. While (i # K).
4. Take all K points except ith pattern from the dataset.
P
5. Find m and of the selected dataset.
P
6. Compute MDi for ith pattern based on m and .
7. Set i ¼ i þ 1; Go to Step 3.
8. Select group j if MDj ¼ max ðMDi Þ.
i
9. Set K ¼ K 2 1 and Dataset ¼ Previous dataset except the jth pattern.
10. If K . Min. no. of patterns, go to Step 2.
11. “normal” Group ¼ Dataset.
12. Compute Threshold value depending on “normal” group (Section 4.3 for details).
13. Create “normal” group with those patterns whose MDs are less than first
Threshold.
14. If number of points in the “normal” group changes then go to Step 12.
P
15. Find m and of the normal group.
Yes
Outlines
Exclude Return Remaining
Return total Dataset No Outliers Dataset
D-Dataset
No of Cluster = K
predefined No = PN
Iteration No = IN
Initialize
INo = 1 and PNo = 1
Dataset No
Empty
Yes
Updata Σi ∀i = 1(1) K
INo = INo+1, Dataset = D
No
INo > IN
1
2
Figure 1.
(continued) Flowchart of UNMDC-II
IJQRM
29,4
2
1
Yes
No
Dataset
Empty
Yes
No
PNo>PN INo = 1
Yes
Return Classification
Result
END
Figure 1.
thresholds values for UNMDC, feature selection for both the methods and future An unsupervised
diagnosis with UNMDC only. classification
Dataset-I: related to steel hardness (using UNMDC-I) algorithm
The demand of high strength steels combined with adequate ductility, formability and
fracture toughness is mainly observed in the sectors like automobile, defense and naval
applications. The alloy chemistry (C, Mn, Si, Cu, Ti and B), cold deformation, cooling rate 375
(CRT), aging time and aging temperature have been taken as input parameters
(features), whereas hardness is designated as the output variable. The purpose is to
improve the hardenability which in turn is expected to give rise a dual phase
microstructure of steel depending on the finish rolling condition and CRT. The chemical
analyses are done in atomic spectrometer. The hardness testing has been carried out in
Vicker’s hardness testing machine. Information on acceptable level of hardness values
(“normal”) is known for 150 observations and the rest 92 patterns of input parameters
yield develop nonconforming (“abnormal”) hardness only (Das and Datta, 2007).
The data are analyzed using MTS and UNMDC methods, considering Stage I and
Stage II first. The results on average MD values for “normal” as well as for “abnormal”
are reported in Table I along with threshold value(s) obtained in UNMDC and the
number of patterns (observations) falling in between the threshold values.
As seen from Table I, all the 150 observations fall below the threshold value 6.6821
and identified as “normal”. Out of 92 abnormals, 56 of them falls within (6.6821,
0.2603 £ 1012), 21 lie between (0.2603, 7.4027) £ 1012, and 15 fall above 7.4027) £ 1012.
The overall average MDs of “normal” and “abnormal” are found as 0.9933 and
5.6262 £ 1012, both for MTS and UNMDC. The partition of “abnormal” group into three
sets (56, 21, 15) using UNMDC indicates different severity levels. Note that, the values
of threshold for MTS are domain knowledge dependent; therefore it is not possible to
report in Table I. The plot of scaled MD for both the methods is shown in Figure 2 over
242 set of points along with the horizontal lines indicating threshold values.
This plot indicates the level of separation of abnormal points from “normal” group
based on severity level. Looking at the plots of MDs (scaled) for both the methods, it is
observed that our proposed unsupervised algorithm maps the patterns quite
consistently with that of MTS in terms of scaled MD values.
We now show the results of selection of useful variables (features) as described in
Stage-III of both the methods for the steel data considered. The features are selected on
the basis of positive S/N ratio values in case of MTS and for UNMDC, the corresponding
high partial F-values indicate the importance of features to be selected (Table II).
Pattern type MTS (mean) UNMDC (mean) Threshold (UNMDC) No. of patterns Table I.
Average MDs for
Normal 0.9933 0.9933 6.6821 150 “normal” and “abnormal”
Abnormal 5.6262 £ 1012 5.6262 £ 1012 (0.0188, (0.2603, 7.4027) £ 1012 (56, 21, 15) patterns and thresholds
0.7403, 2.4588) £ 1013 (UNMDC)
IJQRM 1
29,4 0.9
UNMDC
0.8 MTS
0.7
376
Scaled MD
0.6
0.5
0.4
0.3
0.2
0.1
Figure 2.
Plot of scaled MD (MTS 0
0 50 100 150 200 250
and UNMDC) for Dataset-I
Patterns
MTS UNMDC
Selected Selected
features Gain (S/N ratio) features partial F-value
C, Mn, Si, Ti, B, Cu, Aging (213.4377, 213.0483, C, cold (5.2706, 0.1028, 0.9991,
CRT, cold temp. and 2 6.9996, 2 6.5296, deformation 0.6938, 1.4311, 2.1806,
Table II. deformation, aging aging time 2 5.6647, 2 6.8327, and aging time 0.0002, 38.2853,
Selection of useful temperature and 2 0.1486, 2 0.4957, 0.6679, 35.3996)
variables (features) aging time 4.8658, 942.7802)
of the finished plates based on their mechanical properties of an industrial situation are
considered. After the plates are manufactured, samples are taken for test piece analysis
with respect to their mechanical properties. Depending on the conformance to the
specifications of these sample plates, the corresponding heat for steel making is certified
as “OK” or “Diverted”. A total of 69 observations coming from test piece analysis of a
plate rolling mill contain information on carbon, manganese, sulphur, phosphorus and
silicon, as a result of ladle sample analysis along with the status of steel plates
(normal/“Ok” and abnormal/“Diverted”). No significant variation in the rolling
parameters is observed for this particular data set. Out of a total of 69 heats, 41 are from
“Ok” heats and 28 are declared as “Diverted” (Das et al., 2006).
The analysis of data using UNMDC-II yields the following result of classification
(Table III).
Partial-F statistic values for the five compositions C, Mn, S, P and Si are found as
9.0961, 62.5109, 0.5844, 2.1533 and 2.2699, respectively. The magnitude of these values
implies that Carbon (C) and Manganese (Mn) are the responsible (significant) features
for group separation (Figure 3).
The same dataset is used for comparison with other techniques. The results are
given in Table IV.
5. Merits of UNMDC scheme An unsupervised
We now discuss the advantages of the newly developed unsupervised algorithm, classification
particularly with respect to domain knowledge extraction using feature selection
methodology along with determining multiple threshold values representing different algorithm
severity levels of classification system from the “abnormal” group of measurements.
An algorithm for determining different threshold levels through this algorithm is also
described in this section. The practical implication of the merits of this unsupervised 377
classification algorithm in industrial environment is noteworthy.
5.1 Extraction of domain knowledge
5.1.1 Using UNMDC-I. To extract the domain knowledge based on the results of
feature selection by the algorithm, the input set of ten features of Dataset-I is applied to
this algorithm. The data set are classified into four groups (1: normal; 3: abnormals
with different severity/nonconforming levels) and for each group, the basic statistics
on hardness measurements are given in Table V.
Misclassification statistics
Ok ! Diverted Diverted ! Ok
Number Pattern no. Number Pattern no.
Table III.
3 10, 25, 41 1 61 Misclassification details
Overall 4 (6%) for steel plates
Mahalanobis Distance
1.2
Standardized MD
1
0.8 Normal Group
0.6 Abnormal Group
0.4
0.2
Figure 3.
0
Performance classification
1 6 11 16 21 26 31 36 41 46 51 56 61 66
based on MD values
Sample Points
Misclassification statistics
Ok ! Diverted Pattern no. Diverted ! Ok Pattern no. Total %
5.2.2 Dataset-III: related to steel strength for multi-class classification. High strength low
alloy (HSLA) steels are usually low-carbon (up to 0.15 percent) steels with up to
1.5 percent manganese, low levels (under 0.035 percent) of phosphorus and sulphur
strengthened by small additions of elements, such as copper, nickel, niobium, nitrogen,
vanadium, chromium, molybdenum, silicon, or zirconium and sometimes by special
rolling and cooling techniques. The hot rolling parameters add to this complication and
also influence the final property. Proper adjustment of all these parameters can result
to an improved strength-ductility balance of such steels. HSLA steels are much
stronger and tougher than ordinary plain-carbon steels. The possibility of predicting
the strength of the steel after hot rolling would clearly be desirable and there have been
some attempts in this respect (Ichikawa et al., 1996; Ray Chaudhury and Das, 1997;
Datta and Banerjee, 2004; Das et al., 2006; Das and Datta, 2007).
To verify the performance of the UNMDC algorithm for determining multiple
threshold values, the dataset related to HSLA steel product is considered. This is
because the complexity of such steel system is more due to many input variables and
their interaction. Moreover, the dataset is relatively large as compared to plain carbon
steel system. So, the chance of falsification and the levels of severity could be more due
to unstable processing behaviour and inferior decision making while characterizing its
strength property. The dataset contains information of ten chemical compositions,
namely, Carbon (C), Manganese (Mn), Silicon (Si), Chromium (Cr), Nickel (Ni),
Molybdenum (Mo), Titanium (Ti), Niobium (Nb), Boron (B) and Copper (Cu) along with
six rolling parameters like slab reheating temperature (SRT), deformation in three
different temperature zones (D1, D2, D3), finish rolling temperature (FRT) and CRT.
Out of a total of 117 observations, 75 plates are from normal (high strength) group and
42 are from abnormal (low strength) group based on the strength property.
IJQRM Using the developed algorithm (Section 5.3.1) we get the following cutoffs (Table VII),
29,4 where the first one is from statistical point of view and others are data driven.
Accordingly, using the cutoff values against the computed MDs for all the
117 datapoint, Table VIII yields different severity levels, ranging from 2 to 6, for the
“abnormal” group.
The plot of MDs is shown in Figure 4 over 117 set of points along with the horizontal
380 lines indicating threshold (cutoff) values. This plot indicates the level of separation of
abnormal points from “normal” group based on severity level. From the following figure,
the existence of possible groups among abnormals is also observed.
Now, we explore the useful set of variables, considering each abnormal group (2-6) vs
normal group, based on S/N ratio analysis in the same way as done in Part-A work.
Heat no. Severity Sl. no. Severity Heat no. Severity Heat no. Severity
1 1 31 1 61 1 91 3
2 1 32 1 62 1 92 2
3 2 33 1 63 1 93 2
4 1 34 1 64 1 94 3
5 1 35 1 65 1 95 4
6 1 36 1 66 1 96 4
7 1 37 1 67 1 97 3
8 2 38 1 68 1 98 3
9 1 39 1 69 1 99 3
10 1 40 1 70 1 100 4
11 1 41 1 71 1 101 3
12 1 42 1 72 1 102 4
13 1 43 2 73 2 103 3
14 1 44 1 74 1 104 6
15 1 45 1 75 1 105 5
16 1 46 1 76 2 106 5
17 1 47 1 77 2 107 5
18 1 48 1 78 1 108 6
19 2 49 1 79 2 109 4
20 1 50 1 80 2 110 4
21 1 51 2 81 2 111 6
22 1 52 1 82 2 112 6
23 1 53 1 83 2 113 6
24 1 54 1 84 2 114 6
25 1 55 1 85 2 115 6
26 1 56 1 86 2 116 6
27 1 57 1 87 2 117 4
Table VIII. 28 2 58 1 88 3
Severity/grade 29 1 59 1 89 2
identification of 117 heats 30 1 60 1 90 2
160 An unsupervised
140 3.5 classification
120 3 algorithm
100 2.5
MD
MD
80 2
60 1.5
381
40 1
20 0.5
Figure 4.
0 0 MD values and
0 20 40 60 80 100 120 10 20 30 40 50 60 70 80 90
different cutoff
Observations Observations
The purpose is to see the role of variables when groups of abnormals with different
severity exist. As before, based on the process knowledge, the eight variables (C, Mn, Si,
D1, D2, D3, SRT, FRT) are kept outside this analysis as they should be always present.
We have used the same L16 array where 16 trials were designed and for each trial, based
on the combination of “present” and “absent” level, MD values were calculated using the
data of each abnormal group. The result, in terms of gain for each (normal, abnormal
with a particular severity) combination, is shown in Table IX. The relevant points are
discussed in the next section.
Multi-class classification scheme using multiple cutoff search algorithm has been
quite successfully able to find the role of input variables for steel under study. It is clearly
revealed that Ni, Cu and post rolling CRT both individually and in combination have a
continuously increased gain with increase in the number of the group or with decreasing
the quality of the product. This suggests the fact these variables significantly control the
final property of the steel at all levels of its quality. On the other hand the carbide and
carbonitride precipitate formers like Cr, Mo, Ti, Nb and B have the effect only on the
higher quality of the product, as their gain is significant when compared with second
and third group only. Among them Cr and Mo is better influence converting the quality
of the steel from the third group to first and second groups, but Ti, Nb and B is important
for converting from the second group to first group. This finding also matches the
metallurgical understanding from the fact that Nb, Ti and B are better strengthener
Variables 2 vs 1 3 vs 1 4 vs 1 5 vs 1 6 vs 1