DataChallengePHME2022 Gaffet V2
DataChallengePHME2022 Gaffet V2
1
Vitesco Technologies France SAS 44, Avenue du Général de Croutte, F-31100 Toulouse, France
[email protected]
[email protected]
[email protected]
2
CNRS, LAAS, 7 avenue du colonel Roche, F-31400 Toulouse, France
[email protected]
[email protected]
3
Univ. de Toulouse, UPS, LAAS, F-31400 Toulouse, France
4
Univ. de Toulouse, INSA, LAAS, F-31400 Toulouse, France
A BSTRACT 1. I NTRODUCTION
This paper presents XGBoost classifier-based methods to solve The 2022 PHME Data challenge encourages participants to
three tasks proposed by the European Prognostics and Health solve multiple classification problems for a real production
Management Society (PHME) 2022 conference. These tasks line from Bitron Spa. The dataset includes data from Sol-
are based on real data from a Surface Mount Technologies der Paste Inspection (SPI) and Automatic Optical Inspection
line. Each of these tasks aims to improve the efficiency of (AOI) equipment of a real industrial production line equipped
the Printed Circuit Board (PCB) manufacturing process, fa- with automated, integrated and fully connected machines (In-
cilitate the operator’s work and minimize the cases of manual dustry 4.0). A detailed description of the dataset is given in 3.
intervention. Due to the structured nature of the problems The challenge is to design an algorithm to predict test labels
proposed for each task, an XGBoost method based on en- for the components. Specifically, the goal is to develop a hier-
coding and feature engineering is proposed. The proposed archical classification predicting: 1. whether the AOI classi-
methods utilise the fusion of test values and system charac- fies the component as defective; 2. in the case of a defect, the
teristics extracted from two different testing equipment of the label applied by the operator; 3. in the case of confirmation
Surface Mount Technologies lines. This work also explores of the defect by the operator, the repair label.
the problems of generalising prediction at the system level us-
To tackle this challenge, we pursue the following steps: data
ing information from the subsystem data. For this particular
exploration and domain knowledge extraction, data cleaning,
industrial case: the challenges with the changes in the number
data preparation (normalization and encoding), data modelling
of subsystems. For Industry 4.0, the need for interpretability
(model training and validation) and results in analysis. The
is very important. This is why the results of the models are
four last steps were made recursively while trying different
analysed using Shapley values. With the proposed method,
approaches, as shown in Section 4. Data exploration allowed
our team took the first place, capable of successfully detect-
us to identify three main issues within the given dataset: miss-
ing at an early stage the defective components for tasks 2 and
ing information, highly imbalanced classes (for all tasks) and
3.
high cardinality on the categorical features. The latter is not
necessarily an issue but implies that a special treatment needs
to be done to these features a priori. We will elaborate on
Alexandre Gaffet et al. This is an open-access article distributed under the
the issues in Section 4.1 and on the categorical encoding in
terms of the Creative Commons Attribution 3.0 United States License, which Section 4.2.
permits unrestricted use, distribution, and reproduction in any medium, pro-
vided the original author and source are credited. To solve each task, we take different information units formed
1
E UROPEAN C ONFERENCE OF THE P ROGNOSTICS AND H EALTH M ANAGEMENT S OCIETY 2022
2. R ELATED WORK
Several scientific articles already present machine learning
applications for Surface Mount Technology production lines.
(Richter, Streitferdt, & Rozova, 2017) proposes a convolu-
tion neural networks deep learning application working on the Figure 1. Surface Mount Technologies Production Line.
AOI system to automatically detect defects. In (Tavakolizadeh, (PHM Society, 2022)
Soto, Gyulai, & Beecks, 2017), some binary classifiers are
tested to detect defects inside products using simulated pro-
duction data. These data are simulated from several SMT checking the finished solder pads, their position, shape, etc.,
lines and give a good classification score. In (Parviziomran, this process also inspects the component itself, looking for
Cao, Yang, Park, & Won, 2019), a component shift predic- defects like missing or misaligned components. The AOI in-
tion method is proposed to predict the shift of the pad during spection has two types of sanctions: the automatic sanction
the reflow process. In (Park, Yoo, Kim, Lee, & Kim, 2020) provided by the machine itself and the one given by an oper-
SPI data are used to predict at an early stage the potential de- ator that verifies the first one in case of spotted defects.
fects of the prediction. This work is based on a dual-level
defect detection method. In (Jabbar et al., 2018) some tree- Table 1. Summary characteristics of the datasets
based machine learning methods are used to predict the de-
Feature SPI AOI
fects found in AOI using SPI data. (Gaffet, Ribot, Chanth- Number of lines 5 985 382 31 617
ery, Roa, & Merle, 2021) proposes an unsupervised univari- Number of panels 1924 1 924
ate method for monitoring the In-Circuit Testing machine (lo- Number of components 129 102
cated at the end of the Surface Mount Technology lines) and Figures/panel {1,8} {1,. . . ,8}
components. Another large topic of interest for this study is Components/panel {128,129} {2,. . . ,27}
prognosis and health management at different levels: system- Lines/panel {3112,. . . } {2,. . . ,203}
level or sub-system level. In our case, we have to use infor-
mation from a pin level to retrieve the health at a system level Simple data exploration allows to discover anomalous entries
which is the product component. This topic is linked to the and clean the datasets. A summary of the information found
decentralized diagnosis approach. (Zhang, 2010) proposes in each dataset is shown in Table 1. The cleaned SPI dataset
a decentralized model-based approach with a simulation ex- is composed of 1921 panels, each with 3112 entries. A panel
ample of automated highway systems. (Ferdowsi, Raja, & is an ensemble of 8 PCBs, also called figures, grouped. Each
Jagannathan, 2012) proposes a decentralized fault diagnosis figure is composed of 129 components. The component refer-
and prognosis methodology for large-scale systems adapted ence is found on the feature ComponentID. These components
for aircraft, trains, automobiles, power plants and chemical have several pins that can be identified using the PinNumber
plants applications. (Tamssaouet, Nguyen, Medjaher, & Or- feature. The SPI test results provide the volume, area, height,
chard, 2021) proposes a component interaction-based method size, shape and offset of each PadID as well as a final re-
to provide the prognosis of multi sub-system model. sult flag. A PadID corresponds to a unique combination of
{FigureID, ComponentID, PinNumber}. Each panel provided
3. DATA DESCRIPTION by the competition has at least one component detected as a
defect by AOI automatic sanction. Each line of the datasets
The PHME provides the dataset used in this article as part describes a PadID of one electronic board.
of the 2022 conference data challenge (PHM Society, 2022).
The dataset includes measurement information from two dif- For the AOI, each line corresponds to a unique entry of the
ferent steps of the PCB production (see Figure 1). The first set {PanelID, FigureID, MachineID, ComponentID,
step is the SPI, in which each solder pad is checked to ver- PinNumber}, where the PinNumber can be fill as NaN. On such
ify its compliance and, accordingly, a sanction is generated cases, we believe the AOI does not inspect the solder paste
depending on the quality of the solder (evaluated on several but the component itself.
physical aspects). The second step is the AOI. In addition to As introduced before, the challenge is divided into three tasks.
2
E UROPEAN C ONFERENCE OF THE P ROGNOSTICS AND H EALTH M ANAGEMENT S OCIETY 2022
Task 1 is to predict whether or not a component will be classi- CatBoost (Prokhorenkova, Gusev, Vorobev, Dorogush, & Gulin,
fied as defective by the AOI using only the inputs provided by 2018) and LightGBM (Ke et al., 2017) the algorithms with
the SPI, i.e. the pad measurements. Task 2 is about predict- the best results. Among these algorithms, we decide to use
ing the operator’s label using the SPI test results and the auto- XGBoost, which is a scalable, paralleled and distributed im-
matic defect classification provided by the AOI (AOIlabel). plementation of the original gradient boosting tree algorithm.
Finally, task 3 concerns the reparation operation. Again, the GBDT is an ensemble model algorithm, i.e. it combines sev-
objective is to predict whether the component detected as a eral decision trees to perform a better prediction than a single
defect by the operator can be repaired or not. For this task, model. XGBoost uses, in particular, the idea of boosting: it
the information used for the prediction is the SPI test result, uses a collection of weak models to generate a strong model.
the AOIlabel and also the OperatorLabel. In practice, for XGBoost, the idea is to use a gradient descent
algorithm over a cost function to iteratively generate and im-
4. M ETHODOLOGIES prove weak models. On each iteration, a new weak decision
tree is generated based on the error residual of the previous
4.1. Challenges
weak model. The final prediction is a weighted sum of all the
The exploratory analysis of the training data has revealed sev- iterated weak trees. Among ensemble methods, boosting can
eral issues that need to be tackled to solve correctly the dif- minimize the model’s bias. We propose one model for each
ferent tasks: task. In this section, these models as well as the used features
are presented.
1. Missing values: for three different PanelID, the proposed XGBoost is a very performing algorithm, although some cau-
data missed some information in the SPI dataset. We tion has to be taken when categorical features are used. This
choose to exclude the lines with no information. is true for all tree-based or boosted tree methods. In particu-
2. Class imbalance: the number of components detected as lar, one hot encoding can lead to very poor results when the
defects by the AOI is much lower than the number of categorical features have many levels. Indeed, a large number
components from the SPI dataset. Similarly, the number of levels leads to sparsity as for each level, a new variable is
of components really classified as a fault by the operator created. These new variables have only a small fraction of
is much lower than the number of components classified data points that shall have the value 1 and the other the value
as correct. 0, which is a problem for tree-based methods because tree
3. High cardinality of the categorical features: the categori- split searching for the purest nodes. Indeed, a hot encoded
cal features PadID have more than 1000 modes. Without variable is not very likely to lead to the purest nodes if it is
any sort of variable encoding, classifiers are very diffi- very sparse. That is why, the tree split will not be done using
cult to use for such variables. PadID is already encoded this one hot encoded variable. Even if the original categori-
in the sort because the values are ordered by FigureID. cal feature has a lot of importance for the prediction. In our
In a way, the variable is encoded by component area. cases, other encoding techniques should be used.
4. High bias in continuous features: the continuous features First, a common technique is the Hash encoding. It is already
such as volume, area, height... are highly correlated to present in our dataset with a numerical value for each PadID
the categorical feature PadID. level. The PadID level values depend on the FigureID of the
5. Level of prediction: the prediction has to be done at a pad. Here the Hash encoding is done with only one feature
component level, whereas the available data are given at but in general, it could be encoded into more features. One of
a PadID. This leads to a lot of issues in creating the target the most used hashing methods is described in (Yong-Xia &
training and prediction. Indeed, for instance, it is unclear Ge, 2010).
if the target training has to be created by component or The next approach is the frequency-based encoding method:
by pad. it is based on the frequency of the levels as the label value. If
6. Different numbers of pins: the number of pins depends the frequency is linked to the target, it will help the prediction
on the component. It is difficult to use all the pin results of the variable. For instance, in tasks 2 and 3, the frequency
as input of a classifier for each component, as the number of one component is probably linked to the issues that can
of pins and therefore features varies. The generalisation exist for each component. This encoding can be useful in that
of the training depending on the component is very diffi- case.
cult.
Finally, the last type of encoding is label-based. The idea of
4.2. XGBoost algorithm and categorical features encod- label encoding is to replace each categorical value with the
ing conditional probability of the class to be predicted by know-
ing the categorical features. This can be done by several
For tabular data applications, Gradient Boosting Decision Trees methods such as Leave One Out Encoding, and CatBoost en-
(GBDT) are widely used, being XGBoost (Chen et al., 2015),
3
E UROPEAN C ONFERENCE OF THE P ROGNOSTICS AND H EALTH M ANAGEMENT S OCIETY 2022
coding (Prokhorenkova et al., 2018). The main issue with this fined in task 1. For the categorical features, we keep three fea-
method is to learn the conditional probability without overfit- tures: AOILabel, ComponentID and FigureID ComponentID
ting. It can be realized by not taking the observation into ac- where the later is the string concatenation of the variables
count in the learning of the probability for each observation, with the same name. For these features, we use a CatBoost
as in the Leave One Out Encoding. This can also be done encoding based on the categorical encoders python library.
more efficiently using CatBoost encoding. For this case, we We believe a better result can be achieved with deeper work
found that the CatBoost encoding performs best. on the optimization of the encoder hyper-parameters. Fi-
nally, we also create two new meta-features (not encoded)
The frequency-based encoding and Hash encoding have less
Count Pin Component and Count Pin Figure. These two vari-
success.
ables are counting the number of pins detected as a defect
by the AOI respectively for the component and figure of the
4.3. Solving the challenges
tuple PanelID, FigureID, ComponentID, PinNumber. The
Task 1 XGBoost classifier algorithm classes each tuple into the class
“Bad” or “Good” of the OperatorLabel target column.
The first task is the most challenging of all. The main diffi-
culty arises from the fact that some defects are related to the For the AOI defect without any PinNumber associated, we
pin, and others to the component itself. In our opinion, the propose to also use an ensemble of categories and continu-
most important question for each task is “Should we predict ous variables. We use the same categorical and meta fea-
(model) by component or by pin ?”. For this first task, we tures, while for the numerical features we use the mean val-
decide to go for a per-pin prediction. This is mainly guided ues per component of the following variables: Volume(%),
by the difficulty to generate coherent labels and features at Area(%), OffsetX(%), OffsetY(%) to keep the information
the component level for this task. Indeed, only one pin can only on the PanelID, FigureID, ComponentID tuple level.
have an issue. It does not seem right to affect the same la-
Finally, we also use an XGBoost classifier algorithm to class
bel to a component with only one pin detected as a defect
each tuple PanelID, FigureID, ComponentID, PinNumber,
by AOI equipment and another component with multiple pins
with PinNumber referenced as NaN as described before.
with defects. Moreover, the number of pins varies too much
depending on the studied component. As a result, any aggre- For the final sanction (that must be given at the PanelID,
gation of continuous variables will probably hide important FigureID, ComponentID three-tuple level), we use the fol-
information if only one pin has a defect. For instance, the ag- lowing rule: if one of the four-tuple PanelID, FigureID
gregation with the mean of the Volume will not contain a lot ComponentID, PinNumber entry is predicted as “Bad”, then
of information if there are many pins, and only one defective the associated three-tuple will also be considered as “Bad”. If
pin for the considered component. not, the label “Good” is assigned to the three-tuple.
For each tuple, we want to predict if the tuple is detected as a
Task 3
defect by the AOI equipment or not. Accordingly, the train-
ing target column is 1 if the tuple appears in the AOI dataset Task 3 is about the prediction of one categorical value pre-
and 0 otherwise. It is worth noting that we are not consider- sented in the AOI dataset RepairLabel. This label can take
ing as defective tuples for which only PanelID, FigureID, two values, FalseScrap or NotPossibleToRepair. For this
ComponentID appear with PinNumber = NaN. Both, categori- task, we tried the same approach as in task 2 predicting each
cal and continuous features, are used as input. We use the fol- four-tuple PanelID, FigureID, ComponentID, PinNumber
lowing continuous variables: Volume(%),Area(%),OffsetX(%), and merging the results per component, but the approach was
OffsetY(%),Shape(um),PosX(mm), PosY(mm),SizeX, SizeY not successful. To improve the result, we choose to do the
that we simply call “numerical features” in the following of prediction for the PanelID, FigureID, ComponentID three-
the article. As a categorical value, we only keep ComponentID tuple directly. As input, we use the same method as for task 2,
that we encode using a CatBoost encoding method grouping the SPI values per three-tuple using the mean as an
(Prokhorenkova et al., 2018). aggregation method for the “numerical features”. The cate-
gorical variables used are ComponentID, FigureID Component
Task 2 ID. As before, these variables were encoded using the Cat-
Boost encoding method.
For this second task, we first split the AOI dataset into two
parts according to whether or not PinNumber = NaN. In the The meta-features generated are Count Pin Component,
case where PinNumber is specified, we can join the AOI and Count Pin Figure and Count Pin Panel, respectively, the num-
the SPI easily using the columns PanelID, FigureID, Compon- ber of pins detected as defects by the AOI per component,
entID, PinNumber in each dataset. From the joint dataset, we figure and panel. We also created one-hot encoded features
use the “numerical features” from the SPI test results as de- from the AOILabel. For each labelled type of error, the asso-
4
E UROPEAN C ONFERENCE OF THE P ROGNOSTICS AND H EALTH M ANAGEMENT S OCIETY 2022
5
E UROPEAN C ONFERENCE OF THE P ROGNOSTICS AND H EALTH M ANAGEMENT S OCIETY 2022
6
E UROPEAN C ONFERENCE OF THE P ROGNOSTICS AND H EALTH M ANAGEMENT S OCIETY 2022
Predicted Label
Defects Not Defects
True labels Defects 8263 23345
Not Defects 4204 1937814
Predicted Label
Defects Not Defects
True labels Defects 8256 1064
Not Defects 4203 1922
7
E UROPEAN C ONFERENCE OF THE P ROGNOSTICS AND H EALTH M ANAGEMENT S OCIETY 2022
8
E UROPEAN C ONFERENCE OF THE P ROGNOSTICS AND H EALTH M ANAGEMENT S OCIETY 2022
formation technology (Vol. 2, pp. 271–273). ceedings of the 2010 american control conference (pp.
Zhang, X. (2010). Decentralized fault detection for a class 5650–5655).
of large-scale nonlinear uncertain systems. In Pro-