CLRiuS: Contrastive Learning For Intrinsically Unordered Steel Scrap
CLRiuS: Contrastive Learning For Intrinsically Unordered Steel Scrap
Keywords: There has been remarkable progress in the field of Deep Learning and Computer Vision, but there is a
Artificial intelligence lack of freely available labeled data, especially when it comes to data for specific industrial applications.
Self-supervised learning However, large volumes of structured, semi-structured and unstructured data are generated in industrial
Steel scrap
environments, from which meaningful representations can be learned. The effort required for manual labeling is
extremely high and can often only be carried out by domain experts. Self-supervised methods have proven their
effectiveness in recent years in a wide variety of areas such as natural language processing or computer vision.
In contrast to supervised methods, self-supervised techniques are rarely used in real industrial applications.
In this paper, we present a self-supervised contrastive learning approach that outperforms existing supervised
approaches on the used scrap dataset. We use different types of augmentations to extract the fine-grained
structures that are typical for this type of images of intrinsically unordered items. This extracts a wider range
of features and encodes more aspects of the input image. This approach makes it possible to learn characteristics
from images that are common for applications in the industry, such as quality control. In addition, we show
that this self-supervised learning approach can be successfully applied to scene-like images for classification.
∗ Corresponding author at: KTH Royal Institute of Technology, Department of Materials Science and Engineering, Stockholm, 10044, Sweden.
E-mail addresses: [email protected] (M. Schäfer), [email protected] (U. Faltings), [email protected] (B. Glaser).
1
All authors reviewed the manuscript.
https://doi.org/10.1016/j.mlwa.2024.100573
Received 10 April 2024; Received in revised form 21 June 2024; Accepted 3 July 2024
Available online 8 July 2024
2666-8270/© 2024 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
M. Schäfer et al. Machine Learning with Applications 17 (2024) 100573
usually cannot be done by non-specialist, but requires well-trained do- labeled scrap image data. Experimental results show that the
main experts, making it particularly expensive and often infeasible. In models trained with CLRiuS generalize better and outperform
contrast, unlabeled image and video data are freely respectively easily annotated baseline models.
available on a very large scale (Doersch & Zisserman, 2017). It is also 4. We provide the first self-supervised learning model for scrap
easier for the industry to automatically collect unlabeled data directly image data, which can be used for different downstream tasks
in their processes. In general, there is a lack of such datasets (Caesar and building industrial applications.
et al., 2018) available, whether annotated or not. Industrial or medical
The paper is organized as follows. Section 2 covers the related work
applications or studies that use datasets with domain-specific content
that contributed to this paper. In Section 3 the used dataset is described,
(Baumert et al., 2008; Xu et al., 2023) only make them available on
and the SSL approach and the experimental setup is introduced. In
request or the data is completely confidential.
Section 4 various experiments are performed and the experimental
One approach to learn from limited labeled data is supervised pre-
results are analyzed and compared exactly. Finally, in Section 5 we
training on large labeled datasets such as ImageNet (Deng et al., 2009)
summarize the paper and discuss the results and future work.
and subsequent supervised finetuning on a labeled target dataset (Azizi
et al., 2021). However, as Schäfer et al. (2023) show, for intrinsically
2. Related work
unstructured ‘‘stuff’’-category images like scrap (e.g. DOES Schäfer &
Faltings, 2023), this approach is not well suitable as the commonly
available large labeled datasets tend to depict ‘‘objects’’ rather than The classification and sorting of steel scrap has been investigated
‘‘stuff’’, i.e. the domain shift between these different classes of images and used in other work for several years using other technologies and
seems to be too large. traditional methods. Magnetic separation (Zhang & Forssberg, 1998)
To counteract the problem of the unavailability of annotated data, can be used to separate ferrous from non-ferrous scrap. Eddy current
SSL (self-supervised learning), ‘‘the dark matter of intelligence’’ separation (Huang, Zhu et al., 2021; Jujun et al., 2014) is often used
(Balestriero et al., 2023) is a promising technology that has been to separate non-ferrous material from waste. Sorting systems in which
intensively researched in recent years. Self-supervised learning uses X-ray fluorescence detectors measure the energy of X-rays are available
correlations and similarities in the input data to learn generalizable on the market and are used to determine the composition of steel
features from unlabeled data (Hadsell et al., 2006). This knowledge (e.g. copper concentration) (Weiss, 2011). In laser-induced breakdown
learned from the unlabeled data is then applied in a supervised down- spectroscopy (LIBS), scrap pieces are irradiated by lasers, analyzed and
stream task to develop new applications. The learned representations separated on the basis of the various emission spectra. A combination
require only a small labeled dataset to solve complex tasks such as of LIBS and classification algorithms shows very promising results for
object detection or image segmentation (Rani et al., 2023). Contrastive post-consumer scrap (den Eynde et al., 2022).
learning is one self-supervised learning technique (Caron et al., 2021; A great amount of progress has been made in the field of deep
Chen, Kornblith, Norouzi et al., 2020; He et al., 2020; van den Oord learning and computer vision in recent years. However, there is very
et al., 2019), which has the potential to match the performance of little work that focuses on scrap sorting or visual scrap inspection.
supervised learning methods. Scrap classification and sorting is investigated respectively used in
other works with computer vision methods (Wieczorek & Pilarczyk,
Most approaches and results are based on natural object-centric im-
2008) or machine learning methods with or without computer vision
age data such as ImageNet (Deng et al., 2009), CIFAR-100 (Krizhevsky,
pre-processing (Baumert et al., 2008; Díaz-Romero et al., 2022; dos
2012) with images of vehicles, animals, food, buildings, people etc.
Santos et al., 2024; Xu et al., 2023). The AI-supported methods usually
Similar to medical images (Ciga et al., 2022; Huang, Pareek et al., 2023;
rely on supervised learning to achieve good classification results.
Manna et al., 2022) or technical created images like radar deformation
pictures (Bountos et al., 2022), many images from the industrial en- Self-supervised learning is a subclass of unsupervised learning. In
vironment, such as surface images of materials, structural images or contrast to supervised learning, SSL does not require huge amounts of
microscope images, have a very different nature from that of common annotated data to solve tasks. SSL frameworks are able to learn from
object-centric images. Intrinsically disordered stuff-like images have unlabeled data (Misra & van der Maaten, 2019). In principle, labels
special semantic properties that are distinctly different from object- already present in the data itself (for example in images) are used. SSL
based image data. Not all SSL techniques and pretext tasks can be techniques have been successfully applied in recent years in the fields
applied to this type of image data. For example solving jigsaw puzzles of natural language processing, computer vision and also in the areas
(Kim et al., 2018; Noroozi & Favaro, 2016) or predicting rotations of time series, audio and video processing (Balestriero et al., 2023).
(Gidaris et al., 2018) makes little sense for image data with no semantic One method is context based self-supervised learning. Information
order. that can be obtained from the context of the data is used to train a
In this paper, we propose a model for steel scrap detection based on neural network. For example, with an image divided into tiles, the
SimCLR (Chen, Kornblith, Norouzi et al., 2020) extended with special relative position of tiles in images can be predicted (Doersch et al.,
augmentations (Mishra et al., 2022; Wang & Qi, 2023) for intrinsically 2015; Kim et al., 2018; Noroozi & Favaro, 2016). Context-based SSL
disordered image data. This approach can hopefully help researchers, is furthermore used to predict the rotation (Gidaris et al., 2018) of an
students and especially developers from industry to provide a basis image or to colorize (Leibe et al., 2016) an image. In all these examples,
for exploring SSL methods apart from other than traditional data and the aim is to restore the original data as far as possible. With this
developing new applications. The contributions of our work are as method, it is important to understand the semantics of the input data
follows: and such a semantic structure must be present; e.g. the location of a
cat’s nose also provides information about where the eyes should be
1. To the best of the authors’ knowledge, CLRiuS is the first self- expected to be. In contrast to natural images, this approach is not well
supervised learning approach for classification of steel scrap suited for image data that have no intrinsic order or that provide little
images, thus avoiding an enormous labeling effort. contextual information.
2. We propose a training pipeline based on SimCLR that does In computer vision tasks for non-natural scenes or medical im-
not rely on the generation of massive synthetic or manually ages, very good results were achieved with contrastive approaches.
annotated industrial scrap image data. For example, good results were achieved in the evaluation of digital
3. We show that training on a large unlabeled scrap dataset in a histopathology (Ciga et al., 2022) images using contrastive-based self-
self-supervised approach provides better features extraction than supervised learning techniques. This method is used to learn visual
using pre-trained ResNet models based on labeled data or only representations of image data. When learning representations, similar
2
M. Schäfer et al. Machine Learning with Applications 17 (2024) 100573
images are grouped together, but at the same time dissimilar im- forms the representation in a space in which we apply the contrastive
ages that show few similarities are separated. For non-natural scenes, loss to compare the similarities between the vectors. In contrast to the
i.e. data without intrinsic order such as scrap, the challenges are similar original architecture, we expand the hidden dimensions of the MLP by a
to those for medical images, making contrastive-based self-supervised factor of four. ReLU is also used as the activation function of the hidden
learning techniques a promising approach for this kind of data as well. layers. To maximize the similarity of the two representations 𝑧𝑖 and 𝑧𝑗 ,
SSL methods are not yet widely used in industry, particularly in the the following loss function is used:
steel sector. However, these technologies are already being researched 𝑠𝑖𝑚(𝑧𝑖 ,𝑧𝑗 )
for special tasks such as surface inspection (Hu et al., 2023) or anomaly 𝑒( 𝜏
)
𝑙𝑖,𝑗 = − log ,
detection (Li et al., 2023). ∑2𝑁 (
𝑠𝑖𝑚(𝑧𝑖 ,𝑧𝑘 )
)
𝑘=1 1[𝑘≠𝑖] 𝑒
𝜏
Augmentations of the input data are used in many machine learning
technologies to improve the results, make them more robust and avoid 𝑧𝑇𝑖 ⋅𝑧𝑗
where 𝑠𝑖𝑚 is the cosine similarity 𝑠𝑖𝑚(𝑧𝑖 , 𝑧𝑗 ) = ∥𝑧𝑖 ∥⋅∥𝑧𝑗 ∥
and 𝜏 is the
overfitting (Ciresan et al., 2012; Krizhevsky et al., 2017). Augmen-
tations also play a very important role in contrastive-based methods temperature parameter. Maximum similarity is achieved for a cosine
and frameworks to learn useful representations (Bachman et al., 2019; of 1, maximum disagreement for a cosine of −1. In our approach, the
Doersch et al., 2015; Tian et al., 2020). Augmentations can be grouped minimum usually converges to 0 in the case of two different images.
into weak and strong augmentations. Many current contrastive learning With self-supervised methods, two tasks are usually performed.
methods apply various weak augmentations – such as: cropping, color First, a pretext task which a larger unlabeled dataset to learn the
jittering, Gaussian blurring, grayscale conversion, horizontal flipping, representations and then an application-specific downstream task with
color normalization – to the input image data (Wang & Qi, 2023). less labeled data. After fully training the model with the self-supervised
Varying or extending weak augmentations strategies very often shows learning approach on the DOES dataset (Schäfer & Faltings, 2023)
a significant performance gain (Caron et al., 2021; Chen, Fan et al., without using the labels, we apply a supervised logistic regression
2020). Strong augmentations significantly change the structure of the model as a downstream task. This task is then performed with an-
image. This can mean geometric or non-geometric changes to the notated data from the DOES dataset. The DOES dataset is described
images (Wang & Qi, 2023). Works (Cubuk et al., 2019) on automated in more detail in Section 3.1. It has been shown that the learned
search for suitable augmentations also use the combination of strong representation of the projection head 𝑔(⋅) produces inferior results than
and weak augmentations to achieve better results. that of the base encoder network 𝑓 (⋅). The projection head 𝑔(⋅) can
remove information that is required for the respective downstream task.
3. Methods This can be, for example, the color or orientation of the object (Chen,
Kornblith, Norouzi et al., 2020). Therefore only the frozen results from
As a basis of our approach we use a contrastive learning method the encoder network 𝑓 (⋅) are used in the linear classifier to show the
which is based on SimCLR. The representations are learned by max- performance of the learned representations.
imizing the agreement between augmented image data based on the
same origin data, using a contrastive loss in a latent vector space. The 3.1. Dataset
original framework consists of four main components.
1. A module for stochastic data augmentation in which two corre- Our approach and experiments are based on the freely available
lated augmented images, 𝑥̃ 𝑖 and 𝑥̃ 𝑗 , are generated for each image DOES dataset (Schäfer & Faltings, 2023), created by Faltings and
𝑥. These are regarded as a positive pair. Schäfer. DOES is a multimodal dataset for supervised and unsupervised
2. A base encoder 𝑓 (⋅) with the common ResNet-18 (He et al., analysis of steel scrap, collected with several cameras of different
2016a) architecture which represents the extraction from the resolutions, at different scrap yards, and which uses a special tech-
augmented images. These extracted representations or the nique for extracting many overlapping rectangles (tiles) from raw
weights of the model are used to train the downstream task. images (Schäfer et al., 2023). This tiling technique consists of dividing
3. A projection head 𝑔(⋅), which is a multilayer perceptron (MLP) a raw input image into a set of overlapping tiles, with the tile size
with one hidden layer and a 𝑅𝑒𝐿𝑈 activation function. chosen to maximize the number of resultant tiles while keeping the
4. A contrastive loss function which is defined for a contrastive characteristics of each class still visible. The fixed overlap is chosen
prediction task. from within a certain range such that for the used raw input data taken
from cameras with different resolution, no unnecessary large section of
In this work, components 1 (data augmentations) and 3 (projection the initial image gets lost. Finally, the extracted tiles are rescaled to a
head 𝑔(⋅)) were altered compared to the original version. size of 256 × 256 pixels. Image quality of the dataset specimens varies,
• The original augmentations are extended in various experiments. due to different resolutions, but also due to e.g. motion blur. However,
A distinction was made between weak and strong augmentations. there are no systematic differences in image quality inbetween classes.
The weak and strong augmentations used are explained in more Fig. 2 shows an overview of the dataset and the respective scrap classes.
detail in Section 3.2. The raw images and the tiles in the dataset have a size of 256 × 256
• In SimCLRv2 (Chen, Kornblith, Swersky et al., 2020), the au- pixels. The DOES dataset contains images of eight types of scrap and
thors show that larger MLPs can significantly increase the results. diverse background images:
Therefore, we use an MLP with four hidden layers and also a
• Background → background images, e.g. soil, dirt, sky, buildings
𝑅𝑒𝐿𝑈 activation function.
• E1 → used thin steel scrap
The training pipeline is shown in Fig. 1. For each image 𝑥, we • E2 → new thick production steel scrap
generate 2 augmentations 𝑥̃ 𝑖 and 𝑥̃ 𝑗 , whereby 𝑥 may already have been • E3 → used thick steel scrap
strongly augmented in the CLRiuS approach. The similarity of 𝑥̃ 𝑖 and • E5H → homogeneous lots of carbon steel turnings
𝑥̃ 𝑗 , which are represented as 1-dimensional vectors, will be maximized. • E6 → new thin production steel scrap, compressed or firmly baled
The similarity to the other images of the cycle is minimized. • E8 → new thin production steel scrap, non-baled
The extraction of the representations is performed by a base encoder • E40 → shredded steel scrap
network 𝑓 (⋅). A ResNet-18 convolutional neural network is used for this • EHRB → old and new steel scrap consisting mainly of rebars and
task where ℎ𝑖 and ℎ𝑗 represent the output. The projection head 𝑔(⋅) merchant bars
3
M. Schäfer et al. Machine Learning with Applications 17 (2024) 100573
Fig. 2. Overview of the different classes (E1, E2, E3, E5H, E6, E8, E40 and EHRB) of the DOES scrap image dataset.
Table 1 (Schäfer et al., 2023) shows the statistics of the used dataset. dataset. The images from the test data set were taken independently
DOES contains a different number of raw images and tiles in the of the training data set and are non-redundant to these (Schäfer et al.,
training and test data set. The data set is unbalanced, which the authors 2023). This ensures that there are no scenes or tiles in both data sets
attribute to the different use and frequency of the respective scrap types and that the results of the experiments are valid.
in quality steel production. It can be difficult to classify the different
types of scrap, as some class-pairs differ considerably in terms of shape 3.2. Experimental setup
or dimension, while others can e.g. be very similar in shape and only
differ in dimension, such as E1 and E3. In some cases, individual items During the implementation and for better assessment of the CLRiuS
can even look identical although belonging to different classes, e.g. old approach, we conducted various experiments with different parameters
or new rails (E3 resp. E2). Other class-pairs such as E2 and E6 are easily and settings, as detailed in Table 2. In particular, the impact of augmen-
distinguishable. tations of the input images for the SSL training was evaluated, as can be
The dataset only contains real images taken with cameras or ex- seen in Table 2. The working hypothesis is that strong augmentations
tracted from videos. No synthetic images were used in the DOES particularly help with learning to extract fine-grained structures typical
dataset. Only the tiles (105 523 items) from the training dataset were for classes such as E40 or E5H, and that more weak augmentations
used to train the self-supervised approach. CLRiuS was tested on either should encourage an overall wider range of features to be extracted,
the tiles (8131 items) or the raw images (176 items) from the test encoding more aspects of the input image.
4
M. Schäfer et al. Machine Learning with Applications 17 (2024) 100573
Table 1 • C — random resized crop, random color jitter, random horizontal flip,
DOES statistics: Number of raw image instances and tiles per class and in total, both
random grayscale, random sharpen, random clahe, random equalize,
for the train set and the test set (BG — Background) (Schäfer et al., 2023).
random piecewise affine, random downscale, and gaussian blur
Class Train set Test set
No. of raw instances No. of tiles No. of raw instances No. of tiles The motivation for the compilation B of the 5 weak augmentations
BG 0 1 951 0 0 was given by the original implementation of the SimCLR model (Chen,
E1 1026 18 236 36 1628 Saxena et al., 2020), which does not feature the three augmentations of
E2 232 10 175 14 642 compilation A cited in the original paper. Instead all the five augmen-
E3 1358 21 254 38 1833
tations of compilation B are implemented. Finally, in compilation C,
E5H 16 856 14 717
E6 1170 8 841 26 1069
additional weak augmentations that seemed promising at encouraging
E8 1466 39 496 20 927 the learning of structures and textures in images were added. The
E40 6 317 14 680 strong augmentations, where used, are otsu, canny edge, gaussian blur
EHRB 746 4 397 14 635 followed by sobel and otsu, chan vese, and emboss. At random, one
Total 6020 105 523 176 8131
of these or none was chosen for transforming the input image, with
either equal probability of 16 for each of the six choices otsu, canny
edge, gaussian blur followed by sobel and otsu, chan vese, emboss, or
none (experiments III, VIII), or with a probability of 21 for none, and
In all experiments, the SSL training was conducted with ResNet- 1
of 10 for each of the other 5 possibilities (experiments IV, V, IX, X). All
18 (He et al., 2016a) as the backbone model, batch size 256, learning strong augmentations except for emboss require grayscale images, such
rate 5⋅10−4 maintained over all epochs, temperature 0.07, weight decay that wherever one of these was used, the resultant augmented image is
10−4 , 4 ⋅ 128 hidden dimensions, and a maximum of 500 epochs, with grayscale rather than colored. The motivation for these two weighting
the best checkpoint saved during training. strategies was to evaluate in an experimental setup whether an equal
The best checkpoint in the SSL task was selected according to distribution of differently strongly augmented images in the training
the top-5 cosine similarity on the test set. No extra validation set setup might be too severe a distortion for a successful learning pro-
was used at this stage. As the actual performance metric for the cess or whether the more frequent appearance of strongly augmented
SSL task is not evaluated, only the performance for the downstream images helps learning more suitable features.
task, no use of an additional validation set was made for checkpoint For evaluating model performance on the downstream task, top-
determination and a separate test set to get a fair evaluation of the 1-overall accuracy and per-class accuracy of the trained logistic re-
SSL phase model performance. To validate the working hypothesis gression model on the DOES test set was used. As the proportions of
mentioned above, a downstream task with a fixed model architecture different scrap classes in the DOES dataset reflects the frequency of
was used and the performance of the resultant models on the individual usage of the different scrap classes in a steel mill, the overall accuracy
classes was evaluated. Regarding the differences in classification results combined with a per-class accuracy was deemed more meaningful than
between different classes, one possible explanation aside from quality a balanced accuracy as suitable metric from a production site point
of extracted features in the SSL task could be systematic differences of view, motivating this choice of metrics. Additionally, per-class F1-
in image quality between the different classes. However, the DOES score, precision and recall metrics were calculated, and the confusion
dataset provides a comparable spread of image qualities for all classes, matrices for the model with the highest overall accuracy per experiment
making it indeed possible to gather insights on the quality of the SSL were analyzed in Section 4. Whether a false positive or false negative
task feature extraction via this approach. For the downstream task, a for a particular class is worse than the other cannot be decided per se
logistic regression was trained on the features extracted from the input for the scrap classification task. If an item is misclassified, it depends
images via the SSL trained backbone, with Adam optimizer with weight on which other class it is mistaken for to determine how bad this is.
decay, multistep learning rate, 𝛾 = 0.1, batch size 64 and over 100 Some classes are more similar to another from a metallurgic point of
epochs. For the test set, the full raw or tiled DOES test set was used,
view than others, meaning the confusion of one for the other leads
depending on the experiment. The motivation for this was to evaluate
to less problems in the downstream tasks than for other classes. For
to a certain extent how well the SSL pre-trained and LR-finetuned
classes that are metallurgical dissimilar, a confusion of one for the
models generalize to new data different from the one trained on.
other could lead to ruining an entire batch of steel in the production.
For the downstream task train set, differently sized subsets of the
For metallurgically similar classes on the other hand, differences in
tiled DOES train set were used, with the number of images per la-
classes rather concern market price and handling. This also motivates
bel chosen as 𝑚𝑖𝑛(𝑛, |𝐷𝑂𝐸𝑆𝑔𝑖𝑣𝑒𝑛 𝑙𝑎𝑏𝑒𝑙 |), 𝑛 ∈ [10, 20, 50, 100, 200, 500,
a detailed analysis of the confusion matrices in Section 4.
|𝐷𝑂𝐸𝑆|] and the corresponding images per label chosen randomly
The results of the experiments are presented in Section 4.
from all images in the DOES train set with the given label. This
random selection was done before any experiments were conducted
4. Results and discussion
and respective results collected to ensure an unbiased foundation for
the experiments. The aim of this sub-sampling is to evaluate whether
In order to better relate the SSL results, they are compared to the
classification results for the downstream task for 𝑛 < |𝐷𝑂𝐸𝑆| are
DOES baseline model presented in the original DOES paper, see Table 3.
at least comparable to using the full training set for the downstream
This DOES baseline model was trained in a supervised fashion, with a
task, demonstrating the viability of the SSL approach for dealing with
non-pretrained PreActResNet18 (He et al., 2016c), over 50 epochs with
a lack of annotated data and still gaining competitive accuracy results.
batch size 32, as detailed in the original DOES paper.
As visible in Table 1, except for 𝑛 = |𝐷𝑂𝐸𝑆| or E40 and 𝑛 = 500, this in
An overview of the performance of the individual experiments
particular results in a balanced train set for the downstream task, with-
for the best respective saved checkpoint is given in Table 3, provid-
out under- or over-represented classes. For the weak augmentations,
ing overall- and per-class accuracy measures, and Table 4, providing
experiments with three different compilations were performed:
additional per-class F1-score, precision and recall metrics.
• A — random resized crop, random color jitter, and gaussian blur The best performing model by overall accuracy on the tiled test set
(Augmentations which are used in the original SimCLR frame- is model 4 in experiment IV, followed closely be model 2 in experiment
work) II, as visible in Table 3, and model 2 on the raw test set in experiment
• B — random resized crop, random color jitter, random horizontal flip, VII. Generally, it is notable in Tables 3 and 4 that for all the trained
random grayscale and gaussian blur models the performance on the raw test set is better than on the tiled
5
M. Schäfer et al. Machine Learning with Applications 17 (2024) 100573
Table 2
Conducted experiments (WA — Weak Augmentations, SA — Strong Augmentations, DTrain — Downstream task train set, DTest — Downstream
task test set).
Experiment Model WA SA Epochs Batch size Backbone model DTrain DTest
I 1 A No 500 256 ResNet-18 Tiles Tiles
II 2 B No 500 256 ResNet-18 Tiles Tiles
III 3 B Yes, equally weighted 500 256 ResNet-18 Tiles Tiles
IV 4 B Yes, nonequally weighted 500 256 ResNet-18 Tiles Tiles
V 5 C Yes, nonequally weighted 500 256 ResNet-18 Tiles Tiles
VI 1 A No 500 256 ResNet-18 Tiles Raw
VII 2 B No 500 256 ResNet-18 Tiles Raw
VIII 3 B Yes, equally weighted 500 256 ResNet-18 Tiles Raw
IX 4 B Yes, nonequally weighted 500 256 ResNet-18 Tiles Raw
X 5 C Yes, nonequally weighted 500 256 ResNet-18 Tiles Raw
Table 3
Overall accuracy (O. Acc.) and per-class accuracy in % on test set for best saved checkpoint per model in SSL training (Ckpt — epochs for best
checkpoint in SSL training, MIL — Maximal number of images per label for best downstream task performance).
Experiment Ckpt MIL O. Acc. E1 E2 E3 E40 E5H E6 E8 EHRB
DOES baseline – – 68.09 57.80 32.24 89.20 48.97 47.98 82.13 86.62 62.20
I 477 500 64.20 59.40 55.14 82.71 4.12 43.24 80.17 78.21 72.91
II 482 10 75.12 69.53 76.79 86.52 29.85 55.65 85.31 91.69 83.94
III 494 200 73.72 58.05 82.24 85.38 23.68 58.58 97.29 82.63 89.61
IV 477 10 75.89 87.16 71.50 80.41 10.88 71.69 90.65 84.25 75.75
V 438 500 72.23 54.24 78.82 84.51 8.97 64.02 92.70 90.83 91.65
VI 477 200 78.98 75.00 57.14 81.58 42.86 57.14 100.00 95.00 100.00
VII 482 500 90.34 94.44 78.57 84.21 92.86 64.29 100.00 100.00 100.00
VIII 494 |DOES| 86.93 100.00 71.43 81.58 71.43 42.86 100.00 100.00 100.00
IX 477 100 87.50 83.33 78.57 73.68 85.71 92.86 100.00 100.00 100.00
X 438 50 87.50 100.00 85.71 76.32 50.00 71.43 100.00 100.00 100.00
Table 4
F1-score, recall and precision on test set for best saved checkpoint per model in SSL training (MIL — Maximal number of images per label for
best downstream task performance).
Experiment Metric MIL E1 E2 E3 E40 E5H E6 E8 EHRB
F1 – 0.63 0.40 0.71 0.62 0.58 0.87 0.71 0.75
DOES baseline Recall – 0.58 0.32 0.89 0.49 0.48 0.82 0.87 0.62
Precision – 0.70 0.53 0.59 0.86 0.72 0.93 0.59 0.96
F1 500 0.63 0.55 0.68 0.08 0.44 0.85 0.71 0.79
I Recall 500 0.59 0.55 0.83 0.04 0.43 0.80 0.78 0.73
Precision 500 0.66 0.55 0.58 1.00 0.46 0.91 0.65 0.87
F1 10 0.67 0.77 0.78 0.44 0.60 0.91 0.91 0.89
II Recall 10 0.70 0.77 0.87 0.30 0.56 0.85 0.92 0.84
Precision 10 0.65 0.77 0.70 0.85 0.65 0.98 0.90 0.94
F1 200 0.69 0.74 0.73 0.37 0.63 0.83 0.84 0.93
III Recall 200 0.58 0.82 0.85 0.24 0.59 0.97 0.83 0.90
Precision 200 0.85 0.67 0.64 0.83 0.68 0.73 0.86 0.96
F1 10 0.79 0.70 0.77 0.19 0.73 0.87 0.80 0.86
IV Recall 10 0.87 0.71 0.80 0.11 0.72 0.91 0.84 0.76
Precision 10 0.72 0.68 0.73 0.86 0.74 0.83 0.76 0.99
F1 500 0.64 0.76 0.72 0.16 0.61 0.89 0.82 0.94
V Recall 500 0.54 0.79 0.85 0.09 0.64 0.93 0.91 0.92
Precision 500 0.77 0.73 0.62 0.92 0.59 0.86 0.75 0.96
F1 200 0.77 0.48 0.83 0.60 0.59 0.96 0.86 1.00
VI Recall 200 0.75 0.57 0.82 0.43 0.57 1.00 0.95 1.00
Precision 200 0.79 0.42 0.84 1.00 0.62 0.93 0.79 1.00
F1 500 0.94 0.71 0.86 0.84 0.75 1.00 1.00 1.00
VII Recall 500 0.94 0.79 0.84 0.93 0.64 1.00 1.00 1.00
Precision 500 0.94 0.65 0.89 0.76 0.90 1.00 1.00 1.00
F1 |DOES| 0.97 0.69 0.90 0.77 0.55 0.98 0.91 1.00
VIII Recall |DOES| 1.00 0.71 0.82 0.71 0.43 1.00 1.00 1.00
Precision |DOES| 0.95 0.67 1.00 0.83 0.75 0.96 0.83 1.00
F1 100 0.87 0.69 0.78 0.83 0.90 1.00 1.00 1.00
IX Recall 100 0.83 0.79 0.74 0.86 0.93 1.00 1.00 1.00
Precision 100 0.91 0.61 0.82 0.80 0.87 1.00 1.00 1.00
F1 50 0.97 0.73 0.87 0.61 0.69 1.00 1.00 1.00
X Recall 50 1.00 0.86 0.76 0.50 0.71 1.00 1.00 1.00
Precision 50 0.95 0.63 1.00 0.78 0.67 1.00 1.00 1.00
6
M. Schäfer et al. Machine Learning with Applications 17 (2024) 100573
test set. On the one hand, this indicates that the task on the raw test Table 5
Confusion matrix for the DOES baseline model (Schäfer et al., 2023) on tiled test set
set is presumably easier. On the other hand, all models were trained
(BG — Background).
– for the SSL as well as the downstream task – on a tiled train set.
Thus one would expect the performance to be better on a test set
more similar to the train set, such as the tiled test set, than on a less
similar one, such as the raw test set. It being the reverse indicates
that the tiling process for the dataset generation in DOES is very
good at preserving the characteristics of a class per dataset instance,
while greatly enhancing the overall number of dataset items. But also
in particular that the SSL pre-trained models seem to extract very
effective and well generalizing features from the input images, without
overfitting to the particular training data. This seems very promising for
the application of CV-based models in an industrial setting that were
SSL pre-trained as these settings are often characterized by non-static
Table 6
conditions and the need to quickly adapt to new circumstances. Also
Confusion matrix for model 1 on tiled test set (BG — Background).
notable is that all models except model 1 perform better overall on the
tiled test set than the baseline model from the DOES paper (Schäfer
et al., 2023), indicating the viability of the SSL approach for achieving
competitive performance results as compared to supervised approaches.
In particular, as Schäfer et al. (2023) write, the baseline model from
the DOES paper, which was not pre-trained before finetuning on DOES,
exceeds a pre-trained ResNet50 (He et al., 2016b) finetuned on DOES in
terms of performance, such that the SSL pre-trained models presented
in this paper also exceed pre-trained ResNet50 in terms of performance
as well. As the performance of the baseline model on the raw test set is
not reported in the original DOES paper and the focus of this paper
lies on the SSL approach and the effect of different augmentations,
no comparison of performances between the baseline model and our Table 7
Confusion matrix for model 1 on raw test set (BG — Background).
models on the raw test set is presented. It is also noteworthy that for
almost all the experiments, and in particular also for the overall best-
performing models, the best results were achieved in the downstream
task for less than the entire DOES tiled train set. This shows the
effectiveness of the SSL approach for avoiding the costs of acquiring
large amounts of annotated data without having to accept subpar model
performance in return.
When comparing the per-class F1-score, precision and recall metrics
for the individual models on both the raw and the tiled test set in
Table 4, it is notable that models that have high per-class accuracy on
a class (which in this case is the same as per-class recall) also tend to
perform well on the respective per-class F1-score and precision metric.
On the other hand high per-class precision is no reliable indicator of
high F1-score or recall on the respective class. This shows that relying model on the tiled test set, and considerably lower than any of the
on the accuracy metric as relevant indicator is a valid approach for other models in the performed experiments. This seems to indicate that
model evaluation in the conducted experiments. For this reason, the using only the 3 augmentations from the original SimCLR framework
more detailed discussion of results for the individual models will not as detailed in Section 3.2 does not encourage the model to learn to
compare all three metrics F1-score, precision, recall yet again for each extract the right features from the input images during the SSL phase
model but rather concentrate on the accuracy as relevant metric, as for a good classification performance in the downstream task. While
well as confusion matrices, to evaluate the models. the performance on classes with larger structures, such as e.g. large
In the following, a more detailed discussion and comparison of the cubes for E6, is comparable to the baseline model, classes with very fine
individual results is presented per model. structures, also being particularly different from object-centric images
such as in ImageNet for which the original framework was developed,
4.1. Baseline model such as the aforementioned E40 and E5H seem to be represented
especially poorly by the extracted features, resulting in the very low
The confusion matrix in Table 5 for the baseline model is given performance of the model on these classes.
for the sake of allowing for a more detailed comparison with the SSL
pre-trained models discussed in more detail in the following sections. 4.3. Model 2
4.2. Model 1 Model 2 shows considerably better performance on the raw as well
as tiled DOES test set as compared to the baseline model from the
Model 1 from the original SimCLR framework does not reach the DOES paper as well as compared to model 1, as visible in Table 3,
performance of the supervised baseline model from the original DOES Figs. 5, 6 and the confusion matrices Tables 8, 9. The overall higher
paper, and as visible in Table 3, Figs. 3, 4 and in the confusion matrices accuracy for model 2 as compared to model 1 indicates that more
Tables 5, 6, in particular also the performance on the weakest classes augmentations of the input images for the SSL training is beneficial
E40, E5H is far lower for all values of MIL. Even the performance on for learning meaningful features for classification of the ‘‘stuff’’-like
the raw test set, as visible in Table 3, Fig. 4 and the confusion matrix scrap images. Aside from the overall better performance, there is a
Table 7 is only marginally better than the performance of the baseline notable difference how classes are misclassified if they are misclassified
7
M. Schäfer et al. Machine Learning with Applications 17 (2024) 100573
Fig. 3. Overall accuracy and per-class accuracy on tiled test set for model 1 (Exp. I) (MIL — Maximal number of images per label for downstream task, OA — Overall Accuracy).
Fig. 4. Overall accuracy and per-class accuracy on raw test set for model 1 (Exp. VI) (MIL — Maximal number of images per label for downstream task, OA — Overall Accuracy).
Fig. 5. Overall accuracy and per-class accuracy on tiled test set for model 2 (Exp. II) (MIL — Maximal number of images per label for downstream task, OA — Overall Accuracy).
Fig. 6. Overall accuracy and per-class accuracy on raw test set for model 2 (Exp. VII) (MIL — Maximal number of images per label for downstream task, OA — Overall Accuracy).
8
M. Schäfer et al. Machine Learning with Applications 17 (2024) 100573
Table 8 Table 10
Confusion matrix for model 2 on tiled test set (BG — Background). Confusion matrix for model 3 on tiled test set (BG — Background).
Table 9 Table 11
Confusion matrix for model 2 on raw test set (BG — Background). Confusion matrix for model 3 on raw test set (BG — Background).
Table 12
between the supervised baseline model and the SSL pre-trained models Confusion matrix for model 4 on tiled test set (BG — Background).
1 and 2. Especially for the class E5H, the baseline model misclassified
a considerable proportion as E1 on the tiled test set, as visible in con-
fusion matrix Table 5, whereas both model 1 and model 2 misclassified
far more in comparison as BG instead. It seems that the features learned
in the SSL setting are actually different than the features learned in the
supervised setting. When looking at the E5H-examples in the tiled test
set, a misclassification as BG is indeed easily understandable, and seems
more ‘‘natural’’ than a misclassification as E1. As shown in Fig. 7, at
first sight, some examples of E5H do look almost like dirt or soil from
the scrap yard. The tendency of how things are misclassified if they
were misclassified is similar for model 1 and model 2. Both on the tiled
and the raw test set, as visible in the confusion matrices Tables 6, 7, 8,
9, except that model 2’s overall performance is better. When regarding improvement on the tiled test set on the classes E40 and E5H. These
the evolution of the performance both on the raw and tiled test set classes are characterized by particularly fine structures and a particu-
for differently sized subsets of the train set for the downstream task, larly non-object like quality, supporting the theory that using the strong
as detailed in Figs. 5, 6, it is noticeable that the performance on both augmentations during the SSL pre-training should help extracting the
E5H and E40 drops considerably for the full DOES, especially for the type of features characterizing very finely structured non-object-like
tiled test set, which is explainable by the skewed, unbalanced nature classes. This is also not contradicted by the better performance on the
of the DOES data set, which comes into effect when not restricting the raw test set of model 2 on the classes E40 and E5H as compared to
downstream training to mostly balanced subsets of DOES’ training set. model 3, as visible in the confusion matrices Tables 9, 11 and Figs. 6,
Compared to model 1 though, it is notable that the overall accuracy 9, as the larger raw test set images also provide more less-fine-grained
for model 2 remains in a far tighter range both for the raw and tiled structures differentiating the different classes than the tiled images.
test set for differently sized subsets of the train set, indicating that the Interestingly, also, as visible in the confusion matrix Table 10, model
features learned in the more sophisticated SSL phase of model 2 are 3 misclassifies tiled E40 or E5H for more seldom as BG than model 2,
more suitable for differentiating different classes as compared to model which could also be attributed to the effect of the strong augmentations
1, meaning a near optimal performance in the downstream task can be on the features learned during the SSL phase. Similarly as described
reached with considerably less annotated training data required for the for model 2 in Section 4.3 and as visible in Figs. 8, 9, the range of
downstream task. This again also shows the validity of an SSL approach the overall accuracy for differently sized subsets of the train set for the
for dealing with a lack of annotated training data in CV tasks. downstream task for model 3 is comparably narrow. This shows that
the suitability of the extract features from the SSL phase for reaching
4.4. Model 3 near-optimal performance in the downstream task with only very little
annotated training data.
Compared to the baseline model from the original DOES publica-
tion, the performance is improved, but compared to model 2 without 4.5. Model 4
additional strong augmentations, there is a slight decline in overall
performance, as visible in Table 3. However, when going into detail The overall performance of model 4 on the tiled test set is the best
and comparing the performance on different classes as given in Figs. 5, of all the models evaluated in this paper, as visible in Table 3, and only
6, 8, 9 and the confusion matrices Tables 8, 10, one can notice an exceeded by model 2 on the raw test set. This seems to indicate that,
9
M. Schäfer et al. Machine Learning with Applications 17 (2024) 100573
Fig. 8. Overall accuracy and per-class accuracy on tiled test set for model 3 (Exp. III) (MIL — Maximal number of images per label for downstream task, OA — Overall Accuracy).
Fig. 9. Overall accuracy and per-class accuracy on raw test set for model 3 (Exp. VIII) (MIL — Maximal number of images per label for downstream task, OA — Overall Accuracy).
Fig. 10. Overall accuracy and per-class accuracy on tiled test set for model 4 (Exp. IV) (MIL — Maximal number of images per label for downstream task, OA — Overall Accuracy).
10
M. Schäfer et al. Machine Learning with Applications 17 (2024) 100573
Fig. 11. Overall accuracy and per-class accuracy on raw test set for model 4 (Exp. IX) (MIL — Maximal number of images per label for downstream task, OA — Overall Accuracy).
Table 13 Table 14
Confusion matrix for model 4 on raw test set (BG — Background). Confusion matrix for model 5 on tiled test set (BG — Background).
viewed over all classes, a less frequent use of the strong augmentations Table 15
Confusion matrix for model 5 on raw test set (BG — Background).
in the SSL phase can encourage the learning of features well suited to
differentiate the different classes of scrap. However, when regarding
the class E40, the performance of model 4 on this class has dropped as
compared to model 3, so possibly, this class in particular profits from
the strong augmentations in terms of classification accuracy, whereas
the kind of features these encourage the model to learn are less ben-
eficial for the other classes, although not malignant either. As already
explained in Section 4.4, the higher overall accuracy of model 2 on the
raw test set does not contradict these findings as the structures provided
in the larger raw images are also at least partially less fine grained,
meaning the impact of encouraging the learning of fine features in the
SSL phase does not provide the same support for the performance on
the downstream task as on the tiled test set. When comparing Figs. 8,
classes as compared to the models 2, 3, or 4, as can be seen in Table 3,
9, 10, 11, and also the confusion matrices Tables 10, 12, 11, 13, it is
and Figs. 5, 6, 8, 9, 10, 11, 12, 13. While the overall accuracy is
notable that the performance overall for the differently sized subsets
lower on the tiled test set, the performance on the raw test set is
of the DOES tiled train set is actually quite similar between models 3
fairly similar. When regarding per-class performance on the tiled test
and 4, with the exception of the 𝑀𝐼𝐿 = 10-case on the tiled test set
set, the accuracy on some classes such as E5H or E8 is very similar
for model 4, which has a higher overall accuracy, and the accuracy on
to e.g. model 4, while in particular the performance on class E40 is
individual classes generally slightly higher on classes E1, E2, E40, EHRB
significantly lower. For the raw test set, the performance on E40 and
for model 3 and higher for classes E3, E5H, E8 on model 4. Thus, the
E5H is a bit lower, but on the other classes, it is often comparably and
less frequent use of the strong augmentations can possibly encourage
sometimes better, such as e.g. on E2. When regarding the confusion
the learning of features more well suited for differentiating some classes
matrix for the tiled test set Table 14, it is particularly notable when
and less well suited for the differentiation of other classes, but this is compared to the confusion matrix of e.g. model 4, Table 12, how far
effect is subtle. Again, as for models 2 and 3, the performance range for more misclassifications of other classes as E1 occur for model 5 on
differently sized train sets for the downstream task is considerably more the tiled test set. It appears that the nature of the features learned on
narrow than for model 1, indicating again that the more sophisticated the SSL task for model 5 with the additional weak augmentations does
SSL pre-training has a positive effect on the amount of annotated discourage confusions of e.g. E5H for E6, but encourages confusions
training data required for near-optimal performance in the downstream with E1, the most heterogeneous of the scrap classes. Possibly the
task. additional augmentations provide too severe distortions of the input
images for some classes, especially E40, bridging the gap to the most
4.6. Model 5 heterogeneous class E1 as well as the further class featuring very small
structures E5H in terms of representations and thus resulting in this
Contrary to the initial hypothesis when designing the experiments, increase of the respective misclassifications. The individual effect of the
the additional weak augmentations did not help the model 5 to learn added augmentations for this could be investigated in more detail in
more meaningful features for the differentiation of the individual scrap further research.
11
M. Schäfer et al. Machine Learning with Applications 17 (2024) 100573
Fig. 12. Overall accuracy and per-class accuracy on tiled test set for model 5 (Exp. V) (MIL — Maximal number of images per label for downstream task, OA — Overall Accuracy).
Fig. 13. Overall accuracy and per-class accuracy on raw test set for model 5 (Exp. X) (MIL — Maximal number of images per label for downstream task, OA — Overall Accuracy).
Recalling the discussion in Section 4.2 on comparisons of the confu- more weak augmentations (model 5) or strong augmentations (model 3)
sion matrices for model 1 and the baseline model Tables 5, 6: Then, the result in a decline in overall accuracy, although a more aggressive use
higher rate of misclassifications as class E1 for the baseline model was of strong augmentations can be beneficial if the to-be-discerned scrap
noted as compared to model 1 misclassifying images as BG instead, but mix contains high proportions of very finely structured material such as
this was particularly notable for class E5H. For model 5, the situation E40. The use of only very few augmentations on the other hand (model
is different. For model 5, as discussed above and as visible in the 1) results in very poor overall accuracy, underperforming supervised
confusion matrix Table 14, this concerns class E40 far more than E5H, approaches as in the baseline model. So, concluding, a medium amount
and none of the other classes as strongly, so it seems valid to attribute of augmentations, both weak and strong, seems to be the overall best
this to the too-strong-distortions bridging a gap in representations choice for high overall accuracy performance.
rather than a general failure of the SSL to extract more meaningful Our study also has some limitations which stem from the dataset
representations than the supervised approach of the baseline model. used. The DOES dataset does not provide examples for the background
This explanation is further supported by the analysis of the confusion class in the test set. No meta-data on data collection site for individual
matrices for models 1, 2, 3, 4, Tables 6, 8, 10, 12. As discussed in data set items is provided, and the class distribution is highly unbal-
Section 4.5, the strong augmentations added for models 3 and 4 and anced. Future research should confirm these results by using other
most frequently coming into play for model 3 seem to help extracting datasets from steel scrap studies or consider these points when creating
features useful for discriminating class E40, whereas the additional new steel scrap datasets. Moreover, no k-fold splitting with analysis
of standard deviations was performed for the subsampling of train
weak augmentations of model 2 as compared to model 1 lead to an
datasets in the downstream task, providing a limitation for the exact
increase of misclassifications of E40 as E1.
accuracy figures for the different models on these subsampled datasets.
The confusion matrix for the raw test set Table 15 is fairly similar
We present accuracy results for all differently sized subsets of the train
to the corresponding matrix for model 4 Table 13, as is to be expected
set in the detailed discussion of results per model. Especially on the
from the similar performance metrics in Figs. 10, 11, 12, 13. This is also
larger tiled test set, more suitable for quantitative analysis, a high
in alignment with the considerations on an explanation for the weaker
consistency of accuracy result rankings between the different models
performance on the tiled test set. As the raw test set images depict
is observable, indicating that the individual selection of subsampled
larger sections, a differentiation between E1 and the other classes
train sets did not grossly influence the results. The k-fold-splitting
becomes easier, as E1 is fairly heterogeneous in detail, but the overall approach with standard deviation metrics would have been another,
impression of a large chunk of E1 is more characteristic and dissimilar and even more rigid evaluation, and could perhaps be proceeded on
from other classes. in further studies on this topic. The DOES dataset is fairly new and
As overall conclusion, the models 2 and 4 show best performance under-researched so far, providing little to compare our results to on
with regard to the overall accuracy, as visible in Table 3, with the that account.
per-class analysis and the overall performance on the tiled test set
showing a favorable tendency for model 4. In particular, these self- 5. Conclusion
supervised models perform better with regard to overall accuracy than
the supervised baseline model, showing that with less manual labeling Supervised machine learning methods have already been success-
effort, a better performance can be achieved. The more excessive use of fully established in industrial applications. In this work, we have shown
12
M. Schäfer et al. Machine Learning with Applications 17 (2024) 100573
an alternative contrastive learning approach that outperforms proven Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for
supervised methods by using self-supervised pre-training and a simple contrastive learning of visual representations. arXiv.org.
supervised downstream task. The results of the various experiments Chen, T., Kornblith, S., Swersky, K., Norouzi, M., & Hinton, G. (2020). Big
self-supervised models are strong semi-supervised learners. arXiv.org.
have shown how strongly different augmentations influence the results.
Chen, T., Saxena, S., & Falcon, W. (2020). SimCLR - a simple framework for contrastive
This means that huge labeled data sets are not always necessary and the learning of visual representations. https://github.com/google-research/simclr.
industry is significantly faster in developing new applications. This also Ciga, O., Xu, T., & Martel, A. L. (2022). Self supervised contrastive learning for digital
allows a much faster response to process changes. A very interesting histopathology. Machine Learning with Applications, 7, Article 100198.
further application could be surface defects in various materials. For Ciresan, D., Meier, U., & Schmidhuber, J. (2012). Multi-column deep neural networks
future work, it should be interesting to try out the different augmen- for image classification. In 2012 IEEE conference on computer vision and pattern
recognition (pp. 3642–3649). IEEE.
tations in their intensity or use auto-augmentation libraries. The next
Colla, V., Pietrosanti, C., Malfa, E., & Peters, K. (2021). Environment 4.0: How
steps are to transfer this approach to other use cases or other intrin-
digitalization and machine learning can improve the environmental footprint of
sically unordered data sets. Finetuning and changing the downstream the steel production processes. Matériaux & Techniques, 108, http://dx.doi.org/10.
task might be further interesting aspects that could be investigated too. 1051/mattech/2021007.
However, the work has also shown that automatic visual inspection can Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2019). AutoAugment:
be very difficult for classes such as E40. Depending on the area of appli- Learning augmentation strategies from data. In 2019 IEEE/CVF conference on
cation and external conditions, a combination with other technologies computer vision and pattern recognition (pp. 113–123). IEEE.
such as X-ray or LIBS or the usage of multi-modal data combining color den Eynde, S. V., Diaz-Romero, D. J., Engelen, B., Zaplana, I., & Peeters, J. R. (2022).
and depth information could lead to significantly better results. Assessing the efficiency of laser-induced breakdown spectroscopy (LIBS) based
sorting of post-consumer aluminium scrap. Procedia CIRP, 105, 278–283.
Deng, J., Russakovsky, O., Berg, A., Li, K., & Fei-Fei, L. (2009). Imagenet. figshare
CRediT authorship contribution statement
https://www.image-net.org.
Díaz-Romero, D. J., Van den Eynde, S., Sterkens, W., Eckert, A., Zaplana, I.,
Michael Schäfer: Conceived and designed this research and the fi- Goedemé, T., & Peeters, J. (2022). Real-time classification of aluminum metal scrap
nal assembly and final quality control. Ulrike Faltings: Worked in con- with laser-induced breakdown spectroscopy using deep and other machine learning
cert with M.S., Performed final quality control, Industrial supervision. approaches. Spectrochimica Acta. Part B: Atomic Spectroscopy, 196, Article 106519.
Björn Glaser: Academic supervision. Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation
learning by context prediction. In 2015 IEEE international conference on computer
Declaration of competing interest vision (pp. 1422–1430). IEEE.
Doersch, C., & Zisserman, A. (2017). Multi-task self-supervised visual learning. In 2017
IEEE international conference on computer vision (pp. 2070–2079). IEEE.
The authors declare that they have no known competing finan-
dos Santos, P. H., Santos, V. d., & da Silva Luz, E. J. (2024). Towards robust
cial interests or personal relationships that could have appeared to
ferrous scrap material classification with deep learning and conformal prediction.
influence the work reported in this paper. arXiv:2404.13002.
ESTEP (2020). Proposal for clean steel partnership under the horizon Eu-
Data availability rope programme. https://www.estep.eu/assets/Uploads/ec-rtd-he-partnerships-for-
clean-steel-low-carbon-steelmaking.pdf.
The data used is publicly available. The link is provided in the Fang, X., Wang, H., Liu, G., Tian, X., Ding, G., & Zhang, H. (2022). Industry application
manuscript. of digital twin: from concept to implementation. International Journal of Advanced
Manufacturing Technology, 121(7–8), 4289–4312.
Acknowledgments Gidaris, S., Singh, P., & Komodakis, N. (2018). Unsupervised representation learning
by predicting image rotations. arXiv.org.
Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an
This research has received funding from the European Union’s
invariant mapping. Vol. 2, In 2006 IEEE computer society conference on computer
Horizon-IA innovative program under grant agreement number vision and pattern recognition (pp. 1735–1742). IEEE.
101058694. Open access funding provided by KTH - Royal Institute He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for
of Technology. unsupervised visual representation learning. arXiv.org.
He, K., Zhang, X., Ren, S., & Sun, J. (2016a). Deep residual learning for image
References recognition. In 2016 IEEE conference on computer vision and pattern recognition (pp.
770–778). IEEE.
Azizi, S., Mustafa, B., Ryan, F., Beaver, Z., Freyberg, J., Deaton, J., Loh, A., Karthike- He, K., Zhang, X., Ren, S., & Sun, J. (2016b). Deep residual learning for image
salingam, A., Kornblith, S., Chen, T., Natarajan, V., & Norouzi, M. (2021). Big recognition. In 2016 IEEE conference on computer vision and pattern recognition (pp.
self-supervised models advance medical image classification. In 2021 IEEE/CVF 770–778). http://dx.doi.org/10.1109/CVPR.2016.90.
international conference on computer vision (pp. 3458–3468). http://dx.doi.org/10. He, K., Zhang, X., Ren, S., & Sun, J. (2016c). Identity mappings in deep residual
1109/ICCV48922.2021.00346. networks. arXiv:1603.05027.
Bachman, P., Hjelm, R. D., & Buchwalter, W. (2019). Learning representations by
Hu, X., Yang, J., Jiang, F., Hussain, A., Dashtipour, K., & Gogate, M. (2023). Steel
maximizing mutual information across views. arXiv.org.
surface defect detection based on self-supervised contrastive representation learning
Balestriero, R., Ibrahim, M., Sobal, V., Morcos, A., Shekhar, S., Goldstein, T., Bordes, F.,
with matching metric. Applied Soft Computing, 145, Article 110578.
Bardes, A., Mialon, G., Tian, Y., Schwarzschild, A., Wilson, A. G., Geiping, J.,
Garrido, Q., Fernandez, P., Bar, A., Pirsiavash, H., LeCun, Y., & Goldblum, M. Huang, S.-C., Pareek, A., Jensen, M., Lungren, M. P., Yeung, S., & Chaudhari, A.
(2023). A cookbook of self-supervised learning. arXiv.org. S. (2023). Self-supervised learning for medical image classification: a systematic
Baumert, J. C., Picco, M., Weiler, C., Wauters, M., Albart, P., & Nyssen, P. (2008). review and implementation guidelines. NPJ Digital Medicine, 6(1), 74.
Automated assessment of scrap quality before loading into an EAF. Archives of Huang, Z., Shen, Y., Li, J., Fey, M., & Brecher, C. (2021). A survey on AI-driven digital
Metallurgy and Materials, 53(2), 345–351. twins in industry 4.0: Smart manufacturing and advanced robotics. Sensors (Basel,
Bountos, N. I., Papoutsis, I., Michail, D., & Anantrasirichai, N. (2022). Self-supervised Switzerland), 21(19), 6340.
contrastive learning for volcanic unrest detection. IEEE Geoscience and Remote Huang, J., Yang, X., Zhou, F., Li, X., Zhou, B., Lu, S., Ivashov, S., Giannakis, I., Kong, F.,
Sensing Letters, 19, 1–5.
& Slob, E. (2023). A deep learning framework based on improved self-supervised
Caesar, H., Uijlings, J., & Ferrari, V. (2018). COCO-stuff: Thing and stuff classes in
learning for ground-penetrating radar tunnel lining inspection. Computer-Aided Civil
context. In 2018 IEEE/CVF conference on computer vision and pattern recognition (pp.
and Infrastructure Engineering.
1209–1218). http://dx.doi.org/10.1109/CVPR.2018.00132.
Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2021). Huang, Z., Zhu, J., Wu, X., Qiu, R., Xu, Z., & Ruan, J. (2021). Eddy current separation
Unsupervised learning of visual features by contrasting cluster assignments. can be used in separation of non-ferrous particles from crushed waste printed
arXiv.org. circuit boards. Journal of Cleaner Production, 312, Article 127755. http://dx.doi.
Chen, X., Fan, H., Girshick, R., & He, K. (2020). Improved baselines with momentum org/10.1016/j.jclepro.2021.127755, URL https://www.sciencedirect.com/science/
contrastive learning. arXiv.org. article/pii/S0959652621019739.
13
M. Schäfer et al. Machine Learning with Applications 17 (2024) 100573
Jing, L., & Tian, Y. (2021). Self-supervised visual feature learning with deep neural Raabe, D., Ponge, D., Uggowitzer, P. J., Roscher, M., Paolantonio, M., Liu, C., Antrekow-
networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, itsch, H., Kozeschnik, E., Seidmann, D., Gault, B., De Geuser, F., Deschamps, A.,
43(11), 4037–4058. Hutchinson, C., Liu, C., Li, Z., Prangnell, P., Robson, J., Shanthraj, P., Vakili, S., ....
Jujun, R., Yiming, Q., & Zhenming, X. (2014). Environment-friendly technology Pogatscher, S. (2022). Making sustainable aluminum by recycling scrap: The science
for recovering nonferrous metals from e-waste: Eddy current separation. Re- of ‘‘dirty’’ alloys. Progress in Materials Science, 128, Article 100947. http://dx.doi.
sources, Conservation and Recycling, 87, 109–116. http://dx.doi.org/10.1016/ org/10.1016/j.pmatsci.2022.100947, URL https://www.sciencedirect.com/science/
article/pii/S0079642522000287.
j.resconrec.2014.03.017, URL https://www.sciencedirect.com/science/article/pii/
Rani, V., Nabi, S. T., Kumar, M., Mittal, A., & Kumar, K. (2023). Self-supervised
S0921344914000846.
learning: A succinct review. Archives of Computational Methods in Engineering, 30(4),
Kim, D., Cho, D., Yoo, D., & Kweon, I. S. (2018). Learning image representations by
2761–2775.
completing damaged jigsaw puzzles. arXiv.org. Schäfer, M., & Faltings, U. (2023). DOES - dataset of European scrap classes. http:
Krizhevsky, A. (2012). Learning multiple layers of features from tiny images. University of //dx.doi.org/10.5281/zenodo.8219163.
Toronto. Schäfer, M., Faltings, U., & Glaser, B. (2023). DOES - A multimodal dataset for
Krizhevsky, A., Sutskever, I., & Hinton, G. (2017). ImageNet classification with deep supervised and unsupervised analysis of steel scrap. Scientific Data, 10(1), 780.
convolutional neural networks. Tian, Y., Chen, S., Poole, B., Krishnan, D., Schmid, C., & Isola, P. (2020). What makes
Leibe, B., Matas, J., Sebe, N., & Welling, M. (2016). Colorful image colorization. In for good views for contrastive learning? arXiv.org.
Lecture notes in computer science: vol. 9907, Computer vision - ECCV 2016 (pp. van den Oord, A., Li, Y., & Vinyals, O. (2019). Representation learning with contrastive
649–666). Switzerland: Springer International Publishing AG. predictive coding. arXiv.org.
Li, D., Lu, J., Zhang, T., & Ding, J. (2023). Self-supervised learning and multisource Wang, X., & Qi, G.-J. (2023). Contrastive learning with stronger augmentations. IEEE
heterogeneous information fusion based quality anomaly detection for heavy-plate Transactions on Pattern Analysis and Machine Intelligence, 45(5), 1–12.
shape. IEEE Transactions on Automation Science and Engineering, 1–12. Weiss, M. (2011). Resource recycling in waste management with X-ray fluores-
cence (Master’s thesis), Montanuniversitaet Leoben (000), embargoed until
Manna, S., Bhattacharya, S., & Pal, U. (2022). Self-supervised representation learning
08-06-2016.
for detection of ACL tear injury in knee MR videos. Pattern Recognition Letters,
Wieczorek, T., & Pilarczyk, M. (2008). Classification of steel scrap in the EAF process
154, 37–43. http://dx.doi.org/10.1016/j.patrec.2022.01.008, URL https://www.
using image analysis methods. Archives of Metallurgy and Materials, 53(2), 613–617.
sciencedirect.com/science/article/pii/S0167865522000149.
Winning, M., Calzadilla, A., Bleischwitz, R., & Nechifor, V. (2017). Towards a circular
Mishra, A. K., Roy, P., Bandyopadhyay, S., & Das, S. K. (2022). CR-SSL: A closely economy: insights based on the development of the global ENGAGE-materials model
related self-supervised learning based approach for improving breast ultrasound and evidence for the iron and steel industry. International Economics and Economic
tumor segmentation. International Journal of Imaging Systems and Technology, 32(4), Policy, 14(3), 383–407.
1209–1220. Xu, W., Xiao, P., Zhu, L., Zhang, Y., Chang, J., Zhu, R., & Xu, Y. (2023). Classification
Misra, I., & van der Maaten, L. (2019). Self-supervised learning of pretext-invariant and rating of steel scrap using deep learning. Engineering Applications of Artificial
representations. arXiv.org. Intelligence, 123, Article 106241.
Noroozi, M., & Favaro, P. (2016). Unsupervised learning of visual representations by Zhang, S., & Forssberg, E. (1998). Mechanical recycling of electronics scrap - the current
solving jigsaw puzzles. In B. Leibe, J. Matas, N. Sebe, & M. Welling (Eds.), Computer status and prospects. Waste Management and Research.
vision – ECCV 2016 (pp. 69–84). Cham: Springer International Publishing.
14