0% found this document useful (0 votes)
72 views13 pages

Adaptive Laser Welding Control A Reinforcement Learning Approach

This document summarizes a research publication about using reinforcement learning to control laser welding processes. The authors propose using reinforcement learning as it can learn control laws without prior knowledge of the complex welding dynamics. Their approach uses an agent that modulates laser power based on optical and acoustic emission signals. The agent aims to maximize rewards for achieving a targeted weld quality. Two learning schemes - Q-learning and policy gradient - were tested, with training times of 20 and 33 minutes respectively to reach the target quality. The goal is to autonomously learn adaptive control without needing a pre-established operating range.

Uploaded by

Dileep Gangwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views13 pages

Adaptive Laser Welding Control A Reinforcement Learning Approach

This document summarizes a research publication about using reinforcement learning to control laser welding processes. The authors propose using reinforcement learning as it can learn control laws without prior knowledge of the complex welding dynamics. Their approach uses an agent that modulates laser power based on optical and acoustic emission signals. The agent aims to maximize rewards for achieving a targeted weld quality. Two learning schemes - Q-learning and policy gradient - were tested, with training times of 20 and 33 minutes respectively to reach the target quality. The goal is to autonomously learn adaptive control without needing a pre-established operating range.

Uploaded by

Dileep Gangwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/341691013

Adaptive laser welding control: A reinforcement learning approach

Article in IEEE Access · May 2020


DOI: 10.1109/ACCESS.2020.2998052

CITATIONS READS

0 1,007

5 authors, including:

Giulio Masinelli Tri Le Quang


École Polytechnique Fédérale de Lausanne Empa - Swiss Federal Laboratories for Materials Science and Technology
8 PUBLICATIONS 18 CITATIONS 21 PUBLICATIONS 101 CITATIONS

SEE PROFILE SEE PROFILE

Silvio Zanoli Kilian Wasmer


École Polytechnique Fédérale de Lausanne Empa - Swiss Federal Laboratories for Materials Science and Technology
4 PUBLICATIONS 9 CITATIONS 128 PUBLICATIONS 1,133 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

GoCARB - http://gocarb.ch/ View project

Real-Time Sensing, Control and Monitoring of Metal Additive Manufacturing Processes using Artificial Intelligence Techniques View project

All content following this page was uploaded by Giulio Masinelli on 21 June 2020.

The user has requested enhancement of the downloaded file.


Received April 16, 2020, accepted May 8, 2020, date of publication May 27, 2020, date of current version June 15, 2020.
Digital Object Identifier 10.1109/ACCESS.2020.2998052

Adaptive Laser Welding Control:


A Reinforcement Learning Approach
GIULIO MASINELLI 1,2 , (Member, IEEE), TRI LE-QUANG 1 , SILVIO ZANOLI 2,

KILIAN WASMER 1 , (Member, IEEE), AND SERGEY A. SHEVCHIK 1


1 Laboratory for Advanced Materials Processing, Swiss Federal Laboratories for Materials Science and Technology (EMPA), 3602 Thun, Switzerland
2 Embedded Systems Laboratory, Swiss Federal Institute of Technology in Lausanne (EPFL), 1015 Lausanne, Switzerland
Corresponding author: Kilian Wasmer ([email protected])
This work was supported by the Swiss Federal Laboratories for Materials Science and Technology (EMPA).

ABSTRACT Despite extensive research efforts in the field of laser welding, the imperfect repeatability of
the weld quality still represents an open topic. Indeed, the inherent complexity of the underlying physical
phenomena prevents the implementation of an effective controller using conventional regulators. To close
this gap, we propose the application of Reinforcement Learning for closed-loop adaptive control of welding
processes. The presented system is able to autonomously learn a control law that achieves a predefined weld
quality independently from the starting conditions and without prior knowledge of the process dynamics.
Specifically, our control unit influences the welding process by modulating the laser power and uses optical
and acoustic emission signals as sensory input. The algorithm consists of three elements: a smart agent
interacting with the process, a feedback network for quality monitoring, and an encoder that retains only
the quality critic events from the sensory input. Based on the data representation provided by the encoder,
the smart agent decides the output laser power accordingly. The corresponding input signals are then analyzed
by the feedback network to determine the resulting process quality. Depending on the distance to the targeted
quality, a reward is given to the agent. The latter is designed to learn from its experience by taking the actions
that maximize not just its immediate reward, but the sum of all the rewards that it will receive from that
moment on. Two learning schemes were tested for the agent, namely Q-Learning and Policy Gradient. The
required training time to reach the targeted quality was 20 min for the former technique and 33 min for the
latter.

INDEX TERMS Laser welding, laser material processing, reinforcement learning, policy gradient,
Q-learning, closed-loop control.

I. INTRODUCTION In the literature, the most commonly reported approach to


Laser welding (LW) is a crucial technology for many indus- increase the repeatability of the weld quality is the application
trial sectors, including automotive production, maritime, of traditional regulators, such as proportional-integral (PI)
medical, aerospace, and micromechanics [1]. On the one or proportional-integral-derivative (PID) controllers [3], [4].
hand, its advantages are in non-contact processing — avoid- These methods allow tracking the desired weld quality using
ing tool wear, ability to process refractory materials, and measurements of the surface temperature or the surface shape
higher processing rate and joint quality compared to tra- of the process zone (PZ) as feedback. Unfortunately, since
ditional welding processes [2]. On the other hand, LW’s they are based on the linearization of the non-linear weld-
main disadvantages derive from the highly complex under- ing dynamics, they can only operate in a narrow range of
lying physical phenomena involved in the process. Thus, the process parameters. This operating range, moreover, has
despite many developments of this technology, LW still to be established during a preliminary exhaustive experi-
suffers from imperfect quality repeatability, limiting its mental search, which is very time- and material-consuming,
applications in industrial production requiring high-quality making the entire methodology undesirable in an industrial
standards. environment.

The associate editor coordinating the review of this manuscript and


approving it for publication was Jianyong Yao .

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 103803
G. Masinelli et al.: Adaptive LW Control: A RL Approach

A less common approach, but that is worth investigat- The design of a keyhole LW control system is made all
ing, is based on more sophisticated regulators that rely on the more challenging by the partial observability of the laser
differential models of the process [5], [6]. But in the case of process. In fact, in-depth information of the PZ can only be
LW, a reliable model can be complicated to obtain, as it has indirectly obtained either by acoustic emission (AE) sensors
to take into account many factors that can drastically vary the or by surface measurements using optical emissions (OE)
process, such as the heating and melting dynamics [5]. sensors [12]. Consequently, it is difficult to provide an effec-
Nevertheless, a preliminary attempt can be found in Na tive feedback from the process to the control system, since it
et al. [7], where the authors presented an algorithm that requires the correlation of the surface measurements with the
automatically builds a model during the operation using the sub-surface events (e.g., pore formation), which is not a trivial
Hammerstein identification technique. task [12]. Nevertheless, some pilot works in LW monitor-
An example of the actual use of a model-based controller ing report successes in identifying quality critic momentary
for laser processes was proposed by Song and Mazumder [6], events from the corresponding AE and OE signals from the
where an experimentally identified model was involved for processed zone [13], [14].
predictive control of laser cladding — a process that is closely The present study starts from the aforementioned pre-
related to LW. This technique heavily relies on its model for liminary results of process monitoring and focuses on the
the choice of the actions to take according to their impact use of Reinforcement Learning (RL) towards keyhole LW
on the environment evaluated with the model itself. To be closed-loop control.
specific, a closed-loop process was used to steer the melt RL appears to be an attractive approach since it enables a
pool temperature to a reference temperature profile. In a model-free learning scheme that is capable of solving com-
real-life scenario, unfortunately, this approach has two major plex problems and provides high adaptability to specific con-
drawbacks. First, the temperature of the melt pool is not ditions through active interaction with a given process [15].
uniformly distributed over its surface [8]. Second, the optimal Moreover, we take advantage of recent advances in Deep
temperature profile can vary during the process, as it strictly Convolutional Neural Networks (DCNN) developments [16],
depends on the geometry, e.g., on the proximity to the edges [17] to derive efficient representations of the laser process
or the boundaries of the workpiece. Thus, the tracking of a from the high-dimensional sensory input — the AE and OE
single fixed target has a direct impact on the system perfor- signals from the PZ — and use them to generalize previ-
mance and so on the desired result. ous experiences to new situations [18]. In our case, indeed,
Similarly, Bollig et al. [9] showed promising results by the input data from the sensors do not contain an explicit
modeling the non-linear process with an Artificial Neural representation of the physical state of the system, as they are
Network and controlling the laser power with a linear model just limited to the optical and acoustic emission. As shown
predictive algorithm based on the instantaneous linearization by Mnih et al. [18], DCNNs can overcome — and even take
of the neural network itself. In this case, the regulator aimed to advantage of — this condition, allowing the system to learn
track a reference penetration depth detected from the intensity meaningful position and scale of irregular structures in the
of the plasma’s optical emission. However, the experimen- data.
tal calibration curve used to map the measured intensity to Concerning the recent advances of RL, its application
the penetration depth may diverge from its real-life values, towards LW was discussed in Günther et al. [19], where
limiting the application of the same methodology in broader a dynamic model substituted the real laser process, and a
scenarios. camera-based system and photodiodes were used for process
In this context, there is a clear need for a widely applicable, monitoring. RL was able to efficiently search for strategies
robust, and cost-effective process control system that ensures for modulating the laser irradiation to compensate for the
high-quality standards. In particular, we focus on deep key- mentioned process instabilities.
hole welding, where the process complexity is even higher Despite the successes of this work, the efficiency of RL
compared to other welding regimes, such as conduction in more complex LW processes remains an open ques-
welding. tion. To close this gap, we inspected the performance of
This welding regime is indeed characterized by the our methodology in the case of keyhole LW and evalu-
co-existence — within a limited volume — of vapor, melt, ated its outcomes in terms of the evolution of the weld
and plasma phases of the processed material [10]. More- quality over time during training. Firstly, the AE and OE
over, it possesses an extremely complex energy-coupling signatures of the desired weld quality were given to the
mechanism that includes Fresnel absorption (due to multi- algorithm, as well as several signatures of undesirable
ple reflections inside the vapor channel) [11]. These com- qualities, without any other prior information about the pro-
plex phenomena generate many process instabilities, making cess dynamics. Further search for the optimal process con-
keyhole welding prone to defects even under constant laser trol strategy was carried out in a completely autonomous
irradiation [10]. Specifically, one of the most critical defects way. Two RL techniques were investigated in this contri-
is porosity. Pores are problematic since they are located inside bution: Q-Learning [20] and Policy Gradient [21], in order
the material and may substantially weaken the mechanical to analyze their strengths and weaknesses in this particular
strength of the welding joint [12]. application.

103804 VOLUME 8, 2020


G. Masinelli et al.: Adaptive LW Control: A RL Approach

This paper is divided into five sections. Section II describes B. SENSORS


the experimental setup and the hardware of the control The laser head was equipped with a customized optical sys-
system. Section III describes the developed algorithms, tem that allowed delivering the back-reflected radiations from
including details on signal dimensionality reduction and the the PZ to three photodiodes. These sensors are based on
feedback network used for process monitoring. Section IV Silicon (Si), Germanium (Ge), and InGaAs and are sensi-
presents and discusses the results. Finally, Section V con- tive within the ranges of 450–850 nm, 1000–1200 nm, and
cludes this work and gives the perspective of its further 1250–1700 nm, respectively. The Ge sensor was equipped
developments that would allow LW to operate autonomously with a narrow bandpass optical filter (FB1070-10, Thorlabs
and, thus, bringing it closer to the intelligent manufacturing Inc., USA) with a center wavelength of 1070 ± 2 nm to only
within Industry 4.0 framework [22]. sense the back-reflected laser radiation from the PZ.
In addition to the optical sensors, an AE sensor PICO
(Physical Acoustics, USA) was placed in tight contact with
II. EXPERIMENTAL PROCEDURE, MATERIALS,
the workpiece, as shown in Fig. 1 (a). The sensor was sensi-
ACQUISITION, AND CONTROL
tive within the range 500–1850 kHz. Its purpose is to detect
The experimental setup was similar to the one used in a
the AE shockwaves generated inside the workpiece during
previous work [14], and therefore just a summary is given
welding.
in this contribution.
C. MATERIAL
A. EXPERIMENTAL SETUP The workpieces were 2 mm thick plates of titanium alloy
A schematic representation of the setup is presented in Fig. 1, (Ti6Al4V, grade 5) with a melting temperature of 1,650 ◦ C.
along with its picture. The main components were: a laser This material was chosen due to its extensive industrial usage,
source, an optical laser head, a workpiece holder — mounted including the medical sector. Additionally, its Heat-Affected
on a moving stage — and an AE sensor. Zone (HAZ) can be easily recognized in cross-sections due
The laser source was a fiber laser system StarFiber150P to the remarkable textural changes [27].
(Coherent Switzerland AG, Switzerland), with a maxi- D. REFERENCE QUALITY DEFINITION
mum output power of 250 W, a wavelength of 1070 nm, To meet the industrial demand for high-quality keyhole weld-
and a diameter of the laser spot of 30 µm (within 2w0 ) ing [12], we defined our reference weld as the one with the
at the workpiece surface. The source was operated in highest achievable penetration depth without the presence
continuous-wave (CW) mode with the possibility to modu- of pores. In addition to previous experiences [14], [24],
late the output laser power using an external voltage source several experiments were carried out, taking advantage of
within a voltage range of 0–5 V. More details are given in the well-controlled welding conditions of our setup that
Le-Quang et al. [23]. allowed us to reproduce different penetration depths pre-
The laser experiments were performed in air at atmospheric cisely. Each experimental weld was verified by analyzing the
pressure. To prevent the potential oxidation of the weld, cross-sections of the processed workpieces.
an adequate Ar flow was directed to the PZ via a nozzle. Finally, the investigations lead to a reference weld charac-
The flow was kept constant at a pressure of 1.5 atm during terized by a laser power of 80 W and a resulting penetration
all experiments. In order to realize line welds, a workpiece depth of 150 µ m. Every increment in laser power resulted
was mounted on a linear stage M-663.5U (Physik Instru- in the introduction of porosity, whereas every decrement
mente GmbH, Germany), and moved at a constant veloc- corresponded to shallower welds.
ity of 10 mm/s during the process. The movement of the
workpiece was synchronized with the laser source so that the E. DATA ACQUISITION AND COMPUTATIONS
irradiation started only when the stage already reached the set In order for the control system to reach a real-time response
velocity. given the high-dimensional input from the sensors, a combi-
The aforementioned setup provided the realization of dif- nation of specialized hard/software was used. The hardware
ferent LW regimes leading to various welding quality [14], included a PC equipped with an Intel i7-8750H processor
[24]–[26], including no illumination (laser power P = 0 W), (Intel, USA) that operated at a frequency up to 4.1 GHz, and
conduction welding, keyhole without porosity, and keyhole a Graphics Processing Unit NVidia GTX 2080 Ti (Nvidia,
with porosity. USA).
It must be emphasized that, in terms of process param- The signals from all four sensors described in Section II-B
eters, the weld quality also depends on the velocity of the were acquired with a high-speed DAQ card Advantech 1840
workpiece. This work, however, was focused on the control (Advantech, Taiwan) with four independent input ports for
of the laser power that, in our setup, can be dynamically data digitalization. All signals were digitized with a sampling
modulated via the external voltage generator, as described. rate of 1 MHz, and their acquisition was triggered when the
Consequently, this process parameter was considered as the intensity of the back-reflected laser light detected by the Ge
sole control variable. photodiode exceeded a fixed threshold (0.1 V).

VOLUME 8, 2020 103805


G. Masinelli et al.: Adaptive LW Control: A RL Approach

FIGURE 1. (a) Scheme of the experimental setup and (b) its picture. The labels of the individual components in (a) and
(b) correspond to each other.

The choice of the Ge sensor as a trigger for the acquisition A. ENCODER


is based upon the very high intensity of the back-reflected The encoder was used to reduce the dimensionality of the
laser radiation at the beginning of the process, when the sensory input of the agent, preserving, at the same time,
reflectivity of the workpiece is the highest [23]. the structure of the original data while minimizing the com-
To dynamically modulate the laser power, the control sig- putational time. The introduction of this unit was motivated
nal provided by the RL algorithm was transmitted to the by a resulting simplification of the search of the optimal
laser source via an external USB unit Advantech 4751L control law for the smart agent. Indeed, the projection of the
(Advantech, Taiwan). The latter converted the digital values high-dimensional input data into a low dimensional latent
calculated by the RL models into a direct voltage value, which space allows capturing a ‘‘good’’ parametrization of the sig-
was then delivered to the laser source via a cable connection nal that focuses only on quality critical events that the user
(see Fig. 1 for details). The time delay between the output can settle by carefully choosing the training data [28].
from the USB unit and the laser response was experimentally To be specific, we based our encoder on a DCNN due
measured to be 0.57 ± 0.25 ms. to the proven abilities of convolutional networks to explic-
The real-time acquisition routine of the input signals using itly model signals by finding their meaningful degrees of
the DAQ board, the data processing in the GPU, and the freedom [29], [30]. Indeed, DCNNs also exhibit excel-
transmission of the computed control signal to the laser lent generative properties [31], which motivates their use
source were carried out with in-house custom-made software. as encoders.
In particular, the data acquisition program was coded in Following traditional architectures [30], [32], our DCNN
C# in Visual Studio 2017, Community addition. Conversely, encoder included four convolution layers. Moreover, each
the high-level data processing was realized in Python 3.7. convolution was enforced with a batch normalization layer
Finally, the Deep Learning (DL) library involved was Pytorch to speed up the training [33]. The activation consisted of a
(www.pytorch.org), version 1.1.0. rectified linear unit (ReLU) that is more efficient in multi-
layer architectures, as it diminishes the gradient vanishing
III. DATA PROCESSING problem [34]. The summarization of the input information
The structure of the developed data processing is schemati- is achieved gradually through the convolutional layers by
cally presented in Fig. 2. The entire control unit consists of adopting strided convolutions [35].
three main building blocks: an encoder that processed the data As stated, the training of the encoder was carried out
from the measurements to retain only the quality critic events, separately, prior to the interaction with the environment.
a smart agent interacting with the welding process, and a During training, a decoder with a symmetrical structure was
feedback network based on a DCNN for quality monitoring. added to process the encoder output. Specifically, in the
Before even starting the interaction with the environment, decoder, the convolutions of the encoder were replaced with
the encoder and the feedback network were trained using a their reciprocal transposed convolution. The two models
database consisting of 750 signals acquired from previous were then trained end-to-end to minimize the mean square
experiments covering the whole operating range of the laser error between the training input signals and the output
process. The signals were divided into 5 categories according of the decoder [30], [32]. After training, the decoder was
to the corresponding penetration depth identified with opti- removed, thus, keeping the encoder standalone to provide a
cal inspection of the cross-section of the processed material low dimensional signal representation as input for the smart
(more details in Section IV). agent.

103806 VOLUME 8, 2020


G. Masinelli et al.: Adaptive LW Control: A RL Approach

FIGURE 2. Structure of the complete control unit made up of three main building blocks: an
encoder that processed the data from the sensory input to retain only the quality critic events,
a smart agent interacting with the welding process, and a feedback network based on a
convolutional neural network for quality monitoring.

B. FEEDBACK NETWORK C. SMART AGENT


As seen in the introduction, RL is a learning paradigm leading The final building block is constituted by the smart agent
to the design of algorithms that directly interact with an envi- whose purpose is to interacts with the environment — in
ronment and learn via trial and error. Nevertheless, learning this case, the laser process — by making actions, i.e., mod-
by doing is only effective if we can define a notion of reward, ulation of the laser power. Practically, the agents commu-
something that motivates the intelligent system to behave nicate to the output board that, in turn, delivers the con-
appropriately. For this reason, the full setup depicted in Fig. 2 trol signal to the laser source (see Section II-E for more
included a feedback network based on a DCNN classifier and details).
a summation unit. The principle of operation is the following: based on the
This unit is based on our previous work [13], where the representation of the current sensory input provided by the
AE and OE signals from the PZ were used to identify quality encoder, the agent chooses an action, which leads to a change
critic momentary events. In this contribution, the output of the in the sensory input, and receives a reward from the feedback
classifier is made up of labels that correspond to predefined network. From this experience — made up by the past sensory
welding qualities in terms of penetration depth and pore con- input, the executed action, the current input, and the received
tent. The DCNN classifier shares the initial two convolution reward — the agent tries to optimize the outcomes of its
layers with the encoder, as it is shown in Fig. 2. This detail actions over time, i.e., to maximize the reward over a defined
allows the classifier to reuse the good feature representation time horizon.
learned by the encoder. The final decision on the quality is In our case, the considered time horizon corresponds to the
taken in two fully connected layers that were closed by a time required to perform a single line weld of 10 mm (1 s in
softmax layer. In analogy to the encoder, the training of the this work, see Section II-A).
classifier was carried out prior to the operation of the entire In the remainder of the article, we refer to this 10 mm
system with the preliminary collected signals database. line weld as an episode. Operating in an episodic fashion
To provide the reward signal, the output of the classifier — i.e., by individually welding lines of 10 mm — permits
(i.e., the label of the current momentary quality) was com- the algorithm to update its parameters between one line to
pared with the label of the reference signal in the summation another, and allows the stage to move in a new unprocessed
unit (see Fig. 2). In case of significant differences, the smart position to be able to start over. For training the agent, two RL
agent is granted with negative rewards; otherwise, positive techniques were tested in this study, and their descriptions are
rewards are assigned (more details in Section IV). given in the next subsections.

VOLUME 8, 2020 103807


G. Masinelli et al.: Adaptive LW Control: A RL Approach

D. PARAMETER TUNING environment — the so-called states — and tries to optimize


Assuming the use of a conventional RL learning scheme, the outcomes of these actions over time in terms of reward.
we can output a single action after the defined sensory input In RL, this concept is formalized through a Markov Deci-
is available, i.e., after a predetermined number of data points sion Process (MDP). MDP is described by a quadruple
is acquired from the AE and OE sensors. Hence, the length {S, A, p, r}, where S and A are the state and action spaces
of the input window determines the operational frequency, and p(st+1 |st , at ) is the probability of the transition from state
that is, the rate at which the control unit can modify the laser st ∈ S to state st+1 ∈ S taking the action at ∈ A. Each
power. change of state is rewarded according to r(st , at ). The strategy
A small window increases the system readiness to adapt of choosing an action at given the state st is known as policy,
to new welding conditions, but, unfortunately, it also raises and it is indicated by π(at |st ) — denoting the probability of
the sensitivity to noise of the feedback network [14]. In selecting the action at in state st .
contrast, a large window increases the monitoring accuracy The correctness of the choice of the actions is evaluated
and eases the internal timing constraints, but reduces the in terms of the rewards subsequently collected. Concretely,
number of actions per unit time. In this sense, the window the quality of taking an action at given the actual state st
length is crucial, as it is a trade-off between system readiness with the further choice of all remaining actions according to
to react to different stimuli and monitoring accuracy. A good the policy π, can be quantified with the action-value function
compromise was found by fixing the window length to 20 ms, Qπ (st , at ). Given an episode that includes T steps, it is defined
thus, setting the operating frequency to 50 Hz. as [15]:
The entire system was also sensitive to multiple other " T #
parameters, including the size of the convolutional kernels X
used in the DCNN and the dimensionality of the encoder Qπ (st , at ) = Eπ r(st 0 , at 0 )|st , at , (1)
output. The adjustments of these parameters were carried out t 0 =t
through an exhaustive search, and the final set of parameters
was established as follows. that is the expected total reward from taking the action at in
The optimal size of the convolutional kernel used in the state st and then following the policy π.
very first layer of the feedback network (see Fig. 2) was The goal of RL is to approximate the optimal policy π ∗
founded to be 5 ms. Taking into account the given stage that returns, for every state, the best action to take in terms of
velocity and the acquisition rate, the time span of this kernel total reward from that moment on.
corresponded to 50 µ m in length of the weld joint, or, One approach consists of estimating the action-value func-
equivalently, to a signal sample of 5,000 sampling points tion for π ∗ . Indeed, in that case, the optimal action a to
obtained from each sensor. be taken in state s is the one that maximizes Qπ ∗ for the
Following the scheme in Fig. 2, the unification of all given state [15]. The different RL algorithms differ in the
signals from the sensors in a time interval of 20 ms determines way Qπ (s, a) or, alternatively, the policy parameters are iter-
the dimension of the algorithm’s input space that amounts to atively updated. In this study, we have tested two of the
80,000 data points. As seen before, the agent does not receive most successful realizations of RL that are Q-Learning and
this high dimensional input, but its condensed representation Policy Gradient. Both methods have pro et contra, which are
from the encoder. discussed in the next two subsections.
The maximum possible dimensionality reduction achiev-
able in our setup led to low dimensional signals made up F. DEEP Q-LEARNING
of 64 data points for every sensor (from the original 20,000). Q-Learning is one of the most popular RL algorithms and
In our work, it was experimentally established that any further aims at estimating the Qπ ∗ values for every state — hence
reduction harmed the algorithm’s accuracy, provoking higher the name of the technique.
error rates for the autonomous learning controller. In the case of high-dimensional state space (e.g., in laser
welding), the traditional update methods for the Qπ values
become inapplicable as they suffer from the curse of dimen-
E. REINFORCEMENT LEARNING sionality [37]. Indeed, those methods require to represent the
RL is inspired by human and animal behaviors, where the Qπ values in a tabular form — a table having as many entries
experience/knowledge is acquired through active interaction as the ordered pairs (s, a) ∈ S ×A [15], which is only feasible
with the environment by trying to maximize the rewards if the cardinalities of both S and A are small.
received [15], [18], [36]. The concept of DL allows overcoming those limits by
Specifically, RL is the branch of Machine Learning (ML) using DCNNs to estimate the action-value function [38],
that aims at designing agents capable of taking, in every exploiting the recent advances in ML where DCNNs proved
moment, the action that maximizes not just the immediate to be excellent complex function approximators [39], [40].
reward, but the sum of all the rewards that will be received In our work, the Fitted Q Iteration algorithm (FQI) was
thenceforth. The agent chooses actions based on its sen- used as a basic learning scheme [41], and included the
sory input that provides a momentary representation of the following steps:

103808 VOLUME 8, 2020


G. Masinelli et al.: Adaptive LW Control: A RL Approach

(i) using some policy, collect a dataset of transitions: estimation, which is a common problem for standard Q-
Learning realizations [44].
{(st , at , st+1 , rt )}t=1,2,... (2) Moreover, to avoid bad local minima and to reduce the
correlation between observations, a replay buffer B was intro-
duced, as in Mnih et al. [18]. In particular, during step (i) in
(ii) for every transition, compute:
FQI, the collected transitions are added to B. During step (ii),
yt = rt + γ max Qπθ (st+1 , a) (3) we randomly sampled a batch of the accumulated transitions
a from B and used those to compute the targets yt through the
target network (see (3)). Finally, the updates of the parameters
(iii) update the parameters θ: θ in the Q-network were carried out using (4).
" # Here one of the key advantages of the introduction of the
encoder manifests itself. Indeed, it allows a dimensionality
X
θ ← argmin kQπθ (st , at ) − yt k ,
2
(4)
θ t reduction of the input — the reduction factor was 300 in our
setup — allowing us to use a bigger buffer B, avoiding the
where Qπθ denotes the functional approximator of the func- GPU memory saturation.
tion Qπ given by a parametric function with parameters θ. The advantages and disadvantages of Q-Learning can be
In this contribution, θ represents the weights and biases of a explained by the way the targets are computed in FQI. As can
DCNN that takes as input the ordered pair (st , at ) and outputs be seen in (3), the observed reward in just one transition is
an estimate of Qπ (st , at ). γ is a discount factor ∈ (0, 1) to used to calculate the targets yt . In addition, the first term rt
weigh less future rewards and more the immediate ones, rt is in (3) is significant when the estimation of Qπθ is inaccurate,
the reward collected at time t, and yt is a momentary target for as it is a real reward and not an estimation. In contrast,
the computation of the so-called Bellman update in (4) [37]. the second term γ maxa Qπθ (st+1 , a) in (3) is relevant only
The minimization problem in (4) can be solved using gra- when the estimation of Qπθ is reliable, as it is an estimation
dient descent methods. Therefore, it can be addressed using of the total future reward that is supposed to be higher than
the techniques for loss minimization that are common in DL the current one.
frameworks [42], [43]. Consequently, during the Bellman updates (see (4)),
In order to promote the exploration of the state space at the the algorithm relies more and more on the actual estimate
beginning of the training, we have used the so-called epsilon- of the Q-value as soon as it becomes sufficiently large. In
greedy technique for step (i) of the FQI [15]. This strategy Q-Learning, as a result, the strategy of sharply reducing the
consists in the use of the following policy for the collection variance of the estimates (the Q-values) is being adopted,
of the transitions: to the detriment of high bias.
 1 − ε, if at = argmaxQπθ (st , a)

π(at |st ) = ε
a
(5) G. POLICY GRADIENT
 , otherwise, As mentioned above, the main limitation of Q-Learning is the
|A| − 1
high bias in the estimation of the Q-values. This bias origi-
where |A| is the cardinality of the set A and ε ∈ (0, 1). nates from the single-step reward estimator for the targets yt .
Following (5), at each timestamp, the algorithm chooses The Policy Gradient (PG) approach [15], [45], [46] aims to
either a random action with probability ε, or the best action overcome those limits by evaluating the total reward on an
according to the actual Qπ estimate with probability 1 − ε. entire episode. Similarly to other RL algorithms, the objective
As the training progresses, ε is progressively reduced. This of PG is to find the policy that maximizes the expected total
procedure encourages the exploration of the environment at reward in one episode that includes T steps. But contrary to
the very beginning of the training and the exploitation of the Q-Learning, PG does not try to estimate the optimal Q-values,
acquired knowledge at the end. but the parameters of the policy approximating the optimal
To reduce the oscillations or divergence of the policy, policy π ∗ :
the momentary target yt and the Q-value Qπθ (st , at ) were esti-
mated using two separate networks that are known as target θ ∗ = argmaxJ (θ), (6)
θ
network (Qπθ t ) and Q-network (Qπθ ), respectively [18].
During the interaction with the environment, the param- where
" T #
eters of the target network are cyclically updated with the X
parameters of the Q-network. Additionally, in our study, J (θ) = Eπθ r(st , at ) , (7)
the Double Q-Learning technique was used [44]. It consists t=1
in using the Q-network to evaluate the action to take — and θ stands for the policy parameters. In our case, θ rep-
using Qπθ in (5) — and the target network to evaluate the resents the weights and the biases of a DCNN that takes
momentary target yt — using Qπθ t instead of Qπθ in (3). as input the current sensory representation provided by the
The reason was an efficient decorrelation between the encoder (see Section III-A) and outputs the action to be taken
noise in the action selection and the noise in the Q-values (e.g., the power of laser irradiation).

VOLUME 8, 2020 103809


G. Masinelli et al.: Adaptive LW Control: A RL Approach

In PG, the functional J (θ ) is estimated as: penetration depth identified via optical inspection of both
surface and cross-section of the workpieces.
T
X Based on the optical inspection, the categories were
J (θ ) ≈ Ĵ (θ) = r(st , at ). (8)
defined as insignificant penetration (achieved with a laser
t=1
power of 20 W), poor penetration (40 W), medium pene-
The optimization of the objective J (θ) is carried out by tration (60 W), highest penetration without pores (80 W),
directly differentiating its estimate Ĵ (θ) and using gradient and porosity (120 W). In total, each category consisted
ascent to update the parameters as: of 150 samples.
The second stage concerns the definition of the reward
θ ← θ + α∇θ Ĵ (θ ). (9) function that determines the reward assignment from the
feedback network to the smart agent.
In particular, the gradient of the objective in (8) is computed Considering that the agent is designed to act to maximizes
as [45], [46]: the collected rewards in the long run, the engineering of the
T T reward is crucial since it influences the learning process.
The reward assigned for every weld quality detected by the
X X
∇θ Ĵ (θ) = ∇θ log πθ (at |st ) r(st , at ). (10)
t=1 t=1
classifier used in our experiments is reported in Table 1.

Clearly, the entire approach relies on a single sample estimate TABLE 1. Rewards assigned for every category detected by the classifier.
of the full expectation (cf. (8)) that, even if unbiased, has a
very high variance.
For this reason, even though this method is potentially able
to provide better results compared to Q-Learning in terms of
the learned policy, it surely requires more learning time.
The implementation of PG was carried out by firstly
randomly initializing the parameters of the policy πθ and
then sampling a trajectory (i.e., collecting all the transitions After the preparation, we let the algorithm interact with the
(st , at , st+1 , rt ) within a single episode). The logarithm of the environment in a completely autonomous way without any
action probabilities, as well as the rewards collected along the further interventions. The performance for both Q-Learning
trajectory, were accumulated and used to calculate the pol- and Policy Gradient is shown in Fig. 3, where the red line
icy’s gradient according to (10). Finally, the parameters were represents the average values of the rewards obtained in
updated following the direction of improvement indicated by every episode, whereas the shaded area denotes the standard
the gradient (cf. (9)). deviation.
The average reward of Q-Learning reached a plateau after
IV. RESULTS AND DISCUSSION approximately 110 episodes, i.e., after performing 110 line
A. RESULTS welds of 10 mm. Taking into consideration the fact that
Prior to starting the interaction with the environment, we wait for 10 s after each line — to permit the agent to
the preparation of the algorithm included two stages, namely: update its parameters and to allow the stage to move in a new
i) collection of the signal database for training the classifier unprocessed position —, this learning period corresponds to
and the encoder, and ii) definition of a reward function. about 20 minutes. In contrast, PG reached a plateau only after
The first step is motivated by the fact that the classifier and 180 episodes (33 minutes). In both cases, additional learning
the encoder — to fulfill the role of guiding the smart agent time had little effect in terms of increment of the quality, and
during its learning process — have to learn to recognize, it only increased the cost in terms of wasted materials and
not just the reference quality, but also several other counter- time.
examples. The dynamics of the agent adaption to the given process
For this reason, we collected the acoustic and optical sig- can be vividly seen in the evolution of the welds using optical
nals from multiple weld experiments at various laser power inspections of the surfaces and cross-sections of the processed
(20, 40, 60, 80, and 120 W). material. Fig. 4 presents the optical images of the welds
It must be emphasized that the weld quality depends the- corresponding to the first, the 40th, the 80th, and the 110th
oretically not only on the laser power but also on the work- episode of the Q-Learning training process. To be specific,
piece velocity and its physical properties such as optical and Fig. 4 (a) shows the light microscope images of the top views
thermal [10]. But in this work, since the latter factors were of different episodes, whereas Fig. 4 (b), the corresponding
invariable, the former one is used to define the weld quality. cross-sections.
The sensors’ signals were acquired during three weld It has to be noted that the results in Fig. 4 show an evolution
experiments at each laser power, then partitioned in samples of the weld quality that is consistent with the increment of the
of 20 ms (see Section III-D, for details), and finally grouped reward observed in Fig. 3. Indeed, in Fig. 4 (a), episode 1 —
in 5 categories according to the weld quality in terms of i.e., beginning of the training — signs of unstable controlled

103810 VOLUME 8, 2020


G. Masinelli et al.: Adaptive LW Control: A RL Approach

FIGURE 3. Performance in terms of average reward per episode over time for Q-Learning and Policy Gradient. The red line
represents the average reward over an episode, whereas the shaded area indicates the standard deviation. An episode
corresponds to the weld of a line of 10 mm and has a duration of 1s. Between one line to the other, we wait for 10 s to permit
the agent to update its parameters and to allow the stage to move in a new unprocessed position.

laser power can be seen on the weld surface. The black B. DISCUSSION
marks on the weld correspond to oxidation, which is also Whether the classifier is of unquestionable fundamental
an indication of local overheating due to inaccurate laser importance as it allows the monitoring of the process, the use
control leading to a poor weld quality in terms of mechanical of the encoder, on the other side, is debatable. The encoder has
properties [12]. indeed some pros and cons that were not obvious before the
This aspect is even more evident from the cross-sections experiments. As stated in Section III-A, its advantages consist
(Fig. 4 (b), episode 1), which is characterized by rapid vari- of an effective reduction of the state space dimensionality
ations of the weld penetration depth along the line. In this that potentially simplifies the search of the optimal param-
specific case, the local overheating of the material was taking eters of the smart agent by capturing a proper parametriza-
place due to the application of a too high level of laser power tion of the signal that can focus only on quality critical
generating a highly unstable keyhole that led to the trapping events.
of pores inside the material during the keyhole collapse [10]. In contrast, its drawbacks derive from its output representa-
The red arrows highlight the pore locations in the magnifica- tion, that could not be entirely suited for deriving the dynam-
tion in Fig. 4 (b). ics of the system, as its temporal resolution is non-uniform
After 40 trials, i.e., about 7 min from the beginning of [47]. As a result, the sensitivity of the algorithm to some
the training (Fig. 4, episode 40), the welds started to be actions could be reduced and potentially bringing to poor
characterized by smoother changes in surface textures and process control.
penetration depth. For the sake of verifying the effectiveness of the encoder,
Confirming the positive trend, significant signs of progress we have also tried to exclude it from the processing pipeline
are obtained just after performing other 40 more welds and directly provide the high dimensional raw signals from
(Fig. 4 (a), episode 80, about 15 min from the beginning), the sensors as input to the agent.
when the texture of the weld surface started to present no per- It resulted in a marginally slower convergence rate in terms
ceivable non-uniformities. Nevertheless, some fluctuations of the number of episodes (in the order of tens of episodes),
in the penetration depth can still be observed (Fig. 4 (b), but the two strategies were able to achieve the same results.
episode 80). We believe that this behavior can be explained by the
Finally, a weld comparable to the reference one was only very first convolutional layer of the agent (see Fig. 2) that,
achieved after the completion of other 30 more episodes — if provided with raw signals, can take over the encoder
see Fig. 4 (a), episode 110 (about 20 min from the start), when duty to deliver a good signal representation to the following
the welds began to be characterized by uniform surface tex- layers. However, when excluding the encoder, the computa-
ture and constant penetration depth. Fig. 4 (c) also shows the tions were slowed down due to the more significant input
light microscope images of the cross-sections for the trained quantities, and we had to increase the time between each
controlled and reference welds, respectively. As described episode.
in Section II-D, the latter was realized after an exhaustive It also has to be mentioned that the present work was
search of the laser parameters and achieved a weld depth realized using a well-controlled laboratory environment and
of 150 µm, as shown is in Fig. 4 (c), top image. As can with reliable custom equipment.
be noticed, no measurable differences between the trained These controlled conditions provided a more reproducible
controlled weld and the reference one can be found. laser-material interaction during the welds as they included
Similarly, PG showed identical results apart from a dif- the processing of always the same material with consistent
ferent convergence rate. Indeed, the convergence took about material properties as well as flat surfaces with identical
1.6 times more time compared to Q-Learning (see Fig. 3). surface roughness.

VOLUME 8, 2020 103811


G. Masinelli et al.: Adaptive LW Control: A RL Approach

FIGURE 4. Training dynamics of the Q-Learning algorithm in terms of welding quality. (a) light microscope pictures of the top
view of the welded surface at discrete time points of the algorithm’s training; (b) corresponding light microscope pictures of
the cross-section of the welds from (a). The magnification for the first episode is shown on the right. The red arrows indicate
the pores inside the material; (c) reference weld and controlled weld after the completion of the training procedure. The
numbering of the episodes started from the beginning of the training procedure and is indicated on the vertical axis. The arrow
at the bottom shows the direction of the laser scan. The white borders denote the boundary of the weld. The deep weld
penetration at the beginning of each line constitutes the initial condition from which the algorithm needs to regulate the
power.

The well-controlled environment could also be the reason weld quality autonomously. The latter was chosen to be
for the small size of the database needed to train encoder represented by the weld with the highest depth achievable
and classifier, and this detail may be significantly different without porosity in Ti grade 5 workpiece, to meet the indus-
in industrial conditions. trial demand for high-quality keyhole welding. This refer-
ence weld was determined experimentally and attained a
V. CONCLUSIONS weld depth of 150 µ m without porosity with a laser power
This work presents the first results of a study for adaptive of 80 W.
closed-loop control of laser welding based on RL applied on To guide the smart agent, the feedback network and the
a real-life setup. encoder were trained to recognize not just the reference qual-
The developed system includes an encoder that derives ity, but also several other counter-examples. For this reason,
efficient representations from the sensory input for the active we collected the acoustic and optical signals from 15 weld
unit, a feedback network, and a smart agent, which is the experiments at various laser power, namely 20, 40, 60, 80,
active unit itself, that can influence the laser process. The and 120 W.
principle of operation is the following: based on the current The signals were then grouped in 5 categories according
sensory input provided by the encoder, the agent chooses to the corresponding weld quality in terms of penetration
an action, which leads to a change of its sensory input, and depth, which were identified via optical inspection of both
receives a reward — an indirect quality measure of the state the surfaces and the cross-sections of the workpieces, and
the agent ends up in. From this experience — made up by further partitioned in samples of 20 ms. This time span was
the past sensory input, the executed action, the current input, chosen by taking into consideration the requirement of very
and the received reward — the agent tries to optimize the high classification accuracy and computation time within the
outcomes of its actions over time. range of 1–5 ms.
In standard RL approaches, the reward signal is provided After the DCNN classifier and the encoder were trained,
by the environment and is straightforward to derive. In laser the smart agent started its interaction with the laser process
welding, conversely, effective feedback is challenging to by performing line welds with the output laser power being
provide, as the process is only partially observable since controlled autonomously.
in-depth information of the PZ can be obtained only indi- We tested two learning schemes — Q-Learning and Policy
rectly from conventional sensors. This reason motivates the Gradient — and evaluated their performance both in terms of
introduction of the feedback network: a complete monitoring the evolution of rewards over time, and of the resulting weld
system based on a DCNN classifier capable of tracking the quality.
weld quality in real-time. The training time needed for both the algorithms to reach
In the present work, the control unit was implemented the reference quality was 20 minutes and 33 minutes, respec-
to regulate the output laser power while using the acoustic tively. After that time, there was no additional observable
and optical emission as sensory input. The potential of the increment of weld quality and rewards.
system was demonstrated by its capability — without prior The present results demonstrate the ability of RL to learn
knowledge of the process dynamics — to reach a reference a control law for laser welding processes autonomously.

103812 VOLUME 8, 2020


G. Masinelli et al.: Adaptive LW Control: A RL Approach

This prospect is highly appealing for the industrial sector [17] Y. Bengio, ‘‘Learning deep architectures for AI,’’ Found. Trends Mach.
as the unit can deal with complex processes without costly Learn., vol. 2, pp. 1–27, Jan. 2009.
[18] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,
simulation and computational tools. Furthermore, the sensor M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,
technologies exploited in the present work are commercially S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,
available and ready for industrial implementation. It must be D. Wierstra, S. Legg, and D. Hassabis, ‘‘Human-level control through
deep reinforcement learning,’’ Nature, vol. 518, no. 7540, pp. 529–533,
emphasized that the proposed framework can also operate Feb. 2015.
with other feedback sensor signals — pyrometer, micro- [19] J. Günther, P. M. Pilarski, G. Helfrich, H. Shen, and K. Diepold, ‘‘Intelli-
phones, or additional photodiodes — making it a rather gent laser welding through representation, prediction, and control learning:
An architecture with deep neural networks and reinforcement learning,’’
versatile tool. Further experiments are planned to explore the Mechatronics, vol. 34, pp. 1–11, Mar. 2016.
potential of this approach on more complex conditions, e.g., [20] C. J. Watkins and P. Dayan, ‘‘Technical note: Q-learning,’’ Mach. Learn.,
with surface irregularities or at the interface between two vol. 8, nos. 3–4, pp. 279–292, May 1992.
different materials. Additionally, we will increase the number [21] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, ‘‘Policy gradient
methods for reinforcement learning with function approximation,’’ in Proc.
of control variables, including the workpiece velocity and its 12th Int. Conf. Neural Inf. Process. Syst. Cambridge, MA, USA: MIT
distance from the laser source. Finally, the RL algorithms will Press, 1999, pp. 1057–1063.
be further enriched with techniques for faster convergence, [22] L. Bassi, ‘‘Industry 4.0: hope, hype or revolution?’’ in Proc. IEEE 3rd Int.
Forum Res. Technol. Soc. Ind. (RTSI), Sep. 2017, pp. 1–6.
higher operating frequency, better adaptation under changing [23] T. Le-Quang, S. A. Shevchik, B. Meylan, F. Vakili-Farahani,
materials, and varying noise levels. M. P. Olbinado, A. Rack, and K. Wasmer, ‘‘Why is in situ quality
control of laser keyhole welding a real challenge?’’ Procedia CIRP,
vol. 74, pp. 649–653, Jan. 2018.
REFERENCES [24] F. Vakili-Farahani, J. Lungershausen, and K. Wasmer, ‘‘Process parameter
[1] D. Bäuerle, Laser Processing and Chemistry. Berlin, Germany: Springer, optimization for wobbling laser spot welding of Ti6Al4 V alloy,’’ Phys.
1996. Procedia, vol. 83, pp. 483–493, Jan. 2016.
[2] J. R. Berretta and W. Rossi, ‘‘Laser welding,’’ in Encyclopedia Tribology. [25] S. Shevchik, T. Le Quang, B. Meylan, and K. Wasmer, ‘‘Acoustic emission
Boston, MA, USA: Springer, 2013, pp. 1969–1981. for in situ monitoring of laser processing,’’ in Proc. 33rd Eur. Conf.
[3] A. R. Konuk, R. G. K. M. Aarts, A. J. H. I. Veld, T. Sibillano, D. Rizzi, Acoustic Emission Test. (EWGAE), 2018, pp. 1–10.
and A. Ancona, ‘‘Process control of stainless steel laser welding using [26] F. Vakili-Farahani, J. Lungershausen, and K. Wasmer, ‘‘Wavelet analysis
an optical spectroscopic sensor,’’ Phys. Procedia, vol. 12, pp. 744–751, of light emission signals in laser beam welding,’’ J. Laser Appl., vol. 29,
Jan. 2011. no. 2, May 2017, Art. no. 022424.
[4] S. Postma, ‘‘Weld pool control in ND: YAG laser welding,’’ Ph.D. disser- [27] J. Yang, S. Sun, M. Brandt, and W. Yan, ‘‘Experimental investigation and
tation, Dept. Eng. Technol., Univ. Twente, Amsterdam, The Netherlands, 3D finite element prediction of the heat affected zone during laser assisted
2003. [Online]. Available: https://research.utwente.nl/en/publications/ machining of Ti6Al4 V alloy,’’ J. Mater. Process. Technol., vol. 210, no. 15,
weld-pool-control-in-nd-yag-laser-welding pp. 2215–2222, Nov. 2010.
[5] A. Papacharalampopoulos, P. Stavropoulos, and J. Stavridis, ‘‘Adaptive [28] J. Willems, E. Kikken, and B. Depraetere, ‘‘Low-dimensional learning con-
control of thermal processes: Laser welding and additive manufacturing trol using generic signal parametrizations,’’ IFAC-PapersOnLine, vol. 52,
paradigms,’’ Procedia CIRP, vol. 67, pp. 233–237, Jan. 2018. no. 29, pp. 280–285, 2019.
[6] L. Song and J. Mazumder, ‘‘Feedback control of melt pool temperature
[29] G. E. Hinton and R. S. Zemel, ‘‘Autoencoders, minimum description length
during laser cladding process,’’ IEEE Trans. Control Syst. Technol., vol. 19,
and helmholtz free energy,’’ in Proc. 6th Int. Conf. Neural Inf. Process.
no. 6, pp. 1349–1356, Nov. 2011.
Syst., San Francisco, CA, USA: Morgan Kaufmann, 1993, p. 3–10.
[7] X. Na, Y. Zhang, Y. Liu, and B. Walcott, ‘‘Nonlinear identification of
[30] J. C. Ye and W. K. Sung, ‘‘Understanding geometry of encoder-decoder
laser welding process,’’ IEEE Trans. Control Syst. Technol., vol. 18, no. 4,
CNNs,’’ in Proc. 36th Int. Conf. Mach. Learn., ICML, Jun. 2019,
pp. 927–934, Jul. 2010.
pp. 12245–12254.
[8] P. A. Hooper, ‘‘Melt pool temperature and cooling rates in laser powder
[31] A. Radford, L. Metz, and S. Chintala, ‘‘Unsupervised representation
bed fusion,’’ Additive Manuf., vol. 22, pp. 548–559, Aug. 2018.
learning with deep convolutional generative adversarial networks,’’ in
[9] A. Bollig, D. Abel, C. Kratzsch, and S. Kaierle, ‘‘Identification and pre-
Proc. 4th Int. Conf. Learn. Represent. ICLR, 2016. [Online]. Available:
dictive control of laser beam welding using neural networks,’’ in Proc. Eur.
https://arxiv.org/abs/1511.06434
Control Conf. (ECC), Sep. 2003, pp. 2457–2462.
[10] M. Courtois, M. Carin, P. Le Masson, S. Gaied, and M. Balabane, ‘‘A com- [32] P. Baldi, ‘‘Autoencoders, unsupervised learning, and deep architectures,’’
plete model of keyhole and melt pool dynamics to analyze instabilities and in Proc. ICML Workshop Unsupervised Transf. Learn., vol. 27, I. Guyon,
collapse during laser welding,’’ J. Laser Appl., vol. 26, no. 4, Nov. 2014, G. Dror, V. Lemaire, G. Taylor, and D. Silver, eds. Washington, DC, USA:
Art. no. 042001. Bellevue, Jul. 2012, pp. 37–49.
[11] X. Jin, L. Li, and Y. Zhang, ‘‘A study on fresnel absorption and reflections [33] S. Ioffe and C. Szegedy, ‘‘Batch normalization: Accelerating deep network
in the keyhole in deep penetration laser welding,’’ J. Phys. D, Appl. Phys., training by reducing internal covariate shift,’’ in Proc. 32nd Int. Conf.
vol. 35, p. 2304, Sep. 2002. Mach. Learn. (ICML), Lile, France, vol. 1, Jul. 2015, pp. 448–456.
[12] J. Stavridis, A. Papacharalampopoulos, and P. Stavropoulos, ‘‘Quality [34] H. Ide and T. Kurita, ‘‘Improvement of learning for CNN with ReLU
assessment in laser welding: A critical review,’’ Int. J. Adv. Manuf. Tech- activation by sparse regularization,’’ in Proc. Int. Joint Conf. Neural Netw.
nol., vol. 94, pp. 1825–1847, Feb. 2018. (IJCNN), May 2017, pp. 2684–2691.
[13] S. Shevchik, T. Le-Quang, B. Meylan, F. V. Farahani, M. P. Olbinado, [35] R. Ayachi, M. Afif, Y. Said, and M. Atri, ‘‘Strided convolution instead of
A. Rack, G. Masinelli, C. Leinenbach, and K. Wasmer, ‘‘Supervised deep max pooling for memory efficiency of convolutional neural networks,’’ in
learning for real-time quality monitoring of laser welding with X-ray Proc. Int. Conf. Sci. Electron., Technol. Inf. Telecommun. in Smart Inno-
radiographic guidance,’’ Sci. Rep., vol. 10, no. 1, p. 3389, Dec. 2020. vation, Systems and Technologies, vol. 146, Berlin, Germany: Springer,
[14] S. A. Shevchik, T. Le-Quang, F. V. Farahani, N. Faivre, B. Meylan, 2020, pp. 234–243.
S. Zanoli, and K. Wasmer, ‘‘Laser welding quality monitoring via graph [36] E. O. Neftci and B. B. Averbeck, ‘‘Reinforcement learning in artificial
support vector machine with data adaptive kernel,’’ IEEE Access, vol. 7, and biological systems,’’ Nature Mach. Intell., vol. 1, no. 3, pp. 133–143,
pp. 93108–93122, 2019. Mar. 2019.
[15] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. [37] R. Bellman, Dynamic Programming. Princeton, NJ, USA: Princeton Univ.
Cambridge, MA, USA: A Bradford Book, 2018. Press, 2010.
[16] A. Krizhevsky, I. Sutskever, and G. Hinton, ‘‘ImageNet classification with [38] S. S. Mousavi, M. Schukat, and E. Howley, ‘‘Deep reinforcement learning:
deep convolutional neural networks,’’ in Proc. Neural Inf. Process. Syst., An overview,’’ in Proc. SAI Intell. Syst. Conf. (IntelliSys), vol. 16, 2018,
vol. 25, 2012, pp. 1097–1105. pp. 426–440.

VOLUME 8, 2020 103813


G. Masinelli et al.: Adaptive LW Control: A RL Approach

[39] M. Telgarsky, ‘‘Benefits of depth in neural networks,’’ J. Mach. Learn. SILVIO ZANOLI received the B.Sc. degree in elec-
Res., vol. 49, pp. 1517–1539, Feb. 2016. trical engineering from the University of Bologna,
[40] P. Petersen and F. Voigtlaender, ‘‘Equivalence of approximation by con- Italy, in 2017, and the M.Sc. degree in electri-
volutional neural networks and fully-connected networks,’’ Proc. Amer. cal engineering (with data science and the IoT
Math. Soc., vol. 148, no. 4, pp. 1567–1581, Dec. 2019. specialization) from the Swiss Federal Institute of
[41] S. Lange, T. Gabel, and M. Riedmiller, ‘‘Batch reinforcement learning,’’ Technology in Lausanne (EPFL), Lausanne,
in Reinforcement Learning (Adaptation, Learning, and Optimization), Switzerland, in 2019, where he is currently pursu-
vol. 12. Berlin, Germany: Springer, 2012, pp. 45–73. [Online]. Available:
ing the Ph.D. degree in electrical engineering (with
https://link.springer.com/chapter/10.1007/978-3-642-27645-3_2#citeas
data science specialization). His research interests
[42] P. Mishra and P. Mishra, ‘‘Introduction to PyTorch, tensors, and tensor
operations,’’ in PyTorch Recipes. New York, NY, USA: Apress, 2019, are in signal processing, machine learning, and the
pp. 1–27. IoT with particular attention to low energy solutions.
[43] S. Abrahams, D. Hafner, E. Erwitt, and A. Scarpinelli, TensorFlow for
Machine Intelligence: A Hands-On Introduction to Learning Algorithms.
Santa Rosa, CA, USA: Bleeding Edge Press, 2016. [Online]. Available:
https://dl.acm.org/doi/book/10.5555/3125813
[44] H. V. Hasselt, ‘‘Double Q-learning,’’ in Proc. 23rd Int. Conf. Neural Inf.
Process. Syst., Red Hook, NY, USA: Curran Associates, vol. 2, 2010, KILIAN WASMER (Member, IEEE) received
pp. 2613–2621. the B.S. degree in mechanical engineering
[45] J. Peters and S. Schaal, ‘‘Reinforcement learning of motor skills with policy from Applied University, Sion, Switzerland, and
gradients,’’ Neural Netw., vol. 21, no. 4, pp. 682–697, May 2008. Applied University, Paderborn, Germany, in 1999,
[46] S. Levine and V. Koltun, ‘‘Guided policy search,’’ in Proc. 30th and the Ph.D. degree in mechanical engineering
Int. Conf. Mach. Learn., vol. 28, S. Dasgupta and D. McAllester, from Imperial College London, Great Britain,
eds. Atlanta, Georgia: PMLR, Jun. 2013, pp. 1–9. [Online]. Available: in 2003. He joined the Swiss Federal Laboratories
http://proceedings.mlr.press/v28/levine13.html for Materials Science and Technology (EMPA),
[47] G. Arvanitidis, L. K. Hansen, and S. Hauberg, ‘‘Latent space oddity: On Thun, Switzerland, in 2004, to work on control of
the curvature of deep generative models,’’ in Proc. 6th Int. Conf. Learn. crack propagation in semiconductors. He currently
Represent., ICLR, 2017, pp. 1–16.
leads the Group of Dynamical Processes, Laboratory for Advanced Materials
Processing (LAMP). His research interests include materials deformation
GIULIO MASINELLI (Member, IEEE) received and wear, crack propagation prediction, and material tool interaction. In the
the B.Sc. degree in electrical engineering from last years, he has focused his work on in situ and real-time observation of
the University of Bologna, Italy, in 2017, and complex processes using acoustic and optical sensors in various fields such as
the M.Sc. degree in electrical engineering (with in tribology, fracture mechanics, and laser processing. He is in the director
data science specialization) from the Swiss Federal committee for additive manufacturing of Swiss Engineering. He is also a
Institute of Technology in Lausanne (EPFL), Lau- member of Swiss tribology, European Working Group of Acoustic Emission
sanne, Switzerland, in 2019. He is currently pursu- (EWGAE), and Swissphotonics.
ing the Ph.D. degree with Swiss Federal Laborato-
ries for Material Science and Technology (EMPA)
and EPFL, mainly developing machine learning
algorithms for data analysis and industrial automation. His research interests
include signal processing and machine learning, with emphasis on deep
learning. SERGEY A. SHEVCHIK received the M.Sc.
degree in control from the Moscow Engineer-
TRI LE-QUANG received the B.S. degree in ing Physics Institute, Russia, in 2003, and the
applied physic from Vietnam National Univer- Ph.D. degree in biophotonics from the General
sity, Ho Chi Minh City, Vietnam, in 2007, Physics Institute, Russia, in 2005. He stayed until
the M.Sc. degree in optics from the Friedrich- 2009 as a Postdoctoral Researcher at the Gen-
Schiller-Universität Jena, Germany, in 2013, and eral Physics Institute. From 2009 to 2012, he was
the Ph.D. degree in material engineering from with the Kurchatov Institute, Russia, developing
the Instituto Superior Tecnico Lisboa, Portugal, human–machine interfaces. In 2012 and 2014,
in 2017. Since 2017, he has been working as a he was with the University of Bern, investigating
Postdoctoral Researcher with EMPA, Swiss Fed- multi-view geometry. Since 2014, he has with Swiss Federal Laboratories for
eral Laboratories for Materials Science and Tech- Material Science and Technology (EMPA), working on industrial automa-
nology, Laboratory of Advanced Materials Processing. His research interests tion. His current interest is in signal processing.
include laser material processing, laser technology, and in-situ monitoring.

103814 VOLUME 8, 2020

View publication stats

You might also like