Computers and Chemical Engineering: Jay H. Lee, Joohyun Shin, Matthew J. Realff
Computers and Chemical Engineering: Jay H. Lee, Joohyun Shin, Matthew J. Realff
Computers and Chemical Engineering: Jay H. Lee, Joohyun Shin, Matthew J. Realff
a r t i c l e i n f o a b s t r a c t
Article history: Machine learning (ML) has recently gained in popularity, spurred by well-publicized advances like
Received 22 May 2017 deep learning and widespread commercial interest in big data analytics. Despite the enthusiasm, some
Accepted 10 October 2017 renowned experts of the field have expressed skepticism, which is justifiable given the disappointment
Available online 13 October 2017
with the previous wave of neural networks and other AI techniques. On the other hand, new fundamental
advances like the ability to train neural networks with a large number of layers for hierarchical feature
Keywords:
learning may present significant new technological and commercial opportunities. This paper critically
Machine learning
examines the main advances in deep learning. In addition, connections with another ML branch of rein-
Deep learning
Reinforcement learning
forcement learning are elucidated and its role in control and decision problems is discussed. Implications
Process systems engineering of these advances for the fields of process and energy systems engineering are also discussed.
Stochastic decision problems © 2017 Elsevier Ltd. All rights reserved.
https://doi.org/10.1016/j.compchemeng.2017.10.008
0098-1354/© 2017 Elsevier Ltd. All rights reserved.
112 J.H. Lee et al. / Computers and Chemical Engineering 114 (2018) 111–121
Many researchers engaged in machine learning have promoted particularly difficult to handle. There was a clear need to find an
recent developments like Deep Learning as revolutionary inven- approach where the learning of features and the learning of the
tions that will have a transformative effect. According to Andrew mapping of the features to classification or action could somehow
Ng at Stanford, “I have worked all my life in Machine Learning, and be separated as combinations of these two learning tasks led to
I’ve never seen one algorithm knock over benchmarks like Deep poor performance.
Learning.” Geoffrey Hinton at Google boasts, “Deep Learning is an There was a breakthrough in 2006 with regards to how to
algorithm which has no theoretical limitations of what it can learn; train multi-layer networks. The breakthrough was inspired by the
the more data you give and the more computational time you pro- concept of hierarchical feature extraction, where each “layer” of
vide, the better it is.” On the other hand, some notable experts of the the network robustly extracted features based on the output rep-
field have expressed concerns and skepticism over what they see resentation of the layer closer to the data. This was said to be
as yet another wave of overhyped emerging technologies (Gomes, biologically motivated by the architecture of visual image process-
2014). They point out that the idea of backpropagation is not new ing. The training was essentially unsupervised (Hinton et al., 2006).
and recent progress has not involved any major new conceptual One architecture was particularly successful, the Convolution Neu-
breakthroughs, but a series of refinements that already existed ral Network (CNN) LeCun et al. (1990). (LeCun et al., 1998). This
in the 70 s and 80s. In addition, the often-used parallel of Deep leads to the term “deep learning” where “deep” referred to the
Learning with the human brain is a gross oversimplification. Deep number of layers in the network (LeCun et al., 2015).
Learning is just a tool which is successful in certain, previously very Deep networks are therefore a neural network architecture in
challenging domains like speech recognition, vision, and natural which the lower layers of the network engaged in feature extraction
language processing. In addition, big data with a large number of and then the last layer then mapped these features to classifica-
features naturally lead to a very large hypothesis set and overfitting tion or action. This allowed for the concept of basic features being
(leading to many false positives) is inevitable; applications of ML combined into more complex features, and for enabling certain
techniques to big data without the theoretical backings of statistical important invariances in features to be embedded into the way
analysis and validation are bound to fail. the layer operated on the inputs from the layer below (Hinton
In this paper, we will make a critical assessment of the recent et al., 2006; LeCun et al., 2015). The stage was now set for dramatic
developments in machine learning, Deep Learning and reinforce- improvement in multi-layer network performance − but there was
ment learning in particular. We will see what motivates the use of a one more innovation that had an important synergy − hardware
deep network architecture, why it has not seen use until now, and development.
what recent progress has been made that enables its current suc- In addition to the advance in the layer-by-layer training of the
cess. We will also examine the impact of the recent developments network, the use of GPU computational architectures to carry out
in reinforcement learning using the example of Alpha-Go, which is the training led to an order of magnitude speed up. The combina-
a recent success story in automated game playing that caught the tion of hardware and the explosion of data of various types, but
world’s attention. We will then examine how these developments particularly visual image and text, provided an additional boost
may impact the fields of process and energy systems of engineering to the effectiveness of deep neural network learning. This has led
and point out some promising directions for their application. to some substantial successes in image recognition, handwriting
The rest of the paper is organized as follows. In Section 2, tasks, and in game playing (Mnih et al., 2013; Mnih et al., 2015).
we present a critical assessment of Deep Learning including its Game playing has proved to be a very fruitful area because a game
motivation, early problems, and some recent resolutions as well has well-defined rules and hence can be simulated exactly, this
as remaining challenges. In Section 3, the idea of reinforcement allows for simulated play and hence the generation of data and the
learning and its history are briefly introduced and the specific ML bootstrapping of performance.
approaches used to train Alpha-Go are discussed. Some similari- PSE researchers engaged with neural networks in the 1980 s
ties between the Go game and the multi-stage decision problems in but the efforts died out in the early 90 s due to precisely the
industries are brought out, pointing to the practical potential of the problems that the AI community had identified. There have been
overall approach. In Section 4, some potential applications in the attempts to use neural network approaches for process control
process and energy systems engineering domains are introduced. (Saint-Donat et al., 1991; Ydstie, 1990) and fault detection (Naidu
Naturally, this section carries the authors’ own bias and opinions. et al., 1990). PSE researchers recognized some of the features
Section 5 concludes the paper with a summarizing perspective. of these approaches and made significant contributions: notably
auto-associative neural network (Kramer, 1992) and Wave-net
(Bakshi and Stephanopoulos, 1993). The auto-associative neural
2. Deep learning network which is composed of several layers (mapping, bottle-
neck and de-mapping) is a realization of a general nonlinear PCA
2.1. Historical review (Kramer, 1991). These auto-assocative structures can be stacked,
such that the output of one is the input to the next triplet of layers,
The neural network community reached the conclusion in the leading to the extraction and recombination of successively higher
early 1990 s that training multi-layered networks with backprop- level features, which is the general approach of the deep learning
agation, or really any gradient following algorithm, was essentially networks. The hierarchical multiresolution wavelet based network,
impossible, and the solutions obtained with deeper neural net- named wave-net, was first introduced by chemical engineering
works starting from random initialization perform worse than the society (Bakshi and Stephanopoulos, 1993), where the wavelet
solutions obtained for networks with 1 or 2 hidden layers (Bengio transform played the role of the feature extraction and the last layer
et al., 2007; Tesauro, 1992). Specifically it was shown that the combined the features.
weights in multi-layer networks would tend to shrink to zero or PSE community continued to work on regression approaches
grow without bound (Hochreiter, 1991) and that the networks that can be seen as shallow architecture learning. Kernel PCA, LLE,
would have a very high ratio of saddle points compared to local etc. are all examples of this (Schölkopf and Smola, 2002). The main
minima (Bengio et al., 1994). Confining neural networks to shal- issue is whether the mapping has essentially a local field or a global
low architectures essentially meant that features had to be coded field. For example, Gaussian kernels are local in the sense that the
before the network input, or the specific kernel-type mappings had weights die away the further you are from the mean, which is
to be embedded into the shallow architecture. Time series data was often taken to be placed on the data itself (Bengio et al., 2006).
J.H. Lee et al. / Computers and Chemical Engineering 114 (2018) 111–121 113
Fig. 1. The structure of shallow neural network (left) and deep neural network (right).
There are problems with this as it can be shown that although shal-
low architectures can in theory learn the same functions as deeper
architectures they do so at the expense of potentially requiring
exponentially larger numbers of internal nodes (Bengio, 2009).
E
expected to be poorer. On the other hand, deep networks can orga-
f (x) = (x, e ).(aTe x + be ) (1) nize and store features in a hierarchical manner so as to allow reuse
e=1 of those features that occur multiple times. Therefore, in contrary
The basis function computes the weighting for each linear to the common misconception, the number of nodes and parame-
regression model and hence captures local weighted linear regres- ters needed to represent a function can be much smaller for a deep
sion, or with aTe = 0, this is a weighted basis function network. network compared to a shallow network (Fig. 2). This means there
Shallow networks can be conceptualized as using a combina- is an optimal depth for each given function.
tion of feature extraction, through the basis function or linear A second unrelated issue is that most kernel functions embody
regression, and then weighting those features. Often the success of a distance metric over the training set examples and hence are
shallow networks has come through either already having appro- inherently local. The need for non-local feature extraction, and the
priate features extracted and used as inputs to a relatively simple inefficiency of shallow network architectures, motivates the use
network that sums and thresholds the result, or through having of deeper networks (Bengio, 2009), along with biomimicry of bio-
basis functions that are effective for the feature extraction. For logical neural networks that appear to have layers in the range
example, if e = xi , xi defined as one example of the training set, of 5–10 just for visual processing. The problem, however, is that
then (x, e ) = K(x, xi ) computes the kernel operation, and efficiently deeper networks have more complex and non-convex error func-
allows high dimensional feature spaces to be employed (Schölkopf tions, which makes them potentially harder to train with standard
and Smola, 2002). The training then adjusts the weights associ- backpropagation and other optimization techniques.
ated with each kernel function and we recover the Support Vector
Machine (SVM) representation and training approach. 2.3. Deep networks – early problems and resolution
A problem with shallow networks is that domain-specific fea-
tures require expertise and insight of the task to be extracted In the early 1990 s the theoretical underpinnings of the reason
(hand-crafted) and they can require an exponentially larger num- behind the difficult training problem were established, a summary
ber of nodes in one fewer layers to represent a function (Bengio of which can be found in (Hochreiter and Schmidhuber, 1997).
and LeCun, 2007) compared to a deeper network. This means that Essentially this demonstrated that the error propagated backward
the number of nodes, and hence parameters, is much greater for a significant distance in the network either decayed to zero, mak-
the shallower network and the generalization properties can be ing training very slow, or blew up, making the training unstable.
114 J.H. Lee et al. / Computers and Chemical Engineering 114 (2018) 111–121
Further theoretical analysis (Dauphin et al., 2014) showed that, for tant to noise makes the gradient with respect to past events small
non-convex error functions in high dimensional spaces, naïve gra- and that small changes in the weights are felt by events that are
dient descent was hampered by the occurrence of large numbers close to the current time.
of saddle points, not local minima, with high error plateaus. This
makes naïve gradient descent slow and gives the illusion of having
2.4. Extensions and open challenges
found a local minimum. This type of error surface has been demon-
strated to exist in multilayer neural network architectures (Baldi
However, deep architectures can only go so far. When the
and Hornik, 1989). An alternative view of the difficulty of training
dimension of the raw input space becomes too large, or the time
deep networks is presented in Erhan et al. (2010) where the differ-
delays that occur between the important information and the need
ence in the objective of training the layers was proposed as the key
to use it become long, we need to have a yet further refinement
limitation. The layers nearest the data are posited to be extracting
to the architecture that enables attention to be shifted to different
features whereas the upper layers are combining those features for
parts of the input space (Ba et al., 2014), and to have long term
the final task. These two different objectives combine poorly dur-
memory elements (Hochreiter and Schmidhuber, 1997).
ing training through back-propagation, and particularly the upper
The idea of shifting attention to different parts of the input space
layers may over-fit the training set leading to poor generalization
is essentially the exploration vs exploitation tradeoff that is seen in
performance.
many problems. We need to balance looking hard in the current
The poor practical performance of the multi-layered neural net-
location for the solution versus moving to another part of the space
works (MLNN) architecture discouraged applications in the 1990 s
in which we may find it. Exploitation vs exploration is a meta-level
and early 2000 s. However, there was a breakthrough in training
algorithmic component which must be explicitly designed into the
multilayer architectures that occurred in 2006 (Hinton et al., 2006).
agent as essentially it implies a change in the objective function of
This approach called Deep Belief Networks (DBN) used unsuper-
the problem. If you have an optimization formulation, you cannot
vised “pre-learning” of each layer, using a Restricted Boltzmann
do this within the formulation, you have to do it by some procedure
Machine (RBM), to extract useful features that were then fed to
outside of the optimization algorithm itself. It is essentially saying
the next layer (Fig. 3). Here each layer is trained such that it
“we need to change behavior because our current behavior is inap-
reproduces its input on its output but compresses it through the
propriate to the situation.” These meta-algorithmic decisions make
node functions/weights (i.e. there are not enough weights or the
us uncomfortable.
algorithm is specifically designed to avoid finding the identity func-
The implementation of long term memory gates information
tion). This connects to early chemical engineering literature and
into an element that preserves it and then must be gated out
specifically Mark Kramer’s work (Kramer, 1992) on recognizing
when the conditions in the environment indicate that the infor-
the connection between Hopfield type networks (Hopfield, 1982)
mation needs to be used. This is a meta-architecture component.
and PCA. The successively trained layers were then fine-tuned by
The inclusion of memory elements into the architecture of the neu-
the traditional supervised approach. The success of this technique
ral network is something that has to be selected, similar to deciding
has subsequently spurred successful applications of deep learning
to including convolution and down-selection layers in the network.
architectures to problems such as image classification (Huang et al.,
These hyper-parameter and structural decisions make us uncom-
2012; Lee et al., 2009), playing video games (Mnih et al., 2015) and
fortable.
the game of Go (Silver et al., 2016).
The success of the pre-learning phase of deep learning was
explored experimentally in Erhan et al. (2010). This proposes that 3. Reinforcement learning
pre-learning finds features that are predictive of the main varia-
tions in the input space, i.e. P(X), and learning P(X) is helpful for 3.1. Introduction and history
learning the discriminative P(Y|X). Erhan observed that if the num-
ber of nodes in the early layers was not large enough then the Supervised or unsupervised learning methods learn to per-
pre-learning made the error worse, suggesting that although fea- ceive/encode/predict/classify patterns, but they do not learn to
tures may have been extracted, there were not enough of them to act or make decisions. Reinforcement learning (RL), on the other
form good predictors for the final network output. The explicit focus hand, by actively interacting with an uncertain and dynamic envi-
on learning P(X) separates the objective of learning good features ronment, learns an optimal decision policy mapping the system
from the objective of learning how those features map to good clas- state to the optimal action. RL does this by (1) observing the real-
sification. It has also been demonstrated that learning generative time responses of the environment when random or non-optimal
models, P(X,Y), has better generalization error than discriminative actions are taken and (2) learning, either implicitly or explicitly, the
learning. This connects deep learning to shallow learning archi- cost associated with the given state (and possibly the action). The
tectures where the input to the shallow network has undergone typical setting is one of dynamic decision-making meaning a series of
pre-processing to find robust features of the raw input space, such decisions are to be made in a dynamic, and possibly uncertain, envi-
as PCA. ronment to minimize overall cost (e.g., sum of stage-wise costs).
The recurrent neural network (RNN) architecture (Tsoi and Back, Such sequential decision-making problems are found in various
1997) has been proposed as a way to learn models of time series. fields such as automatic control, scheduling, planning, and logistics.
The main idea is that RNN’s are like deep NN’s because we can This is a popular subject in PSE as most problems in control, real-
think of each iteration of the recurrence as if it was another layer time optimization, process scheduling, and supply chain operation
in the network. Therefore, similar to multilayer neural networks, areas share these features.
training RNN’s has been hard using back-propagation and there RL differs from SL in that it uses evaluative feedbacks from
are some fundamental limitations (Zhang et al., 2014). The key dif- the environment to estimate real-valued rewards or costs versus
ficulty is discovering long term dependencies, and the fixed points instructive feedback used in SL through classification errors, and
of the dynamic system are hard to converge to. Bengio et al. (1994) input events affect decisions made at subsequent times and the
demonstrates that a system can be efficiently trainable by gradient resulting output events, which is inherent in the nature of a
descent but not simultaneously stable and resistant to noise for a dynamical system. An iterative step of RL is generally composed of
simple task that requires information to be stored for an arbitrary two-steps: policy evaluation, wherein the long-term consequences
duration. This results from the fact that storing information resis- of current decisions are characterized by a critic, followed by pol-
J.H. Lee et al. / Computers and Chemical Engineering 114 (2018) 111–121 115
Fig. 3. A typical DBN with one input layer and N hidden layers H1 , . . ., HN . The layer-wise reconstruction is implemented by a family of restricted Boltzmann machine (RBM).
All these require a model of system dynamics, but in many real probable) values of an unknown population parameter, in contrast
problem, system dynamics may be incompletely known and diffi- to point estimation.
cult to identify empirically. To resolve this issue, a value function The Gittins index policy (Gittins, 1979) is known as the optimal
with the argument of state-action pair, known as the Q (quality) policy for an infinite horizon on-line problem with independent
function, was introduced (Watkins, 1989). beliefs (diagonal covariance), known as ‘multi-armed bandit’. The
Gittins index policy is to select an option having the highest Gittins
Q ∗ (Xt , ut ) = U(Xt , ut ) + minE Q ∗ (Xt+1 , u )|Xt , ut (5) index, which is a measure of the reward that can be achieved by
u
a random process evolving from its present state onward towards
∗ (Xt , ut ) = arg minQ ∗ (Xt , ut ) (6) a termination state, under the option of terminating it at any later
u(·)
stage with the accrual of the probabilistic award from that stage up
Therefore, by introducing a more general form of the value to the termination state. For evaluating the Gittins index, proper
function taking the additional argument (Werbos, 1992), model- approximation architectures are studied (Brezzi and Lai, 2002;
free designs can be established which directly learn optimal policy Chick and Gans, 2009).
using observed data {Xt ,Xt+1 U(Xt , (Xt ))}. Werbos called this action- The knowledge gradient (KG) algorithm is one step roll-out
dependent HDP (ADHDP), which is better known as Q-learning in policy, first introduced by Gupta and Miescke (1996) and fur-
the RL community. NN for approximating the Q function can be ther developed by P. Frazier et al., 2008. The main idea is to
deeply extended, which is called ‘deep Q-network (DQN)’ (Mnih use the marginal value of information gained by a measurement.
et al., 2015). DQN uses a representation generator capturing fea- The knowledge gradient of each decision at time n is defined by
tures from the observed states as well as an action scorer to produce
expectation of the incremental random value
of the newly made
scores for all actions and argument objects. measurement, vKG,n
u =Eun min n+1
u
− min nu , and the decision is
Another important question in RL is which decision the RL agent u u
(the decision-maker or the controller) should conduct to quickly then made by optimizing both nu and exploration bonus computed
learn the uncertain information (e.g., the value function parame- from vKG,n
u .
ters) that govern the behavior of resulting policy. This leads to the Unlike the index policies, the value of choosing decision u
classical trade-off between exploration and exploitation. To assure depends on the knowledge about all other decisions u = / u by the
the convergence of Q-function, all the state-action pairs should be KG policy. Thus the KG policy does not have the theoretical strength
visited infinitely often (in an asymptotic sense), which is only satis- of the most index policies, but it is more suitable for a specific prob-
fied when all actions have nonzero probability of being selected by lem class since it can capture the correlation between different
the decision policy during the training. That is, from the viewpoint decisions (Ryzhov et al., 2012). The exact algorithm for comput-
of learning, decisions should be made for the purpose of attain- ing vKG,n
u is specified in Frazier et al. (2008, 2009) for independent
ing more valuable information about the model and uncertainty and correlated belief models, respectively. Instead of a lookup rep-
at hand (‘exploration’). On the other hand, to achieve good results resentation of a belief model, a parametric belief model has been
now, one should choose the best option based on the current knowl- used to deal with the curse-of-dimensionality, as in the case of ADP
edge (‘exploitation’). Optimal balancing between the two is needed (Negoescu et al., 2011).
to achieve the best overall performance. For general MDP problems, physical state or exogenous infor-
Bayes-Adaptive MDP (BAMDP) (Duff and Barto, 2002; Martin, mation variables should be considered as well as the knowledge
1967) is a MDP problem formulation that incorporates the knowl- state. With the view of the value function capturing all the effects
edge on the uncertain parameters or states (e.g., their probability of uncertainty, a Bayesian prior belief is introduced on Q or V
distributions) as state variables (referred to as knowledge or belief itself, rather than on specific parameters. Dearden et al., 1998 pro-
state to distinguish them from the regular state). The knowledge posed a model-free BAMDP algorithm with Q-value distribution,
state may acquire the extra information given the data observed called Bayesian Q-learning (BQN), and they extended this work to
so far, and evolve to the posterior belief distribution over the the case where there is an explicit model with uncertain parame-
dynamics. Suppose that uncertain cost U is multi-variate, normally ters (Dearden et al., 1999). Based on these works, there have been
distributed, and a knowledge state with n measurements is the cur- studies to apply optimal learning techniques to DP as exploration
rent mean and covariance to represent the beliefs, i.e., Xn = ( n , ˙ n ). strategies, involving the Bayesian DP (Strens, 2000) and the BEE-
The state is then updated by using the Bayesian rule after taking a TLE algorithm (Poupart et al., 2006). Recently, Ryzhov and Powell
decision un and observing a measurement obtained from the corre- (2010, 2011) adapted the KG policy to problems with a physical
sponding decision: X n+1 ←− Bay(X n , un , ˆ n+1
un
). The objective is to state, in which the total infinite horizon discounted reward has a
find decision policy that minimizes the expected sum of the cost normal distribution, which leads to relatively simple belief states
over a time horizon. capable of capturing correlations.
N
3.2. Alpha-go and its impact
n
minE u,n (X n ) (7)
n=1
In 1997, IBM’s Deep Blue defeated the then reigning world chess
Optimal learning (better known as the dual control problem champion Garry Kasparov by 3.5-2.5 in a six-game match. It marked
in the control community (Feldbaum, 1960)) techniques have the first time that a computer defeated a reigning world champion
been developed to explicitly optimize the knowledge evolvement in a standard time-controlled chess match. Deep Blue took advan-
through the input decisions by modeling a trade-off between the tage of the brute force computing power rather than computer
objective to maximize the current reward and that to increase the intelligence; being one of the world’s most powerful supercomput-
value of information (Powell and Ryzhov, 2012). There are several ers at the time, it typically searched to a depth of six to eight moves
heuristic algorithms for balancing the objectives of exploitation to as many as twenty or more in some situations. Though successful
and exploration. In the simplest case, one explores with probabil- in the game of chess, many experts doubted that such a brute-force
ity and exploits with probability 1-. Here, random decisions are approach would be extendable for more complex games like Go. In
chosen, with uniform probability (-greedy policy), or weighted fact, typical depth and number of candidate moves in the Go game
probability (Boltzmann exploration or a soft-max policy). Interval are much higher, ∼150 and ∼250 respectively, compared to chess.
estimation can also be used for tuning an interval of possible (or This makes the number of possible threads to be approximately
J.H. Lee et al. / Computers and Chemical Engineering 114 (2018) 111–121 117
Fig. 7. Typical structure of Wave-net (Bakshi and Stephanopoulos, 1993). Most deep networks inherently represent probabilistic input-
output relationships. That is, these networks map the input to the
and Sung, 2009). The difficult stems from the fact that time scales probability distribution of the output (which are typically discrete)
involved in different hierarchies can be quite different. Hence, (Baum and Wilczek, 1987). Hence, they can be useful in represent-
by writing the problem of two or more adjacent layers into a ing complex uncertain relationships. In particular, such feature can
single mathematical program can lead to a very large problem be used to identify Markov state transition dynamics that may be
that is difficult to handle computationally. Considering the uncer- too complex or nonlinear to represent by a simple additive noise.
tainty exacerbates the issue significantly. Scenario-branching used For example, suppose we are interested in identifying Markov tran-
in stochastic programming can explode (in an exponential manner) sition mapping Xt → Xt+1 where X occupies a continuous state space
as the number of time steps or stages increases. This is the reason and the transition is stochastic. One can use deep learning to create
why stochastic programming has seldom seen applications beyond a mapping between Xt and the probability distribution of Xt+1 in an
two stage problems. appropriately discretized state space using simulation or real data.
In integrating planning and scheduling, the high-level deci- Such a mapping can naturally serve as a Markov transition model
sion problem (i.e., planning) typically has recurring dynamics and of the process dynamics.
decisions and are dominated by uncertainties. Therefore, they
are naturally expressed as Markov decision processes (MDPs), for 5. Conclusion
which (stochastic) dynamic programming provides a solution. On
the other hand, the lower-level decision problem (i.e., scheduling) Some significant recent developments in machine learning like
tends to have more complex state representation and dynamics deep learning and reinforcement learning open up new possibilities
and is better suited for the mathematical programming approach; in many application domains. Deep learning enables efficient train-
in some cases, certain heuristic rules exist that are already proven ing of neural networks with a large number of hidden layers, which
highly effective. Since RL or ADP solves MDP by simulating the in turn allow for hierarchical feature learning of the input data. This
stochastic environment and iterative improving a decision pol- will become increasingly effective as data streams are generated
icy by learning, it is a very fitting approach for integrating the and curated at different scales and plant locations. Developments
two layers. The whole integrated system can be simulated easily, in reinforcement learning like approximate dynamic programming
regardless of whether a mathematical programming based solution allow us to obtain optimal or near optimal policy for multi-stage
or heuristic rules are used for the lower layer. Such approaches have stochastic decision problems. Both of these learning techniques
been explored extensively by the authors over the last decade or are embarrassingly parallel, and thus able to take advantage of
so, and have shown some success and potential (Lee and Lee, 2006; the trends in hardware development that emphasize parallel mul-
Shin and Lee, 2016; Shin et al., 2017). ticore computation versus more capable single cores. In the PSE
domain, these developments represent opportunities to push the
4.4. Control technological boundaries in important problems like fault detec-
tion/diagnosis, nonlinear model reduction, integrated planning and
Optimal control problems also employ Markov models (e.g., scheduling, and stochastic optimal control.
state-space models with random noise inputs) and stagewise cost So the question is whether it is time for the PSE community to
functions, and therefore can be viewed as MDPs. The main dif- re-engage in this space and with what research problems in mind.
ference from those studied in the operations research field is The following are a few that come to the authors’ minds:
that the state and action spaces are continuous, leading to very
high-dimensional state and action space when discretized. Hence, a Exploration vs exploitation − this is an old problem, but one
the curse-of-dimensionality is much worse for typical control which has not attracted the community’s attention except with
problems. Nevertheless, RL can provide a useful route to solving the concepts of dual control. This is a problem that we need to
stochastic optimal control problems, which are generally not solv- return to with the idea that advances in the computational power
able by current other techniques, including the ever popular model make this feasible. The problem is that we are unlikely to have
predictive control (MPC) technique. Efforts have been made in using crisp theoretical results beyond those already established, unless
ADP to solve control problems with stochastic uncertainty. In the we can find specific sub-problem structures that are relevant to
PSE domain, the authors have used the approach to handle some PSE.
challenging problems like the dual adaptive control problem and b Deep neural network architectures − is it time for the community
more general nonlinear stochastic optimal control problems, and to see if these can be usefully applied in PSE problems and which
have shown that it can provide better-performing solutions com- types of problems would they seem best suited to? There are lots
120 J.H. Lee et al. / Computers and Chemical Engineering 114 (2018) 111–121
of choices in the structure of networks and types of nodes to use Dearden, R., Friedman, N., Russell, S., 1998. Bayesian Q-learning. Paper Presented at
in these networks; these decisions, much as earlier less complex the AAAI/IAAI.
Dearden, R., Friedman, N., Andre, D., 1999. Model based bayesian exploration.
decisions about the number of nodes and layers to include, are Paper Presented at the Proceedings of the Fifteenth Conference on Uncertainty
not easy to make and are disconcerting. Furthermore, what is the in Artificial Intelligence.
academic value? Duff, M., Barto, A., 2002. Optimal Learning: Computational Procedures for
Bayes-adaptive Markov Decision Processes. PhD Thesis. University of
c Reinforcement Learning − this has the exploitation vs explo- Massassachusetts Amherst.
ration tradeoff built into its learning, but may require changing Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., Bengio, S., 2010.
the architecture to explicitly recognize the decision of switch- Why does unsupervised pre-training help deep learning? J. Mach. Learn. Res.
11 (February), 625–660.
ing between the two modes. This uses features to action (or to
Feldbaum, A., 1960. Dual control theory. I. Avtomatika i Telemekhanika 21 (9),
value) mapping and hence can be combined with the deep learn- 1240–1249.
ing to come up with potentially interesting decision-making for Frazier, P.I., Powell, W.B., Dayanik, S., 2008. A knowledge-gradient policy for
sequential information collection. SIAM J. Control Optim. 47 (5), 2410–2439.
stochastic multi-stage decision problems. The use of simulation,
Frazier, P., Powell, W., Dayanik, S., 2009. The knowledge-gradient policy for
even with imperfect physical rules, can provide inputs that allow correlated normal beliefs. Informs J. Comput. 21 (4), 599–613.
for better performance to be learned cheaply relative to experi- Gaudel, R., Sebag, M., 2010. Feature selection as a one-player game. Paper
ments on real systems. Presented at the International Conference on Machine Learning.
Gittins, J.C., 1979. Bandit processes and dynamic allocation indices. J. R. Stat. Soc.
Ser. B (Methodol.), 148–177.
Our view is that investment in these areas by the PSE community Gomes, L., 2014. Machine-learning maestro michael jordan on the delusions of big
data and other huge engineering efforts. IEEE Spectr. 20 (October).
will pay healthy dividends. Grossmann, I.E., 2012. Advances in mathematical programming models for
enterprise-wide optimization. Comp. Chem. Eng. 47, 2–18.
Gupta, S.S., Miescke, K.J., 1996. Bayesian look ahead one-stage sampling allocations
Acknowledgement for selection of the best population. J. Stat. Plann. Inference 54 (2), 229–244.
Hinton, G.E., Osindero, S., Teh, Y.-W., 2006. A fast learning algorithm for deep belief
This work was supported by the National Research Foundation nets. Neural Comput. 18 (7), 1527–1554.
Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9
of Korea (NRF) grant funded by the Korea government(MEST) (No. (8), 1735–1780.
NRF-2015R1A2A2A01007102). Hochreiter, S., 1991. Untersuchungen Zu Dynamischen Neuronalen Netzen.
Diploma Thesis, Institut für Informatik, Lehrstuhl Prof. Brauer. technische
universität münchen.
References Hopfield, J.J., 1982. Neural networks and physical systems with emergent
collective computational abilities. Proc. Natl. Acad. Sci. 79 (8), 2554–2558.
Ba, J., Mnih, V., & Kavukcuoglu, K. (2014). Multiple object recognition with visual Huang, G.B., Lee, H., Learned-Miller, E., 2012. Learning hierarchical representations
attention. arXiv preprint arXiv:1412.7755. for face verification with convolutional deep belief networks. Paper Presented
Bakshi, B.R., Stephanopoulos, G., 1993. Wave-net: a multiresolution, hierarchical at the Computer Vision and Pattern Recognition (CVPR) 2012 IEEE Conference.
neural network with localized learning. AIChE J. 39 (1), 57–81. Kocsis, L., Szepesvári, C., 2006. Bandit based monte-carlo planning. Paper
Baldi, P., Hornik, K., 1989. Neural networks and principal component analysis: Presented at the European Conference on Machine Learning.
learning from examples without local minima. Neural Netw. 2 (1), 53–58. Kramer, M.A., 1991. Nonlinear principal component analysis using autoassociative
Baum, E.B., Wilczek, F., 1987. Supervised Learning of Probability Distributions by neural networks. AIChE J. 37 (2), 233–243.
Neural Networks. Paper Presented at the NIPS. Kramer, M.A., 1992. Autoassociative neural networks. Comp. Chem. Eng. 16 (4),
Bengio, Y., LeCun, Y., 2007. Scaling learning algorithms towards AI. Large-Scale 313–328.
Kernel Mach. 34 (5), 1–41. LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., Jackel, L.,
Bengio, Y., Simard, P., Frasconi, P., 1994. Learning long-term dependencies with 1990. Handwritten digit recognition with a back-propagation network. Paper
gradient descent is difficult. IEEE Trans. Neural Netw. 5 (2), 157–166. Presented at the Proceedings of Advances in Neural Information Processing
Bengio, Y., Delalleau, O., Le Roux, N., 2006. The curse of highly variable functions Systems (NIPS).
for local kernel machines. Adv. Neural Inf. Process. Syst. 18, 107. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied
Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., 2007. Greedy layer-wise training to document recognition. Proc. IEEE 86 (11), 2278–2324.
of deep networks. Adv. Neural Inf. Process. Syst. 19, 153. LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521 (7553), 436–444.
®
Bengio, Y., 2009. Learning deep architectures for AI. Found. Trends Mach. Learn. 2 Lee, J.M., Lee, J.H., 2004. Approximate dynamic programming strategies and their
(1), 1–127. applicability for process control: a review and future directions. Int. J. Control
Berkooz, G., Holmes, P., Lumley, J.L., 1993. The proper orthogonal decomposition in Autom. Syst. 2, 263–278.
the analysis of turbulent flows. Ann. Rev. Fluid Mech. 25 (1), 539–575. Lee, J.M., Lee, J.H., 2005. Approximate dynamic programming-based approaches for
Bertsekas, D.P., Tsitsiklis, J.N., 1995. Neuro-dynamic programming: an overview. input–output data-driven control of nonlinear processes. Automatica 41 (7),
Paper presented at the Decision and Control, 1995. Proceedings of the 34th 1281–1288.
IEEE Conference. Lee, J.H., Lee, J.M., 2006. Approximate dynamic programming based approach to
Brezzi, M., Lai, T.L., 2002. Optimal learning and experimentation in bandit process control and scheduling. Comp. Chem. Eng. 30 (10), 1603–1618.
problems. J. Econ. Dyn. Control 27 (1), 87–108. Lee, J.M., Lee, J.H., 2009. An approximate dynamic programming based approach to
Browne, C.B., Powley, E., Whitehouse, D., Lucas, S.M., Cowling, P.I., Rohlfshagen, P., dual adaptive control. J. Process Control 19 (5), 859–864.
. . . Colton, S., 2012. A survey of monte carlo tree search methods. IEEE Trans. Lee, J.H., Wong, W., 2010. Approximate dynamic programming approach for
Comput. Intell. AI Games 4 (1), 1–43. process control. J. Process Control 20 (9), 1038–1048.
Chang, H.S., Fu, M.C., Hu, J., Marcus, S.I., 2005. An adaptive sampling algorithm for Lee, J.M., Kaisare, N.S., Lee, J.H., 2006. Choice of approximator and design of penalty
solving Markov decision processes. Oper. Res. 53 (1), 126–139. function for an approximate dynamic programming based control approach. J.
Chang, H.S., Fu, M.C., Hu, J., Marcus, S.I., 2016. Google Deep Mind’s AlphaGo. OR/MS Process Control 16 (2), 135–156.
Today. Lee, H., Grosse, R., Ranganath, R., Ng, A.Y., 2009. Convolutional deep belief
Chaslot, G., De Jong, S., Saito, J.-T., Uiterwijk, J., 2006. Monte-Carlo tree search in networks for scalable unsupervised learning of hierarchical representations.
production management problems. Paper Presented at the Proceedings of the Paper Presented at the Proceedings of the 26th Annual International
18th BeNeLux Conference on Artificial Intelligence. Conference on Machine Learning.
Chatterjee, A., 2000. An introduction to the proper orthogonal decomposition. Lewis, F.L., Vrabie, D., 2009. Reinforcement learning and adaptive dynamic
Curr. Sci. 78 (7), 808–817. programming for feedback control. IEEE Circuits Syst. Mag. 9 (3).
Chick, S.E., Gans, N., 2009. Economic analysis of simulation selection problems. Li, Y. (2017). Deep reinforcement learning: An overview. arXiv preprint
Manage. Sci. 55 (3), 421–437. arXiv:1701.07274.
Coulom, R., 2006. Efficient selectivity and backup operators in Monte-Carlo tree Littman, M.L., 1994. Markov games as a framework for multi-agent reinforcement
search. Paper Presented at the International Conference on Computers and learning. Paper Presented at the Proceedings of the Eleventh International
Games. Conference on Machine Learning.
Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y., 2014. Maravelias, C.T., Sung, C., 2009. Integration of production planning and scheduling:
Identifying and attacking the saddle point problem in high-dimensional overview, challenges and opportunities. Comput. Chem. Eng. 33 (12),
non-convex optimization. Paper Presented at the Advances in Neural 1919–1930.
Information Processing Systems. Martin, J.J., 1967. Bayesian Decision Problems and Markov Chains. Wiley.
De Mesmay, F., Rimmel, A., Voronenko, Y., Püschel, M., 2009. Bandit-based Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., &
optimization on graphs with application to library performance tuning. Paper Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv
Presented at the Proceedings of the 26th Annual International Conference on preprint arXiv:1312.5602.
Machine Learning.
J.H. Lee et al. / Computers and Chemical Engineering 114 (2018) 111–121 121
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., . . . Strens, M., 2000. A Bayesian framework for reinforcement learning. Paper
Ostrovski, G., 2015. Human-level control through deep reinforcement learning. Presented at the ICML.
Nature 518 (7540), 529–533. Stulp, F., Sigaud, O., 2015. Many regression algorithms: one unified model: a
Naidu, S.R., Zafiriou, E., McAvoy, T.J., 1990. Use of neural networks for sensor failure review. Neural Netw. 69, 60–79.
detection in a control system. IEEE Control Systems Magazine 10 (3), 49–55. Sutton, R.S., Barto, A.G., 1998. Reinforcement Learning: An Introduction, vol. 1. MIT
Negoescu, D.M., Frazier, P.I., Powell, W.B., 2011. The knowledge-gradient algorithm press, Cambridge.
for sequencing experiments in drug discovery. Informs J. Comput. 23 (3), Tesauro, G., 1992. Practical issues in temporal difference learning. Mach. Learn. 8
346–363. (3–4), 257–277.
Poupart, P., Vlassis, N., Hoey, J., Regan, K., 2006. An analytic solution to discrete Tsoi, A.C., Back, A., 1997. Discrete time recurrent neural network architectures: a
Bayesian reinforcement learning. Paper Presented at the Proceedings of the unifying review. Neurocomputing 15 (3), 183–223.
23rd International Conference on Machine Learning. Van Hasselt, H., Guez, A., Silver, D., 2016. Deep reinforcement learning with double
Powell, W.B., Ryzhov, I.O., 2012. Optimal Learning, vol. 841. John Wiley & Sons. Q-Learning. Paper Presented at the AAAI.
Powell, W.B., 2007. Approximate Dynamic Programming: Solving the Curses of Walsh, T.J., Goschin, S., Littman, M.L., 2010. Integrating sample-Based planning and
Dimensionality, vol. 703. John Wiley & Sons. model-Based reinforcement learning. Paper Presented at the AAAI.
Prokhorov, D.V., Wunsch, D.C., 1997. Adaptive critic designs. IEEE Trans. Neural Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., & de Freitas, N. (2015).
Netw. 8 (5), 997–1007. Dueling network architectures for deep reinforcement learning. arXiv preprint
Puterman, M.L., 2014. Markov Decision Processes: Discrete Stochastic Dynamic arXiv:1511.06581.
Programming. John Wiley & Sons. Watkins, C.J.C.H., 1989. Learning from Delayed Rewards. University of Cambridge
Ryzhov, I.O., Powell, W.B., 2010. Approximate dynamic programming with England.
correlated Bayesian beliefs. Paper Presented at the Communication Control, Werbos, P.J., 1977. Advanced forecasting methods for global crisis warning and
and Computing (Allerton), 2010 48th Annual Allerton Conference. models of intelligence. Gen. Syst. Yearbook 22 (12), 25–38.
Ryzhov, I.O., Powell, W.B., 2011. Information collection on a graph. Oper. Res. 59 Werbos, P.J., 1990. A menu of designs for reinforcement learning over time. In:
(1), 188–201. Neural Networks for Control., pp. 67–95.
Ryzhov, I.O., Powell, W.B., Frazier, P.I., 2012. The knowledge gradient algorithm for Werbos, P.J., 1992. Approximate dynamic programming for real-time control and
a general class of online learning problems. Oper. Res. 60 (1), 180–195. neural modeling. In: Handbook of Intelligent Control.
Saint-Donat, J., Bhat, N., McAvoy, T.J., 1991. Neural net based model predictive Werbos, P.J., 2012. Reinforcement learning and approximate dynamic
control. Int. J. Control 54 (6), 1453–1468. programming (RLADP)—Foundations, common misconceptions, and the
Schölkopf, B., Smola, A.J., 2002. Learning with Kernels: Support Vector Machines, challenges ahead. In: Reinforcement Learning and Approximate Dynamic
Regularization, Optimization and Beyond. MIT press. Programming for Feedback Control., pp. 1–30.
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. Ydstie, B., 1990. Forecasting and control using adaptive connectionist networks.
arXiv preprint arXiv:1511.05952. Comput. Chem. Eng. 14 (4), 583–599.
Shin, J., Lee, J.H., 2016. Multi-time scale procurement planning considering Zhang, H., Wang, Z., Liu, D., 2014. A comprehensive review of stability analysis of
multiple suppliers and uncertainty in supply and demand. Comp. Chem. Eng. continuous-time recurrent neural networks. IEEE Trans. Neural Networks
91, 114–126. Learn. Syst. 25 (7), 1229–1262.
Shin, J., Lee, J.H., Realff, M.J., 2017. Operational planning and optimal sizing of
microgrid considering multi-scale wind uncertainty. Appl. Energ. 195,
616–633.
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., . . .
Lanctot, M., 2016. Mastering the game of Go with deep neural networks and
tree search. Nature 529 (7587), 484–489.