Data Science in Chemical Engineering
Data Science in Chemical Engineering
Patrick M. Piccione ∗
Process Technology New Active Ingredient, Syngenta, Breitenloh 5, CH-4333 Münchwilen, Switzerland
a r t i c l e i n f o a b s t r a c t
Article history: Data science, digit(al)ization, Industry 4.0, smart manufacturing: all these terms are receiv-
Received 22 January 2019 ing heavy interest from industry and funding institutions. While some definitions remain
Received in revised form 18 April hazy, the financial success of the digital giants has led to a bandwagon all too many com-
2019 panies are happy to jump on. Successes are also reported in the chemical and engineering
Accepted 18 May 2019 sciences, driven by specific enablers as well as technical specificities of the application areas.
Available online 27 May 2019 High expectations for data science applications in chemical engineering have resulted,
together with a loss of visibility of the limits of a purely data-centric approach. At the same
Keywords: time, chemical engineers may not be fully prepared to embrace the digital revolution in
Data science general, and data science in particular. This Short communication, aimed at all stakeholders
Chemical engineering of the digital transformation of the chemical industry, sets out an aspirational vision for
Digital transformation the data science–chemical engineering interplay, together with needs, opportunities, and
Decision suggested approaches to address them. Several frameworks are also given to inform tech-
Interpretability nical strategies: activity classes; workflows; and a decision tree to consciously assess what
Industry 4.0 approaches to privilege.
© 2019 Institution of Chemical Engineers. Published by Elsevier B.V. All rights reserved.
1. Context of data science by the new technologies. Internal efforts are supplemented
by national and international collaborations, with or without
1.1. The hype public funding. These explorations can be driven by enthu-
siasts for a particular technology, sometimes resembling a
It has become impossible to take an airplane without read- solution looking for a problem. Chemical engineers, in par-
ing about the benefits of digitalization, data-driven decisions ticular in manufacturing plants, are producers as well as
or Industry 4.0 in the inflight magazine’s business section. consumers of data, which they are expected to handle to an
Businesses reportedly see no limit to the benefits thereof, ever-increasing degree (Beck et al., 2016). They therefore have
with claims that data science can improve essentially all an important role to play in co-creating a balanced view that
decision workflows (Pence and Williams, 2016). Senior lead- obtains the most from data while not forgetting the verified
ers give strong steers to exploit “Big data”, a term sensitive fundamental theory of our discipline. This Short communication
to statisticians despite (because of?) aggressive marketing maps out the insights derived from leading the “Maths & Data
by data analytics companies (Center for Process Systems Science” (MDS) strategy of Syngenta’s Technology & Engineer-
Engineering Workshop, 2019; Schutt and O’Neil, 2019). Cor- ing division (near 1000 people). The perspectives herein are
porations worldwide, “bewitched by data science” (Center for intended for use by all the communities shaping the digital
Process Systems Engineering Workshop, 2019), are thus engag- transformation of the chemical industry: practicing engineers,
ing in various approaches to explore the possibilities offered data scientists, software developers and architects, technol-
∗
Correspondence to: F. Hoffmann-La Roche AG, Grenzacherstrasse 124, 4070 Basel, Switzerland.
E-mail address: [email protected]
https://doi.org/10.1016/j.cherd.2019.05.046
0263-8762/© 2019 Institution of Chemical Engineers. Published by Elsevier B.V. All rights reserved.
Chemical Engineering Research and Design 1 4 7 ( 2 0 1 9 ) 668–675 669
ogy strategists, and others — whether in industry, academia et al., 2018; Venkatasubramanian, 2019)) or collected as part
or government agencies. of the companies’ operations (e.g., customer selections and
recommendations (Lee et al., 2018)). Very large data sets have
1.2. A definitional nightmare been said to capture very rare behavior, and also to show lim-
ited vulnerability to random errors (Halevy et al., 2009). Some
Data science and digitalization can evoke passionate problems also have a critical characteristic in that the tar-
responses akin to infatuation or rejection, partly because of a get is to improve average behavior of a very large population
lack of common definitions (Schutt and O’Neil, 2019). Having a rather than guarantee error-free physical performance. It is
strong technology component, “data science”, “autonomous” thus feasible to experiment with various approaches, since
and similar terms inherently mean very different things for the downside is economically low and there is no safety impli-
different industries (e.g., chemical vs automotive) (Center cation — unlike for a chemical plant (Clarke, 2019). A small
for Process Systems Engineering Workshop, 2019). Although average improvement over millions of customer transactions
many definitions are available (Schutt and O’Neil, 2019; IGI, also needs no causal explanation to deliver significant eco-
2019), they are not unified (Qi and Tao, 2018). For instance, nomic returns (Clarke, 2019). This latter observation also helps
Beck et al. (2016) divide data science in three areas: data man- explain the slower change pace of change in the manufac-
agement, statistics and machine learning, and visualization. turing sector compared to the consumer Internet (McKinsey
The Syngenta MDS strategy deliberately separated traditional Digital, 2015).
(relatively small) statistical experimental design and analy- Chemical and engineering science has also seen bene-
sis from “data science and empirical modeling”. This division fits from data science, from artificial intelligence (AI) design
emphasized that while traditional statistics required mostly of a much lighter (though more expensive) motorcycle
embedding and use, key novel differentiators to be explored (Stinson, 2018), to predictions of organic reaction products
within the Industry 4.0 and data science currents included (Ferguson, 2018), and molecular sites for biological reactiv-
new science, technology, and, importantly, conditions of use ity (Ferguson, 2018), or energy systems management (Beck
(e.g., tool-making). Mechanistic modeling is scientifically more et al., 2016). Machine learning algorithms allow computers
clearly distinct, and formed the third pillar of the MDS strat- to learn patterns purely from data so that they can per-
egy. Pantelides and Renfro (2013) have recently published a form tasks without human instructions (Beck et al., 2016)
review on the online use of such models, and they will not be — thus leading to the claim that it is possible to “engineer
defined further here. Instrument automation for control and out the engineer” (Center for Process Systems Engineering
data retrieval is obviously highly valuable, but is not “new” Workshop, 2019). The statement is obviously meant to pro-
to process control engineers (Daoutidis et al., 2018) and so is voke: But what kind of engineer? Many machine learning
largely excluded from the present discussion. tasks benefit from being supervised, i.e. having the exam-
Schutt and O’Neil found it easier to define a data scientist ples being supplied and curated by the user. Machine learning
than data science (Schutt and O’Neil, 2019), an approach that and artificial intelligence perspectives by chemical engi-
fits the purposes of the present discussion. To summarize their neers have been published recently by Lee et al. (2018) and
definition, a data scientist will thus be taken to be “someone Venkatasubramanian (2019) respectively.
who knows how to [collect, clean,] extract meaning from and
interpret data, which requires both tools and methods from
1.4. Enablers
statistics and [software engineering], as well as [domain exper-
tise]. [. . .] A crucial part is exploratory data analysis, which
The take-off of data science has been made possible by
combines visualization and data sense. She’ll find patterns,
the convergence of several computational and informational
build models, and algorithms — some with the intention of
advances:
understanding [. . .], and others to serve as prototypes [. . .]. She
is a critical part of data-driven decision making, communi-
cat[ing] with [stakeholders] in clear language and with data
visualizations so that they will understand the implications • computational power, exemplified by the fact that a smart-
[of the data]” (Schutt and O’Neil, 2019). phone can be compared to a 25 year older supercomputer
(Hall and Bianco, 2011).
1.3. Successes • algorithmic advances (Pantelides, 2019) including in
machine learning (Lee et al., 2018; Venkatasubramanian,
The technology giants such as Google, Facebook and Amazon 2019).
are the most visible corporate representatives of the digital • the appearance of easy to use software
world of the 21st century. Digitalization is an intrinsic com- (Venkatasubramanian, 2019), especially as part of rich
ponent of their business model (or even of their products), open-source ecosystems.
and they leverage their systems and data to advertise prod- • the availability of more, and more complex, data thanks to
ucts and services. To this end, they have invested heavily in automation, sensors and networking (Beck et al., 2016), as
the required infrastructure (Lee et al., 2018). Their research well as the increasing availability of large sets of public data
demonstrates that computer programs can achieve superhu- (such as PubChem and ChEMBL in the chemical sciences)
man performance at specific tasks (e.g., playing go) (Ferguson, (Pence and Williams, 2016).
2018). Some of the biggest commercial successes are advertis-
ing and other customer analyses, as well as natural language
recognition (Halevy et al., 2009). The societal embrace of social networks, as well as its
Success in these areas derives from specific problem fea- acceptance of data tracking, has also led to a large increase
tures: first and foremost, the large data sets that are either in the total amount of data available (Schutt and O’Neil,
available (nearly) for free in digital form (language, games (Lee 2019).
670 Chemical Engineering Research and Design 1 4 7 ( 2 0 1 9 ) 668–675
Despite the revolutions in the digital space, the basic needs 3.1. Aspiration
of the chemical manufacturing industry have not changed:
safety, efficiency, product quality, process reliability and cost A vision is the key element for chemical engineering to engage
remain major drivers (Center for Process Systems Engineering with data science. The one proposed here is that the applica-
Workshop, 2019), with sustainability (Daoutidis et al., 2018) tion of the concepts of data science allows computer power
emerging as a more overt driver in the working life of the to augment brain power to design, transfer and implement
author. right-first-time products and processes. This ambition can be
Within this context, some decisions can indeed be purely transposed to other physical sciences beyond chemical engi-
data-driven; for instance, process changes within a previously neering, and the desirability of merging domains of expertise
explored envelope. The main criterion is whether the sys- to address the challenges of the future is well recognized
tem providing the data is the actual, only, one for which an (Daoutidis et al., 2018). For process and product developers,
Chemical Engineering Research and Design 1 4 7 ( 2 0 1 9 ) 668–675 671
this vision naturally links into existing control, automation In the reverse direction, chemical engineers must also
and process analytical technologies to extract the maximum ensure they educate data scientists to business context (and
possible information from all operations at all sizes, from early decisions), and also impart upon them an awareness of the
research laboratories to established manufacturing plants. natural science and engineering context and knowledge base.
The successful corporation of the future must constantly make Joint project work, hosting of data scientists, and the pursuit
information, not just material. Therefore data is and must of cross-team experiences will foster a mutual appreciation
be seen – and managed – as an invaluable asset (McKinsey of each community’s strengths and weaknesses, and give the
Digital, 2015; Deloitte and Verband der Chemischen Industrie, required understanding for healthy challenge and questioning
2017), necessary to construct digital twins of manufacturing of each other’s approaches and assumptions. In a related vein,
processes, themselves an essential enabler of smart manu- alliances and strategic partnerships in the evolving ecosys-
facturing (Qi and Tao, 2018). tem of digital centers of expertise and application will prove
To turn this vision to reality, three aspects are examined very helpful, as already alluded to in several white papers
here: an improved understanding between chemical engineers mentioned above (McKinsey Digital, 2015; UK Department of
and data scientists of each other’s disciplines; conscious International Trade, 2018).
assessments of what approach to favor; and some specific tech-
nical areas to tackle jointly. 3.3. Conscious assessments
3.2. Better understanding Following from the definition above for data science, an essen-
tial enabler for valuable, efficient application of data science
To enable the aspiration above, chemical engineers must in chemical engineering is to delineate the conditions of use:
engage with the data science movement to understand what what sort of problems, data and applications will truly benefit
the field offers, and incorporate its most critical elements from data science? A canonical, documented, and reference-
into education and continuing training. In particular, engi- able framework can be used for discussions, and to support
neers of the 21st century will need advanced data handling academic as well as corporate initiatives. A decision tree is
skills (Bogle, 2019). Such skills include data management, itself proposed to help guide decisions on primary investigative
including metadata; data extraction; and structuring data for approach in Fig. 3. Note that for conceptual simplicity it does
convenient analyses. The assumptions, areas of applicability, not include hybrid models; these are discussed further below.
and limits of various empirical modeling techniques must also The key questions are relatively few. First, and foremost, if
be appreciated. the intent is to make a software tool, especially for a large user
The creation of a data science course for chemical engi- community, modern principles of data science are absolutely
neers is thus highly desirable (Beck et al., 2016). Important essential. The second question concerns the volume and het-
aspects to cover include a general overview, philosophical erogenity of the data (and by extension the 4Vs of volume,
principles, and a starting point to explore the main techniques variety, velocity and veracity (IBM, 2019)). If the data are well
and supporting software. structured, and not too large a set, then traditional statistics
A common, accepted, language is also critical, with three will suffice — possible supported by the data extraction part
aspects. First, canonical definitions of the field’s terms will of data science but not the advanced modeling aspects of arti-
ensure clarity of discussions and initiatives: data science, ficial intelligence, machine learning, etc. In all cases, another
digitalization, machine learning — even if confined to chem- critical pair of questions concerns the availability, and applica-
ical engineering applications. The draft for a data scientist tion to date, of relevant science and engineering principles: if
adapted above can be adjusted further. Excluding tried-and- they have not been used but are available, this known science
verified classical statistics from “data science” is beneficial for should be used first. The residual, unexplained phenomena,
a precise focus. in other words the unknowns, can then benefit from tradi-
It then becomes important to delineate the successive tional statistics or data science, depending on the data set.
levels of information type that can be brought in by math- Early explorations of complex data sets will benefit from the
ematical models (of any nature). Finally, for implementation visualization power and empirical insights from data science.
the various roles to successfully develop tools must also be Indeed many fundamental theories were discovered by care-
understood. For both aspects, many descriptions and frame- ful observations and analysis of empirical data: for instance,
works are available, but the underlying principles are often Galileo’s argumentations (Galilei, 1632). Finally, the last major
quite similar. For technical strategy purposes, the workflow question concerns whether the system under study is truly of
and roles of Figs. 1 and 2 proved quite informative at Syn- ultimate interest, or whether it is just a case within a class,
genta. The successive steps from query (data retrieval) to or even a model system like a scale-down engineering unit.
descriptive analytics (to explore trends), then predictive and If extrapolation outside the limits of the data set and known
finally prescriptive analytics represent qualitative jumps in theory is required, it might even be the case that new phys-
the type of human decision that they support. Each step ical science is needed. Of course, elementary statistics are
thus also requires corresponding leaps in sophistication with expected to be applied routinely to science and engineering
respect to the underlying data, model and tool architecture. work to understand uncertainties, variability, and goodness
The approach can thus be consciously matched to a business of fit.
problem’s needs. The activity classes of Fig. 1 are in good agree- Along a different dimension, the more critical the deci-
ment with McKinsey’s recommendations on how to manage sion and especially its possible downsides, the more critical
the digital thread (McKinsey Digital, 2015). Maturity levels of it is that there be explanatory power attached to the output
organizations can also be assessed to gauge readiness toward — which is where the incorporation of known chemical and
change, drawing from published methodologies such as the engineering theory will prove essential.
one from Warwick University for Industry 4.0 (Agca et al., To fully achieve the potential benefits of data science in
2018). the chemical industry, concrete technical projects and subject
672 Chemical Engineering Research and Design 1 4 7 ( 2 0 1 9 ) 668–675
Define use
Users condition
Deliver value
Fig. 2 – Workflows/activities to make general use tools (data scientists overlap with all three coloured areas).
areas need to be explored in addition to the more abstract Second, the development of software tools today (includ-
engagement described so far. Such opportunities for valuable ing automated report generation) must include data science
convergence of data science with chemical engineering are principles. Producing simple-to-use front-ends will greatly
delineated below, in line with the decision tree of Fig. 3. help disseminate and democratize all modeling methods
and techniques. Process modeling software vendors illustrate
this realization through their continuous construction and
3.4. Opportunities
improvement of browser interfaces (Aspen Technology, 2019;
For context, general Industry 4.0 levers and value drivers to Schulz, 2019). In this context, practicing process modelers
seed thought have been compiled in McKinsey’s extensive stress two crucial points (Gavi, 2019): first, proper training on
white paper (McKinsey Digital, 2015). Different schemes can how to use such front-ends must be ensured; second, models
accordingly be used to classify data science opportunities must not be over-simplified to the extent that they produce
in engineering. Here problem characteristics of the intended inaccurate data and conclusions.
application are emphasized, together with some specific nat- The third type of problems is those with little fundamental
ural science fields. theory, yet plentiful but relatively unstructured or inhomo-
First, the technological possibilities offered by data science geneous data, such as: (1) long-term cost optimization of
should continue to be exploited to make the extraction and manufacturing operations after the physical science known
use of production data as efficient and effective as possible, an to apply to the processes has been exhausted; (2) predictive
area where supervisory control and data acquisition (SCADA) maintenance beyond the limit of equipment knowledge; (3)
system providers are already very active (Rovaglio, 2019). technico-economic optimization after supply chain and logis-
Chemical Engineering Research and Design 1 4 7 ( 2 0 1 9 ) 668–675 673
tics principles have been used to the full; (4) predictions of analysis (Center for Process Systems Engineering Workshop,
the behavior of individuals, including demand. These can be 2019). Further research is needed here as well to make the
called theoryless, or perhaps limited-theory, problems (Center approach as routine as possible, and generally to maximize
for Process Systems Engineering Workshop, 2019). the information content and density of experimental cam-
Energy systems and grids have also been emphasized as paigns. Such an optimization is particularly important when
an application area (Lee et al., 2018). These are a subset of a material or processing costs are critical, e.g., early in fine
more general family: complex network systems. Especially for chemical/pharmaceutical development (Gavi, 2019).
ill-defined complex flows, e.g., for waste water networks, data Some chemical engineering areas seeing growing use of
science techniques that can handle highly uncertain and/or data science have been highlighted by Beck et al. (2016):
fluctuating input data will offer advantages for optimization. computational molecular science and engineering; synthetic
Some of these problems could be called (big-but-)limited-data biology; and energy systems and management. The first of
problems. these fields, expanded to encompass material design, is of
The fourth type of applications concerns joining the advan- particular interest here due to its broad applicability and vari-
tages of empirical models (optimization speed; lack of reliance ety of techniques. It includes discovery informatics, including
on well-established theories), with the interpretability of quantitative structure–property relationships (SQPR) but also
mechanistic models. One emerging field consists of meta- unsupervised learning and the design of software tools
models (simplified, empirical models of more complicated (Piccione et al., 2019; Diorazio et al., 2016). Of great continuing
models), which can be used for tasks such as optimization interest is the “inverse problem” of material and formulation
of process conditions in Aspen (Becker et al., 2008). A research design to achieve specific target properties. Methods to solve
aspect thereof is how to ensure that the general mathematical this problem are particularly valuable in commercial contexts,
structure of the metamodels is realistic enough for suc- leading to computer-aided product design, a subject in which
cessful optimizations. Hybrid models make up another field; process systems engineers are already active (Conte et al.,
these incorporate explanatory power into empirical models. 2011).
“Bak[ing] in an understanding” in data-driven models has
been cited as an approach to reduce risk upon use (Clarke,
4. Conclusion
2019).
A last area to highlight is that data science could, a bit para-
Data science, digit(al)ization, and Industry 4.0 are not concepts at odds
doxically, help bring experiments and theory closer together
with chemical engineering. These currents are in good alignment with
by setting an expectation of high-throughput-screening-and- the long-standing interest of chemical engineers in obtaining data from
674 Chemical Engineering Research and Design 1 4 7 ( 2 0 1 9 ) 668–675
Table 1 – Recommended activities to further connections between data science and chemical engineering.
Issue/opportunity Approaches
their operations; at the same time, our profession must recognize that consortia, in particular Professors Claire Adjiman, Rafiqul Gani
the possibilities now on offer will lead to significant, disruptive shifts (now at PSE for SPEED), Georgios Kontogeorgis, and Nilay Shah,
upon routine, endemic embedding of Industry 4.0 approaches. Data for stimulating conversations over the years.
science and the digital transformation offer plenty of opportunities to This research received no specific grant from any funding
assist and empower chemical engineers, and are not expected to sup-
agency in the public, commercial, or not-for-profit sectors.
plant them — but getting the best from them will require work! At
the recent EFCE Digitalisation in Chemical Engineering workshop, the
summarized sentiment was that the potential is huge and the changes
are coming (Bogle, 2019). This is in line with subjective experience: the
References
biggest transformation in the author’s working life is undoubtedly the
digital one, and it shows no sign of slowing down. Table 1 summa- Aaslaid K. 50 Examples of Corporations That Failed to Innovate.
rizes the key issues, and proposed approaches, to fruitfully bring data https://valuer.ai/blog/50-examples-of-corporations-that-
science and chemical engineering together. failed-to-innovate-and-missed-their-chance/. (Retrieved 4
Technical strategies in this emergent, fast-moving area will require January 2019).
clarity: of definitions, of workflows, of the resulting roles, of expecta-
Agca, O., Gibson, J., Godsell, J., Ignatius, J., Davies, C.W., Xu, O. An
Industry 4 Readiness Assessment Tool. Available at:
tions, and of maturity including willingness to change.
https://warwick.ac.uk/fac/sci/wmg/research/scip/industry
Realistic assessments of problems and their characteristics must
4report/final version of i4 report for use on websites.pdf.
inform which approach to favor in the resolution of technical and sci-
(Retrieved 20 December 2018).
entific challenges. Critical aspects to consider include the availability
Aspen Technology website: AspenTech Introduces Industry’s First
of theory, the complexity of the data set, and the condition of use of
Web-based Engineering and Manufacturing Software with the
any resulting mathematical models. ®
New aspenONE Release.
Theory-limited problems will benefit the most from a data-science- https://www.aspentech.com/en/resources/press-releases/
heavy approach, whereas for interpretability and extrapolation, more aspentech-introduces-industrys-first-web-based-engineering
physical science is required for better results. Metamodels, the link -and-manufacturing-software15032388854. (Accessed 4 April
to high-throughput experimentation, and the creation of tools for the 2019).
general technologist will benefit from merging the approaches. Data Atherton, J.H., Carpenter, K.J., 1999. Process Development:
science will be also be a great help to ensure data are extracted and used Physicochemical Concepts. Oxford University Press, Oxford.
efficiently and effectively, regardless of the models and techniques Beck, D.A.C., Carothers, J.C., Subramanian, V.R., Pfaendtner, J.,
used for their analyses. 2016. Data science: accelerating innovation and discovery in
Most important of all, to help realize this bright potential, chemi- chemical engineering. AIChE J. 62, 1402–1416.
cal engineers must engage with data science and data scientists, e.g., Becker, S., Nagl, M., Westfechtel, B., 2008. Incremental and
through training, shared experiences, and joint projects. interactive integrator tools for design product consistency. In:
Nagl, M., Marquardt, W. (Eds.), Collaborative and Distributed
Chemical Engineering. From Understanding to Substantial
Acknowledgments Design Process Support. Springer, Berlin.
Bogle, D., 2019. Report on 2nd European forum on new
technologies, digitalisation in chemical engineering
Patrick M. Piccione acknowledges Syngenta for supporting this
workshop. EFCE Newsletter (April), 6–8.
work, as well as Gabriel Carré, Dirk de Bruyn Ouboter, Juan Luis
2018. Center for Process Systems Engineering Workshop.
Naveira and Tom Salvesen (Syngenta) and numerous mem- Imperial College, London, 07 December.
bers of the Center for Process Systems Engineering (Imperial Clarke N. Analytics is Not Just About Patterns in Big Data.
College) and Kemiteknik (Technical University of Denmark) https://www.computerweekly.com/blog/Data-Matters/
Chemical Engineering Research and Design 1 4 7 ( 2 0 1 9 ) 668–675 675