0% found this document useful (0 votes)
60 views10 pages

Issues in Synthetic Data Generation For Advanced Manufacturing

This document discusses issues around generating synthetic or virtual data for advanced manufacturing research. It notes that real-world manufacturing data is often limited due to proprietary and privacy issues. Synthetic data generation could help address this problem by providing alternative data for research and testing. However, the document outlines several challenges to generating realistic synthetic data, such as capturing the complexity and variations seen in real data. Standards and specifications can help but may also limit innovation. Overall, synthetic data shows promise for furthering manufacturing research, but generating accurate and useful datasets remains difficult.

Uploaded by

kilo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views10 pages

Issues in Synthetic Data Generation For Advanced Manufacturing

This document discusses issues around generating synthetic or virtual data for advanced manufacturing research. It notes that real-world manufacturing data is often limited due to proprietary and privacy issues. Synthetic data generation could help address this problem by providing alternative data for research and testing. However, the document outlines several challenges to generating realistic synthetic data, such as capturing the complexity and variations seen in real data. Standards and specifications can help but may also limit innovation. Overall, synthetic data shows promise for furthering manufacturing research, but generating accurate and useful datasets remains difficult.

Uploaded by

kilo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/322514902

Issues in synthetic data generation for advanced manufacturing

Conference Paper · December 2017


DOI: 10.1109/BigData.2017.8258117

CITATIONS READS
2 74

3 authors, including:

David Lechevalier
National Institute of Standards and Technology
28 PUBLICATIONS   196 CITATIONS   

SEE PROFILE

All content following this page was uploaded by David Lechevalier on 27 September 2018.

The user has requested enhancement of the downloaded file.


Issues in Synthetic Data Generation for Advanced Manufacturing

Don Libes
National Institute of Standards and Technology
Gaithersburg, MD, USA
[email protected] Sanjay Jain
The George Washington University
Washington, DC, USA
David Lechevalier [email protected]
Le2i, Université de Bourgogne,
Dijon, France
[email protected]

Abstract— To have any chance of application in real world,


advanced manufacturing research in data analytics needs to A. Background
explore and prove itself with real-world manufacturing data. Synthetic data generation has received significant interest
Limited access to real-world data largely contrasts with the in recent years across a number of fields. A recent review
need for data of varied types and larger quantity for research. [2] of the literature related to synthetic data identifies use in
Use of virtual data is a promising approach to make up for the multiple fields including economics, urban planning,
lack of access. This paper explores the issues, identifies transportation planning, cyber security, weather forecasting,
challenges, and suggests requirements and desirable features in and bioinformatics. Interestingly this review didn’t report
the generation of virtual data. These issues, requirements, and any activity in manufacturing. The paper focused on
features can be used by researchers to build virtual data synthetic data generation for learning analytics in the
generators and gain experience that will provide data to data
education environment and identifies some of the same
scientists while avoiding known or potential problems. This, in
turn, will lead to better requirements and features in future
issues that are discussed in relation to advanced
virtual data generators. manufacturing in different situations such as machine-
learning training, or extension of real data set. Similarly, [3]
Keywords- virtual data; synthetic data; data generation; focused on synthetic data generation for Internet of Things
smart manufacturing (IoT) environment to address some of the same challenges
including limited access to real world data discussed here.
There have been papers that tangentially touch on
I. INTRODUCTION
synthetic data generation but with a primary focus on
Much of the research in advanced manufacturing simulation. For example, [4] describes the European Virtual
involves the creation of models and simulation-based Factory Framework (VFF), an interoperability framework for
experimentation. Simulation leads to significantly faster factory modeling. The VFF includes a Virtual Factory Data
results than physical experiments that require use of real Model (VFDM) for common representation of factory
machine tools and materials in a physical shop floor [1]. objects to evaluate performance of production systems and a
Such simulations and models require data for carrying out Virtual Factory Manager to manage a shared data repository.
those experiments. These data can be representative of a Support for the external simulation tool, Arena, demonstrates
large number of sources such as machine tools, robots, the ability of data generation and potential for improved data
suppliers, etc. The data types are also quite varied such as interoperability [5].
material density and strength, machine tool wear and energy The National Institute of Standards and Technology (NIST)
usage. has two efforts specifically in the area of data generation for
The concept of simulated data has been referred to in manufacturing. The first is the STEP2M Simulator which
multiple ways in the literature. Many people use terms such simulates machine monitoring data, given process plans [6].
as artificial data, virtual data, generated data, fake data, and The process plans are Step-compliant data interface for
synthetic data to all mean the same thing. In contrast, people Numeric Controls (STEP-NC) conforming [7] and the
denote data generated by physical machines as real data,
resulting generated data are presented using MTConnect [8].
physically-generated data, or live data. We (the paper
A second project defined a virtual machining model that
authors) use both sets of terms interchangeably in speech but
prefer synthetic data and real data in the written word. The simulates a 2-axis turning machine [9]. This model uses
term live data is occasionally used to emphasize physically- STEP-NC [10] commands and material property as inputs,
generated data that is being consumed within a small and uses equations to compute time and power consumption
window of its generation. However, this can be confusing so depending on the tool path. The simulation generates
we generally avoid that term or make clear in context what machine-monitoring data in MTConnect format usable for
we mean. other simulations. In [11], the authors demonstrated the
usability of these data with a 3-axis milling operation. They
integrated the earlier work into an agent-based model to peakedness relative to a normal distribution, is in the set.
provide capabilities to create a shop floor model [12]. More Skewness, a measure of symmetry, is not in the set.
complex shop floor models could be built combining All of these choices of standards and specifications have
different agent-based models representing different reasons for existence. For example, different vendors may
operations. Such a combination would lead to the have different reasons for their choices. These can include
generation of data at the machine level but also at the shop historical issues, expense in tracking developing standards,
floor level where data could be aggregated. and interactions with other software. For example, the choice
of OWL (Web Ontology Language) variants depends on how
II. WHY IT IS SO HARD TO OBTAIN REAL DATA much need there is for expressiveness – the breadth of
concepts that can be represented [15]. But greater
There are many reasons why it can be difficult or
expressiveness brings with it a loss of computational
impossible to get sufficient real data both by type and
guarantees [16].
quantity. Libes et al describe how significant amounts of data
Data standards may be descriptive (describing practices)
exist but are nonetheless unusable [13]. For example:
or prescriptive (defining practices). Each carries with it
• Data may be proprietary preventing access. downsides. For example, descriptive standards may prevent
• Data may have timestamps but are not synchronized the use of innovative techniques that are too new to be
preventing data joining. incorporated in standards while prescriptive standards may
• Data frequency may be insufficient. be ignored when better technology solutions are discovered.
• Data may be formatted inadequately leaving These dilemmas are particularly apparent in rapidly changing
ambiguity. and highly-competitive fields. To allow variations and
• Data may have inexplicable gaps. technological advances, some standards intentionally leave
• Data may have been generated with different areas of ambiguity with a resulting ambiguity in the data.
underlying goals.
• Metadata may be inadequate or non-standard III. USES OF SYNTHETIC DATA
resulting in semantic confusion. Development and maintenance of data analytics
• Accuracy may be undocumented. applications, models, and real factories can all make use of
• Data provenance may be suspect (modified without synthetic data, albeit in different ways. For example, data
documentation) or unknown. analytics applications can use synthetic data to test that
• Data may be too costly to obtain. training algorithms perform adequately. Factories can also
We lack the space to explore all of these in detail but we use the data to experiment with proposed changes. For
will consider one as an example: data formatting. instance, it is not possible to test a replacement machine tool
Enterprises sit atop a vast collection of disparate data, before it is purchased and installed; however, data
likely produced by a multitude of heterogeneous sensors, and synthesized by a machine tool simulator can allow the
often ultimately stored in files formatted according to a factory to be modeled and tested as if the machine tool was
variety of standards, with varying degrees of compliance. present. Similarly, policy or algorithm changes can be tested
Significant amounts of data may follow no standards before deployment in a real factory. Lastly, synthetic data
whatsoever. Using standard specifications such as XML can be a resource for test suites that exercise the full range of
(eXtensible Markup Language) and JSON (JavaScript Object states of a system – including both normal operation and
Notation) can help [14]. However, problems such as under error conditions.
specification can remain and leave ambiguities. For example, In short, synthesized data can be used throughout the
using an XML attribute called “time” means little if there is lifecycle of a factory – from initial brainstorming to
no definition for how the string is to be interpreted: absolute development to maintenance. We focus on several uses of
with respect to UTC (Coordinated Universal Time) or local synthetic data in this section.
time zone? Is the time relative to the start of a process or
something else? A. Machine-Learning Training
While standards can be helpful, they are not panaceas. Machine learning models are not explicitly programmed
For example, equipment and software from different vendors or based on a physical model but primarily by analyzing
may use different standards. Data, while still standard- data. For example, neural networks are trained on data and
compliant, can lose fidelity during interchange. Standards typically produce networks that make no attempts to model
frequently have different levels of compliance that users may the physics. There is no use of equations that correspond to
choose. Even highly specific standards do not guarantee data performance of any real machines. An appropriate amount
are usable. For example, the machine tool standard, of data is necessary to make sure that machine learning
MTConnect, covers only one direction of communication; algorithms well perform.
so, correlation to commands may not be present or need While defining the right amount of data is one area of
reconstruction with timing uncertainties. Equally important, research, it is equally important to understand the limitation
MTConnect does not cover all possible types of data that a of synthetic data in this context. For instance, a neural
machine tool can generate. For example, MTConnect defines network that is trained on synthetic data is unlikely to
a fixed set of statistics for DataItems. Kurtosis, a measure of provide a better model than the model used to generate the
synthetic data in the first place. For this reason, users (e.g.,
neural networks) of machine-learning algorithms should during simulation runs to produce and measure more realistic
track the provenance of any generated data used to train performance metrics.
those algorithms. Without such provenance, mistaken
assumptions may arise over the quality of decisions E. Augmenting Real Data
generated by machine-learning algorithms. For many reasons, real data can be missing or
insufficient. For example, sensors may be incapable of
B. Verification collecting data quickly enough. Many techniques are
The American Society of Mechanical Engineers (ASME) possible to replace missing data values. For example, simple
defines verification as the process of determining that a mathematical interpolation can produce data that fill gaps
computational model accurately represents the underlying and reflects existing data. However, interpolation can distort
mathematical equations and their solution [17]. During data. This may sound counterintuitive but means that
development, verification is used to ensure that data interpolated data give the mistaken impression that the data
analytics applications and models meet their design are smoother than it actually is.
requirements and specifications. While direct methods are Analytic applications can automatically fill in missing
available in limited ways for verification, data can be used in data values by essentially running algorithms in reverse
verification, in some cases as input to be consumed and in producing data that is more representative of real data. For
some cases as output to be tested against. example, once a neural network has been trained on a data
Being able to generate indefinite amounts of data allows stream, the neural network can be used to produce more data.
more extensive testing than would otherwise be possible However, by its very nature, such synthetic data will not
through limited reference data produced by physical have any impact on analytics rendering it of no intrinsic
machines. In addition, parameters can be changed to generate value except perhaps for pro forma purposes such as creating
different synthetic data that would be very expensive or more complete visualizations.
impossible to obtain by using physical machines. Missing data can be used during development (for
example, for optimization) but is also useful when a factory
C. Validation is in production. In that case, there may be instances or
ASME defines validation as the process of determining the periods when data are unavailable whether due to a broken
degree to which a computational model is an accurate sensor or communications problem. In this case, missing
representation of the real world from the perspective of the data can be replaced with synthetic data in real time.
intended uses of the model [17]. Validation confirms that
applications or models match the needs of the customer. IV. CHALLENGES FOR DATA GENERATION AND
Data analytics applications and models should be validated – COLLECTION
ideally, during development and continuing through the While synthetic data have many uses, challenges exist,
lifecycle. Data are useful in this validation process, many of which may not be obvious until encountered. Thus,
particularly in cases when physically realized applications, we describe some of most significant challenges in data
models, and physical instances will not yet be available. generation.
Having synthetic data – both input and output – as if the
enterprise existed, can be used to test that requirements are A. Quantity Sufficiency
valid. Throughout the life of applications, models, and real Algorithms for synthesizing data require some minimum
factories, synthetic data can be used to validate that proposed amount of realistic data although the data may be of a
changes continue to result in valid results. different type. For example, generation of energy data may
D. Optimization require a machine specification and material data. Some
algorithms may generate data by using data similar to what is
Optimization is the process of improving algorithms or desired – for example, by increasing or decreasing the
applications to produce the best results, and improve the variance of an existing dataset.
system under study. (This term is often mistakenly used for Data analytics algorithms generally increase accuracy
the process of “improvement” which is more practical while with more data when that data cover more of the descriptive
“optimality” is aspirational.) Optimization is frequently space to be analyzed. Of course, more data can slow down
performed by simulation. algorithms worsening time performance. More data can also
Synthesized data can be used for optimization, such as in be unnecessary, essentially providing no new information.
factory models. These models are not algorithmic machine- More data can even introduce artifacts that are irrelevant and
learning models (see earlier) but rather simulations of hurt model performance. However, sufficient quantities of
factories (or subsets) that can be tested for performance training data can be significant, for example during walk-
optimality. For example, work cell and machine tool forward testing while training neural networks. Knowledge
arrangements of a factory floor can be simulated and tested of the quantity required for training can be as useful as the
prior to deployment in order to select the best layout. Since choice of the data itself and which data are more likely to be
such simulations have no real counterpart, synthesized data seen in the real system. Defining the right quantity might be
can be used in the simulation. For example, part flow, possible by calculating the squared error of the data analytics
machining time, machine breakdown, etc., can be used model while training and testing to define if it is over or
under fitted. Defining the right quantity of data before generators, models, simulations) and with what accuracy can
training is, however, a more complex issue. be challenging.
B. Timing
2) Speed
Closely related to quantity sufficiency are issues of For physical data generation, data are generated at
timing. These issues can be complex and be impacted by whatever speed the producers can generate it. For data
many parts of an enterprise but sensor behavior is the analytics applications, that speed is generally not of interest.
simplest place to explore such issues. A data analytics developer is not consuming real data
Sensors operate in two modes. Some sensors run free directly from a shop floor except in rare circumstances.
meaning that as soon as they have finished reporting data, Developers almost always use models or simulations that can
they begin collecting or sending additional data. They are run faster than real-time. For that reason, when data are
constantly busy. This can be useful for algorithms that want generated dynamically, it is desirable to have data generators
to consume as much data as possible. Other sensors may be that can run faster than real-time, preferably as fast as the
synchronized to an internal or external source. Such sensors simulation itself.
artificially discretize readings. Being able to synthetically In some contexts, data generation may need to be slowed
simulate such data sources can be useful for properly down to real time to ensure that applications are able to
modeling real-world factories. provide feedback in a timely manner to impact factory
Generators may need to be capable of generating both performance. For example, if batch dispatching decisions
types of data sources with the ability to generate data at occur every minute and an analytic application promises to
arbitrarily large rates. improve system performance through better dispatching, the
C. Dynamic Generation application should be able to execute and respond within
sub-minute intervals. The capability of such an application
Data analytics developers often do not need or want to to perform well in a real system should be tested with a
store all data in advance of the need. The amount of data manufacturing simulation that ensures real time
needed may be too large to store. Generating data performance.
dynamically can also better reflect the way data are
consumed as well as leaving open the possibility for D. Data Hiding / Suppression
feedback that can change the data generation process. The For a variety of reasons, it can be useful to suppress or
converse is also true. Consider data coherency. Data hide (i.e., to not use) synthetic data during analysis (or
coherency can be disrupted by updating data at the same development and testing of analysis). It may be desirable to
time it is being consumed but without synchronization. By hide details of the implementation as well. Some of these
avoiding the storage of large amounts of synthetic data, data considerations are presented in this section.
coherency is maintained with lower costs (e.g., locking
protocol overhead). For these reasons, it can be useful to 1) Intentional hiding
generate and immediately consume data dynamically For various reasons, more data may be available than is
whenever possible. desired. Thus, it may be useful to hide some of the data. For
On the other hand, storage of large data sets may be example, the analytic engines may be incapable of
useful for certain types of analysis. When time and space are consuming all the data, particularly if time limits are an
not significant factors, data can be evaluated more issue. This is a concern with the need for real-time results.
thoroughly. Resulting models that are relatively static or are Data sampling might be a solution to achieve this task
already optimal do not need to be continuously or frequently without modifying the dimensions represented in the data
retrained if there is no benefit to doing so. set.
1) Feedback and Control 2) Walk-forward testing
Real-time performance applications make decisions that Walk-forward testing consists in training a machine
are fed back to the system thereby affecting its performance. learning model with a subset of the data, and testing the
Naturally, changing the outcome of such decisions changes trained model with an unseen subset of the data. It is another
the performance of the system which in turn changes any example where data must be hidden – at least initially. This
performance data produced by the system that would may be repeated on many quantities of unseen data in order
ordinarily be fed back to the analytics applications. to ensure that systems are not over-trained. The aim is to
It is desirable for data generators to be able to incorporate create a model that is not necessarily expert at recognizing
feedback in order to control future data generation. This is only the training data but capable of recognizing data that is
not necessarily easy as feedback control can introduce non- also likely to be produced in similar scenarios.
trivial synchronization issues. For example, lock-step
synchronization is typically not necessary as there are lags in 3) Filtered
real factories when data-driven decisions affect controllers Data may be hidden because it is withheld (filtered) for a
and similarly how quickly sensors can return feedback. variety of reasons. For example, data that is clearly wrong or
Determining at what levels this is accounted for (e.g., exceeds certain bounds may be suppressed depending on the
needs of the data analysis applications. An analytics
application may be designed to run in a live system where variety of ranges are used such as a Gaussian distribution,
data are guaranteed by database constraints as to its quality. Poisson distribution, or Bayesian-based distribution.
During development, guarantees may be maintained when
the generator is providing the data without any database 3) Uncertainty
filtering. Describing what should be filtered can be There are many sources of uncertainty. For example,
arbitrarily complex as there are many reasons for filtering machine-tool manufacturers themselves may not have a
and the reasons can be combined in complex ways. For usable mathematical model for their products. Even if they
instance, analytics software under development might be do have a model, it may specifically omit aspects that are
restricted from out-of-bounds data (previous example) as difficult or entirely unknowable, for instance. So for these
well as malfunctioning sensor readings that are within and other reasons, physical device manufacturers often state
bounds but inaccurate. only a range of accuracy leaving questions of distributions a
difficult challenge between the designer of a generator which
4) Black box generators mimics that device and the user who configures the data
A data generator or its underlying models and data may generator. Quantifying and aggregating (epistemic and
be treated as a black box. Hiding the implementation can aleatory) uncertainty generated by different sources during
prevent certain types of analytics that may give an the simulation is complex and requires to clearly identify and
unwarranted impression of a tool that is more universal than evaluate each source of uncertainty [19].
it is, simply because the implementer (or software) can “see”
the implementation and use the implementation rather than 4) Adjusted
the data as a basis for analytics. Sensor data can often be adjusted at intermediate
Data hiding also prevents premature optimization as well processing nodes. For example, data can be joined at a
as shortcuts that can overlook problematic data. For example, network node that collects data from several sensors for each
an optimizer that knows it will never see data outside a entry in a log. Adjustments can include normalization,
certain range may learn that it is not necessary to handle such scaling, and quantization. This means that analysis software
data. When it is faced with such unexpected data, perhaps by may only see adjusted data rather than raw data. For this
a misbehaving process upstream, the optimizer is likely to reason, data generators should be able to produce either raw
behave inappropriately since it has not been trained on such data or data adjusted in a variety of ways.
data.
F. High-Level Key Performance Indicators Without Low-
E. Data Quality Level Data
Sensors quantize data, have lags, fail, and have other In many scenarios, the company is interested in high-
issues. So, ideal data should never be expected from real level (i.e., enterprise or shop floor level) Key Performance
enterprises. However, there is value in building ideal data Indicators (KPIs) or data and not in low-level (i.e., process or
generators. For example, certain types of algorithms require machine level) KPIs or data. For example, machine tool
or perform much better when optimal goals are provided. measurements may not be significant to higher-level
This is the case with many non-heuristic approaches to processes such as machine-tool inventory predictions despite
nondeterministic polynomial (NP) problems such as the basis on machine tool information. Many analyses only
identifying optimal routing [18]. Nonetheless, most interest use high-level KPIs as input such as finished product cost,
for data analytics is in creating more realistic data. The inventory cost, and goals such as minimizing late orders.
following sections describe data quality issues in data Once the KPIs are created, the low-level data are never again
generation. used. In such scenarios, the low-level data are not necessary
if the KPIs can be created independently.
1) Reliability Data generators that provide high-level KPIs may not
Physical systems can be unreliable so it is useful for need to model the underlying physics if the KPIs are good
synthetic data to be able to reflect that. This unreliability can enough. Of course, “good enough” is challenging to define.
be difficult to model as misbehavior can come about in so Generating KPIs in a top-down approach (for example,
many ways. For example, sensors can behave erratically, driven by organizational goals) may be harder than
communication can encounter interference, or power can be generating low-level data and doing a higher-level
dirty. Each causes reliability issues. simulation to arrive at KPIs. Whether this is true depends on
Within each type of problem, there is a spectrum of the fidelity requirements of the KPI-consuming applications
unreliability. Reliability can also change over time, typically [20].
increasing but occasionally decreasing. In short, reliability is
complex so generating realistic unreliable data is complex. G. Well-Described Scenarios
To generate data that are useful, scenarios must be
2) Accuracy identified to define the context in which the data applies.
The accuracy of physical sensors must be accounted for Finding and describing such scenarios can be difficult for
by data generators. While an accuracy limit may suffice, it is several reasons. There are many variables that likely differ
more realistic to produce a range that models the physics of for every factory: products, machine tools, goals, and costs.
the sensor; however, this is often not possible. Instead a These are almost always mixed. For example, a
manufacturer may have a mix of machine tools and the scope of data sets varies based on the objectives of the
number, types, and layout will differ from one manufacturer analysis and the level of abstraction desired by the analyst.
to another. Similarly, one manufacturer’s goals are likely to
differ from another. One manufacturer may have contracts J. Repeatability, Reproducibility, and Provenance
with suppliers and utilities while another manufacturer will Repeatability, reproducibility, and provenance are closely
have different suppliers and other constraints. Machine tools related, all having to do with the confidence in the ability to
will be of different ages and exhibit different performance recreate generated data.
characteristics. Where semi-automation is an issue, human
factors will be unique as well. 1) Repeatability
To accommodate differences such as constraints or goals, Repeatability of data generation from a physical
data generator parameters can be adjusted; however, thought enterprise is difficult. For synthetic data generation,
should be given to whether these parameters are intrinsic to repeatability is generally straightforward. Generators must be
data generation. For example, while both machine tool capable of publishing and accepting random number seeds.
performance characteristics and goals (such as “use minimal Generator code must not require anything else that could
power”) will affect the generated data, the former is intrinsic change the data output from one run to another. The code
to the generated data while the latter can be considered a itself must be published in such a way that the code remains
dependent variable that is only meaningful to a higher level the same so that others can rerun the same data generator
of control. with identical results. This can be ensured with a signing
procedure such as providing the results of a cryptographic
H. Manufacturing Levels hash function.
It is customary to organize a hierarchy of manufacturing
at different levels of operation, e.g., models for machine 2) Reproductibility
tools, workcells, factories, enterprises. Additionally, While repeatability is straightforward, reproducibility can
customers and supply chains may also be modeled. be more difficult since underlying libraries and computer
Developing simulators or analytics applications may be hardware can cause output differences despite identical high-
specific to a level or may include multiple interacting levels level code. Languages such as C are particularly notorious
such as model supply chain level and its interaction with the for this. These differences are not necessarily a bad thing but
factory level [21]. intended to give programmers more control over efficiency
Data at any one level are likely to have a strong of an implementation. However, programmers are free to
dependence on other levels of a hierarchy. Obviously, a ignore (or may be unaware of) such subtleties which can lead
factory model depends greatly on the performance of the to non-reproducibility. C is not alone. Many higher-level
operations within. Less obviously, a machine tool depends languages have subtle dependencies as well. In [22], the
on the goals of its workcells or even higher levels. For authors review correct approaches to assess the consistency
example, a machine tool may wear less by running at a of measuring process. These approaches can be similarly
slower speed which is only acceptable if the goals of the applied to data simulation to ensure consistent generation of
factory or workcell permit. data.
I. Data Type Complexity
3) Provenance
Many different types of data exist in a real factory. For Some data may not need reproducibility as long as it is
example: suitable for its purpose and its provenance can be
• Material Data: costs/characteristics/physics of ascertained. Provenance through a digital signature provides
material, water, energy a guarantee of the source of the data and that it has not
• Process Data: task time, customer demand, changed. Only the signing of data and its metadata is
production schedules necessary.
• Product Quality: geometry, structural integrity, Closely related to provenance is traceability. While data
performance may be provably shown to have come from one data
• Manufacturing equipment: efficiency, reliability, supplier, it may be of no value if others cannot ever hope to
spare capacity build a factory that reproduces it.
• Employee data: salary, hours worked, employee
skills K. Model Type Choice and Provenance
It is possible to generate all of these but the more types of There are a variety of model types used for data generation.
data generated, the more work is required. For practical There may also be hybrid combinations of these models.
purposes, many of these are unnecessary and only depend on
the particular goals of the analysis. Not all goals can be 1) Physics-based Models
achieved simultaneously as many will always conflict. For Physics-based models are based on equations that reflect
example, it is generally impossible to achieve both minimal our understanding of what physically happens in real life. In
energy usage and time. There is always a tradeoff. For the theory, physics-based models are the optimal way of
same reason, there is no minimally optimal data set. The modeling any manufacturing process since they capture all
information and effectively produce our best understanding although rarely is the delineation clean. For example,
of a process. standards such as XML or JSON may be used for low-level
However, these equations can be difficult to obtain, syntactic formatting while International Standards
relying on, for example, a machine tool manufacturer which Organization (ISO) 10303 for product manufacturing
may have little incentive to provide specific types of information and ISO 22400-2:2014 (KPIs) apply to higher-
equations or explanations. Alternatively, the equations may level manufacturing semantics [23], [24], [25]. Many
exist but fail for a variety of reasons to correlate with what is standards exist to address other concerns. For example,
observable. For example, machine tool wear may be a factor CMSD (Core Manufacturing Simulation Data) and CSPI
in the equation but is effectively unmeasurable because it (COTS (Commercial Off-The-Shelf) Simulation Package
requires destructive testing or machine disassembly that Interoperability) are standards for facilitating the use of
voids a warranty. simulation models such as shop floor configurations [26],
For practical reasons, it is rarely the case that we truly [27]. Standards such as MTConnect and Open Platform
have accurate equations that hold in all situations. For Communications Unified Architecture (OPC-UA) may be
example, it would not be sensible to have quantum-level used to exchange manufacturing information from machine
models when Newtonian models are sufficient for almost all tools while Representational State Transfer (REST) and
purposes. Similarly, while physics-based models are often Simple Object Access Protocol (SOAP) are examples of
implemented using continuous simulation, the ability to standards for carrying out communications [28],[29].
achieve arbitrary levels of resolution and scale is generally Data generators must be able to produce data in a form
overkill for most data generation needs. that either conforms to the relevant standards or is readily
The result is that physics-based models are always adaptable to them.
idealized and require adaptation to be used for realistic data
generation. 2) Plug-in interoperability
Data generators should be able to serve as plugins to
2) Empirical Models other systems such as domain-specific testbeds and model
Empirical models are based on data, ideally from design software that may provide other services such as
physical systems. These data are then used to build machine- model transformation or optimization. By using standards,
learning models using techniques such as neural networks plug-in capability can be more easily supported and
and Gaussian networks. Such empirical models are generally generators can more likely be incorporated into additional
used with discrete simulation techniques although this is not systems. In the reverse sense, generators should also be able
mandatory. to use models based on a plugin architecture. For example, a
The result is that empirical models are adequate to data generator only capable of generating the results of a
provide good data generation for some situations. However, milling machine using aluminum would see limited use.
for situations that they have not been trained for such as edge Allowing the plugging in of models of tools and materials
cases and other unusual events, empirical models can fail to and parameterizing factors such as feed rates, a generator
generate suitable. would be much more useful, leading to more of a widely
applicable generator.
3) Special-purpose Models
Special-purpose models can be based on other techniques V. SUMMARY AND CONCLUDING NOTES
besides those based in statistics or physics. For instance, a Limited access to real-world data is a significant
data generator could be used to produce intentionally bad impediment to advanced manufacturing. Use of virtual data
data specifically to test the behavior of analysis tools. It may is a promising approach to make up for the lack of access.
suffice to use a stream of zeros or just open a file of random We have presented issues, challenges, and desirable features
data. Similarly, an ideal data generator (see 4.5) can be used in the generation of virtual data. These can be used by
to test minimal fitness or conformance. Such data can be researchers to build virtual data generators and gain
based on published or nominal specifications from a experience that will provide data to data scientists while
manufacturer with no regard to a statistical or physical avoiding known or potential problems. This, in turn, will
model. Data could also be aspirational, referring to a goal for lead to better requirements and features in future virtual data
which no known algorithms can achieve but represent generators.
provable limits. Three areas for future research and development are of
particular interest and deserve increased attention and effort.
L. Integration, Interoperability & Standards
While it is important that data be meaningful, it is also
important that data are in a form that allows it to be easily A. Test Data Repositories
used. Many people experimenting with data analytics would
benefit from repositories of both real data and synthetic data.
1) Data Interchange Standards Such repositories would allow multiple academic researchers
It is desirable for data to be in a form that uses well- or commercial companies to be confident that they are using
recognized standards. A collection of standards, often the same data in creating and testing software to deal with
considered as a stack or hierarchy, may be used together what are intended to be common scenarios. It is also
desirable to provide configuration data to reproduce the raw at the 2014 IEEE International Conference on Big Data, 2014, pp.
data in the repositories so that it can be reproduced as well as 171–176.
modified. [4] W. Terkaj and M. Urgo, “Virtual factory data model to support
performance evaluation of production systems,” in Proceedings of
Data repositories should also have other aspects such as OSEMA 2012 workshop, Graz, Austria, 2012, pp. 24–27.
areas for algorithms and models that have been proposed to [5] “Arena Simulation Software.” Rockwell Automation, 2017.
address data sets in the same repository. Ideally,
[6] G. Shao, S. Jain, and S.-J. Shin, “Data Analytics Using Simulation for
documentation areas and discussion forums would be helpful Smart Manufacturing,” in Proceedings of the 2014 Winter Simulation
as well. For example, observations or questions about Conference, Savannah, GA, 2014.
particular data or data configurations would enable others to [7] M. Albert, “STEP NC - The End of G-Codes?,” Mod. Mach. Shop,
make progress by more easily re-using earlier results. Mar. 2006.
Repositories have been established that incorporate some [8] P. Warndorf, “MTConnect Institute Releases Version 1.3.0 of the
of these ideas. For example, Bosch has created a challenge MTConnect Standard.” 2014.
that includes measurement data produced from production [9] S. Jain, D. Lechevalier, J. Woo, and S. Seung-Jun, “Towards a virtual
lines [30]. One example challenge is to “predict which parts factory prototype,” Winter Simulation Conference (WSC), 2015, pp.
2207–2218.
will fail quality control.” The Bosch data sets and
[10] “ISO 10303-238 (2007) Industrial automation systems and integration
competitions are hosted on Kaggle, a service for general data - Product data representation and exchange - Part 238: Application
science challenges [31]. Another example is the NIST Smart protocol: Application interpreted model for computerized numerical
Manufacturing Systems Test Bed which makes available controllers.” Geneva: International Organization for Standardization,
data from a manufacturing facility which resembles a small May-2007.
manufacturing shop [32]. Sets of data can be downloaded or [11] D. Lechevalier, S.-J. Shin, S. Rachuri, S. Foufou, Y. T. Lee, and A.
queried. In addition, dynamic data streams can be monitored Bouras, “Simulating a virtual machining model in an agent-based
model for advanced analytics,” Journal of Intelligent Manufacturing,
using MTConnect. Sep. 2017.
[12] Jain, Sanjay, and David Lechevalier. "Standards based generation of a
B. Standards virtual factory model." In Proceedings of the 2016 Winter Simulation
Conference, pp. 2762-2773. IEEE Press, 2016.
Standards development is already an area of intense [13] D. Libes, S. Shin, and J. Woo, “Considerations and recommendations
interest. However, there are gaps in standards that would for data availability for data analytics for manufacturing,” in 2015
facilitate data generation and publication of synthetic data. IEEE International Conference on Big Data (Big Data), pp. 68–75,
For example, PFA (Portable Format for Analytics) is a useful 2015.
specification in which to express data analytics [33]; [14] “JSON: The Fat-Free Alternative to XML.” JSON.org.
however, while a PFA-enabled host can generate data usable [15] “OWL Web Ontology Language Reference.” World Wide Web
to other software, PFA lacks the ability to control the seeding Consortium.
of its random number generators which limits its flexibility [16] B. Motik, B. Grau, I. Horrocks, Z. Wu, A. Fokoue, and C. Lutz,
“OWL 2 Web Ontology Language Profiles.” W3C.
and repeatability. More work is needed on standards to better
support data generation. [17] "Guide for verification and validation in computational solid
mechanics." ASME, 2009.
DISCLAIMER [18] M. R. Garey and D. S. Johnson, Computers and Intractability: A
Guide to the Theory of NP-Completeness. New York: W.H. Freeman,
1979.
No approval or endorsement of any commercial product by
[19] Oberkampf, William L., Sharon M. DeLand, Brian M. Rutherford,
the National Institute of Standards and Technology (NIST) is Kathleen V. Diegert, and Kenneth F. Alvin. "Error and uncertainty in
intended or implied. Certain commercial software systems modeling and simulation." Reliability Engineering & System Safety
are identified in this paper to facilitate understanding. Such 75, no. 3: pp. 333-357, 2002
identification does not imply that these software systems are [20] D. Kibira, K. C. Morris, and S. Kumaraguru, “Methods and Tools for
Performance Assurance of Smart Manufacturing Systems,” J. Res.
necessarily the best available for the purpose. Natl. Inst. Stand. Technol., vol. 121, pp. 287–318, 2016.
[21] S. Jain, E. Lindskog, J. Andersson, and B. Johansson, “A Hierarchical
AKNOWLEDGMENTS Approach for Evaluating Energy Trade-offs in Supply Chains,” Int. J.
David Lechevalier’s work on this effort was supported by Prod. Econ., vol. 146, no. 2, pp. 411–422, 2013.
National Institute of Standards and Technology’s Foreign [22] Watson, P. F., and A. Petrie. "Method agreement analysis: a review of
correct methodology." Theriogenology 73, no. 9: pp. 1167-1179,
Guest Researcher Program. 2010.
[23] A. Katzenbach, S. Handschuh, and S. Vettermann, “JT Format (ISO
REFERENCES 14306) and AP 242 (ISO 10303): The Step to the Next Generation
[1] M. Saadoun and V. Sandoval, “Virtual Manufacturing and its Collaborative Product Creation,” NEW PROLAMAT, pp. 41–52, Jan.
Implications.” Virtual Reality and Prototyping, 1999. 2013.
[2] A. M. Berg, S. T. Mol, G. Kismihok, and N. Sclater, “The Role of a [24] “ISO 10303-1:1994 Industrial automation systems and integration --
Reference Synthetic Data Generator within the Field of Learning Product data representation and exchange -- Part 1: Overview and
Analytics,” J. Learn. Anal., vol. 3, no. 1, pp. 107–128, 2016. fundamental principles.” ISO, Dec-1994.
[3] J. W. Anderson, K. E. Kennedy, L. B. Ngo, A. Luckow, and A. W. [25] “ISO 22400-2:2014 Automation systems and integration -- Key
Apon, “Synthetic data generation for the internet of things,” presented performance indicators (KPIs) for manufacturing operations
management -- Part 2: Definitions and descriptions.” ISO, Jan-2014.
[26] Y.-T. T. Lee, “A Journey in Standard Development: The Core [29] J. Mueller, “Understanding SOAP and REST Basics And
Manufacturing Simulation Data (CMSD) Information Model,” J. Res. Differences.” Smartbear.com, Jan-2013.
Natl. Inst. Stand. Technol., vol. 120 (2015). [30] “Bosch Production Line Performance.” Bosch, Aug-2016.
[27] “SISO-STD-006-2010 Standard for Commercial-off-the-shelf [31] “Kaggle: Your Home for Data Science.” Kaggle, Inc.
Simulation Package Interoperability Reference Models.” Simulation
Interoperability Standards Organization (SISO), Inc., Mar-2010. [32] T. Hedberg and M. Helu, “Smart Manufacturing Systems (SMS) Test
Bed.” National Institute of Standards and Technology.
[28] W. Mahnke and S.-H. Leitner, “OPC Unified Architecture - The
[33] J. Pivarski, “Portable Format for Analytics: moving models to
future standard for communication and information modeling in
production.” KDnuggets, Jan-2016.57670.
automation,” ABB Rev., vol. 3/2009, pp. 56–61, Mar. 2009.

View publication stats

You might also like