MDE - User Guide
MDE - User Guide
System Configurations
Context
Manufacturing Data Engine is a packaged solution that works seamlessly with Manufacturing Connect, a marketplace product
provided by Litmus Automation. There are multiple possible combinations:
Once the machine and process data is available in the Cloud, it will also be easier to leverage Cloud tools and
technologies to extract value from that data. Acquiring industrial data is traditionally a high complexity and high risk
process that adds unnecessary time and cost to any Cloud Industrial Information Management use case. MDE has been
designed to provide a generic, easy to use, easy to deploy solution that can make that process shorter, more efficient and
more predictable.
Google Vertical Solutions team has developed the MDE Solution for Manufacturing with the Collaboration of Litmus
Automation. The end to end solution is composed of components built by Google and components built by Litmus
Automation for Google based on their current product line.
MDE is the acquisition, transformation and storage layer of the infrastructure. Acts as a data hub for all use cases to
connect to access manufacturing information. Provide a safe, efficient and available data lake containing all
manufacturing information.
All components of the Intelligent Manufacturing Suite have been designed to work seamlessly with each other. They all
share the same configuration and they are semantically interoperable.
MDE and the rest of the components are configurable. Users can define their specific data requirements and the system
will adjust to those specifications without having to modify the code underlying the solution. MDE will adapt and adjust
based on the data specification. Configuration can be updated using the MC user interface or the provided MDE
configuration API.
Capabilities
The capabilities that MC/MCe and MDE fulfills are:
● Ability to acquire data streams at the edge from multiple industrial controllers and machines translating many of
the available protocols and dialects available in the market.
● Ability to store and process locally those data streams
● Ability to transform those data streams into MQTT and PubSub messages that can be sent to GCP from the edge
locations using a conventional Internet connection.
● Ability to Map and Transform any MQTT or PubSub Payload structure into a predefined data schema based on the
user’s configuration.
● Ability to calculate streaming analytics and transformations based on the user’s configuration.
● Ability to store in any of the mainstream Databases and Storage solutions available in GCP (BQ, BT and GCS)
based on the user’s configuration
● Ability to monitor and supervise the state of the end to end solution using a simple interface
● Ability to set up a user's configuration using a simple, easy to use interface.
Components
The components of MDE are:
● Configuration Manager: main Cloud Component of the MDE. It deploys as a container on GKE and is the element
managing the different message routing pipelines, message transformations and message storage. Contains the
user configuration.
● Dataflow components: different Dataflow jobs that are deployed with the solution, that route and transform the
messages from the edge and write the processed messages into the different databases based on the user
configuration. The Dataflow components communicate with the Configuration Manager to receive the specific
configuration for each message.
● PubSub Topics: PubSub is the main messaging backend that MDE uses to route messages between the different
components of the solution. Several topics and subscriptions are created to ensure the routing of the incoming
messages is done according to the user configuration.
● Databases: MDE creates a number of schemas in BigQuery, BigTable and CloudStorage where the data will be
stored. Those schemas are generic. MDE will route messages to the right tables and files based on the user
configuration.
● Federation API: MDE provides an API to access all data repositories using a common interface. This allows users
to query their data independently of where it is stored and enables them using the same configuration language to
create specific queries to the manufacturing information.
● Integration layer
○ Looker Integration: MDE LookerML component that allows looker to explore natively all data contained in
the MDE solution for those users using Looker.
○ AutoML integration: MDE component to integrate MDE data with the different ML tools available in GCP
to support the creation and training of Industrial ML models using MDE collected data.
○ Grafana and Data Studio integrations: provide native access to MDE data from Grafana and Data Studio.
All components are independent of each other and are able to operate as a unit or as independent units. The solution can
ingest and manage data streams coming from any other edge solution. The edge solution can also be integrated with
other types of cloud architecture. However, we believe that the value of an integrated solution is larger than the sum of the
Overall architecture
The overall architecture of the solution (MDE and MC working together) is as follows:
1. MC edge and MC (cloud companion) handle the data acquisition at and from the EDGE
2. PubSub is the ingestion point for raw messages generated by MC Connect or any other MQTT broker or service
that needs to be integrated in the solution
3. All message routing and transformation is handled by DataFlow. Cloud DataFlow subscribes to the incoming
PubSub Topics and routes messages to the different storage solutions.
4. Config Manager running in GKE as a Kubernetes application is able to store the system’s configuration including
user preferences and metadata and system defaults and orchestrate the different Dataflow jobs that route
messages to its rightful destination and produce the on-the-fly transformations and validations.
5. Integration is provided to main GCP BI and ML tools, including Looker, Data Studio, and Vertex AI Engine. The
engine powering that interaction is the MDE API which delivers syntactically consistent data access which is
abstracted from the data model and technical architecture.
Information flow
The information flow through the MDE Solution following the next steps:
1. Data points are generated at sensor level and digitized by the PLC
2. Manufacturing Connect creates a device connection to the PLC and a Tag connection reading the data value that
needs to be collected. Tags at Manufacturing Connect level are polled tags and are refreshed based on a given
refresh rate.
3. Data from the PLC is packaged in a standard Manufacturing Connect payload structure that includes:
a. deviceName: name of the device driver that originated the message
b. tagName: name assigned to the tag being acquired
c. deviceID: automatic ID being assigned to the Device once it is created
d. Success: if the read from the PLC was successful
e. dataType: type of data contained in the value field
f. timeStamp: timestamp of the reading in msec
g. Value: the actual value field of the event
h. Metadata: any user defined metadata (key value pair) associated at the moment the tag was created
i. registerID: name of the register the value is linked with
j. Description: a description of the value
MDE is designed to be used as a data repository of Manufacturing information, including sensor data and metadata, and
the necessary relationships between them. The design criteria of MDE is to optimize ingestion and query performance of
manufacturing sensor data and metadata and provide an effective support to provide easy access to that information and
to create complex calculations with those data repositories.
Manufacturing data and metadata is stored in an standardized set of data tables and schemas across the different
databases supported by the system. All MDE implementations share the same table names and schemas. Those data
structures have been created to optimize the stability and performance of the ingestion and data access processes. They
have been designed to get the most out of the platforms they are built in. They are also efficient and cost effective. They
integrate Google’s best practices for those products.
To provide the necessary flexibility on how to organize and store tags and metadata, MDE provides a specific data model
that utilizes the standardized data schema but can present the information differently based on the user specific
configuration. The data model is based on the following concepts:
- Archetypes: define the overall characteristics of the data series to be stored. Tags sharing the archetype also
share similar characteristics: have the same type of index or timestamp structure and a common payload type or
nature. Each archetype has a specific data table and schema where the information is stored. All tag information
of the same archetype is stored in the same table. The core characteristics of the database are configured based
on the nature of the archetype. Archetypes are immutable for a given version of MDE: they can be extended by
generating a new version of the overall solution.
- Data types: define a sub-set of tags, of the same archetype, that share a common payload structure and common
payload qualifiers. They do not only share the same payload characteristics but they have the same payload
components and structure. The user's configuration defines data types which means that they can be different
from user to user for a given version of MDE.
- Tags: tags are the individual streams of data ingested in the system. Tags represent a consistent set of
measurements that reflect the same measurement over time. Tags belong to a given archetype and data type.
Each value received for a given tag is stored as an individual record in any of the supported databases. Tags can
also be associated with metadata to describe the nature of the collection of values. Individual tag records can be
also qualified by metadata using payload qualifiers.
- Metadata instances: each tag can be associated with a metadata instance. A metadata instance expresses a set
of consistent context characteristics that can be grouped together logically. For instance, a metadata instance
can be used to reflect the asset associated with a given tag and may contain: plant name, line number, machine
name and sensor ID for instance.
- Metadata schemas: define a set of fields that can be materialized in a given metadata instance. For example, two
different metadata instances qualifying 2 different tags can be related and compared by associating them to the
same metadata schema.
- Metadata buckets: defines a combination of one or more metadata schemas that reflect the required metadata
specifications to qualify a certain aspect of the context of a given tag. Metadata buckets can be associated with
tags or types. When associated with a tag, they create a certain ‘contract’ or ‘promise’ by which the tag is
expected to be qualified with: for example, we may want all sensor readings coming from physical sensors to be
described with the type of measurement they are providing (physical property, units, resolution, etc). To do so, we
could define a certain metadata structure under a metadata schema and associate it with a metadata bucket
Copyright 2022 Google LLC Page 8 of 39
containing all those information elements. That metadata specification can then be assigned to all tags sharing
the same data type used to collect all sensor values from the edge. Metadata from the same bucket is
comparable and can be aggregated across multiple tags. Enables an easy and quick way to create a semantic
structure describing tags that share a common context (ie all tags belonging to a certain machine, all tags
measuring the same physical property, etc).
- Payload qualifier: is a metadata instance associated with a specific tag record instead of the whole stream. It can
share the same metadata schema with the other records of the tag but it is stored together with the values of the
tag as a ‘qualifier’ or context description of the time when the value was generated such as the shift, the operator
or the production order active at that time.
- Metadata providers: MDE provides an internal metadata repository where users can create and store Metadata
schemas and Metadata buckets that the different Tags and Types can use. The default MDE metadata provider,
managed by the Config Manager, is referred to as the “local provider”. However, MDE also supports remote
metadata providers (such as MDM or ERP systems) that can publish a similar Metadata model using a REST API.
Those API endpoints can be registered in MDE as “remote metadata providers” and used in the configuration of
Metadata entities that are provided by those third party systems.
- Transformations: predefined functions, bound to certain input and output archetypes, that can provide a real-time
generic transformation to any tag that matches the input archetype into a new tag of the output archetype. The
transformation is applied to every value of the selected tag.
Archetypes
To ensure that information representing different physical variables (which can be infinitely diverse) can be stored using a
generic, finite and optimized schema, MDE matches the different measurements types captured (represented by a
message type and a payload), in 4 specific data types called archetypes:
● Numeric Data Series (NDS): a time stamped series of numerical values, ie, a temperature sensor sending data
every second to the Cloud.
● Discrete Events Series (DES): a specific information associated with a single time stamp, ie, an operator driven
parameter change in a specific machine of the process that needs to be recorded.
● Continuous Events Series (CES): a series of consecutives states defined by a specific information, a start time
and an end time, ie, the operating state of a given machine or the recipe of a production line.
MDE will match any incoming messages to one of these archetypes. All values ingested will be stored as an archetype. All
values stored of a given archetype share a common database schema and support and specific metadata requirements.
Data Types
Each archetype can be further classified based on Types that specify the archetype in further detail. For instance, a single
archetype such as continuous events could be subdivided into: machine operation state and production program state.
The first subtype will be associated to a payload containing one String value among the list “Running”, “Idle”, “Scheduled
Maintenance” and “Unscheduled Maintenance” while the second one could be associated to a complex schema
containing “Brand”, “Size”, “RecipeID”, “Recipe Description”. Some types will be available by default in the MDE. The user
can create new types. All types can be associated to a schema defining the payload structure and time stamp structure.
Types can be user defined. The Configuration Manager UI can be used to define new types. New types are defined by the
following characteristics:
1. Archetype they belong to that determine the data schema required to store the solution and the payload
complexity
2. Payload structure defining the different elements contained in the payload message for each instance of the
event belonging to this type
3. Event identification structure (typically timeStamp related) defining the structure of the element identifying a
record from the others.
4. Metadata configuration defining the requirements for metadata associated with the type, typically which
metadata elements and schemas need to be completed to define a Tag as part of this type.
Discrete Event Series (DES) DES Binary EventState as binary value Timestamp in msec
- 1 Timestamp EventStateLabel as String
- 1 Complex payload
structure (JSON) DES Default Any JSON Timestamp in msec
Continuous Event Series CES Default Any JSON Timestamp Start in msec
(CES) TImestamp End in msec
- 2 Timestamp (start
and end)
- 1 Complex payload
structure (JSON)
Transformations
MDE defines standard transformations between archetypes that work for each archetype implementation. MDE defines
specific transformations for given archetype sub-types. Some of these transformations are generic and can be applied to
similar types using parameters.
Value change Creates a continuous event based Discrete Event Series Continuous Event Series
monitor on any element of the payload of a Numeric Event Series
discrete event series when the
value switches to a new one.
All message types or tags will be mapped to a specific archetype or archetype sub-type. When a new message for a given
type or a given tag is received by MDE it will be converted to an archetype specific message structure and payload.
Streaming ingestion
The solution will ingest any message that is received at the PubSub landing topic:
“input-messages”
Any message received in the topic will be parsed with the existing pipelines available in the default configuration. The
default configuration has been implemented to understand and be able to decode message payloads generated by
Manufacturing Connect devices. These messages have a structure similar to:
Depending on the data type of the value field in the payload the message will be parsed to a numerical type or an event
type. The transformations that have been implemented in the default configuration are as follows:
Users are allowed to change the parser configuration to update those associations and to change the parsing process.
Please refer to the configuration user guide to find out how.
Based on the type that is mapped to the incoming message a default storage configuration will be assigned to the
message. The default configuration of the system defines the following storage specifications for the default data types:
Default numerical data series Data type for payloads consisting of a BT, BQ and CS
numerical value, a time stamp, any
payload qualifiers and any metadata
buckets
Default discrete event series Data type for complex non-numerical BQ, CS
payloads and a timestamp, any payload
qualifiers and any metadata buckets
Default continuous data series Data type for complex non-numerical BQ, CS
payloads and a StartTime timestamp and
an EndTime timestamp.
Payload Qualifiers
Additionally to the payload value, MDE supports adding a dynamic metadata JSON schema that can be stored beside
each value. The idea is to collect additional information qualifying a specific value, such as sensor reading, with relevant
context information, such as the shift or the machine cycle that was active when the reading was taken.
Those payload qualifiers can be added to the incoming payload message and parsed in the pipeline to the payload
qualifier section of the Type. The content of the payload qualifiers is inserted into the different storage solutions and
revealed as metadata.
Batch loading by uploading data to a GCS bucket setup by MDE which is named <project-id>-batch-ingestion in
one of the supported formats (JSON, AVRO or CSV). Whenever a new file is uploaded to this bucket, it will be detected by
the batch ingestion Dataflow job and will send each message individually into the input-messages PubSub topic. The
way messages are read vary from format to format.
In order for MDE to be able to recognize which filetype is being ingested and to provide configuration options, each filetype
needs to be registered in the Configuration Manager before they can be successfully ingested. The below section of the UI
(FILE INGESTION) provides access to the configuration of the batch loading feature. When selecting the File Ingestion tab
the following information appears:
This is the list of available registered Ingestion Specifications in the current system. From this UI the users can edit any of
the current specifications or create a new one. All specifications can be enabled or disabled, which means that a certain
specification can be created but disabled immediately after the files have been uploaded, remaining in the system
configuration for a future use.
To select a specific specification, simply click on the Actions column and select ‘View / Edit’:
The most complex type is CSV. It requires a number of extra configuration parameters such as the character used as
separator and the schema of the file. Specifically the additional parameters required are:
- Separator: the character used to delimit the columns in the file
- Skip Rows: the number of rows to skip from the file (in case there is a header)
- Use header row as field names: the system will use a certain row in the file as the name for the columns /
All this configuration can also be implemented using the configuration API of the ConfigManager, specifically using the
below REST API call, that shows a snippet of an Ingestion Configuration example:
POST http://{{hostname}}:{{port}}/api/v1/ingestions
{
"source": "CSV",
"filePattern": "(.*)-input-1.csv",
"separator": ",",
"skipRows": 1,
"dateTimeColumns": [
{
"index": 7,
"format": "dd-M-yyyy hh:mm:ss"
}
],
"schema": {
"registerId": 0,
"success": 1,
"description": 2,
"tagName": 3,
"value": 4,
"deviceName": 5,
"deviceID": 6,
"timestamp": 7,
"event": [
{
"label": 9,
"description": 10
}
]
}
}
Common fields
● source (required): (CSV, JSON, AVRO, AVRO_RAW_WRITER, OPERATIONS_REPROCESSED_AVRO)
● filePattern (required): This should be a regular expression of the location/name of the file without the bucket name(ie
for files <bucket>/loadData/batch1/batchmessages1.csv you could use
"loadData/batch1/batchmessages\.*.csv")
● name (required):Name to identify this file ingestion configuration.
● enabled (required): Boolean to enable/disable this configuration.
CSV fields
● separator (optional comma as default): A character that defines the separator for the records.
● skiprows (optional): If you want to skip any rows before starting reading the CSV file.
● dateTimeColumns (optional): If you want to use any of the columns as datetime and want to parse them you can use
this. It expects an array where each element consists of an index and a format. The index is a 0 based number for the
column number and the format which consists of a date format as defined by the Java DateTimeFormatter.
● inferredSchemaHeaderRow (optional): You need to set either this field or the schema field, both can't be empty. This
field sets the row number to use as a header row for parsing the CSV, if you want to use the first row just set it to 0.
● schema (optional). You need to specify either this field or inferredSchemaHeaderRow, both can't be empty. This field
lets you define a schema to parse your CSV, it accepts a JSON where each key defines the schema and the value is the
column to use as value for that field as shown in the example above.
Format details
JSON
MDE supports JSON in the same way as BigQuery does, this means JSON data must be newline delimited. Each JSON
object must be on a separate line in the file. Each line will be sent as a separate message to PubSub. Ej:
{
"name": "numeric-input-json",
"enabled": true,
"source": "JSON",
"filePattern": "numeric-input.*.json"
}
AVRO
The AVRO schema used should be convertible to JSON since the message will be converted before being sent to PubSub.
Each record will be sent as an individual message. Ej:
{
"name": "avro-1",
"enabled": true,
"source": "AVRO",
"filePattern": "/avo/testMessage(.*)"
}
CSV
CSV files will be transformed into JSON either by the provided schema or by inferring a schema from a header. If you use a
header the JSON will be flat as we don't support nested schema inference. Each line of the CSV will be sent as a separate
message to PubSub. Ej:
AVRO_RAW_WRITER
This configuration is used to re-ingest messages that have been produced by the gcs-writer Dataflow pipeline. It expects
only files produced by that pipeline with the schema, it will resend the message to the pipeline as if it was sent for the first
time, if your messages where processed correctly the first time this will cause data duplication.:
● String messageId
● Long timestamp
● Map<String, String> attributes (PubSub headers)
● Byte[] message
{
"name": "reprocess-raw-avro-1",
"enabled": true,
"source": "AVRO_RAW_WRITER",
"filePattern": "testMessageReprocessing1.avro"
}
1. Go to BigQuery and identify the failed messages in the Operations Dashboard Table.
2. Once you have a query that identifies them, go to Query Settings and save the results of the query as a new table.
3. Once the new table is created, select it and click on EXPORT -> Export to GCS, then select a gcs bucket, select
AVRO as format and select Compression (whichever you prefer).
4. Once it has been exported, create an ingestion specification with the OPERATIONS_REPROCESSED_AVRO as
source and the filePattern that matches the filename and where you will copy it.
5. Finally, copy the file to the path above.
This process will take longer than the other batch ingestions since it will first read the records in this new file and try to
find each original message in the gcs raw bucket and join both data streams. Ej:
{
"name": "reprocess-messages1",
"enabled": true,
"source": "OPERATIONS_REPROCESSED_AVRO",
"filePattern": "payloadresolutionfailedmessages.avro"
}
Each data destination can be helpful for a given use case. Understanding which storage solution is optimal for each data
stream needs to be determined by the user. However, as an overall guidance, here are some tips on using the right storage
solution for a given data stream:
● Cloud Storage is the lowest cost solution. Accessing data in Cloud Storage has a higher latency than any other
destination. Cloud storage can be useful when users need to store large amounts of data for a longer time but
they don’t have an immediate need to explore or navigate the data right away. Data in cloud storage can be
exported and inserted in any DW tools or solutions in the future when that need arises.
● BigQuery is our main analytical DW and BI product. It can store large amounts of data (Pb scale DB) and provide
access via SQL with second-level response times. Inserting data in BQ is ideal to support analytical workloads
such as reporting via Looker. BQ cost (insert, store and access) is significantly higher than Cloud Storage.
● BigTable is our non-relational database and is ideal for sub-second latency streams, where high performing
inserts and queries are required. BigTable is ideal in near-real-time use cases, where sub-second access to the
edge values is required. BT cost is significantly higher than BQ and Cloud Storage cost.
BigQuery pipeline
When BigQuery storage is active, MDE will insert a new row each time a new value is received for a given Tag. The
solution provides a default schema in BQ where it stores all incoming values. The default schema consists of one table
for each data archetype. The data is stored in BQ under the “sfp_data” folder in the project's BQ explorer.
The ingestion to BigQuery is done via BigQuery Streaming mode to ensure low latency ingestion.
The different tables available match the default data archetypes provided with the default version of the solution. All
Types from a given archetype are stored in the same tables.
All BQ tables are standard and can be queried and explored using the BQ tool.
For those fields that are now stored as JSON format we rely on the BQ JSON extension. This new extension allows users
to natively query fields in a JSON column almost as native table columns, reducing query complexity and improving query
efficiency for those fields. For those projects where that extension is not available, MDE is generating a KV pair equivalent
column that can be queried without the need for that extension. MDE detects if the extension is active and leverages the
JSON fields if that is the case. If MDE detects that the extension is unavailable, it will automatically rely on the KV pair
fields.
When ingestion fails and data can't be inserted in the expected canonical table, MDE will insert those messages in a dead
letter table in BQ called InsertErrors. This table holds the message and reports the error generated when trying to ingest
that message:
The time-series API also supports multiple resolution and aggregation when you retrieve the data. As the ingested data
are usually in the milliseconds of resolution, retrieving the data might produce many records which might not be as useful
for a dashboard like Grafana. In that case, you can specify an aggregation and a different resolution during the retrieval of
the tag values.
After a successful deployment, the time-series will have an internal endpoint exposed using an internal load-balancer. You
can retrieve the IP from the Cloud console network section. Alternatively, if you’re using the terraform deployment for the
POST http://<service-ip>/api/v1/records
{
"tagName": "TEST-Machine",
"timestamp": 1642506148000,
"value": 20
}
In the above example, the time-series will insert a new value 20 for a tag named TEST-Machine at a timestamp specified
by the unix epoch. You don’t normally use that API as it’s already automatically invoked by the solution data-lake
workflows.
To retrieve time-series data for a particular tag, you can use the following query API which will retrieve the records of the
specified tag name for the specified duration
POST http://<serivce-ip>/api/v1/records/query
{
"startTimestamp": 1642506147000,
"endTimestamp": 1642506149000,
"tagName": "TEST-Mahcine"
}
Alternatively, you can specify downsampling logic that aggregates the milliseconds records and provides a more
coarse-grained value. In that case, you’ll need to specify the downsample section of the query.
POST http://<serivce-ip>/api/v1/records/query
{
"startTimestamp": 1642506147000,
"endTimestamp": 1642506149000,
"tagName": "TEST-Mahcine",
"downsample": {
"duration": 5,
"unit": "SECONDS",
"aggregation": "MEAN"
}
}
The above request will aggregate the records over a 5 seconds interval and perform an average calculation on the values,
then returns the aggregated results. Typically, you’ll use the Federation API to query the time-series data through a
visualization tool like Grafana.
● messageId: Message Id assigned by PubSub in the input-messages topic. It's used to be able to reprocess failed
messages since it's propagated across all the processing steps.
● timestamp: Ingestion timestamp
● attributes: PubSub attributes.
● message: PubSub message data.
The messages are emitted every 10 minutes with a name containing the datetime like
raw-messages2022-02-08T18:00:00.000Z-2022-02-08T18:10:00.000Z-pane-0-last-00-of-10.
● Uniquely identify each incoming message to manage lineage across the system
● Provide a safe storage for all edge message where they can be stored raw for a long period of time cost effectively
● Offer the possibility to re-run through the ingestion process and configuration older messages in case that
configuration has changed and it is necessary to re-process the data differently
Ingestion errors
The ingestion process can fail at different stages if the configuration is not set correctly for the payloads ingested in the
landing topic. For example, an incoming message could contain an unrecognized payload structure in the MDE
configuration. In that case, those messages can’t be parsed to the destination data model and are dropped redirected to
the ops-writer pipeline to be written into BigQuery in the OperationsDashboard table to make troubleshooting and
reprocessing easier.
The configuration interface allows to pre-create a given configuration for a specific Tag. If the Tag configuration has been
already created when the first value is received, the stored configuration will be used instead of creating a default
configuration. This allows users to create specific configurations for metadata, storage profiles or transformations for
specific tags in the system. It is important to remark that once a Tag has been assigned to a given type it can’t be
assigned to a different one. Types are associated with a Tag configuration; this link can only be changed by deleting the
Tag configuration and creating a new one from scratch.
To create Tag configurations or to edit existing tag configuration please refer to the MDE configuration guide.
The list of Tags supports several navigation options such as search and filtering to make finding the right Tag in the
overall configuration easier.
The transformed values are computed in real time every time a new value is received from the edge for the original tag.
However, considering some transformations might change the time base for a data series, it is possible that new values
are inserted in the data lake with a different frequency than the original tag that triggered its calculation.
There are a number of available transformations that are included in the default configuration package that is deployed
with the solution. These transformations are designed to work with specific Types only. Also, depending on the design of
the transformation, the incoming or outgoing Type or Archetype might be the same or different.
Managing Metadata
Any Tag can be associated with many different metadata instances. Each instance represents a snippet of information
that describes a certain area of the tag context. Metadata instances are JSON objects that follow a Metadata bucket
schema. Metadata Buckets are groups of Metadata Schemas that are semantically related (ie, describe a certain aspect
of the context). Buckets can be associated with tags and types so all tags created based on that Type will be
automatically associated with that metadata specification.
To generate a given metadata snippet or instance for a tag we need to follow a number of steps:
Schemas define the structure of the Metadata instance object that will characterize tags.. Schemas can also be
associated with a Domain file that defines the possible values of that object. Metadata schemas obey the JSON schema
specification.
A typical JSON schema defining a certain Metadata for a tag could be as follows:
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"$id": "sampleContext1",
"title": "tag_sample_context",
"description": "Sample context schema for qualifying tags",
"type": "object",
"properties": {
"tag_device_name": {
"description": "Name of the device generating tag values",
"type": "string"
},
"tag_device_id": {
"description": "ID of the device generating tag values",
"type": "string"
},
"tag_measurement_type": {
"description": "Description of the measurement",
"type": "string"
}
},
"required": [ "tag_device_name" ]
}
To define a new metadata schema the user simply needs to select the Add New Schema button in the UI
A contextual menu will open where all required Schema parameters (Provider, Name, Id, JSON schema and Description) can be
added and saved into the current configuration model:
To create a new Metadata bucket users need to select the Buckets section of the metadata menu in MDE's MC UI:
The list of available Metadata Schemas appears under the Schema section of the Bucket. Once selected, Schemas get
added to the bucket. Once added, Schemas can be selected and the details of the particular Schema are displayed as part
of the Bucket information. Schemas are now copied to the Bucket, which means that if Schemas evolve the Schemas
available in the Bucket will not change.
The other 2 remaining options to configure for each Schema added to the Bucket are the Required and Dynamic flags:
- Required Schemas must be implemented for a message to be inserted in the databases.
- Remote Metadata providers provide Dynamic Schemas and will be queried from the API every time a new
message is received. Use this flag only if the bucket is associated with a remote provider.
If we open the Metadata section of the configuration, we can see that the list of available Metadata Buckets is displayed under
a selector. To associate the bucket to the tag, we just simply select it from the list:
To create a Metadata Implementation we just need to click on the Schema that we need to fill-in and the content of the schema
will open below in a form format:
Now the user can just complete the form and click on the "Save" button. The tag will now be associated with the values entered
in the form. Those values are now available as metadata.
The metadata section can specify a bucket ID, bucket version and schema so the metadata information is parsed to the correct
metadata bucket in the tag. If the bucket is not specified, the information will be mapped to a "default" bucket that has a
generic JSON schema. That information will appear in the tag section of the MC UI under the default bucket.
Configuration Bucket
Once you deploy MDE v1.2.0+, you’ll have a new GCS bucket created named as ${PROJECT_ID}-config-manager-jobs which
contains a couple of predefined directories that will be created once the import/exports jobs are submitted via the
config-manager – except for the import directory. The staging directory is private to the config-manager, and it’s mainly used
to track the progress of the submitted jobs in the background.
Once the job finishes successfully, it will contain the exported files in a data directory under the export folder as shown in the
below screenshot. Notice that each job will have its own UUID and output data directory which enables running the export jobs
multiple times over time without overwriting the old exported files.
POST http://{{hostname}}:{{port}}/api/v1/admin/export
{
"status": 201,
"message": "Export job is submitted and it will start in the background",
"timestamp": "07-11-2022 01:58:37",
"id": "a26a50d4-e8a9-49f6-943d-175590636968"
}
To submit an import job, you need to use the corresponding import API, shown below. The same restrictions of having one job
at a time of import/export apply here too. The API accepts an overwrite flag which is defaulted to true, the overwrite behavior
will replace the existing configuration item if found in the database. If it is set to false, then the import job will skip it.
POST http://{{hostname}}:{{port}}/api/v1/admin/import?overwrite=true
{
"status": 201,
"message": "Import job is submitted and it will start in the background",
"timestamp": "07-11-2022 01:58:37",
"id": "a26a50d4-e8a9-49f6-943d-175501636969"
}
At any time, you can get the active jobs using the http://{{hostname}}:{{port}}/api/v1/admin/jobs/active. You can
also get the full history of submitted jobs along with their start/end times and the status of the jobs
Lastly, you can deactivate any running job by using the deactivation API of
http://{{hostname}}:{{port}}/api/v1/admin/jobs/deactivate/{uuid} where uuid is the job UUID.
To enable anomaly detection, you need to enable the alpha features by contacting the solution team. Once the alpha features
are enabled, you will see a new transformation in the MC user interface that represents the anomaly detection registration. You
can enable it for a given type as described above.
The current release offers two configuration endpoints to specify the anomaly detection subsystem endpoint and the page size
for the multi-page endpoints and the import/export batch frequency. There are two main APIs for this, one that completely
overwrites the configuration settings with what you provide in the REST endpoint body, and another one that only updates
portions of the configuration based on what you sent. They are described below
POST http://{{hostname}}:{{port}}/api/v1/configs/write
{
"anomalyDetection": {
"properties":{
"mde.subsystem.ad.config-endpoint": "https://anomaly-detection-subsystem.a.run.app"
}
}
}
This will wipe out any existing values of any other configurations and just add one configuration element for the
anomaly-detection. The response will display the total system configurations currently stored in MDE
The second endpoint will amend/patch the system configuration using the part that you’ll specify in the body. Here’s an
example
PATCH http://{{hostname}}:{{port}}/api/v1/configs/patch
{
"api": {
"properties":{
"mde.api.page.max-size": "100"
}
}
}
Assuming that the above two calls run in sequence, the final result of the system configurations will be as follows
GET http://{{hostname}}:{{port}}/api/v1/configs
{
"anomalyDetection": {
"id": "7d172502-aa2f-42aa-97af-f2dd27a829f4",
"properties": {
"mde.subsystem.ad.config-endpoint": "https://admin-arzpme4m3a-uc.a.run.app"
}
},
Depending on the configuration you’re providing, there will be specific validation before saving it. For example, to add an
anomaly-detection endpoint, it must have a valid up-and-running anomaly-detection instance that can be contacted at the
configuration saving time
In addition to the above technical metrics, the solution also exports a status of each incoming message as it passes through
the various processing steps of the solution. The export destination is a BigQuery table called OperationsDashboard which
has the following structure
You can create Data Studio dashboard that utilize this table, you can ask the solutions team to share a sample dashboard with
you