Production ML Pipelines With TensorFlow Extended - TFX - Presentation
Production ML Pipelines With TensorFlow Extended - TFX - Presentation
Aurélien Géron
Consultant
@aureliengeron 1
What are we doing here?
What does it all mean?
2
In addition to training an amazing model ...
Modeling Code
… a production solution requires so much more
Configuration
Data Collection
Modeling Analysis Tools
Code
Serving Infrastructure
Process Management Tools Machine Resource
Management
Feature Extraction
Production Machine Learning
“Hidden Technical Debt in Machine Learning Systems”
NIPS 2015
http://bit.ly/ml-techdebt
Production Machine Learning
Machine Learning Development
● Labeled data
● Feature space coverage
● Minimal dimensionality
● Maximum predictive data
● Fairness
● Rare conditions
● Data lifecycle management
Production Machine Learning
Machine Learning Development Modern Software Development
+
● Feature space coverage ● Extensibility
● Minimal dimensionality ● Configuration
● Maximum predictive data ● Consistency & Reproducibility
● Fairness ● Modularity
● Rare conditions ● Best Practices
● Data lifecycle management ● Testability
● Monitoring
● Safety & Security
TensorFlow Extended (TFX)
TensorFlow Extended (TFX)
Powers Alphabet’s most important bets and products
… and some of Google’s most important partners.
11
What we’re doing
We’re following a typical ML development process
● Understanding our data
● Feature engineering
● Training
● Analyze model performance
● Lather, rinse, repeat
● Ready to deploy to production
12
Data ingestion
Data ingestion Data validation
age is missing
age normalized_age
China ➜ [1, 0, 0]
India ➜ [0, 1, 0]
USA ➜ [0, 0, 1]
Data ingestion Data validation Data transform
Model training
Data ingestion Data validation Data transform
?
Data ingestion Data validation Data transform
Accuracy
Model
version
Data ingestion Data validation Data transform
Accuracy
new
Model
version
Data ingestion Data validation Data transform
Model serving
Data ingestion Data validation Data transform
Requests
Model serving
Data ingestion Data validation Data transform
Predictions
Model serving
Data ingestion Data validation Data transform
Model serving
TF Data Validation (TFDV) TF Transform (TFT)
Model serving
TF Data Validation (TFDV) TF Transform (TFT)
Model serving
TF Data Validation (TFDV) TF Transform (TFT)
Model serving
TF Serving (TFS)
TFDV TFT TFMA TFS
(or , or...)
Data &
models
(or , or...) SQL
(or , or...)
Data &
models
(or , or...) SQL
(or , or...)
Data &
models
(or , or...) SQL
GoogleCloud
or or Dataflow or...
Orchestration: ...
Processing API:
Google Cloud
Beam Runner: Dataflow ...
Orchestration: Manual w/ InteractiveContext
Metadata Store:
Processing API:
Beam Runner:
Local Runner
Exercise 1
Prerequisites
● Linux / MacOS
● Python 3.6
● Virtualenv
● Git
42
Step 1: Setup your environment
% sudo apt-get update
43
Step 1: Setup your environment
% cd
% source tfx_env/bin/activate
44
TFX End-to-End Example
Features
n_imgs
global_subjectivity
self_reference_avg_shares
global_sentiment_polarity
Label = n_shares
...
46
Parses Transforms Expects Label
train_input_fn No No Yes
Parses Transforms Expects Label
train_input_fn No No Yes
eval_input_fn No No Yes
Parses Transforms Expects Label
train_input_fn No No Yes
eval_input_fn No No Yes
train_input_fn No No Yes
eval_input_fn No No Yes
train_input_fn No No Yes
eval_input_fn No No Yes
53
ML Coding vs ML engineering
Machine Resource
Data Verification
Management
Data Collection
Serving
Configuration Monitoring
Infrastructure
ML Code Analysis Tools
Adapted from: Sculley et al.: Hidden Technical Debt in Machine Learning Systems
Writing Software (Programming)
... ...
Writing ML Software (The “Code” view)
... ...
This slide is, not surprisingly, the same as the previous one however it is only half the story :)
Engineering
// Strong Contracts.
Output Program(Inputs) {
... Human authored and peer reviewed code ...
}
TestProgramEdgeCase1...N() {
EXPECT_EQ(..., Program(...))
}
BenchmarkProgramWorstCase1...N {
...
}
Engineering vs ML Engineering
Monolithic code Fixed Datasets Evolving Datasets (Data, Features, ...) and Objectives
Non-reusable code Unmergeable Artifacts Reusable Models aka Modules, Mergeable Statistics ...
Untested code Non-validated Datasets, Models Expectations, Data Validation, Model Validation ...
Unbenchmarked or hack-optimized once code Models Quality and Performance Benchmarked Models ...
... ...
This is the remaining half!
Introduction to Apache Beam
60
What is Apache Beam?
Python
Apache Apex
input | Sum.PerKey()
SQL
Apache Samza
SELECT key, SUM(value)
FROM input GROUP BY key
Apache Nemo
(incubating)
⋮
Beam Portability Framework
● Currently most runners support the Java SDK only
● Portability framework (https://beam.apache.org/roadmap/portability/) aims to
provide full interoperability across the Beam ecosystem
● Portability API
○ Protobufs and gRPC for broad language support
○ Job submission and management: The Runner API
○ Job execution: The SDK harness
● Python Flink and Spark runners use Portability Framework
Beam Portability Support Matrix
Hello World Example
pipeline = beam.Pipeline()
lines = (pipeline
result = pipeline.run()
result.state
Hello World Example
with beam.Pipeline() as pipeline:
lines = (pipeline
lines = (pipeline
result = pipeline.run()
result.state
PCollection
● A distributed dataset your Beam pipeline operates on.
● The dataset can be bounded (from fixed source) or
unbounded (from a continuously updating source).
● The pipeline typically creates a source PCollection by
reading data from an external data source
○ But you can also create a PCollection from in-memory data within your driver program.
lines = (pipeline
result = pipeline.run()
result.state
PTransform
● A PTransform represents a data processing operation,
or a step, in your pipeline.
● Every PTransform takes one or more PCollection
objects as input
● It performs a processing function that you provide on
the elements of that PCollection.
● It produces zero or more output PCollection objects.
PTransform
pipeline = beam.Pipeline()
lines = (pipeline
Hello
result.state
World
!!!
I/O Transforms
● Beam comes with a number of “IOs” library
PTransforms.
● They read or write data to various external storage
systems.
I/O Transforms
with beam.Pipeline() as pipeline:
lines = (pipeline
| beam.io.ReadFromTFRecord("test_in.tfrecord")
| beam.Map(lambda line: line + b' processed')
| beam.io.WriteToTFRecord("test_out.tfrecord"))
Lab 2
77
Orchestration:
Orchestrator
Metadata Store:
Processing API:
Beam Runner:
Local Runner
Exercise 3
from tfx.orchestration import pipeline
pipeline.Pipeline(
pipeline_name=pipeline_name,
pipeline_root=pipeline_root,
components=[
example_gen, statistics_gen, infer_schema, validate_stats,
transform, trainer, model_analyzer, model_validator, pusher
],
enable_cache=True,
metadata_connection_config=sqlite_metadata_connection_config(
metadata_path),
additional_pipeline_args={},
)
Lab 3
82
Data Exploration & Cleanup
The first task in any data science or ML project is to understand and clean the data
83
import tensorflow_data_validation as tfdv
train_stats = tfdv.generate_statistics_from_csv(
data_location=_train_data_filepath)
tfdv.visualize_statistics(train_stats)
85
tfdv.visualize_statistics(
lhs_statistics=eval_stats,
rhs_statistics=train_stats,
lhs_name='EVAL_DATASET',
rhs_name='TRAIN_DATASET')
schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)
anomalies = tfdv.validate_statistics(
statistics=eval_stats,
schema=schema)
tfdv.display_anomalies(anomalies)
# Relax the minimum fraction of values that must come from
# the domain for feature company.
company = tfdv.get_feature(schema, 'company')
company.distribution_constraints.min_domain_mass = 0.9
serving_anomalies_with_env = tfdv.validate_statistics(
serving_stats, schema, environment='SERVING')
# Add skew comparator for 'weekday' feature.
weekday = tfdv.get_feature(schema, 'weekday')
weekday.skew_comparator.infinity_norm.threshold = 0.01
skew_anomalies = tfdv.validate_statistics(
train_stats,
schema,
previous_statistics=eval_stats,
serving_statistics=serving_stats)
Lab 4
94
Data Preprocessing
The raw data usually needs to be prepared before being fed to a Machine
Learning model. This may involve several transformations:
95
Training/Serving Skew
● Preprocessing data before training
● Same preprocessing required at serving time
● Possibly with multiple serving environments
● Risk of discrepancy
In-model preprocessing
● If we include the preprocessing steps in the TensorFlow graph, the
problem is solved
● Except training is slow
○ Preprocessing runs once per epoch instead of just once
98
Training Serving
99
RAW_DATA_FEATURE_SPEC = {
"name": tf.io.FixedLenFeature([], tf.string)
}
RAW_DATA_FEATURE_SPEC = {
"name": tf.io.FixedLenFeature([], tf.string)
}
RAW_DATA_METADATA = dataset_metadata.DatasetMetadata(
schema_utils.schema_from_feature_spec(RAW_DATA_FEATURE_SPEC))
RAW_DATA_FEATURE_SPEC = {
"name": tf.io.FixedLenFeature([], tf.string)
}
RAW_DATA_METADATA = dataset_metadata.DatasetMetadata(
schema_utils.schema_from_feature_spec(RAW_DATA_FEATURE_SPEC))
{
'_schema': feature {
name: "name"
type: BYTES
presence {
min_fraction: 1.0
}
shape {}
}
}
data_coder = tft.coders.ExampleProtoCoder(
RAW_DATA_METADATA.schema)
encoded = data_coder.encode({"name": "café"})
data_coder = tft.coders.ExampleProtoCoder(
RAW_DATA_METADATA.schema)
encoded = data_coder.encode({"name": "café"})
b'\n\x13\n\x11\n\x04name\x12\t\n\x07\n\x05caf\xc3\xa9'
data_coder = tft.coders.ExampleProtoCoder(
RAW_DATA_METADATA.schema)
encoded = data_coder.encode({"name": "café"})
b'\n\x13\n\x11\n\x04name\x12\t\n\x07\n\x05caf\xc3\xa9'
decoded = data_coder.decode(encoded)
data_coder = tft.coders.ExampleProtoCoder(
RAW_DATA_METADATA.schema)
encoded = data_coder.encode({"name": "café"})
b'\n\x13\n\x11\n\x04name\x12\t\n\x07\n\x05caf\xc3\xa9'
decoded = data_coder.decode(encoded)
{'name': b'caf\xc3\xa9'}
tmp_dir = tempfile.mkdtemp(prefix="tft-data")
train_path = os.path.join(tmp_dir, "train.tfrecord")
/tmp/tft-datac1z2ichz/train.tfrecord-00000-of-00001
eval_path = os.path.join(tmp_dir, "eval.tfrecord")
/tmp/tft-datac1z2ichz/eval.tfrecord-00000-of-00001
with beam.Pipeline() as pipeline:
_ = (pipeline
| "Read" >> beam.io.ReadFromTFRecord(f"{train_path}*")
| "Decode" >> beam.Map(data_coder.decode)
| "Print" >> beam.Map(print)
)
with beam.Pipeline() as pipeline:
_ = (pipeline
| "Read" >> beam.io.ReadFromTFRecord(f"{train_path}*")
| "Decode" >> beam.Map(data_coder.decode)
| "Print" >> beam.Map(print)
)
{'name': b'Alice'}
{'name': b'Bob'}
{'name': b'Cathy'}
{'name': b'Alice'}
with beam.Pipeline() as pipeline:
_ = (pipeline
| "Read" >> beam.io.ReadFromTFRecord(f"{eval_path}*")
| "Decode" >> beam.Map(data_coder.decode)
| "Print" >> beam.Map(print)
)
{'name': b'Denis'}
{'name': b'Alice'}
def preprocessing_fn(inputs):
outputs = {}
lower = tf.strings.lower(inputs["name"])
outputs["name_xf"] = tft.compute_and_apply_vocabulary(lower)
return outputs
https://www.tensorflow.org/tfx/transform/api_docs ➔ Math
◆ covariance()
➔ Buckets ◆ max()
◆ apply_buckets() ◆ mean()
◆ apply_buckets_with_interpolation() ◆ min()
◆ bucketize() ◆ pca()
◆ bucketize_per_key() ◆ quantiles()
◆ scale_by_min_max()
➔ Text & Categories
◆ scale_by_min_max_per_key()
◆ apply_vocabulary()
◆ scale_to_0_1()
◆ bag_of_words()
◆ scale_to_0_1_per_key()
◆ compute_and_apply_vocabulary()
◆ scale_to_z_score()
◆ hash_strings()
◆ scale_to_z_score_per_key()
◆ ngrams()
◆ size()
◆ vocabulary()
◆ sum()
◆ word_count()
◆ var()
◆ tfidf()
➔ Misc
➔ Apply arbitrary transformations
◆ deduplicate_tensor_per_row()
◆ apply_function_with_checkpoint()
◆ get_analyze_input_columns()
◆ apply_pyfunc()
◆ get_transform_input_columns()
◆ apply_saved_model()
◆ segment_indices()
◆ ptransform_analyzer()
◆ sparse_tensor_to_dense_with_shape()
with beam.Pipeline() as pipeline:
train_data = (pipeline
| "ReadTrain" >> beam.io.ReadFromTFRecord(f"{train_path}*")
| "DecodeTrain" >> beam.Map(data_coder.decode)
)
with beam.Pipeline() as pipeline:
train_data = (pipeline
| "ReadTrain" >> beam.io.ReadFromTFRecord(f"{train_path}*")
| "DecodeTrain" >> beam.Map(data_coder.decode)
)
{'name_xf': 0}
{'name_xf': 2}
{'name_xf': 1}
{'name_xf': 0}
with beam.Pipeline() as pipeline:
_ = (pipeline
| "Read" >> beam.io.ReadFromTFRecord(f"{eval_xf_path}*")
| "Decode" >> beam.Map(data_xf_coder.decode)
| "Print" >> beam.ParDo(print)
)
{'name_xf': -1}
{'name_xf': 0}
metadata_xf.schema
feature {
name: "name_xf"
type: INT
int_domain {
is_categorical: true
}
presence {
min_fraction: 1.0
}
shape {}
}
/tmp/tft-data0o6lwwt0/graph/
transform_fn/
assets/
vocab_compute_and_apply_vocabulary_vocabulary
variables
saved_model.pb
transformed_metadata/
schema.pbtxt
/tmp/tft-data0o6lwwt0/graph/
transform_fn/
assets/
vocab_compute_and_apply_vocabulary_vocabulary
variables
saved_model.pb
transformed_metadata/ alice
schema.pbtxt cathy
bob
/tmp/tft-data0o6lwwt0/graph/
transform_fn/
assets/
vocab_compute_and_apply_vocabulary_vocabulary
variables/
saved_model.pb
transformed_metadata/
schema.pbtxt
/tmp/tft-data0o6lwwt0/graph/
transform_fn/
assets/
vocab_compute_and_apply_vocabulary_vocabulary
variables/
saved_model.pb
transformed_metadata/
schema.pbtxt
tft_output = tft.TFTransformOutput(graph_dir)
@tf.function
def transform_raw_features(example):
return tft_output.transform_raw_features(example)
@tf.function
def transform_raw_features(example):
return tft_output.transform_raw_features(example)
{
'transform_output': Channel(
type_name: TransformPath
artifacts: [Artifact([...])]),
'transformed_examples': Channel(
type_name: ExamplesPath
artifacts: [Artifact([...], split: train),
Artifact([...], split: eval)])
}
transform = Transform(
input_data=example_gen.outputs['examples'],
schema=infer_schema.outputs['output'],
module_file='my_transform.py')
context.run(transform)
my_transform.py
import tensorflow as tf
import tensorflow_transform as tft
def preprocessing_fn(inputs):
outputs = {}
outputs["name_xf"] = tft.compute_[...](inputs["name"])
[...]
return outputs
Lab 5
Preprocessing Data
with TF Transform (TFT)
136
Analyzing Model Results
Understanding more than just the top level metrics
137
138
eval_model = tfma.default_eval_shared_model(
eval_saved_model_path='eval/run0/eval_model/0')
slices = [tfma.slicer.SingleSliceSpec(columns=['trip_start_hour']),
tfma.slicer.SingleSliceSpec(columns=['trip_start_day'])]
eval_model = tfma.default_eval_shared_model(
eval_saved_model_path='eval/run0/eval_model/0')
slices = [tfma.slicer.SingleSliceSpec(columns=['trip_start_hour']),
tfma.slicer.SingleSliceSpec(columns=['trip_start_day'])]
eval_result = tfma.run_model_analysis(
eval_shared_model=eval_model,
data_location='data.tfrecord',
file_format='tfrecords',
slice_spec=slices,
output_path='output/run0')
eval_model = tfma.default_eval_shared_model(
eval_saved_model_path='eval/run0/eval_model/0')
slices = [tfma.slicer.SingleSliceSpec(columns=['trip_start_hour']),
tfma.slicer.SingleSliceSpec(columns=['trip_start_day'])]
eval_result = tfma.run_model_analysis(
eval_shared_model=eval_model,
data_location='data.tfrecord',
file_format='tfrecords',
slice_spec=slices,
output_path='output/run0')
tfma.view.render_slicing_metrics(
eval_result,
slicing_spec=slices[0])
eval_model = tfma.default_eval_shared_model(
eval_saved_model_path='eval/run0/eval_model/0')
slices = [tfma.slicer.SingleSliceSpec(
columns=['trip_start_day'],
features=[('trip_start_hour', 12)]]
eval_result = tfma.run_model_analysis(
eval_shared_model=eval_model,
data_location='data.tfrecord',
file_format='tfrecords',
slice_spec=slices,
output_path='output/run0')
tfma.view.render_slicing_metrics(
eval_result,
slicing_spec=slices[0])
eval_model = tfma.default_eval_shared_model(
eval_saved_model_path='eval/run0/eval_model/0')
slices = [tfma.slicer.SingleSliceSpec(
columns=['trip_start_day', 'trip_start_hour'])]
eval_result = tfma.run_model_analysis(
eval_shared_model=eval_model,
data_location='data.tfrecord',
file_format='tfrecords',
slice_spec=slices,
output_path='output/run0')
tfma.view.render_slicing_metrics(
eval_result,
slicing_spec=slices[0])
output_dirs = [os.path.join("output", run_name)
for run_name in ("run_0", "run_1", "run_2")]
eval_results_from_disk = tfma.load_eval_results(
output_dirs, tfma.constants.MODEL_CENTRIC_MODE)
output_dirs = [os.path.join("output", run_name)
for run_name in ("run_0", "run_1", "run_2")]
eval_results_from_disk = tfma.load_eval_results(
output_dirs, tfma.constants.MODEL_CENTRIC_MODE)
eval_results_from_disk = tfma.load_eval_results(
output_dirs, tfma.constants.MODEL_CENTRIC_MODE)
tfma.view.render_time_series(
eval_results_from_disk,
slices[0])
Lab 6
148
Application 1
Model v1
Application 2
Model v1
Application 3
Model v1
Application 1
Model v1
Application 2
Model v2
Application 3
Model v1
Application 1
Model v2
Application 2
Model v2
Application 3
Model v1
Application 1
Model v2
Application 2
Model v2
Application 3
Model v2
Application 1
TF Serving
Application 2
Model v1
Application 3
Application 1
TF Serving
Application 2
Model v2
Application 3
Application 1
TF Serving
Application 2
Model v2
Application 3
Fairness
Lab 7
Fairness
157
from tfx.types import ComponentSpec
from tfx.types.component_spec import ChannelParameter
from tfx.types.component_spec import ExecutionParameter
from tfx.types.standard_artifacts import Examples
class DataAugmentationComponentSpec(ComponentSpec):
PARAMETERS = {
'max_rotation_angle': ExecutionParameter(type=float)
}
INPUTS = {
'input_data': ChannelParameter(type=Examples)
}
OUTPUTS = {
'augmented_data': ChannelParameter(type=Examples)
}
from tfx.components.base.base_executor import BaseExecutor
from tfx.types.artifact_utils import get_split_uri
class DataAugmentationExecutor(BaseExecutor):
def Do(self, input_dict, output_dict, exec_properties):
input_examples_uri = get_split_uri(
input_dict['input_data'], 'train')
output_examples_uri = get_split_uri(
output_dict['augmented_data'], 'train')
max_rotation_angle = exec_properties['max_rotation_angle']
[...]
[...]
decoder = tfdv.TFExampleDecoder()
with beam.Pipeline() as pipeline:
_ = (pipeline
| 'ReadTrainData' >> beam.io.ReadFromTFRecord(input_examples_uri)
| 'ParseExample' >> beam.Map(decoder.decode)
| 'Augmentation' >> beam.ParDo(_augment_image, **exec_properties)
| 'DictToExample' >> beam.Map(_dict_to_example)
| 'SerializeExample' >> beam.Map(lambda x: x.SerializeToString())
| 'WriteAugmentedData' >> beam.io.WriteToTFRecord(
os.path.join(output_examples_uri, "data_tfrecord"),
file_name_suffix='.gz'))
[...]
from tfx.components.base.base_component import BaseComponent
from tfx.components.base.executor_spec import ExecutorClassSpec
class DataAugmentationComponent(BaseComponent):
SPEC_CLASS = DataAugmentationComponentSpec
EXECUTOR_SPEC = ExecutorClassSpec(DataAugmentationExecutor)
trainer1 = Trainer(
trainer_fn='trainer.trainer_fn1',
transformed_examples=transform.outputs.transformed_examples,
[...])
trainer2 = Trainer(
trainer_fn='trainer.trainer_fn2',
transformed_examples=transform.outputs.transformed_examples,
[...])
transform = Transform(...)
trainer1 = Trainer(
trainer_fn='trainer.trainer_fn1',
transformed_examples=transform.outputs.transformed_examples,
[...])
trainer2 = Trainer(
trainer_fn='trainer.trainer_fn2',
transformed_examples=transform.outputs.transformed_examples,
[...])
transform = Transform(...)
trainer1 = Trainer(
trainer_fn='trainer.trainer_fn1',
transformed_examples=transform.outputs.transformed_examples,
instance_name='Trainer1',
[...])
trainer2 = Trainer(
trainer_fn='trainer.trainer_fn2',
transformed_examples=transform.outputs.transformed_examples,
instance_name='Trainer2',
[...])
transform = Transform(...)
trainer1 = Trainer(
trainer_fn='trainer.trainer_fn1',
transformed_examples=transform.outputs.transformed_examples,
instance_name='Trainer1',
[...])
trainer2 = Trainer(
trainer_fn='trainer.trainer_fn2',
transformed_examples=transform.outputs.transformed_examples,
instance_name='Trainer2',
[...])
Arjun Gopalan
Software Engineer
171
How a Typical Neural Net Works
172
How a Typical Neural Net Works
Cat
Train
Dog
... ...
173
Neural Structured Learning (NSL)
174
Neural Structured Learning (NSL)
Concept: train neural net using structure among samples
Input Label
...
175
Structure Among Samples
[Source: graph concept is from Juan et al., arXiv’19. Original images are from pixabay.com]
176
Structure Among Samples
177
NSL: Advantages of Learning with
Structure
Less Labeled Data Required (Neural Graph Learning)
178
Scenario I: Not Enough Labeled Data
Example task:
Document Classification
Lots of samples
179
NSL: Advantages of Learning with Structure
Less Labeled Data Required
Input Label
...
180
NSL Resource: Tutorials
Scenario II: Model Robustness Required
Example task: Image Classification
182
NSL: Advantages of Learning with
Structure
Robust Model
Input Label
Original
image Panda
Use implicit structure
derived from
“adversarial” examples
Perturbed Panda
image
183
NSL Resource: Tutorials
NSL Framework
NSL: Neural Graph Learning
186
NSL: Neural Graph Learning
Example features
Supervised Loss Neighbor Loss
189
NSL: Neural Graph Learning Training
Workflow
190
NSL: Neural Graph Learning Training
Workflow
191
NSL: Adversarial Learning
xj
x’i
x’j
Adversarial Learning
Paper: Goodfellow, et al. [ICLR’15]
192
Libraries
Tools
Trainers
Libraries, Tools, and Trainers
Graph Functions
build_graph
pack_nbrs
read_tsv_graph
write_tsv_graph
add_edge
add_undirected_edges
Libraries, Tools, and Trainers
Standalone Tool Lib
build_graph unpack_neighbor_features
pack_nbrs adversarial_neighbor
replicate_embeddings
Graph Functions
utils
build_graph
pack_nbrs
read_tsv_graph
write_tsv_graph
add_edge
add_undirected_edges
Libraries, Tools, and Trainers
Standalone Tool Lib
build_graph unpack_neighbor_features
pack_nbrs adversarial_neighbor
replicate_embeddings
Graph Functions
utils
build_graph
pack_nbrs Keras
read_tsv_graph graph_regularization
write_tsv_graph adversarial_regularization
add_edge Layers
add_undirected_edges
Estimator
add_graph_regularization
add_adversarial_regularization
Libraries, Tools, and Trainers
Standalone Tool Lib
build_graph unpack_neighbor_features
pack_nbrs adversarial_neighbor
replicate_embeddings
Graph Functions
utils
build_graph
pack_nbrs Keras
read_tsv_graph graph_regularization
write_tsv_graph adversarial_regularization
add_edge Layers
add_undirected_edges
Estimator
add_graph_regularization
add_adversarial_regularization
Web: tensorflow.org/neural_structured_learning
pip install neural-structured-learning
import neural_structured_learning as nsl
203
What If No Explicit Structure or Graph?
Construct graph via Preprocessing
204
What If No Explicit Structure or Graph?
Construct graph via Preprocessing
embedding embedding
205
What If No Explicit Structure or Graph?
Construct graph via Preprocessing
Similar?
embedding embedding
206
What If No Explicit Structure or Graph?
Construct graph via Preprocessing
Similar
embedding embedding
207
“””Generate embeddings.”””
?
imdb = tf.keras.datasets.imdb
(pp_train_data, pp_train_labels), (pp_test_data, pp_test_labels) = ( Load Data
imdb.load_data(num_words=10000))
pretrained_embedding =
'https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1'
hub_layer = hub.KerasLayer(
pretrained_embedding, input_shape=[], dtype=tf.string, Pre-trained
trainable=True) Embedding
“””Generate embeddings.”””
?
imdb = tf.keras.datasets.imdb
(pp_train_data, pp_train_labels), (pp_test_data, pp_test_labels) = ( Load Data
imdb.load_data(num_words=10000))
pretrained_embedding =
'https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1'
hub_layer = hub.KerasLayer(
pretrained_embedding, input_shape=[], dtype=tf.string, Pre-trained
trainable=True) Embedding
# Generate embeddings.
record_id = int(0)
with tf.io.TFRecordWriter('/tmp/imdb/embeddings.tfr') as writer:
for word_vector in pp_train_data:
text = decode_review(word_vector) Text to
sentence_embedding = hub_layer(tf.reshape(text, shape=[-1,])) Embedding
sentence_embedding = tf.reshape(sentence_embedding, shape=[-1])
write_embedding_example(sentence_embedding, record_id) Lookup
record_id += 1
✔
“””Build graph and prepare graph input for NSL.”””
Add
Similar
embedding embedding
# Build a graph from embeddings.
nsl.tools.build_graph((['/tmp/imdb/embeddings.tfr'],
'/tmp/imdb/graph_99.tsv', Graph Building
similarity_threshold=0.8)
Similar
embedding embedding
# Build a graph from embeddings.
nsl.tools.build_graph((['/tmp/imdb/embeddings.tfr'],
'/tmp/imdb/graph_99.tsv', Graph Building
similarity_threshold=0.8)
213
Thank You!
Web: tensorflow.org/neural_structured_learning
Repo: github.com/tensorflow/neural-structured-learning
Survey: cutt.ly/nsl2019
Struc
Structure Neural Network
Special acknowledgment: ture
Google Expander team
Arjun Gopalan
[email protected]
Up Next: Hands-on TFX+NSL Tutorial
IMDB Reviews
IMDB Reviews
POSITIVE ?
IMDB Reviews
Label = True
POSITIVE ?
ExampleGen
IdentifyExamples
IdentifyExamples
StatisticsGen
statistics
SchemaGen
schema
ExampleValidator
blessing
ExampleGen
IdentifyExamples
StatisticsGen
statistics
SchemaGen
schema
ExampleValidator
blessing
Transform
examples (text_xf + label_xf)
ExampleGen
IdentifyExamples
StatisticsGen
statistics
SchemaGen
schema
ExampleValidator SynthesizeGraph
blessing synthesized_graph
Transform
examples (text_xf + label_xf)
ExampleGen
IdentifyExamples
StatisticsGen
statistics
SchemaGen TF Hub
1. text embeddings
schema 2. embeddings synthesized_graph
NSL
ExampleValidator SynthesizeGraph
blessing synthesized_graph
Transform
examples (text_xf + label_xf)
ExampleGen
IdentifyExamples
StatisticsGen
statistics
SchemaGen TF Hub
1. text embeddings
schema 2. embeddings synthesized_graph
NSL
ExampleValidator SynthesizeGraph
blessing synthesized_graph
Transform GraphAugmentation
examples (text_xf + label_xf) examples (augmented with neighbors)
ExampleGen
IdentifyExamples
StatisticsGen
statistics
SchemaGen TF Hub
1. text embeddings
schema 2. embeddings synthesized_graph
NSL
ExampleValidator SynthesizeGraph
blessing synthesized_graph
eval_input_fn No No Yes No
Parses Transforms Expects Label Expects augmented
eval_input_fn No No Yes No
eval_input_fn No No Yes No
eval_input_fn No No Yes No
NSL in TFX
233
Thank You!