0% found this document useful (0 votes)
884 views

Cours 4 - Loading and Preprocessing Data With TensorFlow

This document provides an overview of loading and preprocessing data with TensorFlow. It discusses the TensorFlow Data API and how to create datasets from various data sources. It also covers preprocessing techniques like normalization, one-hot encoding of categorical features, and embeddings. TFRecord format and TF Transform are introduced as efficient ways to store and preprocess large datasets. Finally, the TensorFlow Datasets project is mentioned as a source for common datasets.

Uploaded by

Sarah Bouammar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
884 views

Cours 4 - Loading and Preprocessing Data With TensorFlow

This document provides an overview of loading and preprocessing data with TensorFlow. It discusses the TensorFlow Data API and how to create datasets from various data sources. It also covers preprocessing techniques like normalization, one-hot encoding of categorical features, and embeddings. TFRecord format and TF Transform are introduced as efficient ways to store and preprocess large datasets. Finally, the TensorFlow Datasets project is mentioned as a source for common datasets.

Uploaded by

Sarah Bouammar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Deep Learning

Lecture 4 – Loading and


Preprocessing Data
with TensorFlow
Maxime Bourliatoux

02.03.2022
21.10.2021
1 THE DATA API

2 THE TFRECORD FORMAT

Sommaire 3 PREPROCESSING THE INPUT FEATURES

4 TF TRANSFORM

5 THE TENSORFLOW DATASETS (TFDS) PROJECT

2
Partie 1
The Data API
#1 THE DATA API
So far we have used only datasets that fit in memory, but Deep Learning systems are often trained on very large
datasets that will not fit in RAM. Ingesting a large dataset and preprocessing it efficiently can be tricky to implement
with other Deep Learning libraries, but TensorFlow makes it easy thanks to the Data API: you just create a dataset
object, and tell it where to get the data and how to transform it. TensorFlow takes care of all the implementation
details, such as multithreading, queuing, batching, and prefetching. Moreover, the Data API works seamlessly with
tf.keras!

The whole Data API revolves around the concept of a dataset :


#1 THE DATA API
Chaining Transformations
Once you have a dataset, you can apply all sorts of transformations to it by calling its transformation methods.
Each method returns a new dataset, so you can chain transformations like this.

transformation to each item

transformation to the dataset as a whole

Filter the data

Having a look at the data


#1 THE DATA API
Shuffling data

In memory

From several files


#1 THE DATA API
Preprocessing the Data

We have this kind of data (comes from a csv file)

Let’s write a function to preprocess it :

Output :
#1 THE DATA API
Putting Everything Together
#1 THE DATA API
Prefeteching
By calling prefetch(1) at the end, we are creating a dataset that will do its best to always be one batch ahead. In other
words, while our training algorithm is working on one batch, the dataset will already be working in parallel on getting
the next batch ready.
#1 THE DATA API
Using the Dataset with tf.keras

Now we just have to apply the function for the different datasets :

And create and train our model :

The same way we can evaluate and make predictions :


Partie 2
The TFRecord Format
#2 THE TFRECORD FORMAT

The TFRecord format is TensorFlow’s preferred format for storing large amounts of data and reading it efficiently.
It is a very simple binary format that just contains a sequence of binary records of varying sizes (each record is
comprised of a length, a CRC checksum to check that the length was not corrupted, then the actual data, and
finally a CRC checksum for the data).

You can easily create a TFRecord file using the tf.io.TFRecordWriter class:
#2 THE TFRECORD FORMAT

Compressed TFRecord Files

You can create a compressed TFRecord file by setting the options argument

When reading a compressed TFRecord file, you need to specify the compression type :

To go further on the subject : read about protobuf…


Partie 3
Preprocessing the Input
Features
#3 PREPROCESSING THE INPUT FEATURES

Preparing your data for a neural network requires converting all features into numerical features, generally
normalizing them, and more. In particular, if your data contains categorical features or text features, they need to
be converted to numbers.

This can be done ahead of time when preparing your data files, using any tool you like
(e.g., NumPy, pandas, or Scikit-Learn). Alternatively, you can preprocess your data on the fly when loading it with
the Data API (e.g., using the dataset’s map() method, as we saw earlier), or you can include a preprocessing layer
directly in your model.

Let’s look at this last option now.


#3 PREPROCESSING THE INPUT FEATURES

Creating a layer for standardization with a Lambda layer

If you want a nice, self-contained layer : You will need to adapt it before using it
#3 PREPROCESSING THE INPUT FEATURES

Encoding Categorical Features Using One-Hot Vectors


Remember the ocean_proximity feature in the California housing dataset we explored ?

It is a categorical feature with five possible values: "<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", and "ISLAND".

We need to encode this feature before we feed it to a neural network.


Since there are very few categories, we can use one-hot encoding.

oov = out-of-vocabulary

Rule of thumb :
• If the number of categories is lower than 10, then one-hot encoding is generally the
way to go.
• If the number of categories is greater than 50, then embeddings are usually
preferable.
• In between 10 and 50 categories, you may want to experiment with both options and
see which one works best for your use case.
#3 PREPROCESSING THE INPUT FEATURES

Encoding Categorical Features Using Embeddings

An embedding is a trainable dense vector that represents a category. By default, embeddings are initialized randomly,
so for example the "NEAR BAY" category could be represented initially by a random vector such as [0.131, 0.890], while
the "NEAR OCEAN" category might be represented by another random vector such as [0.631, 0.791].

Word embeddings of similar words tend to be close, and some axes seem to encode meaningful concepts
#3 PREPROCESSING THE INPUT FEATURES

Encoding Categorical Features Using Embeddings

Keras provides an Embedding layer :


Partie 4
TF Transform
#4 TF TRANSFORM
It is great to make preprocessing before the training, but those transformations have to be done before the
predictions, in productions. And changing all the codes on the pipelines, apps etc is time consumming.

This is why TF Transform was designed :

TF Transform lets you apply this preprocess() function to the whole


training set using Apache Beam (it provides an
AnalyzeAndTransformDataset class that you can use for this purpose in
your Apache Beam pipeline). In the process, it will also compute all the
necessary statistics over the whole training set: in this example, the mean
and standard deviation of the housing_median_age feature, and the
vocabulary for the ocean_proximity feature. The components that
compute these statistics are called analyzers.

Importantly, TF Transform will also generate an equivalent TensorFlow


Function that you can plug into the model you deploy. This TF Function
includes some constants that correspond to all the all the necessary
statistics computed by Apache Beam (the mean, standard deviation, and
vocabulary).
Partie 5
The TensorFlow Datasets
(TFDS) Project
#5 THE TENSORFLOW DATASETS (TFDS) PROJECT

The TensorFlow Datasets project makes it very easy to download common datasets, from small ones like
MNIST or Fashion MNIST to huge datasets like ImageNet (you will need quite a bit of disk space!). The list
includes image datasets, text datasets (including translation datasets), and audio and video datasets. You
can visit https://www.tensorflow.org/datasets/catalog/overview#all_datasets to view the full list, along
with a description of each dataset.

You might also like