Cours 4 - Loading and Preprocessing Data With TensorFlow
Cours 4 - Loading and Preprocessing Data With TensorFlow
02.03.2022
21.10.2021
1 THE DATA API
4 TF TRANSFORM
2
Partie 1
The Data API
#1 THE DATA API
So far we have used only datasets that fit in memory, but Deep Learning systems are often trained on very large
datasets that will not fit in RAM. Ingesting a large dataset and preprocessing it efficiently can be tricky to implement
with other Deep Learning libraries, but TensorFlow makes it easy thanks to the Data API: you just create a dataset
object, and tell it where to get the data and how to transform it. TensorFlow takes care of all the implementation
details, such as multithreading, queuing, batching, and prefetching. Moreover, the Data API works seamlessly with
tf.keras!
In memory
Output :
#1 THE DATA API
Putting Everything Together
#1 THE DATA API
Prefeteching
By calling prefetch(1) at the end, we are creating a dataset that will do its best to always be one batch ahead. In other
words, while our training algorithm is working on one batch, the dataset will already be working in parallel on getting
the next batch ready.
#1 THE DATA API
Using the Dataset with tf.keras
Now we just have to apply the function for the different datasets :
The TFRecord format is TensorFlow’s preferred format for storing large amounts of data and reading it efficiently.
It is a very simple binary format that just contains a sequence of binary records of varying sizes (each record is
comprised of a length, a CRC checksum to check that the length was not corrupted, then the actual data, and
finally a CRC checksum for the data).
You can easily create a TFRecord file using the tf.io.TFRecordWriter class:
#2 THE TFRECORD FORMAT
You can create a compressed TFRecord file by setting the options argument
When reading a compressed TFRecord file, you need to specify the compression type :
Preparing your data for a neural network requires converting all features into numerical features, generally
normalizing them, and more. In particular, if your data contains categorical features or text features, they need to
be converted to numbers.
This can be done ahead of time when preparing your data files, using any tool you like
(e.g., NumPy, pandas, or Scikit-Learn). Alternatively, you can preprocess your data on the fly when loading it with
the Data API (e.g., using the dataset’s map() method, as we saw earlier), or you can include a preprocessing layer
directly in your model.
If you want a nice, self-contained layer : You will need to adapt it before using it
#3 PREPROCESSING THE INPUT FEATURES
It is a categorical feature with five possible values: "<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", and "ISLAND".
oov = out-of-vocabulary
Rule of thumb :
• If the number of categories is lower than 10, then one-hot encoding is generally the
way to go.
• If the number of categories is greater than 50, then embeddings are usually
preferable.
• In between 10 and 50 categories, you may want to experiment with both options and
see which one works best for your use case.
#3 PREPROCESSING THE INPUT FEATURES
An embedding is a trainable dense vector that represents a category. By default, embeddings are initialized randomly,
so for example the "NEAR BAY" category could be represented initially by a random vector such as [0.131, 0.890], while
the "NEAR OCEAN" category might be represented by another random vector such as [0.631, 0.791].
Word embeddings of similar words tend to be close, and some axes seem to encode meaningful concepts
#3 PREPROCESSING THE INPUT FEATURES
The TensorFlow Datasets project makes it very easy to download common datasets, from small ones like
MNIST or Fashion MNIST to huge datasets like ImageNet (you will need quite a bit of disk space!). The list
includes image datasets, text datasets (including translation datasets), and audio and video datasets. You
can visit https://www.tensorflow.org/datasets/catalog/overview#all_datasets to view the full list, along
with a description of each dataset.