0% found this document useful (0 votes)

928 views23 pages

Cours 4 - Loading and Preprocessing Data With TensorFlow

This document provides an overview of loading and preprocessing data with TensorFlow. It discusses the TensorFlow Data API and how to create datasets from various data sources. It also covers preprocessing techniques like normalization, one-hot encoding of categorical features, and embeddings. TFRecord format and TF Transform are introduced as efficient ways to store and preprocess large datasets. Finally, the TensorFlow Datasets project is mentioned as a source for common datasets.

Uploaded by

Sarah Bouammar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

928 views23 pages

Cours 4 - Loading and Preprocessing Data With TensorFlow

Uploaded by

Sarah Bouammar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 23

Deep Learning

Lecture 4 – Loading and

Preprocessing Data
with TensorFlow
Maxime Bourliatoux

02.03.2022
21.10.2021
1 THE DATA API

2 THE TFRECORD FORMAT

Sommaire 3 PREPROCESSING THE INPUT FEATURES

4 TF TRANSFORM

5 THE TENSORFLOW DATASETS (TFDS) PROJECT

2
Partie 1
The Data API
#1 THE DATA API
So far we have used only datasets that fit in memory, but Deep Learning systems are often trained on very large
datasets that will not fit in RAM. Ingesting a large dataset and preprocessing it efficiently can be tricky to implement
with other Deep Learning libraries, but TensorFlow makes it easy thanks to the Data API: you just create a dataset
object, and tell it where to get the data and how to transform it. TensorFlow takes care of all the implementation
details, such as multithreading, queuing, batching, and prefetching. Moreover, the Data API works seamlessly with
tf.keras!

The whole Data API revolves around the concept of a dataset :

#1 THE DATA API
Chaining Transformations
Once you have a dataset, you can apply all sorts of transformations to it by calling its transformation methods.
Each method returns a new dataset, so you can chain transformations like this.

transformation to each item

transformation to the dataset as a whole

Filter the data

Having a look at the data

#1 THE DATA API
Shuffling data

In memory

From several files

#1 THE DATA API
Preprocessing the Data

We have this kind of data (comes from a csv file)

Let’s write a function to preprocess it :

Output :
#1 THE DATA API
Putting Everything Together
#1 THE DATA API
Prefeteching
By calling prefetch(1) at the end, we are creating a dataset that will do its best to always be one batch ahead. In other
words, while our training algorithm is working on one batch, the dataset will already be working in parallel on getting
the next batch ready.
#1 THE DATA API
Using the Dataset with tf.keras

Now we just have to apply the function for the different datasets :

And create and train our model :

The same way we can evaluate and make predictions :

Partie 2
The TFRecord Format
#2 THE TFRECORD FORMAT

The TFRecord format is TensorFlow’s preferred format for storing large amounts of data and reading it efficiently.
It is a very simple binary format that just contains a sequence of binary records of varying sizes (each record is
comprised of a length, a CRC checksum to check that the length was not corrupted, then the actual data, and
finally a CRC checksum for the data).

You can easily create a TFRecord file using the tf.io.TFRecordWriter class:
#2 THE TFRECORD FORMAT

Compressed TFRecord Files

You can create a compressed TFRecord file by setting the options argument

When reading a compressed TFRecord file, you need to specify the compression type :

To go further on the subject : read about protobuf…

Partie 3
Preprocessing the Input
Features
#3 PREPROCESSING THE INPUT FEATURES

Preparing your data for a neural network requires converting all features into numerical features, generally
normalizing them, and more. In particular, if your data contains categorical features or text features, they need to
be converted to numbers.

This can be done ahead of time when preparing your data files, using any tool you like
(e.g., NumPy, pandas, or Scikit-Learn). Alternatively, you can preprocess your data on the fly when loading it with
the Data API (e.g., using the dataset’s map() method, as we saw earlier), or you can include a preprocessing layer
directly in your model.

Let’s look at this last option now.

#3 PREPROCESSING THE INPUT FEATURES

Creating a layer for standardization with a Lambda layer

If you want a nice, self-contained layer : You will need to adapt it before using it
#3 PREPROCESSING THE INPUT FEATURES

Encoding Categorical Features Using One-Hot Vectors

Remember the ocean_proximity feature in the California housing dataset we explored ?

It is a categorical feature with five possible values: "<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", and "ISLAND".

We need to encode this feature before we feed it to a neural network.

Since there are very few categories, we can use one-hot encoding.

oov = out-of-vocabulary

Rule of thumb :
• If the number of categories is lower than 10, then one-hot encoding is generally the
way to go.
• If the number of categories is greater than 50, then embeddings are usually
preferable.
• In between 10 and 50 categories, you may want to experiment with both options and
see which one works best for your use case.
#3 PREPROCESSING THE INPUT FEATURES

Encoding Categorical Features Using Embeddings

An embedding is a trainable dense vector that represents a category. By default, embeddings are initialized randomly,
so for example the "NEAR BAY" category could be represented initially by a random vector such as [0.131, 0.890], while
the "NEAR OCEAN" category might be represented by another random vector such as [0.631, 0.791].

Word embeddings of similar words tend to be close, and some axes seem to encode meaningful concepts
#3 PREPROCESSING THE INPUT FEATURES

Encoding Categorical Features Using Embeddings

Keras provides an Embedding layer :

Partie 4
TF Transform
#4 TF TRANSFORM
It is great to make preprocessing before the training, but those transformations have to be done before the
predictions, in productions. And changing all the codes on the pipelines, apps etc is time consumming.

This is why TF Transform was designed :

TF Transform lets you apply this preprocess() function to the whole

training set using Apache Beam (it provides an
AnalyzeAndTransformDataset class that you can use for this purpose in
your Apache Beam pipeline). In the process, it will also compute all the
necessary statistics over the whole training set: in this example, the mean
and standard deviation of the housing_median_age feature, and the
vocabulary for the ocean_proximity feature. The components that
compute these statistics are called analyzers.

Importantly, TF Transform will also generate an equivalent TensorFlow

Function that you can plug into the model you deploy. This TF Function
includes some constants that correspond to all the all the necessary
statistics computed by Apache Beam (the mean, standard deviation, and
vocabulary).
Partie 5
The TensorFlow Datasets
(TFDS) Project
#5 THE TENSORFLOW DATASETS (TFDS) PROJECT

The TensorFlow Datasets project makes it very easy to download common datasets, from small ones like
MNIST or Fashion MNIST to huge datasets like ImageNet (you will need quite a bit of disk space!). The list
includes image datasets, text datasets (including translation datasets), and audio and video datasets. You
can visit https://www.tensorflow.org/datasets/catalog/overview#all_datasets to view the full list, along
with a description of each dataset.

Deep Learning PPT Full Notes
No ratings yet
Deep Learning PPT Full Notes
105 pages
Handwritten Machine Learning Notes
No ratings yet
Handwritten Machine Learning Notes
114 pages
Femfat 52 Max Manual e
50% (2)
Femfat 52 Max Manual e
103 pages
10 - D&D Spirit Master
100% (1)
10 - D&D Spirit Master
27 pages
Deep Learning PPT Full Notes
100% (3)
Deep Learning PPT Full Notes
105 pages
Ethnotech - Data Science With Python
No ratings yet
Ethnotech - Data Science With Python
480 pages
LLM Basics
No ratings yet
LLM Basics
35 pages
Machine Learning Short Notes
No ratings yet
Machine Learning Short Notes
36 pages
JNTUK R20 B.Tech CSE 4-1 Deep Learning Techniques Unit 2 Notes
No ratings yet
JNTUK R20 B.Tech CSE 4-1 Deep Learning Techniques Unit 2 Notes
51 pages
Deep Learning Andrew NG
100% (3)
Deep Learning Andrew NG
173 pages
Microbiological Assays
60% (5)
Microbiological Assays
34 pages
Deep Learning Interview Questions and Answers
No ratings yet
Deep Learning Interview Questions and Answers
21 pages
ML Unit-1
No ratings yet
ML Unit-1
32 pages
Question Bank - Machine Learning (Repaired)
100% (1)
Question Bank - Machine Learning (Repaired)
78 pages
TensorFlow Roadmap
No ratings yet
TensorFlow Roadmap
22 pages
TensorFlow Cheatsheet Zero To Mastery V1.01
No ratings yet
TensorFlow Cheatsheet Zero To Mastery V1.01
26 pages
MLOps Continuous Delivery For ML On AWS
No ratings yet
MLOps Continuous Delivery For ML On AWS
69 pages
Vector Databases
No ratings yet
Vector Databases
35 pages
Data Science Masters 2.0: Impact Batch 2.0
No ratings yet
Data Science Masters 2.0: Impact Batch 2.0
11 pages
Lab - Manual FDS
No ratings yet
Lab - Manual FDS
12 pages
Parameters v5.00 Fdu VFX - en
No ratings yet
Parameters v5.00 Fdu VFX - en
468 pages
Machine Learning Cheat Sheet ??? - ?
No ratings yet
Machine Learning Cheat Sheet ??? - ?
231 pages
SAT Suite Question Bank - Results
No ratings yet
SAT Suite Question Bank - Results
279 pages
Cours 1 - Intro To Deep Learning
100% (1)
Cours 1 - Intro To Deep Learning
38 pages
ML Projects 1
No ratings yet
ML Projects 1
29 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
71 pages
True Money Management
100% (4)
True Money Management
12 pages
GenAI Unit1 3
No ratings yet
GenAI Unit1 3
31 pages
Mathematics of Generative AI
No ratings yet
Mathematics of Generative AI
22 pages
Python TensorFlow Tutorial - Build A Neural Network - Adventures in Machine Learning
No ratings yet
Python TensorFlow Tutorial - Build A Neural Network - Adventures in Machine Learning
18 pages
Remidial Measures of Foundation Leakage - Conglomerate
No ratings yet
Remidial Measures of Foundation Leakage - Conglomerate
22 pages
Key Management
No ratings yet
Key Management
10 pages
Deep Learning Syllabus
100% (1)
Deep Learning Syllabus
2 pages
Deep Learning Questions
50% (2)
Deep Learning Questions
51 pages
ML (U1&u2)
No ratings yet
ML (U1&u2)
51 pages
Deep Learning and TensorFlow
No ratings yet
Deep Learning and TensorFlow
50 pages
Fa1800a-M'13 (F2S8)
No ratings yet
Fa1800a-M'13 (F2S8)
70 pages
Lab 2 Simio
0% (2)
Lab 2 Simio
16 pages
Hands-On Lab With LLMs and Gen AI Within IDC
No ratings yet
Hands-On Lab With LLMs and Gen AI Within IDC
57 pages
Xiaomi - Entering International Markets
No ratings yet
Xiaomi - Entering International Markets
6 pages
AI Lab Manual Version 1.3
100% (1)
AI Lab Manual Version 1.3
123 pages
Cours 3 - Custom Models and Training With TensorFlow
No ratings yet
Cours 3 - Custom Models and Training With TensorFlow
36 pages
AI Important Questions
No ratings yet
AI Important Questions
196 pages
Capital Structure and Firms Financial Performance Evidence From Palestine
No ratings yet
Capital Structure and Firms Financial Performance Evidence From Palestine
96 pages
DL Lab Manual
No ratings yet
DL Lab Manual
65 pages
Machine Learning Projects For Final Year PDF
No ratings yet
Machine Learning Projects For Final Year PDF
4 pages
Program 1
No ratings yet
Program 1
7 pages
A Novel Adoption of LSTM in Customer Touchpoint Prediction Problems Presentation 1
100% (1)
A Novel Adoption of LSTM in Customer Touchpoint Prediction Problems Presentation 1
73 pages
Getting Started With MLOPs 21 Page Tutorial
No ratings yet
Getting Started With MLOPs 21 Page Tutorial
21 pages
Introduction To Parallel Computing
100% (1)
Introduction To Parallel Computing
34 pages
Image Caption Generator
No ratings yet
Image Caption Generator
13 pages
Northwest Corner Method
No ratings yet
Northwest Corner Method
8 pages
Aria Review
No ratings yet
Aria Review
23 pages
Large Language Model (LLM) 1
100% (1)
Large Language Model (LLM) 1
17 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
38 pages
Image Processing With CUDA
No ratings yet
Image Processing With CUDA
66 pages
Manual RLB10048 5pack
100% (1)
Manual RLB10048 5pack
14 pages
Cell
No ratings yet
Cell
24 pages
Machine Learning Summarized Notes 1660762916
No ratings yet
Machine Learning Summarized Notes 1660762916
111 pages
ML - LAB Record
No ratings yet
ML - LAB Record
36 pages
Cours 2 - Training Deep Neural Networks
No ratings yet
Cours 2 - Training Deep Neural Networks
42 pages
Data Analytics Unit-3 Notes
No ratings yet
Data Analytics Unit-3 Notes
21 pages
Chapter 1-The Numeration System
No ratings yet
Chapter 1-The Numeration System
24 pages
Chapter Two - Li-WPS Office
No ratings yet
Chapter Two - Li-WPS Office
12 pages
Unit 1 - Machine Learning
No ratings yet
Unit 1 - Machine Learning
21 pages
10 Evani Generative AI Champion
No ratings yet
10 Evani Generative AI Champion
39 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
15 pages
23: Economies of Scale
No ratings yet
23: Economies of Scale
55 pages
Deep Learning Handout
100% (1)
Deep Learning Handout
6 pages
30 Deep Learning Projects
No ratings yet
30 Deep Learning Projects
7 pages
Casting Defects
No ratings yet
Casting Defects
30 pages
02 - Lecture Note - TensorFlow Ops
No ratings yet
02 - Lecture Note - TensorFlow Ops
21 pages
00 Course Introduction
100% (1)
00 Course Introduction
17 pages
Operation of 18 GHZ Hts Ecris and Lebt On A 200Kv HV Platform
No ratings yet
Operation of 18 GHZ Hts Ecris and Lebt On A 200Kv HV Platform
18 pages
Bai Tap Dieu Khien Chuyen Dong Chuong 1-2
No ratings yet
Bai Tap Dieu Khien Chuyen Dong Chuong 1-2
15 pages
Mathematics 7 First Quarter Exam 2019-2020
No ratings yet
Mathematics 7 First Quarter Exam 2019-2020
15 pages
Alia AMF601 Electromagnetic Flowmeter
No ratings yet
Alia AMF601 Electromagnetic Flowmeter
4 pages
Artificial Intelligence and Machine Learning in Business
No ratings yet
Artificial Intelligence and Machine Learning in Business
5 pages
Deliverable 01 Worksheet: Answer and Explanation
No ratings yet
Deliverable 01 Worksheet: Answer and Explanation
5 pages
Brand New: How Visual Context Shapes Initial Response To Logos and Corporate Visual Identity Systems
No ratings yet
Brand New: How Visual Context Shapes Initial Response To Logos and Corporate Visual Identity Systems
2 pages
GANppt
100% (1)
GANppt
34 pages
ALLISON TOURIGNY - Resume
No ratings yet
ALLISON TOURIGNY - Resume
2 pages
PyTorch Workflow Fundamentals
No ratings yet
PyTorch Workflow Fundamentals
1 page
Simple Libraries in Python
No ratings yet
Simple Libraries in Python
12 pages
Gardner Gary Resume
No ratings yet
Gardner Gary Resume
1 page
Langchain PDF Reader
100% (1)
Langchain PDF Reader
15 pages
Career Plans For Next 2 Years
No ratings yet
Career Plans For Next 2 Years
11 pages
Common Modification Request Form
No ratings yet
Common Modification Request Form
2 pages
Implement NLP use-cases using BERT: Explore the Implementation of NLP Tasks Using the Deep Learning Framework and Python (English Edition)
From Everand
Implement NLP use-cases using BERT: Explore the Implementation of NLP Tasks Using the Deep Learning Framework and Python (English Edition)
Amandeep
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet

Cours 4 - Loading and Preprocessing Data With TensorFlow

Uploaded by

Cours 4 - Loading and Preprocessing Data With TensorFlow

Uploaded by

Deep Learning

Lecture 4 – Loading and

2 THE TFRECORD FORMAT

Sommaire 3 PREPROCESSING THE INPUT FEATURES

5 THE TENSORFLOW DATASETS (TFDS) PROJECT

The whole Data API revolves around the concept of a dataset :

transformation to each item

transformation to the dataset as a whole

Filter the data

Having a look at the data

From several files

We have this kind of data (comes from a csv file)

Let’s write a function to preprocess it :

And create and train our model :

The same way we can evaluate and make predictions :

Compressed TFRecord Files

To go further on the subject : read about protobuf…

Let’s look at this last option now.

Creating a layer for standardization with a Lambda layer

Encoding Categorical Features Using One-Hot Vectors

We need to encode this feature before we feed it to a neural network.

Encoding Categorical Features Using Embeddings

Encoding Categorical Features Using Embeddings

Keras provides an Embedding layer :

This is why TF Transform was designed :

TF Transform lets you apply this preprocess() function to the whole

Importantly, TF Transform will also generate an equivalent TensorFlow

You might also like