RPi4CV HackerBundle

Dr.
Adrian Rosebrock
Raspberry Pi for Computer Vision
Hacker Bundle - v1.0.1
Adrian Rosebrock, PhD

Dave Hoffman, MSc
David McDuffee
Abhishek Thanki
Sayak Paul
pyimagesearch
The contents of this book, unless otherwise indicated, are Copyright ©2019 Adrian Rosebrock,
PyImageSearch.com. All rights reserved. Books like this are made possible by the time in-
vested by the authors. If you received this book and did not purchase it, please consider
making future books possible by buying a copy at https://www.pyimagesearch.com/raspberry-
pi-for-computer-vision/ today.
© 2019 PyImageSearch
Contents
Contents 3
1 Introduction 15
2 Deep Learning on Resource Constrained Devices Outline 17

2.1 Chapter Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 The Challenge of DL on Embedded Devices . . . . . . . . . . . . . . . . . . . . . 17
2.3 Can we Train Models on the RPi? . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Faster Inference with Coprocessors . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Movidius Neural Compute Stick (NCS) . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Goral Coral TPU USB Accelerator . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Dedicated Development Boards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.1 Google Coral TPU Dev Board . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 NVIDIA Jetson Nano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.3 My Recommendations for Devices . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Is the RPi Irrelevant for Deep Learning? . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Multiple Pis, Message Passing, and ImageZMQ 27

3.2 Projects Involving Multiple Raspberry Pis . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Project Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Sockets, Message Passing, and ZMQ . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.1 What is Message Passing? . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.2 What is ZMQ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.3 Example Message Passing using Client/Server . . . . . . . . . . . . . . . 31
3.4.4 Running Our Message Passing Example . . . . . . . . . . . . . . . . . . 35
3.5 ImageZMQ for Video Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3
4 CONTENTS
3.5.1 Why Stream Video Frames Over a Network? . . . . . . . . . . . . . . . . 36

3.5.2 What is the ImageZMQ Library? . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.3 Installing ImageZMQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5.4 Preparing Clients for ImageZMQ with Custom Hostnames . . . . . . . . . 38
3.5.5 Defining the Client/Server Relationship . . . . . . . . . . . . . . . . . . . 40
3.5.6 The ImageZMQ Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.7 The ImageZMQ Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.8 Results of Streaming Video with ImageZMQ . . . . . . . . . . . . . . . . . 44
3.5.9 Factors Impacting ImageZMQ Performance . . . . . . . . . . . . . . . . . 45
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4 Advanced Security with YOLO Object Detection 49

4.2 Object Detection with YOLO and Deep Learning . . . . . . . . . . . . . . . . . . 50
4.3 An Overview of Our Security Application . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Building a Security Application with Deep Learning . . . . . . . . . . . . . . . . . 52
4.4.1 Project Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4.2 Our Configuration File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.3 Implementing the Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.4 Implementing the Server . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.4.1 Parsing YOLO Output . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Running Our YOLO Security Application . . . . . . . . . . . . . . . . . . . . . . . 65
4.6 How Do I Define My Own Unauthorized Zones? . . . . . . . . . . . . . . . . . . . 67
4.6.1 Step #1: Capture the Image/Frame . . . . . . . . . . . . . . . . . . . . . . 67
4.6.2 Step #2a: Define Coordinates with Photoshop/GIMP . . . . . . . . . . . . 68
4.6.3 Step #2b: Define Coordinates with OpenCV . . . . . . . . . . . . . . . . . 69
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5 Face Recognition on the RPi 71

5.2 Our Face Recognition System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Deep Learning for Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4.1 Understanding Deep Learning and Face Recognition Embeddings . . . . 78
5.4.2 Step #1: Gather Your Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 79
CONTENTS 5
5.4.2.1 Use OpenCV and Webcam for Face Detection . . . . . . . . . . 80

5.4.2.2 Download Images Programmatically . . . . . . . . . . . . . . . . 81
5.4.2.3 Manual Collection of Images . . . . . . . . . . . . . . . . . . . . 81
5.4.3 Step #2: Extract Face Embeddings . . . . . . . . . . . . . . . . . . . . . . 82
5.4.4 Step #3: Train the Face Recognition Model . . . . . . . . . . . . . . . . . 87
5.5 Text-to-Speech on the Raspberry Pi . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.6 Face Recognition: Putting the Pieces Together . . . . . . . . . . . . . . . . . . . 91
5.7 Face Recognition Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.8 Room for improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6 Building a Smart Attendance System 105

6.2 Overview of Our Smart Attendance System . . . . . . . . . . . . . . . . . . . . . 106
6.2.1 What is a Smart Attendance System? . . . . . . . . . . . . . . . . . . . . 106
6.3 Step #1: Creating our Database . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3.1 What is TinyDB? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3.2 Our Database Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.3.3 Implementing the Initialization Script . . . . . . . . . . . . . . . . . . . . . 114
6.3.4 Initializing the Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.4 Step #2: Enrolling Faces in the System . . . . . . . . . . . . . . . . . . . . . . . 116
6.4.1 Implementing Face Enrollment . . . . . . . . . . . . . . . . . . . . . . . . 116
6.4.2 Enrolling Faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.4.3 Implementing Face Un-enrollment . . . . . . . . . . . . . . . . . . . . . . 122
6.4.4 Un-enrolling Faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.5 Step #3: Training the Face Recognition Component . . . . . . . . . . . . . . . . 124
6.5.1 Implementing Face Embedding Extraction . . . . . . . . . . . . . . . . . . 124
6.5.2 Extracting Face Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.5.3 Implementing the Training Script . . . . . . . . . . . . . . . . . . . . . . . 126
6.5.4 Running the Training Script . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.6 Step #4: Implementing the Attendance Script . . . . . . . . . . . . . . . . . . . . 128
6.7 Smart Attendance System Results . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7 Building a Neighborhood Vehicle Speed Monitor 137

6 CONTENTS

7.2 Neighborhood Vehicle Speed Estimation . . . . . . . . . . . . . . . . . . . . . . . 138
7.2.1 What is VASCAR and How Is It Used to Measure Speed? . . . . . . . . . 138
7.2.3 Speed Estimation Config File . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.2.4 Camera Positioning and Constants . . . . . . . . . . . . . . . . . . . . . . 144
7.2.5 Centroid Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2.6 Trackable Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2.7 Speed Estimation with Computer Vision . . . . . . . . . . . . . . . . . . . 147
7.2.8 Deployment and Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.2.9 Calibrating for Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8 Deep Learning and Multiple RPis 169

8.2 An ImageZMQ Client/Server Application for Monitoring a Home . . . . . . . . . . 169
8.2.2 Implementing the Client OpenCV Video Streamer . . . . . . . . . . . . . 171
8.2.3 Implementing the OpenCV Video Server . . . . . . . . . . . . . . . . . . . 173
8.2.4 Streaming Video Over Your Network with OpenCV and ImageZMQ . . . . 180
8.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9 Training a Custom Gesture Recognition Model 183

9.2 Getting Started with Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . 184
9.2.1 What is Gesture Recognition? . . . . . . . . . . . . . . . . . . . . . . . . 184
9.3 Gathering Gesture Training Examples . . . . . . . . . . . . . . . . . . . . . . . . 191
9.3.1 Implementing the Dataset Gathering Script . . . . . . . . . . . . . . . . . 191
9.3.2 Running the Dataset Gathering Script . . . . . . . . . . . . . . . . . . . . 194
9.4 Gesture Recognition with Deep Learning . . . . . . . . . . . . . . . . . . . . . . 196
9.4.1 Implementing the GestureNet CNN Architecture . . . . . . . . . . . . . . 196
9.4.3 Examining Training Results . . . . . . . . . . . . . . . . . . . . . . . . . . 202
9.5 Implementing the Complete Gesture Recognition Pipeline . . . . . . . . . . . . . 203
9.6 Gesture Recognition Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
CONTENTS 7
9.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
10 Vehicle Recognition with Deep Learning 217

10.2 What is Vehicle Recognition? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
10.3 Getting Started with Vehicle Recognition . . . . . . . . . . . . . . . . . . . . . . . 219
10.3.1 Our Vehicle Recognition Project . . . . . . . . . . . . . . . . . . . . . . . 219
10.4 Phase #1: Creating Our Training Dataset . . . . . . . . . . . . . . . . . . . . . . 225
10.4.1 Gathering Vehicle Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
10.4.2 Detecting Vehicles with YOLO Object Detector . . . . . . . . . . . . . . . 227
10.4.3 Building Our Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
10.5 Phase #2: Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
10.5.1 Implementing Deep Learning Feature Extraction . . . . . . . . . . . . . . 232
10.5.2 Extracting Features with ResNet . . . . . . . . . . . . . . . . . . . . . . . 237
10.5.4 Training Our Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
10.6 Phase #3: Implementing the Vehicle Recognition Pipeline . . . . . . . . . . . . . 240
10.6.1 Implementing the Client (Raspberry Pi) . . . . . . . . . . . . . . . . . . . 241
10.6.2 Implementing the Server (Host Machine) . . . . . . . . . . . . . . . . . . 242
10.7 Vehicle Recognition Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
10.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
11 What is the Movidius NCS 255

11.1 Chapter learning objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
11.2 What is the Intel Movidius Neural Compute Stick? . . . . . . . . . . . . . . . . . 256
11.3 What can the Movidius NCS do? . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
11.4 Intel Movidius’ NCS History Lesson . . . . . . . . . . . . . . . . . . . . . . . . . . 258
11.4.1 Product Launch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
11.4.2 Meet OpenVINO and the NCS2 . . . . . . . . . . . . . . . . . . . . . . . . 259
11.4.3 Raspberry Pi 4 released (with USB 3.0 support) . . . . . . . . . . . . . . 260
11.5 What are the alternatives to the Movidius NCS? . . . . . . . . . . . . . . . . . . . 261
11.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
12 Image Classification with the Movidius NCS 263

8 CONTENTS
12.2 Image Classification with the Movidius NCS . . . . . . . . . . . . . . . . . . . . . 263

12.2.3 Image Classification with the Movidius NCS and OpenVINO . . . . . . . . 266
12.2.4 Minor Changes for CPU Classification . . . . . . . . . . . . . . . . . . . . 270
12.2.5 Image Classification with Movidius NCS Results . . . . . . . . . . . . . . 272
12.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
13 Object Detection with the Movidius NCS 277

13.2 Object Detection with the Movidius NCS . . . . . . . . . . . . . . . . . . . . . . . 278
13.2.2 A Brief Review of Object Counting . . . . . . . . . . . . . . . . . . . . . . 279
13.2.3 Object Counting with OpenVINO . . . . . . . . . . . . . . . . . . . . . . . 280
13.2.4 Movidius Object Detection and Footfall Counting Results . . . . . . . . . 292
13.3 Pre-trained Models and Custom Training with the Movidius NCS . . . . . . . . . 294
13.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
14 Fast, Efficient Face Recognition with the Movidius NCS 297

14.2 Fast, Efficient Face Recognition with the Movidius NCS . . . . . . . . . . . . . . 298
14.2.2 Our Environment Setup Script . . . . . . . . . . . . . . . . . . . . . . . . 299
14.2.3 Extracting Facial Embeddings with Movidius NCS . . . . . . . . . . . . . 300
14.2.4 Training an SVM model on Top of Facial Embeddings . . . . . . . . . . . 306
14.2.5 Real-Time Face Recognition in Video Streams with Movidius NCS . . . . 309
14.2.6 Face Recognition with Movidius NCS Results . . . . . . . . . . . . . . . . 314
14.3 How to Obtain Higher Face Recognition Accuracy . . . . . . . . . . . . . . . . . 315
14.3.1 You May Need More Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
14.3.2 Perform Face Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
14.3.3 Tune Your Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . 319
14.3.4 Use Dlib’s Embedding Model . . . . . . . . . . . . . . . . . . . . . . . . . 319
14.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
15 Recognizing Objects with IoT Pi-to-Pi Communication 321

15.2 How this Chapter is Organized . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
CONTENTS 9
15.3 The Case for a Complex Security System . . . . . . . . . . . . . . . . . . . . . . 323

15.3.1 Your Family, Your Belongings . . . . . . . . . . . . . . . . . . . . . . . . . 323
15.3.2 The Importance of Video Evidence . . . . . . . . . . . . . . . . . . . . . . 324
15.3.3 How the RPi Shines for Security — Flexibility and Affordability . . . . . . 324
15.4 A Fully-Fledged Project Involving Multiple IoT Devices . . . . . . . . . . . . . . . 325
15.5 Reinforcing Concepts Covered in Previous Chapters . . . . . . . . . . . . . . . . 326
15.5.1 Message Passing and Sockets . . . . . . . . . . . . . . . . . . . . . . . . 326
15.5.2 Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
15.5.3 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
15.5.4 Background Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
15.5.5 Object Detection with MobileNet SSD . . . . . . . . . . . . . . . . . . . . 329
15.5.6 Twilio SMS Alerts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
15.6 New Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
15.6.1 IoT Smart Lighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
15.6.2 State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
15.7 Our IoT Sase Study Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
15.7.1 Independent State Machines . . . . . . . . . . . . . . . . . . . . . . . . . 335
15.7.2.1 Pi #1: “driveway-pi” . . . . . . . . . . . . . . . . . . . . . . . . . 337
15.7.2.2 Pi #2: “home-pi” . . . . . . . . . . . . . . . . . . . . . . . . . . . 338
15.7.3 Config Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
15.7.3.1 Pi #1: “driveway-pi” . . . . . . . . . . . . . . . . . . . . . . . . . 339
15.7.3.2 Pi #2: “home-pi” . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
15.7.4 Driver Script for Pi #1: “driveway-pi” . . . . . . . . . . . . . . . . . . . . . 343
15.7.5 Driver Script for Pi #2: “home-pi” . . . . . . . . . . . . . . . . . . . . . . . 350
15.7.6 Deploying Our IoT Project . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
15.8 Suggestions and Scope Creep . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
15.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
16 Your Next Steps 369

16.1 So, What’s next? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
10 CONTENTS
To the PyImageSearch team;
the Raspberry Pi community;
and all the PyImageSearch readers who
have made this book possible.
12 CONTENTS
Companion Website
Thank you for picking up a copy of Raspberry Pi for Computer Vision! To accompany this book
I have created a companion website which includes:
• Up-to-date installation instructions on how to configure your Raspberry Pi develop-

ment environment
• Instructions on how to use the pre-configured Raspbian .img file(s)
• Supplementary material that I could not fit inside this book
• Frequently Asked Questions (FAQs) and their suggested fixes and solutions
Additionally, you can use the “Issues” feature inside the companion website to report any
bugs, typos, or problems you encounter when working through the book. I don’t expect many
problems; however, this is a brand new book so myself and other readers would appreciate
reporting any issues you run into. From there, I can keep the book updated and bug free.
To create your companion website account, just use this link:
http://pyimg.co/qnv89
Take a second to create your account now so you’ll have access to the supplementary
materials as you work through the book.
13
14 CONTENTS
Chapter 1
Introduction
Welcome to the Hacker Bundle of Raspberry Pi for Computer Vision!
Inside the Hobbyist Bundle you learned about the Raspberry Pi, an affordable, yet powerful
embedded device given the size and cost of the machine.
You then discovered how to apply computer vision, image processing, and OpenCV to the
RPi to build real-world applications, including:
• Building a remote wildlife detector
• Creating a video surveillance system (and streaming the result to a webpage)
• Detecting tired, drowsy drivers behind the wheel
• Face tracking with pan/tilt servos
• Creating a traffic counting/footfall application
• Building an automatic prescription pill recognition system
You should be proud of your accomplishments thus far — some of those projects were not
easy, but you rolled up your sleeves, put your head down, and learned how to apply computer
vision to the RPi.
That said, the Hobbyist Bundle did not touch on deep learning, one of the most important
advances in computer science in the past decade.
Deep learning has impacted nearly ever facet of computer science, including computer vi-
sion, Natural Language Processing (NLP), speech recognition, social network filtering, bioinfor-
matics, drug design, and more — basically, if there is enough labeled data, deep learning
has likely (successfully) been applied to the field in some manner.
Computer vision is no different.
15
16 CHAPTER 1. INTRODUCTION
Today we see deep learning applied to computer vision tasks including image classification,
object detection, semantic/instance segmentation, face recognition, gait recognition, and much
more.
In the Hobbyist Bundle we didn’t touch on deep learning.
We instead focused on the fundamentals of applying fairly standard computer vision and
image processing algorithms to the RPi via the OpenCV library. Placing emphasis on the
basics to enable us to:
i. Learn how to use the OpenCV library on the Raspberry Pi.
ii. Better understand the computational limitations of embedded devices even with basic
algorithms.
Deep learning algorithms, by their very nature, are incredibly computationally expensive
— and in order to apply them to the RPi (or other embedded devices) we need to be extremely
thoughtful regarding:
• The number of model parameters/size (ensuring it can fit into RAM).
• Our model’s computational requirements (to ensure inference can be made in a rea-
sonable amount of time)
• Library optimizations, such as NEON, VFPV3, and OpenCL, which can be used to
improve inference time.
• Hardware add-ons, including the Movidius NCS or Google Coral USB Accelerator, which
can be used to push model computation to an optimized compute USB compute stick.
Inside the chapters in this bundle, we’ll take a deeper dive into the world of computer vision
on embedded devices. You’ll learn about more advanced computer vision algorithms, ma-
chine learning techniques, and how to apply deep learning on embedded devices (including
optimization tips, suggestions, and best practices). The techniques covered in this bundle are
much more advanced than what is covered in the Hobbyist Bundle — this is where you’ll start
to separate yourself from a hobbyist to a true embedded device practitioner.
Flip the page to get started.

Chapter 2
Deep Learning on Resource

Constrained Devices Outline
Before we start writing code to perform deep learning inference (including classification, detec-
tion, and segmentation) on the RPi, we first need to take a step back and understand why it’s
such a challenge to perform deep learning on resource constrained devices.
Having this perspective is not only educational, but it will better enable us to assess the right
libraries and tools for the job when building an application that leverages deep learning.
2.1 Chapter Learning Objectives
Inside this chapter you will:
i. Learn why performing deep learning on embedded devices is so challenging.
ii. Discover coprocessor devices, such as the Movidius NCS and Google Coral TPU Accel-
erator.
iii. Learn about dedicated development boards, including Google’s TPU Dev Board and the
NVIDIA Jetson Nano.
iv. Discuss whether or not the RPi is relevant for deep learning.
2.2 The Challenge of DL on Embedded Devices
Deep learning algorithms have facilitated unprecedented levels of accuracy in computer vision
— but that accuracy comes at a price — namely, computational resources that embedded
devices tend to lack.
17
18 CHAPTER 2. DEEP LEARNING ON RESOURCE CONSTRAINED DEVICES OUTLINE
Model Memory FLOPs

AlexNet 233 MB 727 MFLOPs
SqueezeNet 30 MB 837 MFLOPs
VGG16 528 MB 16 GFLOPs
GoogLeNet 51 MB 2 GFLOPs
ResNet-18 45 MB 2 GFLOPs
Inception V3 91 MB 6 GFLOPs
DenseNet-121 126 MB 3 GFLOPs
Table 2.1: Estimates of memory consumption and FLOP counts for seminal Convolutional Neural
Networks [1].
It’s no secret that training even modest deep learning models requires a GPU. Trying to train
even a small model on a CPU can take multiple of orders of magnitude longer. And even at
inference, when the more computationally expensive backpropagation phase is not required, a
GPU is still often required to obtain real-time performance. These computational requirements
put resource constrained devices, such as the RPi, at a serious disadvantage — how are they
supposed to leverage deep learning if they are so comparatively underpowered?
There are a number of problems at work here, and the first, which we touched on above, is
the complexity of the model and the required computation. Samuel Albanie, a researcher
at the prestigious University of Oxford, studied the amount of required computation for popular
CNNs (Table 2.1).
AlexNet [2], the seminal CNN architecture that helped jumpstart the latest resurgence in
deep learning research after its performance in the ImageNet 2012 competition, requires 727
MFLOPs.
The VGG family of networks [3] requires anywhere from 727 MFLOPs to 20 GFLOPs, de-
pending on which specific architecture is used.
ResNet [4, 5], arguably one of the most powerful and accurate CNNs, requires 2 GFLOPs
to 16 GFLOPs, depending on how deep the model is.
The original Raspberry Pi was released with a 700 MHz processor capable of approximately
0.041 GFLOPS [6]. The RPi v1.1 included a quad-core Cortex-A7 running at 900 MHz and
1GB of RAM. It’s estimated that the RPi v1.1 is approximately 4x faster than the original RPi,
bringing computational power up to 0.164 GLOPs.
The Raspberry Pi 3 was upgraded further, including a ARM Cortex-A53 running at 1.2 GHz,
having 10x the performance of the original RPi [6], giving us approximately 0.41 GFLOPs.
2.3. CAN WE TRAIN MODELS ON THE RPI? 19
Now, take that 0.41 GLOPs and compare it to Table 2.1 — note how the RPi is computa-
tionally limited compared to the amount of operations required by state-of-the-art architectures
such as ResNet (2-16 GLOPs). In order to successfully perform deep learning on the RPi, we’ll
need some serious levels of optimization.
So far we’ve discussed only computational requirements, but we haven’t assessed the
memory requirements (RAM).
The RPi 3 has 1GB of RAM while the RPi has 1-4GB, depending on the model. However,
keep in mind that this RAM is responsible for not only your code and deep learning models, but
also the system operations on the RPi as well. While the amount of RAM tends to be less of
a limitation than the CPU, it’s still worth considering when developing your own deep learning
applications on the RPi.
Finally, you most consider the power draw of the RPi.
Embedded devices naturally do not draw as much power as laptops, desktops, or GPU-
enabled deep learning rigs — embedded machines are not designed to draw as much
power. The less power there is, the less computationally powerful the machine.
For comparison purposes, a RPi 3B+ will draw about 3.5W and a full blown desktop GPU
will draw 250W (Titan X) [7]. That’s not to mention the other peripherals drawing even more
power in the GPU machine — most Power Supply Units (PSUs) would be capable of 800W at
a minimum. For reference, my NVIDIA DIGITS DevBox has a 1350W power supply to power
4x TitanX GPUs and all the other components (http://pyimg.co/hsozv) [8].
As you can see, we need to make some special accommodations to accomplish any deep
learning on the RPi.
2.3 Can we Train Models on the RPi?
A common misconception I see from deep learning practitioners new to the world of embedded
devices is that (incorrectly) think they can/should train their models on the RPi.
In short — don’t train your deep learning models on the RPi (or other embedded
devices). Instead, you should:
i. Train the model on your desktop/GPU-enabled machine.
ii. Serialize the model to disk.
iii. Transfer the model to the RPi (ex., FTP, SFTP, etc.).
iv. Use the RPi to perform inference.

Using the steps above you can alleviate the need to actually train the model on the RPi and
instead use the RPi for just making predictions.
All that said, there are certain situations where training a model on the RPi can make
practical sense — most of these use cases involve taking a pre-trained model and then fine-
tuning it on a small amount of data available on the embedded device itself. Those situations
are few and far between though — my personal recommendation would be to operate under
the assumption that you should not be training on the RPi unless you have a very explicit
reason for doing so.
2.4 Faster Inference with Coprocessors
There are times when the Raspberry Pi CPU itself will not be sufficient for deep learning and
computer vision tasks. If and when those situations arise, we can utilize coprocessors to
perform deep learning inference.
Our pipeline then becomes:
i. Loading a deep learning model into memory and onto the coprocessor
ii. Using the RPi CPU for polling frames from a video stream.
iii. Utilizing the CPU for any preprocessing (resizing, channel order swapping, etc.)
iv. Passing the image to the coprocessor for inference.
v. Post-processing the results from the coprocessor and then continuing the process with
the CPU.
Coprocessors are specifically designed for deep learning inference in mind, the two most
popular of which are Intel’s Movidus Neural Compute Stick (NCS) and Google Coral’s TPU
USB Accelerator.
2.4.1 Movidius Neural Compute Stick (NCS)
Intel’s NCS is a USB thumb drive sized deep learning coprocessor (Figure 2.1). You plug the
USB stick into your RPi and then access it via the NCS2 SDK and/or the OpenVINO toolkit,
the latter of which is recommended as it can be used directly inside OpenCV with only one or
two extra function calls.
The NCS can run between 80-150 GLOPs in just over 1W of power [9], enabling embedded
devices, such as the RPi, to run state-of-the-art neural networks.
Later in this text we’ll learn how to use the NCS for deep learning inference on the RPi.
2.5. DEDICATED DEVELOPMENT BOARDS 21
Figure 2.1: Left: Intel’s Movidius Neural Compute Stick. Right: Google Coral USB Accelerator.
2.4.2 Goral Coral TPU USB Accelerator
Similar to the Movidius NCS, the Google Coral USB Accelerator (Figure 2.1, right) is also
a coprocessor device that plugs in to the RPi via a USB port — and just like the NCS, it is
designed only for inference (i.e., you wouldn’t train a model with either the Coral or NCS,
although Google’s documentation does how how to fine-tune models on small datasets using
the coral [10]).
Google reports that their Coral line of products are over 10x faster than the NCS; however,
there is a bit of caveat when using the Coral USB Accelerator with the Raspberry Pi — to
obtain such speeds the Coral USB Accelerator utilizes USB 3. The problem the RPi 3B+ only
has USB 2!
Unfortunately, having USB 2 instead of USB 3 does reduce our inference by 10x which
essentially means that the NCS and Coral USB Accelerator will perform very similarly on the
Raspberry Pi 3/3B+.
However, with that said, the Raspberry Pi 4 does have USB 3 — using USB 3, the
Coral USB Accelerator obtains much faster inference than the current iteration of the
NCS.
2.5 Dedicated Development Boards
Just like there are times when you may need a coprocessor to obtain adequate throughput
speeds for your deep learning and computer vision pipeline, there are also times where you
may need to abandon the Raspberry Pi altogether and instead utilize a dedicated develop-
ment board.
Currently, there are two frontrunners in the dedicated deep learning board market — the
Google Coral TPU Dev Board and the NVIDIA Jetson Nano.
Figure 2.2: Left: Google Coral Dev Board. Right: NVIDIA Jetson Nano.
2.5.1 Google Coral TPU Dev Board
Unlike the TPU USB Accelerator, the Google Coral TPU Dev Board (Figure 2.2, left) is actually
a dedicated board capable of up to 32-64 GLOPs, far more powerful than both the TPU USB
Accelerator and the Movidius NCS.
The device itself utilizes the highly optimized implementation of TensorFlow Lite, capable
of running object detection models such as MobileNet V2 at 100+ FPS in a power efficient
manner.
The downside of the TPU Dev Board is that it can only run TensorFlow Lite models at these
speeds — you won’t be able to take existing, off-the-shelf pre-trained models (such as Caffe
models) and then directly use them with the TPU Dev Board.
Instead, you would first need to convert these models (which may or may not be possible)
and then deploy them to the TPU Dev Board.
I’ll be covering the TPU Dev Board in depth inside the Complete Bundle of this text.
2.5.2 NVIDIA Jetson Nano
Competing with Google Coral’s TPU Dev Board is the NVIDIA Jetson Nano (Figure 2.2, right).
The Jetson Nano includes a 128-core Maxwell GPU, a quad-core ARM 57 processor run-
ning at 1.43 GHz, and 4GB of 64-bit LPDDR4 RAM. The Nano can provide 472 GFLOPs with
only 5-10W of power.
What I really like about the Nano is that you are not restricted in terms of deep learning
packages and libraries — there are already TensorFlow and Keras libraries that are optimized
2.6. IS THE RPI IRRELEVANT FOR DEEP LEARNING? 23
for the Nano, as I discuss in this PyImageSearch tutorial: (http://pyimg.co/yb72g) [11].
At this point it’s impossible to tell which dev board is going to win out in the market — it’s
highly dependent on not only the marketing of each respective company, but more importantly,
their documentation and support as well. NVIDIA’s Jetson TX1 and TX2 series, while powerful,
were incredibly challenging to utilize from a user experience perspective. I believe the Nano is
correcting many of the mistakes from the TX series and will ultimately make for a much more
pleasant user experience provided they continue on the road they are now.
2.5.3 My Recommendations for Devices
If you decide you would like to explore coprocessor additions to the Raspberry Pi, then I would
definitely take a look at the Movidius NCS and Google Coral USB Accelerator.
The NCS has come a long way since it’s original v1 release, and with the OpenVINO toolkit,
it’s incredibly easy to use with your own applications. That said, OpenVINO does lock you
down a bit to using OpenCV, so if you’re looking to utilize TensorFlow or Keras models, I would
instead recommend the Google Coral USB Accelerator.
As for dedicated dev boards, right now I’m partial to NVIDIA Jetson Nano. While the Coral
Dev Board is extremely fast, I found the Jetson Nano easier to use. I also enjoyed not being
locked down to just TensorFlow and Keras models.
2.6 Is the RPi Irrelevant for Deep Learning?
At this point you may be pondering two questions:
i. “Is the RPi just too underpowered for deep learning?”
ii. "And if so, why would Adrian ever write a book about the RPi?”
Those are two fair questions — and if you know me at all, you know my answer is that it’s
all about bringing the right tool to the job.
Running deep learning models on embedded devices is computationally expensive, and

even in some cases, is even computationally prohibitive — expecting to (1) run state-of-the-art
models on a RPi CPU, and (2) obtain real-time performance out-of-the-box just isn’t going to
happen. However, there are a variety of techniques we can utilize to optimize our code and
models for the RPi, making deep learning inference on the RPi for CPU possible.
When we encounter situations where the RPi CPU just isn’t enough, we’ll then be able to
lean on dedicated coprocessors such as the Google Coral USB TPU Accelerator and Intel’s
Movidius NCS. And when we need a dedicated device that’s both faster than the RPi and more
suitable for deep learning, we have both the Google Coral Dev board and NVIDIA Jetson Nano.
As I said, performing deep learning on the RPi is far from irrelevant — we just need to
understand the limitations of what we can and cannot do.
The rest of this text (as well as the Complete Bundle) will show you both practical projects
and the limitations of the RPi through real-world applications, including situations where you
should utilize either a USB coprocessor or a dedicated dev board.
2.7 Summary
In this chapter you learned about some of the problems surrounding deep learning on resource
constrained devices, namely:
• The immense amount of computation required by state-of-the-art networks (and how the
RPi is quite limited in terms of computation).
• Limited memory/RAM.
• Power draw and supply.
Deep learning practitioners new to the world of embedded devices may be tempted to ac-
tually train their models on the RPi. Instead, what you should do is train your model on a
desktop/GPU-enabled machine, and then after training, transfer it to your embedded de-
vice for inference. Unless you have very explicit reasons for doing so, you should not train
a model on the RPi (or other embedded device) — not only are they computational resources
limited, but electrically the RPi is underpowered as well.
You will invariably run into situations where a particular deep learning model/computer vision
pipeline is too computationally expensive for your Raspberry Pi.
If and when that situation arises, you can leverage USB coprocessors such as the Google
Coral USB Accelerator and Intel’s Movidius NCS. These USB sticks are optimized for deep
learning inference and allow you to push computation from the CPU to the USB stick, allowing
you to obtain better performance.
Alternatively, you may choose to switch embedded devices entirely and go with Google
Coral’s TPU development board or NVIDIA’s Jetson Nano — both of these devices are faster
than the Raspberry Pi and more optimized for deep learning.
Depending on your project needs, you may even elect to use a REST API service to process
images/videos in the cloud and receive the results back on your Raspberry Pi. Obviously this
isn’t ideal/possible for realtime needs.
2.7. SUMMARY 25
All that said, the Raspberry Pi itself is not irrelevant for deep learning. Instead, it’s all
about bringing the right tool for the job.
The remainder of this text (as well as the Complete Bundle) will show you how and when to
perform deep learning on the Raspberry Pi, as well as the situations where you should consider
using a coprocessor or dedicated development board.
Chapter 3
Multiple Pis, Message Passing, and

ImageZMQ
The PyImageSearch blog receives questions, comments, and emails quite often that go some-
thing like this:
“Hi Adrian, I’m working on a project where I need to stream frames from a client
camera to a server for processing using OpenCV. Should I use an IP camera? Would
a Raspberry Pi work? What about RTSP streaming? Have you tried using FFMPEG
or GStreamer? How do you suggest I approach the problem?”
It’s a great question — and if you’ve ever attempted live video streaming with OpenCV then
you know there are a ton of different options.
You could go with the IP camera route. But IP cameras can be a pain to work with. Some
IP cameras don’t even allow you to access the RTSP (Real-time Streaming Protocol) stream.
Other IP cameras simply don’t work with OpenCV’s cv2.VideoCapture function. An IP
camera may be too expensive for your budget as well.
In those cases, you are left with using a standard webcam — the question then becomes,
how do you stream the frames from that webcam using OpenCV?
Using FFMPEG or GStreamer is definitely an option. But both of those can be a royal pain to
work with. In fact, they are so much of a pain that we removed our original FFMPEG streaming
content and example code from this book. Quite simply, it was going to be a near impossible
to support and it honestly would have led readers like you down the wrong path.
In this chapter we’ll review the preferred solution using message passing libraries, specif-
ically ZMQ and ImageZMQ.
ZMQ, or “ZeroMQ”, makes working with communication sockets very simple.
27
28 CHAPTER 3. MULTIPLE PIS, MESSAGE PASSING, AND IMAGEZMQ
As an introductory example, we’ll learn to use ZMQ to send text strings between clients and
servers.
We’ll also use a package called ImageZMQ (http://pyimg.co/fthtd) [12]) to send video frames
from a client to a server — this has become my preferred way of streaming video from Rasp-
berry Pis, essentially turning them into an inexpensive wireless IP camera that you have full
control over.
PyImageSearch Guru and PyImageConf 2018 speaker, Jeff Bass has made his personal
ImageZMQ project public (http://pyimg.co/lau8p). His system allows for streaming video frames
across your network using minimal lines of code. The project is very well polished and is
another tool you can put in your tool chest for your own projects.
As you’ll see, this method of OpenCV video streaming is not only reliable but incredibly easy
to use, requiring only a few lines of code.
In this chapter we will learn:
i. How multiple Pis can communicate
ii. How sockets and ZMQ work
iii. How to use the ImageZMQ library
3.2 Projects Involving Multiple Raspberry Pis
There are a number of use cases where you may wish to integrate multiple Raspberry Pis. The
first two are related to security.
Chapter 15’s IoT Case Study project involving face recognition, door monitoring, message
passing, and IoT lights is a great example. Inside that chapter we utilize two RPis – one is
responsible for detecting people and vehicles as they come down our driveway. The first RPi
sends a message to the second RPi which starts performing face recognition at our front door.
You could extend that project and apply face recognition to every door of a corporate build-
ing or campus.
The next idea is related to factory automation. Maybe you have an automation line with mul-
tiple cameras (RPis) doing different tasks and passing information downstream (and upstream)
the automation line.
3.3. PROJECT STRUCTURE 29
Let’s brainstorm a serial number example. Maybe one camera grabs the serial number
from a barcode [13] or OCR [14] and sends that serial number to the rest of the computers
downstream so they will be expecting it. Perhaps one RPi finds a non-compliant measurement
on a part and it needs to inform other RPis to discard the part matching a certain serial number.
Each RPi takes a picture of the part and sends the frame somewhere with ImageZMQ.
Another idea is swarm robots playing a soccer game. The robots tell each other when and
where they see the soccer ball and other players on the field.
3.3 Project Structure
The directory structure for this project is as follows:
|-- messagepassing_example
|-- client.py
|-- server.py
|-- imagezmq_example
|-- client.py
|-- server.py
Our first demonstration will be using sockets and ZMQ for a simple messagepassing_exa
mple/ using a client/server approach.
From there, we’ll walkthrough an imagezmq_example/ for sending video frames from a
client to a server.
3.4 Sockets, Message Passing, and ZMQ
3.4.1 What is Message Passing?
Message passing is a programming paradigm/concept typically used in multiprocessing, dis-

tributed, and/or concurrent applications. Using message passing, one process can communi-
cate with one or more other processes, typically using a message broker. Whenever a process
wants to communicate with another process, including all other processes, it must first send its
request to the message broker.
The message broker receives the request and then handles sending the message to the
other process(es). If necessary, the message broker also sends a response to the originating
process.
As an example of message passing, let’s consider a tremendous life event, such as a mother
giving birth to a newborn child (process communication depicted in Figure 3.1). Process A, the
Figure 3.1: Illustrating the concept of sending a message from a process, through a message
broker, to other processes.
mother, wants to announce to all other processes (i.e., the family), that she had a baby. To do
so, Process A constructs the message and sends it to the message broker.
The message broker then takes that message and broadcasts it to all processes. All other
processes then receive the message from the message broker. These processes want to
show their support and happiness to Process A, so they construct a message saying their
congratulations as shown in Figure 3.2.
Figure 3.2: Each process sends an acknowledgment (ACK) message back through the message
broker to notify Process A that the message is received.
These responses are sent to the message broker which in turn sends them back to Process
A.
This example is a dramatic simplification of message passing and message broker sys-
3.4. SOCKETS, MESSAGE PASSING, AND ZMQ 31
tems but should help you understand the general algorithm and the type of communication
the processes are performing. You can very easily get into the weeds studying these topics,
including various distributed programming paradigms and types of messages/communication
(1:1 communication, 1:many, broadcasts, centralized, distributed, broker-less etc.).
As long as you understand the basic concept that message passing allows processes to
communicate (including processes on different machines) then you will be able to follow along
with the rest of this chapter.
3.4.2 What is ZMQ?
ZeroMQ [15], or simply ZMQ for short, is a high-performance asynchronous message passing
library used in distributed systems.
Both RabbitMQ [16] and ZeroMQ are some of the most highly used message passing sys-
tems. However, ZeroMQ specifically focuses on high throughput and low latency appli-
cations — which is exactly how you can frame live video streaming.
When building a system to stream live videos over a network using OpenCV, you would
want a system that focuses on:
• High throughput: There will be new frames from the video stream coming in quickly.
• Low latency: As we’ll want the frames distributed to all nodes on the system as soon as
they are captured from the camera.
ZeroMQ also has the benefit of being extremely easy to both install and use.
Jeff Bass, the creator of ImageZMQ (which builds on ZMQ) [17], chose to use ZMQ as the
message passing library for these reasons.
3.4.3 Example Message Passing using Client/Server
In this client/server message passing example, our client is going to ask the user a question
and we’ll query the server to see if the message is correct or incorrect.
Textual messages are passed between the client and server in order to answer the question:
“What is the best type of pie?”
We can implement this example in minimal lines of code on both the client and server.
Let’s implement the server first. Go ahead and create a new file named server.py and
insert the following code:
1 # import necessary packages

2 import argparse
3 import sys
4 import zmq
5
6 # construct the argument parser and parse the arguments
7 ap = argparse.ArgumentParser()
8 ap.add_argument("-p", "--server-port", required=True, type=str,
9 help="server's port")
10 args = vars(ap.parse_args())
We begin with imports where Line 4 imports zmq for message passing.
From there we process one command line argument, the --server-port.
Let’s go ahead and establish our connection:
12 # create a container for all sockets in this process

13 context = zmq.Context()
14
15 # establish a socket for incoming connections
16 print("[INFO] creating socket...")
17 socket = context.socket(zmq.REP)
18 socket.bind("tcp://*:{}".format(args["server_port"]))
Line 13 creates a container for all socket connections.
We then bind a socket connection with zmq on our --server-port via Lines 17 and 18.
From here let’s loop over incoming messages:
20 while True:
21 # receive a message, decode it, and conver to lowercase
22 message = socket.recv().decode("ascii").lower()
23 print("[INFO] received message `{}`".format(message))
Line 22 receives a string message from our socket and converts it to lowercase.
The message is echoed to the terminal (Line 23).
Let’s handle the message parsing in a few case statements:
25 # check if the correct message, *raspberry*, is received and then

26 # send return message message accordingly
27 if "raspberry" in message:
28 print("[INFO] correct message, so sending 'correct'")

29 returnMessage = "correct"
30 socket.send(returnMessage.encode("ascii"))
31
32 # kill the server if the *quit* message is received and let the
33 # client know that it should exit as well
34 elif message == "quit":
35 returnMessage = "quitting server..."
37 print("[INFO] terminating the server")
38 sys.exit(0)
39
40 # otherwise send the default reply
41 else:
42 print("[INFO] incorrect message, so requesting again")
43 returnMessage = "try again!"
If the client says "raspberry" is the best type of pie, we let the user know the message
is correct by sending "correct" as our returnMessage (Lines 27-30).
Or, if the message is "quit", then we send "quitting server..." as our returnMessage
(Lines 34-36). We also go ahead and call sys.exit since the client wants to quit (Line 38).
For any other message, we’ll ask the client to "try again!" (Lines 41-44).
Now that our server is ready to go, let’s implement a client. Create a new file named
client.py and insert the following code:
1 # import the necessary packages

2 import argparse
3 import sys
4 import zmq
5
8 ap.add_argument("-ip", "--server-ip", required=True, type=str,
9 help="server's ip address")
10 ap.add_argument("-p", "--server-port", required=True, type=str,
11 help="server's port")
13
16
17 # establish a socket to talk to server
18 print("[INFO] connecting to the server...")
19 socket = context.socket(zmq.REQ)
20 socket.connect("tcp://{}:{}".format(args["server_ip"],
21 args["server_port"]))
We begin with the same imports as our server (Lines 2-4).
Our client requires two command line arguments (Lines 7-12):
• --server-ip: The IP address of the server we are connecting to.
• --server-port: The port for the application running on the server.
Lines 15-21 then connect to the server via the IP and port.
Once we are connected, we’ll begin our question/response logic:
23 # loop indefinitely
24 while True:
25 # ask the user to type a message
26 message = input("[INPUT] What is the best type of pie? ")
27
28 # send a text message over the socket connection
29 print("[INFO] sending '{}'".format(message))
30 socket.send(message.encode("ascii"))
First, the client poses a question, "What is the best type of pie?". The question
is wrapped in an input() statement, pausing execution until the user types their answer (of
course, we know the answer is “raspberry” or “raspberry pie”, but it is up to the server to be the
judge).
The client will then send the message to the server to see if it is correct (Line 30).
Let’s receive the server’s response:
32 # receive a reply text message

33 response = socket.recv().decode("ascii")
34 print("[INFO] received reply '{}'".format(response))
35
36 # quit client if necessary
37 if response == "quitting server...":
38 print("[INFO] the server is shut down, so exiting the client")
39 sys.exit(0)
Line 33 receives a response from the server and Line 34 echos it to the user.
In the event that the response is equal to "quitting server...", our client will exit.
Let’s put the client and server to use in the next section.
3.4.4 Running Our Message Passing Example
With our client and server coded up, now it’s time to put them to work.
First, go ahead and start your server on a machine such as your laptop or a RPi:
$ python server.py --server-port 5555

[INFO] creating socket...
Then, on a separate machine start your client (technically you could run the server and
client on the same machine if you use a loopback IP address such as 127.0.0.1 or localhost
instead of another machine’s IP on your network):
$ python client.py --server-ip localhost --server-port 5555

[INFO] connecting to the server...
Next, you will be prompted for [INPUT].
Of course we all like “Raspberry” flavored pie, but just for the sake of this example, let’s
enter “Apple” at the prompt:
[INPUT] What is the best type of pie? Apple

[INFO] sending 'Apple'
[INFO] received reply 'try again!'
It looks like the server has asked us to try again!.
Okay, so try “Raspberry” this time:
[INPUT] What is the best type of pie? Raspberry

[INFO] sending 'Raspberry'
[INFO] received reply 'correct'
Perfect! We received a message indicating our answer is correct!
At this point, you can chat with the server and try some other types of pie.
When you’re ready, you can tell the server that you’re ready to quit:
[INPUT] What is the best type of pie? quit

[INFO] sending 'quit'
[INFO] received reply 'quitting server...'
[INFO] the server is shut down, so exiting the client
If you weren’t watching the server, you can go back and inspect its output too. Here’s what
the server’s output looked like:
$ python server.py --server-port 5555

[INFO] creating socket...
[INFO] received message `apple`
[INFO] incorrect message, so requesting again
[INFO] received message `raspberry`
[INFO] correct message, so sending 'correct'
[INFO] received message `quit`
[INFO] terminating the server
As you can see, both our client and server are operating properly.
As a challenge for you, could you implement more features for this program? Perhaps add
more questions and handle answer responses? Or maybe you’d like to create a client/server
chat program using Python and zmq.
You could implement such a chat system that can handle one client in a matter of minutes.
You could keep track of multiple client connections in a multi-chat system in a matter of thirty
minutes. Prompt for the person’s chat username/alias when they connect to the server. Then
broadcast each incoming message to all clients with the alias at the beginning of the string
(just like the good ’ole days of ICQ).
3.5 ImageZMQ for Video Streaming
3.5.1 Why Stream Video Frames Over a Network?
There are a number of reasons why you may want to stream frames from a video stream over
a network with OpenCV.
To start, you could be building a security application that requires all frames to be sent to a
central hub for additional processing and logging.
Or, your client machine may be highly resource constrained (such as a Raspberry Pi) and
lack the necessary computational horsepower required to run computationally expensive algo-
rithms (such as deep neural networks, for example).
In these cases, you need a method to take input frames captured from a webcam with
OpenCV and then pipe them over the network to another system.
There are a variety of methods to accomplish this task such as those mentioned in the intro-
duction of this chapter, but right now we’re going to continue our focus on message passing.
3.5. IMAGEZMQ FOR VIDEO STREAMING 37
Figure 3.3: A great application of video streaming with OpenCV is a security camera system. You
could use Raspberry Pis and ImageZMQ to stream from the Pi (client) to the server.
3.5.2 What is the ImageZMQ Library?
Jeff Bass is the owner of Yin Yang Ranch [18], a permaculture farm in Southern California. He
was one of the first people to join PyImageSearch Gurus (http://pyimg.co/gurus), my flagship
computer vision course. In the course and community he has been an active participant in
many discussions around the Raspberry Pi.
Jeff has found that Raspberry Pis are perfect for computer vision and other tasks on his
farm. They are inexpensive, readily available, and astoundingly resilient/reliable.
At PyImageConf 2018 [19], Jeff spoke about his farm and more specifically how he used
Raspberry Pis and a central computer to manage data collection and analysis. The heart of
his project is a library that he put together called ImageZMQ [20].
ImageZMQ solves the problem of real-time streaming from the Raspberry Pis on his farm.
It is based on ZMQ and works really well with OpenCV.
Plain and simple, ImageZMQ just works. And it works really reliably.
I’ve found it to be more reliable than alternatives such as GStreamer or FFMPEG streams.
I’ve also had better luck with it than using RTSP streams.
Be sure to refer to the following links:
• You can learn the details of ImageZMQ by studying Jeff’s code on GitHub:
http://pyimg.co/lau8p
Figure 3.4: ImageZMQ is a video streaming library developed by PyImageSearch Guru, Jeff Bass.
It is available on GitHub: http://pyimg.co/lau8p
• If you’d like to learn more about Jeff, be sure to refer to his interview here:
http://pyimg.co/sr2gj
• Jeff’s slides from PyImageConf 2018 are also available here: http://pyimg.co/f7jsc
In the coming sections, we’ll install ImageZMQ, implement the client + server, and put the
system to work.
3.5.3 Installing ImageZMQ
ImageZMQ is preinstalled on the Raspbian and Nano .imgs included with this book. Refer to
the companion website associated with this text to learn more about the preconfigured .img
files.
If you prefer to install ImageZMQ from scratch, refer to this article on PyImageSearch:
http://pyimg.co/fthtd) [12].
3.5.4 Preparing Clients for ImageZMQ with Custom Hostnames
ImageZMQ must be installed on each client and the central server.

Figure 3.5: Changing a Raspbery Pi hostname is as simple as entering the raspi-config inter-
face from a terminal or SSH connection.
In this section, we’ll cover one important difference for clients.
Our code is going to use the hostname of the client to identify it. You could use the IP
address in a string for identification, but setting a client’s hostname allows you to more easily
identify the purpose of the client.
In this example, we’ll assume you are using a Raspberry Pi running Raspbian. Of course,
your client could run Windows Embedded, Ubuntu, macOS, etc., but since our demo uses
Raspberry Pis, let’s learn how to change the hostname on the RPi.
To change the hostname on your Raspberry Pi, fire up a terminal (this could be over an
SSH connection if you’d like).
Then run the raspi-config command and navigate to network options as shown in Fig-
ure 3.5. From there, change the hostname to something unique. I recommend a naming
convention such as pi-garage, pi-frontporch, etc.
After changing the hostname you’ll need to save and reboot your Raspberry Pi.
On some networks, now you won’t even need an IP address to SSH to your Pi. You could
SSH via the hostname:
$ ssh pi-livingroom
3.5.5 Defining the Client/Server Relationship
Figure 3.6: The client/server relationship for ImageZMQ video streaming with OpenCV.
Before we actually implement network video streaming with OpenCV, let’s first define the
client/server relationship to ensure we’re on the same page and using the same terms:
• Client: Responsible for capturing frames from a webcam using OpenCV and then send-
ing the frames to the server.
• Server: Accepts frames from all input clients.
We could argue back and forth as to which system is the client and which is the server. For
example, a system that is capturing frames via a webcam and then sending them elsewhere
could be considered a server — the system is undoubtedly serving up frames. Similarly, a
system that accepts incoming data could very well be the client.
However, we are assuming:
i. There is at least one (and likely many more) system responsible for capturing frames.
ii. There is only a single system used for actually receiving and processing those frames.
For these reasons, we’ll prefer to think of the system sending the frames as the client and
the system receiving/processing the frames as the server. Thus, Figure 3.6 demonstrates the
relationship and responsibilities of both the client and server.
You may disagree, but that is the client-server terminology we’ll be using throughout
the remainder of this chapter.
3.5.6 The ImageZMQ Client
Let’s go ahead and implement our ImageZMQ client inside of client.py:

2 from imutils.video import VideoStream
3 import imagezmq
4 import argparse
5 import socket
6 import time
7
10 ap.add_argument("-s", "--server-ip", required=True,
11 help="ip address of the server to which the client will connect")
13
14 # initialize the ImageSender object with the socket address of the
15 # server
16 sender = imagezmq.ImageSender(connect_to="tcp://{}:5555".format(
17 args["server_ip"]))
We begin with our imports and command line arguments. On Line 3 we import imagezmq.
We have one command line argument, the --server-ip which is the server’s IP address
or hostname.
By default, we’ll connect to the typically open port on Lines 16 and 17 where we initialize
our ImageSender as sender.
The --server-ip is passed as a command line argument to the connect_to parameter.
Moving on, we’ll initialize our video stream and start sending frames:
19 # get the host name, initialize the video stream, and allow the
20 # camera sensor to warmup
21 rpiName = socket.gethostname()
22 #vs = VideoStream(src=0).start()
23 vs = VideoStream(usePiCamera=True).start()
24 time.sleep(2.0)
25
26 # loop over frames from the camera
27 while True:
28 # read the frame from the camera and send it to the server
29 frame = vs.read()
30 sender.send_image(rpiName, frame)
We grab the hostname of the client on Line 21 – refer back to Section 3.5.4 to learn how to
set each of your Raspberry Pi hostnames.
Our video stream is launched on Line 23.
Lines 27-30 start our infinite while loop to both read a frame from the VideoStream and
then send it with send_image. We pass the rpiName string (the hostname) as well as the
frame itself.
Jeff Bass’ ImageZMQ library takes care of the rest for us.
With our 30-line client implemented, let’s move on to the server.
3.5.7 The ImageZMQ Server
The server is equally easy to implement thanks to Jeff’s API.
Go ahead and open server.py and insert the following lines of code:

2 from imagezmq import imagezmq
3 import imutils
4 import cv2
5
6 # initialize the ImageHub object
7 imageHub = imagezmq.ImageHub()
We begin by importing packages. Line 2 imports imagezmq.
From there, Line 7 initializes our imageHub. The imageHub manages connections to our
clients.
Now let’s loop over incoming frames from our client(s):
9 # start looping over all the frames

10 while True:
11 # receive RPi name and frame from the RPi and acknowledge
12 # the receipt
13 (rpiName, frame) = imageHub.recv_image()
14 imageHub.send_reply(b'OK')
15 print("[INFO] receiving data from {}...".format(rpiName))
Line 13 receives a tuple containing a client’s rpiName and frame.
We will use this unique rpiName (i.e. the client’s hostname) as our GUI window name. We
will also annotate the frame itself with the rpiName. This allows us to have many clients each
with their own GUI window on our server. Take the time now to set each of the Pis around your
house with a unique hostname following the guide in Section 3.5.4.
Line 14 sends an “ack” (acknowledgement) message back to the client that we received the
frame successfully.
The final code block processes and displays our frame:
17 # resize the frame to have a maximum width of 400 pixels

18 frame = imutils.resize(frame, width=400)
19
20 # draw the sending device name on the frame
21 cv2.putText(frame, rpiName, (10, 25),
22 cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 2)
23
24 # display and capture keypresses
25 cv2.imshow(rpiName, frame)
26 key = cv2.waitKey(1) & 0xFF
27
28 # if the `q` key was pressed, break from the loop
29 if key == ord("q"):
30 break
31
32 # do a bit of cleanup
33 cv2.destroyAllWindows()
Line 18 resizes the frame — resizing is optional.
Lines 21 and 22 annotate the frame with the rpiName in the top left corner.
Line 25 then creates a GUI window based on the rpiName with the frame inside of it.
We handle the q keypress via Lines 26-30. When we break out of the loop, all windows
are destroyed.
Our server couldn’t be any easier!

3.5.8 Results of Streaming Video with ImageZMQ
Let’s put ImageZMQ to the test.
For this section, grab one or more Raspberry Pis along with your laptop. Your laptop will be
the server. Each RPi will be a client.
ImageZMQ is required to be installed on the server and all clients (refer to Section 3.5.3.
Once the clients and server are ready to go, we’ll first start the server:
$ # on your laptop
$ python server.py
From there, go ahead and start just one client (you can do so via SSH so you don’t have to
break out a screen/keyboard/mouse on the Pi-side):
$ # on the pi
$ python client.py --server-ip 192.168.1.113
In Figure 3.7 we go ahead and launch our ImageZMQ server. Once the server is running,
you can start each client (Figure 3.8). It is recommended to start a screen session on your
clients if you connect via SSH. Screen sessions persist even when the SSH connection drops.
Your server will begin displaying frames as connections are made from each RPi as shown
in Figure 3.9. As you add more client connections to the server, additional windows will appear.
For a full demo, please refer to this screencast: http://pyimg.co/p86ek.
Notice that each OpenCV window on the host is named via the hostname of the RPi. The
hostname of the RPi is also annotated in the top left corner of each OpenCV frame. To learn
how to configure custom hostnames for your RPis, refer to Section 3.5.4. It is essential to
configure the hostname of each RPi as it represents the unique identifier of the RPi as
the code is written. Another option would be to query and send either the MAC or IP address
on the client side to the server.
Once you start streaming, you’ll notice that the latency is quite low and the image quality is
perfect (we didn’t compress the images). Be sure to see the next section about the impact of
your network hardware/technology on performance.
Figure 3.7: Start the ImageZMQ server so that clients have a server to connect to. The server
should run on a desktop/laptop and it will be responsible for displaying frames (and any image
processing done on the server-side). Note: High resolution image can be found here for easier
viewing: http://pyimg.co/frt0a
3.5.9 Factors Impacting ImageZMQ Performance
There are a number of factors which will impact performance (i.e. frames per second through-
put rate) for network video streaming projects. Ask yourself the following questions and keep
in mind that most of them are inter-related.
How many clients are connected?
The more clients, the longer it will be until each client frame is displayed.
The example in this chapter was very serial. Multithreading/multiprocessing will help if you
want to manage connections and displaying video frames in parallel.
How large is the frame you are sending?
By default, the imutils.video module paired with the PiCamera will send 320x240
frames. If you send larger frames, your FPS will go down. Full resolution frames require
more data packets to get to the server.
If you have a high resolution USB webcam and forget to manually insert a frame = imutils.resize(
Figure 3.8: After the ImageZMQ server is running, launch each client while providing the
--server-ip via command line argument. The clients can run in SSH screen sessions so
that they continue to run even when you close the terminal window. Note: High resolution image
can be found here for easier viewing: http://pyimg.co/3yuej
width=400) (substitute 400 for a width of your choosing) between Lines 28 and 29, you’ll
be sending full resolution HD frames. HD frames will quite literally choke your streaming ap-
plications unless you apply compression. Compression is certainly possible but is outside the
scope of this book.
What server processing are you doing?
If the server is only displaying frames, the overall FPS will be quite high. However, if you
are running a complex computer vision or deep learning pipeline, performance will definitely
be impacted.
What client-side processing is happening?
In general, you can take the load off the client and just treat the client as an IP camera. But
you may wish to do some of the frame processing on the Pi (or other client) itself to take a load
off the server. Be creative!
Do you need to send every frame to the server?

Figure 3.9: Server will begin to display the frames coming from each client in separate windows.
Note: High resolution image can be found here for easier viewing: http://pyimg.co/3yuej
Maybe you only need to send frames containing motion to the server for a security applica-
tion. Try implementing the idea of only sending motion frames to the server using this chapter’s
example code. Simply apply background subtraction on the clients (Chapter 4, Section 4.3.9 of
the Hobbyist Bundle for an example of motion detection). You may wish to send a “keep alive”
frame every so often.
What WiFi technology does your network support?
Discussing wireless technology is outside the scope of the book. That said, keep the fol-
lowing points in mind:
i. 802.11b is slow.
ii. 802.11g is slow.
iii. 802.11ac is fast.
iv. 802.11ac 80MHz bandwidth is even faster.
v. 5GHz (higher frequency) is good for speed and short distances.
vi. 2.4GHz (lower frequency) is good for longer ranges, but is slower than 5GHz.
What WiFi technology does your Raspberry Pi support?
Be sure to look up the specifications. In general, here’s what you need to know:
• The RPi Pi Zero W only supports 2.4GHz.
• The RPi Pi 3B doesn’t support 802.11ac.
• The RPi Pi 3B+ and RPi 4B support 802.11ac.
Similarly, check the network specifications of your laptop/desktop.
Do you have a mix of fast and slow clients on your network?
There’s no problem with mixing clients unless you have an issue with speed. If you are
using serial processing (as we demonstrated in this chapter), then an RPi Zero W could be a
bottleneck, for example.
3.6 Summary
In this chapter we learned about message passing via sockets, ZMQ, and ImageZMQ.
The Raspberry Pi is a true Internet of Things (IoT) device. You should use its communication
capabilities to your advantage when designing an interconnected computer vision system.
In the next chapter, we’ll build a home security system with a central server that processes
frames from multiple Pis around your home to find people and other objects with a deep learn-
ing object detector.
Flip the page to continue learning more about ImageZMQ with a real world example.
Chapter 4
Advanced Security with YOLO Object

Detection
In Chapter 14 of the Hobbyist Bundle you learned how to build a basic home security applica-
tion using motion detection/background subtraction.
In this chapter we are going to extend the basic system to include more advanced computer
vision algorithms, including deep learning, object detection, and the ability to monitor specific
zones.
Additionally, you’ll also learn how to stream frames directly from your Raspberry Pi to a
central host (such as a laptop/desktop) where more computationally expensive operations can
be performed.
In this chapter you will:
i. Learn about YOLO, a deep learning-based object detector.
ii. Efficiently apply a cascade of background subtraction and object detection (to save CPU
cycles).
iii. Define unauthorized security zones and monitor them for intruders.
iv. Use ImageZMQ to stream frames from a Pi to a central computer (i.e., your laptop, desk-
top, or GPU rig) running YOLO.
49
50 CHAPTER 4. ADVANCED SECURITY WITH YOLO OBJECT DETECTION
4.2 Object Detection with YOLO and Deep Learning
When it comes to deep learning-based object detection, there are three primary object detec-
tors you’ll encounter:
• R-CNN and their variants, including the original R-CNN, Fast R-CNN, and Faster R-CNN
[21, 22, 23].
• Single Shot Detectors (SSDs) [24]
• YOLO [25, 26, 27]
R-CNNs are one of the first deep learning-based object detectors and are an example of a
two stage detector. In the original R-CNN publication, Girshick et al. proposed:
i. Using an algorithm such as Selective Search [28] (or equivalent) to propose candidate
bounding boxes that could contain objects.
ii. These regions were then passed into a CNN for classification.
The problem with the standard R-CNN method was that it was painfully slow and not a
complete end-to-end-object detectors. Two followup papers were published (Fast R-CNN and
Faster R-CNN, respectively), leading to a complete end-to-end trainable object detector that
automatically proposed regions of an image that could potentially contain objects.
While R-CNNs did tend to be very accurate, the biggest problem was their speed —
they were incredibly slow, obtaining only 5 FPS on a GPU.
To help increase the speed of deep learning-based object detectors, both Single Shot De-
tectors (SSDs) and YOLO use a one-stage detector strategy. These algorithms treat object
detection as a regression problem, taking a given input image and simultaneously learning
bounding box coordinates and corresponding class label probabilities.
In general, single-stage detectors tend to be less accurate than two-stage detectors, but
are significantly faster.
Since the original YOLO publication the framework has gone through a number of itera-
tions, including YOLO9000 and YOLOv2 [26]. The performance of both these updates was
a bit underwhelming and it wasn’t until the 2018 publication of YOLOv3 [27] that prediction
performance improved.
We’ll be using YOLOv3 for this chapter but feel free to swap out the model for another
object detection method — you are not limited to just YOLO. As long as your object detector
can produce bounding box coordinates, you can use it in this chapter as a starting point for
your own projects.
4.3. AN OVERVIEW OF OUR SECURITY APPLICATION 51
4.3 An Overview of Our Security Application
Figure 4.1: Flowchart of steps when building our RPi home security system. First, our client
RPi reads a frame from its camera sensor. The frame is sent (via ImageZMQ) to a server for
processing. The server checks to see if motion has occurred in the frame, and if so, applies YOLO
object detection. A video clip containing the action is then generated and saved to disk.
Before we get started building our deep learning-based security application, let’s first ensure
we understand the general steps (Figure 4.1). You’ll note that our Raspberry Pi is meant to be
a “camera only” — this means that the RPi will not be performing any on-board processing.
Therefore, the RPi will be responsible for capturing frames and then sending them to a more
powerful server for additional processing.
The server will apply a computationally expensive object detector to locate objects in our
video feed. However, our server will utilize a two-stage process, called a cascade (not to be
confused with Haar cascade object detectors):
i. First, we monitor the frame for motion.
ii. If, and only if, motion is found, we then apply the YOLO object detector.
Even on a laptop/desktop CPU, the YOLO object detector is very computationally expensive
— we should only apply it when we need to. Since a scene without motion is presumed not to
have any objects we’re interested in, it doesn’t make sense to apply YOLO to an area that has
no objects!
Once motion is detected we’ll run YOLO and continue to monitor the video feed. If a person,
animal, or any other object we define enters any unauthorized zone, we’ll record a video clip of
the intrusion.
Remark. These “unauthorized zones” must be provided before we launch our security appli-
cation. You’ll learn how to find these coordinates inside Section 4.6.
4.4 Building a Security Application with Deep Learning
In this section we will implement our actual security application. This system will follow the
flowchart depicted and described in Section 4.3.
4.4.1 Project Structure
Before we start writing any code let’s first take a look at our directory structure for the project:
|-- config
| |-- config.json
|-- output
|-- pyimagesearch
| |-- keyclipwriter
| | |-- __init__.py
| | |-- keyclipwriter.py
| |-- motion_detection
| | |-- __init__.py
| | |-- singlemotiondetector.py
| |-- utils
| | |-- __init__.py
| | |-- conf.py
| |-- __init__.py
| |-- parseyolooutput.py
|-- yolo-coco
| |-- coco.names
| |-- yolov3.cfg
| |-- yolov3.weights
|-- client.py
|-- server.py
Inside the config/ directory we will store our config.json file, containing our various
configurations.
The pyimagesearch module contains four important Python files:
• keyclipwriter.py: Used to write key events to disk (i.e., video clips).
• singlemotiondetection.py: Our motion detection implementation from Chapter 14

of the Hobbyist Bundle.
• conf.py: Used to load our JSON configuration file.

4.4. BUILDING A SECURITY APPLICATION WITH DEEP LEARNING 53
• parseyolooutput.py: Helper class used to parse the output of the YOLO object de-
tection network.
The yolo-coco/ directory contains the YOLO object detector (pre-trained on COCO).
This object detector is capable of recognizing 80 common object classes, including people,
cars/trucks, animals (dogs, cats, etc.).
The client.py will run on our Raspberry Pi — this script will be responsible for capturing
frames from the RPi camera and then streaming them back to server.py which apply our
actual computer vision pipeline.
4.4.2 Our Configuration File
With our directory structure reviewed, let’s now take a look at conf.json:
1 {
2 // number of frames required to construct a reasonable background
3 // model
4 "frame_count": 32,
5
6 // path to yolo weights and labels
7 "yolo_path": "yolo-coco",
8
9 // minimum confidence to filter out weak detections
10 "confidence": 0.5,
11
12 // non-maxima suppression threshold
13 "threshold": 0.3,
The "frame_count" defines the minimum number of required frames in order to construct
a model for background subtraction. If you recall from Chapter 14 of the Hobbyist Bundle,
we must first ensure that our background subtractor knows what the background “looks like”,
thereby allowing it to detect when motion takes place in the scene.
The "yolo_path" configuration points to the yolo-coco/ directory where the pre-trained
YOLO model lives. This directory includes the YOLO model architecture, pre-trained weights,
and a .names file with the names of the labels YOLO can detect.
Our "confidence" is the minimum required probability to filter out weak detections. Any
given detection must have a predicted probability larger than "confidence", otherwise we’ll
throw out the detection, assuming it’s a false-positive.
The "threshold" is used for non-maxima suppression (NMS) [29, 30]. NMS is used to
suppress overlapping bounding boxes, collapsing the boxes into a single object. You can learn
more about NMS in the following tutorial: http://pyimg.co/fz1ak.
Let’s now define our unauthorized zones and the objects we want to look for:
15 // coordinates of unauthorized zones in (x1, y1, x2, y2) format

16 "unauthorized_zone": [[0, 0, 100, 200], [250, 0, 375, 200]],
17
18 // list of classes being considered in the surveillance (all
19 // other classes the object detector was trained on will be
20 // ignored)
21 "consider": ["person", "car", "truck", "dog", "knife", "cat",
22 "bird", "bicycle", "motorbike"],
Line 16 initializes "unauthorized_zone", a list of bounding box coordinates defining the

unauthorized zones in our stream. You can learn how to define your own bounding boxes for
unauthorized zones in Section 4.6.
Lines 21 and 22 constructs the list of objects we want to look for in our scene. Here we have
supplied a list of common objects you’ll likely want to monitor for in a security application; how-
ever, feel free to add or remove any of the object classes from the yolo-coco/coco.names
file if you wish.
Our final set of configurations handle writing video clips to disk:
24 // output video codec, output video FPS, and path to the output
25 // videos
26 "codec": "MJPG",
27 "fps": 10,
28 "output_path": "output",
29
30 // key clip writer buffer size
31 "buffer_size": 32
32 }
The "codec" controls our video codec while the "fps" dictates the playback frame rate of
the video path stored in the "output_path" directory.
We’ll allow a total of 32 frames to be stored in our "buffer_size", enabling us to write

these frames back to disk when an unauthorized zone breach occurs.
4.4.3 Implementing the Client
Our client is a Raspberry Pi. The Pi will be responsible for one task, and one task only —
capturing frames from a video stream and then sending those frames to our server for
processing.
Let’s implement the client now. Open up client.py and insert the following code:

3 import imagezmq
4 import argparse
5 import socket
6 import time
7
13
15 # server
Lines 2-6 import our required Python packages while Lines 9-12 parse our command line
arguments. Only a single argument is required, --server-ip, which is the IP address of the
server running the ImageZMQ hub.
Be sure to refer to Chapter 3 to learn how to use ImageZMQ.
Lines 16 and 17 then initialize the ImageSender used to send frames from our video
stream via ImageZMQ to our central server.
Overall, this code should look very similar to the ImageZMQ streaming code from the previ-
ous chapter (only now we’re sending frames instead of text content).
The next step is to launch our VideoStream:
24 time.sleep(2.0)
25
27 while True:
28 # read the frame from the pi camera and send it to the server
On Line 21 we grab the hostname of the RPi — we’ll eventually extend this implementation
to handle multiple RPis streaming to a single server (Chapter 8). When we do extend our
implementation we need each RPi to have a unique ID. Grabbing the hostname ensures each
RPi is uniquely identified. To set the hostname of your Raspberry Pi, refer to Section 3.5.4
Remark. Be sure to refer to the book’s companion website to learn how to set the hostname of
your Raspberry Pi. A link to access the companion website can be found in the first few pages
of this text.
Line 27 starts a while loop that loops over frames from our VideoStream. We then call
the send_image method to send the frame to the central server.
4.4.4 Implementing the Server
Now that we’ve created the client, let’s implement the server. Open up server.py and insert
the following code:

2 from pyimagesearch.motion_detection import SingleMotionDetector
3 from pyimagesearch.parseyolooutput import ParseYOLOOutput
4 from pyimagesearch.keyclipwriter import KeyClipWriter
5 from pyimagesearch.utils.conf import Conf
6 from datetime import datetime
7 import numpy as np
8 import imagezmq
9 import argparse
10 import imutils
11 import cv2
12 import os
Lines 2-12 handle importing our required Python packages, but most notably, take a look
at Lines 2-4 — these imports will facilitate building the computer vision pipeline discussed in
Section 4.3 earlier in this chapter.
As we work through this script keep in mind that our security application consists of two
stages:
i. Stage #1: First perform the less computationally expensive background subtraction/mo-
tion detection to the frame.
ii. Stage #2: If, and only if, motion is found, utilize the more computationally expensive
background subtractor.
In order to determine if an object has entered an unauthorized zone, we need to define an

overlap function:
14 def overlap(rectA, rectB):

15 # check if x1 of rectangle A is greater x2 of rectangle B or if
16 # x2 of rectangle A is less than x1 of rectangle B, if so, then
17 # both of them do not overlap and return False
18 if rectA[0] > rectB[2] or rectA[2] < rectB[0]:
19 return False
20
21 # check if y1 of rectangle A is greater y2 of rectangle B or if
22 # y2 of rectangle A is less than y1 of rectangle B, if so, then
23 # both of them do not overlap and return False
24 if rectA[1] > rectB[3] or rectA[3] < rectB[1]:
25 return False
26
27 # otherwise the two rectangles overlap and hence return True
28 return True
The overlap method requires that we supply the bounding box coordinates of two rectan-
gles, rectA and rectB, respectively. Line 18 checks to see if:
• The first x-coordinate of rectA is greater than the second x-coordinate of rectB.
• The second x-coordinate of rectA is less than the first x-coordinate of rectB.
If any of these cases hold, then the two rectangles do not overlap.
Lines 24 makes a similar check, only this time for the respective y -coordinates. Again, if
the check passes, then the two rectangles do not overlap. Finally, if both checks fail then we
know that the rectangles do overlap and return True.
Next, let’s parse our command line arguments and perform a few initializations:

32 ap.add_argument("-c", "--conf", required=True,
33 help="Path to the input configuration file")
35
36 # load the configuration file and initialize the ImageHub object
37 conf = Conf(args["conf"])
39
40 # initialize the motion detector, the total number of frames read
41 # thus far, and the spatial dimensions of the frame
42 md = SingleMotionDetector(accumWeight=0.1)
43 total = 0
44 (W, H) = (None, None)
Lines 31-34 parse our command line arguments. We only need a single argument, --conf,
the path to our configuration file. Lines 37 and 38 load our conf and initialize the ImageHub.
Lines 44 then instantiates the motion detector, initializes the total number of frames read
thus far, and the frame width and height, respectively.
We can now load the YOLO object detector from disk:
46 # load the COCO class labels our YOLO model was trained on
47 labelsPath = os.path.sep.join([conf["yolo_path"], "coco.names"])
48 LABELS = open(labelsPath).read().strip().split("\n")
49
50 # initialize a list of colors to represent each possible class label
51 np.random.seed(42)
52 COLORS = np.random.randint(0, 255, size=(len(LABELS), 3),
53 dtype="uint8")
54
55 # derive the paths to the YOLO weights and model configuration
56 weightsPath = os.path.sep.join([conf["yolo_path"], "yolov3.weights"])
57 configPath = os.path.sep.join([conf["yolo_path"], "yolov3.cfg"])
58
59 # load our YOLO object detector trained on COCO dataset (80 classes)
60 # and determine only the *output* layer names that we need from YOLO
61 print("[INFO] loading YOLO from disk...")
62 net = cv2.dnn.readNetFromDarknet(configPath, weightsPath)
63 ln = net.getLayerNames()
64 ln = [ln[i[0] - 1] for i in net.getUnconnectedOutLayers()]
65
66 # initialize the YOLO output parsing object
67 pyo = ParseYOLOOutput(conf)
Here we construct the path to the COCO class labels file (i.e., the list of objects YOLO can
detect). We then load this file into the LABELS list. We also derive random COLORS that we
can use to visualize each label (Lines 43-53).
In order to load YOLO from disk we must first derive the paths to the weights and model
configuration (Lines 56 and 57). Given these paths we can load YOLO from disk on Line 62.
We also determine the output layer names from YOLO on Lines 63 and 64, enabling us to
extract the object predictions from the network.
Line 67 instantiates ParseYOLOOutput, used to actually parse the output of the network.
We’ll implement that class in Section 4.4.4.1.
Let’s start looping over frames received from our Raspberry Pi via ImageZMQ:
69 # initialize key clip writer and the consecutive number of

70 # frames that have *not* contained any action
71 kcw = KeyClipWriter(bufSize=conf["buffer_size"])
72 consecFrames = 0
73 print("[INFO] starting advanced security surveillance...")
74
76 while True:
78 # the receipt
81
82 # resize the frame, convert it to grayscale, and blur it
84 gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
85 gray = cv2.GaussianBlur(gray, (7, 7), 0)
86
87 # grab the current timestamp and draw it on the frame
88 timestamp = datetime.now()
89 cv2.putText(frame, timestamp.strftime(
90 "%A %d %B %Y %I:%M:%S%p"), (10, frame.shape[0] - 10),
92
93 # if we do not already have the dimensions of the frame,
94 # initialize it
95 if H is None and W is None:
96 (H, W) = frame.shape[:2]
Line 71 instantiates our KeyClipWriter so we can write video clips of intruders to disk.
We then start looping over frames from the RPi on Line 76.
Lines 79 and 80 grab the latest frame from the RPi and then acknowledge receipt of the
frame. We preprocess the frame by resizing it to have a width of 400px, convert it to grayscale,
and then blur it to reduce noise (Lines 83-85).
We also draw the current timestamp on the frame (Lines 88-91) and grab the frame di-
mensions (Lines 95 and 96).
In order to apply motion detection we must first ensure that we have sufficiently modeled
the background, assuming that no motion has taken place during the first "frame_count"
frames:
98 # if the total number of frames has reached a sufficient

99 # number to construct a reasonable background model, then
100 # continue to process the frame
101 if total > conf["frame_count"]:
102 # detect motion in the frame and set the update consecutive
103 # frames flag as True
104 motion = md.detect(gray)
105 updateConsecFrames = True
106
107 # if the motion object is not None, then motion has

108 # occurred in the image
109 if motion is not None:
110 # set the update consecutive frame flag as false and
111 # reset the number of consecutive frames with *no* action
112 # to zero
113 updateConsecFrames = False
114 consecFrames = 0
115
116 # if we are not already recording, start recording
117 if not kcw.recording:
118 # store the day's date and check if output directory
119 # exists, if not, then create
120 date = timestamp.strftime("%Y-%m-%d")
121 os.makedirs(os.path.join(conf["output_path"], date),
122 exist_ok=True)
123
124 # build the output video path and start recording
125 p = "{}/{}/{}.avi".format(conf["output_path"], date,
126 timestamp.strftime("%H%M%S"))
127 kcw.start(p, cv2.VideoWriter_fourcc(*conf["codec"]),
128 conf["fps"])
Line 101 checks to ensure that we have received a sufficient number of frames to build
an adequate background model. Provided we have, we attempt to detect motion in the gray
frame (Line 104).
Line 109 checks to see if motion has been found. If motion has been found and we are
not recording (Line 117), we create an output directory, build output video file path, and start
recording (Lines 120-128).
Since we know motion has taken place, we can now apply the YOLO object detector:
130 # construct a blob from the input frame and then perform
131 # a forward pass of the YOLO object detector, giving us
132 # our bounding boxes and associated probabilities
133 blob = cv2.dnn.blobFromImage(frame, 1 / 255.0,
134 (416, 416), swapRB=True, crop=False)
135 net.setInput(blob)
136 layerOutputs = net.forward(ln)
137
138 # parse YOLOv3 output
139 (boxes, confidences, classIDs) = pyo.parse(layerOutputs,
140 LABELS, H, W)
141
142 # apply non-maxima suppression to suppress weak,
143 # overlapping bounding boxes
144 idxs = cv2.dnn.NMSBoxes(boxes, confidences,
145 conf["confidence"], conf["threshold"])
Lines 133-136 construct a blob from the input frame and then pass it through the YOLO
object detector.
Given the layerOutputs we parse them on Lines 139 and 140 and then apply non-
maxima suppression to suppress weak, overlapping bounding boxes, keeping only the most
confident ones. NMS also ensures that we do not have any redundant or extraneous bounding
boxes in our results. If you’re interested, you can learn more about NMS, including how the
underlying algorithm works, in the following tutorial: http://pyimg.co/fz1ak [29].
Moving on, let’s first ensure that at least one detection was found in the frame:
147 # ensure at least one detection exists

148 if len(idxs) > 0:
149 # loop over the indexes we are keeping
150 for i in idxs.flatten():
151 # extract the bounding box coordinates
152 (x, y) = (boxes[i][0], boxes[i][1])
153 (w, h) = (boxes[i][2], boxes[i][3])
154
155 # loop over all the unauthorized zones
156 for zone in conf["unauthorized_zone"]:
157 # store the coordinates of the detected
158 # object in (x1, y1, x2, y2) format
159 obj = (x, y, x + w, y + h)
160
161 # check if there is NOT a overlap between the
162 # object and the zone, if so, then skip this
163 # iteration
164 if not overlap(obj, zone):
165 continue
166
167 # otherwise there is overlap between the
168 # object and the zone
169 else:
170 # draw a bounding box rectangle and label on the
171 # frame
172 color = [int(c) for c in COLORS[classIDs[i]]]
173 cv2.rectangle(frame, (x, y), (x + w, y + h),
174 color, 2)
175 text = "{}: {:.4f}".format(LABELS[classIDs[i]],
176 confidences[i])
177 y = (y - 15) if (y - 15) > 0 else h - 15
178 cv2.putText(frame, text, (x, y),
179 cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
180
181 # break out of the loop since the object
182 # overlaps with at least one zone
183 break
Provided at least one detection was found we can start looping over the indexes of the
objects kept by NMS (Line 150).
Lines 152 and 153 extract the bounding box coordinates of the current bounding box. Given
the bounding box coordinates we need to check and see if they overlap with an unauthorized
zone. To accomplish this task we start looping over the unauthorized zones on Line 156 and
construct a bounding box for the zone (Line 159).
Lines 164 and 165 check to see if there is no overlap between the two boxes. If so, we
continue looping over the unauthorized zone coordinates.
Otherwise, there is an overlap between the zones (Line 169) so we draw both label text
and bounding box coordinates on Lines 172-179. If you wish to raise an alarm or send a text
message alert, for example, this else code block is where you would want to insert your logic.
We can now visualize the unauthorized zones themselves on the frame:
185 # loop over the unauthorized zones

186 for (x1, y1, x2, y2) in conf["unauthorized_zone"]:
187 # draw the zone on the frame
188 cv2.rectangle(frame, (x1, y1), (x2, y2), (255, 0, 0), 2)
189
190 # otherwise, no action has taken place in this frame, so
191 # increment the number of consecutive frames that contain
192 # no action
193 if updateConsecFrames:
194 consecFrames += 1
195
196 # update the key frame clip buffer
197 kcw.update(frame)
198
199 # if we are recording and reached a threshold on consecutive
200 # number of frames with no action, stop recording the clip
201 if kcw.recording and consecFrames == conf["buffer_size"]:
202 kcw.finish()
Line 186-188 loops over each of the unauthorized zones and draws them on our frame,
enabling us to visualize them.
Provided that no motion has taken place, Lines 193 and 194 update our consecFrames
count.
Line 197 updates the key clip writer while Lines 201 and 202 check to see if we should
stop recording.
Finally, let’s update our background model and then display the output frame to our screen:
204 # update the background model and increment the total number
205 # of frames read thus far
206 md.update(gray)
207 total += 1
208
209 # show the frame
210 cv2.imshow("{}".format(rpiName), frame)
212
214 if key == ord("q"):
215 break
216
217 # if we are in the middle of recording a clip, wrap it up
218 if kcw.recording:
219 kcw.finish()
220
At this point we’re almost finished — we have one more Python class to define.
4.4.4.1 Parsing YOLO Output
Inside the server.py script we utilized a class named ParseYOLOOutput — let’s define that
class now.
Open up the parseyoloooutput.py file and insert the following code:
1 # import the necessary package

3
4 class ParseYOLOOutput:
5 def __init__(self, conf):
6 # store the configuration file
7 self.conf = conf
On Line 5 we create the constructor to the class. We only need a single argument, our
configuration, which we store on Line 7.
Next, let’s define the parse method:
9 def parse(self, layerOutputs, LABELS, H, W):

10 # initialize our lists of detected bounding boxes,
11 # confidences, and class IDs, respectively
12 boxes = []
13 confidences = []
14 classIDs = []
The parse method requires that we supply four parameters to the function:
• layerOutputs: The output of making predictions with the YOLO object detector.
• LABELS: The list of class labels YOLO was trained to detect.
• H: The height of the frame.
• W: The width of the frame.
We then initialize three lists used to store the detected bounding box coordinates, confi-
dences (i.e., predicted probabilities), and class label IDs. We can now start populating these
lists by looping over the layerOutputs:
16 # loop over each of the layer outputs

17 for output in layerOutputs:
18 # loop over each of the detections
19 for detection in output:
20 # extract the class ID
21 scores = detection[5:]
22 classID = np.argmax(scores)
23
24 # check if the class detected should be considered,
25 # if not, then skip this iteration
26 if LABELS[classID] not in self.conf["consider"]:
27 continue
Inside this block we start looping over each of the layerOutputs.
For each of the layerOutputs we then loop over each of the detections. Lines 21 and
22 extract the predicted class ID with the largest corresponding predicted probability. We then
make a check on Line 26 to ensure that the class label exists in the set of class labels we want
to "consider" (defined in Section 4.4.2) — if not, we continue looping over detections.
Otherwise, we have found a class label we are interested in so let’s process it:
29 # retrieve the confidence (i.e., probability) of the

30 # current object detection
31 confidence = scores[classID]
32
33 # filter out weak predictions by ensuring the
34 # detected probability is greater than the minimum
35 # probability
36 if confidence > self.conf["confidence"]:
37 # scale the bounding box coordinates back
38 # relative to the size of the image, keeping in
39 # mind that YOLO actually returns the center
4.5. RUNNING OUR YOLO SECURITY APPLICATION 65
40 # (x, y)-coordinates of the bounding box followed

41 # by the boxes' width and height
42 box = detection[0:4] * np.array([W, H, W, H])
43 box = box.astype("int")
44 (centerX, centerY, width, height) = box
45
46 # use the center (x, y)-coordinates to derive the
47 # top and and left corner of the bounding box
48 x = int(centerX - (width / 2))
49 y = int(centerY - (height / 2))
50
51 # update our list of bounding box coordinates,
52 # confidences, and class IDs
53 boxes.append([x, y, int(width), int(height)])
54 confidences.append(float(confidence))
55 classIDs.append(classID)
56
57 # return the detected boxes and their corresponding
58 # confidences and class IDs
59 return (boxes, confidences, classIDs)
Line 31 extracts the probability for the class label from the scores list — we then ensure
that a minimum confidence is met on Line 36. Ensuring that a prediction meets a minimum
predicted probability helps filter out false-positive detections.
Line 42 scales the bounding box coordinates back relative to the size of the original image.
Lines 43 and 44 extract the dimensions of the bounding box, while keeping in mind that
YOLO returns bounding box coordinates in the following order: (centerX, centerY, width,
height). We use this information to derive the top-left (x, y)-coordinates of the bounding box
(Lines 48 and 49).
Finally, we update the boxes, confidences, and classIDs lists on Lines 53-55 and
return the three respective lists as a tuple to the calling function (Line 59).
4.5 Running Our YOLO Security Application
Now that we’ve coded up our server.py script, let’s put it to work.
Open up a terminal and execute the following command:
$ python server.py --conf config/config.json

[INFO] loading YOLO from disk...
[INFO] starting advanced security surveillance...
From there, on your Raspberry Pi (via SSH or VNC), start the client:
Figure 4.2: Objects such as people are annotated/recorded (red) only when they enter unautho-
rized zones (blue).
Your server will then come alive with the feed from the RPi. In Figures 4.2 and 4.3 you can
see my macOS desktop. People (red) are detected via YOLO, but they are only annotated and
recorded if/when they enter unauthorized zones (blue). Notice that the objects that are outside
the blue boxes are not detected. Our key clip writer will store clips only as people enter the
unauthorized zones.
You can verify that the video clip was written to disk by checking the contents of your
output/ directory:
$ ls output/2019-11-15/
151118.avi 151544.avi 152020.avi 153044.avi 153351.avi
151505.avi 151601.avi 152624.avi 153240.avi 153846.avi
4.6. HOW DO I DEFINE MY OWN UNAUTHORIZED ZONES? 67
Figure 4.3: Objects such as people are annotated/recorded (red) only when they enter unautho-
rized zones (blue).
4.6 How Do I Define My Own Unauthorized Zones?
In this chapter you learned how to monitor unauthorized zones for access; however, how do
you actually define these unauthorized zones? In general, there are two methods I recom-
mend: using an image processing tool (ex., Photoshop, GIMP, etc.) or utilizing OpenCV’s
mouse click events.
4.6.1 Step #1: Capture the Image/Frame
Initially determining the bounding box (x, y)-coordinates of an unauthorized zone is manual
process. The good news is that these coordinates only need to be determined once.
The first step is to actually capture your image/frame. You can use the exact code from
Section 4.4.4, but you should insert a cv2.imwrite call at the bottom of the while loop used
to loop over frames, like this:
Figure 4.4: Using Photoshop to derive the (x, y)-coordinates of a region in an image.

210 cv2.imshow("{}".format(rpiName), frame)
212
214 if key == ord("q"):
215 cv2.imwrite("frame.png", frame)
216 break
Notice how I’m writing the current frame to disk with the filename frame.png (Line 215).
We’ll now examine this frame in the following two sections.
4.6.2 Step #2a: Define Coordinates with Photoshop/GIMP
If you have Photoshop or GIMP installed on your machine then you can simply open up
frame.png in the editor and look at the (x, y)-coordinates of the frame, an example of which
can be seen in Figure 4.4.
As I move my mouse, Photoshop will update the “Info” section with the current (x, y)-
coordinates. To define my unauthorized zone coordinates I can simply move my mouse to
the four corners of the rectangle I want to monitor and jot them down. Given the coordinates I
can go back to my config.json file and updated the "unauthorized_zone" list.
4.7. SUMMARY 69
4.6.3 Step #2b: Define Coordinates with OpenCV
An alternative option to defining your unauthorized zone coordinates is to use OpenCV and
mouse click events (http://pyimg.co/fmckl) [31].
Figure 4.5: Using OpenCV’s mouse click events to print and display (x, y)-coordinates.
Using OpenCV you can capture mouse clicks on any window opened by cv2.imshow.
Figure 4.5 demonstrates how I’ve opened a frame, moved my mouse to a location on the
frame I want to monitor, and then clicked my mouse — the (x, y)-coordinates of the click are
then printed to my terminal. I can then repeat this process for the remaining three vertices of
the bounding box rectangle.
From there, I would take the (x, y)-coordinates I just found and update the "unauthorized_zone"
list in the config file.
4.7 Summary
In this tutorial you learned how to build a more advanced home security application.
Since OpenCV’s YOLO object detector is incredibly computationally demanding, it is unable

to run in real-time on the RPi; therefore, we decided to stream the frames directly from the RPi
to a central host for additional processing.
While streaming may feel like “cheating” to a certain degree, it’s a perfectly acceptable ap-
proach provided you have enough network bandwidth. If you do not have the network strength
to facilitate real-time streaming then you’ll need to apply your deep learning algorithms directly
on the RPi itself — the rest of this text is dedicated to techniques you can apply to do so.
Furthermore, in Chapter 8 you will learn how to extend the method we covered here in
this chapter to build an even more advanced home security application capable of leveraging
multiple Raspberry Pis.
Chapter 5
Face Recognition on the RPi
Inside this chapter you will learn how to build a complete face recognition system.
Such a system could be used to recognize faces at the front of your house and unlock your
front door, monitor your desk and capture anyone who is trying to use your workstation, or even
build something fun like a “smart treasure chest” that automatically unlocks when the correct
face is identified.
Regardless of what you’re building, this chapter will serve as a template you can use
to build your own face recognition systems.
In this chapter you will learn:
i. About “embeddings” and how deep learning can be used for face recognition.
ii. Three methods that can be used to gather and build your face recognition dataset.
iii. How to use Python to extract face embeddings from your custom dataset.
iv. How to train a Support Vector Machine (SVM) classifier on top of the embeddings.
v. How to use text-to-speech with Python.
vi. How to put all the pieces together and create a complete face recognition pipeline.
5.2 Our Face Recognition System
Our face recognition system is a multi-step process (Figure 5.1).
71
72 CHAPTER 5. FACE RECOGNITION ON THE RPI
Figure 5.1: In order to build a face recognition system we must: create a dataset of example
faces (i.e., our training data), extract features from our face dataset, and train a machine learning
classifier model on top of the face features. From there we can recognize faces in images and
video streams.
First, we must gather our dataset of face images.
Section 5.4.2 will show you three methods that I recommend for building your own custom
face dataset (or you can simply follow along with this chapter using the dataset provided for
you in the downloads associated with the text).
After building your face dataset you must:
i. Quantify each face using dlib and face_recognition.
ii. Train a Support Vector Machine (SVM) on the face quantifications.
It’s recommended that you perform both of these steps on a laptop, desktop, or GPU
rig. A laptop/desktop will have more RAM and a faster CPU, enabling you more easily work
with your data.
If you choose to use your RPi for these steps you may find (1) it takes substantially longer to
quantify each face and train the model and/or (2) the scripts error out completely due to lack of
memory. Therefore, use your laptop/desktop to train the model and then transfer the resulting
models to your RPi.
Once your models have been trained and transferred, we’ll perform face recognition on the
5.3. GETTING STARTED 73
Raspberry Pi itself.
The Raspberry Pi is therefore responsible for inference (i.e., making predictions) but not the
actual training process.
5.3 Getting Started
Before we can get started building our face recognition project we first need to review the key
components, namely the project structure and the associated configuration file.
This chapter is meant to serve as a complete review of building a face recognition system
— this means that we’ll not only be covering how to actually deploy face recognition to the
Raspberry Pi, but also how to create your own custom faces dataset and then train a face
recognizer on this data.
Since we’ll be covering so many techniques in a chapter we also have significantly more
Python scripts/files to review.
Let’s take a look at these files now:
|-- cascade
| |-- haarcascade_frontalface_default.xml
|-- config
| |-- config.json
|-- face_recognition
| |-- dataset
| | |-- abhishek [5 entries]
| | |-- adrian [5 entries]
| | |-- dave [5 entries]
| | |-- mcCartney [5 entries]
| | |-- unknown [6 entries]
| |-- encode_faces.py
| |-- train_model.py
|-- messages
| |-- abhishek.mp3
| |-- adrian.mp3
| |-- dave.mp3
| |-- mcCartney.mp3
| |-- unknown.mp3
|-- output
| |-- encodings.pickle
| |-- le.pickle
| |-- recognizer.pickle
|-- pyimagesearch
| |-- notifications
| | |-- __init__.py
| | |-- twilionotifier.py
| |-- utils
| | |-- __init__.py
| | |-- conf.py
| |-- __init__.py
|-- create_voice_msgs.py
|-- door_monitor.py
The cascade/ directory contains our Haar cascade for face detection. Face recognition
is a computationally demanding task, and given the resource limited nature of the RPi, any
speedup we can obtain is worth it.
Here we’ll be using Haar cascades for face detection, but later in the Hacker Bundle, I’ll be
showing you how to use the Movidius NCS for faster, more efficient face detection as well.
The config/ directory contains our config.json file. We’ll be reviewing this file in Sec-
tion 5.3.2 below.
All scripts that will be used to train our actual face recognizer are contained in the face_rec
ognition/ directory.
The dataset/ directory is where we’ll store our dataset of faces images. You’ll learn how
to build your own face recognition dataset inside Section 5.4.2.
The encode_faces.py and train_model.py will be used to train the face recognition
model. These scripts are covered in Section 5.4.3, respectively.
The output/ directory will then contain the output files from encode_faces.py and
train_model.py.
We’ll be using text-to-speech (TTS) to announce each recognized (or unidentified) face,
but in order to utilize TTS, we first need to generate .mp3 files for each person’s name. The
messages/ directory will contain these .mp3 files after they have been generated by the
create_voice_messages.py script.
Inside the pyimagesearch module you’ll find utilities load our configuration file and send
Twilio text message notifications.
Finally, door_monitor.py will put all the pieces together, and as the name suggests, will
build a face recognition application for the purpose of monitoring our front door.
Our configuration file for this project is longer than the vast majority of configuration files in this
book. I would suggest you review both the config file, along with the rest of the code in this
5.3. GETTING STARTED 75
chapter once, then go back and read the code separately from the text. Doing so will enable
you to understand the context in which each configuration is used (and how different Python
scripts can utilize the same config values, such as the path to our Haar cascade or trained face
recognition model).
With that said, open up the config.json file now and let’s review it:
1 {
2 // path to OpenCV's face detector cascade file
3 "cascade_path": "cascade/haarcascade_frontalface_default.xml",
4
5 // path to extracted face encodings file
6 "encodings_path": "output/encodings.pickle",
7
8 // path to face recognizer model
9 "recognizer_path": "output/recognizer.pickle",
10
11 // path to label encoder file
12 "le_path": "output/le.pickle",
Line 2 defines the path to our Haar cascade used for face detection (i.e., finding the
bounding box coordinates of each face in an image).
For face recognition (the actual person identification) we then have three paths:
• "encodings_path": Where we’ll store our face quantifications after our deep learning
model has extracted embeddings from the face ROI.
• "recognizer_path": The path to the trained face recognition model.
• "le_path": Path to the LabelEncoder object for the face recognizer.
We’ll be reviewing each of these paths in detail inside Section 5.4.3 and Section 5.4.4, but
now simply make note of them.
Let’s continue working through the config.json file:
14 // boolean variable used to indicate if frames should be displayed

15 // to our screen
16 "display": true,
17
18 // variable used to store the area threshold percentage
The "display" config controls whether our driver script, door_monitor.py, should dis-
play any output frames via cv2.imshow. If our face recognition application is meant to run in
the background then you can set "display" to false, ensuring no output frames are shown
via cv2.imshow, and therefore allowing your script to run faster.
The "threshold" parameter controls the threshold percentage for our background sub-
tractor.
As Line 19 specifies, if more than 12% of any given threshold is occupied with motion,
we’ll start running our face detector on each frame of the video stream, which leads us to the
"look_for_a_face" parameter:
21 // number of consecutive frames to look for a face once the door

22 // is open
23 "look_for_a_face": 60,
24
25 // number of consecutive frames to recognize a particular person
26 // to reach a conclusion
27 "consec_frames": 1,
28
29 // number of frames to skip once a person is recognized
30 "n_skip_frames": 900,
Once both (1) motion is found and (2) more than "threshold" percent of the frame con-
tains motion, we’ll start applying face detection to each frame of the video stream for a total of
"look_for_a_face" frames.
Since face detection is more computationally expensive than motion detection via back-
ground subtraction, we wish to only apply face detection occasionally — setting a limit on the
number of frames to sequentially apply face detection to enables us to save CPU cycles and
ensure our face recognition system runs faster.
In the event that a face is found in a frame, we then apply face recognition. If a given
person is identified in at least "consec_frames" frames, we can label the face as a positive
recognition.
Smaller values of "consec_frames" will allow your pipeline to run faster as face recogni-
tion only has to be applied for a total of "consec_frames"; however, smaller values may lead
to false-positive recognitions. You may need to balance this parameter in your own applications
to reach a satisfiable level of accuracy.
After a face has been successfully recognized we skip a total of "n_skip_frames" to

ensure the same person is not accidentally recognized again.
The following configurations are used with Google’s TTS library:
32 // see available languages in your terminal with `gtts-cli --all`

33 // shows accents/dialects after the hyphen.
5.4. DEEP LEARNING FOR FACE RECOGNITION 77
34 // ex: en-gb would be the English language with British accent

35 "lang": "en",
36 "accent": "us",
37
38 // path to messages directory
39 "msgs_path": "messages",
40
41 // two variables, one to store the message used for homeowner(s),
42 // and a second to store the message used for intruders
43 "welcome_sound": "Welcome home ",
44 "intruder_sound": "I don't recognize you, I am alerting the
45 homeowners that you are in their house.",
Line 35 defines the name of the language used by the TTS package while accent (Line
36) controls the particular accent of the language
The msgs_path controls the path to the output messages/ directory where we’ll later gen-
erate and store .mp3 files for each person’s name. Lines 43-45 then specify the "welcome_
sound", used to welcome a user to their home, while "intruder_sound" contains the text
for an unauthorized user entering the premises.
Our face recognition system can also send text message notifications to the administrator
by specifying the relevant AWS S3 and Twilio API keys/credentials:
47 // variables to store your aws account credentials

48 "aws_access_key_id": "YOUR_AWS_ACCESS_KEY_ID",
49 "aws_secret_access_key": "YOUR_AWS_SECRET_ACCESS_KEY",
50 "s3_bucket": "YOUR_S3_BUCKET_NAME",
51
52 // variables to store your twilio account credentials
53 "twilio_sid": "YOUR_TWILIO_SID",
54 "twilio_auth": "YOUR_TWILIO_AUTH_ID",
55 "twilio_to": "YOUR_PHONE_NUMBER",
56 "twilio_from": "YOUR_TWILIO_PHONE_NUMBER",
57
58 // message sent to the owners when a intruder is detected
59 "message_body": "There is an intruder in your house."
60 }
If you need a review of these parameters please refer to Chapter 10 of the Hobbyist Bundle
where they are covered in detail.
5.4 Deep Learning for Face Recognition
In the first part of this section we’ll briefly review how deep learning algorithms can facilitate
accurate face recognition via face embeddings.
From there you’ll learn how to gather and build a dataset that can be used for face recogni-
tion, followed by extracting face embeddings from the dataset, and then training the actual face
recognition model.
5.4.1 Understanding Deep Learning and Face Recognition Embeddings
Figure 5.2: Facial recognition via deep metric learning involves a “triplet training step.” The triplet
consists of 3 unique face images — 2 of the 3 are the same person. The NN generates a 128-d
vector for each of the 3 face images. For the 2 face images of the same person, we tweak the
neural network weights to make the vector closer via distance metric. Image credit: [32]
Face recognition via deep learning hinges on a technique called deep metric learning. If
you have any prior experience with deep learning you know that we typically train a network to:
i. Accept a single input image
ii. Output a classification/label for that image
Deep metric learning is a bit different. Instead of trying to output a single label (or even the
coordinates/bounding box of objects in an image), we instead output a real-valued feature
vector. For the dlib facial recognition network (the library we’ll be using to face recognition), the
output feature vector is 128-d (i.e., a list of 128 real-valued numbers) that are used to quantify
the face.
Training the network is done using triplets (Figure 5.2). Here we provide three images to
the network:
• Two of these images are example faces of the same person.
• The third image is a random face from our dataset and is not the same person as the
other two images.
As an example, let’s again consider Figure 5.2 where we provided three images: one of
Chad Smith (drummer of the Red Hot Chili Peppers) and two of Will Ferrell (a famous actor).
Our network quantifies these faces, constructing a 128-d embedding (quantification) for each.
From there, the general idea is that we’ll tweak the weights of our neural networks so that
the 128-d measurements of the two Will Ferrel images will be closer to each other and further
from the measurements of Chad Smith.
Our network architecture for face recognition is based on ResNet-34 from the Deep Resid-
ual Learning for Face Recognition paper by He et al. [4], but with fewer layers and the number
of filters reduced by half. The network was trained by Davis King [33] on a dataset of ¥ 3
million images. On the Labeled Faces in the Wild (LFW) dataset, a popular benchmark for
face recognition/verification, the network compares to other state-of-the-art methods, reaching
99.38% accuracy.
Both Davis King (the creator of dlib) and Adam Gietgey (the author of the face_recogniti
on module [34] which wraps around dlib) have written detailed articles on how deep learning-
based facial recognition works [32, 35] — I would highly encourage you to read them if you
would like more details on how (1) these networks are trained and (2) how the networks pro-
duce 128-d embeddings used to quantify a face.
5.4.2 Step #1: Gather Your Dataset
Before we can recognize faces, we first need to build a dataset of faces to recognize. For
example, your dataset could consist of:
• Famous actors and athletes
• Members of your family
• Colleagues at your work
• Fellow students in your classroom

Exactly who is in the images doesn’t matter as much as you have enough images per
person to obtain an accurate face recognition model.
Keep in mind that machine learning algorithms are not magic — you need to supply enough
example images for them to learn patterns that can discriminate between faces and accurately
recognize them. In general, there are three ways to build a face recognition dataset:
i. Use OpenCV and a webcam
ii. Download images programmatically (such as via an API)
iii. Manually collect images
5.4.2.1 Use OpenCV and Webcam for Face Detection
Figure 5.3: Using OpenCV and a webcam it’s possible to detect faces in a video stream and save
the examples to disk. This process can be used to create a face recognition dataset on premises.
The first method is appropriate when you are building an “on-site” face recognition system
and you need to have physical access to a particular person to gather example images of their
face. Such a system would be typical for companies, schools, or other organizations where
people need to physically show up and attend every day.
To gather example faces of these people, you may escort them to a special room where a
video camera is setup to (1) detect the (x, y)-coordinates of their face in a video stream and (2)
write the frames containing their face to disk (Figure 5.3). We may even perform this process
over multiple days or weeks to gather examples of their face in different lighting conditions,
times of day, and varying moods or emotional states.
To learn how to build a face recognition dataset by using your OpenCV and Webcam, you
can follow this PyImageSearch tutorial which includes detailed code: http://pyimg.co/v9evi [36].
5.4.2.2 Download Images Programmatically
However, there are cases where you may not have access to the physical person and/or they
are a public figure with a strong online presence — in those cases you can programmatically
download example images of their faces via APIs on varying platforms.
Exactly which API you choose here depends dramatically on the person you are attempting
to gather example face images of.
For example, if a person consistently posts on Twitter or Instagram, you may want to lever-
age one of their (or other) social media APIs to scrape face images.
Another option would be to leverage a search engine, such as Google or Bing. If you
decide you want to use a search engine to build a face recognition dataset, be sure to refer the
following two guides:
• How to create a deep learning dataset using Google Images: http://pyimg.co/rdyh0
• How to (quickly) build a deep learning image dataset: http://pyimg.co/vgcns
Both of the guides linked to above will enable you to build a custom dataset using Google
and Bing, respectively.
5.4.2.3 Manual Collection of Images
The final method to creating your own face recognition dataset, and also the least desirable
one, is to manually find and save example face images yourself (Figure 5.4).
This method is obviously the most tedious and requires the most man-hours — typically we
would prefer a more “automatic” solution, but in some cases, you’ll need to resort to it. Using
this method you will need to manually inspect:
• Search engine results (ex., Google, Bing, etc.)
• Social media profiles (Facebook, Twitter, Instagram, etc.)
• Photo sharing services (Google Photos, Flickr, etc.)
Then, for each service, you’ll need to manually save the images to disk.
In these types of scenarios the user often has a public profile of some sort but has signifi-
cantly fewer images to crawl programmatically.
Figure 5.4: Manually downloading face images to create a face recognition dataset is the least
desirable option but one that you should not forget about. Use this method if the person doesn’t
have (as large of) an online presence or if the images aren’t tagged.
5.4.3 Step #2: Extract Face Embeddings
Before we can recognize faces with our Raspberry Pi, we first need to quantify the faces in our
training set.
Keep in mind that we are not actually training a network here — the network has already
been trained to create 128-d embeddings on a dataset of ¥ 3 million images. We certainly
could train a network from scratch or even fine-tune the weights of an existing model, but that
is likely overkill for many projects. Furthermore, you would need a lot of images to train the
network from scratch.
Instead, it’s easier to use a pre-trained network and use it to extract the 128-d embeddings
for each face in our dataset (Figure 5.5).
We’ll then take the extracted embeddings and train a SVM classifier on top of them in
Section 5.4.4.
At this point I’ll assume you have either (1) chosen to use the example images included
with the source code associated with this text, or (2) created your own example image dataset
using the instructions detailed above.
In either case, your dataset should have the following directory structure:
Figure 5.5: Facial recognition via deep learning and Python using the face_recognition mod-
ule method generates a 128-d real-valued number feature vector per face.
|-- dataset
| |-- abhishek [5 entries]
| |-- adrian [5 entries]
| |-- dave [5 entries]
| |-- mcCartney [5 entries]
| |-- unknown [6 entries]
Notice how I have placed all example faces inside the dataset/ directory. Inside the
dataset/ directory there are subdirectories for each person that I want to recognize (i.e.,
each person has their own corresponding directory). Then, inside the subdirectory for each
person, I put example faces of that person, and that person alone.
For example, you should not place example faces of “Adrian” in the "Abhishek" directory (or
vice versa).
Using a directory structure such as the one proposed here forces organization and ensures
your images are organized on disk.
Now that our directory structure is organized, let’s get to work. Open up the encode_faces.
py file in the directory structure for this project and let’s get to work:

2 from imutils import paths
3 import face_recognition
4 import argparse
5 import pickle
6 import cv2
7 import os
8 import imutils
First, we need to import our required packages. Most notably, the face_recognition
package, a wrapper around the dlib library, provides us with the ability to easily and conve-
niently extract 128-d face embeddings from our image.
Next, we need to parse a few command line arguments:

12 ap.add_argument("-i", "--dataset", required=True,
13 help="path to input directory of faces + images")
14 ap.add_argument("-e", "--encodings", required=True,
15 help="path to serialized db of facial encodings")
16 ap.add_argument("-d", "--detection-method", type=str, default="cnn",
17 help="face detection model to use: either `hog` or `cnn`")
The three arguments include:
• --dataset : The path to our dataset of person names and example images.
• --encodings: Our face encodings are written to the file that this argument points to.
• --detection-method: Before we can encode faces in images we first need to detect

them. Our two face detection methods include either hog or cnn. These two values are
the only ones that will work for this flag.
Now that we’ve defined our arguments, let’s grab the paths to our files in our --dataset
directory:
20 # grab the paths to the input images in our dataset

21 print("[INFO] quantifying faces...")
22 imagePaths = list(paths.list_images(args["dataset"]))
23
24 # initialize the list of known encodings and known names
25 knownEncodings = []
26 knownNames = []
Line 22 uses the path to our input directory to build a list of all imagePaths contained
therein.
We also need to initialize two lists before our loop, knownEncodings and knownNames,
respectively. These two lists will contain the face encodings and corresponding names for each
person in the dataset (Lines 25 and 26).
Let’s now enter the main for loop of our script:

28 # loop over the image paths

29 for (i, imagePath) in enumerate(imagePaths):
30 # extract the person name from the image path
31 print("[INFO] processing image {}/{}".format(i + 1,
32 len(imagePaths)))
33 name = imagePath.split(os.path.sep)[-2]
34
35 # load the input image and convert it from RGB (OpenCV ordering)
36 # to dlib ordering (RGB)
37 image = cv2.imread(imagePath)
38 rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
On Line 29 we loop over each imagePath in the imagePaths list. For each image, we
extract the name of the person (i.e., subdirectory name), based on our assumption of how our
dataset is organized on disk, discussed earlier in this section.
We then load the current image from disk on Line 37. OpenCV orders channels in BGR
order; however, face_recognition and dlib expect RGB — we therefore swap from BGR
to RGB color channel ordering on Line 38.
We are now ready to extract face embeddings from the image:
40 # detect the (x, y)-coordinates of the bounding boxes

41 # corresponding to each face in the input image
42 boxes = face_recognition.face_locations(rgb,
43 model=args["detection_method"])
44
45 # compute the facial embedding for the face
46 encodings = face_recognition.face_encodings(rgb, boxes)
47
48 # loop over the encodings
49 for encoding in encodings:
50 # add each encoding + name to our set of known names and
51 # encodings
52 knownEncodings.append(encoding)
53 knownNames.append(name)
Lines 42 and 43 find/localize the face in the image resulting in a list of boxes. We pass
two parameters to the face_locations method, including:
• rgb: Our RGB image.
• model: Either cnn or hog, based on the value of the --detection-method command
line argument.
The CNN method is much more accurate but far slower than the HOG face detector. The
HOG face detector is faster but less accurate than the CNN. In either case, you should be
running this script on a laptop/desktop and not a Raspberry Pi (due to the computational
and memory requirements of the face detector and embedding network), so feel free to play
around with both face detectors.
Given the bounding boxes used to localize each face in the image, we pass them into the
face_encodings function which, internally (1) loops over each of the bounding box loca-
tions, and (2) quantifies the face and returns a 128-d vector used to represent the face. The
face_encodings function then returns a list of encodings, one 128-d vector per face de-
tected. We loop over each of the encodings on Line 49 and update the knownEncodings
and knownNames lists, respectively. The for loop back on Line 29 repeats this process for all
images in our dataset.
The final step is to serialize our knownEncodings and knownNames to disk, enabling us
to train a face recognition model over top of them in Section 5.4.4:
55 # dump the facial encodings + names to disk

56 print("[INFO] serializing encodings...")
57 data = {"encodings": knownEncodings, "names": knownNames}
58 f = open(args["encodings"], "wb")
59 f.write(pickle.dumps(data))
60 f.close()
Line 57 constructs a dictionary with two keys — "encodings" and "names".
From there, Lines 58-60 dump the names and encodings to disk so we can use them in the
following section.
To extract face embeddings for your own dataset, execute the following command:
$ cd face_recognition
$ python encode_faces.py \
--dataset ../datasets/face_recognition_dataset \
--encodings ../output/encodings.pickle
[INFO] quantifying faces...
[INFO] processing image 1/29
...
[INFO] serializing encodings...
As the output demonstrates, I have successfully extracted embeddings for each of the 29
faces in my dataset. Looking at my directory structure you can also see that the encodings.pi
ckle file is now present on disk. The entire process took ¥ 10m30s using the CNN method on
my laptop CPU (i.e., no GPU was used here).
I also want to again reiterate that this script should only be run on your laptop/desk-
top. The face detection methods we’re using in this script, while accurate, are very slow on
the Raspberry Pi.
5.4.4 Step #3: Train the Face Recognition Model
At this point we have extracted 128-d embeddings for each face — but how do we actually
recognize a person based on these embeddings?
The answer is that we need to train a “standard” machine learning model (such as SVM,
k-NN classifier, Random Forest, etc.) on top of the embeddings. We’ll be using a Linear SVM
here as the model is fast to train and can provide probabilities when making a prediction.
Open up the train_model.py file and insert the following code:

2 from sklearn.preprocessing import LabelEncoder
3 from sklearn.svm import SVC
4 import argparse
5 import pickle
6
9 ap.add_argument("-e", "--encodings", required=True,
10 help="path to serialized db of facial encodings")
11 ap.add_argument("-r", "--recognizer", required=True,
12 help="path to output model trained to recognize faces")
13 ap.add_argument("-l", "--le", required=True,
14 help="path to output label encoder")
We import our packages and modules on Lines 2-5. We’ll be using scikit-learn’s implemen-
tation of SVM.
From there, we parse three command line arguments:
• --embeddings: The path to the serialized embeddings (we exported these embeddings
by running the encode_faces.py script in the previous section.
• --recognizer: The path to the trained SVM model that actually recognizes faces.
• --le: Our LabelEncoder file path. We’ll serialize our label encoder to disk so that we
can use both it and the recognizer model to perform face recognition.
Let’s load our facial embeddings from disk and encode our labels:
17 # load the face encodings

18 print("[INFO] loading face encodings...")
19 data = pickle.loads(open(args["encodings"], "rb").read())
20
21 # encode the labels
22 print("[INFO] encoding labels...")
23 le = LabelEncoder()
24 labels = le.fit_transform(data["names"])
Here we load our embeddings from Section 5.4.3 on Line 19. We won’t be generating
any embeddings in this training script — we’ll use the embeddings previously generated and
serialized.
We then initialize our LabelEncoder and encode our labels (Lines 23 and 24).
We can now train our SVM for face recognition:
26 # train the model used to accept the 128-d encodings of the face and
27 # then produce the actual face recognition
28 print("[INFO] training model...")
29 recognizer = SVC(C=1.0, kernel="linear", probability=True)
30 recognizer.fit(data["encodings"], labels)
On Line 29 we initialize our SVM with a linear kernel and then on Line 30 we train
the model. Again, we are using a Linear SVM as the model is fast to train and capable of
producing a probability for each prediction, but you can try experimenting with other machine
learning models if you wish.
After training the model we output the model and label encoder to disk as two pickle files:
32 # write the actual face recognition model to disk

33 f = open(args["recognizer"], "wb")
34 f.write(pickle.dumps(recognizer))
35 f.close()
36
37 # write the label encoder to disk
38 f = open(args["le"], "wb")
39 f.write(pickle.dumps(le))
40 f.close()
5.5. TEXT-TO-SPEECH ON THE RASPBERRY PI 89
Again, both the trained SVM (used to make predictions) and the label encoder (used to
convert SVM predictions to actual person names) are serialized to disk for use in Section 5.6.
Let’s go ahead and train the SVM now; however, make sure you run this command on
your laptop/desktop and not your Raspberry Pi:
$ python train_model.py --encodings ../output/encodings.pickle \

--recognizer ../output/recognizer.pickle --le ../output/le.pickle
[INFO] loading face encodings...
[INFO] encoding labels...
[INFO] training model...
As you can see, the SVM has been trained on the embeddings and both the (1) SVM itself
and (2) label encoder has been written to disk, enabling us to utilize them in Section 5.6.
Again, I strongly encourage you to run both the encode_faces.py and train_model.py
scripts on a laptop or desktop — between the face detector and face embedding neural net-
work, your RPi could easily run out of memory. I would therefore suggest you:
i. Run encode_faces.py on your laptop, desktop, or GPU machine.
ii. Train the model on your laptop laptop/desktop.
iii. Then transfer the resulting files/model to your RPi via FTP, SFTP, USB thumb drive, email-
ing to yourself, etc.
From there you’ll be able to perform inference (i.e., make predictions) on the RPi where
there is sufficient computational horsepower to do so.
Refer to Chapter 14 to learn how to harness the power of the Movidius NCS for face recog-
nition on the Raspberry Pi.
5.5 Text-to-Speech on the Raspberry Pi
We’ll be using Google’s TTS library, gTTS, to announce the name of a recognized person or
alert when an intruder is detected.
To make our driver script easier to follow, we’ll actually pre-record .mp3 files for each per-
son. This is not a requirement for our script but does make it easier to follow along.
Open up the create_voice_msgs.py script and insert the following code:

2 from pyimagesearch.utils import Conf
3 from gtts import gTTS

4 import argparse
5 import pickle
6 import os
7
13
14 # load the configuration file and label encoder
16 le = pickle.loads(open(conf["le_path"], "rb").read())
17 print("[INFO] creating mp3 files...")
Lines 2-6 import our required Python packages. The gTTs library on Line 3 will be used to
perform TTS and generate our resulting .mp3 files.
Lines 9-12 parse our command line arguments. We only need a single argument here,
--conf, the path to our configuration file, which is then loaded on Line 15.
We also load our label encoder (Line 16) which contains the names of each person our
face recognition model, including the “unknown” class.
Next, let’s loop over each of the labels:
19 # loop over all class labels (i.e., names)

20 for name in le.classes_:
21 # display which name we're creating the MP3 for
22 print("[INFO] creating {}.mp3...".format(name))
23
24 # if the name is unknown then it's a intruder
25 if name == "unknown":
26 # initialize the Google Text To Speech object with the
27 # message for a intruder
28 tts = gTTS(text=conf["intruder_sound"], lang="{}-{}".format(
29 conf["lang"], conf["accent"]))
30
31 # otherwise, it's a legitimate person name
32 else:
33 # initialize the Google Text To Speech object with a welcome
34 # message for the person
35 tts = gTTS(text="{} {}.".format(conf["welcome_sound"], name),
36 lang="{}-{}".format(conf["lang"], conf["accent"]))
37
38 # save the speech generated as a mp3 file
39 p = os.path.sep.join([conf["msgs_path"], "{}.mp3".format(name)])
40 tts.save(p)
5.6. FACE RECOGNITION: PUTTING THE PIECES TOGETHER 91
Line 24 makes a check to see if we are currently examining the “unknown” (i.e., intruder)
class. If so, we record a special message for the intruder using the "intruder_sound" text
(Lines 27 and 28). Otherwise, we’re examining a legitimate user, so we use the "welcome_so
und" text (Lines 31-35).
Lines 38 and 39 then saves the resulting .mp3 file to disk inside the "msgs_path" direc-
tory.
To build the .mp3 files for each person, just execute the following command:
$ python create_voice_msgs.py --conf config/config.json

[INFO] creating mp3 files...
[INFO] crating abhishek.mp3...
[INFO] crating adrian.mp3...
[INFO] crating dave.mp3...
[INFO] crating mcCartney.mp3...
[INFO] crating unknown.mp3...
Let’s now inspect the messages/ directory:
$ ls messages
abhishek.mp3 adrian.mp3 dave.mp3 mcCartney.mp3 unknown.mp3
Examining the messages/ directory you can see a total of four .mp3 files, one for each
person our face recognition model was trained on, along with an additional unknown.mp3 file
for the “unknown” class.
5.6 Face Recognition: Putting the Pieces Together
We’re almost there!
It’s time to put all the pieces together and build our complete face recognition pipeline. Open
up the door_monitor.py file and insert the following code:

2 from pyimagesearch.notifications import TwilioNotifier
8 import argparse
9 import imutils
10 import pickle
11 import signal
12 import time
13 import cv2
14 import sys
15 import os
16
17 # function to handle keyboard interrupt
18 def signal_handler(sig, frame):
19 print("[INFO] You pressed `ctrl + c`! Closing face recognition" \
20 " door monitor application...")
21 sys.exit(0)
22
Lines 2-16 handle importing our required Python packages. We have a number of im-
ports for this project, including our TwilioNotifier to send text message notifications, the
VideoStream to access our webcam/RPi camera module via OpenCV, and face_recogniti
on to facilitate face recognition.
Lines 19-22 handle creating a signal_handler to detect if/when ctrl + c is pressed,

enabling us to gracefully exit our script.
Lines 25-28 parse our command line arguments. The only argument we need is --conf,
the path to our configuration file.
Next, we can load our models:
29 # load the configuration file and initialize the Twilio notifier

31 tn = TwilioNotifier(conf)
32
33 # load the actual face recognition model, label encoder, and face
34 # detector
35 recognizer = pickle.loads(open(conf["recognizer_path"], "rb").read())
37 detector = cv2.CascadeClassifier(conf["cascade_path"])
38
39 # initialize the MOG background subtractor object
40 mog = cv2.bgsegm.createBackgroundSubtractorMOG()
Line 31 loads our configuration file while Line 32 instantiates the TwilioNotifier used
to send text message notifications.
We then load our face recognition models on Lines 36-38, including the trained face recog-
nition model (SVM), label/name encoder, and Haar cascade for face detection, respectively.
Line 41 then instantiates the MOG background subtractor. Background subtraction is com-
putationally efficient, so we’ll apply motion detection to each frame of the video stream rather
than applying a more computationally expensive face detector to every frame. Provided suffi-
cient motion has taken place, we’ll then trigger the face detection and recognition models.
Our next code block exclusively handles key initializations:
42 # initialize the frame area and boolean used to determine if the door
43 # is open or closed
44 frameArea = None
45 doorOpen = False
46
47 # initialize previous and current person name to None, then set the
48 # consecutive recognition count to zero
49 prevPerson = None
50 curPerson = None
51 consecCount = 0
52
53 # initialize the skip frames boolean and skip frame counter
54 skipFrames = False
55 skipFrameCount = 0
Line 44 initializes frameArea which are our spatial dimensions (i.e., width and height) of
the frame.
The doorOpen boolean (Line 45) is used to indicate whether a door is open or closed
based on the results of our background subtraction algorithm. We’ll only be applying face
detection and recognition if doorOpen is equal to True, thereby saving precious CPU cycles.
Line 49 initializes the name of the previous person identified while Line 50 initializes the
current person. Recall from our config.json file that we need at least "consec_frames"
frames where prevPerson equals curPerson to indicate a successful recognition — the
actual consecutive count is initialized on Line 51.
Lines 54 and 55 indicates whether or not we are performing skip-frames, and if so, how
many frames have been skipped.
With our initializations out of the way, let’s access our VideoStream and allow the camera
sensor to warmup:
57 # signal trap to handle keyboard interrupt

58 signal.signal(signal.SIGINT, signal_handler)
59 print("[INFO] Press `ctrl + c` to exit, or 'q' to quit if you have" \
60 " the display option on...")

61
62 # initialize the video stream and allow the camera sensor to warmup
63 print("[INFO] warming up camera...")
64 # vs = VideoStream(src=0).start()
66 time.sleep(2.0)
We can now start looping over frames of the vs:
68 # loop over the frames of the stream

69 while True:
70 # grab the next frame from the stream
72
73 # check to see if skip frames is set and the skip frame count is
74 # less than the threshold set
75 if skipFrames and skipFrameCount < conf["n_skip_frames"]:
76 # increment the skip frame counter and continue
77 skipFrameCount += 1
78 continue
79
80 # if the required number of frames have been skipped then reset
81 # skip frames boolean and skip frame counter, and reinitialize
82 # MOG object
83 elif skipFrameCount == conf["n_skip_frames"]:
84 skipFrames = False
87
88 # resize the frame
90
91 # if we haven't calculated the frame area yet, calculate it
92 if frameArea == None:
93 frameArea = (frame.shape[0] * frame.shape[1])
Line 71 reads a frame from our video stream.
Line 75 then makes to check to see if (1) we are performing skip frames and if so, (2) how
many frames have been skipped so far.
In the event that we are still under "n_skip_frames" we increment our skipFrameCount
and continue looping.
Otherwise, we have reached our "n_skip_frames count (Line 83) and need to reset our
skipFrames boolean, skipFrameCount integer, and re-instantiate our background subtrac-
tor.
Line 89 resizes our frame to have a width of 500px while Lines 92 and 93 grab the spatial
dimensions after resizing.
Next, let’s make a check to see our door is closed, implying that we are currently performing
background subtraction only :
95 # if the door is closed, monitor the door using background

96 # subtraction
97 if not doorOpen:
98 # convert the frame to grayscale and smoothen it using a
99 # gaussian kernel
102
103 # calculate the mask using MOG background subtractor
104 mask = mog.apply(gray)
105
106 # find countours in the mask
107 cnts = cv2.findContours(mask.copy(), cv2.RETR_EXTERNAL,
108 cv2.CHAIN_APPROX_SIMPLE)
109 cnts = imutils.grab_contours(cnts)
110
111 # check to see if at least one contour is found
112 if len(cnts) >= 1:
113 # sort the contours in descending order based on their
114 # area and grab the largest one
115 c = sorted(cnts, key=cv2.contourArea, reverse=True)[0]
116
117 # if the *percentage* of contour area w.r.t. frame to
118 # greater than the threshold set then set the door as
119 # open and record the start time of this event
120 if (cv2.contourArea(c) / frameArea) >= conf["threshold"]:
121 print("[INFO] door is open...")
122 doorOpen = True
123 startTime = datetime.now()
Line 97 checks to see if the door is closed (i.e., not doorOpen), and if so, converts the
frame to grayscale and smooths it (Lines 100 and 101).
We then apply background subtraction to the gray frame and detect contours in the binary
mask. Provided at least one contour is found (Line 112) we sort the contours by their area,
largest to smallest, and grab the largest one (Line 115).
Line 120 makes a check to see if the area of the contour is greater than threshold
percent of the frame area (i.e., width x height). Provided the area is sufficiently large we know
that motion has occurred, in which case we mark doorOpen as True and grab the current
timestamp.
Our next code block handles what happens when motion has occurred:
125 # if the door is open then:

126 # 1) run face recognition for a pre-determined period of time
127 # 2) if no face is detected in step 1 then it's a intruder
128 elif doorOpen:
129 # compute the number of seconds difference between the current
130 # timestamp and when the motion threshold was triggered
131 delta = (datetime.now() - startTime).seconds
132
133 # run face recognition for pre-determined period of time
134 if delta <= conf["look_for_a_face"]:
135 # convert the input frame from (1) BGR to grayscale (for
136 # face # detection) and (2) from BGR to RGB (for face
137 # recognition)
139 rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
140
141 # detect faces in the grayscale frame
142 rects = detector.detectMultiScale(gray, scaleFactor=1.1,
143 minNeighbors=5, minSize=(30, 30))
144
145 # OpenCV returns bounding box coordinates in (x, y, w, h)
146 # order but we need them in (top, right, bottom, left)
147 # order for dlib, so we need to do a bit of reordering
148 box = [(y, x + w, y + h, x) for (x, y, w, h) in rects]
Line 128 checks to see if motion has occurred, and provided it has, we compute the delta
time difference between the current time and the startTime (Line 131).
Line 134 then starts an if statement that will perform face recognition for a total of "look_
for_a_face" seconds.
We start by converting the frame to grayscale (for face detection) and then changing the
channel ordering from BGR to RGB (for face recognition).
Lines 142 and 143 detect faces in the image while Line 148 reorders the (x, y)-coordinates
of the bounding boxes for dlib.
From here, we need to ensure that at least one face was detected:
150 # check if a face has been detected

151 if len(box) > 0:
153 encodings = face_recognition.face_encodings(rgb, box)
154
155 # perform classification to recognize the face
156 preds = recognizer.predict_proba(encodings)[0]
157 j = np.argmax(preds)
158 curPerson = le.classes_[j]
159
160 # draw the bounding box of the face predicted name on
161 # the image

162 (top, right, bottom, left) = box[0]
163 cv2.rectangle(frame, (left, top), (right,
164 bottom), (0, 255, 0), 2)
165 y = top - 15 if top - 15 > 15 else top + 15
166 cv2.putText(frame, curPerson, (left, y),
168
169 # if the person recognized is the same as in the
170 # previous frame then increment the consecutive count
171 if prevPerson == curPerson:
172 consecCount += 1
173
174 # otherwise, a different name was predicted so reset
175 # the counter
176 else:
177 consecCount = 0
178
179 # set current person to previous person for the next
180 # iteration
181 prevPerson = curPerson
Provided at least one face was detected (Line 151) we then use the face_encodings
function to compute 128-d embeddings for each face. The encodings are then passed into
our face recognition model and the label with the largest corresponding probability is extracted
(Lines 156-158).
We then use OpenCV to draw a bounding box surrounding the face along with the name of
the person on the frame (Lines 162-167).
Line 171 checks to see if the previous identification matches the current identification, and
if so, increments the consecCount. Otherwise, we reset the consecCount as the identified
names do not match between consecutive frames (Lines 176 and 177).
Line 181 then sets the prevPerson to the curPerson for the next iteration of the loop.
Let’s now check to see if we should apply an audio message and/or alert the home owner:
183 # if a particular person is recognized for a given

184 # number of consecutive frames, we have reached a
185 # conclusion and alert/greet the person accordingly
186 if consecCount == conf["consec_frames"]:
187 # play the MPL# file according to the recognized
188 # person
189 print("[INFO] recognized {}...".format(curPerson))
190 os.system("mpg321 --stereo {}/{}.mp3".format(
191 conf["msgs_path"], curPerson))
192
193 # check if the person is an intruder
194 if curPerson == "unknown":
195 # send the frame via Twilio to the home owner

196 tn.send(frame)
197
198 # mark the door as closed and now we start skipping
199 # next few frames
200 print("[INFO] door is closed...")
201 doorOpen = False
202 skipFrames = True
203
204 # otherwise, no face was detected and the door was closed
205 else:
206 # indicate the door is not open and then start skipping
207 # frames
208 print("[INFO] no face detected...")
210 skipFrames = True
Line 186 checks to see if the consecCount meets the minimum number of required frames
("consec_frames") for a person identification to be “positive”. Provided it has, we use the
mpg321 command line tool to play the corresponding audio message for the user (Lines 189-
191).
If the identified person is “unknown” we’ll also send a text message alert, including the
frame, to the home owner (Lines 194-196).
Remark. Ensure that your Raspberry Pi has the mpg321 apt-get package installed and can
play .mp3 files. The package is installed by default on the Raspberry Pi .img that accompanies
this book.
We then reset our doorOpen boolean and indicate that we should start skipping frames,
ensuring that the user is not accidentally "re-identified” by or system (Lines 201 and 202).
The else block beginning on Line 205 closes the if delta <= conf["look_for_a_
face"]: on Line 134. If a face could not be detected for the specified seconds then we’ll
assume the motion that detected is not something we are interested in (such as a house pet
passing by the camera).
Our final code block handles displaying our frame to the screen, but only if the display
configuration is set:
212 # check to see if we should display the frame to our screen

213 if conf["display"]:
214 # show the frame and record any keypresses
215 cv2.imshow("frame", frame)
217
218 # if the `q` key is pressed, break from the loop
5.7. FACE RECOGNITION RESULTS 99
219 if key == ord("q"):

220 break
221
223 vs.stop()
If your goal is to run the face recognition system as a background process then you can
leave display to False. While it may not seem like it, I/O operations such as cv2.imshow
can cause latency which in turn slows down your FPS throughput rate. When possible, leave
out calls to cv2.imshow to improve your pipeline speed.
Additionally, be sure to refer to Chapter 23 of the Hobbyist Bundle to learn about bench-
marking and profiling your computer vision scripts.
5.7 Face Recognition Results
Figure 5.6: Our camera has been mounted with a clear view of the back door and just above eye
level so that it can perform face recognition.
Wow! We put in a lot of work in this chapter to build our face recognition pipeline! Take a
second to congratulate yourself on all that you’ve accomplished so far.
But now it’s time for the fun part — running the face recognition system.
$ python door_monitor --conf config/config.json

[INFO] Press 'ctrl + c' to exit or 'q' to quit if you have the display
option on...
[INFO] warming up camera...
[INFO] door is open...
[INFO] recognized adrian...
High Performance MPEG 1.0/2.0/2.5 Audio Player for Layer 1, 2, and 3.
Version 0.3.2-1 (2012/03/25). Written and copyrights by Joe Drew,
now maintained by Nanakos Chrysostomos and others.
Uses code from various people. See 'README' for more!
THIS SOFTWARE COMES WITH ABSOLUTELY NO WARRANTY! USE AT YOUR OWN RISK!
Directory: messages
Playing MPEG stream from dave.mp3 ...
MPEG 2.0 layer III, 32 kbit/s, 24000 Hz mono
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.front
[0:01] Decoding of adrian.mp3 finished.

[INFO] door is closed...
Figure 5.6 shows the positioning of the camera relative to the doorway which is about 10ft
away (not pictured).
Figure 5.7 demonstrates face recognition frames captured by our system.
In the top-left our Raspberry Pi detects motion from the front door opening, triggering face
detection to be applied (top-right). The face is marked as "unknown" until our face recognition
algorithm runs on the detected face.
The bottom-left shows that my face is correctly recognized as "Adrian" while the bottom-
right indicates that an "intruder" has entered the house.
When testing the fully integrated door monitor aspect of this chapter, we found that lighting
makes a big difference for two aspects of the system.
Recall from working through the Hobbyist Bundle computer vision scripts that lighting can
make or break a computer vision application. You may realize that you need to make some
additional tweaks for your home, or you might just need more training data. Be sure to review
the next section for ideas on how to make the system reliable in most conditions.
5.8. ROOM FOR IMPROVEMENT 101
Figure 5.7: Door monitoring security with face recognition. Top-left: Our RPi detects that motion
has occurred (i.e., the door has opened) and starts applying face detection. Top-right: Once a
face is detected it’s marked as "unknown" until the face recognition verifies the face. Bottom-left:
Correctly recognizing my face as "Adrian". Bottom-right: The face recognition system marking
the person as an "intruder" as they are not a member of the household.
5.8 Room for improvement
Facial recognition using controlled lighting conditions is great when you’re hacking, experi-
menting, or researching. When you take the plunge from image processing to computer vision,
environmental factors come into play. The number one environmental factor we are concerned
with is light.
In this section we’ll review tips and suggestions for making light your friend and not your foe.
At the bottom of the section, there is a reference to an additional section of this book which
has extra tips on achieving higher accuracy (i.e. reducing false positives and false negatives)
when performing face recognition.
First, there is the lighting from the door. The motion detection via background subtraction
uses a change in light to detect when the door is open or closed. Sounds perfect, right? In
theory, yes, but consider the camera’s point of view. The camera is aimed directly at the door
and a person will be coming through the doorway when it opens.
If it is semi-dark inside, and extremely bright outside (such as when the sun is shining), it is
possible for the camera sensor to become flooded with light. This light makes the doorway and
other nearby areas of the frame experience washout. As for the face in front of all the light?
It might experience reflections or it might just appear as a dark blob with no facial features
making it hard to recognize. The doorOpen boolean coupled with the "look_for_a_face"
timeout helps to ensure we give the camera’s “auto white balance” feature time to compensate.
Depending on your system’s physical environment, you may need to work on some hacks in
that section of the code.
The second lighting issue is related, but separate, from the first one. If you’re like me,
you probably enter your house sometimes during the day and sometimes at night. Sometimes,
your entry way light bulb will be left on. Other times not. Your face training data should be
captured for occupants of your house under various circumstances. When you capture
images (using the method from Section 5.4.2.1), it would be best if you capture them under
multiple natural and artificial lighting conditions. Take your face photos at different times of
the day. Take photos with the ceiling or lamp lights on and off. Capture photos with the
blinds/curtains opened and closed. You get the idea.
An easy solution is to just mount your camera and log face images by the doorway
for a period of a week knowing that different lighting will be present over the course of
this time. Once you’ve logged the data, you can manually sort it and then follow the training
procedure detailed in Section 5.4 prior to deploying your system.
Another idea is to add an IoT light to the room and send commands to it to control the lighting
from your RPi. Internet of Things lights come in the form of smart plugs, smart switches, and
smart bulbs. You can control a light from your RPi with simple APIs. In Chapter 15, we’ll use
an inexpensive smart plug to turn on a lamp near the door. This serves two purposes. First it
serves to illuminate the face and room where facial recognition is to be performed. Secondly,
a light that automatically turns on in the house is usually a deterrent to intruders/thieves.
For additional face recognition tips, be sure to refer to Section 14.3 of the “Fast, Effi-
cient Face Recognition with the Movidius NCS” chapter. In that chapter as a whole, you’ll
learn how to use the Movidius NCS coprocessor to speed up your face recognition pipelines.
5.9 Summary
In this chapter you learned how to build a complete, end-to-end face recognition system, in-
cluding:
• Gathering your faces dataset.

5.9. SUMMARY 103
• Training a face recognition model.
• Deploying the face recognition system to the RPi.
We created our face recognition system in the context of building a “front door monitor”,
capable of monitoring our home for intruders; however, you can use this code as a template for
whatever facial recognition applications you can imagine.
For added security and working in dark environments, I would suggest extending this project
by incorporating IoT components, including SMS notifications or smart bulbs. Refer to Chapter
15 for more details on building such a project.
Chapter 6
Building a Smart Attendance System
In this chapter you will learn how to build a smart attendance system used in school and
classroom applications.
Using this system, you, or a teacher/school administrator, can take attendance for a class-
room using only face recognition — no manual intervention of the instructor is required.
To build such an application, we’ll be using computer vision algorithms and concepts we’ve
learned throughout the text, including accessing our video stream, detecting faces, extract-
ing face embeddings, training a face recognizer, and then finally putting all the components
together to create the final smart classroom application.
Since this chapter references techniques used in so many previous chapters, I highly rec-
ommend that you read all preceding chapters in the book before you proceed.
i. Learn about smart attendance applications and why they are useful
ii. Discover TinyDB and how it can be used in real-world applications
iii. Implement and perform face enrollment/registration
iv. Train a face recognition model
v. Implement the final smart classroom application
105
106 CHAPTER 6. BUILDING A SMART ATTENDANCE SYSTEM
Figure 6.1: You can envision your attendance system being placed near where students enter the
classroom at the front doorway. You will need a screen with adjacent camera along with a speaker
for audible alerts.
6.2 Overview of Our Smart Attendance System
We’ll start this section with a brief review of what a smart attendance application is and why
we may want to implement our own smart attendance system.
From there we’ll review the directory structure for the project and review the configuration
file.
6.2.1 What is a Smart Attendance System?
The goal of a smart attendance system is to automatically recognize students and take atten-
dance without having the instructor having to manually intervene. Freeing the instructor from
having to take attendance gives the teacher more time to interact with the students and do
what they do best — teach rather than administer.
An example of a working smart attendance system can be seen in Figures 6.1 and 6.2.
Notice how as the student walks into a classroom they are automatically recognized. This
6.2. OVERVIEW OF OUR SMART ATTENDANCE SYSTEM 107
positive recognition is then logged to a database, marking the student as “present” for the
given session.
Figure 6.2: An example of a smart attendance system in action. Face detection is performed to
detect the presence of a face. Next, the detected face is recognized. Once the person is identified
the student is logged as "present" in the database.
We’ll be building our own smart attendance system in the remainder of this chapter. The
application will have multiple steps and components, each detailed below:
i. Step #1: Initialize the database (only needs to be done once)
ii. Step #2: Face enrollment (needs to be performed for each student in the class)
iii. Step #3: Train the face recognition model (needs to be performed once, and then again
if a student is ever enrolled or un-enrolled).
iv. Step #4: Take attendance (once per classroom session).
Before we start implementing these steps, let’s first review our directory structure for the
project.
Let’s go ahead and review our directory structure for the project:
|-- config
| |-- config.json
|-- database
| |-- attendance.json
|-- dataset
| |-- pyimagesearch_gurus
| |-- S1901
| | |-- 00000.png
...
| | |-- 00009.png
| |-- S1902
| |-- 00000.png
...
| |-- 00009.png
|-- output
| |-- le.pickle
|-- pyimagesearch
| |-- utils
| | |-- __init__.py
| | |-- conf.py
| |-- __init__.py
|-- initialize_database.py
|-- enroll.py
|-- unenroll.py
|-- encode_faces.py
|-- train_model.py
|-- attendance.py
The config/ directory will store our config.json configurations for the project.
The database/ directory will store our attendance.json file which is the serialized
JSON output from TinyDB, the database we’ll be using for this project.
The dataset/ directory (not to be confused with the database/ folder) will store all ex-
ample faces of each student captured via the enroll.py script.
We’ll then train a face recognition model on these captured faces via both encode_faces.py
and train_model.py — the output of these scripts will be stored in the output/ directory.
Our pyimagesearch module is quite simplistic, requiring only our Conf class used to load
our configuration file from disk.
Before building our smart attendance system we must first initialize our database via the
initialize_database.py script.
Once we have captured example faces for each student, extracted face embeddings, and
then trained our face recognition model, we can use attendance.py to take attendance. The
6.2. OVERVIEW OF OUR SMART ATTENDANCE SYSTEM 109
attendance.py script is the final script meant to be run in the actual classroom. It takes
all of the individual components implemented in the project and combines it into the final smart
attendance system.
If a student ever needs to leave the class (such as them dropping out of the course), we can
run unenroll.py.
Let’s now review our config.json file in the config directory:
1 {
2 // text to speech engine language and speech rate
3 "language": "english-us",
4 "rate": 175,
5
6 // path to the dataset directory
7 "dataset_path": "../datasets/face_recognition_dataset",
8
9 // school/university code for the class
10 "class": "pyimagesearch_gurus",
11
12 // timing of the class
13 "timing": "14:05",
Lines 3 and 4 define the language and rate of speech for our Text-to-Speech (TTS) engine.
We’ll be using the pyttsx3 library in this project — if you need to change the "language"
you can refer to the documentation for your specific language value [37].
The "dataset_path" points to the "dataset" directory. Inside this directory we’ll store
all captured face ROIs via the enroll.py script. We’ll then later read all images from this
directory and train a face recognition model on top of the faces via encode_faces.py and
train_model.py.
Line 10 sets the class_name, which is as the name suggests, is the title of the class.
We’re calling this class "pyimagesearch_gurus".
We then have the "timing" of the class (Line 13). This value is the time of day the class
actually starts. Our attendance.py script will monitor the time and ensure that attendance
can only be taken in a N second window once class actually starts.
Our next set of configurations is for face detection and face recognition:
15 // number of face detections before actually saving the faces to

16 // disk
17 "n_face_detection": 10,
18
19 // number of images required per person in the dataset
20 "face_count": 10,
21
22 // maximum time limit (in seconds) to take the attendance once
23 // the class has started
24 "max_time_limit": 300,
25
26 // number of consecutive recognitions to be made to mark a person
27 // recognized
28 "consec_count": 3,
The "n_face_detection" value controls the number of subsequent frames where a face
must be detected before we save the face ROI to disk. Enforcing at least ten consecutive
frames with a face detected prior to saving the face ROI to disk helps reduce false-positive
detections.
The "face_count" parameter sets the minimum number of face examples per student.
Here we are requiring that we capture ten total face examples per student.
We then have "max_time_limit" — this value sets the maximum time limit (in seconds)
to take attendance for once the class has started. Here we have a value of 300 seconds (five
minutes). Once class starts, the students have a total of five minutes to make their way to the
classroom, verify that they are present with our smart attendance system, and take their seats
(otherwise they will be marked as “absent”).
To reduce misrecognizing a face, a given face must be recognized "consec_count" times

in a row. This behavior ensures that if our face recognizer is "flickering" between two names
it will take the name that is recognized as "Adrian" multiple times in succession to count as
"Adrian" even if it incorrectly reported "Abhishek" for a single frame. Again, this behavior is
used to reduce false-positive recognitions.
Our final code block handles setting file paths:
30 // path to the database

31 "db_path": "database/attendance.json",
32
33 // paths to the encodings, recognizer, and label encoder
37
38 // dlib face detection to be used
39 "detection_method": "hog"
40 }
6.3. STEP #1: CREATING OUR DATABASE 111
Line 31 defines the "db_path" which is the output path to our serialized TinyDB file.
Lines 34-36 set additional file paths:
• "encodings_path": The path to the output 128-d extracted embeddings/quantifications

for each face from the dataset/ directory.
• "recognizer_path": The path to the trained SVM from train_model.py.
• "le_path": The serialized LabelEncoder object uses to transform predictions to human-

readable labels (in this case, the names of the students).
Finally, Line 39 sets our "detection_method". We’ll be using the HOG + Linear SVM
detector from the dlib and face_recognition library. Haar cascades would be faster than
HOG + Linear SVM, but less accurate. Similarly, deep learning-based face detectors would
be much more accurate but far too slow to run in real-time (especially since we’re not only
performing face detection but face recognition as well).
If you are using a co-processor such as the Movidius NCS or Google Coral USB Accelerator
I would suggest using the deep learning face detector (as it will be more accurate), but if you’re
using just the Raspberry Pi CPU, stick with either HOG + Linear SVM or Haar cascades as
these methods are fast enough to run in (near) real-time on the RPi.
6.3 Step #1: Creating our Database
Before we can enroll faces in our system and take attendance, we first need to initialize the
database that will store information on the class (name of the class, date/time class starts,
etc.) and the students (student ID, name, etc.).
We’ll be using TinyDB [38] for all database operations. TinyDB is small, efficient, and imple-
mented in pure Python. We’re using TinyDB for this project as it allows for database operations
to “get out of way”, ensuring we can focus on implementing the actual computer vision algo-
rithms rather than CRUD (Create, Read, Update, Delete) operations.
6.3.1 What is TinyDB?
TinyDB is a “tiny, documented oriented database” (http://pyimg.co/eowhr) [38]. The library is

written in pure Python, is simple to use, and allows us to store any object represented as a
Python dict data type.
For example, the following code snippet loads a serialized database from disk, inserts a
record for a student, and then demonstrates how to query for that record:
>>> from tinydb import TinyDB

>>> from tinydb import Query
>>> db = TinyDB("path/to/db.json")
>>> Student = Query()
>>> db.insert({"name": "Adrian", "age": 30})
>>> db.search(User.name == "Adrian")
[{"name": "Adrian", "age": 30}]
As you can see, TinyDB allows us to focus less on the actual database code and more on
the embedded computer vision/deep learning concepts (which is what this book is about, after
all).
If you do not already have the tinydb Python package installed on your system, you can
install it via the following command:
$ pip install tinydb
I would recommend you stick with TinyDB to build your own proof-of-concept smart at-
tendance system. Once you’re happy with it the system you can then try more advanced,
feature-rich databases including mySQL, postgresql, MongoDB, Firebase, etc.
6.3.2 Our Database Schema
Internally, TinyDB represents a database as Python dictionary. Data is stored in (nested)

Python dictionaries. When presented with a query, TinyDB scans the Python dictionary objects
and returns all items that match the query parameters.
Our database will have three top-level dictionaries:
1 {
2 "_default": {
3 "1": {
4 "class": "pyimagesearch_gurus"
5 }
6 },
7 "attendance": {
8 "2": {
9 "2019-11-13": {
10 "S1901": "08:01:15",
11 "S1903": "08:03:41",
12 "S1905": "08:04:19"
13 }
14 },
15 "1": {
16 "2019-11-14": {
17 "S1904": "08:02:22",
18 "S1902": "08:02:49",
19 "S1901": "08:04:27"
20 }
21 }
22 },
23 "student": {
24 "1": {
25 "S1901": [
26 "Adrian",
27 "enrolled"
28 ]
29 },
30 "2": {
31 "S1902": [
32 "David",
33 "enrolled"
34 ]
35 },
36 "3": {
37 "S1903": [
38 "Dave",
39 "enrolled"
40 ]
41 },
42 "4": {
43 "S1904": [
44 "Abhishek",
45 "enrolled"
46 ]
47 },
48 "5": {
49 "S1905": [
50 "Sayak",
51 "enrolled"
52 ]
53 }
54 }
55 }
The class key (Line 4) contains the name of the class where our smart attendance system
will be running. Here you can see that the name of the class, for this example, is “pyimage-
search_gurus”.
The student dictionary (Line 23) stores information for all students in the database. Each
student must have a name and a status flag used to indicate if they are enrolled or un-enrolled
in a given class. The actual student ID can be whatever you want, but I’ve chosen the format:
• S: Indicating “student”
• 19: The current year.
• 01: The first student to be registered.
The next student registered would then be S1902, and so on. You can choose to keep this
same ID structure or define your own — the actual ID is entirely arbitrary provided that the
ID is unique.
Finally, the attendance dictionary stores the attendance record for each session of the
class. For each session, we store both (1) the student ID for each student who attended the
class, along with (2) the timestamp of when each student was successfully recognized. By
recording both of these values we can then determine which students attended a class and
whether or not they were late for class.
Keep in mind this database schema is meant to be the bare minimum of what’s required to
build a smart attendance application. Feel free to add in additional information on the student,
including age, address, emergency contact, etc.
Secondly, we’re using TinyDB here out of simplicity. When building your own smart atten-
dance application you may wish to use another database — I’ll leave that as an exercise to
you to implement as the point of this text is to focus on computer vision algorithms rather than
database operations.
6.3.3 Implementing the Initialization Script
Our first script, initialize_database.py, is a utility script used to create our initial TinyDB
database. This script only needs to be executed once but it has to be executed before you
start enrolling faces.
Let’s take a look at the script now:

3 from tinydb import TinyDB
4 import argparse
5
11
12 # load the configuration file
14
15 # initialize the database
16 db = TinyDB(conf["db_path"])
17
18 # insert the details regarding the class
19 print("[INFO] initializing the database...")
20 db.insert({"class": conf["class"]})
21 print("[INFO] database initialized...")
22
23 # close the database
24 db.close()
Lines 2-4 import our required Python packages. Line 3 imports the TinyDB class used to
interface with our database.
Only only command line argument, --conf, is parsed on Lines 7-10. We then load the
conf on Line 13 and use the "db_path" to initialize the TinyDB instance.
Once we have the db object instantiated we use the insert method to add data to the
database. Here we are adding the class name from the configuration file. We need to insert
a class so that students can be enrolled in the class in Section 6.4.2.
Finally, we close the database on Line 24 which triggers the TinyDB library to serialize the
database back out to disk as a JSON file.
6.3.4 Initializing the Database
Let’s initialize our database now.
Open up a terminal and issue the following command:
$ python initialize_database.py --conf config/config.json

[INFO] initializing the database...
[INFO] database initialized...
If you check the database/ directory you’ll now see a file named attendance.json:
$ ls database/
attendance.json
The attendance.json file is our actual TinyDB database. The TinyDB library will read,
manipulate, and save the data inside this file.
6.4 Step #2: Enrolling Faces in the System
Now that we have our database initialized we can move on to face enrollment and un-enrollment.
During enrollment a student will stand in front of a camera. Our system will access the
camera, perform face detection, extract the ROI of the face and then serialize the ROI to disk.
In Section 6.5 we’ll take these ROIs, extract face embeddings, and then train a SVM on top of
the embeddings.
A teacher or school administrator can perform un-enrollment as well (Section 6.4.4). We

may want to un-enroll a student if they leave a class or when the semester ends.
6.4.1 Implementing Face Enrollment
Before we can recognize students in a classroom we first need to “enroll” them in our face
recognition system. Enrollment is a two step process:
i. Step #1: Capture faces of each individual and record them in our database (covered in
this section).
ii. Step #2: Train a machine learning model to recognize each individual (covered in Section
6.5).
The first phase of face enrollment will be accomplished via the enroll.py script. Open up
that script now and insert the following code:

5 from tinydb import where
7 import argparse
8 import imutils
9 import pyttsx3
10 import time
11 import cv2
12 import os
Lines 2-12 handle importing our required Python packages. The tinydb imports on Lines
4 and 5 will interface with our database. The where function will be used to perform SQL-
like “where” clauses to search our database. Line 6 imports the face_recognition library
which will be used to facilitate face detection (in this section) and face recognition (in Section
6.4. STEP #2: ENROLLING FACES IN THE SYSTEM 117
6.6). The pyttsx3 import is our Text-to-Speech (TTS) library. We’ll be using this package
whenever we need to generate speech and play it through our speakers.
Next, let’s parse our command line arguments:

16 ap.add_argument("-i", "--id", required=True,
17 help="Unique student ID of the student")
18 ap.add_argument("-n", "--name", required=True,
19 help="Name of the student")
We require three command line arguments:
• --id: The unique ID of the student.
• --name: The name of the student.
• --conf: The path to our configuration file.
Let’s use TinyDB and query if a student with the given --id already exists in our database:

26
27 # initialize the database and student table objects
29 studentTable = db.table("student")
30
31 # retrieve student details from the database
32 student = studentTable.search(where(args["id"]))
Line 25 loads our configuration file from disk.
Using this configuration we then load the TinyDB and grab a reference to the student table
(Line 29). The where method is used to search the studentTable for all records which have
the supplied --id.
If there are no existing records with the supplied ID then we know the student has not been
enrolled yet:
34 # check if an entry for the student id does *not* exist, if so, then
35 # enroll the student
36 if len(student) == 0:
41 time.sleep(2.0)
42
43 # initialize the number of face detections and the total number
44 # of images saved to disk
45 faceCount = 0
46 total = 0
47
48 # initialize the text-to-speech engine, set the speech language, and
49 # the speech rate
50 ttsEngine = pyttsx3.init()
51 ttsEngine.setProperty("voice", conf["language"])
52 ttsEngine.setProperty("rate", conf["rate"])
53
54 # ask the student to stand in front of the camera
55 ttsEngine.say("{} please stand in front of the camera until you" \
56 "receive further instructions".format(args["name"]))
57 ttsEngine.runAndWait()
Line 36 makes a check to ensure that no existing students have the same --id we are
using. Provided that check passes we access our video stream and initialize two integers:
• faceCount: The number of consecutive frames with a face detected.
• total: The total number of faces saved for the current student.
Lines 50-52 initialize the TTS engine by setting the speech language and the speech rate.
We then instruct the student (via the TTS engine) to stand in front of the camera (Lines
55-57). With the student now in front of the camera we can capture faces of the individual:
59 # initialize the status as detecting

60 status = "detecting"
61
62 # create the directory to store the student's data
63 os.makedirs(os.path.join(conf["dataset_path"], conf["class"],
64 args["id"]), exist_ok=True)
65
66 # loop over the frames from the video stream
67 while True:
68 # grab the frame from the threaded video stream, resize it (so
69 # face detection will run faster), flip it horizontally, and
70 # finally clone the frame (just in case we want to write the
71 # frame to disk later)

74 frame = cv2.flip(frame, 1)
75 orig = frame.copy()
Line 60 sets our current "status" to "detecting". Later this status will be updated to
saving once we start writing example face ROIs to disk.
Lines 63 and 64 create a subdirectory in our dataset/ directory. This subdirectory is

named based on the supplied --id.
We then start looping over frames of our video stream on Line 67. We preprocess the
frame by resizing it to have a width of 400px (for faster processing) and then horizontally
flipping it (to remove the mirror effect).
Let’s now perform face detection on the frame:
77 # convert the frame from from RGB (OpenCV ordering) to dlib

78 # ordering (RGB) and detect the (x, y)-coordinates of the
79 # bounding boxes corresponding to each face in the input image
82 model=conf["detection_method"])
83
84 # loop over the face detections
85 for (top, right, bottom, left) in boxes:
86 # draw the face detections on the frame
87 cv2.rectangle(frame, (left, top), (right, bottom),
88 (0, 255, 0), 2)
89
90 # check if the total number of face detections are less
91 # than the threshold, if so, then skip the iteration
92 if faceCount < conf["n_face_detection"]:
93 # increment the detected face count and set the
94 # status as detecting face
95 faceCount += 1
96 status = "detecting"
97 continue
98
99 # save the frame to correct path and increment the total
100 # number of images saved
101 p = os.path.join(conf["dataset_path"], conf["class"],
102 args["id"], "{}.png".format(str(total).zfill(5)))
103 cv2.imwrite(p, orig[top:bottom, left:right])
104 total += 1
105
106 # set the status as saving frame
107 status = "saving"
We use the face_recognition library to perform face detection using the HOG + Linear
SVM method on Lines 81 and 82.
The face_locations function returns a list of four values: the top, right, bottom, and left
(x, y)-coordinates of each face in the image.
On Line 85 we loop over the detected boxes and use the cv2.rectangle function to
draw the bounding box of the face.
Line 92 makes a check to see if the faceCount is still below the number of required
consecutive frames with a face detected (used to reduce false-positive detections). If our
faceCount is below the threshold we increment the counter and continue looping.
Once we have reached the threshold we derive the path to the output face ROI (Lines 101
and 102) and then write the face ROI to disk (Line 103). We then increment our total face
ROI count and update the status.
We can then draw the status on the frame and visualize it on our screen:
109 # draw the status on to the frame

110 cv2.putText(frame, "Status: {}".format(status), (10, 20),
112
113 # show the output frame
114 cv2.imshow("Frame", frame)
115 cv2.waitKey(1)
116
117 # if the required number of faces are saved then break out from
118 # the loop
119 if total == conf["face_count"]:
120 # let the student know that face enrolling is over
121 ttsEngine.say("Thank you {} you are now enrolled in the {} " \
122 "class.".format(args["name"], conf["class"]))
124 break
125
126 # insert the student details into the database
127 studentTable.insert({args["id"]: [args["name"], "enrolled"]})
128
129 # print the total faces saved and do a bit of cleanup
130 print("[INFO] {} face images stored".format(total))
131 print("[INFO] cleaning up...")
133 vs.stop()
134
135 # otherwise, a entry for the student id exists
136 else:
137 # get the name of the student
138 name = student[0][args["id"]][0]
139 print("[INFO] {} has already already been enrolled...".format(
140 name))
141

143 db.close()
If our total reaches the maximum number of face_count images needed to train our
face recognition model (Line 119), we use the TTS engine to tell the user enrollment for them
is now complete (Lines 121-123). The student is then inserted into the TinyDB, including the
ID, name, and enrollment status.
The else statement on Line 136 closes the if statement back on Line 36. As a reminder,
this if statement checks to see if the student has already been enrolled — the else statement
therefore catches if the student is already in the database and trying to enroll again. If that
case happens we simply inform the user that they have already been enrolled and skip any
face detection and localization.
6.4.2 Enrolling Faces
To enroll faces in our database, open up a terminal and execute the following command:
$ python enroll.py --id S1902 --name David --conf config/config.json

...
[INFO] 10 face images stored
[INFO] cleaning up
Figure 6.3 (left) shows the “face detection” status. During this phase our enrollment software
is running face detection on each and every frame. Once we have reached a sufficient number
of consecutive frames detected with a face, we change the status to “saving” (right) and begin
saving face images to disk. After we reach the required number of face images an audio
message is played over the speaker and a notification is printed in the terminal.
Upon executing the script you can check the dataset/pyimagesearch_gurus/S1902/

directory and see that there are ten face examples:
$ ls dataset/pyimagesearch_gurus/S1902
00000.png 00002.png 00004.png 00006.png 00008.png
00001.png 00003.png 00005.png 00007.png 00009.png
You can repeat the process of face enrollment via enroll.py for each student that is
registered to the class.
Once you have face images for each student we can then train a face recognition model in
Section 6.5.
Figure 6.3: Step #2: Enrolling faces in our attendance system via enroll.py.
6.4.3 Implementing Face Un-enrollment
If a student decides to drop out of the class we need to un-enroll them from both our (1)
database and (2) face recognition model. To accomplish both these tasks we’ll be using the
unenroll.py script — open up that file now and insert the following code:

5 import argparse
6 import shutil
7 import os
8
11 ap.add_argument("-i", "--id", required=True,
12 help="Unique student ID of the student")
arguments. We need two command line arguments here, --id, the ID of the student we are
un-enrolling, and ---conf, the path to our configuration file.
We can then load the Conf and then access the students table:

19
20 # initialize the database and student table objects
23
24 # retrieve the student document from the database, mark the student
25 # as unenrolled, and write back the document to the database
26 student = studentTable.search(where(args["id"]))
27 student[0][args["id"]][1] = "unenrolled"
28 studentTable.write_back(student)
29
30 # delete the student's data from the dataset
31 shutil.rmtree(os.path.join(conf["dataset_path"], conf["class"],
32 args["id"]))
33 print("[INFO] Please extract the embeddings and re-train the face" \
34 " recognition model...")
35
37 db.close()
Line 26 searches the studentTable using the supplied --id.
Once we find the row we update the enrollment status to be “unenrolled”. We then delete
the students face images from our dataset directory (Lines 31 and 32). The db is then
serialized back out to disk on Line 37.
6.4.4 Un-enrolling Faces
Let’s go ahead and un-enroll the “Dave” student that we enrolled from Section 6.4.2:
$ python unenroll.py --id S1901 --conf config/config.json

[INFO] Please extract the embeddings and re-train the face recognition
model...
Checking the contents of dataset/pyimagesearch_gurus/ you will no longer find a

S1901 directory — this is because the S1901 student has been removed from our database.
You can use this script whenever you need to un-enroll a student from a database, but
before you continue on to the next section, make sure you use the enroll.py script
to register at least two students in the database (as we need at least two students in the
database to train our model).
Once you have done so you can move on to training the actual face recognition component
of the smart attendance system.
6.5 Step #3: Training the Face Recognition Component
Now that we have example images for each student in the class we can move on to training
the face recognition component of the project.
6.5.1 Implementing Face Embedding Extraction
The encode_faces.py script we’ll be reviewing in this section is essentially identical to the
script covered in Chapter 5. We’ll still review the file here as a matter of completeness, but
make sure you refer to Chapter 5 for more details on how this script works.
Open up the encode_faces.py file and insert the following code:

5 import argparse
6 import pickle
7 import cv2
8 import os
9
15
18
21 imagePaths = list(paths.list_images(
22 os.path.join(conf["dataset_path"], conf["class"])))
23
24 # initialize the list of known encodings and known names
25 knownEncodings = []
26 knownNames = []
Lines 2-8 import our required Python packages. The face_recognition library, in-
conjunction with dlib, will be used to quantify each of the faces in our dataset/ directory.
6.5. STEP #3: TRAINING THE FACE RECOGNITION COMPONENT 125
We then parse the path to our --conf file on (Lines 11-14). The configuration itself is
loaded on Line 17.
Lines 21-22 grabs the paths to all images inside the dataset/ directory. We then initialize
two lists, one to store the quantifications of each face followed by a second list to store the
actual names of each face (Lines 25 and 26).
Let’s loop over each of the imagePaths:

34
35 # load the input image and convert it from RGB (OpenCV ordering)
36 # to dlib ordering (RGB)
38 rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
39
41 encodings = face_recognition.face_encodings(rgb)
42
43 # loop over the encodings
44 for encoding in encodings:
45 # add each encoding + name to our set of known names and
46 # encodings
47 knownEncodings.append(encoding)
Line 33 extracts the name of the student from the imagePath. In this case the name is the
ID of the student.
Line 37 and 38 reads our input image from disk and converts it from BGR to RGB channel
ordering (the channel ordering that the face_recognition library expects when performing
face quantification).
A call to the face_encodings method uses a neural network to compute a list of 128 float-
ing point values used to quantify the face in the image. We then update our knownEncodings
with each encoding and the knownNames list with the name of the person.
Finally, we can serialize both the knownEncodings and knownNames to disk:
50 # dump the facial encodings + names to disk

51 print("[INFO] serializing encodings...")
52 data = {"encodings": knownEncodings, "names": knownNames}
53 f = open(conf["encodings_path"], "wb")
55 f.close()
Again, for more details on how the face encoding process works, refer to Chapter 5.
6.5.2 Extracting Face Embedding
To quantify each student face in the dataset/ directory, open up a terminal and execute the
following command:
$ python encode_faces.py --conf config/config.json

...
[INFO] serializing encodings...
It is recommended to execute this script on your laptop, desktop, or GPU-enabled machine.

It is possible to run the script on a RPi 3B+, 4B, or higher, although not recommended. Again,
use a separate system to compute these embeddings and train the face recognition model in
Section 6.5 — you can then take the resulting model file and transfer it back to the RPi.
6.5.3 Implementing the Training Script
Similar to our encode_faces.py, the train_model.py file is essentially identical to the

script in Chapter 5. I’ll review train_model.py here as well, but again, you should refer to
Chapter 5 if you require more detail on what this script is doing.
Let’s review train_model.py now:

5 import argparse
6 import pickle
7
6.5. STEP #3: TRAINING THE FACE RECOGNITION COMPONENT 127

13
16
17 # load the face encodings
18 print("[INFO] loading face encodings...")
19 data = pickle.loads(open(conf["encodings_path"], "rb").read())
20
Lines 9-12 parse the --conf switch. We then load the associated Conf file on Line 15.
Line 19 loads the serialized data from disk. The data includes both the (1) 128-d quantifi-
cations for each face, and (2) the names of each respective individual. We take the names and
then pass them through a LabelEncoder, ensuring that each name (string) is represented by
a unique integer.
We can now train a SVM on the associated data and labels:
26 # train the model used to accept the 128-d encodings of the face and
29 recognizer = SVC(C=1.0, kernel="linear", probability=True)
30 recognizer.fit(data["encodings"], labels)
31
33 print("[INFO] writing the model to disk...")
34 f = open(conf["recognizer_path"], "wb")
35 f.write(pickle.dumps(recognizer))
36 f.close()
37
39 f = open(conf["le_path"], "wb")
41 f.close()
After training is complete we serialize both the face recognizer model and the LabelEnc
oder to disk.
Again, for more details on this script and how it works, make sure you refer to Chapter 5.
6.5.4 Running the Training Script
To train our face recognition model, execute the following command:
$ python train_model.py --conf config/config.json

[INFO] loading face encodings...
[INFO] writing the model to disk...
Training should only take a few minutes. After training is complete you should have two
new files in your output/ directory, recognizer.pickle and le.pickle, in addition to
encodings.pickle from previously:
$ ls output
encodings.pickle le.pickle recognizer.pickle
The recognizer.py file is your actual trained SVM. The SVM model will be used to accept
the 128-d face encoding inputs and then predict the probability of the student based on the face
quantification.
We then take the prediction with the highest probability and pass it through our serialized
LabelEncoder (i.e., le.pickle) to convert the prediction to a human-readable name (i.e.,
the unique ID of the student).
6.6 Step #4: Implementing the Attendance Script
We now have all the pieces of the puzzle — it’s time to assemble them and create our smart
attendance system.
Open up the attendance.py file and insert the following code:

5 from datetime import date
10 import argparse
6.6. STEP #4: IMPLEMENTING THE ATTENDANCE SCRIPT 129
11 import imutils
12 import pyttsx3
13 import pickle
14 import time
15 import cv2
16
On Lines 2-15 we import our required packages. Notable imports include tinydb used to
interface with our attendance.json database, face_recognition to facilitate both face
detection and face identification, and pyttsx3 used for Text-to-Speech.
Lines 18-21 parse our command line arguments.
We can now move on to loading the configuration and accessing individual tables via
TinyDB:

25
26 # initialize the database, student table, and attendance table
27 # objects
30 attendanceTable = db.table("attendance")
31
32 # load the actual face recognition model along with the label encoder
Lines 29 and 30 grab a reference to the student and attendance tables, respectively.
We then load the trained face recognizer model and LabelEncoder on Lines 33 and 34.
Let’s access our video stream and perform a few more initializations:
40 time.sleep(2.0)
41
42 # initialize previous and current person to None
44 curPerson = None
45
46 # initialize consecutive recognition count to 0
47 consecCount = 0
48
49 # initialize the text-to-speech engine, set the speech language, and
50 # the speech rate
51 print("[INFO] taking attendance...")
52 ttsEngine = pyttsx3.init()
53 ttsEngine.setProperty("voice", conf["language"])
54 ttsEngine.setProperty("rate", conf["rate"])
55
56 # initialize a dictionary to store the student ID and the time at
57 # which their attendance was taken
58 studentDict = {}
Line 43 and 44 initialize two variables, prevPerson, the ID of the previous person rec-
ognized in the video stream, and curPerson, the ID of the current person identified in the
stream. In order to reduce false-positive identifications we’ll ensure that the prevPerson and
curPerson match for a total of consec_count frames (defined inside our config.json file
from Section 6.2.3).
The consecCount integer keeps track of the number of consecutive frames with the same
person identified.
Lines 52-54 initialize our ttsEngine, used to generate speech and play it through our
speakers.
We then initialize studentDict, a dictionary used to map a student ID to when their re-
spective attendance was taken.
We’re now entering the body of our script:
60 # loop over the frames from the video stream

61 while True:
62 # store the current time and calculate the time difference
63 # between the current time and the time for the class
64 currentTime = datetime.now()
65 timeDiff = (currentTime - datetime.strptime(conf["timing"],
66 "%H:%M")).seconds
67
68 # grab the next frame from the stream, resize it and flip it
69 # horizontally
Line 63 grabs the current time. We then take this value and compute the difference between
the current time and when class officially starts. We’ll use this timeDiff value to determine if
class has already started, in which time to take attendance has closed.
Line 70 reads a frame from our video stream which we then preprocess by resizing to have
a width of 400px and then flipping horizontally.
Let’s check to see if the maximum time limit to take attendance has been reached:
74 # if the maximum time limit to record attendance has been crossed

75 # then skip the attendance taking procedure
76 if timeDiff > conf["max_time_limit"]:
77 # check if the student dictionary is not empty
78 if len(studentDict) != 0:
79 # insert the attendance into the database and reset the
80 # student dictionary
81 attendanceTable.insert({str(date.today()): studentDict})
82 studentDict = {}
83
84 # draw info such as class, class timing, and current time on
85 # the frame
86 cv2.putText(frame, "Class: {}".format(conf["class"]),
87 (10, 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 1)
88 cv2.putText(frame, "Class timing: {}".format(conf["timing"]),
90 cv2.putText(frame, "Current time: {}".format(
91 currentTime.strftime("%H:%M:%S")), (10, 40),
93
94 # show the frame
95 cv2.imshow("Attendance System", frame)
97
100 break
101
102 # skip the remaining steps since the time to take the
103 # attendance has ended
104 continue
Provided that (1) class has already started, and (2) the maximum time limit for attendance
has been passed (Line 76), we make a second check on Line 78 to see if the attendance
record has been added to our database. If we have not added the results of taking attendance
to our database, we insert a new set of records to the database, indicating that each of the
individual students in studentDict are in attendance. The teacher or principal can then
audit the attendance results at their leisure.
Lines 86-92 draw class information on our frame, including the name of the class, when
class starts, and the current timestamp.
The remaining code blocks assume that we are still taking attendance, implying that we’re
not past the attendance time limit:
106 # convert the frame from RGB (OpenCV ordering) to dlib

107 # ordering (RGB)
109
110 # detect the (x, y)-coordinates of the bounding boxes
111 # corresponding to each face in the input image
113 model=conf["detection_method"])
114
115 # loop over the face detections
116 for (top, right, bottom, left) in boxes:
117 # draw the face detections on the frame
118 cv2.rectangle(frame, (left, top), (right, bottom),
119 (0, 255, 0), 2)
120
121 # calculate the time remaining for attendance to be taken
122 timeRemaining = conf["max_time_limit"] - timeDiff
123
124 # draw info such as class, class timing, current time, and
125 # remaining attendance time on the frame
126 cv2.putText(frame, "Class: {}".format(conf["class"]), (10, 10),
128 cv2.putText(frame, "Class timing: {}".format(conf["timing"]),
130 cv2.putText(frame, "Current time: {}".format(
131 currentTime.strftime("%H:%M:%S")), (10, 40),
133 cv2.putText(frame, "Time remaining: {}s".format(timeRemaining),
Line 108 swaps our frame to RGB ordering so we can perform both face detection and
recognition via dlib and face_recognition.
Lines 112 and 113 perform face detection. We then loop over each of the detecting bound-
ing boxes and draw them on our frame.
Lines 126-134 draw class information on our screen, most importantly, the amount of time
remaining to register yourself as having “attended” the class.
Let’s now check and see if any faces were detected in the current frame:
136 # check if atleast one face has been detected

137 if len(boxes) > 0:
139 encodings = face_recognition.face_encodings(rgb, boxes)
140

145
146 # if the person recognized is the same as in the previous
147 # frame then increment the consecutive count
150
151 # otherwise, these are two different people so reset the
152 # consecutive count
153 else:
154 consecCount = 0
155
156 # set current person to previous person for the next
157 # iteration
Provided at least one face was detected in the frame, Line 139 takes all detected faces
and then extracts 128-d embeddings used to quantify each face. We take these face embed-
dings and pass them through our recognizer, finding the index of the label with the largest
corresponding probability (Lines 142-144).
Line 148 checks to see if the prevPerson prediction matches the curPerson predic-
tion, in which case we increment the consecCount. Otherwise, we do not have matching
consecutive predictions so we reset the consecCount (Lines 153 and 154).
Once consecCount reaches a suitable threshold, indicating a positive identification, we

can process the student:

162 # positive recognition and alert/greet the person accordingly
163 if consecCount >= conf["consec_count"]:
164 # check if the student's attendance has been already
165 # taken, if not, record the student's attendance
166 if curPerson not in studentDict.keys():
167 studentDict[curPerson] = datetime.now().strftime("%H:%M:%S")
168
169 # get the student's name from the database and let them
170 # know that their attendance has been taken
171 name = studentTable.search(where(
172 curPerson))[0][curPerson][0]
173 ttsEngine.say("{} your attendance has been taken.".format(
174 name))
176
177 # construct a label saying the student has their attendance
178 # taken and draw it on to the frame
179 label = "{}, you are now marked as present in {}".format(
180 name, conf["class"])

181 cv2.putText(frame, label, (5, 175),
183
184 # otherwise, we have not reached a positive recognition and
185 # ask the student to stand in front of the camera
186 else:
187 # construct a label asking the student to stand in fron
188 # to the camera and draw it on to the frame
189 label = "Please stand in front of the camera"
190 cv2.putText(frame, label, (5, 175),
Line 163 ensures that the consecutive prediction count has been satisfied (used to reduce
false-positive identifications). We then check to see if the student’s attendance has already
been taken (Line 166), and if not, we update the studentDict to include (1) the ID of the
person, and (2) the timestamp at which attendance was taken.
Lines 171-175 lookup the name of the student via the ID and then use the TTS engine to
let the student know their attendance has been taken.
Line 186 handles when the consecCount threshold has not been met — in that case we
tell the user to stand in front of the camera until their attendance has been taken.
Our final code block displays the output frame to our screen and performs a few tidying up
operations:

194 cv2.imshow("Attendance System", frame)
196
197 # check if the `q` key was pressed
198 if key == ord("q"):
199 # check if the student dictionary is not empty, and if so,
200 # insert the attendance into the database
201 if len(studentDict) != 0:
202 attendanceTable.insert({str(date.today()): studentDict})
203
204 # break from the loop
205 break
206
207 # clean up
209 vs.stop()
210 db.close()
In the event the q key is pressed, we check to see if there are any students in studentDict
that need to have their attendance recorded, and if so, we insert them into our TinyDB — after
6.7. SMART ATTENDANCE SYSTEM RESULTS 135
which we break from the loop (Lines 198-205).
6.7 Smart Attendance System Results
It’s been quite a journey to get here, but we are now ready to run our smart attendance system
on the Raspberry Pi!
$ python attendance.py --conf config/config.json

[INFO] taking attendance...
[INFO] cleaning up...
Figure 6.4: An example of a student enrolled in the PyImageSearch Gurus course being marked
as "present" in the TinyDB database. Face detection and face recognition has recognized this
student while sounding an audible message and printing a text based annotation on the screen.
Figure 6.4 demonstrates our smart attendance system in action. As students enter the
classroom, attendance is taken until the time expires. Each result is saved to our TinyDB
database. The instructor can can query the database at a later date to determine which stu-
dents have attended/not attended certain class sessions throughout the semester.
6.8 Summary
In this chapter you learned how to implement a smart attendance application from scratch.
This system is capable of running in real-time on the Raspberry Pi, despite using more
advanced computer vision and deep learning techniques.
I know this has been a heavily requested topic on the PyImageSearch blog, particularly
among students working on their final projects before graduation, so if you use any of the
concepts/code in this chapter, please don’t forget to cite it in your final reports. You can find
citation/reference instructions here on the PyImageSearch blog: http://pyimg.co/hwovx.
Chapter 7
Building a Neighborhood Vehicle Speed

Monitor
This chapter is inspired by PyImageSearch readers who have emailed me asking for speed
estimation computer vision solutions.
As pedestrians taking the dog for a walk, escorting our kids to school, or marching to our
workplace in the morning, we’ve all experienced unsafe, fast-moving vehicles operated by
inattentive drivers that nearly mow us down.
Many of us live in apartment complexes or housing neighborhoods where ignorant drivers

disregard safety and zoom by, going way too fast.
We feel almost powerless. These drivers disregard speed limits, crosswalk areas, school
zones, and “children at play” signs altogether. When there is a speed bump, they speed up
almost as if they are trying to catch some air!
Is there anything we can do?
In most cases, the answer is unfortunately “no” — we have to look out for ourselves and our
families by being careful as we walk in the neighborhoods we live in.
But what if we could catch these reckless neighborhood miscreants in action and
provide video evidence of the vehicle, speed, and time of day to local authorities?
In fact, we can.
In this tutorial, we’ll build an OpenCV project that:
i. Detects vehicles in video using a MobileNet SSD and Intel Movidius Neural Compute
Stick (NCS)
ii. Tracks the vehicles
137
138 CHAPTER 7. BUILDING A NEIGHBORHOOD VEHICLE SPEED MONITOR
iii. Estimates the speed of a vehicle and stores the evidence in the cloud (specifically in a
Dropbox folder).
Once in the cloud, you can provide the shareable link to anyone you choose. I sincerely
hope it will make a difference in your neighborhood.
Let’s take a ride of our own and learn how to estimate vehicle speed using a Raspberry Pi
and Movidius NCS.
In this chapter we will:
i. Review the physics formula for calculating speed
ii. Discover the VASCAR approach that police use to measure speed
iii. Understand the human component that leads to inaccurate speeds with police VASCAR
electronics
iv. Build a Python computer vision app based on object detection/tracking and use the VAS-
CAR approach to automatically determine the speed of vehicles moving though the FOV
of a single camera
v. Utilize the Movidius NCS to ensure our system runs in real-time on the RPi
vi. Tune and calibrate our system for accurate speed readings
7.2 Neighborhood Vehicle Speed Estimation
In the first part of this chapter, we’ll review the concept of VASCAR, a method for measuring
speed of moving objects using distance and timestamps. From there, we’ll review our Python
project structure and config file including key configuration settings.
We’ll then implement our computer vision app and deploy it. We’ll also review a method for
tuning your speed estimation system by adjusting one of the constants.
7.2.1 What is VASCAR and How Is It Used to Measure Speed?
Visual Average Speed Computer and Recorder (VASCAR) is a method for calculating the
speed of vehicles. It does not rely on RADAR or LIDAR, but it borrows from those acronyms.
Instead, VASCAR is a simple timing device relying on the following equation:
7.2. NEIGHBORHOOD VEHICLE SPEED ESTIMATION 139
Figure 7.1: Visual Average Speed Computer and Recorder (VASCAR) devices calculate speed
based on Equation 7.1. A police officer must press a button each time a vehicle crosses two
waypoints relying on their eyesight and reaction time. There is potential for significant human
error. Police use VASCAR where RADAR/LIDAR is illegal or where drivers use RADAR/LIDAR
detectors. In this chapter we will build a computer vision speed measurement system based on
VASCAR that eliminates the human component. Figure credits: [39, 40]
speed = distance/time (7.1)
Police use VASCAR where RADAR and LIDAR is illegal or when they don’t want to be
detected by RADAR/LIDAR detectors.
Police must know the distance between two fixed points on the road (such as signs, lines,
trees, bridges, or other reference points). When a vehicle passes the first reference point,
they press a button to start the timer. When the vehicle passes the second point, the timer
is stopped. The speed is automatically computed because the computer already knows the
distance per Equation 7.1.
Speed measured by VASCAR is obviously severely limited by the human factor. What if
the police officer has poor eyesight or a poor reaction time? If they press the button late (first
reference point) and then early (second reference point), then your speed will be calculated
faster than you are actually going since the time component is smaller. If you are ever issued a
ticket by a police officer and it says VASCAR on it, then you have a very good chance of getting
out of the ticket in a court room. You can (and should) fight it. Be prepared with Equation 7.1
above and to explain how significant the human component is.
Our project relies on a VASCAR approach, but with four reference points. We will average
the speed between all four points with a goal of having a better estimate of the speed. Our
system is also dependent upon the distance and time components.
For further reading about VASCAR, please refer to the Wikipedia article: http://pyimg.co/91
s1o [41].
|-- config
| |-- config.json
|-- pyimagesearch
| |-- utils
| | |-- __init__.py
| | |-- conf.py
| |-- __init__.py
| |-- centroidtracker.py
| |-- trackableobject.py
|-- sample_data
| |-- cars.mp4
|-- output
| |-- log.csv
|-- MobileNetSSD_deploy.caffemodel
|-- MobileNetSSD_deploy.prototxt
|-- speed_estimation_dl.py
|-- speed_estimation_dl_video.py
Our config.json file holds all the project settings — we will review these configurations
in the next section.
We will be taking advantage of both the CentroidTracker and TrackableObject

classes in this project. The centroid tracker is identical to our previous people/vehicle counting
projects in the Hobbyist Bundle (Chapters 19 and 20) and Hacker Bundle (Chapter 13). Our
trackable object class, on the other hand, includes additional attributes that we will keep track
of including timestamps, positions, and speeds.
A sample video compilation from vehicles passing in front of Dave Hoffman’s house is in-
cluded (cars.mp4). Take note that you should not rely on video files for accurate speeds
— the FPS of the video, in addition to the speed at which frames are read from the file, will
impact speed readouts. Videos like the one provided are great for ensuring that the program
functions as intended, but again, accurate speed readings from video files are not likely.
The output/ folder will a log file, log.csv, includes the timestamps and speeds of vehi-
cles that have passed the camera.
Our pretrained Caffe MobileNet SSD object detector files are included in the root of the
project.
The driver script, speed_estimation_dl.py, interacts with the live video stream, object
detector, and calculates the speeds of vehicles using the VASCAR approach. It is one of the
longer scripts we cover in this book.
Remark. Also included in the project is speed_estimation_dl_video.py, a script which

is capable of using the sample_data/cars.mp4 video file. Take note that this is just for
your own personal development. As you may know, OpenCV is not capable of throttling a
video automatically based on its recorded framerate. Therefore speeds for this video will be
inaccurately calculated and reported. You should perform live testing with a camera and real
traffic flow to calculate speeds and calibrate your system. Refer to Section 7.2.9 to learn about
calibration.
7.2.3 Speed Estimation Config File
Let’s review config.json, our configuration settings file:
1 {
2 // maximum consecutive frames a given object is allowed to be
3 // marked as "disappeared" until we need to deregister the object
4 // from tracking
5 "max_disappear": 10,
6
7 // maximum distance between centroids to associate an object --
8 // if the distance is larger than this maximum distance we'll
9 // start to mark the object as "disappeared"
10 "max_distance": 175,
11
12 // number of frames to perform object tracking instead of object
13 // detection
14 "track_object": 4,
15
16 // minimum confidence
18
19 // frame width in pixels
20 "frame_width": 400,
21
22 // dictionary holding the different speed estimation columns
23 "speed_estimation_zone": {"A": 120, "B": 160, "C": 200, "D": 240},
24
25 // real world distance in meters
26 "distance": 16,
27
28 // speed limit in mph
29 "speed_limit": 15,
The "max_disappear" frame count signals to our centroid tracker when to mark an object
as disappeared (Line 5). The "max_distance" value is the maximum Euclidean distance in
pixels for which we’ll associate object centroids (Line 10) — if they exceed this distance we
mark the object as disappeared.
Our "track_object" value represents the number of frames to perform object tracking
rather than object detection (Line 14).
The "confidence" value is the probability threshold for object detection with MobileNet
SSD — objects (i.e. cars) that don’t meet the threshold are skipped to avoid false-positive
detections (Line 17).
The frame will be resized to a "frame_width" of 400 pixels (Line 20). We have four
speed estimation zones. Line 23 holds a dictionary of the frame’s columns (y -coordinates)
separating the zones. These columns are dependent upon the "frame_width", so take care
while updating them.
Figure 7.2: The camera’s FOV is measured at the roadside carefully. Oftentimes calibration is
required. Refer to Section 7.2.9 to learn about the calibration procedure.
Line 26 is the most important value in this configuration. You will have to physically measure
the "distance" on the road from one side of the frame to the other side. It will be best if you
have a helper to make the measurement.
Have the helper watch the screen and tell you when you are standing at the very edge of
the frame. Put the tape down on the ground at that point. Stretch the tape to the other side of
the frame until your helper tells you that they see you at the very edge of the frame in the video
stream. Take note of the distance in meters that all your calculations will be based upon.
As shown in Figure 7.2, there are 49 feet between the edges of where cars will travel in the
frame relative to the positioning on my camera. The conversion of 49 feet to meters is 14.94
meters.
So why does Line 26 reflect "distance": 16?
The value has been tuned for system calibration. See Section 7.2.9 to learn how to test and
calibrate your system. Secondly, had the measurement been made at the center of the street
(i.e. further from the camera), the distance would have been longer. The measurement was
taken next to the street by Dave Hoffman so he would not get run over by a car!
Our speed_limit in this example is 15mph (Line 29). Vehicles traveling less than this
speed will not be logged. Vehicles exceeding this speed will be logged. If you want all speeds
to be logged, you can set the value to 0.
The remaining configuration settings are for display, Dropbox upload, and important file
paths:
31 // flag indicating if the frame must be displayed

32 "display": true,
33
34 // path the object detection model
35 "model_path": "MobileNetSSD_deploy.caffemodel",
36
37 // path to the prototxt file of the object detection model
38 "prototxt_path": "MobileNetSSD_deploy.prototxt",
39
40 // flag used to check if dropbox is to be used and dropbox access
41 // token
42 "use_dropbox": false,
43 "dropbox_access_token": "YOUR_DROPBOX_APP_ACCESS_TOKEN",
44
45 // output directory and csv file name
46 "output_path": "output",
47 "csv_name": "log.csv"
48 }
If you set "display" to true on Line 32, an OpenCV window is displayed on your Rasp-
berry Pi desktop.
Lines 35-38 specify our Caffe model and prototxt paths.
If you elect to "use_dropbox", then you must set the value on Line 42 to true and fill
in your access token on Line 43. Videos of vehicles passing the camera will be logged to
Dropbox. Ensure that you have the quota for the videos as well.
To create/find your Dropbox API key, you can create an app on the app creation page:
http://pyimg.co/tcvd1. Once you have an app created, the API key may be generated under
the OAuth section of the app’s page on the App Console: http://pyimg.co/ynxh8. On the App
Console page, simply click the “Generate” button and copy/paste the key into the configuration
file).
Lines 46 and 47 specify the "output_path" for the log file.
7.2.4 Camera Positioning and Constants
Figure 7.3: This project assumes the camera is aimed perpendicular to the road. Timestamps
of a vehicle are collected at waypoints ABCD or DCBA. From there, Equation 7.1 is put to use to
calculate 3 speeds among the 4 waypoints. Speeds are averaged together and converted to km/hr
and miles/hr. As you can see, the distance measurement is different depending on where (edges
or centerline) the tape is laid on the ground. We will account for this by calibrating our system in
Section 7.2.9.
Figure 7.3 shows an overhead view of how the project is laid out. In the case of Dave
Hoffman’s house, the RPi and camera are sitting in his road-facing window. The measurement
for the "distance" was taken at the side of the road on the far edges of the FOV lines for the
camera. Points A, B, C, and D mark the columns in a frame. They should be equally spaced
in your video frame.
Cars pass through the FOV in either direction and MobileNet SSD combined with an object
tracker assist in grabbing timestamps at points ABCD (left-to-right) or DCBA (right-to-left).
7.2.5 Centroid Tracker
Object tracking is a concept we have already covered in this book, however let’s take a moment
to review.
A simple object tracking algorithm relies on keeping track of the centroids of objects.
Typically an object tracker works hand-in-hand with a less-efficient object detector. The
object detector is responsible for localizing an object. The object tracker is responsible for
keeping track of which object is which by assigning and maintaining an identification numbers
(IDs).
This object tracking algorithm we’re implementing is called centroid tracking as it relies on
the Euclidean distance between (1) existing object centroids (i.e., objects the centroid tracker
has already seen before) and (2) new object centroids between subsequent frames in a video.
The centroid tracking algorithm is a multi-step process. The five steps include:
i. Step #1: Accept bounding box coordinates and compute centroids
ii. Step #2: Compute Euclidean distance between new bounding boxes and existing objects
iii. Step #3: Update (x, y)-coordinates of existing objects
iv. Step #4: Register new objects
v. Step #5: Deregister old objects
The CentroidTracker class was covered in Chapters 19 and 20 of the Hobbyist Bundle
in addition to Chapter 13 of this volume. Please take the time now to review the class in any of
those chapters.
7.2.6 Trackable Object
In order to track and calculate the speed of objects in a video stream, we need an easy way to
store information regarding the object itself, including:
• Its object ID.
• Its previous centroids (so we can easily to compute the direction the object is moving).
• A dictionary of timestamps corresponding to each of the four columns in our frame.
• A dictionary of x-coordinate positions of the object. These positions reflect the actual
position in which the timestamp was recorded so speed can accurately be calculated.
• A "last point boolean" serves as a flag to indicate that the object has passed the last
waypoint (i.e. column) in the frame.
• The calculated speed in MPH and KMPH. We calculate both and the user can choose
which they prefers to use by a small modification to the driver script.
• A boolean to indicate if the speed has been estimated (i.e. calculated) yet.
• A boolean indicating if the speed has been logged in the .csv log file.
• The direction through the FOV the object is traveling (left-to-right or right-to-left).
To accomplish all of these goals we can define an instance of TrackableObject — open

up the trackableobject.py file and insert the following code:

3
4 class TrackableObject:
5 def __init__(self, objectID, centroid):
6 # store the object ID, then initialize a list of centroids
7 # using the current centroid
8 self.objectID = objectID
9 self.centroids = [centroid]
10
11 # initialize a dictionaries to store the timestamp and
12 # position of the object at various points
13 self.timestamp = {"A": 0, "B": 0, "C": 0, "D": 0}
14 self.position = {"A": None, "B": None, "C": None, "D": None}
15 self.lastPoint = False
16
17 # initialize the object speeds in MPH and KMPH
18 self.speedMPH = None
19 self.speedKMPH = None
20
21 # initialize two booleans, (1) used to indicate if the
22 # object's speed has already been estimated or not, and (2)
23 # used to indidicate if the object's speed has been logged or
24 # not
25 self.estimated = False
26 self.logged = False
27
28 # initialize the direction of the object
29 self.direction = None
The TrackableObject constructor accepts an objectID and centroid. The centroids

list will contain an object’s centroid location history.
We will have multiple trackable objects — one for each car that is being tracked in the frame.
Each object will have the attributes shown on Lines 8-29.
Lines 18 and 19 hold the speed in MPH and KMPH. We need a function to calculate the
speed, so let’s define the function now:
31 def calculate_speed(self, estimatedSpeeds):

32 # calculate the speed in KMPH and MPH
33 self.speedKMPH = np.average(estimatedSpeeds)
34 MILES_PER_ONE_KILOMETER = 0.621371
35 self.speedMPH = self.speedKMPH * MILES_PER_ONE_KILOMETER
Line 33 calculates the speedKMPH attribute as an average of the three estimatedSpeeds

between the four points (passed as a parameter to the function).
There are 0.621371 miles in one kilometer (Line 34). Knowing this, Line 35 calculates the
speedMPH attribute.
7.2.7 Speed Estimation with Computer Vision
Before we begin working on our driver script, let’s review our algorithm at a high level:
• Our speed formula is speed = distance / time (Equation 7.1).
• We have a known distance constant measured by a tape at the roadside. The camera
will face at the road perpendicular to the distance measurement unobstructed by obsta-
cles.
• Meters per pixel are calculated by dividing the distance constant by the frame width in
pixels (Equation 7.2).
• Distance in pixels is calculated as the difference between the centroids as they pass by
the columns for the zone (Equation 7.3). Distance in meters is then calculated for the
particular zone (Equation 7.4).
• Four timestamps (t) will be collected as the car moves through the FOV past four waypoint
columns of the video frame.
• Three pairs of the four timestamps will be used to determine three —t values.
• We will calculate three speed values ( —t

dab )in the case of the speed between columns A
ab
and B) for each of the pairs of timestamps and estimated distances.
• The three speed estimates will be averaged for an overall speed (Equation 7.5).
• The speed is converted and made available in the TrackableObject class as speedMPH
or speedKMPH. While will display speeds in miles per hour. Minor changes to the script
are required if you prefer to have the kilometers per hour logged and displayed — be sure
to read the remarks as you follow along in the chapter.
The following equations represent our algorithm:
distance constant
meters per pixel = mpp = (7.2)
f rame width
distance in pixels = pab =| colB ≠ colA | (7.3)
distance in meters zoneab = dab = pab ú mpp (7.4)
—tab
+ —tbc
+ —tcd
average speed = dab dbc dcd
(7.5)
3
Now that we understand the(1) methodology for calculating speeds of vehicles and (2)
we have defined the CentroidTracker and TrackableObject classes, let’s work on our
speed estimation driver script.
Open a new file named speed_estimation_dl.py and insert the following lines:

2 from pyimagesearch.centroidtracker import CentroidTracker
3 from pyimagesearch.trackableobject import TrackableObject
6 from imutils.io import TempFile
7 from imutils.video import FPS
9 from threading import Thread
11 import argparse
12 import dropbox
13 import imutils
14 import dlib
15 import time
16 import cv2
17 import os
Lines 2-17 handle our imports including our CentroidTracker and TrackableObject
for object tracking. The correlation tracker from Davis King’s dlib is also part of our object
tracking method. We’ll use the dropbox API to upload data to the cloud in a separate Thread
so as not block the main thread of execution.
Let’s implement the upload_file function now:

19 def upload_file(tempFile, client, imageID):

20 # upload the image to Dropbox and cleanup the tempory image
21 print("[INFO] uploading {}...".format(imageID))
22 path = "/{}.jpg".format(imageID)
23 client.files_upload(open(tempFile.path, "rb").read(), path)
24 tempFile.cleanup()
Our upload_file function will run in one or more separate threads. This method accepts
the tempFile object, Dropbox client object, and imageID as parameters. Using these
parameters, it builds a path and then uploads the file to Dropbox (Lines 22 and 23). From
there, Line 24 then removes the temporary file from local storage.
Let’s go ahead and load our configuration:

31
Lines 27-33 parse the --conf command line argument and load the contents of the con-
figuration into the conf dictionary.
We’ll then initialize our pretrained MobileNet SSD CLASSES and Dropbox client if re-
quired:
35 # initialize the list of class labels MobileNet SSD was trained to

36 # detect
37 CLASSES = ["background", "aeroplane", "bicycle", "bird", "boat",
38 "bottle", "bus", "car", "cat", "chair", "cow", "diningtable",
39 "dog", "horse", "motorbike", "person", "pottedplant", "sheep",
40 "sofa", "train", "tvmonitor"]
41
42 # check to see if the Dropbox should be used
43 if conf["use_dropbox"]:
44 # connect to dropbox and start the session authorization process
45 client = dropbox.Dropbox(conf["dropbox_access_token"])
46 print("[SUCCESS] dropbox account linked")
And from there, we’ll load our object detector and initialize our video stream:
48 # load our serialized model from disk

49 print("[INFO] loading model...")
50 net = cv2.dnn.readNetFromCaffe(conf["prototxt_path"],
51 conf["model_path"])
52 net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)
53
58 time.sleep(2.0)
59
60 # initialize the frame dimensions (we'll set them as soon as we read
61 # the first frame from the video)
62 H = None
63 W = None
Lines 50-52 load the MobileNet SSD net and set the target processor to the Movidius NCS
Myriad. Using the Movidius NCS coprocessor ensures that our FPS is high enough for
accurate speed calculations. In other words, if we have a lag between frame captures, our
timestamps can become out of sync and lead to inaccurate speed readouts.
Lines 57-63 initialize the PiCamera video stream and frame dimensions.
We have a handful more initializations to take care of:
65 # instantiate our centroid tracker, then initialize a list to store

66 # each of our dlib correlation trackers, followed by a dictionary to
67 # map each unique object ID to a TrackableObject
68 ct = CentroidTracker(maxDisappeared=conf["max_disappear"],
69 maxDistance=conf["max_distance"])
70 trackers = []
71 trackableObjects = {}
72
73 # keep the count of total number of frames
74 totalFrames = 0
75
76 # initialize the log file
77 logFile = None
78
79 # initialize the list of various points used to calculate the avg of
80 # the vehicle speed
81 points = [("A", "B"), ("B", "C"), ("C", "D")]
82
83 # start the frames per second throughput estimator
84 fps = FPS().start()
For object tracking purposes, Lines 68-71 initialize our CentroidTracker, trackers list,
and trackableObjects dictionary.
Line 74 initializes a totalFrames counter which will be incremented each time a frame
is captured. We’ll use this value to calculate when to perform object detection versus object
tracking.
Our logFile object will be opened later on (Line 77).
Our speed will be based on the ABCD column points in our frame. Line 81 initializes a list
of pairs of points for which speeds will be calculated. Given our four points, we can calculate
three speeds and then average them.
Line 84 initializes our FPS counter.
With all of our initializations taken care of, let’s begin looping over frames:

87 while True:
88 # grab the next frame from the stream, store the current
89 # timestamp, and store the new date
91 ts = datetime.now()
92 newDate = ts.strftime("%m-%d-%y")
93
94 # check if the frame is None, if so, break out of the loop
95 if frame is None:
96 break
97
98 # if the log file has not been created or opened
99 if logFile is None:
100 # build the log file path and create/open the log file
101 logPath = os.path.join(conf["output_path"], conf["csv_name"])
102 logFile = open(logPath, mode="a")
103
104 # set the file pointer to end of the file
105 pos = logFile.seek(0, os.SEEK_END)
106
107 # if we are using dropbox and this is a empty log file then
108 # write the column headings
109 if conf["use_dropbox"] and pos == 0:
110 logFile.write("Year,Month,Day,Time,Speed (in MPH),ImageID\n")
111
112 # otherwise, we are not using dropbox and this is a empty log
113 # file then write the column headings
114 elif pos == 0:
115 logFile.write("Year,Month,Day,Time (in MPH),Speed\n")
Our frame processing loop begins on Line 87. We begin by grabbing a frame and taking
our first timestamp (Lines 90-92).
Lines 99-115 initialize our logFile and write the column headings. Notice that if we are
using Dropbox, one additional column is present in the CSV — the image ID.
Remark. If you prefer to log speeds in kilometers per hour, be sure to update the CSV column
headings on Line 110 and Line 115.
Let’s preprocess our frame and perform a couple initializations:
117 # resize the frame

118 frame = imutils.resize(frame, width=conf["frame_width"])
120
121 # if the frame dimensions are empty, set them
122 if W is None or H is None:
124 meterPerPixel = conf["distance"] / W
125
126 # initialize our list of bounding box rectangles returned by
127 # either (1) our object detector or (2) the correlation trackers
128 rects = []
Line 118 resizes our frame to a known width directly from the "frame_width" value held
in the config file.
Remark. If you change "frame_width" in the config, be sure to update the "speed_estim
ation_zone" columns as well.
Line 119 converts the frame to RGB format for dlib’s correlation tracker.
Lines 122-124 initialize the frame dimensions and calculate meterPerPixel. The meters
per pixel value helps to calculate our three estimated speeds among the four points.
Remark. If your lens introduces distortion (i.e. a wide area lens or fisheye), you should con-
sider a proper camera calibration (via intrinsic/extrinsic parameters) so that the meterPerPixel
value is more accurate.
Line 128 initializes an empty list to hold bounding box rectangles returned by either (1) our
object detector or (2) the correlation trackers.
At this point, we’re ready to perform object detection to update our trackers:
130 # check to see if we should run a more computationally expensive

131 # object detection method to aid our tracker
132 if totalFrames % conf["track_object"] == 0:
133 # initialize our new set of object trackers
134 trackers = []
135
136 # convert the frame to a blob and pass the blob through the
137 # network and obtain the detections
138 blob = cv2.dnn.blobFromImage(frame, size=(300, 300),

139 ddepth=cv2.CV_8U)
140 net.setInput(blob, scalefactor=1.0/127.5, mean=[127.5,
141 127.5, 127.5])
142 detections = net.forward()
Object detection only will occur on multiples of "track_object" per Line 132. Perform-
ing object detection only every N frames reduces the expensive inference operations. We’ll
perform object tracking instead most of the time.
Lines 134 initializes our new list of object trackers to update with accurate bounding box
rectangles so that correlation tracking can do its job later.
Lines 138-142 perform inference using the Movidius NCS.
Let’s loop over the detections and update our trackers:
144 # loop over the detections

145 for i in np.arange(0, detections.shape[2]):
146 # extract the confidence (i.e., probability) associated
147 # with the prediction
148 confidence = detections[0, 0, i, 2]
149
150 # filter out weak detections by ensuring the `confidence`
151 # is greater than the minimum confidence
152 if confidence > conf["confidence"]:
153 # extract the index of the class label from the
154 # detections list
155 idx = int(detections[0, 0, i, 1])
156
157 # if the class label is not a car, ignore it
158 if CLASSES[idx] != "car":
159 continue
160
161 # compute the (x, y)-coordinates of the bounding box
162 # for the object
163 box = detections[0, 0, i, 3:7] * np.array([W, H, W, H])
164 (startX, startY, endX, endY) = box.astype("int")
165
166 # construct a dlib rectangle object from the bounding
167 # box coordinates and then start the dlib correlation
168 # tracker
169 tracker = dlib.correlation_tracker()
170 rect = dlib.rectangle(startX, startY, endX, endY)
171 tracker.start_track(rgb, rect)
172
173 # add the tracker to our list of trackers so we can
174 # utilize it during skip frames
175 trackers.append(tracker)
Line 145 begins a loop over detections.
Lines 148-159 filter the detection based on the "confidence" threshold and CLASSES
type. We only look for the “car” class using our pretrained MobileNet SSD.
Lines 163 and 164 calculate the bounding box of an object.
We then initialize a dlib correlation tracker and begin track the rect ROI found by our
object detector (Lines 169-171). Line 175 adds the tracker to our trackers list.
Now let’s handle the event in which we’ll be performing object tracking rather than object
detection:
177 # otherwise, we should utilize our object *trackers* rather than

178 # object *detectors* to obtain a higher frame processing
179 # throughput
180 else:
181 # loop over the trackers
182 for tracker in trackers:
183 # update the tracker and grab the updated position
184 tracker.update(rgb)
185 pos = tracker.get_position()
186
187 # unpack the position object
188 startX = int(pos.left())
189 startY = int(pos.top())
190 endX = int(pos.right())
191 endY = int(pos.bottom())
192
193 # add the bounding box coordinates to the rectangles list
194 rects.append((startX, startY, endX, endY))
195
196 # use the centroid tracker to associate the (1) old object
197 # centroids with (2) the newly computed object centroids
198 objects = ct.update(rects)
Object tracking is less of a computational load on our RPi, so most of the time (i.e. except
every N "track_object" frames) we will perform tracking.
Lines 182-185 loop over the available trackers and update the position of each object.
Lines 188-194 add the bounding box coordinates of the object to the rects list.
Line 198 then updates the CentroidTracker’s objects using either the object detection
or object tracking rects.
Let’s loop over the objects now and take steps towards calculating speeds:
200 # loop over the tracked objects

201 for (objectID, centroid) in objects.items():
202 # check to see if a trackable object exists for the current

203 # object ID
204 to = trackableObjects.get(objectID, None)
205
206 # if there is no existing trackable object, create one
207 if to is None:
208 to = TrackableObject(objectID, centroid)
Each trackable object has an associated objectID. Lines 204-208 create a trackable
object (with ID) if necessary.
From here we’ll check if the speed has been estimated for this trackable object yet:
210 # otherwise, if there is a trackable object and its speed has

211 # not yet been estimated then estimate it
212 elif not to.estimated:
213 # check if the direction of the object has been set, if
214 # not, calculate it, and set it
215 if to.direction is None:
216 y = [c[0] for c in to.centroids]
217 direction = centroid[0] - np.mean(y)
218 to.direction = direction
If the speed has not been estimated (Line 212), then we first need to determine the direction
in which the object is moving (Lines 215-218). Positive direction values indicate left-to-right
movement and negative values indicate right-to-left movement. Knowing the direction is impor-
tant so that we can estimate our speed between the points properly.
With the direction in hand, now let’s collect our timestamps:
220 # if the direction is positive (indicating the object

221 # is moving from left to right)
222 if to.direction > 0:
223 # check to see if timestamp has been noted for
224 # point A
225 if to.timestamp["A"] == 0 :
226 # if the centroid's x-coordinate is greater than
227 # the corresponding point then set the timestamp
228 # as current timestamp and set the position as the
229 # centroid's x-coordinate
230 if centroid[0] > conf["speed_estimation_zone"]["A"]:
231 to.timestamp["A"] = ts
232 to.position["A"] = centroid[0]
233
235 # point B
236 elif to.timestamp["B"] == 0:

241 if centroid[0] > conf["speed_estimation_zone"]["B"]:
242 to.timestamp["B"] = ts
243 to.position["B"] = centroid[0]
244
246 # point C
247 elif to.timestamp["C"] == 0:
252 if centroid[0] > conf["speed_estimation_zone"]["C"]:
253 to.timestamp["C"] = ts
254 to.position["C"] = centroid[0]
255
257 # point D
258 elif to.timestamp["D"] == 0:
261 # as current timestamp, set the position as the
262 # centroid's x-coordinate, and set the last point
263 # flag as True
264 if centroid[0] > conf["speed_estimation_zone"]["D"]:
265 to.timestamp["D"] = ts
266 to.position["D"] = centroid[0]
267 to.lastPoint = True
Lines 222-267 collect timestamps for cars moving from left-to-right for each of our columns,
A, B, C, and D.
Let’s inspect the calculation for column A:
i. Line 225 checks to see if a timestamp has been made for point A — if not, we’ll proceed
to do so.
ii. Line 230 checks to see if the current x-coordinate centroid is greater than column A.
iii. If so, Lines 231 and 232 record a timestamp and the exact x-position of the centroid.
iv. Columns B, C, and D use the same method to collect timestamps and positions with one
exception. For column D, the lastPoint is marked as True. We’ll use this flag later to
indicate that it is time to perform our speed formula calculations.
Now let’s perform the same timestamp, position, and last point updates for right-to-left
traveling cars (i.e. direction < 0):
269 # if the direction is negative (indicating the object

270 # is moving from right to left)
271 elif to.direction < 0:
273 # point D
274 if to.timestamp["D"] == 0 :
275 # if the centroid's x-coordinate is lesser than
279 if centroid[0] < conf["speed_estimation_zone"]["D"]:
280 to.timestamp["D"] = ts
281 to.position["D"] = centroid[0]
282
284 # point C
285 elif to.timestamp["C"] == 0:
290 if centroid[0] < conf["speed_estimation_zone"]["C"]:
291 to.timestamp["C"] = ts
292 to.position["C"] = centroid[0]
293
295 # point B
296 elif to.timestamp["B"] == 0:
301 if centroid[0] < conf["speed_estimation_zone"]["B"]:
302 to.timestamp["B"] = ts
303 to.position["B"] = centroid[0]
304
306 # point A
307 elif to.timestamp["A"] == 0:
310 # as current timestamp, set the position as the
311 # centroid's x-coordinate, and set the last point
312 # flag as True
313 if centroid[0] < conf["speed_estimation_zone"]["A"]:
314 to.timestamp["A"] = ts
315 to.position["A"] = centroid[0]
316 to.lastPoint = True
Lines 271-316 grab timestamps and positions for cars as they pass by columns D, C, B,
and A. For A the lastPoint is marked as True.
Now that a car’s lastPoint is True, we can calculate the speed:
318 # check to see if the vehicle is past the last point and
319 # the vehicle's speed has not yet been estimated, if yes,
320 # then calculate the vehicle speed and log it if it's
321 # over the limit
322 if to.lastPoint and not to.estimated:
323 # initialize the list of estimated speeds
324 estimatedSpeeds = []
325
326 # loop over all the pairs of points and estimate the
327 # vehicle speed
328 for (i, j) in points:
329 # calculate the distance in pixels
330 d = to.position[j] - to.position[i]
331 distanceInPixels = abs(d)
332
333 # check if the distance in pixels is zero, if so,
334 # skip this iteration
335 if distanceInPixels == 0:
336 continue
337
338 # calculate the time in hours
339 t = to.timestamp[j] - to.timestamp[i]
340 timeInSeconds = abs(t.total_seconds())
341 timeInHours = timeInSeconds / (60 * 60)
342
343 # calculate distance in kilometers and append the
344 # calculated speed to the list
345 distanceInMeters = distanceInPixels * meterPerPixel
346 distanceInKM = distanceInMeters / 1000
347 estimatedSpeeds.append(distanceInKM / timeInHours)
348
349 # calculate the average speed
350 to.calculate_speed(estimatedSpeeds)
351
352 # set the object as estimated
353 to.estimated = True
354 print("[INFO] Speed of the vehicle that just passed"\
355 " is: {:.2f} MPH".format(to.speedMPH))
356
357 # store the trackable object in our dictionary
358 trackableObjects[objectID] = to
When the trackable object’s (1) last point timestamp and position has been recorded, and
(2) the speed has not yet been estimated (Line 322) we’ll proceed to estimate speeds.
Line 324 initializes a list to hold three estimatedSpeeds. Let’s calculate the three esti-
mates now.
Line 328 begins a loop over our pairs of points. We calculate the distanceInPixels
using the position values (Lines 330-331). If the distance is 0, we’ll skip this pair (Lines
335 and 336).
Next we calculate the elapsed time between two points in hours (Lines 339-341). We need
the time in hours as we are calculating kilometers per hour and miles per hour.
We then calculate the distance in kilometers by multiplying the pixel distance by the esti-
mated meterPerPixel value (Lines 345 and 346). Recall that meterPerPixel is based
on (1) the width of the FOV at roadside and (2) the width of the frame.
The speed is calculated by Equation 7.1 (distance over time) and added to the estimated
Speeds list.
Line 350 makes a call to the TrackableObject class method calculate_speed to

average out our three estimatedSpeeds in both miles per hour and kilometers per hour.
Line 353 marks the speed as estimated. Lines 354 and 355 then print the speed in
the terminal.
Remark. If you prefer to print the speed in km/hr be sure to update both the string to KMPH
and the format variable to to.speedKMPH.
Line 358 stores the trackable object to the trackableObjects dicitionary.
Phew! The hard part is out of the way in this script. Let’s wrap up, first by annotating the
centroid and ID on the frame:
360 # draw both the ID of the object and the centroid of the
361 # object on the output frame
362 text = "ID {}".format(objectID)
363 cv2.putText(frame, text, (centroid[0] - 10, centroid[1] - 10)
364 , cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
365 cv2.circle(frame, (centroid[0], centroid[1]), 4,
366 (0, 255, 0), -1)
A small dot is drawn on the centroid of the moving car with the ID number next to it.
Next we’ll go ahead and update our log file and store vehicle images in Dropbox (i.e., the
cloud):
368 # check if the object has not been logged

369 if not to.logged:
370 # check if the object's speed has been estimated and it
371 # is higher than the speed limit
372 if to.estimated and to.speedMPH > conf["speed_limit"]:
373 # set the current year, month, day, and time
374 year = ts.strftime("%Y")

375 month = ts.strftime("%m")
376 day = ts.strftime("%d")
377 time = ts.strftime("%H:%M:%S")
378
379 # check if dropbox is to be used to store the vehicle
380 # image
382 # initialize the image id, and the temporary file
383 imageID = ts.strftime("%H%M%S%f")
384 tempFile = TempFile()
385 cv2.imwrite(tempFile.path, frame)
386
387 # create a thread to upload the file to dropbox
388 # and start it
389 t = Thread(target=upload_file, args=(tempFile,
390 client, imageID,))
391 t.start()
392
393 # log the event in the log file
394 info = "{},{},{},{},{},{}\n".format(year, month,
395 day, time, to.speedMPH, imageID)
396 logFile.write(info)
397
398 # otherwise, we are not uploading vehicle images to
399 # dropbox
400 else:
401 # log the event in the log file
402 info = "{},{},{},{},{}\n".format(year, month,
403 day, time, to.speedMPH)
404 logFile.write(info)
405
406 # set the object has logged
407 to.logged = True
At a minimum every vehicle that exceeds the speed limit will be logged in the CSV file.
Optionally Dropbox will be populated with images of the speeding vehicles.
Lines 369-372 check to see if the trackable object has been logged, speed estimated, and
if the car was speeding.
If so Lines 374-377 extract the year, month, day, and time from the timestamp.
If an image will be logged in Dropbox, Lines 381-391 store a temporary file and spawn a
thread to upload the file to Dropbox. Using a separate thread for a potentially time-consuming
upload is critical so that our main thread doesn’t block, impacting FPS and speed calculations.
The filename will be the imageID on Line 383 so that it can easily be found later if it is
associated in the log file.
Lines 394-404 write the CSV data to the logFile. If Dropbox is used, the imageID is the
last value.
Remark. If you prefer to log the kilometers per hour speed, simply update to.speedMPH to
to.speedKMPH on Line 395 and Line 403.
Line 396 marks the trackable object as logged.
Let’s wrap up:
409 # if the *display* flag is set, then display the current frame
410 # to the screen and record if a user presses a key
414
416 if key == ord("q"):
417 break
418
419 # increment the total number of frames processed thus far and
420 # then update the FPS counter
421 totalFrames += 1
422 fps.update()
423
424 # stop the timer and display FPS information
425 fps.stop()
426 print("[INFO] elapsed time: {:.2f}".format(fps.elapsed()))
427 print("[INFO] approx. FPS: {:.2f}".format(fps.fps()))
428
429 # check if the log file object exists, if it does, then close it
430 if logFile is not None:
431 logFile.close()
432
433 # close any open windows
435
436 # clean up
438 vs.stop()
Lines 411-417 display the annotated frame and look for the q keypress in which case we’ll
quit (break).
Lines 421 and 422 increment totalFrames and update our FPS counter.
When we have broken out of the frame processing loop we perform housekeeping including
printing FPS stats, closing our log file, destroying GUI windows, and stopping our video stream
(Lines 425-438)
7.2.8 Deployment and Calibration
Now that our code is implemented, we’ll deploy and test our system.
I highly recommend that you conduct a handful of controlled drive-bys and tweak the vari-
ables in the config file until you are achieving accurate speed readings. Prior to any fine tuning
calibration, we’ll just ensure that the program working. Be sure you have met the following
requirements prior to trying to run the application:
• Position and aim your camera perpendicular to the road as per Figure 7.3.
• Ensure your camera has a clear line of sight with limited obstructions — our object detec-
tor must be able to detect a vehicle at multiple points as it crosses through the camera’s
field of view (FOV).
• It is best if your camera is positioned far from the road. The further points A and D are
from each other at the point at which cars pass on the road, the better the distance / time
calculations will average out and produce more accurate speed readings. If your camera
is close to the road, a wide-angle lens is an option, but then you’ll need to perform camera
calibration (a future PyImageSearch blog topic).
• If you are using Dropbox functionality, ensure that your RPi has a solid WiFi, Ethernet, or
even cellular connection.
• Ensure that you have set all constants in the config file. We may elect to fine tune the
constants in the next section.
Assuming you have met each of the requirements, you are now ready to deploy your pro-
gram.
Enter the following command to start the program and begin logging speeds:
$ python speed_estimation_dl.py --conf config/config.json

[INFO] loading model...
[INFO] Speed of the vehicle that just passed is: 26.08 MPH
Figure 7.4: Deployment of our neighborhood speed system. Vehicle speeds are calculated after
they leave the viewing frame. Speeds are logged to .csv and images are stored in Dropbox.
As shown in Figure 7.4, our system is measuring speeds of vehicles traveling in both di-
rections. In the next section, we will perform drive-by tests to ensure our system is reporting
accurate speeds.
To see a video of the system in action, be sure to follow the following link: http://pyimg.co/n3zu9
On occasions when multiple cars are passing through the frame at one given time, speeds
will be reported inaccurately. This can occur when our centroid tracker mixes up centroids.
This is a known drawback to our algorithm. To solve the issue additional algorithm engineering
will need to be conducted by you as the reader. One suggestion would be to perform instance
segmentation (http://pyimg.co/lqzvq [42]) to accurate segment each vehicle.
Per the remark in Section 7.2.2, you may also execute a separate script for use with the
sample_data/cars.mp4 video file as follows:
$ python speed_estimation_dl_video.py --conf config/config.json \

--input sample_data/cars.mp4
[INFO] NOTE: When using an input video file, speeds will be inaccurate
because OpenCV can't throttle FPS according to the framerate of the
video. This script is for development purposes only.
Figure 7.5: Images of vehicles exceeding the speed limit are stored in Dropbox.
[INFO] elapsed time: 22.88
[INFO] approx. FPS: 30.78
[INFO] cleaning up...
Note that those calculated and reported speeds are inaccurate since OpenCV can’t throttle
a video file per its FPS and instead processes it as fast as possible.
7.2.9 Calibrating for Accuracy
You may find that the system produces slightly inaccurate readouts of the vehicle speeds going
by. Do not disregard the project just yet. You can tweak the config file to get closer and closer
to accurate readings.
We used the following approach to calibrate our system until our readouts were spot-on:
• Begin recording a screencast of the RPi desktop showing both the video stream and
terminal. This screencast should record throughout testing.
• Meanwhile, record a voice memo on your smartphone throughout testing of you driving
by while stating what your drive-by speed is.
Figure 7.6: Calibration involves drive-testing. This figure shows how the "distance" contstant
in your config file affects the outcome of the speed calculation.
• Drive by the computer-vision-based VASCAR system in both directions at predetermined

speeds. We chose 10mph, 15mph, 20mph, and 25mph to compare our speed to the
VASCAR calculated speed. Your neighbors might think you’re weird as you drive back
and forth past your house, but just give them a nice smile!
• Sync the screencast to the audio file so that it can be played back.
• The speed +/- differences could be jotted down as you playback your video with synced
audio file.
• With this information, tune the constants: (1) If your speed readouts are a little high, then
decrease the "distance" constant, or (2) Conversely, if your speed readouts are slightly
low, then increase the "distance" constant.
• Rinse and repeat until you are satisfied. Don’t worry, you won’t burn too much fuel in the
process.
Based on extensive testing and adjustments, Dave Hoffman and Abhishek Thanki found
that Dave needed to increase his distance constant from the original 14.94m to 16.00m. The
final testing table is shown in Figure 7.7.
Figure 7.7: Drive testing results. Click here for a high resolution version of the table:
http://pyimg.co/y7o5z
If you care to watch and listen to Dave Hoffman’s final testing and verification video, you
can click this link: http://pyimg.co/tx49e.
With a calibrated system, you’re now ready to let it run for a full day. Your system is likely
only configured for daytime use unless you have streetlights on your road.
Remark. For nighttime use (outside the scope of this chapter), you may need infrared cameras
and infrared lights and/or adjustments to your camera parameters (refer to the Hobbyist Bundle
Chapters 6, 12, and 13 for these topics).
7.3 Summary
In this chapter we built a system to monitor the speeds of moving vehicles with just a camera
and well-crafted software.
Rather than relying on expensive RADAR or LIDAR sensors, we used timestamps, a known
distance, and a simple physics equation to calculate speeds. In the police world this is known
as Vehicle Average Speed Computer and Recorder (VASCAR). Police rely on their eyesight
and button pushing reaction time to collect timestamps — a method that barely holds in court
in comparison to RADAR and LIDAR.
But of course, we are engineers so our system eliminates the human component to calcu-
late speeds automatically with computer vision. Using both object detection and object tracking
we coded a method to calculate four timestamps. We then let the math do the talking: We know
that speed equals distance over time. Three speeds were calculated among the three pairs of
7.3. SUMMARY 167
points and averaged for a solid estimate.
One drawback to our automated system is that it is only as good as the key distance con-
stant.
To combat this, we measured carefully and then conducted drive-bys while looking at our
speedometer to verify operation. Adjustments to the distance constant were made if needed.
Yes, there is a human component in this verification method. If you have a cop friend that
can help you verify with their RADAR gun that would be even better.
We hope you enjoyed this chapter, and more importantly, we hope that you can apply it to
detect drivers speeding in your neighborhood.
Chapter 8
Deep Learning and Multiple RPis
In Chapter 3, we learned about message passing, ZMQ, and ImageZMQ.
ImageZMQ made throwing frames around a network dead simple. In this chapter we’ll take
it a step further by monitoring a home with deep learning and multiple Raspberry Pis.
Clients will send video frames to a central server via ImageZMQ. Our server will run an
object detector to find people and animals in the incoming frames from our clients. The results
will be displayed in a montage. You can extend this chapter to make your own security digital
video recorder.
Let’s go ahead and get started.

Remark. Portions of this chapter include republished content from my blog. Be sure to give the
original article on PyImageSearch a read and refer to the animated GIFs (http://pyimg.co/fthtd)
[12].
i. Reinforce video streaming over a network with ImageZMQ.
ii. Use an object detector to detect objects in realtime.
iii. Build and annotate a montage of the results.
8.2 An ImageZMQ Client/Server Application for Monitoring a Home
Imagine for thirty seconds that you are profusely rich and that you have an extravagant 10,000
sq. foot house. Your house has multiple rooms and you may even have a guest house or a
169
170 CHAPTER 8. DEEP LEARNING AND MULTIPLE RPIS
pool house. Your beloved dog (or cat) is free to roam the property inside and out. Sometimes
she sleeps in the den. Other times she’s sprawled out by the pool. Clearly your dog has a
wonderful life.
Only there’s a little problem.
Today your dog is due to go to the vet. You’ve looked in the usual places, but she’s not
turning up. The top of the hour is creeping up and you’re going to be late. Being late is the
worst feeling – you are impacting someone else’s schedule, not to mention your own.
Finally you find her, and make it to the vet appointment just five minutes late.
Your appointment was bumped to the next slot, so while you’re waiting, you think of ways so
that this won’t happen again.
Computer vision to the rescue!
What if you could put cameras around your house so that you could monitor each room
much like a security guard would?
Better yet, what if a deep learning object detector finds your dog automatically in any of the
camera’s video streams and you know exactly where she is?
We’ll implement such a system in this chapter based on:
• Client to server video streaming with ImageZMQ.
• Pretrained MobileNet SSD object detection to find people, dogs, and cats.
• And an OpenCV montage so that you can visualize all the feeds in one (or more) conve-
nient windows on a large screen.
We’ll begin this chapter by implementing our client. From there, we’ll import the server
which handles object detection and display. We’ll wrap up by analyzing our results.
By the end of the chapter, you’ll have a system you can deploy to find your dog, cat, or
partner in your home and even your vehicles outside your home. You could extend it to include
Digital Video Recording (DVR) functionality if you want to use it for a security purpose.
Our project structure is as follows:
8.2. AN IMAGEZMQ CLIENT/SERVER APPLICATION FOR MONITORING A HOME 171
|-- client.py
|-- server.py
The first two files listed in the project are the pre-trained Caffe MobileNet SSD object de-
tection files. The server (server.py) will take advantage of these Caffe files using OpenCV’s
DNN module to perform object detection. The pre-trained caffe model supports 20 classes –
we’ll configure it to filter for people, dogs, cats, and vehicles.
The client.py script will reside on each device, sending a video stream to the server.
Later on, we’ll upload client.py onto each of the Pis (or another machine) on your network
so they can send video frames to the central location.
8.2.2 Implementing the Client OpenCV Video Streamer
Let’s start by implementing the client which will be responsible for:
• Capturing frames from the camera (either USB or the RPi camera module).
• Sending the frames over the network via ImageZMQ.
We reviewed this script in the last chapter, but let’s review it again to reinforce the concept.
Open up the client.py file and insert the following code:

3 import imagezmq
4 import argparse
5 import socket
6 import time
7
13
15 # server
We start off by importing packages and modules on Lines 2-6. We’re importing imagezmq
so we can send frames from our client to our server (Line 3).
The server’s IP address (--server-ip) is the only command line argument parsed on
Lines 9-12. The socket module of Python is simply used to grab the hostname of the RPi.
Lines 16 and 17 create the ImageZMQ sender object and specify the IP address and port
of the server. The IP address will come from the command line argument that we already
established. I’ve found that port 5555 doesn’t usually have conflicts, so it is hardcoded. You
could easily turn it into a command line argument if you need to as well.
Let’s initialize our video stream and start sending frames to the server:
24 time.sleep(2.0)
25
27 while True:
Line 21 grabs the hostname, storing the value as rpiName. Refer to Section 3.5.4 to set
hostnames on your Raspberry Pis.
Our VideoStream is created via Line 22 or 23 depending on whether you are using a
PiCamera or USB camera. This is the point where you should also set your camera resolution
and other camera parameters (Hobbyist Bundle Chapters 5 and 6).
For this project, we are just going to use the default PiCamera resolution so the argu-
ment is not provided, but if you find that there is a lag, you are likely sending too many pixels.
If that is the case, you may reduce your resolution quite easily. Just pick from one of the reso-
lutions available for the PiCamera V2 here: http://pyimg.co/mo5w0 [43] (the second table is for
PiCamera V2).
Once you’ve chosen the resolution, edit Line 22 like so:
22 vs = VideoStream(usePiCamera=True, resolution=(320, 240)).start()
Remark. The resolution argument won’t make a difference for USB cameras since they are all
implemented differently. As an alternative, you can insert a frame = imutils.resize(frame,
width=320) between Lines 28 and 29 to resize the frame manually (beware that doing so
will require the CPU to resize the image, thus slightly slowing down your pipeline).
Finally, our while loop on Lines 26-29 grabs and sends the frames.
As you can see, the client is quite simple and straightforward! Let’s move on to the server
where the heart of today’s project lives.
8.2.3 Implementing the OpenCV Video Server
Figure 8.1: The Graphical User Interface (GUI) concept drawing for our ImageZMQ server that
performs object detection on frames incoming from client Raspberry Pis.
The live video server will be responsible for:
i. Accepting incoming frames from multiple clients.
ii. Applying object detection to each of the incoming frames.
iii. Maintaining an “object count” for each of the frames (i.e., count the number of objects).
iv. Displaying a montage of the processed frames in a single OpenCV window.
Figure 8.1 shows the initial GUI concept for the ImageZMQ server application we are build-
ing.
Let’s go ahead and implement the server — open up the server.py file and insert the
following code:

2 from imutils import build_montages

5 import imagezmq
6 import argparse
7 import imutils
8 import cv2
9
12 ap.add_argument("-p", "--prototxt", required=True,
13 help="path to Caffe 'deploy' prototxt file")
14 ap.add_argument("-m", "--model", required=True,
15 help="path to Caffe pre-trained model")
16 ap.add_argument("-c", "--confidence", type=float, default=0.2,
17 help="minimum probability to filter weak detections")
18 ap.add_argument("-mW", "--montageW", required=True, type=int,
19 help="montage frame width")
20 ap.add_argument("-mH", "--montageH", required=True, type=int,
21 help="montage frame height")
On Lines 2-8 we import packages and libraries. Most notably we’ll be using: build_montages
to build a montage of all incoming frames. For more details on building montages with OpenCV,
refer to this PyImageSearch blog post for an example:
http://pyimg.co/vquhs [44].
We’ll use imagezmq for video streaming. OpenCV’s DNN module will be utilized to perform
inference with our pretrained Caffe object detector.
Are you wondering where imutils.video.VideoStream is? We usually use my

VideoStream class to read frames from a webcam. It isn’t actually necessary for the server –
don’t forget that we’re using imagezmq for streaming frames from clients! The server doesn’t
have a camera directly wired to it.
Let’s process five command line arguments:
• --prototxt: The path to our Caffe deep learning prototxt file.
• --model: The path to our pre-trained Caffe deep learning model. I’ve provided a pre-
trained MobileNet SSD in the project folder, but with some minor changes, you could elect
to use an alternative model.
• --confidence: Our confidence threshold to filter weak detections.
• --montageW: Number of columns for our montage (this is not the width in pixels). We’re
going to stream from four Raspberry Pis, so you could do 2x2, 4x1, or 1x4. You could
also do, for example, 3x3 for nine clients, provided you have nine client RPis.
• --montageH: The number of rows for your montage.

Let’s initialize our ImageHub object along with our deep learning object detector:
24 # initialize the ImageHub object

26
28 # detect, then generate a set of bounding box colors for each class
33
36 net = cv2.dnn.readNetFromCaffe(args["prototxt"], args["model"])
The ImageHub is initialized on Line 25. Now we’ll be able to receive frames from clients.
Our MobileNet SSD object CLASSES are specified on Lines 29-32. Later, we will filter
detections for only the classes we wish to consider. From there we instantiate our Caffe object
detector on Line 36.
Initializations come next:
38 # initialize the consider set (class labels we care about and want
39 # to count), the object count dictionary, and the frame dictionary
40 CONSIDER = set(["dog", "person", "car"])
41 objCount = {obj: 0 for obj in CONSIDER}
42 frameDict = {}
43
44 # initialize the dictionary which will contain information regarding
45 # when a device was last active, then store the last time the check
46 # was made was now
47 lastActive = {}
48 lastActiveCheck = datetime.now()
49
50 # stores the estimated number of Pis, active checking period, and
51 # calculates the duration seconds to wait before making a check to
52 # see if a device was active
53 ESTIMATED_NUM_PIS = 4
54 ACTIVE_CHECK_PERIOD = 10
55 ACTIVE_CHECK_SECONDS = ESTIMATED_NUM_PIS * ACTIVE_CHECK_PERIOD
56
57 # assign montage width and height so we can view all incoming frames
58 # in a single "dashboard"
59 mW = args["montageW"]
60 mH = args["montageH"]
61 print("[INFO] detecting: {}...".format(", ".join(obj for obj in
62 CONSIDER)))
In this chapter’s example, we’re only going to CONSIDER three types of objects from the
MobileNet SSD list of CLASSES. We’re considering (1) dogs, (2) people, and (3) cars (Line
40).
We’ll soon use this CONSIDER set to filter out other classes that we don’t care about such
as chairs, plants, monitors, or sofas which don’t typically move and aren’t interesting for this
security type project.
Line 41 initializes a dictionary for our object counts to be tracked in each video feed. Each
count is initialized to zero.
A separate dictionary, frameDict, is initialized on Line 42. The frameDict dictionary will
contain the hostname key and the associated latest frame value from the respective RPi.
Lines 47 and 48 are variables which help us determine when a Pi last sent a frame to the
server. If it has been a while (i.e. there is a problem, such as the RPi freezing or crashing), we
can get rid of the static, out of date image in our montage. The lastActive dictionary will
have hostname keys and timestamps for values.
Lines 53-55 are constants which help us to calculate whether a Pi is active. Line 55 itself
calculates that our check for activity will be 40 seconds. You can reduce this period of time by
adjusting ESTIMATED_NUM_PIS and ACTIVE_CHECK_PERIOD on Lines 53 and 54.
Our mW and mH variables on Lines 59 and 60 represent the number of columns and rows
for our montage. These values are pulled directly from the command line args dictionary.
Let’s loop over incoming streams from our clients and process the data:

65 while True:
67 # the receipt
70
71 # if a device is not in the last active dictionary then it means
72 # that its a newly connected device
73 if rpiName not in lastActive.keys():
74 print("[INFO] receiving data from {}...".format(rpiName))
75
76 # record the last active time for the device from which we just
77 # received a frame
78 lastActive[rpiName] = datetime.now()
We begin looping on Line 65.
Lines 68 and 69 grab an image from the imageHub and send an ACK message. The result
of imageHub.recv_image is rpiName (the hostname), and the video frame itself. Be sure
to refer to Chapter 3, Section 3.5.4 to learn how to set your RPi hostname.
The remainder of our while loop processes the incoming frames.
Lines 73-78 perform housekeeping duties to determine when a Raspberry Pi was lastActive.
Let’s perform inference on a given incoming frame:
80 # resize the frame to have a maximum width of 400 pixels, then

81 # grab the frame dimensions and construct a blob
83 (h, w) = frame.shape[:2]
84 blob = cv2.dnn.blobFromImage(cv2.resize(frame, (300, 300)),
85 0.007843, (300, 300), 127.5)
86
87 # pass the blob through the network and obtain the detections and
88 # predictions
91
92 # reset the object count for each object in the CONSIDER set
93 objCount = {obj: 0 for obj in CONSIDER}
Lines 82-90 perform object detection on the frame. First, the frame dimensions are com-
puted. Then, a blob is created from the image. The blob is then passed through the neural
net.
Remark. If you need a refresher on how the blobFromImage function works, refer to this
tutorial: http://pyimg.co/c4gws [45].
On Line 93 we reset the object counts to zero (we will be populating the dictionary with
fresh count values shortly).
Let’s loop over the detections with the goal of (1) counting, and (2) drawing boxes around
objects that we are considering:

97 # extract the confidence (i.e., probability) associated with
98 # the prediction
100
101 # filter out weak detections by ensuring the confidence is
102 # greater than the minimum confidence
103 if confidence > args["confidence"]:
105 # detections

107
108 # check to see if the predicted class is in the set of
109 # classes that need to be considered
110 if CLASSES[idx] in CONSIDER:
111 # increment the count of the particular object
112 # detected in the frame
113 objCount[CLASSES[idx]] += 1
114
117 box = detections[0, 0, i, 3:7] * np.array([w, h, w, h])
119
120 # draw the bounding box around the detected object on
121 # the frame
122 cv2.rectangle(frame, (startX, startY), (endX, endY),
123 (255, 0, 0), 2)
On Line 96 we begin looping over each of the detections. Inside the loop, we proceed to
extract the object confidence and filter out weak detections (Lines 99-103). We then grab
the label idx (Line 106) and ensure that the label is in the CONSIDER set (Line 110).
For each detection that has passed the two checks (confidence threshold and in CONSI
DER), we will (1) increment the objCount for the respective object, and (2) draw a rectangle
around the object (Lines 113-123).
Next, let’s annotate each frame with the hostname and object counts. We’ll also build a
montage to display them in:
125 # draw the sending device name on the frame

126 cv2.putText(frame, rpiName, (10, 25),
128
129 # draw the object count on the frame
130 label = ", ".join("{}: {}".format(obj, count) for (obj, count) in
131 objCount.items())
132 cv2.putText(frame, label, (10, h - 20),
133 cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255,0), 2)
134
135 # update the new frame in the frame dictionary
136 frameDict[rpiName] = frame
137
138 # build a montage using images in the frame dictionary
139 montages = build_montages(frameDict.values(), (w, h), (mW, mH))
140
141 # display the montage(s) on the screen
142 for (i, montage) in enumerate(montages):
143 cv2.imshow("Home pet location monitor ({})".format(i),
144 montage)
145
146 # detect any kepresses
On Lines 126-133 we make two calls to cv2.putText to draw the Raspberry Pi hostname
and object counts.
From there we update our frameDict with the frame corresponding to the RPi hostname.
Lines 139-144 create and display a montage of our client frames. The montage will be
mW frames wide and mH frames tall (there is the possibility that multiple montages of equal
dimensions will be displayed if you have more clients than available "tiles" for output in a single
montage).
Keypresses are captured via Line 147.
The last block is responsible for checking our lastActive timestamps for each client feed
and removing frames from the montage that have stalled. Let’s see how it works:
149 # if current time *minus* last time when the active device check
150 # was made is greater than the threshold set then do a check
151 if (datetime.now() - lastActiveCheck).seconds > ACTIVE_CHECK_SECONDS:
152 # loop over all previously active devices
153 for (rpiName, ts) in list(lastActive.items()):
154 # remove the RPi from the last active and frame
155 # dictionaries if the device hasn't been active recently
156 if (datetime.now() - ts).seconds > ACTIVE_CHECK_SECONDS:
157 print("[INFO] lost connection to {}".format(rpiName))
158 lastActive.pop(rpiName)
159 frameDict.pop(rpiName)
160
161 # set the last active check time as current time
162 lastActiveCheck = datetime.now()
163
165 if key == ord("q"):
166 break
167
There’s a lot going on in Lines 151-162. Let’s break it down.
We only perform a check if at least ACTIVE_CHECK_SECONDS have passed (Line 151).

We loop over each key-value pair in lastActive (Line 153). If the device hasn’t been active
recently (Line 156) we need to remove stale data (Lines 158 and 159). First we remove (pop)
the rpiName and timestamp from lastActive. Then the rpiName and frame are removed
from the frameDict. From there, the lastActiveCheck is updated to the current time on
Line 162.
Effectively, this implementation enables getting rid of expired frames (i.e. frames that
are no longer real-time). This is really important if you are using the ImageHub server for a
security application. Perhaps you are saving key motion events like a Digital Video Recorder
(DVR) (we covered the Key Clip Writer in Section 9.2.2 of the Hobbyist Bundle and in a 2016
PyImageSearch tutorial titled Saving key event video clips with OpenCV (http://pyimg.co/hvskf)
[46]). The worst thing that could happen if you don’t get rid of expired frames is that an intruder
kills power to a client and you don’t realize the frame isn’t updating. Think James Bond or
Jason Bourne sort of spy techniques.
Last in the loop is a check to see if the q key has been pressed — if so we break from the
loop and destroy all active montage windows (Lines 165-169).
8.2.4 Streaming Video Over Your Network with OpenCV and ImageZMQ
Now that we’ve implemented both the client and the server, let’s put them to the test.
First, lets fire up the server:
$ python server.py --prototxt MobileNetSSD_deploy.prototxt \

--model MobileNetSSD_deploy.caffemodel --montageW 2 --montageH 2
Once your server is running, go ahead and start each client pointing to the server. On
each client, follow these steps:
i. Open an SSH connection to the client: ssh [email protected] (inserting your own IP
address, of course)
ii. Start screen on the client: screen
iii. Activate your environment: workon py3cv4
iv. If you are not using the book’s accompanying preconfigured Raspbian .img, then you’ll
need to install ImageZMQ: http://pyimg.co/fthtd [12]
Finally, run the client in your screen session:

8.3. SUMMARY 181
As an alternative to Steps 1-6, you may start the client on reboot (as we learned to do in
Chapter 8 of the Hobbyist Bundle).
Once frames roll in from the clients, your server will come alive! Each frame received is
passed through the MobileNet SSD, annotated, and added to the montage. Figure 8.2 shows
a screenshot of the resulting streams annotated and arranged in a montage.
Figure 8.2: The Graphical User Interface (GUI) concept drawing for our ImageZMQ server that
performs object detection on frames incoming from multiple client Raspberry Pis.
You shouldn’t observe much, if-any, lag — be sure to review the previous chapter’s guidance
on performance with ImageZMQ (Section 3.5.9).
8.3 Summary
In this chapter, we learned how to stream video over a network using OpenCV and the Im-
ageZMQ library.
Instead of relying on IP cameras or FFMPEG/GStreamer, we used a Raspberry Pi to cap-

ture input frames and ImageZMQ to stream them to a more powerful machine for additional
processing.
ImageZMQ relies on a distributed system concept called message passing. Thanks to

Jeff Bass’ hard work (the creator of ImageZMQ http://pyimg.co/fthtd) [12]) our implementation
required only a few lines of code. ImageZMQ is easier to install, more reliable, and is faster
than alternatives.
If you are ever in a situation where you need to stream live video over a network, definitely
give ImageZMQ a try — you’ll find it super intuitive.
Chapter 9
Training a Custom Gesture Recognition

Model
Our hands, gestures, and body language can communicate just as much information, if not
more, than the words coming out of our mouths. Based on our posture and stance we, con-
sciously or unconsciously, communicate signals about our level of comfort, including whether
we are relaxed, agitated, nervous, or threatened. In fact, our bodies communicate so much
information that studies have found gait recognition (i.e., how someone walks) to be more
accurate than face recognition for person identification tasks!
In this chapter we are going to focus on recognizing hand gestures. We’ll utilize computer
vision and deep learning to build a custom hand gesture recognition system on the Raspberry
Pi. This system will be capable of running in real-time on the RPi, despite leveraging deep
learning.
You can use this project as a template when building your own security applications (i.e., re-
placing the keypad on your home security alarm with gesture recognition), accessibility projects
(i.e., enabling disabled users to more easily access a computer), or smart home applications
(i.e., replacing your remote with your hand).
i. Learn about gesture recognition (specifically hand gesture recognition).
ii. Implement a Python script to gather hand gesture example images.
iii. Train a custom CNN to recognize hand gestures.
iv. Create a Python script to take our trained model and then recognize gestures in real-time.
183
184 CHAPTER 9. TRAINING A CUSTOM GESTURE RECOGNITION MODEL
9.2 Getting Started with Gesture Recognition
In this section we’ll look at a high level overview of our hand gesture recognition pipeline,
ensuring we understand the goal of the project before we start diving into code. From there
we’ll look look at the directory structure for our project and then implement our configuration
file.
9.2.1 What is Gesture Recognition?
The goal of gesture recognition is to interpret human gestures via systemized algorithms. Most
gesture recognition algorithms focus on hand or face gestures; however, there is an increasing
body of work surrounding gait recognition [47, 48, 49] which can be used to identify a person
strictly by how they are walking.
In this chapter we’ll focus on hand gesture recognition, that is, recognizing specific ges-
tures such as stop, fist/close, peace, etc. (Figure 9.1, top).
Figure 9.1: Top: An example of the hand gesture system we’ll be creating. Bottom: The steps
involved in creating our hand gesture implementation.
9.2. GETTING STARTED WITH GESTURE RECOGNITION 185
To build our gesture recognition application we’ll need to utilize both traditional computer
vision and deep learning, the complete pipeline of which is depicted in Figure 9.1 (bottom).
First, we’ll utilize thresholding to segment the foreground hand from the background im-
age/frame. The end result is a binary image depicting which pixels belong to the hand and
which ones are the background (and thus uninteresting to our application).
Given sufficient examples of each gesture (after thresholding/binarization), we can train a

shallow CNN on these gestures. We’ll want to utilize a shallow CNN here to ensure that both
(1) our CNN can run in real-time on the RPi and (2) to ensure our CNN does not overfit on
limited training data. The CNN can be then be used in conjunction with our hand segmentation
method to recognize gestures in input frames from a video stream.
In the context of this chapter, we’ll be framing gesture recognition as a home security
application.
Perhaps you are interested in replacing or augmenting your existing home alarm keypad
(i.e., the keypad where you enter your code to arm/disarm your alarm) with gesture recognition.
When you (or an intruder) enters your residence you will need to enter a “four digit” code that is
based on your hand gestures — if the gesture is correct, or sufficient time has passed without
entering a gesture, the alarm will sound.
The benefit of augmenting an existing home security keypad with gesture recognition is that
most burglars know about keypad-based inputs — but gesture recognition? That’s something
new and something likely to throw them off their game (if they even recognize/know what the
screen and camera mounted on your wall is supposed to do).
Let’s analyze our directory structure for the project:
|-- assets
| |-- correct.wav
| |-- fist.png
| |-- hang_loose.png
| |-- incorrect.wav
| |-- peace.png
| |-- stop.png
|-- config
| |-- config.json
|-- output
| |-- gesture_reco.h5
| |-- lb.pickle
|-- pyimagesearch
| |-- nn
| | |-- __init__.py
| | |-- gesturenet.py
| |-- notifications
| | |-- __init__.py
| |-- utils
| | |-- __init__.py
| | |-- conf.py
| |-- __init__.py
|-- gather_examples.py
|-- train_model.py
|-- recognize.py
Inside the assets/ directory we store audio and image files which will be used when
building a simple user interface with OpenCV. Each of the .png files provides a visualization
for each of the hand gestures we’ll be recognizing. The .wav files provide audio for “correct”
and “incorrect” gesture inputs to our home security application.
The config/ directory stores the config.json file that we’ll be utilizing for this project
while the ouput/ directory will store the trained gesture recognition deep learning model along
with the label binarizer (so we can encode/decode gesture labels).
The pyimagesearch module contains three classes:
• Conf: Our configuration loader.
• TwilioNotifier: Used to send text message notifications if an incorrect gesture code

is entered.
• GestureNet: Our gesture recognition CNN implementation.
The gather_examples.py script is used to access the video stream on our camera —
from there we provide examples of each gesture that we wish our CNN to recognize.
These example frames are saved to disk for the train_model.py script, which as the
name suggests, trains GestureNet on the example gestures.
Remark. If you do not wish to gather your own example gestures, I have provided a sample
of the gesture recognition dataset I gathered inside the data directory. You can use those
example images if you simply want to run the scripts and get a feel for how the project functions.
Finally, the recognize.py script puts all the pieces together and:
i. Segments the hand region from the input image
ii. Classifies the hand gesture (if any)
iii. Authenticates and checks to see if the entered gesture is correct

iv. Raises an alarm/sends a text message notification is incorrect
Additionally, recognize.py script includes extra functionality, demonstrating how to ac-

cess the GPIO pins on the RPi and light up colored lights based on if the input gesture was
correct or not (i.e., “green” for correct, “red’ for incorrect).
Let’s now take a look at our config.json file:
1 {
2 // define the top-left and bottom-right coordinates of the gesture
3 // capture area
4 "top_left": [20, 10],
5 "bot_right": [220, 210],
Figure 9.2: Hand gesture recognition will be performed when the user places their hand within the
"black square" region which we have programmatically highlighted with a red rectangle.
In order to make our gesture recognition pipeline more accurate, we’ll only perform gesture
recognition with a specific ROI of the frame (Figure 9.2). Here you can see a rectangle has
been defined via Lines 4 and 5 — any hands/gestures captured within this rectangular region
will be captured and identified. Any hands/gestures outside of this region will be ignored.
Our foreground hands should have sufficient contrast between the background, en-
abling basic image processing (i.e., thresholding) to perform the segmentation. Since I
have light skin I have chosen to use a dark background — in this manner there will be enough
contrast between my hand and the background. Conversely, if you have dark skin you should
utilize a light background, again, ensuring there is enough contrast for thresholding to be ac-
curately utilized).
We implement our gesture recognition method this way to make it easier for us segment
foreground hands from the background as we assume we can control the background of the
ROI we are monitoring.
Next, let’s name the gestures that we’ll be identifying and enable “hot keys” that can be used
to save examples of each of these gestures to disk:
7 // create the key mappings, where a key on the keyboard maps to a

8 // gesture name -- these mappings will be used to organize training
9 // data on disk
10 "mappings": {
11 "i": "ignore",
12 "f": "fist",
13 "h": "hang_loose",
14 "p": "peace",
15 "s": "stop"
16 },
17
18 // path to where captured training data will be stored
19 "dataset_path": "../datasets/hand_gesture_dataset",
Line 10 defines the "mappings" dictionary which maps a key on your keyboard to a par-
ticular gesture name
Here you can see that we’ll be recognizing four gestures: "fist, "hang loose", "peace",
and "stop" — examples of these gestures can be seen in Figure 9.3.
Inside the gather_examples.py script (Section 9.3) we’ll be accessing our video stream
and providing examples of each of these gestures so we can later train our CNN to recognize
them (Section 9.4).
In order to save each of the gesture recognition examples we define “hot keys”. For exam-
ple, if we press the f key on our keyboard then the gather_examples.py script assumes
we are making a “fist” in which case the current frame is saved to disk and labeled as a “fist”.
Similarly, if we make a peace sign and then press the p key, then the frame is labeled as a
“peace sign” and saved to our hard drive.
The “ignore” class (captured via the i key) is a special case — we assume that frames with
the “ignore” label have no gestures and are instead just the background and nothing else. We
capture such a class so that we can train our CNN to “ignore” any frames that do not have any
gestures, thereby reducing false-positive gesture recognitions (i.e., the CNN thinking there is a
Figure 9.3: The four signs our hand gesture recognition system will recognize: fist (top-left), hang
loose (top-right), peace (bottom-left), and stop (bottom-right). You can recognize additional ges-
tures by providing training examples for each sign.
gestures in the frame but in reality nothing is present).
Finally, Line 19 defines the "dataset_path", the path to where each of our gestures
classes and associated examples will be saved, respectively.
Next, let’s take a look at parameters when training our GestureNet model:
21 // define the initial learning rate, batch size, and number of

22 // epochs to train for
23 "init_lr": 1e-3,
24 "bs": 8,
25 "num_epochs": 75,
26
27 // path to the trained gesture recognition model and the label
28 // binarizer
29 "model_path": "output/gesture_reco.model",
30 "lb_path": "output/lb.pickle",
Lines 23-25 define our initial learning rate, batch size, and number of training epochs. We
then define the path to the output serialized model (after training) along with the label binarizer,
used to encode/decode class label (Lines 29 and 30).
The following code block defines the path to our assets used to build the simple frontend
GUI along with our actual passcode:
32 // path to the assets directory

33 "assets_path": "assets",
34
35 // define the correct pass code
36 "passcode": ["peace", "stop", "fist", "hang_loose"],
The "passcode" consists of four gestures but you could modify it to use one, two, three, or
even one hundred gestures — the actual length of the list is arbitrary. What is not arbitrary is
the contents of the "passcode" list. You’ll notice that each of the entries in the "passcode"
list maps to a gesture in the "mappings" dictionary back on Line 10.
You can make your passcode whatever you want provided that every entry in "passcode"
exists in "mappings" — the GestureNet CNN will only recognize gestures it was trained
to identify.
The following two configurations handle (1) how many consecutive frames a given gesture
needs to identified for before we consider it a positive recognition and (2) the number of sec-
onds to show a correct/incorrect message after a gesture code has been input:
38 // number of consecutive frames a gesture needs to be successfully

39 // classified until updating the gestures list
41
42 // number of seconds to show the status message after a correct or
43 // incorrect pas code entry
44 "num_seconds": 10,
The final code block in our configuration defines the paths to our correct/incorrect audio
files along with any optional Twilio API information used to send text message notifications if
an input gesture code is incorrect:
46 // path to the audio files that will play for correct and incorrect
47 // pass codes
48 "correct_audio": "assets/correct.wav",
49 "incorrect_audio": "assets/incorrect.wav",
50
9.3. GATHERING GESTURE TRAINING EXAMPLES 191
56 "address_id": "YOUR_ADDRESS"
57 }
We are now ready to start implementing our hand gesture recognition system!
9.3 Gathering Gesture Training Examples
In order to train our GestureNet architecture we first need to gather training examples of each
hand gesture we wish to recognize. Once we have the dataset we can train the model.
9.3.1 Implementing the Dataset Gathering Script
Open up the gather_examples.py file in your project structure and insert the following code:

4 import argparse
5 import imutils
6 import time
7 import cv2
8 import os
9
13 help="path to the input configuration file")
15
18
19 # grab the top-left and and bottom-right (x, y)-coordinates for the
20 # gesture capture area
21 TOP_LEFT = tuple(conf["top_left"])
22 BOT_RIGHT = tuple(conf["bot_right"])
Lines 2-8 handle importing our required Python packages while Lines 11-14 parse our
command line arguments. We only need a single argument, --conf, the path to our input
configuration file which we load from disk on Line 17.
Lines 21 and 22 grab the top-left and bottom-right (x, y)-coordinates for our gesture recog-
nition capture area. Our hand must be placed within this region before either (1) gather-
ing examples of a particular gesture or (2) recognizing a gesture (which we’ll do later in the
recognize.py script).
Next, let’s map the names of each gesture to a key on our keyboard:
24 # grab the key => class label mappings from the configuration
25 MAPPINGS = conf["mappings"]
26
27 # loop over the mappings
28 for (key, label) in list(MAPPINGS.items()):
29 # update the mappings dictionary to use the ordinal value of the
30 # key (the key value will be different on varying operating
31 # systems)
32 MAPPINGS[ord(key)] = label
33 del MAPPINGS[key]
34
35 # grab the set of valid keys from the mappings dictionary
36 validKeys = set(MAPPINGS.keys())
37
38 # initialize the counter dictionary used to count the number of times
39 # each key has been pressed
40 keyCounter = {}
Line 25 grabs our MAPPINGS from the configuration. The MAPPINGS variable is a dictionary
with the keys being a given letter on our keyboard and the value being the name of the gesture.
For example, the s key on our keyboard maps to the stop gesture (as defined in Section 9.2.3
above).
However, we have a bit of extra work to do. We’ll be using the cv2.waitKey function to
capture keypresses — this function requires that we have the ordinal value of the key rather
than the string value. Therefore, we must:
i. Loop over all mappings on Line 28.
ii. Update the MAPPINGS dictionary to use the ord value of the key (Line 32).
iii. Delete the original string key value from the dictionary (Line 33).
Given this updated dictionary we then grab the set of validKeys that can be pressed
when gathering training examples (Line 36).
Let’s now move on to accessing our video stream:
38 # initialize the counter dictionary used to count the number of times

39 # each key has been pressed
40 keyCounter = {}
41
42 # start the video stream thread
43 print("[INFO] starting video stream thread...")
44 vs = VideoStream(src=0).start()
45 # vs = VideoStream(usePiCamera=True).start()
46 time.sleep(2.0)
47
48 # loop over frames from the video stream
49 while True:
50 # grab the frame from the threaded video file stream
52
53 # resize the frame and then flip it horizontally
Line 40 initializes a counter dictionary to count the number of times a given key on the
keyboard has been pressed (thus counting the total number of gathered training examples per
class).
Lines 44-46 access our video stream.
We start looping over frames from the video stream on Line 49. We preprocess the frame
by (1) reducing the frame size and then (2) flipping the frame horizontally. We flip the frame
horizontally since our frame is mirrored to us. Flipping “un-mirrors” the frame.
Next, we can extract the gesture capture roi from the frame:
57 # extract the ROI from the frame, convert it to grayscale,

58 # and threshold the ROI to obtain a binary mask where the
59 # foreground (white) is the hand area and the background (black)
60 # should be ignored
61 roi = frame[TOP_LEFT[1]:BOT_RIGHT[1], TOP_LEFT[0]:BOT_RIGHT[0]]
62 roi = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)
63 roi = cv2.threshold(roi, 75, 255, cv2.THRESH_BINARY)[1]
64
65 # clone the original frame and then draw the gesture capture area
66 clone = frame.copy()
67 cv2.rectangle(clone, TOP_LEFT, BOT_RIGHT, (0, 0, 255), 2)
68
69 # show the output frame and ROI, and then record if a user presses
70 # a key
71 cv2.imshow("Frame", clone)
72 cv2.imshow("ROI", roi)
Line 61 uses NumPy array slicing and the supplied gesture capture (x, y)-coordinates to
extract the roi. We then convert the roi to grayscale and threshold it, leaving us with a binary
representation of the image. Ideally, our hand will show up as foreground (white) pixels while
the background remains dark (black).
Remark. You may need to tune the parameters to cv2.threshold to obtain an adequate
segmentation of the hand. See Section 9.5 for more details on how this method can be made
more robust.
Lines 66 clones the frame so we can draw on it while Line 70 visualizes the gesture capture
region. We then display the frame and roi to our screen on Lines 71 and 72.
Line 73 checks to see if any keys are pressed. If a key is pressed, we need to check and
see which one:

77 break
78
79 # otherwise, check to see if a key was pressed that we are
80 # interested in capturing
81 elif key in validKeys:
82 # construct the path to the label subdirectory
83 p = os.path.sep.join([conf["dataset_path"], MAPPINGS[key]])
84
85 # if the label subdirectory does not already exist, create it
86 if not os.path.exists(p):
87 os.mkdir(p)
88
89 # construct the path to the output image
90 p = os.path.sep.join([p, "{}.png".format(
91 keyCounter.get(key, 0))])
92 keyCounter[key] = keyCounter.get(key, 0) + 1
93
94 # save the ROI to disk
95 print("[INFO] saving ROI: {}".format(p))
96 cv2.imwrite(p, roi)
If the q key is pressed then we have finished gathering gesture examples and can safely
exit the script.
Otherwise, we check to see if the key pressed exists inside our validKeys set (Line 81). If
so, we construct the path to the output label subdirectory (Line 83), create the output directory
if necessary (Lines 86 and 87), and finally save the output roi to disk (Lines 90-96).
9.3.2 Running the Dataset Gathering Script
To gather your own gesture recognition dataset, open up a terminal and execute the following
command:
$ python gather_examples.py --conf config/config.json

[INFO] starting video stream thread...
[INFO] saving ROI: ../datasets/hand_gesture_data/fist/0.png
[INFO] saving ROI: ../datasets/hand_gesture_data/hang_loose/0.png
[INFO] saving ROI: ../datasets/hand_gesture_data/fist/1.png
[INFO] saving ROI: ../datasets/hand_gesture_data/stop/0.png
[INFO] saving ROI: ../datasets/hand_gesture_data/peace/0.png
[INFO] saving ROI: ../datasets/hand_gesture_data/ignore/0.png
...
Figure 9.4: Left: An example of the cv2.threshold operation used to binarize my hand as a
white foreground on a black background. This example "hang loose" gesture will be logged to disk
once I press the h key on my keyboard. We can then train a CNN to recognize the gesture. Right:
Just like we need examples for each gesture we want to recognize, we also need examples of "no
gestures", ensuring that our model knows the difference between a person performing a sign and
when to simply ignore the frame as it has no semantic content.
Here you can see that I am making a “hang loose” sign (Figure 9.4, left). I then press the f
key on my keyboard which saves the fist example to the hand_gesture_data/hang_loose/
directory.
Figure 9.4 (right) shows an example of the “ignore” class. Note how there are no gestures
in the gesture capture region — this is done purposely so that we can train our CNN to recog-
nize the lack of a gesture in the frame (otherwise our CNN may report false-positive gesture
classifications). To capture the “ignore” class I press the i key on my keyboard.
Examining the output of the hand_gesture_data/ directory you can see I have gathered
approximately 100 examples per class:
$ tree ../datasets/hand_gesture_data/ --dirsfirst --filelimit 50

data/
|-- fist [101 entries]
|-- hang_loose [101 entries]
|-- ignore [101 entries]
|-- peace [101 entries]
|-- stop [101 entries]
We’ll train a CNN to recognize each of these gestures in the following section.
Remark. I have included my own example gesture dataset in the downloads associated with
this book. Feel free to use this dataset to continue following along or stop now to gather your
own dataset.
9.4 Gesture Recognition with Deep Learning
In the first part of this section we’ll implement a CNN architecture to recognize gestures based
on the example dataset we gathered in the previous section. We’ll then train the CNN and
examine the results.
9.4.1 Implementing the GestureNet CNN Architecture
The CNN we’ll be implementing in this chapter is called GestureNet — I created this archi-
tecture specifically for the task of recognizing gestures.
The architecture has both AlexNet and VGGNet-like characteristics, including (1) a larger
CONV kernel size in the very first layer of the network (AlexNet-like [2]), and (2) 3x3 filter sizes
throughout the rest of the network (VGGNet-like [3]).
Since we are working with binary blobs as our training data, the larger filter sizes can capture
more “structural information”, such as the size and shape of the binary blob, before switching
to more "standard" 3x3 convolutions. As we’ll see in Section 9.4.3, this combination of a large
filter size early in the network followed by smaller filter sizes leads to an accurate hand gesture
recognition model.
Let’s go ahead and implement GestureNet now:

2 from tensorflow.keras.models import Sequential
3 from tensorflow.keras.layers import BatchNormalization
4 from tensorflow.keras.layers import Conv2D
5 from tensorflow.keras.layers import MaxPooling2D
6 from tensorflow.keras.layers import Activation
7 from tensorflow.keras.layers import Flatten
8 from tensorflow.keras.layers import Dropout
9 from tensorflow.keras.layers import Dense
10 from tensorflow.keras import backend as K
11
12 class GestureNet:
13 @staticmethod
14 def build(width, height, depth, classes):
15 # initialize the model along with the input shape to be
16 # "channels last" and the channels dimension itself
17 model = Sequential()
18 inputShape = (height, width, depth)
9.4. GESTURE RECOGNITION WITH DEEP LEARNING 197
19 chanDim = -1
20
21 # if we are using "channels first", update the input shape
22 # and channels dimension
23 if K.image_data_format() == "channels_first":
24 inputShape = (depth, height, width)
25 chanDim = 1
Lines 1-10 import our required Python packages while Line 14 defines the build method
used to construct our CNN architecture. The build method requires four parameters:
• width: The width (in pixels) of the input images in our dataset.
• height: The height of the images in our dataset.
• depth: The number of channels in the image (3 for RGB images, 1 for grayscale/single
channel images).
• classes: The total number of classes labels in our dataset (4 gestures to recognize plus
an “ignore” class, so 5 classes total).
Line 17 initializes the model while Lines 18-25 initialize the inputShape and channel
dimension based on whether or not we are using channels-first or channels-lsat ordering.
Let’s move on to the body of the CNN:
27 # first CONV => RELU => CONV => RELU => POOL layer set
28 model.add(Conv2D(16, (7, 7), padding="same",
29 input_shape=inputShape))
30 model.add(Activation("relu"))
31 model.add(BatchNormalization(axis=chanDim))
32 model.add(MaxPooling2D(pool_size=(2, 2)))
33 model.add(Dropout(0.25))
Line 28 defines the first layer of the network, a CONV layer that will learn 16 filters, each
with a filter size of 7x7. As mentioned above, we use a larger filter size to capture more
structural/shape information of the binarized shape in the image. We then apply batch normal-
ization, max-pooling (to reduce volume size), and dropout (to reduce overfitting).
We then stack to more CONV layers, this time each with 3x3 filter sizes:
35 # second CONV => RELU => CONV => RELU => POOL layer set
36 model.add(Conv2D(32, (3, 3), padding="same"))
41
42 # third CONV => RELU => CONV => RELU => POOL layer set
43 model.add(Conv2D(64, (3, 3), padding="same"))
Note how (1) the volume size is reduced via max-pooling the deeper we are go in the
network, while (2) simultaneously the number of filters learned by CONV layers increases the
deeper we go. This behavior is very typical and you’ll see it in nearly every CNN you encounter.
Finally, let’s add our fully-connected layers:
49 # first (and only) set of FC => RELU layers

50 model.add(Flatten())
51 model.add(Dense(128))
53 model.add(BatchNormalization())
55
56 # softmax classifier
57 model.add(Dense(classes))
58 model.add(Activation("softmax"))
59
60 # return the constructed network architecture
61 return model
Lines 50-54 add a single fully-connected layer with 128 neurons. Another FC layer is added
on Line 57, this time containing the total number of classes. The final, fully constructed model
is returned to the calling function on Line 61.
With the model implemented we can move on to our training script. Open up the train_model
.py file and we’ll get to work:

2 from sklearn.preprocessing import LabelBinarizer
3 from sklearn.model_selection import train_test_split
4 from sklearn.metrics import classification_report
5 from tensorflow.keras.preprocessing.image import ImageDataGenerator
6 from tensorflow.keras.optimizers import Adam
7 from pyimagesearch.nn.gesturenet import GestureNet

11 import argparse
12 import pickle
13 import cv2
14 import os
15
21
24
25 # grab the list of images in our dataset directory, then initialize
26 # the list of data (i.e., images) and class labels
27 print("[INFO] loading images...")
28 imagePaths = list(paths.list_images(conf["dataset_path"]))
29 data = []
30 labels = []
Line 2-14 import our required Python packages. Note how we import GestureNet on Line
7.
Lines 16-20 parse our command line arguments. Again, the only switch we need is --conf,
the path to our configuration file which we load on Line 23.
Line 28 grabs the paths to all example training images in the "dataset_path" from our
configuration while Lines 29 and 30 initialize the data and labels lists, respectively.
We can now load our training data from disk:

33 for imagePath in imagePaths:
34 # extract the class label from the filename
35 label = imagePath.split(os.path.sep)[-2]
36
37 # load the image, convert it to grayscale, and resize it to be a
38 # fixed 64x64 pixels, ignoring aspect ratio
40 image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
41 image = cv2.resize(image, (64, 64))
42
43 # update the data and labels lists, respectively
44 data.append(image)
45 labels.append(label)
On Line 33 we loop over all imagePaths in our dataset. For each image path, we:
i. Grab the label by parsing the class label from the filename on Line 35 (the label is the
subdirectory name of where the image lives, which in this case, is the name of the label
itself — see Section 9.3 for more details).
ii. Load the input image from disk (Line 39).
iii. Convert the image to grayscale (Line 40).
iv. Resize the image to 64x64 pixels on Line 41.
From there we add the image and label to our data and labels lists, respectively (Lines
44 and 45).
Let’s prepare our data for training:
47 # convert the data into a NumPy array, then preprocess it by scaling

48 # all pixel intensities to the range [0, 1]
49 data = np.array(data, dtype="float") / 255.0
50
51 # reshape the data matrix so that it explicitly includes a channel
52 # dimension
53 data = data.reshape((data.shape[0], data.shape[1], data.shape[2], 1))
54
55 # one-hot encode the labels
56 lb = LabelBinarizer()
57 labels = lb.fit_transform(labels)
58
59 # partition the data into training and testing splits using 75% of
60 # the data for training and the remaining 25% for testing
61 (trainX, testX, trainY, testY) = train_test_split(data, labels,
62 test_size=0.25, stratify=labels, random_state=42)
Line 49 converts the data list to a proper NumPy array while simultaneously scaling all
pixel values from the range [0, 255] to [0, 1].
We then reshape the data matrix to include a channel dimension (Line 53), otherwise Keras
will not understand that we are working with grayscale, single channel images.
Lines 56 and 57 one-hot encode our labels while Lines 61 and 62 construct the training
and testing split.
We are now ready to train GestureNet:
64 # construct the training image generator for data augmentation

65 aug = ImageDataGenerator(rotation_range=20, zoom_range=0.15,
66 width_shift_range=0.2, height_shift_range=0.2, shear_range=0.15,

67 horizontal_flip=True, fill_mode="nearest")
68
69 # initialize our gesture recognition CNN and compile it
70 model = GestureNet.build(64, 64, 1, len(lb.classes_))
71 opt = Adam(lr=conf["init_lr"],
72 decay=conf["init_lr"] / conf["num_epochs"])
73 model.compile(loss="categorical_crossentropy", optimizer=opt,
74 metrics=["accuracy"])
75
76 # train the network
77 H = model.fit_generator(
78 aug.flow(trainX, trainY, batch_size=conf["bs"]),
79 validation_data=(testX, testY),
80 steps_per_epoch=len(trainX) // conf["bs"],
81 epochs=conf["num_epochs"])
Lines 65-67 initialize our data augmentation object, used to randomly translate, rotate,
shear, etc. our input images. Data augmentation is used to reduce overfitting and improve the
ability of our model generalize. You can read more about more data augmentation inside Deep
Learning for Computer Vision with Python http://pyimg.co/dl4cv [50] and in this tutorial:
http://pyimg.co/pedyk [51].
Line 70 initializes the GestureNet model, instructing it to accept 64x64 input images with
only a single channel. The total number of classes is equal to the number of unique classes
found by the LabelBinarizer.
From there we compile the model and then begin training (Lines 77-81).
After training is complete we can make predictions on our testing data:
83 # evaluate the network

84 print("[INFO] evaluating network...")
85 predictions = model.predict(testX, batch_size=conf["bs"])
86 print(classification_report(testY.argmax(axis=1),
87 predictions.argmax(axis=1), target_names=lb.classes_))
88
89 # serialize the model
90 print("[INFO] saving model...")
91 model.save(str(conf["model_path"]))
92
93 # serialize the label encoder
94 print("[INFO] serializing label encoder...")
95 f = open(str(conf["lb_path"]), "wb")
96 f.write(pickle.dumps(lb))
97 f.close()
We also serialize both the model and lb to disk so we can use them later in the recognize.py
script.
9.4.3 Examining Training Results
With our train_model.py file implemented we can now train GestureNet!
$ python train_model.py --conf config/config.json

Using TensorFlow backend.
[INFO] loading images...
Epoch 1/75
47/47 [==============================] - 3s 71ms/step - loss: 1.7041 -
accuracy: 0.4628 - val_loss: 0.7240 - val_accuracy: 0.7323
Epoch 2/75
47/47 [==============================] - 2s 48ms/step - loss: 1.2165 -
Epoch 3/75
47/47 [==============================] - 2s 47ms/step - loss: 0.9044 -
...
Epoch 73/75
47/47 [==============================] - 2s 44ms/step - loss: 0.2467 -
Epoch 74/75
47/47 [==============================] - 2s 45ms/step - loss: 0.1834 -
Epoch 75/75
47/47 [==============================] - 2s 45ms/step - loss: 0.2579 -
[INFO] evaluating network...
precision recall f1-score support
fist 1.00 1.00 1.00 25

hang_loose 1.00 1.00 1.00 26
ignore 1.00 1.00 1.00 25
peace 1.00 1.00 1.00 25
stop 1.00 1.00 1.00 26
micro avg 1.00 1.00 1.00 127

macro avg 1.00 1.00 1.00 127
weighted avg 1.00 1.00 1.00 127
[INFO] saving model...

[INFO] serializing label encoder...
By the end of our 75th epoch we are obtaining 100% accuracy on our testing data!
9.5. IMPLEMENTING THE COMPLETE GESTURE RECOGNITION PIPELINE 203
9.5 Implementing the Complete Gesture Recognition Pipeline
We are now ready to put all the pieces together and finish implementing our hand gesture
recognition pipeline!
Inside this script we will:
i. Load our trained GestureNet model.
ii. Access our video stream.
iii. Classify each frame from the video stream as either (1) an identified gesture or (2) back-
ground, in which case we ignore the frame.
iv. Compare the input gesture passcode to the correct passcode defined in Section 9.2.3.
v. Either welcome the user (correct passcode) or alert the home owner of an intruder (incor-
rect passcode).
Open recognize.py in your favorite code editor and we’ll get started:

6 from threading import Thread
7 from tensorflow.keras.preprocessing.image import img_to_array
8 from tensorflow.keras.models import load_model
10 from datetime import date
11 import RPi.GPIO as GPIO
13 import argparse
14 import imutils
15 import pickle
16 import time
17 import cv2
18 import os
Lines 2-18 handle our imports. We’ll be using the TwilioNotifier to send text message
notifications of intruders. The GPIO package can also be used to access the GPIO pins on a
Pi Traffic Light if you are using one.
Let’s initialize the GPIO pins now:

20 # set the Pi Traffic Light GPIO pins

21 red = 9
22 yellow = 10
23 green = 11
24
25 # setup the Pi Traffic Light
26 GPIO.setmode(GPIO.BCM)
27 GPIO.setup(red, GPIO.OUT)
28 GPIO.setup(yellow, GPIO.OUT)
29 GPIO.setup(green, GPIO.OUT)
Lines 21-23 define the integer values of the red, yellow, and green lights on the Traffic Light
HAT. We then setup the Pi Traffic Light on Lines 26-29.
Remark. If you wish to eliminate the Pi Traffic Light, be sure to comment out all code lines that
begin with GPIO.
We’ll now define four helper utility functions, the first of which, correct_passcode, is
defined below:
31 def correct_passcode(p):
32 # actuate a lock or (in our case) turn on the green light
33 GPIO.output(green, True)
34
35 # print status and play the correct sound file
36 print("[INFO] everything is okay :-)")
37 play_sound(p)
The correct_passcode function, as the name suggests, is called when the user has
input a correct hand gesture code. This method accepts a single parameter, p, the path to
the input audio file to play when the correct passcode has been entered. Line 33 turns on the
green light via the GPIO library while Line 37 plays the “correct passcode” audio file.
Similarly, we also have an incorrect_passcode function:
39 def incorrect_passcode(p, tn):

40 # turn on the red light
41 GPIO.output(red, True)
42
43 # print status and play the incorrect sound file
44 print("[INFO] security breach!")
45 play_sound(p)
46
47 # alert the homeowner
48 hhmmss = (datetime.now()).strftime("%I:%M%p")
49 today = date.today().strftime("%A, %B %d %Y")
50 msg = "An incorrect passcode was entered at " \

51 "{} on {} at {}.".format(conf["address_id"], today, hhmmss)
52 tn.send(msg)
This method accepts two parameters, the first of which is p, the path to the audio file to be
played if an incorrect passcode is entered. The second argument, tn, is our TwilioNotifier
object used to send a text message alert to the home owner.
Line 41 displays the “red” light on our Traffic Light HAT while Line 45 plays the “incorrect
passcode” audio file. Lines 48-52 then send a text message to the home owner, indicating that
a potential intruder has entered the house and entered the incorrect passcode.
The reset_lights function is used to turn off all lights on the Traffic Light HAT:
54 def reset_lights():
55 # turn off the lights
56 GPIO.output(red, False)
57 GPIO.output(yellow, False)
58 GPIO.output(green, False)
The final utility function, play_sound, plays an audio file via the built-in aplay command
on the Raspberry Pi:
60 def play_sound(p):
61 # construct the command to play a sound, then execute the command
62 command = "aplay {}".format(p)
63 os.system(command)
64 print(command)
With our helper utilities defined, we can move on to the body of the script:

71
Lines 67-70 parse our command line arguments. Just like all other scripts in this chapter,
we need only the --conf switch.
Lines 73 and 74 instantiate the Conf and TwilioNotifier classes, respectively.
If you recall from Section 9.2.2, where we reviewed the project/directory structure, we have
an assets/ directory which contains emoji-like visualizations for each of the gestures.
The following code block loads each of these emoji images from disk:
76 # grab the paths to gesture icon images and then initialize the icons
77 # dictionary where the key is the gesture name (derived from the image
78 # path) and the key is the actual icon image
79 print("[INFO] loading icons...")
80 imagePaths = paths.list_images(conf["assets_path"])
81 icons = {}
82
84 for imagePath in imagePaths:
85 # extract the gesture name (label) the icon represents from the
86 # filename, load the icon, and then update the icons dictionary
87 label = imagePath.split(os.path.sep)[-1].split(".")[0]
88 icon = cv2.imread(imagePath)
89 icons[label] = icon
Line 80 grabs all imagePaths inside the "assets_path" directory while Line 81 initial-
izes the icons dictionary. The key to the dictionary is the name of the label while the value is
the icon itself.
Lines 84-89 then loops over each of the icon paths, extracts the name of the gesture from
the filename, loads the icon, and then stores it in the icons dictionary.
Our next code block prepares to access our video stream:
91 # grab the top-left and and bottom-right (x, y)-coordinates for the
92 # gesture capture area
93 TOP_LEFT = tuple(conf["top_left"])
94 BOT_RIGHT = tuple(conf["bot_right"])
95
96 # load the trained gesture recognizer model and the label binarizer
98 model = load_model(str(conf["model_path"]))
99 lb = pickle.loads(open(str(conf["lb_path"]), "rb").read())
100
101 # start the video stream thread
102 print("[INFO] starting video stream thread...")
105 time.sleep(2.0)
Lines 93 and 94 grab the top-left and bottom-right (x, y)-coordinates for our gesture recog-
nition area, just like we did in the gather_examples.py script from Section 9.3.
We then load both the trained GestureNet model and serialized LabelBinarizer from
disk on Lines 98 and 99. Lines 102-105 then access our video stream.
We only have a few more initializations to go before we start looping over frames:
107 # initialize the current gesture, a bookkeeping variable used to keep

108 # track of the number of consecutive frames a given gesture has been
109 # classified as
110 currentGesture = [None, 0]
111
112 # initialize the list of input gestures recognized from the user
113 # along with the timestamp of when all four gestures were entered
114 gestures = []
115 enteredTS = None
116
117 # initialize two booleans used to indicate (1) whether or not the
118 # alarm has been raised and (2) if the correct pass code was entered
119 alarm = False
120 correct = False
Line 110 initializes currentGesture, a list containing two values: (1) the current recog-
nized gesture, and (2) the total number of consecutive frames that GestureNet has reported
the current gesture as the classification. By keeping tracking of the number of consecutive
frames a given gesture has been recognized as, we can reduce the likelihood of a false-positive
classification from the network.
Line 114 initializes gestures, a list of hand gestures input from the user. We’ll compare
gestures with the passcode in the configuration file to determine if the user has input the
correct passcode. We’ll also grab the timestamp of when the gestures have been input to the
system (Line 115).
We then have two booleans on Lines 119 and 120:
• alarm: Used to indicate if the alarm has been raised.
• correct: Indicates whether or not the correct passcode was entered.
Let’s start looping over frames from the VideoStream:

123 while True:
124 # grab the frame from the threaded video file stream and grab the
125 # current timestamp

128
129 # resize the frame and then flip it horizontally
132
133 # clone the original frame and then draw the gesture capture area
134 clone = frame.copy()
135 cv2.rectangle(clone, TOP_LEFT, BOT_RIGHT, (0, 0, 255), 2)
Line 126 reads a frame from our stream while Line 127 grabs the current timestamp.
We then resize the frame and horizontally flip it, just like we did in Section 9.3. Line 135
visualizes the gesture capture region, ensuring we know were to place our hands for our ges-
tures to be recognized.
Let’s now make a check to see how many gestures have been input by the user:
137 # only perform hand gesture classification if the current gestures

138 # list is not already full
139 if len(gestures) < 4:
140 # extract the hand gesture capture ROI from the frame, convert
141 # the ROI to grayscale, and then threshold it to reveal a
142 # binary mask of the hand
143 roi = frame[TOP_LEFT[1]:BOT_RIGHT[1], TOP_LEFT[0]:BOT_RIGHT[0]]
144 roi = cv2.cvtColor(roi, cv2.COLOR_BGR2GRAY)
145 roi = cv2.threshold(roi, 75, 255, cv2.THRESH_BINARY)[1]
146 visROI = roi.copy()
147
148 # now that we have the hand region we need to resize it to be
149 # the same dimensions as what our model was trained on, scale
150 # it to the range [0, 1], and prepare it for classification
151 roi = cv2.resize(roi, (64, 64))
152 roi = roi.astype("float") / 255.0
153 roi = img_to_array(roi)
154 roi = np.expand_dims(roi, axis=0)
155
156 # classify the input image
157 proba = model.predict(roi)[0]
158 label = lb.classes_[proba.argmax()]
For this application, we’ll assume that each passcode has four gestures (you can increase
or decrease this value as you see fit). Line 139 checks to see if there are less than four
gestures input, implying that the end user is still entering their gestures.
Lines 143-145 extract the hand gesture recognition roi and then threshold it, just like we
did in Section 9.3.2. After thresholding, our hand now appears as white foreground on a black
background (Figure 9.4).
Lines 151-154 preprocess the roi for classification, just as we did in the train_model.py
script from Section 9.4. Lines 157 and 158 classify the input roi to obtain the gesture predic-
tion.
Our next code block handles updating the currentGesture bookkeeping variable:
160 # check to see if the label from our model matches the label
161 # from the previous classification
162 if label == currentGesture[0] and label != "ignore":
163 # increment the current gesture count
164 currentGesture[1] += 1
165
166 # check to see if the current gesture has been recognized
167 # for the past N consecutive frames
168 if currentGesture[1] == conf["consec_frames"]:
169 # update the gestures list with the predicted label
170 # and then reset the gesture counter
171 gestures.append(label)
172 currentGesture = [None, 0]
173
174 # otherwise, reset the current gesture count
175 else:
176 currentGesture = [label, 0]
Line 162 checks to see if (1) the current predicted label matches whatever the previous
predicted label was, and (2) that the predicted label is not the “ignore” class. Provided this
passes, we increment the the number of consecutive frames the current gesture was predicted
as (Line 164).
If we have reached the "consec_frames" threshold, we add the current label to the
gestures list and then reset the currentGesture. Using a consecutive count helps reduce
false-positive classifications.
Lines 175 and 176 handle the case where the current label does not match the previous
label in currentGesture, in which case we reset currentGesture.
In the event we are entering a gesture we’ll change the color of the Traffic Light HAT to
“yellow”, indicating that a passphrase is being entered:
178 # turn on the yellow light if we have at least 1 gesture

179 if len(gestures) >= 1 and len(gestures) < 4:
180 GPIO.output(yellow, True)
181
182 # otherwise, turn the yellow light off
183 else:
184 GPIO.output(yellow, False)
If there are no gestures or we already have a full gestures list, we’ll turn off the light on
the HAT (Lines 183 and 184).
Let’s now start building a basic GUI visualization for the end user:
186 # initialize the canvas used to draw recognized gestures

187 canvas = np.zeros((250, 425, 3), dtype="uint8")
188
189 # loop over the number of hand gesture input keys
190 for i in range(0, 4):
191 # compute the starting x-coordinate of the entered gesture
192 x = (i * 100) + 25
193
194 # check to see if an input gesture exists for this index, in
195 # which case we should display an icon
196 if len(gestures) > i:
197 canvas[65:165, x:x + 75] = icons[gestures[i]]
198
199 # otherwise, there has not been an input gesture for this icon
200 else:
201 # draw a white box on the canvas, indicating that a
202 # gesture has not been entered yet
203 cv2.rectangle(canvas, (x, 65), (x + 75, 165),
204 (255, 255, 255), -1)
Line 187 initializes a 425 ◊ 250 pixel canvas that we’ll be drawing on.
We then loop over the number of input gestures on Line 190. For each gesture we compute
the starting x-coordinate of the emoji. If there exists an entry for the i-th gesture, we use the
icons dictionary to draw it on the canvas (Lines 196 and 197); otherwise, we draw a white
rectangle, indicating a gesture has not yet been input (Lines 200-204). An example of this
simple interface can be seen in Figure 9.5.
Figure 9.5: Left: An example of our GUI interface waiting for a gesture recognition input. Right:
As gestures are input the white rectangles are replaced with the emoji corresponding to the rec-
ognized gesture.
Our next code block handles checking to see if the input list of gestures is correct or not:
206 # initialize the status as "waiting" (implying that we're waiting

207 # for the user to input four gestures) along with the color of the
208 # status text
209 status = "Waiting"
210 color = (255, 255, 255)
211
212 # check to see if there are four gestures in the list, implying
213 # that we need to check the pass code
214 if len(gestures) == 4:
215 # if the timestamp of when the four gestures has been entered
216 # has not been initialized, initialize it
217 if enteredTS is None:
218 enteredTS = timestamp
219
220 # initialize our status, color, and sound path for the
221 # "correct" pass code
222 status = "Correct"
223 color = (0, 255, 0)
224 audioPath = conf["correct_audio"]
225
226 # check to see if the input gesture pass code is correct
227 if gestures == conf["passcode"]:
228 # if we have not taken action for a correct pass code,
229 # take the action
230 if not correct:
231 t = Thread(target=correct_passcode, args=(audioPath,))
232 t.daemon = True
233 t.start()
234 correct = True
Line 214 checks to see if four gestures have been entered, indicating that we need to
check if the passcode is correct or not. Lines 217 and 218 then grab the timestamp of when
the check took place, just in case we need to alert the home owner.
We initialize our status, color, and audioPath variables, operating under the assump-
tion that the entered passcode is correct (Lines 222-224). Line 227 then checks to see if the
passcode is correct.
Provided the correct gesture code has been entered we then make another check on Line
230. If correct is still set to False then we know that we have not played a “welcome”
message to the user — in that case, we create a Thread to call the correct_passcode
function and set the correct boolean to True.
Let’s now handle if the input gesture code is incorrect:
236 # otherwise, the pass code is incorrect

237 else:
238 # update the status, color and audio path
239 status = "Incorrect"
240 color = (0, 0, 255)

241 audioPath = conf["incorrect_audio"]
242
243 # if the alarm has not already been raised, raise it
244 if not alarm:
245 t = Thread(target=incorrect_passcode,
246 args=(audioPath, tn,))
247 t.daemon = True
248 t.start()
249 alarm = True
250
251 # after a correct/incorrect pass code we will show the status
252 # for N seconds
253 if (timestamp - enteredTS).seconds > conf["num_seconds"]:
254 # reset the gestures list, timestamp, and alarm/correct
255 # booleans
256 gestures = []
257 enteredTS = None
258 alarm = False
259 correct = False
260 reset_lights()
Line 237 handles the case in which case the entered passcode is incorrect. In this case
update our status, color, and audioPath variables, respectively (Lines 239-241). If the
alarm has not been raised, we raise it on Lines 244-249.
Lines 253-260 allow our simple GUI to display the correct/incorrect passcode for a total of
"num_seconds" before we reset and allow the user to try a different gesture passcode.
The final code block in the script handles visualizing the output images to our screen:
262 # draw the timestamp and status on the canvas

263 ts = timestamp.strftime("%A %d %B %Y %I:%M:%S%p")
264 status = "Status: {}".format(status)
265 cv2.putText(canvas, ts, (10, canvas.shape[0] - 10),
267 cv2.putText(canvas, status, (10, 25), cv2.FONT_HERSHEY_SIMPLEX,
268 0.6, color, 2)
269
270 # show ROI we're monitoring, the output frame, and passcode info
271 cv2.imshow("ROI", visROI)
272 cv2.imshow("Security Feed", clone)
273 cv2.imshow("Passcode", canvas)
275
277 if key == ord("q"):
278 reset_lights()
279 break
280
9.6. GESTURE RECOGNITION RESULTS 213
283 vs.stop()
Lines 265-268 draw the current timestamp, ts, and status on the output canvas. We
then visualize the ROI we’re monitoring, output frame, and passcode information to our screen
(Lines 271-273). Finally, if the q key is pressed on our keyboard we gracefully exit our script.
9.6 Gesture Recognition Results
With our gesture recognition pipeline complete we can put it to work!
$ python recognize.py --conf config/config.json

[INFO] loading icons...
[INFO] starting video stream thread...
Figure 9.6: The correct hand gesture passcode has been entered.
Figure 9.6 visualizes the output of our script. In the bottom-right you can see my hand
gesture sequence (i.e. passcode). The passcode has been correctly recognized as “peace”,
“stop”, “fist”, and “hang_loose”, after which the correct.wav audio file is played and “Correct”
is displayed on the output frame (this passcode matches the "passcode" configuration setting
from Section 9.2.3).
Figure 9.7 shows the output of what an incorrect passcode would look like. Notice how the
screenshot clearly shows an incorrect passcode, which triggers the incorrect.wav audio
Figure 9.7: The incorrect hand gesture passcode has been entered by an intruder.
file to play and display "Incorrect" on the frame. Additionally, a text message notification is then
sent to my smartphone, alerting me to the intruder.
Keep in mind that hand gesture recognition is not limited to security applications.
You can also utilize gesture recognition to build accessibility programs, enabling disabled
users to more easily access a computer. You could also use hand gesture recognition to build
smart home systems, such as accessing your TV without a remote.
When developing your own hand gesture recognition applications you should use this chap-
ter as a template and starting point — from there you can extend it to work with your own
projects!
9.7 Summary
In this chapter you learned how to perform hand gesture recognition on the Raspberry Pi.
Our gesture recognition pipeline combined both traditional computer vision techniques along
with deep learning algorithms.
In order to ensure our method was able of running in real-time on the RPi, we needed to
utilize background subtraction and thresholding to first segment the hand from the rest of the
image. Our CNN was then able to recognize the hand with a high level of accuracy.
The biggest problem with this approach is that it hinges on being able to accurately segment
the foreground hand from the background — if there is too much noise or if the segmentation
is not reasonably accurate, then the CNN will incorrectly classify the hand region. A more
9.7. SUMMARY 215
advanced approach would be to utilize a deep learning-based object detector to first detect the
hand [52] in the image and then apply a hand keypoint detector [53] to localize each of the
fingers on the hand. Using this information we could more easily segment the hand from the
image and thereby increase the robustness of our system.
However, such as system would be too computationally expensive to run on the RPi alone
— we would need to utilize coprocessors such as the Movidius NCS or Google Coral USB
Accelerator, both of which are covered in the Complete Bundle of this text.
Chapter 10
Vehicle Recognition with Deep Learning
Package theft has become a massive problem in the United States, especially surrounding
major holidays.
Interestingly, it’s been reported that rural areas have had a higher package rate theft (per
population) than major cities, thus demonstrating the problem is not limited to areas with just
high populations of people [54].
Using computer vision we can help combat package theft, ensuring whether you’re awaiting
the arrival of the hottest video game just released or simply sending a care package for your
grandmother’s 80th birthday, that your package arrives safely.
Inside this chapter we’ll explore package theft through vehicle identification, and more
specifically, recognizing various delivery vehicles. You can use the techniques covered in
this chapter you recognize other types of vehicles as well.
i. Build a dataset of vehicle images
ii. Utilize the (pre-trained) YOLO object detector to detect trucks in your vehicle dataset
iii. Perform transfer learning via feature extraction to extract features from the detected vehi-
cles
iv. Train a Logistic Regression model on top of the extracted features
v. Recognize vehicles in real-time video using a Raspberry Pi, ImageZMQ, and deep learn-
ing
217
218 CHAPTER 10. VEHICLE RECOGNITION WITH DEEP LEARNING
10.2 What is Vehicle Recognition?
Vehicle recognition is the process of identifying the means of transport a person is operating.
In its simplest form, vehicle recognition labels a vehicle as car, truck, bus, etc. More advanced
forms of vehicle recognition may provide more specific information, including the make, model,
and color of the vehicle.
Figure 10.1: An example of: (1) detecting the presence of a vehicle in an image, (2) localizing
where in the image the vehicle is, and (3) correctly identifying the type of vehicle.
We’ll be examining vehicle recognition through our project on delivery truck identification.
We will learn how to utilize deep learning and transfer learning to create a computer vision
system capable of recognizing various types of trucks (ex., UPS, FedEx, DHL, USPS, etc.).
An example output of our delivery project identification project can be seen in Figure 10.1.
Notice how we have correctly:
i. Detected the presence of the truck in the image
ii. Localized where in the image the truck is
iii. Correctly identified the truck (USPS)
The project we’ll be building here in this chapter is one of the more advanced projects in
the Hacker Bundle and will require more Python files and code than previous chapters. Take
10.3. GETTING STARTED WITH VEHICLE RECOGNITION 219
your time when working through this chapter and slowly work through it. I would also suggest
reading through the chapter once to obtain a higher level understanding of what we’re building
and then going back to read it a second time, this time paying closer attention to the details.
10.3 Getting Started with Vehicle Recognition
In the first part of this chapter we’ll review the general flow of our algorithm, including each of
the components we are going to implement. From there we’ll review our project structure and
then dive into our configuration file.
10.3.1 Our Vehicle Recognition Project
Our vehicle recognition project has three phases:
i. Phase #1: Building our dataset of vehicles and extracting them from images.
ii. Phase #2: Training the vehicle recognition model via transfer learning.
iii. Phase #3: Detecting and classifying vehicles in real-time video using our trained model.
The steps of Phase #1 can be see in Figure 10.2 (top). We’ll start by assuming we have a
dataset of example delivery trucks, including UPS, USPS, DHL, etc.
Remark. The delivery truck dataset is provided for you in the downloads associated with this
text. You can also use the instructions in Section 10.4 to build your own dataset as well.
However, simply having example images of various delivery trucks is not enough — we also
need to know where in the input images the vehicle is. To localize where the delivery truck is,
we’ll utilize the YOLO object detector. YOLO will find the truck in the image and then write
the bounding box (x, y)-coordinates of the truck to a CSV file.
Given the CSV file, we can move on to Phase #2 (Figure 10.2, middle). In this phase we
take the detected trucks and perform transfer learning via deep learning. We start by looping
over each of the detections in the CSV file, loading the corresponding image for the current
detection, and then extract the vehicle from the image using the (x, y)-coordinates provided by
the CSV file — the vehicle ROI.
Instead of trying to train a vehicle recognition model from scratch, we can instead apply
transfer learning via feature extraction using deep learning (http://pyimg.co/r0rgh) [55].
Pre-trained deep neural networks learn discriminative, robust features that can be used to
recognize classes the network was never train on. We’ll take advantage of the robust nature
Figure 10.2: Our vehicle recognition project consists of three phases. In Phase #1 we build our
training dataset by taking an input set of vehicle images, performing object detection, and storing
the (x, y)-coordinates of where each vehicle resides in an image. Phase #2 consists of taking the
detected vehicle regions, extracting features from the vehicle ROIs using ResNet, and then training
a Logistic Regression model on the features. Finally, Phase #3 utilizes a RPi to capture frames,
passes them to a central server using ImageZMQ, applies object detection to detect vehicles,
extracts features from the vehicle ROI, and then identifies the vehicle using the extracted features
and our Logistic Regression model.
of CNNs and use ResNet (pre-trained on ImageNet) to extract features from the vehicle ROIs.
These features are then written to disk as an HDF5 file.
Remark. If you are new to the concept of transfer learning, feature extraction, and fine-tuning,
you’ll want to refer to Deep Learning for Computer Vision with Python (http://pyimg.co/dl4cv)
[50], as well as the following tutorials on the PyImageSearch blog:
i. Transfer learning: http://pyimg.co/r0rgh [55]
ii. Feature extraction: http://pyimg.co/1j05z [56]
iii. Fine-tuning: http://pyimg.co/rqtlj [57]

Finally, we arrive at Phase #3 (Figure 10.2, bottom). This phase puts all the pieces together,
arriving at a fully functioning vehicle recognition system.
However, there is a bit of a problem we need to address first — the Raspberry Pi does not
have the computational resources to run all of the following models at the same time:
• Object/vehicle detection via YOLO
• Feature extraction via ResNet
• Final vehicle classification via Logistic Regression
Instead of trying to utilize the underpowered RPi for these operations (or using a copro-
cessor such as the Movidius NCS or Google Coral), we’ll instead treat the Raspberry Pi as a
simple network camera and then use ImageZMQ to stream the results back to a more power-
ful host machine for processing (similar to Chapter 4 on Advanced Security Applications with
YOLO Object Detection). The results of the vehicle identification are then sent back to the RPi.
Take a second now to review the steps of Figure 10.2 to ensure you understand each of
the phases of this project. From here, we’ll move on to reviewing the directory structure for our
project.
Let’s start by reviewing our directory structure for the project:
|-- config
| |-- truckid.json
|-- pyimagesearch
| |-- io
| | |-- __init__.py
| | |-- hdf5datasetwriter.py
| |-- keyclipwriter
| | |-- __init__.py
| | |-- keyclipwriter.py
| |-- utils
| | |-- __init__.py
| | |-- conf.py
| |-- __init__.py
|-- yolo-coco
| |-- coco.names
| |-- yolov3.cfg
| |-- yolov3.weights
|-- output
| |-- detections.csv
| |-- truckid.model
|-- videos
| | cars.mp4
|-- client.py
|-- build_training_data.py
|-- extract_features.py
|-- train_model.py
|-- truck_id.py
Inside the config/ directory we’ll store our config.json file which we’ll implement in
Section 10.3.3.
The pyimagesearch module also contains an implementation of a class named HDF5Data

setWriter. This class enables us to write data to disk in a special dataset format called HDF5
[58]. We won’t be covering this implementation in the text, but if you’re interested, you can (1)
take a look at the code included in the downloads of this book and/or (2) refer to Deep Learning
for Computer Vision with Python [50] where the implementation is covered in detail.
The yolo-coco/ directory contains three files, all for the YOLO object detector:
• coco.names: A plaintext file containing the names of the objects/labels YOLO was
trained on (i.e., the labels from the COCO dataset [59]).
• yolov3.cfg: The configuration file for YOLO.
• yolov3.weights: The weights for the YOLO model.
We’ll be using YOLO in Phase #1 (Section 10.4) to detect vehicles in our training set and
then again in Phase #3 (Section 10.5), where we detect vehicles in a real-time video stream.
The vehicle recognition dataset stores the images of vehicles we’ll be training a machine
learning model to recognize. The build_training_data.py script applies the YOLO object
detector to every image in the dataset directory, yielding the detections.csv file in the
output/ directory.
We then take the detections.csv file and use extract_features.py file to extract
deep learning features (via ResNet, pre-trained on ImageNet) and write them to disk in output/
features.hdf5. Given the extracted features we use train_model.py to train a Lo-
gistic Regression model on top of the features — the output model is serialized to disk in
output/truckid.model.
Finally, truck_id.py uses the truckid.model to classify vehicles in an input video

stream.
With our directory structure reviewed, let’s move on to the configuration file:
1 {
2 // path to the input directory containing our truck training data
3 "dataset_path": "../datasets/vehicle_recognition_dataset",
4
5 // path to the CSV file containing the bounding box detections after
6 // applying the YOLO object detector
7 "detections_path": "output/detections.csv",
8
9 // base path to YOLO directory
10 "yolo_path": "yolo-coco",
11
12 // minimum probability to filter weak detections and threshold
13 // when applying non-maxima suppression
14 "yolo_confidence": 0.5,
15 "yolo_threshold": 0.3,
16
17 // list of class labels we're interested in from the YOLO object
18 // detector (we're adding "bus" since YOLO will sometimes
19 // misclassify a delivery truck as a bus)
20 "classes": ["truck", "bus"],
Line 3 defines the "dataset_path", which is the path to our directory containing the
input vehicle images. We assume that the "dataset_path" is a root directory and then
contains subdirectories for each class label. For example, in our project structure we have a
subdirectory named fedex/ — this directory lives inside dataset/. When we go to detect
objects in this image, and later extract features from the ROI, we can use the subdirectory
name to easily derive the label name.
After applying the YOLO object detector we use the "detections_path" to write the
bounding box coordinates to disk (Line 7).
Lines 10-15 initialize configurations for the YOLO object detector:
• "yolo_path": Path to the root directory containing the YOLO files.
• "yolo_confidence": Minimum confidence used to filter out false-positive detections.
• "yolo_threshold": Threshold value for non-maxima suppression [60].
Line 20 defines the list of class labels we’re interested in detecting. Sometimes the YOLO
object detector will misclassify a “truck” as “bus” — therefore, we’ll include the “bus” class to
make sure we find all truck-like vehicles in our input images.
The following configurations are used when applying transfer learning in Section 10.5:
22 // path to output HDF5 file after feature extraction

23 "features_path": "output/features.hdf5",
24
25 // define the batch size and buffer size during feature extraction
26 "batch_size": 16,
27 "buffer_size": 1000,
28
29 // path to the label encoder
30 "label_encoder_path": "output/le.pickle",
31
32 // number of jobs to run while grid-searching the best parameters
33 "n_jobs": 1,
34
35 // path to the output model after training
36 "model_path": "output/truckid.model",
Line 23 sets the path to where extracted features will be stored. Lines 26 and 27 set the
batch size and buffer size for feature extraction. Under the vast majority of circumstances you
will not have to adjust this parameter.
Line 30 sets the path to the output, serialized LabelEncoder file while Line 36 does the
same for the output model after training.
Line 33 defines the number of parallel jobs to run while grid-searching for the best hyper-
parameters. Typically you’ll want to leave this value at 1 as grid-searching requires quite a bit
of RAM — the exception is if your machine has as large amount of RAM (over 64GB). In that
case you can increase the number of parallel jobs and have the grid-search run faster.
The next code block handles configurations from Phase #3 (Section 10.6) where we actually
deploy the trained vehicle recognition model:
38 // MOG background subtraction parameters

39 "mog_history": 500,
40 "mog_nmixtures": 5,
41 "mog_bg_ratio": 0.7,
42 "mog_noise_sigma": 0,
43
44 // motion area in pixels
45 "min_area_divisor": 80,
46 "max_area_divisor": 2,
Applying the YOLO object detector is very computationally expensive, therefore we shouldn’t
run the detector on each and every frame — instead, we can utilize the same trick we did in
Chapter 5 and utilize a two-stage process:
i. Stage #1: First background subtraction/motion detection is applied.
ii. Stage #2: If sufficient foreground is determined, we can then apply the YOLO object
detector to find any vehicles.
10.4. PHASE #1: CREATING OUR TRAINING DATASET 225
In order to apply motion detection we’ll be using OpenCV’s built-in MOG method. Lines 39-
42 define our parameters to MOG. Refer to Chapter 5 where these parameters are explained
in detail.
Lines 45 and 46 define the "min_area_divisor" and "max_area_divisor", respec-

tively. These values are used to compute the relative area a given vehicle bounding box oc-
cupies (compared to the dimensions of the entire frame). Imposing constraints on vehicle
detection size helps reduce false-positives.
Our final set of configurations handle KeyClipWriter parameters, the path to any output
video files, along with any Dropbox API authentications:
48 // key clip writer buffer

49 "kcw_buffer_size": 20,
50
51 // path to output directory
52 "output_videos_path": "videos",
53 "output_fps": 4,
54 "codec": "MJPG",
55
56 // dropbox settings for storing videos
57 "use_dropbox": true,
58 "dropbox_access_token": "YOUR_DROPBOX_ACCESS_KEY",
59 "dropbox_base_path": "YOUR_DROPBOX_APP_BASEPATH",
60 "delete_local_video": false
61 }
Let’s move on to Phase #1!
10.4 Phase #1: Creating Our Training Dataset
I have already gathered the vehicle/truck dataset we’ll be using in this chapter (which is also
included in the accompanying downloads associated with this text), but I’ll show you the method
that I used to curate the dataset, just in case you want to build your own.
Given our dataset of vehicles, we then need to detect and localize each delivery truck.
Simply knowing that a vehicle exists in an image is not enough — we instead need to
use an object detector to detect the bounding box coordinates of a vehicle in an image.
Having the bounding box coordinates of the vehicle will enable us to (1) extract it from the input
image, (2) perform transfer learning via feature extraction on the ROI, and finally (3) train a
Logistic Regression model on top of the data. These three three tasks will take place in Phase
#2 (Section 10.5), but before we can get there, we first need to build our dataset and obtain the
vehicle detections.
10.4.1 Gathering Vehicle Data
Figure 10.3: Example montage of delivery trucks for training our vehicle recognition system.
The dataset we’ll be using for delivery truck identification contains three types of trucks
along with a final “ignore” class used to filter out non-delivery truck vehicles:
• FedEx: 298 images
• UPS: 245 images
• USPS: 313 images
• Ignore: 332 images
FedEx, UPS, and USPS are all examples of delivery vehicles. The “ignore” class contains
vehicles that are not FedEx, UPS, or USPS trucks, including school buses, garbage trucks,
pickup trucks, etc.
The goal of Phase #1 is to (1) loop over all 1,178 input images in our dataset, (2) detect
the bounding box coordinates of each truck/vehicle in the current image, and (3) write the
coordinates back out to disk. Once we have the bounding box coordinates of each truck in an
image, we can then utilize transfer learning to actually recognize the vehicle detected by our
object detector.
Again, I have already curated the vehicle dataset for this chapter and provided it for you in
the downloads associated with this text. To create the dataset I downloaded images for each
class using Google Images. For example, I searched for “usps truck” in Google and then pro-
grammatically downloaded each of the image results using Google Images programmatically
(http://pyimg.co/rdyh0) [61]. As an alternative, you may use Bing Images to download results
programmatically (http://pyimg.co/vgcns) [62].
If you would like to build your own vehicle recognition dataset (or any other image dataset),
be sure to follow those instructions.
10.4.2 Detecting Vehicles with YOLO Object Detector
We have downloaded/organized our dataset of vehicles on disk in the previous section, but
that’s only the first step — simply having the vehicle images is not enough.
The next step is to apply object detection to localize where each of vehicles reside in the
image. We’ll then write these locations to disk (in CSV file format) so we can use them when
for feature extraction in Section 10.5.
Open up the build_training_data.py file and we’ll get started:

5 import argparse
6 import random
7 import cv2
8 import os
9
arguments. The only command line switch we need is --conf, the path to our configuration
file.
Next, let’s load the YOLO object detector:

18
22

26
27 # load our YOLO object detector trained on COCO dataset (80 classes)
30
31 # determine only the *output* layer names that we need from YOLO
Line 20 derives the path to the coco.names file based on our "yolo_path" from the
configuration. That path is used to load the labels plaintext file into a list named LABELS.
Lines 24 and 25 derive the paths to the YOLO weights and configuration files, respectively.
Once we have the paths we load the YOLO model itself on Line 29.
Lines 32 and 33 determine the output names of the YOLO layers (i.e., the layers that
contain our detected objects).
Let’s now grab the paths to our input images in the "dataset_path" and start looping
over them:
35 # grab the input image paths and open the output CSV file for writing
36 imagePaths = list(paths.list_images(conf["dataset_path"]))
37 random.shuffle(imagePaths)
38 csv = open(conf["detections_path"], "w")
39
40 # loop over the input images
42 # load the input image and grab its spatial dimensions
43 print("[INFO] processing image {} of {}".format(i + 1,
46 (H, W) = image.shape[:2]
47
48 # construct a blob from the input image and then perform a
49 # forward pass of the YOLO object detector, giving us our
50 # bounding boxes and associated probabilities
51 blob = cv2.dnn.blobFromImage(image, 1 / 255.0, (416, 416),
52 swapRB=True, crop=False)
55
56 # initialize our lists of detected bounding boxes, confidences,
57 # and class IDs, respectively
58 boxes = []
59 confidences = []
60 classIDs = []
Line 36 and 37 grab the paths to all of our input imagePaths and shuffles them. Line 38
opens our csv file for writing.
We then start looping over each of the imagePaths on Line 41. For each image we load it
from disk and then grab its spatial dimensions. Lines 51-54 construct a blob from the image,
which we then pass through the YOLO object detector. Lines 58-60 initialize three lists used
to store our detecting bounding boxes, corresponding probability of detection, and class label
IDs.
Given our detections we loop over them:

66 # extract the class ID and confidence (i.e., probability)
67 # of the current object detection
71
72 # filter out weak predictions by ensuring the detected
73 # probability is greater than the minimum probability
74 if confidence > conf["yolo_confidence"]:
75 # scale the bounding box coordinates back relative to
76 # the size of the image, keeping in mind that YOLO
77 # actually returns the center (x, y)-coordinates of
78 # the bounding box followed by the boxes' width and
79 # height
81 (centerX, centerY, width, height) = box.astype("int")
82
83 # use the center (x, y)-coordinates to derive the top
84 # and and left corner of the bounding box
85 x = int(centerX - (width / 2))
86 y = int(centerY - (height / 2))
87
For each output layer, and for each detection, we grab the scores (probabilities),
classID (the index with the largest corresponding predicted probability), and confidence
(the probability itself).
We make a check on Line 74 to ensure that the confidence meets our minimum prob-
ability. Performing this check helps reduce false-positive detections. Lines 80-86 derive the
bounding box coordinates of the detected object. We then update the boxes, confidences,
and classIDs appropriately on Lines 90-92.
At this point we have now looped over all detections from the YOLO network and our boxes,
confidences, and classIDs lists have been populated — the next step is to (1) apply NMS
and then (2) filter only the truck/bus class:
94 # apply non-maxima suppression to suppress weak, overlapping

95 # bounding boxes
97 conf["yolo_confidence"], conf["yolo_threshold"])
98
101 # initialize a bookkeeping variable to maintain the area of
102 # the largest bounding box rectangle found thus far
103 keep = (None, None)
104
107 # if the predicted class label is not a class we are
108 # interested in, ignore it
109 if LABELS[classIDs[i]] not in conf["classes"]:
110 continue
111
112 # grab the width and height of the bounding box and then
113 # compute the area of the bounding box
114 (w, h) = (boxes[i][2], boxes[i][3])
115 area = w * h
116
117 # check to see if the largest bounding box bookkeeping
118 # variable should be updated
119 if keep[0] is None or area > keep[0]:
120 keep = (area, i)
Lines 96 and 97 apply non-maxima suppression [29] to suppress weak, overlapping bound-
ing boxes.
We then ensure that at least one detection was found on Line 100. Provided that we have
at least one detection, we initialize keep, a bookkeeping variable used to keep track of the
area of the largest bounding box rectangle we’ve found thus far. The keep tuple will store two
values (1) the area of the bounding box rectangle, and (2) the index into the idxs list for that
rectangle.
Line 106 starts looping over all remaining idxs after NMS. Line 109 checks to see if the
current class label for index i does not exist in our classes configuration (i.e., either “truck”
or “bus”).
If the label is not “truck” or “bus” we ignore the detection and keep looping (Line 110).
Otherwise, we can safely assume the label is either “truck” or “bus” so we compute the area
of the bounding box (Lines 114 and 115) and then check to see if we should update our
bookkeeping variable, keeping track of the largest bounding box found thus far (Lines 119 and
120).
Our final code block handles writing the largest bounding box to disk:
122 # ensure at least one bounding box that we are interested in

123 # was found
124 if keep[0] != None:
126 i = keep[1]
127 (x, y) = (boxes[i][0], boxes[i][1])
128 (w, h) = (boxes[i][2], boxes[i][3])
129
130 # write the image path and bounding box coordinates to
131 # the CSV file
132 csv.write("{}\n".format(",".join([imagePath, str(x),
133 str(y), str(w), str(h)])))
134
135 # close the output CSV file
136 csv.close()
Provided at least one truck/bus class we found (Line 124) we write the following to the
output CSV file:
• Image file path
• Starting x-coordinate of bounding box
• Starting y -coordinate of bounding box
• Width of bounding box
• Height of bounding box
Let’s now move on to running the script.
10.4.3 Building Our Dataset
To apply the YOLO object detector to our dataset of vehicle images, open up a terminal and
execute the following command:
$ time python build_training_data.py --conf config/truckid.json

[INFO] processing image 1 of 1178

...
On my 3 GHz Intel Xeon W processor, the YOLO object detector took 3m41s to run on the
input dataset. After the script was finishes executing, you will find a file named detections.csv
in your output/ directory:
$ ls output/
detections.csv
Examining the detections.csv you’ll find that each row contains the image file path and
bounding box coordinates of the largest truck/bus class found in the image. In the following
section we’ll take this information, use it to extract the vehicle ROI for the input image, and then
apply transfer learning via feature extraction to train a model to correctly recognize the vehicle.
10.5 Phase #2: Transfer Learning
In this section you will use transfer learning via feature extraction to extract features from each
of the detected vehicle ROIs and then train a Logistic Regression model on top of the extracted
features. The output Logistic Regression model will be capable of recognizing vehicles in
images and video streams.
I’ll be making the assumption that you have (1) read the transfer learning chapters in Deep
Learning for Computer Vision with Python [50], and/or (2) read the PyImageSearch tutorials
on transfer learning (http://pyimg.co/r0rgh [55]), feature extraction (http://pyimg.co/1j05z [56]),
and fine-tuning (http://pyimg.co/rqtlj [57]).
If you haven’t read those chapters or tutorials, stop now and go read them before continuing.
10.5.1 Implementing Deep Learning Feature Extraction
In Section 10.4 we applied the YOLO object detector to localize vehicles in our input dataset of
trucks/buses. We will now use those locations to:
i. Extract the ROI of the vehicle using the bounding box coordinates.
ii. Pass the ROI through the ResNet network (without the FC layer head).
10.5. PHASE #2: TRANSFER LEARNING 233
iii. Obtain the output activations from ResNet.
iv. Treat these activations as features and write them to disk in HDF5 format.
The output features will serve as input to our Logistic Regression model in Section 10.5.3.
Open up the extract_features.py file and we’ll get started:

2 from pyimagesearch.io import HDF5DatasetWriter
4 from tensorflow.keras.applications import ResNet50
5 from tensorflow.keras.applications import imagenet_utils
8 import progressbar
9 import argparse
10 import pickle
11 import cv2
12 import os
13
14 # construct the argument parse and parse the arguments
Lines 2-12 import our required Python packages.
You’ll notice that the HDF5DatasetWriter class is being imported on Line 2. We won’t
be reviewing this class as it’s outside the scope of this chapter, but all you need to know is
that this class allows us to store data on disk in HDF5 format.
HDF5 is a binary data format created by the HDF5 group [58] to store gigantic numerical
datasets on disk (far too large to fit into memory) while at the same time facilitating easy access
and computation on the rows in the datasets. Feature extraction via deep learning tends to lead
to very large feature vectors so storing the data in HDF5 tends to be a natural choice.
Lines 15-18 parse our command line arguments. The only switch we need is --conf, the
path to our input configuration file.
Let’s now load our detections CSV file from disk and process it:
20 # load the configuration file along with the contents of the

21 # detections CSV file
23 rows = open(conf["detections_path"]).read().strip().split("\n")
24
25 # extract the class labels from the image paths then encode the
26 # labels
27 labels = [r.split(",")[0].split(os.path.sep)[-2] for r in rows]
29 labels = le.fit_transform(labels)
30
31 # load the ResNet50 network
32 print("[INFO] loading network...")
33 model = ResNet50(weights="imagenet", include_top=False)
Line 23 loads the contents of the "detections_path" CSV file, breaking it into a list, one
row per line.
We then extract the class labels from the filenames in each row (Line 27). These labels
are then used to fit a LabelEncoder (Lines 28 and 29), used to transform our labels from
strings to integers.
Line 33 loads the ResNet50 architecture from disk with weights pre-trained on ImageNet.
We leave off the top fully-connected layer from ResNet so we can perform transfer learning via
feature extraction.
Next, let’s initialize our HDF5DatasetWriter class:
35 # initialize the HDF5 dataset writer, then store the class label
36 # names in the dataset
37 dataset = HDF5DatasetWriter((len(rows), 100352),
38 conf["features_path"], dataKey="features",
39 bufSize=conf["buffer_size"])
40 dataset.storeClassLabels(le.classes_)
41
42 # initialize the progress bar
43 widgets = ["Extracting Features: ", progressbar.Percentage(), " ",
44 progressbar.Bar(), " ", progressbar.ETA()]
45 pbar = progressbar.ProgressBar(maxval=len(rows),
46 widgets=widgets).start()
The final output volume of ResNet50, without the fully-connected layers, is 7 ◊ 7 ◊ 2048,
thus implying that our output feature dimension is 100,352-d, as indicated when initializing the
HDF5DatasetWriter class on Lines 37-39.
Lines 43-46 initializes a progress bar widget we can use to estimate how long it will take for
the feature extraction process to complete.
We can now start looping over the rows in batches of size "batch_size":
48 # loop over the images in batches

49 for i in np.arange(0, len(rows), conf["batch_size"]):

50 # extract the batch of data and labels, then initialize the list
51 # of actual images that will be passed through the network
52 # for feature extraction
53 batchData = rows[i:i + conf["batch_size"]]
54 batchLabels = labels[i:i + conf["batch_size"]]
55 batchImages = []
56
57 # loop over the data and labels in the current batch
58 for (j, row) in enumerate(batchData):
59 # unpack the row
60 (imagePath, x, y, w, h) = row.split(",")
61 (x, y) = (int(x), int(y))
62 (w, h) = (int(w), int(h))
63
64 # load the input image and grab its spatial dimensions
66 (imgH, imgW) = image.shape[:2]
Line 53 grabs the next batch of detected objects from the rows array. We then grab the
accompanying labels on Line 54. Line 55 initializes an empty list, batchImages, which will
store the images corresponding to each of the batchLabels.
We start looping over each of the batched rows on Line 58. For each row, we unpack it,
obtaining the imagePath and bounding box coordinates for the largest vehicle in the input
image (Lines 60-62). Line 65 loads the image from disk and grabs its dimensions.
At this point we need to extract the region of the image containing the vehicle:
68 # truncate any bounding box coordinates that may fall

69 # outside the boundaries of the image
70 xMin = max(0, x)
71 yMin = max(0, y)
72 xMax = min(imgW, x + w)
73 yMax = min(imgH, y + h)
74
75 # extract the ROI of the image
76 image = image[yMin:yMax, xMin:xMax]
77
78 # convert our image from BGR to RGB channel ordering, then
79 # resize the image to 224x224
80 image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
81 image = cv2.resize(image, (224, 224))
82
83 # preprocess the image by (1) expanding the dimensions and
84 # (2) subtracting the mean RGB pixel intensity from the
85 # ImageNet dataset
86 image = np.expand_dims(image, axis=0)
87 image = imagenet_utils.preprocess_input(image)
88
89 # add the image to the batch

90 batchImages.append(image)
Lines 70-73 truncate any bounding box coordinates which may fall outside the boundaries
of the input image. We then use NumPy array slicing on Line 76 to obtain the region of the
interest that contains the vehicle. The ROI is processed by converting from BGR to RGB
color channel ordering, resizing to 224x224 pixels (the input dimensions for ResNet), and then
performing mean subtraction (Lines 80-87).
After all preprocessing steps are complete, we add the image to the batchImages list.
Given our batch of images we can pass them through our CNN to obtain the output features:
92 # pass the images through the network and use the outputs as
93 # our actual features
94 batchImages = np.vstack(batchImages)
95 features = model.predict(batchImages,
96 batch_size=conf["batch_size"])
97
98 # reshape the features so that each image is represented by
99 # a flattened feature vector
100 features = features.reshape((features.shape[0], 100352))
101
102 # add the features and labels to our HDF5 dataset
103 dataset.add(features, batchLabels)
104 pbar.update(i)
Lines 95 and 96 send the current batchImages through our ResNet network, obtaining
the activations from the network.
Again, recall that without the fully-connected layer head, ResNet produces an output volume
of size 7 ◊ 7 ◊ 2048. When flattened (Line 100), that leads to a total of 100,352 values used to
quantify the vehicle region. The combination of features and corresponding batchLabels
are then added to the HDF5 dataset on Line 103.
We wrap up the extract_features.py script by closing the dataset and then serializing
the LabelEncoder to disk:
106 # close the dataset

107 dataset.close()
108 pbar.finish()
109
111 f = open(conf["label_encoder_path"], "wb")
113 f.close()
If you’re interested in learning more about transfer learning, feature extraction, and fine-
tuning, make sure you refer to the following resources:
• Deep Learning for Computer Vision with Python: http://pyimg.co/dl4cv [50]
• Transfer Learning with Keras and Deep Learning: http://pyimg.co/r0rgh [55]
• Keras: Feature extraction on large datasets with Deep Learning: http://pyimg.co/1j05z

[56]
• Fine-tuning with Keras and Deep Learning: http://pyimg.co/rqtlj [57]
10.5.2 Extracting Features with ResNet
To extract features from our dataset of vehicle images, open up a terminal and execute the
following command:
$ python extract_features.py --conf config/truckid.json

[INFO] loading network...
Extracting Features: 100% |############################| Time: 0:01:48
On my machine the entire feature extraction process took 1m48s. You could also run this
script on a machine with a GPU to dramatically speedup the process as well.
After the script has run you can check the output/ directory to verify that the features.hdf5
file has been created:
$ ls output/
detections.csv features.hdf5
In the next section you will train a Logistic Regression model on these features, giving us
our final vehicle identification model.
Now that we have a dataset of extracted features, we can move on to training our machine
learning model to (1) accept the input features (extracted via ResNet), and (2) then actually
identify and recognize the vehicle


3 from sklearn.linear_model import LogisticRegression
4 from sklearn.model_selection import GridSearchCV
5 from sklearn.metrics import classification_report
6 from sklearn.metrics import accuracy_score
7 import argparse
8 import pickle
9 import h5py
10
Lines 2-9 handle our imports. You’ll want to specifically make note of the LogisticRegre
ssion class, an instance of which we’ll be training later in this script, along with h5py, a
Python class used to create, modify, and access HDF5 datasets.
Lines 12-15 then parse our command line arguments.
The next step is to load our LabelEncoder (created in Section 10.5.2 after running the
extract_features.py script) and then open the HDF5 database for reading:
17 # load the configuration file and label encoder

19 le = pickle.loads(open(conf["label_encoder_path"], "rb").read())
20
21 # open the HDF5 database for reading then determine the index of
22 # the training and testing split, provided that this data was
23 # already shuffled *prior* to writing it to disk
24 db = h5py.File(conf["features_path"], "r")
25 i = int(db["labels"].shape[0] * 0.75)
Line 25 computes an index, i, used to determine the training/testing split. Here we’ll be
using 75% of the data for training and the remaining 25% for testing.
We can now define a set of hyperparameters we want to tune and then apply the GridSearch
CV class to perform a cross-validated search across each choice of these hyperparameters:
27 # define the set of parameters that we want to tune then start a

28 # grid search where we evaluate our model for each value of C
29 print("[INFO] tuning hyperparameters...")
30 params = {"C": [0.0001, 0.001, 0.01, 0.1, 1.0]}
31 model = GridSearchCV(LogisticRegression(solver="lbfgs",
32 multi_class="multinomial"), params, cv=3, n_jobs=conf["n_jobs"])

33 model.fit(db["features"][:i], db["labels"][:i])
34 print("[INFO] best hyperparameters: {}".format(model.best_params_))
Here we’ll be tuning the C value which controls the “strictness” of the Logistic Regression
model:
• A larger value of C is more rigid and will try to make the model make less mistakes on
the training data.
• A smaller value of C is less rigid and will allow for for some mistakes on the training data.
The benefit of a larger value of C is that you may be able to obtain higher training accuracy.
The downside is that you may overfit your model to the training data.
Conversely, a smaller value of C may lead to lower training accuracy, but could potentially
lead to better model generalization. At the same time though, too small of a C value could
make the model effectively useless and incapable of making meaningful predictions.
The goal of the GridSearchCV is to identify which value of C will perform best on our data.
After we’ve performed a grid-search we evaluate our model:
36 # generate a classification report for the model

37 print("[INFO] evaluating...")
38 preds = model.predict(db["features"][i:])
39 print(classification_report(db["labels"][i:], preds,
40 target_names=le.classes_))
41
42 # compute the raw accuracy with extra precision
43 acc = accuracy_score(db["labels"][i:], preds)
44 print("[INFO] score: {}".format(acc))
We then serialize the model to disk so we can use it in the truck_id.py script in Section
10.6:
46 # serialize the model to disk

47 print("[INFO] saving model...")
48 f = open(conf["model_path"], "wb")
49 f.write(pickle.dumps(model.best_estimator_))
50 f.close()
51
53 db.close()
We’re almost complete with Phase #2!

10.5.4 Training Our Model
Let’s go ahead and train our Logistic Regression model for vehicle recognition. Open up a
terminal and execute the following command:
$ python train_model.py --conf config/truckid.json

[INFO] tuning hyperparameters...
[INFO] best hyperparameters: {'C': 0.1}
[INFO] evaluating...
precision recall f1-score support
fedex 0.94 0.95 0.95 65

ignore 0.98 1.00 0.99 58
ups 0.97 0.97 0.97 59
usps 0.97 0.95 0.96 79
accuracy 0.97 261

macro avg 0.97 0.97 0.97 261
weighted avg 0.97 0.97 0.97 261
[INFO] score: 0.9655172413793104

[INFO] saving model...
As you can see, we are obtaining 96% accuracy. Knowing that we now have an accurate
truck identification model, we can move on to Phase #3 where we’ll build out our computer
vision app.
10.6 Phase #3: Implementing the Vehicle Recognition Pipeline
We are now ready to finish implementing the vehicle recognition pipeline! In this phase we
need two machines:
• The client, presumed to be a Raspberry Pi, that acts as an IP camera, responsible only
for capturing frames and streaming them back to the host.
• The server/host, which we assume is a more powerful laptop, desktop, or GPU machine,
which accepts frames from the RPi, runs motion detection, object detection, and vehicle
recognition, and then write the identification to disk as a video clip.
We’ll be implementing both of these scripts in this section.

10.6. PHASE #3: IMPLEMENTING THE VEHICLE RECOGNITION PIPELINE 241
10.6.1 Implementing the Client (Raspberry Pi)
Our client script, running on the Raspberry Pi, is responsible for capturing frames from the
video stream and then sending the frames to the server/host for processing via ImageZMQ.
The client.py script is identical to Chapter 4 but we’ll review it here as a matter of complete-
ness:

3 import imagezmq
4 import argparse
5 import socket
6 import time
7
13
15 # server
18
24 time.sleep(2.0)
25
27 while True:
Line 2 imports our VideoStream which we’ll use to access our video stream, whether that
be a USB camera or a Raspberry Pi camera module. The imagezmq import on Line 3 is used
to interface with ImageZMQ for sending frames across the wire.
Lines 9-12 parse our command line arguments. The only argument we need is --server-
ip, the IP address of the server to which the client will connect. Line 16 and 17 initializes the
sender used to send frames via ImageZMQ.
Lines 23 then access our video stream. Lines 27-30 loop over frames from the camera
and then send them to the server for processing.
10.6.2 Implementing the Server (Host Machine)
The server script is considerably more complex than the client script as it’s responsible for:
i. Accepting frames from the RPi video stream
ii. Applying motion detection to the frame
iii. If motion is detected, applying object detection to the frame
iv. Extracting any vehicles from the frame
v. Using our trained vehicle recognition model to actually identify the vehicle
It is technically possible to run all of these operations on the RPi, but you would need a
dedicated coprocessor such as a Movidius NCS or Google Coral USB Accelerator to handle
the object detection and feature extraction components.
Let’s dive into the truck_id.py script now:

2 from pyimagesearch.keyclipwriter import KeyClipWriter
4 from tensorflow.keras.applications import ResNet50
5 from tensorflow.keras.applications import imagenet_utils
8 import argparse
9 import imagezmq
10 import dropbox
11 import imutils
12 import pickle
13 import time
14 import cv2
15 import os
16
Lines 2-15 import our required Python packages. You’ll want to take note that the ResNet50
model is being imported — that is the same model we used for Section 10.5.1 on feature ex-
traction. We’ll need to use ResNet to quantify the vehicle ROIs prior to passing them through
our Logistic Regression model for classification.
Lines 18-21 parse our command line arguments. Just in the previous scripts in this chapter,
we only need to supply --conf, the path to our configuration file.
We can then load the configuration file, initialize the KeyClipWriter, and connect to Drop-
box (if necessary):
23 # load our configuration file

25
26 # initialize key clip writer and the consecutive number of
27 # frames that have *not* contained any action
28 kcw = KeyClipWriter(bufSize=conf["kcw_buffer_size"])
29 consecFrames = 0
30
31 # initialize the Dropbox client
32 client = None
33
34 # check to see if the Dropbox should be used
36 # connect to Dropbox and start the session authorization process
37 client = dropbox.Dropbox(conf["dropbox_access_token"])
38 print("[SUCCESS] dropbox account linked")
Take note of Line 29 where we initialize consecFrames — this variable is used to count the
total number of consecutive frames that have not contained any action. Once consecFrames
reaches a certain threshold we’ll know to stop recording the video clip.
Let’s now load our trained model files:
43
44 # load the truck ID model and label encoder
45 print("[INFO] loading truck ID model...")
46 truckModel = pickle.loads(open(conf["model_path"], "rb").read())
47 le = pickle.loads(open(conf["label_encoder_path"], "rb").read())
48
49 # load the ResNet50 network
50 print("[INFO] loading ResNet...")
51 model = ResNet50(weights="imagenet", include_top=False)
52
56
57 # load our YOLO object detector and determine only the *output*
58 # layer names that we need from YOLO

This code block loads a handful of files, including:
• Our trained truckModel and LabelEncoder (Lines 46 and 47).
• The ResNet50 model pre-trained on ImageNet (Line 51).
• The YOLO object detector pre-trained on COCO (Lines 54-62).
We have one final code block of initializations before we can start looping over frames from
ImageZMQ:
64 # initialize the ImageZMQ image hub along with the frame dimensions
65 # from our input video stream
67 (W, H) = (None, None)
68
69 # initialize the MOG foreground background subtractor and
70 # morphological kernels
71 fgbg = cv2.bgsegm.createBackgroundSubtractorMOG(
72 history=conf["mog_history"], nmixtures=conf["mog_nmixtures"],
73 backgroundRatio=conf["mog_bg_ratio"],
74 noiseSigma=conf["mog_noise_sigma"])
75 eKernel = np.ones((3, 3), "uint8")
76 dKernel = np.ones((5, 5), "uint8")
77
78 # initialize the motion status flag
79 motion = False
80
81 # initialize the label, filename, and path
82 label = None
83 filename = None
84 localPath = None
Line 66 initializes the ImageZMQ imageHub while Line 67 initializes the spatial dimensions
of our input frame (which we’ll populate once we’re inside the while loop).
Lines 71-74 initialize the MOG background subtractor used to detect motion/foreground in
the input frames. We also initialize two kernels, one for erosion (eKernel) and another for
dilation (dKernel) — we’ll be using these kernels to cleanup our foreground segmentation.
Line 79 initializes motion, a flag used to indicate whether or not motion has occurred in
the frame.
Finally, Lines 82-84 initialize variables used to save our output vehicle identification clips to
disk.
We can now enter the body of the while loop used to process frames from the RPi client:

87 while True:
88 # read the next frame from the stream
89 (_, frame) = imageHub.recv_image()
91
92 # if the frame dimensions are empty, grab them
95
96 # reset the motion flag and apply background subtraction
97 motion = False
98 mask = fgbg.apply(frame)
99
100 # erode the mask to eliminate noise and then dilate the mask to
101 # fill in holes
102 mask = cv2.erode(mask, eKernel, iterations=2)
103 mask = cv2.dilate(mask, dKernel, iterations=5)
104
105 # find contours in the mask
Line 89 reads the next frame from the ImageZMQ video stream. If we haven’t already
grabbed the dimensions of the frame, we do so on Lines 93 and 94.
Line 97 sets motion to False, indicating there is no motion in the frame.
We then apply the MOG background subtractor on Line 98 to obtain the foreground/back-
ground mask. We apply a series of erosions and dilations to cleanup the mask (Lines 102 and
103) and then apply contour detection (Lines 106-108).
The following code block processes the contours to determine if motion has taken place:
110 # only proceed if at least one contour was found

111 if len(cnts) > 0:
112 # find the largest contour in the mask and calculate
113 # the area and bounding rectangle
114 c = max(cnts, key=cv2.contourArea)
115 (x, y, w, h) = cv2.boundingRect(c)
116
117 # compute the area of the bounding box and determine the
118 # relative area the bounding box occupies
119 area = w * h
120 minArea = (W * H) / conf["min_area_divisor"]
121 maxArea = (W * H) / conf["max_area_divisor"]
122
123 # check if the motion area is within range, and if so,
124 # indicate that motion was found
125 if area > minArea and area < maxArea:
126 motion = True
127
128 # otherwise there is no motion
129 else:
130 motion = False
Line 111 ensures that at least one contour was found. Provided we have at least one
contour, we find the largest one (Line 114) and compute its bounding box dimensions (Line
115).
Lines 119-121 compute the area of the largest bounding box and determine its relative
area (compared to the frame dimensions). Provided that the size of the motion area is greater
than the minArea and smaller than the maxArea, we set motion equal to True. Otherwise,
motion is set to False. We perform this check to help reduce false-positive detections.
Now that motion has been appropriately set, we should check to see if (1) motion was
detected or (2) we are already recording:
132 # check to see if (1) there was motion detected or (2) we are
133 # already recording (just in case the delivery truck stops and
134 # delivers to our house)
135 if motion or kcw.recording:
136 # construct a blob from the input frame and then perform a
137 # forward pass of the YOLO object detector, giving us our
138 # bounding boxes and associated probabilities
139 blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416),
143
144 # otherwise, there is no motion or we are not recoding, to
145 # indicate there were no outputs from YOLO (since we didnt have
146 # to run it)
147 else:
148 layerOutputs = []
Provided that motion was found or we are already recording, we need to apply YOLO object
detection to find the vehicle in the frame (Lines 139-142). Otherwise, if there is no motion
and we are not recording, we can simply set layerOutputs to an empty list (implying that no
vehicles needed to be detected).
The following code block is identical to the YOLO output processing from Section 10.4.2:
150 # initialize our lists of detected bounding boxes, confidences,

151 # and class IDs, respectively
152 boxes = []
153 confidences = []
154 classIDs = []
155
160 # extract the class ID and confidence (i.e., probability)
161 # of the current object detection
165
166 # filter out weak predictions by ensuring the detected
167 # probability is greater than the minimum probability
168 if confidence > conf["yolo_confidence"]:
169 # scale the bounding box coordinates back relative to
170 # the size of the image, keeping in mind that YOLO
171 # actually returns the center (x, y)-coordinates of
172 # the bounding box followed by the boxes' width and
173 # height
175 (centerX, centerY, width, height) = box.astype("int")
176
177 # use the center (x, y)-coordinates to derive the top
178 # and and left corner of the bounding box
179 x = int(centerX - (width / 2.0))
180 y = int(centerY - (height / 2/0))
181
Here we are looping over the outputs of the YOLO model, filtering out low probability detec-
tions, computing the bounding box coordinates of the object, and then populating our boxes,
confidences, and classIDs lists, respectively.
The next code block is also near identical to the YOLO processing in Section 10.4.2:
188 # apply non-maxima suppression to suppress weak, overlapping

189 # bounding boxes
191 conf["yolo_confidence"], conf["yolo_threshold"])
192
198 (x, y) = (boxes[i][0], boxes[i][1])
199 (w, h) = (boxes[i][2], boxes[i][3])
200
201 # if the label is not not one we are interested in, then
202 # ignore it
203 if LABELS[classIDs[i]] not in conf["classes"]:
204 continue
Lines 190 and 191 apply NMS to suppress weak, overlapping bounding boxes.
We then start looping over all kept idxs on Line 196, extracting the bounding box coordi-
nates of the current object on Lines 198 and 199. We also check to ensure that the current
object is either a “truck” or “bus”.
We are now ready to perform vehicle recognition:
206 # truncate any bounding box coordinates that may fall

207 # outside the boundaries of the image
208 xMin = max(0, x)
209 yMin = max(0, y)
210 xMax = min(W, x + w)
211 yMax = min(H, y + h)
212
213 # extract the ROI of the frame
214 roi = frame[yMin:yMax, xMin:xMax]
215
216 # convert our ROI from BGR to RGB channel ordering, then
217 # resize the ROI to 224x224
218 roi = cv2.cvtColor(roi, cv2.COLOR_BGR2RGB)
219 roi = cv2.resize(roi, (224, 224))
220
221 # preprocess the ROI by (1) expanding the dimensions and
222 # (2) subtracting the mean RGB pixel intensity from the
223 # ImageNet dataset
224 roi = np.expand_dims(roi, axis=0)
225 roi = imagenet_utils.preprocess_input(roi)
226
227 # pass the ROI through our feature extractor network and
228 # reshape them
229 features = model.predict(roi)
230 features = features.reshape((features.shape[0], 100352))
231
232 # take the features and use the truck ID model to classify
233 # the ROI
234 preds = truckModel.predict(features)
235 label = le.inverse_transform(preds)[0]
Lines 208-211 truncate any bounding box coordinates that may fall outside the range of
the image. We use these coordinates to perform NumPy array slicing to extract the roi of the
vehicle (Line 214).
This roi is then converted from BGR to RGB channel ordering and resized to 224x224
pixels, followed by preprocessing the roi via mean subtraction (Lines 218-225). You’ll notice
that these are the exact preprocessing steps we performed in Section 10.5.1 when performing
feature extraction on our original vehicle dataset.
Lines 229 passes the roi through ResNet50 (without the FC layer head), yielding our
feature vector. This feature vector is then supplied to the truckModel (i.e., the trained Logistic
Regression model) which gives us the final vehicle identification (Lines 234 and 235).
We can then draw the bounding box rectangle and label for the vehicle on the frame:
237 # draw a bounding box rectangle and label on the frame

238 cv2.rectangle(frame, (x, y), (x + w, y + h),
239 (0, 255, 0), 2)
240 text = "{}: {:.4f}".format(label, confidences[i])
241 cv2.putText(frame, text, (x, y - 5),
243
244 # check to see if a recording has not been started
245 if not kcw.recording:
246 # start recording
247 print("[INFO] recording...")
249 filename = "{}.avi".format\
250 (timestamp.strftime("%Y%m%d-%H%M%S"))
251 localPath = "{}/{}/{}".format(
252 conf["output_videos_path"], label, filename)
253 kcw.start(localPath, cv2.VideoWriter_fourcc(
254 *conf["codec"]), conf["output_fps"])
255
256 # restart the consecutive frames we've been recording
Line 245 checks to see if we are not recording a video clip, and if not, we start recording
(Lines 248-254).
Our next code block handles the case when we are both (1) currently recording and (2) we
have reached a consecFrames number of frames without motion (indicating the vehicle has
passed view of our camera):
259 # if we are recording and reached a threshold on consecutive

260 # number of frames with no action, stop recording the clip

261 if kcw.recording and consecFrames >= conf["kcw_buffer_size"]:
262 # finish the recording
263 print("[INFO] ...stopped recording!")
264 kcw.finish()
265
266 # optionally upload the clip to Dropbox
268 # upload the video to Dropbox
269 print("[INFO] uploading... {}".format(kcw.outputPath))
270 db_path = "/{}/{}/{}".format(
271 conf["dropbox_base_path"], label, filename)
272 client.files_upload(open(kcw.outputPath, "rb").read(),
273 db_path)
274 print("[INFO] ...done uploading!")
275
276 # optionally delete the local video
277 if conf["delete_local_video"]:
278 os.remove(kcw.outputPath)
279
280 # reset the filename, path, and consecutive motion frames
281 filename = None
282 localPath = None
Line 264 finishes up the recording of the key event clip. We then check to see if the video
clip should be uploaded to Dropbox on Lines 267-278. We wrap up the video recording by
resetting the filename, localPath, consecFrame count for the next time a vehicle enters
the scene.
The final step is to update our KeyClipWriter with the current frame, update the consec
Frame count (if necessary) and display the output frame to our screen:
285 # update the key frame clip buffer

286 kcw.update(frame)
287
288 # increment the consecutive frames if necessary
290 consecFrames += 1
291
292 # display the output frame
293 frameLarge = imutils.resize(frame.copy(), width=900)
294 cv2.imshow("frame", frameLarge)
295 key = cv2.waitKey(1)
296
298 if key == ord("q"):
299 break
300
301 # if we are in the middle of recording a clip, wrap it up
10.7. VEHICLE RECOGNITION RESULTS 251
303 kcw.finish()
That was quite a lot of work, but we’re done now!
10.7 Vehicle Recognition Results
Figure 10.4: Montage of vehicle recognition results using a tiered deep learning model approach.
Each vehicle is correctly recognized.
With our truck_id.py file complete, let’s now put vehicle recognition to the test!
Make sure you have ZMQ installed, then open up a terminal and execute the following
command to start the server on your laptop, desktop, or host machine:
$ python truck_id.py --conf config/truckid.json

[SUCCESS] dropbox account linked
[INFO] loading truck ID model...
[INFO] loading ResNet...

[INFO] recording...
[INFO] ...stopped recording!
[INFO] uploading... 20190215-105621.avi
[INFO] ...done uploading!
[INFO] recording...
[INFO] ...stopped recording!
[INFO] uploading... 20190215-110347.avi
[INFO] ...done uploading!
The vehicle on the bottom-left is incorrectly recognized as "usps" but then correctly recognized as
"ignore" on the bottom-right.
Then, open up a new terminal, and then launch the Raspberry Pi client script:
Once executed, your Raspberry Pi will act as an “network camera” and start streaming from
the RPi to the server. The server will then process the frames, apply object detection, locate
and vehicles, identify them, and then write the results to disk as a video clip.
10.7. VEHICLE RECOGNITION RESULTS 253
On the bottom-right, the image experiences "washout" from the sun aiming into the camera sensor.
Examples of our vehicle recognition results can be seen in Figures 10.4-10.6. While our
system performs well, referring to the figures and their captions, you can see that our system is
not 100% reliable. On some occasions, the truck is incorrectly recognized. On other occasions,
camera sensor washout causes the system to not be recognized at all. The washout problem
is not the deep learning recognizer’s fault — perhaps the image quality would be better with a
polarizing lens filter.
To improve accuracy, I would suggest capturing more examples of trucks to include in the
"ignore" folder for the training set and balancing the dataset as needed.
When building your own vehicle recognition system I suggest you follow the recipe in this
chapter. Additionally, you may be able to run the entire vehicle detection, feature extraction,
and identification models on the Raspberry Pi itself provided that you use a coprocessor such
as the Movidius NCS or Google Coral USB Accelerator to offload the object detector and ideally
feature extractor (the Logistic Regression model can easily run on the CPU). This would also
be a case where you definitely need at least a Raspberry Pi 4B.
10.8 Summary
In this chapter you learned how to build a vehicle recognition system using computer vision
and deep learning.
We framed our vehicle recognition project as delivery truck identification, capable of log-
ging when delivery trucks stop at your house. Using such a system you can monitor your home
for deliveries (and ideally help reduce the risk of package theft).
To build such a system we divided it into three phases:
i. Phase #1: Building the dataset
ii. Phase #2: Utilizing transfer learning to build the vehicle recognition model
iii. Phase #3: Applying vehicle recognition via the RPi and ImageZMQ
As our results demonstrated, our system is quite accurate.
You should use this project as a starting point when developing your own vehicle identifica-
tion projects.
Finally, if you are interested in learning more about vehicle recognition, including how to
recognize the make (ex., “Telsa”) and model (ex., “Model-S”) of vehicle, be sure to refer to Deep
Learning for Computer Vision with Python (http://pyimg.co/dl4cv) [50] which can recognize
make/model with over 96% accuracy.
Chapter 11
What is the Movidius NCS
One of the biggest challenges of developing computer vision and deep learning applications
on the Raspberry Pi is trying to balance speed with computational complexity.
The Raspberry Pi, by definition, is an embedded device with limited computational re-
sources; however, we know that computationally intensive deep learning algorithms are re-
quired in order to build robust, highly accurate computer vision systems.
Trying to balance speed with computational complexity is like trying to push together identi-
cal poles of a magnet — they repel instead of attract.
So, what do we do?
Have we reached an impasse?
Are we limited to having to choose between speed and computational complexity on em-
bedded devices?
Prior to 2017 the answer would have been a resounding “Yes”.
But just as deep learning has enabled software to obtain unprecedented accuracy on com-
puter vision tasks, hardware is now enabling better, more efficient deep learning, creating a
feedback loop where one empowers the other.
Intel’s Movidius Neural Compute Stick (NCS) is one of the first pieces of hardware to bring
real-time computer vision and deep learning to the Raspberry Pi. The NCS is actually a co-
processor — a USB stick that plugs into your embedded device.
When you’re ready to apply deep learning to the NCS, you simply load an NCS-optimized
model into your software, and pass data through the network. The onboard co-processor
handles the computation in an efficient manner, enabling you to obtain faster performance
than using your CPU alone.
255
256 CHAPTER 11. WHAT IS THE MOVIDIUS NCS
In this chapter, as well as the following chapters, you’ll discover how to use the Movidius
NCS in your own embedded deep learning applications, ensuring you can have your cake and
eat it too.
11.1 Chapter learning objectives
In this chapter, you will learn about:
i. The Intel Movidius Neural Compute Stick including its capabilities, initial rollout, and
promising technology.
ii. OpenVINO, an Intel device-optimized flavor of OpenCV.
You’ll also briefly be introduced to alternative/competing products.
Let’s get started.
11.2 What is the Intel Movidius Neural Compute Stick?
Figure 11.1: The Intel Movidius Neural Compute Stick is a USB-based deep learning coprocessor.
It is designed around the Myriad processor and is geared for single board computers like the
Raspberry Pi.
Marketed as the “Vision Processing Unit (VPU) with a dedicated Neural Compute Engine”
(NCE), the Intel Movidius Neural Compute Stick (NCS) is a USB stick which is optimized for
deep learning inference.
11.2. WHAT IS THE INTEL MOVIDIUS NEURAL COMPUTE STICK? 257
The Movidius company (founded in 2005) designed the Myriad X processor with the VPU/
NCE capability and Intel quickly bought the technology in September 2016 [9]. Intel/Movid-
ius has brought the technology to many embedded camera devices, drones such as the DJI
Tello, and the hobbyist community in the form of a USB stick that pairs well with Single Board
Computers (SBCs) like the Raspberry Pi.
According to Tractica market research, “the total revenue of the AI-related deep learning
chip market is forecast to rise from $500M USD in 2016 to $12.2B USD in 2025” [63]. Of
course 2016 was a few years ago and now we have competing products such as the Google
Coral Tensor Processing Unit (TPU) which is covered in the Complete Bundle.
In the remainder of this chapter, we’ll learn about what the Movidius NCS is capable of.
We’ll answer the question, “Is the Movidius NCS a GPU?”
From there, we’ll learn about the Movidius NCS product’s history and promising future with
the advent of the OpenVINO library.
Is the Movidius NCS a GPU?
No — far from it.
The NCS is small, USB-based, and only draws upwards of 2.5W in comparison to a 300W
NVIDIA K80 GPU. A full-blown GPU like a K80 actually requires an even bigger Power Supply
Unit (PSU) to power the entire motherboard, more GPUs, and other peripherals.
The NCS is designed to work in tandem with your Single Board Computer (SBC) CPU while
being very efficient for deep learning tasks.
Generally, it cannot be used for training like your GPU is geared for — rather, the NCS is
designed for deep learning inference.
The Raspberry Pi makes the perfect companion for the NCS — the deep learning stick
allows for much faster inference and allows the RPi CPU to work on other tasks.
Do I need a GPU to train my model prior to deployment with the NCS?
In most cases, "Yes."
The NCS is not designed to be used for training a deep learning model.
You’ll find yourself using a full sized computer with a more powerful setup to train a deep
learning model. This could be a laptop with CPU for a small model. Or it could be a full sized
deep learning rig desktop outfitted with lots of memory, powerful CPU(s), and powerful GPU(s).
From there you’d transfer your trained model to the target device (i.e. the Raspberry Pi with the
Movidius plugged in).
11.3 What can the Movidius NCS do?
The Movidius NCS is capable of deep learning classification, object detection, and feature
extraction. Deep learning segmentation is also possible.
There are a selection of Movidius NCS compatible models available at the following links:
• OpenVINO Model Zoo: http://pyimg.co/dc8ck [64]
• Model Documentation: http://pyimg.co/78cqs [65]
If you have the option, I highly recommend working with pretrained models that you find in
this book and in online examples.
You should exhaust all possible pretrained models before you embark on training and de-
ploying your own model to the Movidius NCS.
Thus far, the Movidius best supports Caffe and TensorFlow models. Unfortunately Keras
models are yet to be supported, although contacts on the product team at Intel tell us that
Keras support is highly requested and in their roadmap.
11.4 Intel Movidius’ NCS History Lesson
As with the advent of any new product, the Movidius NCS product had a tough start, but now
that the product is backed by Intel, it is set up for success. In this section we’ll briefly discuss
the history of the NCS and how we got to where we are today.
11.4.1 Product Launch
The NCS launched in 2017 and PyImageSearch was quick to get our hands on one to write
two blog posts about the shiny blue USB stick [66, 67].
The blog posts were a huge hit — they exceeded traffic expectations. It was clear that the
Raspberry Pi community had both (1) the need for such a product, and (2) high expectations
for the product.
Movidius/Intel certainly brought their product to the market at the right time to ride the deep
learning wave.
The NCS was launched with the Neural Compute Software Development Kit (NCSDK) Appli-
cation Program Interface version 1 (APIv1). The APIv1 left a lot to be desired, but it functioned
well enough to convince the community of the speed and promise of Myriad and the NCS.
11.4. INTEL MOVIDIUS’ NCS HISTORY LESSON 259
Part of the SDK included a challenging, hard to use tool for converting deep learning mod-
els into “graph files." The graphs were required to be able to use your own models with the
Movidius NCS. The tool was only Caffe and Tensorflow capable. Furthermore, not all CNN
architectures were supported which frustrated many deep learning practitioners.
Early technology adopters are used to these challenges.
We all knew that Intel likely had something in development to address both the APIv1 and
graph file tool.
The question was: When will working with Movidius become easier?
Some of the issues were fixed with the release of APIv2 in mid-2018, which brought virtual
environment and Docker support (alleviating the need for an Ubuntu VM to convert models to
graph files).
Some were quick to port their code and others continued to watch from the sidelines.
The Raspberry Pi communities online persevered by posting examples on GitHub and blogs
highlighting the breakthroughs, limitations, and wishes.
Finally, one day in late-2018, Intel announced OpenVINO to the public.
11.4.2 Meet OpenVINO and the NCS2
Figure 11.2: Transitioning from the NCSDK to the OpenVINO Toolkit. OpenVINO is far easier to
use and work with than the NCSDK. Intel recommends using OpenVINO instead of the NCSDK
for all projects [68].
The OpenVINO framework is an absolute game-changer.
OpenVINO presents a much simpler API and makes the Movidius a lot easier to work with.
OpenVINO is essentially an Intel device-optimized version of OpenCV supporting deep learn-
ing on a range of Intel hardware (GPUs, CPUs, Coprocessors, FPGAs, and more).
Deep learning inference with the Movidius now requires little to no code updates to perform
inference (in comparison to a standard OpenCV DNN module inference script).
Simply set the target processor (either the Raspberry Pi CPU or the Myriad NCS processor)
and the rest of the code is mostly the same.
Be sure to refer to PyImageSearch’s first article highlighting the benefits of OpenVINO:

http://pyimg.co/vrln6 [69].
OpenVINO supports the NCS1 (with no firmware upgrade) and the newer, faster NCS2
(announced on November 14th 2018 [9]).
The Raspberry Pi community was thrilled about OpenVINO and its long term viability. No
longer were we to rely on a challenging, non-Pythonesque API. In fact, the APIv1 and APIv2
were laid to rest on mid-2019 — they are no-longer supported by Intel, but there are some
legacy models in the ModelZoo and the code is out there if you need it: http://pyimg.co/dc8ck
[64].
11.4.3 Raspberry Pi 4 released (with USB 3.0 support)
Figure 11.3: The Raspberry Pi 4B has USB 3.0 capability, unlocking the full potential of the Intel
Movidius NCS2 deep learning stick.
Shortly after OpenVINO and the NCS2 were released, the Raspberry Pi 4B line was re-
leased with highly anticipated USB 3.0 support.
11.5. WHAT ARE THE ALTERNATIVES TO THE MOVIDIUS NCS? 261
Why does USB 3.0 matter?
The NCS has supported USB 3.0 since the beginning, but the Raspberry Pi hardware (3B
and 3B+) was stuck on USB 2.0. The lack of USB 3.0 severely limited overall inference time
when using a Raspberry Pi. Some users switched to alternative hardware (i.e. non-RPI SBCs).
Apart from the Raspberry Pi 4B being 2x faster altogether, the time it takes to perform
inference on a frame is also reduced as we can transfer the data back and forth much more
quickly using USB 3.0.
11.5 What are the alternatives to the Movidius NCS?
The Movidius was the first product in its class to make a significant dent in the marketplace.
The chip is present in many cameras and devices.
Of course there are competing chips, but are there competing USB-based devices?
The answer is "Yes."
The main competitor is the Google Coral Tensor Processing Unit (TPU) with the benefit
being that it is backed by the Google behemoth. The Google Coral is covered in the Complete
Bundle of this text.
As computer vision tasks move to IoT/Edge devices, it will not be uncommon for new prod-
ucts to enter the marketplace. Our recommendation is not to jump on new products too
quickly. If you do find yourself evaluating a new coprocessor product, be sure to conduct a
suite of benchmarks (refer to Section 23.3 of the Hobbyist Bundle) across a variety of CNN
architectures. Always put the device through the ringer by processing video files for extended
periods of time so that the device warms up and the processor really gets a good workout.
11.6 Summary
When building deep learning applications on the Raspberry Pi, you’ll find yourself trying to
balance speed with computationally complexity. The problem is, by nature, deep learning
models are incredibly computationally hungry, and the underpowered CPU on your embedded
device won’t be able to keep up with its appetite.
So, what do you do in those situations?
The answer is to turn to a co-processor device such as Intel’s Movidius NCS or the Google
Coral TPU USB Accelerator. These devices are essentially hardware-optimized USB sticks that
plug into your RPi and then handle the heavy lifting of deep learning inference on your device.
In the next few chapters in the text you’ll learn how to utilize the Movidius NCS to give your
deep learning applications a much needed speedup on the RPi.
Chapter 12
Image Classification with the Movidius

NCS
As discussed in Chapter 11, the Intel Movidius NCS is a deep learning coprocessor designed
for Single Board Computers (SBCs) like the Raspberry Pi.
While the Intel Movidius NCS is not a fully-fledged GPU, it packs a punch that might be
just what you’re looking for to gain a few FPS in your Raspberry Pi classification (or object
detection) project.
In this chapter we’ll learn how to deploy pretrained classification models to your RPi with
the Intel Movidius NCS and OpenVINO.
If you want to train your own model for the NCS, be sure to refer to the RPi4CV Complete
Bundle.
i. Perform image classification with the Movidius NCS and OpenVINO
ii. See how Raspberry Pi CPU classification is similar
iii. Analyze and compare classification and benchmarking results
12.2 Image Classification with the Movidius NCS
This chapter is organized into six sections.
263
264 CHAPTER 12. IMAGE CLASSIFICATION WITH THE MOVIDIUS NCS
First, we’ll review the project structure. From there, we’ll review our configuration file which
makes working with our CPU and Movidius classification scripts easy.
We’ll then dive into our Movidius image classification. We’ll review the differences between
the CPU and Movidius image classification scripts — they are nearly identical, a testament to
how much work the NCS has put into making the OpenVINO API seamless and easy to use.
Finally, we’ll compare results and calculate the Movidius classification speedup versus the
CPU for SqueezeNet and GoogLeNet.
Let’s begin!
|-- config
| |-- config.json
|-- images
| |-- beer.png
| |-- brown_bear.png
| |-- dog_beagle.png
| |-- keyboard.png
| |-- monitor.png
| |-- space_shuttle.png
| |-- steamed_crab.png
|-- models
| |-- googlenetv4
| | |-- caffe
| | | |-- googlenet-v4.caffemodel
| | | |-- googlenet-v4.prototxt
| | |-- ir
| | |-- googlenet-v4.bin
| | |-- googlenet-v4.mapping
| | |-- googlenet-v4.xml
| |-- squeezenet
| |-- caffe
| | |-- squeezenet1.1.caffemodel
| | |-- squeezenet1.1.prototxt
| |-- ir
| |-- squeezenet1.1.bin
| |-- squeezenet1.1.mapping
| |-- squeezenet1.1.xml
|-- pyimagesearch
| |-- utils
| | |-- __init__.py
| | |-- conf.py
| |-- __init__.py
|-- imagenet_class_index.json
|-- imagenet_classes.pickle
|-- classify_cpu.py
|-- classify_movidius.py
12.2. IMAGE CLASSIFICATION WITH THE MOVIDIUS NCS 265
Our scripts for this project are nearly identical. We will review classify_movidius.py
in detail. The classify_cpu.py file will not be reviewed in its entirety. Rather, we will only
review specific lines where the changes are. As you follow along, you should open both scripts
side-by-side on your computer screen. Both scripts include timestamps such that we can
benchmark the Movidius NCS vs. the CPU.
Our project settings are organized in a JSON file for convenience to eliminate the need for a
metric ton of command line arguments or even separate scripts per GoogLeNet and SqueezeNet.
Let’s inspect our configuration file, config.json, now:
1 {
2 // paths to the model files
3 "model_paths": {
4 "ir": {
5 "squeezenet": {
6 "xml": "models/squeezenet/ir/squeezenet1.1.xml",
7 "bin": "models/squeezenet/ir/squeezenet1.1.bin"
8 },
9 "googlenet": {
10 "xml": "models/googlenetv4/ir/googlenet-v4.xml",
11 "bin": "models/googlenetv4/ir/googlenet-v4.bin"
12 }
13 },
14 "caffe": {
15 "squeezenet": {
16 "prototxt": "models/squeezenet/caffe/squeezenet1.1.prototxt",
17 "caffemodel": "models/squeezenet/caffe/squeezenet1.1.caffemodel"
18 },
19 "googlenet": {
20 "prototxt": "models/googlenetv4/caffe/googlenet-v4.prototxt",
21 "caffemodel": "models/googlenetv4/caffe/googlenet-v4.caffemodel"
22 }
23 }
24 },
Our "model_paths" are organized by "ir" (Movidius NCS graph files) and "caffe"
(Caffe files for the CPU). The pre-trained Movidius-compatible files came from OpenVINO/
OpenCV’s ModelZoo (http://pyimg.co/dc8ck [64]). The pre-trained CPU-compatible files came
from Intel’s official pre-trained model page (http://pyimg.co/78cqs [65]). These files are included
in the project folder so that you don’t have to go searching for them.
Let’s review our preprocessing settings:
26 // preprocessing instructions for each model

27 "preprocess": {
28 "squeezenet":
29 {
30 "input_size": [227, 227],
31 "mean": [104.0, 117.0, 123.0],
32 "scale": 1.0
33 },
34 "googlenet":
35 {
36 "input_size": [299, 299],
37 "mean": 128.0,
38 "scale": 128.0
39 }
40 },
Preprocessing settings are organized in Lines 27-40 under "preprocess". Both SqueezeNet
and GoogLeNet require different input dimensions ("input_size"), mean subtraction ("mean"),
and scaling ("scale").
Our CPU script performs both mean subtraction and scaling, but the Movidius NCS script
does not since the preprocessing steps are baked into the .bin and .xml files (the "ir"
paths above).
Finally, our ImageNet dataset class labels are specified in the .pickle file on Line 43:
42 // path to the imagenet labels file

43 "labels_path": "imagenet_classes.pickle"
44 }
12.2.3 Image Classification with the Movidius NCS and OpenVINO
If you’ve never used the old Movidius NCS APIs (APIv1 and APIv2), you really dodged a bullet.
Intel’s newer OpenVINO-compatible release of OpenCV now takes care of a lot of the hard
work for us.
As we’ll see in this section, as well as the following one, the scripts have very minor differ-
ences between deploying a model for the Movidius NCS or CPU to perform inference.
Go ahead and open a new file named classify_movidius.py and insert the following
code:

2 from openvino.inference_engine import IENetwork
3 from openvino.inference_engine import IEPlugin
6 import argparse
7 import pickle
8 import time
9 import cv2
Lines 2-9 import our packages. Our openvino imports are on Lines 2 and 3.
Let’s go ahead and parse command line arguments:

13 ap.add_argument("-i", "--image", required=True,
14 help="path to the input image")
15 ap.add_argument("-m", "--model", required=True,
16 choices=["squeezenet","googlenet"],
17 help="model to be used for classify")
Our Movidius NCS classification script requires three command line arguments:
• --image: The path to the input image.
• --model: The deep learning classification model — either squeezenet or googlenet.
• --conf: The path to the configuration file we reviewed in the previous section.
When these arguments are provided via the terminal, the script will handle loading the
image, model, and configuration so that we can conduct classification with the Movidius NCS.
Let’s go ahead and load the configuration and classes now:

24
25 # load the imagenet class labels
26 classes = pickle.loads(open(conf["labels_path"], "rb").read())
Line 23 loads the JSON-based configuration.
Then, Line 26 loads the ImageNet classes using the "labels_path" (path to the
.pickle file) contained within the conf dictionary.
Now we’ll setup the Intel Movidius NCS with our pretrained model:
28 # initialize the plugin for specified device

29 plugin = IEPlugin(device="MYRIAD")
30
31 # read the IR generated by the Model Optimizer (.xml and .bin files)
32 print("[INFO] loading models...")
33 net = IENetwork(
34 model=conf["model_paths"]["ir"][args["model"]]["xml"],
35 weights=conf["model_paths"]["ir"][args["model"]]["bin"])
36
37 # prepare input and output blobs
38 print("[INFO] preparing inputs...")
39 inputBlob = next(iter(net.inputs))
40 outputBlob = next(iter(net.outputs))
41
42 # set the default batch size as 1 and grab the number of input blobs,
43 # number of channels, the height, and width of the input blob
44 net.batch_size = 1
45 (n, c, h, w) = net.inputs[inputBlob].shape
46
47 # load model to the plugin
48 print("[INFO] Loading model to the plugin...")
49 execNet = plugin.load(network=net)
Line 29 initializes the OpenVINO plugin. The "MYRIAD" device is the Movidius NCS pro-
cessor.
Lines 33-35 load the CNN. The model and weights file paths are provided directly from
the config file. Either SqueezeNet or GoogLeNet is loaded via the --model switch. Here we
are grabbing the "ir" paths which include .xml and .bin files.
Lines 39 and 40 prepare our inputs and outputs.
We will be performing inference on only one image at a time, so our net.batch_size is

set to 1 (Line 44).
Line 45 then grabs the number of input blobs (n), number of channels (c), and required
dimensions of the blob (h and w).
From there, Line 49 goes ahead and loads the net onto the Movidius NCS. This step
only needs to be completed once, but it can take some time. If you were to be performing
classification on a video stream, rest assured that from here, inference will be as fast as it can
be.
Next, we will load and preprocess our input image:
51 # load the input image and resize input frame to network size
52 orig = cv2.imread(args["image"])
53 frame = cv2.resize(orig, (w, h))
54
55 # change data layout from HxWxC to CxHxW
56 frame = frame.transpose((2, 0, 1))
57 frame = frame.reshape((n, c, h, w))
Lines 52-57 load our image from the path provided in --image and resize/reshape it.
Mean subtraction and image scaling preprocessing is baked into the .bin and .xml files
(the "ir" paths in our config). Be sure to refer to OpenVINO’s Optimization Guide (http://pyimg
.co/hyleq, [70]) and search the page for “image mean/scale parameters”.
We’re now ready to perform classification inference with the Movidius NCS:
59 # perform inference and retrieve final layer output predictions

60 print("[INFO] classifying image...")
61 start = time.time()
62 results = execNet.infer({inputBlob: frame})
63 results = results[outputBlob]
64 end = time.time()
65 print("[INFO] classification took {:.4f} seconds...".format(
66 end - start))
Lines 62 and 63 perform inference on the frame.
Timestamps are taken before and after inference — the elapsed inference time is printed
via Lines 65 and 66 for benchmarking purposes.
Let’s process the top-five results:
68 # sort the indexes of the probabilities in descending order (higher

69 # probabilitiy first) and grab the top-5 predictions
70 idxs = np.argsort(results.reshape((1000)))[::-1][:5]
71
72 # loop over the top-5 predictions and display them
73 for (i, idx) in enumerate(idxs):
74 # check if the model type is SqueezeNet, and if so and retrieve
75 # the probability using special indexing as the output format of
76 # SqueezeNet is (1, 1000, 1, 1)
77 if args["model"] == "squeezenet":
78 proba = results[0][idx][0][0]
79
80 # otherwise, it's a GoogLeNet model so retrieve the probability
81 # from the output which is of the form (1, 1000)
82 else:
83 proba = results[0][idx]
Line 70 sorts and grabs the top-five results indexes.

Using the idxs, we form a loop over them (Line 73). Inside the loop, we grab the classifica-
tion probability (proba). The results have different shapes depending on whether we used
the squeezenet (Lines 77 and 78) or googlenet (Lines 82 and 83) model.
From here we will (1) annotate our output image with the top result, and (2) print the top-five
results to the terminal:
85 # draw the top prediction on the input image

86 if i == 0:
87 text = "Label: {}, {:.2f}%".format(classes[idx],
88 proba * 100)
89 cv2.putText(orig, text, (5, 25), cv2.FONT_HERSHEY_SIMPLEX,
90 0.7, (0, 0, 255), 2)
91
92 # display the predicted label + associated probability to the
93 # console
94 print("[INFO] {}. label: {}, probability: {:.2f}%".format(i + 1,
95 classes[idx], proba * 100))
96
97 # display the output image
98 cv2.imshow("Result", orig)
99 cv2.waitKey(0)
Lines 86-90 annotate our orig input image with the top classification result and probability.
Lines 94 and 95 print out the top-five classification results and probabilities as the loop
iterates.
Finally, the freshly annotated image is displayed on screen until a key is pressed (Lines 98
and 99).
12.2.4 Minor Changes for CPU Classification
Raspberry Pi image classification with the CPU is nearly identical for this project. In this
section, we will inspect the minor differences.
Go ahead and open classify_cpu.py.
Let’s review the differences, while paying attention to the line numbers since identical
blocks will not be reviewed:

4 import argparse
5 import pickle
6 import time
7 import cv2
Lines 2-9 import our packages — for CPU-based classification, we do not need openvino
imports.
Now let’s jump to Line 26 where the model is loaded:
26 # load our serialized model from disk, set the preferable backend and
27 # set the preferable target
29 net = cv2.dnn.readNetFromCaffe(
30 conf["model_paths"]["caffe"][args["model"]]["prototxt"],
31 conf["model_paths"]["caffe"][args["model"]]["caffemodel"])
32 net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
33 net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)
For classification with the CPU, we use OpenCV’s standard dnn module (Lines 29-31).
Notice that we are grabbing the "caffe" paths (the "prototxt" and "caffemodel").
This is as opposed to our Movidius NCS "ir" paths (the "xml" model and "bin" weights).
Additionally, take note of the backend and target (Lines 32 and 33). We are telling the
OpenVINO implementation of OpenCV that we will use the OpenCV backend and CPU (rather
than the Myriad processor on the Movidius NCS).
Preprocessing comes next and it is slightly different than our Movidius script:
35 # retrieve various preprocessing values such as input height/width,

36 # channel mean, and scale factor
37 (H, W) = conf["preprocess"][args["model"]]["input_size"]
38 mean = conf["preprocess"][args["model"]]["mean"]
39 scale = 1.0 / conf["preprocess"][args["model"]]["scale"]
40
41 # load the input image and resize input frame to network size
42 orig = cv2.imread(args["image"])
43 frame = cv2.resize(orig, (W, H))
Lines 38 and 39 set up our mean subtraction and image pixel scaling scale. We didn’t
need to perform mean preprocessing or pixel scaling previously as those steps are baked into
the .bin and .xml files (generated using OpenVINO optimizer). Instead, here we are using
the original caffe model and hence we need to perform these preprocessing steps manually.
The orig image is loaded and the frame is resized (Lines 42 and 43) prior to us creating
a blob for classification inference:
45 # convert the frame to a blob and perform a forward pass through the
46 # model and obtain the classification result
47 blob = cv2.dnn.blobFromImage(frame, scale, (W, H), mean)
48 print("[INFO] classifying image...")
49 start = time.time()
51 results = net.forward()
52 end = time.time()
53 print("[INFO] classification took {:.4f} seconds...".format(
54 end - start))
Our image is converted to a blob (Line 47) so that we can send it through the neural net.
Be sure to refer to my blog post on How OpenCV’s blobFromImage works (http://pyimg.co/c4
gws) [45].
Classification inference takes place on Lines 50 and 51. Again, timestamps are collected
and the elapsed time is computed and printed to the terminal for benchmarking purposes.
From here we process the results — all Lines 56-87 are identical to our previous Mo-
vidius script (Lines 68-99).
12.2.5 Image Classification with Movidius NCS Results
Now that we have implemented classification scripts for both the (1) Movidius NCS, and (2)
CPU, let’s put them to the test and compare results.
If you are using the .img that accompanies the book, you will need to use the openvino
virtual environment.
I recommend initiating a VNC or SSH (with X forwarding) session for running this example.
Remark. Remote development on the Raspberry Pi was covered in the Hobbyist Bundle of
this text, but if you need a refresher, refer to this tutorial: http://pyimg.co/tunq7 [71].
When you’re ready, open a terminal open on your Raspberry Pi and run the following script
to start the OpenVINO environment:
$ source start_openvino.sh
Starting Python 3.7 with OpenCV-OpenVINO 4.1.1 bindings...
[setupvars.sh] OpenVINO environment initialized
Remark. It is important for the openvino virtual environment to avoid usage of solely the
workon openvino command. If you source start_openvino.sh as shown, another
Intel-provided script is also sourced in the process to set up key environmental variables. I
recommend that you inspect the start script on the .img via the following command: cat
~/start_openvino.sh.
Figure 12.1: Comparing image classification on both the Raspberry Pi CPU and Movidius NCS
using the SqueezeNet CNN pretrained on ImageNet.
From there, let’s fire up the CPU-based classification script using (1) SqueezeNet, and (2)
an image of a beer glass:
$ python classify_cpu.py --conf config/config.json --model squeezenet \

--image images/beer.png
[INFO] classifying image...
[INFO] classification took 0.1049 seconds...
[INFO] 1. label: beer_glass, probability: 97.69%
[INFO] 2. label: eggnog, probability: 1.72%
[INFO] 3. label: goblet, probability: 0.09%
[INFO] 4. label: vase, probability: 0.09%
[INFO] 5. label: beer_bottle, probability: 0.05%
Let’s compare the results to our Movidius based script:
$ python classify_movidius.py --conf config/config.json --model squeezenet \

--image images/beer.png
[INFO] loading models...
[INFO] preparing inputs...
[INFO] Loading model to the plugin...

[INFO] 1. label: beer_glass, probability: 97.41%
[INFO] 2. label: eggnog, probability: 1.90%
[INFO] 3. label: goblet, probability: 0.11%
[INFO] 4. label: vase, probability: 0.10%
[INFO] 5. label: beer_bottle, probability: 0.05%
As you can see in the terminal output and in Figure 12.1, CPU inference took 0.1049 sec-
onds whereas Movidius inference required only 0.0125 seconds, a speedup of 8.39X.
using the GoogLeNet CNN pretrained on ImageNet.
Now let’s run (1) GoogLeNet with (2) an image of a brown bear on both the CPU and
Movidius:
$ python classify_cpu.py --conf config/config.json --model googlenet \

--image images/brown_bear.png
[INFO] 1. label: brown_bear, probability: 90.43%
[INFO] 2. label: American_black_bear, probability: 0.41%
[INFO] 3. label: sloth_bear, probability: 0.18%
[INFO] 4. label: ice_bear, probability: 0.16%
[INFO] 5. label: lesser_panda, probability: 0.07%
12.3. SUMMARY 275
$ python classify_movidius.py --conf config/config.json --model googlenet \

--image images/brown_bear.png
[INFO] loading models...
[INFO] preparing inputs...
[INFO] Loading model to the plugin...
[INFO] 1. label: brown_bear, probability: 87.99%
[INFO] 2. label: American_black_bear, probability: 0.25%
[INFO] 3. label: sloth_bear, probability: 0.15%
[INFO] 4. label: ice_bear, probability: 0.14%
[INFO] 5. label: lesser_panda, probability: 0.07%
using SqueezeNet and GoogLeNet models pretrained on ImageNet.
As you can see in the terminal output and in Figure 12.2, CPU inference took 3.1376 sec-
onds whereas Movidius inference required only 0.1624 seconds, a speedup of 19.32X.
The results in Figures 12.1 and 12.2 were collected with a Raspberry Pi 4B using a Movidius
NCS2 and OpenVINO version 4.1.1. A summary of the results is shown in the table in Figure
12.3. As the results show, the Movidius NCS2 paired with a Raspberry Pi 4B is 8x faster for
SqueezeNet classification, and a whopping 19x faster for GoogLeNet classification. Results
will not be as good using a Raspberry Pi 3B+ which does not have USB 3.0.
I would highly recommend giving this Intel OpenVINO document (http://pyimg.co/hyleq, [70])
a read regarding performance comparisons if you are conducting your own benchmarking.
12.3 Summary
In this chapter, we learned how to perform image classification with the Movidius NCS using
OpenVINO’s implementation of OpenCV.
As we discovered, the CPU and Movidius classification scripts are nearly identical minus
the initial setup.
The Movidius NCS2 on the Raspberry Pi 4 is a whopping 19x faster using GoogLeNet than
using only the Raspberry Pi CPU. This makes the Movidius and OpenVINO a great companion
for your Raspberry Pi deep learning projects.
In the next chapter, we’ll perform object detection with the Movidius NCS.
Chapter 13
Object Detection with the Movidius NCS
In Chapter 19 of the Hobbyist Bundle we discovered how to develop a people/footfall counter

optimized for the resource-constrained Raspberry Pi.
In this chapter we’ll build upon the project.
Rather than a background subtraction method, which leads to inaccurate counting if people
are close together, we’ll instead use object detection.
When this code was originally written, the Raspberry Pi 3B+ was the best RPi hardware
available, so we needed to add additional horsepower. The horsepower comes in the form of
the Intel Movidius NCS with the OpenVINO software.
The example we will cover in this chapter will be similar to my original “OpenCV People
Counter” tutorial on PyImageSearch from August 2018 (http://pyimg.co/vgak6) [72], with the
main exception being that we will dispatch the Movidius NCS for the heavy lifting. Our ex-
ample also uses our new DirectionCounter class which was not included in the blog post
example.
Let’s dig in!
In this chapter we will learn how to:
i. Perform object detection with OpenVINO/Movidius.
ii. Use the MobileNet SSD with either a (1) CPU, or (2) Myriad/Movidius processor.
iii. Count objects (people) using object detection, correlation tracking, and centroid tracking.
277
278 CHAPTER 13. OBJECT DETECTION WITH THE MOVIDIUS NCS
13.2 Object Detection with the Movidius NCS
In this chapter, we begin by reviewing our project structure. From there, we’ll briefly review
object counting which was covered in the Hobbyist Bundle.
We’ll then dive right into object counting with OpenVINO/Movidius. In our results section,
we will benchmark the CPU vs. the Movidius NCS. Now that the RPi 4B is available with USB
3.0 support, the results are quite impressive.
|-- config
| |-- config.json
|-- mobilenet_ssd
| |-- MobileNetSSD_deploy.caffemodel
| |-- MobileNetSSD_deploy.prototxt
|-- output
| |-- output_01.avi
| |-- output_02.avi
|-- pyimagesearch
| |-- utils
| | |-- __init__.py
| | |-- conf.py
| |-- __init__.py
| |-- centroidtracker.py
| |-- directioncounter.py
| |-- trackableobject.py
|-- videos
| |-- example_01.mp4
| |-- example_02.mp4
|-- people_counter_openvino.py
Input videos for testing are included in the videos/ directory. Both example_01.mp4 and
example_02.mp4 are provided by David McDuffee.
Our output/ folder will be where we’ll store processed videos. One example output video
is included.
The mobilenet_ssd/ directory contains our pretrained Caffe-based object detector files.
The pyimagesearch module contains our Conf class for parsing JSON configs. Addi-
tionally three classes related to counting objects are included: (1) TrackableObject, (2)
CentroidTracker, and (3) DirectionCounter. We will briefly review the purpose of these
three classes later in this chapter. A full line-by-line review can be found in Chapter 19 of the
Hobbyist Bundle.
13.2. OBJECT DETECTION WITH THE MOVIDIUS NCS 279
Our driver script for object detection and people counting is contained within people_coun
ter_openvino.py. This script takes advantage of all the aforementioned classes as well as
MobileNet SSD in order to count people using the Movidius coprocessor or the Raspberry Pi
CPU. The CPU option is only recommended if you are using a Raspberry Pi 4 — it is mainly
for benchmarking purposes.
13.2.2 A Brief Review of Object Counting
Figure 13.1: An example of our people/footfall counter in action. Our algorithm detects people in
a video stream, determines the direction they are going (up/down or left/right), and then counts
them appropriately once they cross the center line.
Our Movidus NCS, object detection, and people counting implementation uses three classes
developed in Chapter 19 of the Hobbyist Bundle.
The CentroidTracker class manages an ordered dictionary of objects, assigning each

one an objectID. The objects dictionary contains a list of current/historical (x, y)-locations
(centroids). The three methods in this class, register, deregister, and update work
together to associate/dissociate objects (people) that come in view of the camera all based on
a list of bounding boxes provided by our MobileNet SSD object detector.
The TrackableObject class only holds/stores data and does not have any methods.
Every person in our frame is a trackable object and has an objectID, list of centroids, and
a boolean indicating if it is counted or not.
Our DirectionCounter class is responsible for counting either totalUp/totalDown or

totalLeft/totalRight people counts. When a person walks through the camera’s field
of view we use the find_direction method to determine if they are moving up/down or
left/right. From there the count_object method checks to see if the person has crossed the
“counting line” and increments the respective counter.
If you need a full review of the three classes, be sure to refer to Chapter 19 of the Hobbyist
Bundle now before moving on.
13.2.3 Object Counting with OpenVINO
Now let’s put the Movidius Neural Compute Stick to work using OpenVINO.
To demonstrate the power of OpenVINO on the Raspberry Pi with Movidius, we’re going
to perform real-time deep learning object detection along with people counting. The Mo-
vidius/Myriad coprocessor will perform the actual deep learning inference, reducing the load
on the Pi’s CPU. We’ll use the Raspberry Pi CPU to process the results and tell the Movidius
what to do, but we’re reserving object detection for the Myriad as its hardware is optimized and
designed for deep learning inference.
This script is especially convenient as only a single function call affects whether the Mo-
vidius NCS Myriad or CPU is used for inference. The .setPreferableTarget call tells
OpenVINO which processor to use for deep learning inference. The function call demonstrates
the power of OpenVINO in that it is really quite simple to set either the CPU or coprocessor
to handle deep learning inference; however not all OpenVINO scripts are this convenient as of
the time of this writing.
In the results section, we will benchmark both CPU and Movidius people counting on a
Raspberry Pi 4B.
Go ahead and open people_counter_openvino.py and insert the following lines:

2 from pyimagesearch.directioncounter import DirectionCounter
3 from pyimagesearch.centroidtracker import CentroidTracker
4 from pyimagesearch.trackableobject import TrackableObject
6 from multiprocessing import Process
7 from multiprocessing import Queue
8 from multiprocessing import Value
12 import argparse
13 import imutils
14 import time
15 import dlib
16 import cv2
Lines 2-16 import packages and modules. As you can see, we begin by importing our cus-
tom classes including DirectionCounter, CentroidTracker, TrackableObject, and
Conf.
We’ll by using multiprocessing for writing to output video files so as not to slow down
the main functionality of the script. We’ll use a process safe Queue of frames and Value
variable to indicate whether we should be writing to video.
We will also use the dlib correlation tracker (a new addition compared to the implementa-
tion in Chapter 19 of the Hobbyist Bundle).
Let’s review our process for writing to output video files:
18 def write_video(outputPath, writeVideo, frameQueue, W, H):

19 # initialize the FourCC and video writer object
20 fourcc = cv2.VideoWriter_fourcc(*"MJPG")
21 writer = cv2.VideoWriter(outputPath, fourcc, 30, (W, H), True)
22
23 # loop while the write flag is set or the output frame queue is
24 # not empty
25 while writeVideo.value or not frameQueue.empty():
26 # check if the output frame queue is not empty
27 if not frameQueue.empty():
28 # get the frame from the queue and write the frame
29 frame = frameQueue.get()
30 writer.write(frame)
31
32 # release the video writer object
33 writer.release()
The write_video function will run in an independent process so that our main thread
of execution isn’t bogged down with the time-consuming blocking operation of writing video
frames to disk.
The main process will simply append frames from the FIFO frameQueue and the write_
video process will handle the video writing more efficiently.
The function accepts five parameters: (1) outputPath, the filepath to the output video
file, (2) writeVideo, a flag indicating if video writing should be ongoing, (3) frameQueue,
a process-safe Queue holding our frames to be written to disk in the video file, and (4/5) the
video file dimensions.
Lines 20 and 21 initialize the video writer.
From there, a loop starts on Line 25 — it will continue to write to the video until writeVideo
is False. The frames are written as they become available in the frameQueue. When the
video is finished, the output file pointer is released (Line 33).
With our video writing process out of the way, let’s define our command line arguments:

37 ap.add_argument("-t", "--target", type=str, required=True,
38 choices=["myriad", "cpu"],
39 help="target processor for object detection")
40 ap.add_argument("-m", "--mode", type=str, required=True,
41 choices=["horizontal", "vertical"],
42 help="direction in which people will be moving")
45 ap.add_argument("-i", "--input", type=str,
46 help="path to optional input video file")
47 ap.add_argument("-o", "--output", type=str,
48 help="path to optional output video file")
Our driver script accepts four command line arguments:
• --target: The target processor for object detection, either myriad or cpu.
• --mode: The direction (either horizontal or vertical in which people will be moving
through the frame.
• --conf: The path to the JSON configuration file.
• --input: The path to an optional input video file.
• --output: The path to an optional output video file. When an output video filepath is
provided, the write_video process will come to life.
From there, we’ll parse our config and list our MobileNet SSD CLASSES:

53
54 # initialize the list of class labels MobileNet SSD detects
We are only concerned with people counting, so later in our frame processing loop, we will
filter for the "person" class from the CLASSES list (Lines 55-58).
Let’s go ahead and load our serialized object detection model from disk and set the prefer-
able processor:

64
65 # check if the target processor is myriad, if so, then set the
66 # preferable target to myriad
67 if args["target"] == "myriad":
68 net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)
69
70 # otherwise, the target processor is CPU
71 else:
72 # set the preferable target processor to CPU and preferable
73 # backend to OpenCV
74 net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)
75 net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
Lines 62 and 63 load our caffe model from the .prototxt and .model files (the paths
are specified in config.json).
The power of OpenVINO lies in the ability to set the target processor for inference. Lines
67-75 set the OpenVINO target processor based on the --target processor provided via
command line argument.
Next we will initialize our video stream:
77 # if a video path was not supplied, grab a reference to the webcam

78 if not args.get("input", False):
79 print("[INFO] starting video stream...")
82 time.sleep(2.0)
83
84 # otherwise, grab a reference to the video file
85 else:
86 print("[INFO] opening video file...")
87 vs = cv2.VideoCapture(args["input"])
And then perform the remaining initializations:
89 # initialize the video writer process (we'll instantiate later if

90 # need be) along with the frame dimensions
91 writerProcess = None
92 W = None
93 H = None
94
95 # instantiate our centroid tracker, then initialize a list to store
96 # each of our dlib correlation trackers, followed by a dictionary to

97 # map each unique object ID to a trackable object
98 ct = CentroidTracker(maxDisappeared=20, maxDistance=30)
99 trackers = []
100 trackableObjects = {}
101
102 # initialize the direction info variable (used to store information
103 # such as up/down or left/right people count) and a variable to store
104 # the the total number of frames processed thus far
105 directionInfo = None
106 totalFrames = 0
107
108 # start the frames per second throughput estimator
Lines 91-93 initialize our writerProcess and output video frame dimensions.
Next, we initialize our CentroidTracker (Line 98), trackers to hold our dlib correlation
trackers (Line 99), and our trackableObjects ordered dictionary (Line 100).
The directionInfo is initialized (Line 105) and will later hold a dictionary of our object
counts to be annotated on the screen.
Object detection is a resource-hungry task. Therefore, we will only perform object detection
every N skip-frames. The totalFrames variable is initialized to 0 for now and it will increment
upon each iteration of our while loop. When totalFrames % conf["skip_frames"]
== 0, we will perform object detection so that we have an accurate position of the people
walking through the frame.
From here we’ll begin our frame processing loop:

112 while True:
113 # grab the next frame and handle if we are reading from either
114 # VideoCapture or VideoStream
116 frame = frame[1] if args.get("input", False) else frame
117
118 # if we are viewing a video and we did not grab a frame then we
119 # have reached the end of the video
120 if args["input"] is not None and frame is None:
121 break
122
123 # convert the frame from BGR to RGB for dlib
We begin looping on Line 112. At the top of the loop we grab the next frame (Lines 115
and 116). In the event that we’ve reached the end of the video, we’ll break out of the loop
(Lines 120 and 121).
On Line 124 we swap color channels as dlib requires an RGB-ordered image.
Next, we’ll initialize our DirectionCounter and writerProcess (if necessary):
126 # check to see if the frame dimensions are not set

128 # set the frame dimensions and instantiate our direction
129 # counter
131 dc = DirectionCounter(args["mode"], H, W)
132
133 # begin writing the video to disk if required
134 if args["output"] is not None and writerProcess is None:
135 # set the value of the write flag (used to communicate when
136 # to stop the process)
137 writeVideo = Value('i', 1)
138
139 # initialize a frame queue and start the video writer
140 frameQueue = Queue()
141 writerProcess = Process(target=write_video, args=(
142 args["output"], writeVideo, frameQueue, W, H))
143 writerProcess.start()
On the first iteration of our loop, our frame dimensions will still be None. Lines 127-131
set the frame dimensions and initialize our DirectionCounter as dc while providing the
direction mode (vertical/horizontal).
If we will be writing a processed output video to disk, Lines 134-143 initialize the frameQueue
and start the writerProcess.
Let’s now detect people using the SSD:
145 # initialize the current status along with our list of bounding
146 # box rectangles returned by either (1) our object detector or
147 # (2) the correlation trackers
148 status = "Waiting"
149 rects = []
150
151 # check to see if we should run a more computationally expensive
152 # object detection method to aid our tracker
153 if totalFrames % conf["skip_frames"] == 0:
154 # set the status and initialize our new set of object
155 # trackers
156 status = "Detecting"
157 trackers = []
158

161 blob = cv2.dnn.blobFromImage(frame, size=(300, 300),
162 ddepth=cv2.CV_8U)
163 net.setInput(blob, scalefactor=1.0/127.5, mean=[127.5,
164 127.5, 127.5])
We initialize a status as "Waiting" on Line 148. Possible states include:
• Waiting: In this state, we’re waiting on people to be detected and tracked.
• Detecting: We’re actively in the process of detecting people using the MobileNet SSD.
• Tracking: People are being tracked in the frame and we’re counting the totalUp and
totalDown.
Our rects list will be populated either via detection or tracking. We go ahead and initialize
rects on Line 149.
Deep learning object detectors are very computationally expensive, especially if you are
running them on your CPU (and even if you use your Movidius NCS).
To avoid running our object detector on every frame, and to speed up our tracking pipeline,
we’ll be skipping every N frames (set by command line argument --skip-frames where 30
is the default).
Only every N frames will we exercise our SSD for object detection. Otherwise, we’ll simply
be tracking moving objects in-between. Using the modulo operator on Line 153 we ensure that
we’ll only execute the code in the if-statement every N frames.
Assuming we’ve landed on a multiple of "skip_frames", we’ll update the status to

"Detecting" (Line 156).
We then initialize our new list of dlib correlation trackers (Line 157).
Next, we’ll perform inference via object detection. We begin by creating a blob from the
frame, followed by passing the blob through the net to obtain detections (Lines 161-165).
OpenVINO will seamlessly use either the (1) Myriad in the Movidius NCS, or (2) CPU,
depending on the preferred target processor we set via command line arguments.
Now we’ll loop over each of the detections in hopes of finding objects belonging to the
“person” class:


172
173 # filter out weak detections by requiring a minimum
174 # confidence
177 # detections list
179
180 # if the class label is not a person, ignore it
181 if CLASSES[idx] != "person":
182 continue
Looping over detections on Line 168, we proceed to grab the confidence (Line 171)
and filter out weak results and those that don’t belong to the "person" class (Lines 175-182).
We can now compute a bounding box for each person and begin correlation tracking:

186 box = detections[0, 0, i, 3:7] * np.array(
187 [W, H, W, H])
189
190 # construct a dlib rectangle object from the bounding
191 # box coordinates and then start the dlib correlation
192 # tracker
193 tracker = dlib.correlation_tracker()
194 rect = dlib.rectangle(startX, startY, endX, endY)
195 tracker.start_track(rgb, rect)
196
197 # add the tracker to our list of trackers so we can
198 # utilize it during skip frames
199 trackers.append(tracker)
Computing our bounding box takes place on Lines 186-188.
We then instantiate this particular person’s dlib.correlation_tracker as tracker

on Line 193, followed by passing in the object’s bounding box coordinates to dlib.rectangle,
storing the result as rect (Line 194).
Subsequently, we start tracking on Line 195 and add the tracker to the trackers list on
Line 199.
Again, we will have one dlib correlation tracker per person in the frame. Correlation tracking
is computationally expensive, but since we’ve offloaded the deep learning object detection
aspect of our system to the Movidius, the RPi CPU is able to handle multiple correlation trackers
within reason even on an RPi 3B+.
Obviously the more people that are present in the frame, the lower our FPS will be — just
keep that in mind if you are counting people over a large area.
That’s a wrap for all operations we do every N skip-frames (Lines 153-199)!
Let’s take care of the typical operations where tracking (not object detection) is taking place
in the else block:
201 # otherwise, we should utilize our object *trackers* rather than

202 # object *detectors* to obtain a higher frame processing
203 # throughput
204 else:
205 # loop over the trackers
206 for tracker in trackers:
207 # set the status of our system to be 'tracking' rather
208 # than 'waiting' or 'detecting'
209 status = "Tracking"
210
211 # update the tracker and grab the updated position
212 tracker.update(rgb)
213 pos = tracker.get_position()
214
215 # unpack the position object
216 startX = int(pos.left())
217 startY = int(pos.top())
218 endX = int(pos.right())
219 endY = int(pos.bottom())
220
221 # add the bounding box coordinates to the rectangles list
222 rects.append((startX, startY, endX, endY))
Most of the time, we aren’t landing on a skip-frame multiple. During these iterations of the
loop we’ll utilize our trackers to track our object rather than applying detection.
We begin looping over the available trackers on Line 206. Inside the loop, we proceed to
update the status to "Tracking" (Line 209) and grab the object position (Lines 212 and
213). From there we extract the position coordinates (Lines 216-219) followed by populating
the information in our rects list.
Now let’s draw a line that people must cross in order to be tracked:
224 # check if the direction is *vertical*

225 if args["mode"] == "vertical":
226 # draw a horizontal line in the center of the frame -- once an
227 # object crosses this line we will determine whether they were
228 # moving 'up' or 'down'

229 cv2.line(frame, (0, H // 2), (W, H // 2), (0, 255, 255), 2)
230
231 # otherwise, the direction is *horizontal*
232 else:
233 # draw a vertical line in the center of the frame -- once an
234 # object crosses this line we will determine whether they were
235 # moving 'left' or 'right'
236 cv2.line(frame, (W // 2, 0), (W // 2, H), (0, 255, 255), 2)
Depending on the --mode of operation, either "vertical" or "horizontal", we’ll draw

a horizontal or vertical line respectively for the people to cross (Lines 225-236).
Next, we’ll update the centroid tracker with our fresh object rects:
238 # use the centroid tracker to associate the (1) old object
239 # centroids with (2) the newly computed object centroids
240 objects = ct.update(rects)
241
242 # loop over the tracked objects
243 for (objectID, centroid) in objects.items():
244 # grab the trackable object via its object ID
245 to = trackableObjects.get(objectID, None)
246
247 # create a new trackable object if needed
248 if to is None:
249 to = TrackableObject(objectID, centroid)
250
251 # otherwise, there is a trackable object so we can utilize it
252 # to determine direction
253 else:
254 # find the direction and update the list of centroids
255 dc.find_direction(to, centroid)
256 to.centroids.append(centroid)
257
258 # check to see if the object has been counted or not
259 if not to.counted:
260 # find the direction of motion of the people
261 directionInfo = dc.count_object(to, centroid)
262
263 # store the trackable object in our dictionary
264 trackableObjects[objectID] = to
Our centroid tracker will associate object IDs with object locations.
Line 243 begins a loop over our objects to (1) track, (2) determine direction, and (3) count
them if they cross the counting line.
On Line 245 we attempt to fetch a TrackableObject for the current objectID. If the
trackable object doesn’t exist for this particular ID, we create one (Lines 248 and 249).
Otherwise, there is already an existing TrackableObject, so we need to figure out if the

object (person) is moving up/down or left/right. To do so, we make a call to find_direction
and update the list of centroids (Lines 255 and 256).
If the person has not been counted, we go ahead and count them (Lines 259-261). Behind
the scenes in the DirectionCounter class, the object will not be counted if it has yet to
cross the counting line.
Finally, we store the trackable object in our trackableObjects dictionary (Line 264) so
we can grab and update it when the next frame is captured.
We’re on the home-stretch!
The next three code blocks handle:
i. Display (drawing and writing text to the frame).
ii. Writing frames to a video file on disk (if the --output command line argument is present).
iii. Capturing keypresses.
iv. Cleanup.
First we’ll draw some information on the frame for visualization:
266 # draw both the ID of the object and the centroid of the
267 # object on the output frame
268 text = "ID {}".format(objectID)
269 color = (0, 255, 0) if to.counted else (0, 0, 255)
270 cv2.putText(frame, text, (centroid[0] - 10, centroid[1] - 10),
271 cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
272 cv2.circle(frame, (centroid[0], centroid[1]), 4, color, -1)
273
274 # check if there is any direction info available
275 if directionInfo is not None:
276 # construct a list of information as a combination of
277 # direction info and status info
278 info = directionInfo + [("Status", status)]
279
280 # otherwise, there is no direction info available yet
281 else:
282 # construct a list of information as status info since we
283 # don't have any direction info available yet
284 info = [("Status", status)]
285
286 # loop over the info tuples and draw them on our frame
287 for (i, (k, v)) in enumerate(info):
288 text = "{}: {}".format(k, v)
289 cv2.putText(frame, text, (10, H - ((i * 20) + 20)),
The person is annotated with a dot and an ID number where red means not counted and
green means counted (Lines 268-272).
We build our text info via Lines 275-284. It contains the (1) object counts, and (2) status.
Lines 279-282 annotate the corner of the frame with the text-based info.
Let’s wrap up the while loop:
292 # put frame into the shared queue for video writing
293 if writerProcess is not None:
294 frameQueue.put(frame)
295
299
301 if key == ord("q"):
302 break
303
304 # increment the total number of frames processed thus far and
305 # then update the FPS counter
306 totalFrames += 1
307 fps.update()
Lines 293 and 294 put a frame in the frameQueue for the writerProcess to consume.
Lines 297-302 display the frame to the screen and capture keypresses (the q key quits
the frame processing loop).
Our totalFrames counter is incremented and our fps counter is updated (Lines 306 and
307). The totalFrames are used in our modulo operation to check to see if we should skip
object detection (i.e. "skip-frames").
Finally, we’ll print FPS statistics and perform cleanup:

310 fps.stop()
311 print("[INFO] elapsed time: {:.2f}".format(fps.elapsed()))
313
314 # terminate the video writer process
315 if writerProcess is not None:
316 writeVideo.value = 0
317 writerProcess.join()
318
319 # if we are not using a video file, stop the camera video stream
320 if not args.get("input", False):
321 vs.stop()
322
323 # otherwise, release the video file pointer
324 else:
325 vs.release()
326
327 # close any open windows
Lines 315-317 stop the video writerProcess — our output video will be ready for us in
the output/ folder.
13.2.4 Movidius Object Detection and Footfall Counting Results
Great job implementing your first OpenVINO/Movidius object detection script.
Let’s proceed to count people via object detection, correlation tracking, and centroid tracking
using both the (1) CPU, and (2) Myriad/Movidius processors.
If you are using the .img provided with this book, fire up the openvino virtual environment:
$ source ~/start_openvino.sh
Starting Python 3.7 with OpenCV-OpenVINO 4.1.1 bindings...
[setupvars.sh] OpenVINO environment initialized
Remark. When using the openvino virtual environment it is recommended to avoid usage of
the the workon openvino command. If you source start_openvino.sh as shown, an-
other Intel-provided script is also sourced in the process to set up key environmental variables.
I recommend that you inspect the start script on the .img via the following command: cat
~/start_openvino.sh
From there, be sure to source the setup file in the project folder:
$ source setup.sh
Go ahead and execute the following command to use the CPU:
$ python people_counter_openvino.py --target cpu --mode vertical \

--conf config/config.json --input videos/example_01.mp4 \
--output output/output_01.avi
[INFO] opening video file...
Figure 13.2: People counting via object detection with the Raspberry Pi 4B 4GB CPU.
As you can see, the RPi 4B 4GB achieved 26 FPS using only the CPU for object detection.
Let’s see how much of an improvement the Intel Movidius Neural Compute Stick coupled
with OpenVINO yields using the Myriad processor. Ensure that your Movidius NCS2 is plugged
into your RPi 4B 4GB’s USB 3.0 port and update the --target:
$ python people_counter_openvino.py --target myriad --mode vertical \

--conf config/config.json --input videos/example_01.mp4 \
--output output/output_01.avi
[INFO] opening video file...
Using the Movidius NCS coprocessor, we achieve 33 FPS, an increase in processing by 7

FPS, a speedup of 22.2%. Results may vary when using the webcam (these benchmarks are
collected using a prerecorded video file).
Be sure to review Chapters 19 and 20 of the Hobbyist Bundle for more details on algorithm
optimizations. In particular, be sure to review Section 20.4, "Leading up to a successful project
Figure 13.3: People counting via object detection with the Raspberry Pi 4B 4GB and Intel Movidius
NCS2 plugged into the USB 3.0 port.
requires multiple revisions".
13.3 Pre-trained Models and Custom Training with the Movidius NCS
In this chapter, we used a pre-trained Caffe-based MobileNet SSD for object detection.
Other pre-trained models compatible with OpenVINO and the Movidius NCS are available
at the following resources:
• OpenCV Model Zoo: http://pyimg.co/dc8ck
• OpenVINO Model Zoo: http://pyimg.co/78cqs
• Model Documentation: http://pyimg.co/m0n0w
To learn how to train your own models for the Movidius NCS, be sure to refer to the Complete
Bundle volume of this book.
13.4. SUMMARY 295
13.4 Summary
In this chapter, we learned how to build a people counter using object detection with the Intel
Movidius NCS.
Our implementation:
• Utilizes deep learning object detectors for improved person detection accuracy.
• Leverages two separate object tracking algorithms, including both centroid tracking and
correlation filters for improved tracking accuracy.
• Applies both a “detection” and “tracking” phase, making it capable of (1) detecting new
people, and (2) finding people that may have been “lost” during the tracking phase.
• Is capable of running in real-time at 32 FPS using the Movidius NCS, but only 24 FPS
using only the CPU.
As you can see from the images, this type of system would be especially useful to a store
owner to track the number of people that go in and out of the store at various times of day.
Using the data, the store owner could determine how much staff is needed to tend to the
customers, potentially saving the owner money in the long run.
If you enjoyed this object detection chapter using the Movidius NCS, you’re going to love
the Google Coral TPU alternative presented in the Complete Bundle of this book. Once you
become familiar with the Coral, a great homework assignment would be to implement people
counting as we did in this chapter, but replace the Intel Movidius coprocessor with the Google
Coral TPU coprocessor and benchmark your results.
Chapter 14
Fast, Efficient Face Recognition with

the Movidius NCS
When we built our face recognition door monitor in Chapter 5 you may have noticed that you
could enter your doorway quickly enough to avoid face recognition by the door monitor.
Is there a problem with the face detection or face recognition models themselves?
No, absolutely not.
The problem is that your Raspberry Pi CPU simply can’t process the frames quickly enough.
You need more computational horsepower.
As the title to this chapter suggests, we’re going to pair our RPi with the Movidius Neural
Compute Stick coprocessor. The NCS Myriad processor will handle both face detection and
extracting face embeddings. The RPi CPU processor will handle the final machine learning
classification using the results from the face embeddings.
The process of offloading deep learning tasks to the Movidius NCS frees up the Raspberry
Pi CPU to handle the non-deep learning tasks. Each processor is then doing what it is designed
for. We are certainly pushing our Raspberry Pi to the limit, but we don’t have much choice short
of using a completely different single board computer such as an NVIDIA Jetson Nano.
By the end of this chapter you’ll have a fully functioning face recognition script running at
6.29FPS on the RPi and Movidius NCS, a 243% speedup compared to using just the RPi
alone!
Remark. This chapter includes a selection of reposted content from the following two blog
posts: Face recognition with OpenCV, Python, and deep learning (http://pyimg.co/oh21b [73])
OpenCV Face Recognition (http://pyimg.co/i39fy [74]). The content in this chapter, however, is
optimized for the Movidius NCS.
297
298 CHAPTER 14. FAST, EFFICIENT FACE RECOGNITION WITH THE MOVIDIUS NCS
In this chapter, we will build upon Chapter 5 but with a simpler example which demonstrates:
i. How to work with the Movidius NCS for face recognition.
ii. Drawbacks and limitations of face recognition.
iii. How to obtain higher face recognition accuracy.
14.2 Fast, Efficient Face Recognition with the Movidius NCS
Prior to reading this chapter, be sure to read/review Chapter 5 in which face recognition was
first presented in this book. Specifically, you should review Section 5.4, “Deep learning for face
recognition” to ensure you understand modern face recognition concepts.
You can reuse your face dataset that you may have developed for that chapter; alternatively
take the time now to develop a face dataset for this chapter (Section 5.4.2).
Upon your understanding of this chapter, you will be able to revisit Chapter 5 and integrate
the Movidius into the door monitor project both (1) with the Movidius NCS for faster speeds,
and (2) with higher accuracy.
In the remainder of this chapter, we’ll begin by reviewing the process of extracting embed-
dings for/with the NCS. From there, we’ll train a model upon the embeddings.
Finally we’ll develop a quick demo script to ensure that our faces are being recognized
properly.
Our project is organized in the following manner:
|-- face_detection_model
| |-- deploy.prototxt
| |-- res10_300x300_ssd_iter_140000.caffemodel
|-- face_embedding_model
| |-- openface_nn4.small2.v1.t7
|-- output
| |-- embeddings.pickle
| |-- le.pickle
|-- setupvars.sh
|-- extract_embeddings.py
14.2. FAST, EFFICIENT FACE RECOGNITION WITH THE MOVIDIUS NCS 299
|-- train_model.py
|-- recognize_video.py
Our face detector will localize a face in the image to be recognized. The pre-trained Caffe
face detector files (provided by OpenCV) are included inside the face_detection_model/
directory. Be sure to refer to this deep learning face detection blog post to learn more about
the detector and how it can be put to use: http://pyimg.co/l9v8e [75].
We will extract face embeddings with a pre-trained OpenFace PyTorch model included in the
face_embedding_model/ directory. The openface_nn4.small2.v1.t7 file was trained
by the team at Carnegie Mellon University as part of the OpenFace project [76].
When we execute extract_embeddings.py, two pickle files will be generated. Both

embeddings.pickle and le.pickle will be stored inside of the output/ directory if you
so choose. The embeddings consist of a 128-d vector for each face in the dataset.
We’ll then train a Support Vector Machines (SVM) machine learning model on top of the
embeddings by executing the train_model.py script. The result of training our SVM will be
serialized to recognizer.pickle in the output/ directory.
Remark. You should delete the files included in the output/ directory and generate new files
associated with your own face dataset.
The recognize_video.py script simply activates your camera and detects plus recog-
nizes faces in each frame.
14.2.2 Our Environment Setup Script
Our Movidius face recognition system will not work properly unless a system environment
variable is set.
This may change in future revisions of OpenVINO, but for now a shell script is provided in
the project associated with this chapter.
Open up setup.sh and inspect the script:
1 #!/bin/sh
2
3 export OPENCV_DNN_IE_VPU_TYPE=Myriad2
The “shebang” (#) on Line 1 indicates that this script is executable.
Line 3 sets the environment variable using the export command. You could, of course,
manually type the command in your terminal, but this shell script alleviates you from having to
memorize the variable name and setting.
Let’s go ahead and execute the shell script:
$ source setup.sh
Provided that you have executed this script, you shouldn’t see any any strange OpenVINO-
related errors with the rest of the project.
If you encounter the following error message in the next section, be sure to execute setup.sh:
Traceback (most recent call last):

File "extract_embeddings.py", line 108 in <module>
cv2.error: OpenCV(4.1.1-openvino) /home/jenkins/workspace/OpenCV/
OpenVINO/build/opencv/modules/dnn/src/op\textit{inf}engine.cpp:477
error: (-215:Assertion failed) Failed to initialize Inference Engine
backend: Can not init Myriad device: NC_ERROR in function 'initPlugin'
14.2.3 Extracting Facial Embeddings with Movidius NCS
Recall from Section 5.4.1 that in order to perform deep learning face recognition, we need
real-valued feature vectors to train a model upon. The script in this section serves the purpose
of extracting 128-d feature vectors for all faces in our dataset.
Let’s open extract_embeddings.py and review:

4 import argparse
5 import imutils
6 import pickle
7 import cv2
8 import os
9
12 ap.add_argument("-i", "--dataset", required=True,
13 help="path to input directory of faces + images")
14 ap.add_argument("-e", "--embeddings", required=True,
15 help="path to output serialized db of facial embeddings")
16 ap.add_argument("-d", "--detector", required=True,
17 help="path to OpenCV's deep learning face detector")
18 ap.add_argument("-m", "--embedding-model", required=True,
19 help="path to OpenCV's deep learning face embedding model")

Lines 2-8 import the necessary packages for extracting face embeddings.
Lines 11-22 parse five command line arguments:
• --dataset: The path to our input dataset of face images.
• --embeddings: The path to our output embeddings file. Our script will compute face
embeddings which we’ll serialize to disk.
• --detector: Path to OpenCV’s Caffe-based deep learning face detector used to actu-
ally localize the faces in the images.
• --embedding-model: Path to the OpenCV deep learning Torch embedding model. This
model will allow us to extract a 128-D facial embedding vector.
• --confidence: Optional threshold for filtering week face detections.
24 # load our serialized face detector from disk

25 print("[INFO] loading face detector...")
26 protoPath = os.path.sep.join([args["detector"], "deploy.prototxt"])
27 modelPath = os.path.sep.join([args["detector"],
28 "res10_300x300_ssd_iter_140000.caffemodel"])
29 detector = cv2.dnn.readNetFromCaffe(protoPath, modelPath)
30 detector.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)
31
32 # load our serialized face embedding model from disk and set the
33 # preferable target to MYRIAD
34 print("[INFO] loading face recognizer...")
35 embedder = cv2.dnn.readNetFromTorch(args["embedding_model"])
36 embedder.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)
Here we load the face detector and embedder:
• detector: Loaded via Lines 26-29. We’re using a Caffe based DL face detector to
localize faces in an image.
• embedder: Loaded on Line 33. This model is Torch-based and is responsible for extract-
ing facial embeddings via deep learning feature extraction.
Notice that we’re using the respective cv2.dnn functions to load the two separate models.
The dnn module is optimized by the Intel OpenVINO developers.
As you can see on Line 30 and Line 36 we call setPreferableTarget and pass the Myr-
iad constant setting. These calls ensure that the Movidius Neural Compute Stick will conduct
the deep learning heavy lifting for us.
Moving forward, let’s grab our image paths and perform initializations:

40 imagePaths = list(paths.list_images(args["dataset"]))
41
42 # initialize our lists of extracted facial embeddings and
43 # corresponding people names
44 knownEmbeddings = []
45 knownNames = []
46
47 # initialize the total number of faces processed
48 total = 0
The imagePaths list, built on Line 40, contains the path to each image in the dataset. The
imutils function, paths.list_images automatically traverses the directory tree to find all
image paths.
Our embeddings and corresponding names will be held in two lists: (1) knownEmbeddings,
and (2) knownNames (Lines 44 and 45).
We’ll also be keeping track of how many faces we’ve processed the total variable (Line
48).
Let’s begin looping over the imagePaths — this loop will be responsible for extracting
embeddings from faces found in each image:

56
57 # load the image, resize it to have a width of 600 pixels (while
58 # maintaining the aspect ratio), and then grab the image
59 # dimensions
61 image = imutils.resize(image, width=600)
62 (h, w) = image.shape[:2]
We begin looping over imagePaths on Line 51.

First, we extract the name of the person from the path (Line 55). To explain how this works,
consider the following example in a Python shell:
$ python
>>> from imutils import paths
>>> import os
>>> datasetPath = "../datasets/face_recognition_dataset"
>>> imagePaths = list(paths.list_images(datasetPath))
>>> imagePath = imagePaths[0]
>>> imagePath
'dataset/adrian/00004.jpg'
>>> imagePath.split(os.path.sep)
['dataset', 'adrian', '00004.jpg']
>>> imagePath.split(os.path.sep)[-2]
'adrian'
>>>
Notice how by using imagePath.split and providing the split character (the OS path
separator — “/” on Unix and “\” on Windows), the function produces a list of folder/file names
(strings) which walk down the directory tree. We grab the second-to-last index, the persons
name, which in this case is adrian.
Finally, we wrap up the above code block by loading the image and resizing it to a known
width (Lines 60 and 61).
Let’s detect and localize faces:
64 # construct a blob from the image

65 imageBlob = cv2.dnn.blobFromImage(
66 cv2.resize(image, (300, 300)), 1.0, (300, 300),
67 (104.0, 177.0, 123.0), swapRB=False, crop=False)
68
69 # apply OpenCV's deep learning-based face detector to localize
70 # faces in the input image
71 detector.setInput(imageBlob)
72 detections = detector.forward()
On Lines 65-67, we construct a blob. A blob packages an image into a data structure
compatible with OpenCV’s dnn module. To learn more about this process, please read Deep
learning: How OpenCV’s blobFromImage works (http://pyimg.co/c4gws [45]).
From there we detect faces in the image by passing the imageBlob through the detector
network (Lines 71 and 72).
Let’s process the detections:
74 # ensure at least one face was found

75 if len(detections) > 0:
76 # we're making the assumption that each image has only ONE
77 # face, so find the bounding box with the largest probability
78 j = np.argmax(detections[0, 0, :, 2])
79 confidence = detections[0, 0, j, 2]
80
81 # ensure that the detection with the largest probability also
82 # means our minimum probability test (thus helping filter out
83 # weak detection)
85 # compute the (x, y)-coordinates of the bounding box for
86 # the face
87 box = detections[0, 0, j, 3:7] * np.array([w, h, w, h])
89
90 # extract the face ROI and grab the ROI dimensions
91 face = image[startY:endY, startX:endX]
92 (fH, fW) = face.shape[:2]
93
94 # ensure the face width and height are sufficiently large
95 if fW < 20 or fH < 20:
96 continue
The detections list contains probabilities and bounding box coordinates to localize faces
in an image. Assuming we have at least one detection, we’ll proceed into the body of the if-
statement (Line 75). We make the assumption that there is only one face in the image, so we
extract the detection with the highest confidence and check to make sure that the confidence
meets the minimum probability threshold used to filter out weak detections (Lines 78-84).
When we’ve met that threshold, we extract the face ROI and grab/check dimensions to
make sure the face ROI is sufficiently large (Lines 87-96).
From there, we’ll take advantage of our embedder CNN and extract the face embeddings:
98 # construct a blob for the face ROI, then pass the blob
99 # through our face embedding model to obtain the 128-d
100 # quantification of the face
101 faceBlob = cv2.dnn.blobFromImage(face, 1.0 / 255,
102 (96, 96), (0, 0, 0), swapRB=True, crop=False)
103 embedder.setInput(faceBlob)
104 vec = embedder.forward()
105
106 # add the name of the person + corresponding face
107 # embedding to their respective lists
109 knownEmbeddings.append(vec.flatten())
110 total += 1
We construct another blob, this time from the face ROI (not the whole image as we did
before) on Lines 101 and 102.
Subsequently, we pass the faceBlob through the embedder CNN (Lines 103 and 104).
This generates a 128-D vector (vec) which quanitifies the face. We’ll leverage this data to
recognize new faces via machine learning.
And then we simply add the name and embedding vec to knownNames and knownEmbedd
ings, respectively (Lines 108 and 109).
We also can’t forget about the variable we set to track the total number of faces either —
we go ahead and increment the value on Line 110.
We continue this process of looping over images, detecting faces, and extracting face em-
beddings for each and every image in our dataset.
All that’s left when the loop finishes is to dump the data to disk:
112 # dump the facial embeddings + names to disk

113 print("[INFO] serializing {} encodings...".format(total))
114 data = {"embeddings": knownEmbeddings, "names": knownNames}
115 f = open(args["embeddings"], "wb")
117 f.close()
We add the name and embedding data to a dictionary and then serialize it into a pickle file
on Lines 113-117.
At this point we’re ready to extract embeddings by running our script. Prior to running the
embeddings script, be sure to setup environmental variables via our script if you did not do so
in the previous section:
$ source setup.sh
From there, open up a terminal and execute the following command to compute the face
embeddings with OpenCV and Movidius:
$ python extract_embeddings.py \
--dataset ../datasets/face_recognition_dataset \
--embeddings output/embeddings.pickle \
--detector face_detection_model \
--embedding-model face_embedding_model/openface_nn4.small2.v1.t7
[INFO] loading face detector...
[INFO] loading face recognizer...

...
[INFO] serializing 116 encodings...
This process completed in 57s on a RPi 4B with an NCS2 plugged into the USB 3.0 port.
You may notice a delay at the beginning as the model is being loaded. From there, each image
will process very quickly.
As you can see we’ve extracted 120 embeddings for each of the 120 face photos in our
dataset. The embeddings.pickle file is now available in the output/ folder as well:
ls -lh output/*.pickle
-rw-r--r-- 1 pi pi 66K Nov 20 14:35 output/embeddings.pickle
The serialized embeddings filesize is 66KB — embeddings files grow linearly according
to the size of your dataset. Be sure to review Section 14.3.1 later in this chapter about the
importance of an adequately large dataset for achieving high accuracy.
14.2.4 Training an SVM model on Top of Facial Embeddings
At this point we have extracted 128-d embeddings for each face — but how do we actually
recognize a person based on these embeddings? The answer is that we need to train a
“standard” machine learning model (such as an SVM, k-NN classifier, Random Forest, etc.) on
top of the embeddings.
For small datasets a k-Nearest Neighbor (k-NN) approach can be used for face recognition
on 128-d embeddings created via the dlib [33] and face_recognition [34] libraries.
However, in this chapter, we will build a more powerful classifier (Support Vector Machines)
on top of the embeddings — you’ll be able to use this same method in your dlib-based face
recognition pipelines as well if you are so inclined.

2 from sklearn.model_selection import GridSearchCV

5 import argparse
6 import pickle
7
10 ap.add_argument("-e", "--embeddings", required=True,
11 help="path to serialized db of facial embeddings")
13 help="path to output model trained to recognize faces")
15 help="path to output label encoder")
We import our packages and modules on Lines 2-6. We’ll be using scikit-learn’s implemen-
tation of Support Vector Machines (SVM), a common machine learning model.
Lines 9-16 parse three required command line arguments:
• --embeddings: The path to the serialized embeddings (we saved them to disk by run-
ning the previous extract_embeddings.py script).
• --recognizer: This will be our output model that recognizes faces. We’ll be saving it
to disk so we can use it in the next two recognition scripts.
• --le: Our label encoder output file path. We’ll serialize our label encoder to disk so that
we can use it and the recognizer model in our image/video face recognition scripts.
Let’s load our facial embeddings and encode our labels:
18 # load the face embeddings

19 print("[INFO] loading face embeddings...")
20 data = pickle.loads(open(args["embeddings"], "rb").read())
21
Here we load our embeddings from our previous section on Line 20. We won’t be gen-
erating any embeddings in this model training script — we’ll use the embeddings previously
generated and serialized.
Then we initialize our scikit-learn LabelEncoder and encode our name labels (Lines 24
and 25).
Now it’s time to train our SVM model for recognizing faces:
27 # train the model used to accept the 128-d embeddings of the face and
30 params = {"C": [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0],
31 "gamma": [1e-1, 1e-2, 1e-3, 1e-4, 1e-5]}
32 model = GridSearchCV(SVC(kernel="rbf", gamma="auto",
33 probability=True), params, cv=3, n_jobs=-1)
34 model.fit(data["embeddings"], labels)
35 print("[INFO] best hyperparameters: {}".format(model.best_params_))
We are using a machine learning Support Vector Machine (SVM) Radial Basis Function
(RBF) kernel [77] which is typically harder to tune than a linear kernel. Therefore, we will
undergo a process known as “gridsearching”, a method to find the optimal machine learning
hyperparameters for a model.
Lines 30-33 set our gridsearch parameters and perform the process. Notice that n_jobs=1.
If you were utilizing a more powerful system, you could run more than one job to perform grid-
searching in parallel. We are on a Raspberry Pi, so we will use a single worker.
Line 34 handles training our face recognition model on the face embeddings vectors.
Remark. You can and should experiment with alternative machine learning classifiers. The
PyImageSearch Gurus course [78] covers popular machine learning algorithms in depth. To
learn more about the course use this link: http://pyimg.co/gurus
From here we’ll serialize our face recognizer model and label encoder to disk:

48 f = open(args["recognizer"], "wb")
49 f.write(pickle.dumps(model.best_estimator_))
50 f.close()
51
53 f = open(args["le"], "wb")
55 f.close()
To execute our training script, enter the following command in your terminal:
$ python train_model.py --embeddings output/embeddings.pickle \

--recognizer output/recognizer.pickle --le output/le.pickle
[INFO] loading face embeddings...

[INFO] best hyperparameters: {'C': 100.0, 'gamma': 0.1}
Let’s check the output/ folder now:
ls -lh output/*.pickle
-rw-r--r-- 1 pi pi 66K Nov 20 14:35 output/embeddings.pickle
-rw-r--r-- 1 pi pi 470 Nov 20 14:55 le.pickle
-rw-r--r-- 1 pi pi 97K Nov 20 14:55 recognizer.pickle
With our serialized face recognition model and label encoder, we’re ready to recognize faces
in images or video streams.
14.2.5 Real-Time Face Recognition in Video Streams with Movidius NCS
In this section we will code a quick demo script to recognize faces using your PiCamera or
USB webcamera. Go ahead and open recognize_video.py and insert the following code:

5 import argparse
6 import imutils
7 import pickle
8 import time
9 import cv2
10 import os
11
14 ap.add_argument("-d", "--detector", required=True,
15 help="path to OpenCV's deep learning face detector")
16 ap.add_argument("-m", "--embedding-model", required=True,
17 help="path to OpenCV's deep learning face embedding model")
19 help="path to model trained to recognize faces")
21 help="path to label encoder")
Our imports should be familiar at this point.
Our five command line arguments are parsed on Lines 12-24:

• --detector: The path to OpenCV’s deep learning face detector. We’ll use this model
to detect where in the image the face ROIs are.
• --embedding-model: The path to OpenCV’s deep learning face embedding model.

We’ll use this model to extract the 128-D face embedding from the face ROI — we’ll feed
the data into the recognizer.
• --recognizer: The path to our recognizer model. We trained our SVM recognizer in
Section 14.2.4. This model will actually determine who a face is.
• --le: The path to our label encoder. This contains our face labels such as adrian or
unknown.
• --confidence: The optional threshold to filter weak face detections.
Be sure to study these command line arguments — it is critical that you know the difference
between the two deep learning models and the SVM model. If you find yourself confused later
in this script, you should refer back to here.
Now that we’ve handled our imports and command line arguments, let’s load the three
models from disk into memory:
26 # load our serialized face detector from disk

27 print("[INFO] loading face detector...")
28 protoPath = os.path.sep.join([args["detector"], "deploy.prototxt"])
29 modelPath = os.path.sep.join([args["detector"],
30 "res10_300x300_ssd_iter_140000.caffemodel"])
31 detector = cv2.dnn.readNetFromCaffe(protoPath, modelPath)
32 detector.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)
33
34 # load our serialized face embedding model from disk and set the
35 # preferable target to MYRIAD
36 print("[INFO] loading face recognizer...")
37 embedder = cv2.dnn.readNetFromTorch(args["embedding_model"])
38 embedder.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)
39
40 # load the actual face recognition model along with the label encoder
41 recognizer = pickle.loads(open(args["recognizer"], "rb").read())
42 le = pickle.loads(open(args["le"], "rb").read())
We load three models in this block. At the risk of being redundant, here is a brief summary
of the differences among the models:
i. detector: A pre-trained Caffe DL model to detect where in the image the faces are
(Lines 28-32).
ii. embedder: A pre-trained Torch DL model to calculate our 128-D face embeddings (Line
37 and 38).
iii. recognizer: Our SVM face recognition model (Line 41).
One and two are pre-trained deep learning models, meaning that they are provided to you
as-is by OpenCV. The Movidius NCS will perform inference using each of these models.
The third recognizer model is not a form of deep learning. Rather, it is our SVM machine
learning face recognition model. The RPi CPU will have to handle making face recognition
predictions using it.
We also load our label encoder which holds the names of the people our model can recog-
nize (Line 42).
Let’s initialize our video stream:
44 # initialize the video stream, then allow the camera sensor to warm up
45 print("[INFO] starting video stream...")
48 time.sleep(2.0)
49
50 # start the FPS throughput estimator
Line 47 initializes and starts our VideoStream object. We wait for the camera sensor to
warm up on Line 48.
Line 51 initializes our FPS counter for benchmarking purposes.
Frame processing begins with our while loop:
53 # loop over frames from the video file stream

54 while True:
55 # grab the frame from the threaded video stream
57
58 # resize the frame to have a width of 600 pixels (while
59 # maintaining the aspect ratio), and then grab the image
60 # dimensions
62 (h, w) = frame.shape[:2]
63
64 # construct a blob from the image
65 imageBlob = cv2.dnn.blobFromImage(
66 cv2.resize(frame, (300, 300)), 1.0, (300, 300),
67 (104.0, 177.0, 123.0), swapRB=False, crop=False)

68
69 # apply OpenCV's deep learning-based face detector to localize
70 # faces in the input image
71 detector.setInput(imageBlob)
72 detections = detector.forward()
We grab a frame from the webcam on Line 56. We resize the frame (Line 61) and then
construct a blob prior to detecting where the faces are (Lines 65-72).
Given our new detections , let’s recognize faces in the frame. But, first we need to filter
weak detections and extract the face ROI:

75 for i in range(0, detections.shape[2]):
76 # extract the confidence (i.e., probability) associated with
77 # the prediction
79
80 # filter out weak detections
82 # compute the (x, y)-coordinates of the bounding box for
83 # the face
84 box = detections[0, 0, i, 3:7] * np.array([w, h, w, h])
86
87 # extract the face ROI
88 face = frame[startY:endY, startX:endX]
89 (fH, fW) = face.shape[:2]
90
91 # ensure the face width and height are sufficiently large
92 if fW < 20 or fH < 20:
93 continue
You’ll recognize this block from extracting embeddings in Section 14.2.3.
We loop over the detections on Line 75 and extract the confidence of each on Line 78.
Then we compare the confidence to the minimum probability detection threshold contained
in our command line args dictionary, ensuring that the computed probability is larger than the
minimum probability (Line 81).
From there, we extract the face ROI (Lines 84-89) as well as ensure it’s spatial dimensions
are sufficiently large (Lines 92 and 93).
Recognizing the name of the face ROI requires just a few steps:
95 # construct a blob for the face ROI, then pass the blob
96 # through our face embedding model to obtain the 128-d
97 # quantification of the face

98 faceBlob = cv2.dnn.blobFromImage(cv2.resize(face,
99 (96, 96)), 1.0 / 255, (96, 96), (0, 0, 0),
101 embedder.setInput(faceBlob)
102 vec = embedder.forward()
103
105 preds = recognizer.predict_proba(vec)[0]
107 proba = preds[j]
108 name = le.classes_[j]
First, we construct a faceBlob (from the face ROI) and pass it through the embedder to
generate a 128-D vector which quantifies the face (Lines 98-102)
Then, we pass the vec through our SVM recognizer model (Line 105), the result of which
is our predictions for who is in the face ROI.
We take the highest probability index and query our label encoder to find the name (Lines
106-108).
Remark. You can further filter out weak face recognitions by applying an additional threshold
test on the probability. For example, inserting if proba < T (where T is a variable you de-
fine) can provide an additional layer of filtering to ensure there are fewer false-positive face
recognitions.
Now, let’s display face recognition results for this particular frame:
110 # draw the bounding box of the face along with the
111 # associated probability
112 text = "{}: {:.2f}%".format(name, proba * 100)
113 y = startY - 10 if startY - 10 > 10 else startY + 10
114 cv2.rectangle(frame, (startX, startY), (endX, endY),
115 (0, 0, 255), 2)
116 cv2.putText(frame, text, (startX, y),
118
119 # update the FPS counter
120 fps.update()
121
125
127 if key == ord("q"):
128 break
129

131 fps.stop()
132 print("[INFO] elasped time: {:.2f}".format(fps.elapsed()))
134
137 vs.stop()
To close out the script, we:
• Draw a bounding box around the face and the person’s name and corresponding predicted
probability (Lines 112-117).
• Update our fps counter (Line 120).
• Display the annotated frame (Line 123) and wait for the q key to be pressed at which
point we break out of the loop (Lines 124-128).
• Stop our fps counter and print statistics in the terminal (Lines 131-133).
• Cleanup by closing windows and releasing pointers (Lines 136 and 137).
14.2.6 Face Recognition with Movidius NCS Results
Now that we have (1) extracted face embeddings, (2) trained a machine learning model on the
embeddings, and (3) written our face recognition in video streams driver script, let’s see the
final result.
$ python recognize_video.py --detector face_detection_model \

--embedding-model face_embedding_model/openface_nn4.small2.v1.t7 \
--recognizer output/recognizer.pickle \
--le output/le.pickle
[INFO] loading face detector...
[INFO] loading face recognizer...
[INFO] starting video stream...
[INFO] elasped time: 60.30
As you can see, faces have correctly been identified. What’s more, we are achieving 6.29
FPS using the Movidius NCS in comparison to 2.59 FPS using strictly the CPU. This comes
out to a speedup of 243% using the RPi 4B and Movidius NCS2.
14.3. HOW TO OBTAIN HIGHER FACE RECOGNITION ACCURACY 315
Figure 14.1: Face recognition with the Raspberry Pi and Intel Movidius Neural Compute Stick.
14.3 How to Obtain Higher Face Recognition Accuracy
Inevitably, you’ll run into a situation where OpenCV does not recognize a face correctly.
What do you do in those situations? And how do you improve your OpenCV face recognition
accuracy?
In this section, I’ll detail a few of the suggested methods to increase the accuracy of your
face recognition pipeline.
Remark. This section includes reposted content from my OpenCV Face Recognition blog post
http://pyimg.co/6hwuu [74].
14.3.1 You May Need More Data
My first suggestion is likely the most obvious one, but it’s worth sharing.
In this tutorial on face recognition (http://pyimg.co/oh21b [73]), a handful of PyImageSearch

readers asked why their face recognition accuracy was low and faces were being misclassified
— the conversation went something like this (paraphrased):
Them: Hey Adrian, I am trying to perform face recognition on a dataset of my classmate’s

faces, but the accuracy is really low. What can I do to increase face recognition accuracy?
Figure 14.2: All face recognition systems are error-prone. There will never be a 100% accurate
face recognition system.
Me: How many face images do you have per person?
Them: Only one or two.
Me: Gather more data.
I get the impression that most readers already know they need more face images when they
only have one or two example faces per person, but I suspect they are hoping for me to pull a
computer vision technique out of my bag of tips and tricks to solve the problem.
It doesn’t work like that.
If you find yourself with low face recognition accuracy and only have a few example faces
per person, gather more data — there are no “computer vision tricks” that will save you from
the data gathering process.
Invest in your data and you’ll have a better OpenCV face recognition pipeline. In
general, I would recommend a minimum of 10-20 faces per person.
Figure 14.3: Most people aren’t training their OpenCV face recognition models with enough data.
(image source: [79])
Remark. You may be thinking, “But Adrian, you only gathered 20 images per person for this
chapter!” Yes, you are right — and that is to prove a point. The face recognition system we
discussed in this chapter worked but can always be improved. There are times when smaller
datasets will give you your desired results, and there’s nothing wrong with trying a small dataset
— but when you don’t achieve your desired accuracy you’ll want to gather more data.
14.3.2 Perform Face Alignment
The face recognition model OpenCV uses to compute the 128-d face embeddings comes from
the OpenFace project [76].
The OpenFace model will perform better on faces that have been aligned. Face alignment
is the process of (1) dentifying the geometric structure of faces in images and (2) attempting to
obtain a canonical alignment of the face based on translation, rotation, and scale.
As you can see from Figure 14.4 we have:
i. Detected faces in the image and extracted the ROIs (based on the bounding box coordi-
nates).
ii. Applied facial landmark detection (http://pyimg.co/x0f5r [80]) to extract the coordinates of
the eyes.
Figure 14.4: Performing face alignment for OpenCV facial recognition can dramatically improve
face recognition performance.
iii. Computed the centroid for each respective eye along with the midpoint between the eyes.
iv. And based on these points, applied an affine transform to resize the face to a fixed size
and dimension.
If we apply face alignment to every face in our dataset, then in the output coordinate space,
all faces should:
i. Be centered in the image.
ii. Be rotated such the eyes lie on a horizontal line (i.e., the face is rotated such that the eyes
lie along the same y -coordinates).
iii. Be scaled such that the size of the faces is approximately identical.
Applying face alignment to our OpenCV face recognition pipeline is outside the scope of
this chapter, but if you would like to further increase your face recognition accuracy using
OpenCV and OpenFace, I would recommend you apply face alignment using the method in
this PyImageSearch article: http://pyimg.co/tnbzf [81].
14.3.3 Tune Your Hyperparameters
My next suggestion is for you to attempt to tune your hyperparameters on whatever machine
learning model you are using (i.e., the model trained on top of the extracted face embeddings).
For this chapter’s tutorial, we used an SVM with a Radial Basis Function (RBF) kernel. To
tune the hyperparameters, we performed a grid search over the C value, which is typically the
most important value of an SVM to tune.
The C value is a “strictness” parameter and controls how much you want to avoid misclassi-
fying each data point in the training set. Larger values of C will be more strict, causing the SVM
to try harder to classify every input data point correctly, even at the risk of overfitting. Smaller
values of C will be more “soft”, allowing some misclassifications in the training data, but ideally
generalizing better to testing data.
You may also want to consider tuning the gamma value. The default scikit-learn implemen-
tation will attempt to automatically set the gamma value for you; however, the result may not be
satisfactory. The following example from the scikit-learn documentation shows you how to tune
both the C and gamma parameters to a RBF SVM: http://pyimg.co/qz3pq [82].
It’s interesting to note that according to one of the classification examples in the OpenFace
GitHub [83], they actually recommend to not tune the hyperparameters if you are using a
linear SVM, as, from their experience, they found that setting C=1 obtains satisfactory face
recognition results in most settings.
That said, RBF SVMs tend to be significantly harder to tune, so if your face recognition
accuracy is not sufficient, it may be worth the extra effort and computational cost of tuning your
hyperparameters via either a grid search or random search.
14.3.4 Use Dlib’s Embedding Model
In my experience using both OpenCV’s face recognition model along with dlib’s face recogni-
tion model [35], I’ve found that dlib’s face embeddings are more discriminative, especially for
smaller datasets.
Furthermore, I’ve found that dlib’s model is less dependent on (1) preprocessing steps such
as face alignment, and (2) using a more powerful machine learning model on top of extracted
face embeddings
If you take a look at my first face recognition article on PyImageSearch (http://pyimg.co/oh21b

[73]), you’ll notice that we utilized dlib with a simple k-NN algorithm for face recognition (with a
small modification to throw out nearest neighbor votes whose distance was above a threshold).
The k-NN model worked well, but as we know, more powerful machine learning models exist.
To improve accuracy further, you may want to use dlib’s embedding model, and then instead
of applying k-NN, follow Section 14.2.4 from this chapter and train a more powerful classifier
on the face embeddings.
14.4 Summary
In this chapter, we used OpenVINO and our Movidius NCS to perform face recognition.
Our face recognition pipeline was created using a four-stage process:
i. Create your dataset of face images. You can, of course, swap in your own face dataset
provided you follow the directory structure of the project covered in Chapter 5).
ii. Extract face embeddings for each face in the dataset.
iii. Train a machine learning model (Support Vector Machines) on top of the face embed-
dings.
iv. Utilize OpenCV and our Movidius NCS to recognize faces in video streams.
We put our Movidius NCS to work for the following deep learning tasks:
• Face detection: localizing faces in an image
• Extracting face embeddings: generating 128-D vectors which quantify a face numerically
We then used the Raspberry Pi CPU to handle the non-DL machine learning classifier used
to make predictions on the 128-D embeddings.
This process of separating responsibilities allowed the CPU to call the shots, while employ-
ing the NCS for the heavy lifting. We achieved a speedup of 243% using the Movidius NCS
for face recognition in video streams.
Chapter 15
Recognizing Objects with IoT Pi-to-Pi

Communication
Now that you’ve read all of the Hobbyist Bundle, and most of the Hacker Bundle, imagine that
you’re brainstorming with me on a security-related project to build upon Chapter 5 — the face
recognition door monitor project.
Let’s consider the drawbacks of many alarm systems available today for homes and busi-
nesses.
Commercial off-the-shelf (COTS) systems are expensive in comparison to the cheap and
affordable Raspberry Pi. Service contracts, installation fees, and false-alarm police call fees
come to mind.
COTS security systems typically cannot be customized. Despite being an IoT device, rarely
does a COTS security system integrate and interoperate with other devices in your home.
Alarm systems sometimes tend to be a nuisance and people end up not using them in
the first place. Don’t you hate that your motion sensor isn’t smart enough to ignore your dog
moving through the home? Many people with animals simply do not arm their home when for
fear of a false-alarm.
As the Hacker Bundle comes to a close with this final chapter, what better way to to end
than by putting all the pieces together and creating a real-world IoT security project you can
use in your home?
In this chapter, we will arm our home with multiple Raspberry Pis that communicate among
themselves to accomplish a common goal. In a sense, this chapter represents the culmination
of all the topics we have learned and discussed so far in the first two volumes of this book.
We will focus on learning by doing and using the concepts that we have learned along the
way. There will certainly be room for you to hack the project to your needs and create entirely
321
322 CHAPTER 15. RECOGNIZING OBJECTS WITH IOT PI-TO-PI COMMUNICATION
different spin-off projects as well.
Let’s go ahead and review our learning objectives.
Admittedly, this chapter has a lot of moving parts. Keeping that in mind, we have actually
covered most of the topics in previous chapters of the Hobbyist and Hacker Bundle.
i. Reinforce and refer to previously covered topics including:
a. Message passing and sockets

b. Twilio SMS alerts
c. Face detection/recognition
d. Background subtraction
e. Object detection with MobileNet SSD
f. Text-to-speech
ii. Cover new topics including:
a. Internet of Things (IoT) smart lighting

b. State machines
iii. And bring it all together for a multiple Pi Internet of Things project.
15.2 How this Chapter is Organized
In this chapter, we’ll begin by building a case for security with an emphasis on the lack of
flexibility of commercially available systems — that’s where we bring in Raspberry Pis to the
rescue. From there we’ll review the concept of what we’re building in this chapter. Our interop-
erable system will involve at least two Raspberry Pis and all the concepts listed in the objects
section.
Next, we’ll review concepts that we have already covered in this book including links to
chapters which you should be familiar with prior to reading this chapter and hacking with the
code.
We’ll then introduce two new concepts: IoT lighting and state machines. We’ll review the
IoT lights that are recommended for compatibility with this chapter including a quick demo of
15.3. THE CASE FOR A COMPLEX SECURITY SYSTEM 323
the API. If you have a computer science or computer engineering background, you’re likely
already familiar with state machines. If not, you’ll pick up the concept very quickly.
From there we’ll get into our IoT Case Study project including discussing our state machines,
project structure, config files, and driver scripts. We’ll bring it all together for deployment. And
finally, we’ll review suggestions and ideas for this project and spin-off projects that you dream
up.
Be sure to budget some time for this chapter as there are a lot of moving parts. For de-
ployment you may have to spend some time tweaking your physical environment including
positioning of your cameras and other hardware as well as testing.
15.3 The Case for a Complex Security System
In this section, we’ll discuss security systems, why they are important, and how networked
Raspberry Pis can help with the equation.
15.3.1 Your Family, Your Belongings
We all know way too many people that have had belongings stolen.
This leaves someone with a fear for their safety, their family’s safety, and a fear of their pre-
cious belongings being vulnerable to loss. Protecting your family and belongings is important
to you — theft and vandalism is on the rise in many areas and it is hard for communities to
counteract.
Many break-ins even happen in the light of day, what can we do?
Relying on law enforcement to pick up the pieces of an insecure home or car that gets
broken into doesn’t always lead to putting people behind bars.
Law enforcement needs evidence to act, and in many cases, they don’t have time to dust
for finger prints for the theft of an expensive laptop or jewelry.
Thanks to GPS and connectivity in laptops and smartphones, thieves are beginning to leave
them behind. Check out Apple’s latest “Find My” system that relies on nearby bluetooth devices
to find phones, watches, laptops, and headphones [84]. Apple claims the system is secure and
will not impact battery life of your devices. Considering you’re nearly always within 30 feet of
an iPhone, thieves may begin to think twice about stealing an Apple product!
Instead, they go for items that can’t be electronically tracked — currency, tools from your
garage/shed, and jewelry. For these items, it is essential to have a guarded home with a
security system to include video evidence if you want any hope of the person serving time
behind bars and leaving your belongings alone.
15.3.2 The Importance of Video Evidence
Let’s face it. In most cases a camera or motion sensitive light won’t deter a thief. The thief or
vandal will sometimes wear a mask making it nearly impossible for identification.
That’s not an excuse to skip installing cameras – cameras and storage are cheap. If you
don’t have a video to provide to police, you certainly aren’t helping them on their quest to track
down the bad guy. Video evidence provided to both the police and posted in community social
groups like the NextDoor app [85] may lead to action being taken.
15.3.3 How the RPi Shines for Security — Flexibility and Affordability
There are countless off the shelf systems for security surveillance, many with a local DVR or
cloud storage option. These systems are great for the “average Joe” homeowner. As Rasp-
berry Pi tinkerers, we tend to overlook the COTS (Commercial Off-the-Shelf) solution. There’s
actually a lot to learn from these systems so browsing product pages and talking to home
security experts is definitely worth our time.
But normally, COTS systems don’t provide much room for tinkering and custom develop-
ment. We’re hackers, so we say “boring!” We also become frustrated that we can’t make these
expensive products and services operate as we desire:
• That camera can’t pan/tilt on the cheap.
• Nor can it activate all the lights in the house to really throw the bad guy off his game at
night.
• Nor can it send SMS messages to your smart phone (and your neighbors).
• Can it borrow a gigabyte of storage from Dropbox or Box that you already pay for to store
motion or pedestrian clips? I doubt it — you’re locked in to an expensive storage solution
possibly with a proprietary video file format.
• Does the camera interface with your alarm system? Some do, but you may pay a pre-
mium.
• Pedestrian and face detection? Yeah right.
• Full control over your motion detection algorithm? I didn’t think so.
• Is that COTS system less than a one time fee of $100USD and some sweat equity on the
keyboard? No, you’re locked into a recurring monthly fee.
15.4. A FULLY-FLEDGED PROJECT INVOLVING MULTIPLE IOT DEVICES 325
The Raspberry Pi (coupled with Python, OpenCV, and other libraries) can complete all the
above tasks as we’ve learned in the Hobbyist Bundle and so far in the Hacker Bundle.
Our solution is flexible, affordable, and interoperable with other IoT devices and services that
are worth paying for (there are plenty of poor products and services that won’t interoperate as
well).
Flexibility is both a pro and a con. On one hand, the possibilities are endless. On the other
hand, it requires time and development. But you’ll learn new skills in the process which is a
win-win in my mind coming from the computer vision education perspective.
You may even find yourself working for a security company developing computer vision
applications and algorithms now that you’re armed with knowledge you gained in this book!
15.4 A Fully-Fledged Project Involving Multiple IoT Devices
Before we begin implementing our IoT security system, let’s break it down and understand how
it will actually work.
Our system has the goals of (1) monitoring our driveway outside the house for motion and
people/vehicles, (2) turning on a light inside the home to deter bad guys and help with face
detection/recognition at night, and (3) performing face recognition and alerting the homeowners
if someone shouldn’t be inside (i.e. not you or your family members).
We can accomplish this proof of concept IoT case study system with a minimum of three
devices:
• Two Raspberry Pis, each with its own camera
• IoT light bulb, switch, or plug
Each Raspberry Pi will be running a separate program; however, the Raspberry Pis will pass
messages back and forth to convey the state of our system. This communication/message
passing paradigm allows for each RPi’s Python script to update its own state so that each Pi
smartly does the correct task (i.e. waits, monitors driveway, detects the door as opened/closed,
or performs face recognition, turns on a light, etc.).
Only one of the Raspberry Pis will communicate with our IoT lighting via an API. Arguably,
either or both Raspberry Pis could control the lights, but for simplicity only one RPi will be
responsible in this example.
Before we design our system, we need to review (1) previously covered topics, (2) state
machine basics, and (3) IoT lighting via APIs.
15.5 Reinforcing Concepts Covered in Previous Chapters
This case study builds upon many concepts previously covered in this book. The chapter
admittedly lengthy, so this section serves as your starting point with pointers to other chapter-
s/sections and outside resources.
Be sure to refer to previous chapters as needed while you read the rest of this chapter.
Previous chapters include more detailed code explanations that you should be familiar with at
this point.
Let’s review the concepts you should know for understanding this chapter.
15.5.1 Message Passing and Sockets
Any Internet communications system you use and rely on utilizes sockets which in programmer-
speak is a word for a connection. A single program can have many connections to services
such as REST APIs, databases, websites, SFTP, applications residing on servers or other
systems, etc.
We will rely on simple message passing sockets in this project. Be sure to refer to Chapter
3 where we reviewed message passing applications by example, ZMQ, and ImageZMQ.
15.5.2 Face Detection
Figure 15.1: Left: Face detection localizes a face in an image. Right: Face recognition determines
who is in the detected face ROI.
Face detection involves localizing a face in an image (i.e. finding the bounding box (x,
y)-coordinates of all faces). Face detection algorithms rely on Haar Cascades, Histogram
15.5. REINFORCING CONCEPTS COVERED IN PREVIOUS CHAPTERS 327
of Oriented Gradient detectors (HOG), or deep learning object detectors. Each has its own
tradeoffs in speed/accuracy performance. Knowing which type of face detector is key to a
project’s success, especially on a resource-constrained device like the Raspberry Pi. The
following chapters of this book implement face detection:
• Hobbyist Bundle:
• Chapter 18 utilizes face detection for pan/tilt camera tracking.
• Hacker Bundle:
• Chapter 5 uses face detection prior to recognition in a video stream of your doorway
to monitor and alert you of people entering your home.
• Chapter 6 utilizes face detection to find faces in a frame prior to recognition for class-
room attendance purposes.
• Chapter 14 utilizes a deep learning face detection model prior to applying deep learn-
ing/machine learning based face recognition using the Movidius NCS.
15.5.3 Face Recognition
Face recognition involves discerning the difference between faces in an image or video feed
(i.e. who is who?).
Local Binary Patterns and Eigenfaces were successful algorithms used for facial recognition
in the early days of the art.
These days, modern face recognition systems employ deep learning and machine learning
to accomplish face recognition. We compute “face embeddings”, a form of a feature vector, for
faces in a dataset. From there we can train a machine learning model on top of the extracted
face embeddings.
We then load the serialized model to recognize fresh faces presented to a camera as either
recognized or unknown.
Alternatively a k-Nearest Neighbor approach could be used, which is arguably not machine
learning. Rather, k-NN relies on computing the distance between face embeddings and finding
the closest match(es).
The following chapters demonstrate facial recognition in the Hacker Bundle:
• Chapter 5 makes use of facial recognition to recognize inhabitants and intruders entering
your home.
• Chapter 6 performs facial recognition for classroom attendance.

• Chapter 14 applies face recognition using the Movidius NCS.
I also recommend reading the following articles:
• Face Recognition with OpenCV, Python, and Deep Learning: http://pyimg.co/oh21b [73]
• OpenCV Face Recognition: http://pyimg.co/i39fy [74]
Face recognition is a hot topic and you can find all relevant topics on PyImageSearch via
this "faces" category link: http://pyimg.co/yhluw.
15.5.4 Background Subtraction
Figure 15.2: Left: Background subtraction for motion detection. Right: Object detection for local-
izing and determining types of objects.
Background subtraction is a method that can help find motion areas in a video stream. In
order to successfully apply background subtraction, we need to make the assumption that our
background is mostly static and unchanging over consecutive frames of a video. Then, we can
model the background and monitor it for substantial changes. The changes are detected and
marked as motion. You can observe background subtraction in action in the following chapters:
• Chapter 9 first introduces background subtraction as a method for monitoring birds.

• Chapter 13 uses background subtraction with an infrared camera video stream to
monitor for nocturnal wildlife.
• Chapter 14 demonstrates background subtraction in a web streaming application.
• Chapter 19 takes advantage of background subtraction for people counting. This
method, while less accurate than HOG-based or deep learning-based object detec-
tion, was efficient enough to run in realtime on a Raspberry Pi 3B+.
15.5. REINFORCING CONCEPTS COVERED IN PREVIOUS CHAPTERS 329
• Chapter 20 utilizes background subtraction for traffic counting.
• Hacker Bundle:
• Chapter 9 performs background subtraction to segment hand gestures from a static

background.
15.5.5 Object Detection with MobileNet SSD
Object detection with the pre-trained MobileNet SSD enables localization and recognition of 20
everyday classes.
In this chapter we will use the pretrained model to detect people and vehicles.
Object detection with the MobileNet SSD is covered in the following chapters of the Hacker
Bundle:
• Chapter 8 uses MobileNet SSD to find people and animals in multiple RPi client video
streams so you can locate them in any frame streamed via ImageZMQ.
• Chapter 13 improves upon Chapter 19 of the Hobbyist Bundle using MobileNet SSD for
accurate and fast people counting with the addition of OpenVINO and the Movidius NCS
for inference.
I’ve written about the MobileNet SSD a number of times on PyImageSearch, so be sure to
refer to these articles for more practical usages of MobileNet: http://pyimg.co/o64vu
If you are interested in training your own Faster R-CNNs, SSDs, or RetinaNet object detec-
tion models, you may refer to the ImageNet Bundle of Deep Learning for Computer Vision with
Python (http://pyimg.co/dl4cv) [50].
15.5.6 Twilio SMS Alerts
Text message alerts are not only useful, but are a lot of fun and are a great way to show off
your projects while you’re out having drinks with friends. Be sure to check out the following
chapters involving Twilio SMS/MMS alerts:
• Chapter 10 introduces the code templates for working with Twilio notifications.
• Chapter 11 includes a project that sends SMS notifications when your mailbox is
opened.
• Hacker Bundle:
• Chapter 5 alerts you via MMS when an unknown face has entered the door of your
home.
• Chapter 9 sends alerts to your phone when someone enters the wrong hand gesture
code in front of your camera.
15.6 New Concepts
15.6.1 IoT Smart Lighting
Figure 15.3: Left: Smart plug. Center: Smart light bulb. Right: Smart switch.
In this chapter we will put IoT lighting devices to work for us. A Raspberry Pi will activate a
light in the home so that the camera can see our face for facial recognition. It may also serve
as a deterrent to an unknowing intruder that think’s a person turned on the light.
There are many IoT lights on the market, but a lot of them are so secure that you can’t
easily work with them using Python. The TP-Link brand of IoT lights have a known security
vulnerability [86] and the folks at SoftSCheck reverse engineered the communication protocol.
SoftSCheck discovered the vulnerability, responsibly disclosed it to the TP-Link engineering
team, and published the WireShark capture files and Python scripts on GitHub [87].
Later, the GadgetReactor user on GitHub posted an API that supports more devices [88].
Luckily for us, the lights are easily turned on and off using this Python API. TP-Link’s “Kasa” line
15.6. NEW CONCEPTS 331
of lighting products (Figure 15.3) with the vulnerability (i.e. compatible with pyHS100) include:
• Smart plugs (HS series)
• Smart switches (HS series)
• Smart bulbs (LB and KL series)
You may purchase any of the lighting devices listed on the companion website — they all
work with the API, though some may require slight modifications.
We will use the pyHS100 project to interface with our lights by IP address. The Python
package is already installed on the preconfigured .img that accompanies this book [89]. If you
are not using the PyImageSearch RPi .img, you may simply install the package via pip in your
virtual environment:
$ pip install pyHS100
The following demo program, iot_light_demo.py, will toggle your light on and off:
1 # import necessary packages

2 from pyHS100 import SmartBulb
3 from pyHS100 import SmartPlug
4 import argparse
5 import sys
6
9 ap.add_argument("-i", "--ip-address", required=True,
10 help="IP address of the smart device")
11 ap.add_argument("-t", "--type", required=True,
12 choices=["plug", "switch", "bulb"], help="type of smart device")
Lines 2-5 import packages, namely pyHS100. The preconfigured Raspbian .img that
accompanies this book includes the package. It is also pip-installable via pip install
pyHS100.
Lines 8-13 parse the --ip-address and smart device --type. You’ll need to follow the
instructions that come with your TP-LINK product to connect it to your network. You can find
the IP address by looking at your DHCP client table on your router. The type can be any one
of the choices listed on Line 12 ("plug", "switch", or "bulb" as shown in Figure 15.3).
With our known IP address and device type, we’re now ready to instantiate the device:
15 # check if the device type is plug or switch

16 if args["type"] == "plug" or "switch":
17 # instantiate a smart plug/switch object
18 device = SmartPlug(args["ip_address"])
19
20 # otherwise, we are using a bulb
21 elif args["type"] == "bulb":
22 # instantiate a smart bulb object
23 device = SmartBulb(args["ip_address"])
If the --type is either a plug or a switch, Lines 16-18 initialize the device. Otherwise, a
bulb is initialized via Lines 21-23; bulbs use a slightly different API for a different feature set.
From here, we’ll query the device status and request user input:
25 # print the current state of the device

26 print("[INFO] current state of the device: {}".format(device.state))
27
28 # begin a loop over user input
29 while True:
30 # grab user input
31 val = input("[INPUT] Enter 'on', 'off', or 'exit' here: ")
Line 26 prints the device state (either on or off) to the terminal.
Line 29 begins a user input loop. Inside the loop, first we request user input via the
terminal. The input message commands the user to specify “on”, “off”, or “exit”.
Next we’ll process the user input and print the device status:
33 # check if the user wants to turn on the device

34 if val == "on":
35 device.turn_on()
36
37 # or the user wants to turn off the device
38 elif val == "off":
39 device.turn_off()
40
41 # or the user wants to exit the program
42 elif val == "exit":
43 print("[INFO] exiting the application (the light remains in" \
44 " the current state")
45 sys.exit(0)
46
47 # otherwise, the user has entered invalid input
48 else:
49 print("[ERROR] invalid input received, please try again")
15.6. NEW CONCEPTS 333
50 continue
51
52 # print a success message if the device state matches desired state
53 if device.state == val.upper():
54 print("[SUCCESS] device state equals desired state of {}".format(
55 val.upper()))
56
57 # otherwise print a failure message when the state does not match
58 else:
59 print("[FAILURE] device state not equal to desired state of" \
60 " {}".format(val.upper()))
If the input val is "on" we turn_on the device (Lines 34 and 35). Conversely, if the
val is "off" we turn the device off (Lines 38 and 39).
Upon the "exit" command, the Python program will print a message and quit (Lines
42-45).
Lines 48-50 continue in the event of invalid input.
Lines 53-60 print a success or failure message depending on if the desired on/off state
matches the actual state of the device.
To run the program, simply pass the IP address and type via command line argument:
$ python iot_light_demo.py --ip-address 192.168.1.20 --type plug

[INFO] current state of the device: OFF
[INPUT] Enter 'on', 'off', or 'exit' here: on
[SUCCESS] device state equals desired state of ON
[INPUT] Enter 'on', 'off', or 'exit' here: off
[SUCCESS] device state equals desired state of OFF
[INPUT] Enter 'on', 'off', or 'exit' here: on
[SUCCESS] device state equals desired state of ON
[INPUT] Enter 'on', 'off', or 'exit' here: exit
[INFO] exiting the application (the light remains in the current state)
15.6.2 State Machines
State machines, also known as “finite-state machines”, often simplify computer application
logic and design. During the design phase of a system that has multiple states of operation,
the developers and designers are forced to consider the various modes of operation (i.e. states)
a system can be in.
Consider matter. Matter can be in any one of four states: solid, liquid, gas, or plasma.
There’s nothing in between. The matter is able to transition between the four states, but can
only actually be in any one state at any given time.
Figure 15.4: State machine which mimics a common traffic light (not the entire intersection). All
state machines have initial states. Some state machines have final states. You could argue either
way whether a traffic light has a final state, but consider power failure, a final state in which all
lights are off. When power comes back, the light will go to it’s inital state of all lights off until the
controller is ready to route traffic through the intersection. At that point its next state would either
be "Red" (as shown) or "Green".
Now consider a computer program such as one that controls a traffic light. We know that
a standard traffic light has three states: “go” (green), “stop” (red), or “caution” (yellow/amber).
Sometimes there may be a protected turn arrow as well. Those states represent the three/four
possible states from one perspective of a traffic intersection. You will never see both the red
and green lights on at the same time in the case of an operational traffic light as shown in
Figure 15.4. The state can transition from "stop" to "go", however.
Let’s take this example a step further. Take into account the entire intersection. The system
controlling the intersection usually has a minimum of four incoming lines of traffic to route
through. Induction sensors in the road and/or cameras mounted above the intersection monitor
for cars waiting to pass through the intersection. There are also crosswalks and buttons for
pedestrians to request to cross the intersection. The state machine quickly grows to handle
all the lanes of traffic and directions vehicles can go in as well as crosswalks. There are also
individual states of each array of lights just as in Figure 15.4.
Without a logical state machine, not only would you struggle to understand all the conditional
statements in your code, the next traffic engineer would have no idea how to read the program
just to make a change so that the intersection routes vehicles more efficiently after an additional
turn late has been built.
15.7. OUR IOT SASE STUDY PROJECT 335
In this chapter, we have two Raspberry Pis. Each RPi will operate its own state machine.
Information will be exchanged via message passing sockets when one Pi is alerting the other
Pi that it has completed an action. The other Pi then jumps into a new state and begins a
separate task.
We’ll review the state machine for each of our RPis in the next section.
15.7 Our IoT Sase Study Project
In this section, we will begin by reviewing each of our two RPis’ state machines and how
they interact. From there we’ll dive into each separate project structure. We’ll then review the
separate JSON configurations.
Next, we’ll dive into the driver script for the Pi aimed out the window at our driveway (i.e.
“driveway-pi”). Similarly we’ll walk through the driver script for the face recognition door
monitor Pi (i.e. “home-pi”).
We’ll wrap up by learning how to execute and test our system and reviewing tips and sug-
gestions for developing similar projects to this one.
15.7.1 Independent State Machines
As shown in Figure 15.5 our state machines are simplified as compared to Figure 15.4. To
simplify the drawing, each initial state is marked as state 0. There is no final state — it is
expected that this program will run forever.
The state machines run independently in their own Python scripts on separate Raspberry
Pis.
But how does one Python application trigger a state machine on a separate system?
To accomplish state change triggers, we will implement a message passing function/pro-

cess into both driver scripts. The implementation is near-identical; however, the difference will
be in what message becomes sent/received and which state transitions trigger. We will name
the function exchange_messages and the process messageProcess.
For the "driveway-pi" (Figure 15.5, left), the heart of the program is in the "0: Looking For
Objects" state. When the "driveway-pi" finds a person or car, it will send a message to the
"home-pi" and wait until the "home-pi" sends further instructions.
Similarly, the bulk of functionality for the "home-pi" (Figure 15.5, right) lays in the "1: Face
Recognition" state. The "home-pi" starts in a "0: Waiting for Object" state in which it needs a
message from the "driveway-pi" indicating that either a person or car has been detected. At
Figure 15.5: Left: The "driveway-pi" state machine. Right: The "home-pi" state machine. The
state machines run independently in their own Python scripts on separate Raspberry Pis. A
near-identical function/process will run on each RPi called exchange_messages, responsible
for sending/receiving messages and changing states accordingly.
that point it will turn on the light, wait for the door to open, and perform face recognition (the "1:
Face Recognition" state).
Figure 15.6 demonstrates the exchange_messages function/process and the two types
of messages our Pis will be configured to send/receive. To keep our code blocks short, no
validation is performed on the messages. If you were to have multiple types of messages,
you would, of course, need validation (i.e. conditional if/elif/else statements) to determine
which message is received and what action to take (i.e. changing to a different state).
If you are adding functionality to this system (i.e. additional Pis with new responsibilities), I
highly encourage you to think about the states of the system and how states will transition.
You should sketch your flowchart/state machine and messages your RPis and any other
computers will exchange on a blank sheet of paper. Make iterations until you are comfortable
working on the driver scripts.
As an example, maybe you’ll have a third Pi that monitors when your home’s garage door is
open/closed as Jeff Bass discussed at PyImageConf 2018 [20]. Or maybe you’ll have a Pi that
monitors for dogs roaming on your property when you are not home — if a picture is delivered
Figure 15.6: Each Raspberry Pi in our security system has an exchange_messages function/pro-
cess. This process is able to send/receive messages and change states at any time. All states
are process-safe variables.
to your smartphone, maybe you can let your neighbor know that their dog escaped!
In any of these cases, you may need more RPis, more states, and more types of messages
exchanged among the RPis. You would also add message validation as we did in our message
passing example (Section 3.4.3).
Each Pi will run an independent but interworking Python application. Inside the chapter code
folder are two subdirectories: (1) driveway-pi/, and (2) home-pi/. Let’s review the con-
tents of each subdirectory now.
15.7.2.1 Pi #1: “driveway-pi”
|-- config
| |-- config.json
|-- pyimagesearch
| |-- utils
| | |-- __init__.py
| | |-- conf.py
| |-- __init__.py
|-- detect.py
Our configuration settings for “driveway-pi” are stored in config.json and parsed by the
Conf class in conf.py.
The detect.py script uses MobileNet SSD to detect people and vehicles that are present
in your driveway. Depending on when these objects are detected, a message is sent to the
“home-pi” to turn on lights and perform facial recognition.
15.7.2.2 Pi #2: “home-pi”
|-- cascade
| |-- haarcascade_frontalface_default.xml
|-- config
| |-- config.json
|-- face_recognition
| |-- encode_faces.py
| |-- train_model.py
|-- messages
| |-- abhishek.mp3
| |-- adrian.mp3
| |-- dave.mp3
| |-- mcCartney.mp3
| |-- unknown.mp3
|-- output
| |-- le.pickle
|-- pyimagesearch
| |-- notifications
| | |-- __init__.py
| |-- utils
| | |-- __init__.py
| | |-- conf.py
| |-- __init__.py
|-- create_voice_msgs.py
|-- door_monitor.py
As you can see, there is a lot going on in the “home-pi” project tree. Let’s break it down.
Facial recognition consists of a step to encode faces (encode_faces.py generates

encodings.pickle). From there, a machine learning model is trained on top of the en-
codings via train_model.py which produces the recognizer and label encoder (recogniz
er.pickle and le.pickle).
Creating voice messages is a process that is conducted after you have trained your facial
recognition model. The create_voice_msgs.py script reads the label encoder file to grab
the names of the individuals the system can recognize. This script produces .mp3 text-to-
speech files for each person in the messages/ directory.
Door monitoring (door_monitor.py) is very similar to Chapter 5, but with added fea-
tures including a state machine, socket connection for reading and sending status messages,
and IoT light control. Existing features include detecting when the door is opened, performing
face detection (haarcascade_frontalface_default.xml), face recognition (recogniz
er.pickle), playing audio files, and sending SMS messages via Twilio (twilionotifier.py)
to the homeowner when an intruder is detected.
15.7.3 Config Files
Each of the projects have their own respective config file as mentioned in the previous section.
Let’s go ahead and review both of them now
15.7.3.1 Pi #1: “driveway-pi”
Open up driveway-pi/config/config.json and inspect the contents:
1 {
2 // home pi ip address and port number
3 "home_pi": {
4 "ip": "HOME_PI_IP_ADDRESS",
5 "port": 5558
6 },
7
8 // path the object detection model
9 "model_path": "MobileNetSSD_deploy.caffemodel",
10
11 // path to the prototxt file of the object detection model
12 "prototxt_path": "MobileNetSSD_deploy.prototxt",
13
14 // variable indicating minimum threshold confidence
16
18 // to our screen
19 "display": true
20 }
The “driveway-pi” must know the IP address and port of the “home-pi” (Lines 3-6). Be
sure to replace "HOME_PI_IP_ADDRESS" with the IP of the “home-pi” which is set up for face
recognition.
The pretrained MobileNet SSD object detector file paths are shown on Lines 9-12. The
"confidence" threshold is currently set to 50% via Line 15.
It is highly recommended that you set "display": true and set up a VNC connection
while you’re testing your system. Once you are satisfied with the operation of the system, you
can set "display": false. To learn more about working remotely with your RPi, including
VNC, be sure to refer to this article: http://pyimg.co/tunq7 [71].
15.7.3.2 Pi #2: “home-pi”
The majority of settings that you need to tune are held inside the “home-pi” configuration. Go
ahead and inspect home-pi/config/config.json now:
1 {
2 // type of smart device being used
3 "smart_device": "smart_plug",
4
5 // smart device ip address
6 "smart_device_ip": "YOUR_SMART_DEVICE_IP_ADDRESS",
7
8 // port number used to communicate with driveway pi
9 "port": 5558,
Line 3 is the type of smart IoT lighting device that you have on your network. Provided
you are using a TP-Link Kasa device, your options are "smart_plug", "smart_switch",
or "smart_bulb".
Line 9 holds the "port" number used to communicate with the “driveway-pi”. This num-
ber should match the port number in the “driveway-pi” configuration. The “driveway-pi” will
be responsible for connecting to our “home-pi”, so the “home-pi” does not need to know the
“driveway-pi”’s IP address.
Now we’ll review a selection of file paths:
11 // path to OpenCV's face detector cascade file

12 "cascade_path": "cascade/haarcascade_frontalface_default.xml",
13
14 // path to extracted face encodings file
16
17 // path to face recognizer model

19
20 // path to label encoder file
Lines 12-21 include the paths to our Haar cascade face detector, face encodings file, facial
recognition model, and label encoder.
The majority of the following settings are related to the door status and face recognition:

24 // to our screen
25 "display": true,
26
27 // variable used to store the door contour area threshold percentage
29
30 // number of consecutive frames to look for a face once the door
31 // is open
32 "look_for_a_face": 60,
33
34 // number of consecutive frames to recognize a particular person
35 // to reach a conclusion
37
38 // number of frames to skip once a person is recognized
39 "n_skip_frames": 900,
40
41 // amount of time for which the home pi sleeps before messaging
42 // driveway pi to start detecting objects
43 "sleep_time": 300,
Again, for testing and demonstration purposes, it is recommended that you set "display":
true (Line 25) and establish a VNC connection to this Pi. You can and should establish two
VNC connections — one from your laptop to the “driveway-pi” and one from your laptop to the
“home-pi”. Being able to see the output of these two video feeds allows for monitoring each
system’s state machine and other frame annotations.
The door area "threshold" is set to 12% of the frame (Line 28). You can adjust this
value depending on (1) how close your camera is to the door, and (2) the frame resolution you
are using. Both of these factors impact the relative size of the doorway in the frame.
Facial recognition settings are shown on Lines 32-39:
• "look_for_a_face": The number of consecutive frames to look for a face after which
it has been determined that the door is open.
• "consec_frames": The number of consecutive frames required to for a person to be

recognized. In other words, if a face is recognized as “Adrian” for 3 consecutive frames,
our system will confidently assume that "Adrian" is entering the house.
• "n_skip_frames": Once a person is recognized we will skip this number of frames

prior to checking for the door to be opened when the occupant is leaving the house.
On Line 43 the "sleep_time" represents the number of seconds for which the “home-pi”
sleeps before messaging the “driveway-pi” to begin detecting objects. It is currently set to 300
seconds (5 minutes).
Let’s go ahead and review our text-to-speech settings:
45 // see available languages in your terminal with `gtts-cli --all`

46 // shows accents/dialects after the hyphen.
47 // ex: en-gb would be the English language with British accent
48 "lang": "en",
49 "accent": "us",
50
51 // path to messages directory
52 "msgs_path": "messages",
53
54 // two variables, one to store the message used for homeowner(s),
55 // and a second to store the message used for intruders
56 "welcome_sound": "Welcome home ",
57 "intruder_sound": "I don't recognize you, I am alerting the" \
58 "homeowners that you are in their house.",
I’ve chosen the English language and United States accent (Lines 48 and 49) for the
Google text-to-speech engine. Follow the instructions in the comment on Lines 45-47 to see
the available languages and accents/dialects.
All text-to-speech files should be stored in the path specified by "msgs_path" (Line 52).
Our "welcome_sound" and "intruder_ound" configurations on Lines 56-58 include

the strings that will be used to generate our text-to-speech audio files.
The remaining configurations should look quite familiar to you for S3 file storage and Twilio
MMS settings:
60 // variables to store your aws account credentials

61 "aws_access_key_id": "YOUR_AWS_ACCESS_KEY_ID",
62 "aws_secret_access_key": "YOUR_AWS_SECRET_ACCESS_KEY",
63 "s3_bucket": "YOUR_S3_BUCKET_NAME",
64
70
71 // message sent to the owners when a intruder is detected
72 "message_body": "There is an intruder in your house."
73 }
S3 will be used for temporary storage of the image file that will be included with an MMS
message (Lines 61-63). Twilio settings including your ID numbers, phone number, and desti-
nation number must be populated on Lines 66-69. The text message body is show on Line
72.
Be sure to spend a few minutes familiarizing yourself with all configurations. You may need
to make adjustments later.
15.7.4 Driver Script for Pi #1: “driveway-pi”
Take a moment now to refer to Section 15.7.1 and the figures in that section in which we
learned about the responsibilities and states for Pi #1, known as the “driveway-pi”.
The “driveway-pi” has one main responsibility: to monitor the area outside your home
where cars or people will be present.
Of course, if you don’t have a driveway, it could monitor a walkway or hallway by your
apartment. Maybe you are using this project at your business to monitor employees coming in
and to validate that they are not intruders attempting to steal physical or electronic goods and
records.
Recall that the secondary responsibility of the “driveway-pi” is to inform the “home-pi” (Pi
#2) that a person or vehicle has been detected. It is up to the “home-pi” what to do with that
information (we know it involves turning on a light and performing facial recognition when the
door opens).
Without further ado, let’s dive into the code.
Go ahead and open a new file named detect.py in the driveway-pi directory and insert
the following code:

7 import argparse
8 import signal
9 import time
10 import cv2
11 import sys
12 import zmq
Lines 2-12 import our packages and modules. Namely, we will use Process and Value
from Python’s multiprocessing module. Additionally, we will utilize zmq for message pass-
ing via sockets.
Let’s define our global states:
14 # set variables representing different application states as global

15 global STATE_LOOKING_FOR_OBJECTS
16 global STATE_SENDING_MESSAGE
17 global STATE_WAITING_FOR_MESSAGE
Lines 15-17 indicate that each of our states are global variables. We’ll refer to these states
a number of times, so become familiar with them and refer to Section 15.7.1 as needed.
Now we’ll define our exchange_messages function which we’ll later implement as a sep-
arate Python process:
19 def exchange_messages(conf, STATE):

22
23 # establish a socket to talk to home pi
24 print("[INFO] connecting to the home pi...")
25 socket = context.socket(zmq.REQ)
26 socket.connect("tcp://{}:{}".format(conf["home_pi"]["ip"],
27 conf["home_pi"]["port"]))
Line 19 defines the exchange_messages function which accepts two parameters:
• conf: Our configuration dictionary is passed directly to this function/process so that all
configuration variables can be accessed.
• STATE: Depending on the current state of the “driveway-pi” we’ll either be sending a mes-
sage or waiting to receive a message. This exchange_messages function will handle
communication depending on either of those conditions.
Lines 21-27 initialize our socket connection to the “home-pi” server via its IP address and
port.
Let’s create an infinite loop for exchanging our messages:
29 # tell the home pi to start face recognition application

30 message = "start face reco"
31
32 # loop until the quit flag is set
33 while True:
34 # if this is a message sending state
35 if STATE.value == STATE_SENDING_MESSAGE:
36 # notify the home pi and set the skip flag
37 print("[INFO] notifying home pi...")
38 socket.send(message.encode("ascii"))
39 STATE.value = STATE_WAITING_FOR_MESSAGE
40 print("[INFO] sending '{}' message to home pi...".format(
41 message))
42
43 # otherwise, current state is to wait for a message
44 elif STATE.value == STATE_WAITING_FOR_MESSAGE:
45 # wait for a response from the home pi and reset the
46 # object detected flag
47 print("[INFO] waiting for message from home pi...")
48 response = socket.recv().decode("ascii")
49 print("[INFO] received '{}' from home pi...".format(
50 response))
51 STATE.value = STATE_LOOKING_FOR_OBJECTS
The only message that “driveway-pi” will ever send to “home-pi” is "start face reco"
on Line 30. This message indicates both (1) “driveway-pi” has detected a person or vehicle,
and (2) it is time for “home-pi” to begin face recognition.
Line 33 begins an infinite loop to monitor the STATE and act accordingly. Inside the loop, if
the current STATE is:
• STATE_SENDING_MESSAGE, then we go ahead and send the message over the socket
connection to “home-pi” followed by changing the state to STATE_WAITING_FOR_MESSAGE
(Lines 35-41).
• STATE_WAITING_FOR_MESSAGE, then we’ll receive a response when it is available and

change the current state to STATE_LOOKING_FOR_OBJECTS (Lines 44-51).
• STATE_LOOKING_FOR_OBJECTS, then neither conditional is met so nothing happens in

this process.
With our message exchanging process coded and ready, now let’s define our signal handler
and parse command line arguments:

57 sys.exit(0)
58
64
The signal_handler simply monitors for ctrl + c keypresses from the user upon
which the application exits (Lines 54-57).
All of our settings are in our configuration file; Lines 60-63 parse the command line argu-
ments for the path to the config (--conf). Line 66 then loads the configuration into memory.
With our configuration in hand, now let’s perform initializations related to our MobileNet SSD
object detector:

69 # detect, then generate a set of bounding box colors for each class
74 COLORS = np.random.uniform(0, 255, size=(len(CLASSES), 3))
75
80
81 # initialize the frame dimensions (we'll set them as soon as we read
82 # the first frame from the video)
83 H = None
84 W = None
Lines 70-84 initialize our pretrained MobileNet SSD including its CLASSES, random anno-
tation COLORS, and frame annotations.
Next we will assign finite constant values to our global states:
86 # initialize variables representing different application states

87 STATE_LOOKING_FOR_OBJECTS = 0
88 STATE_SENDING_MESSAGE = 1
89 STATE_WAITING_FOR_MESSAGE = 2
90
91 # state description lookup table
92 DRIVEWAY_PI_STATES = {
93 STATE_LOOKING_FOR_OBJECTS: "looking for a person/car",
94 STATE_SENDING_MESSAGE: "sending message to home pi",
95 STATE_WAITING_FOR_MESSAGE: "waiting for message from home pi"
96 }
97
98 # initialize shared variable used to set the STATE
99 STATE = Value("i", STATE_LOOKING_FOR_OBJECTS)
Our FSM consists of three states (we reviewed them in Section 15.7.1). Lines 87-89 assign
each state an integer value.
Lines 92-96 create a dictionary of DRIVEWAY_PI_STATES which is a string lookup table.

The strings will be annotated in the top corner of the video frame so you can monitor the state
of the system.
Line 99 sets the initial value of STATE to STATE_LOOKING_FOR_OBJECTS. Our “driveway-

pi” will begin its job by looking out the window for objects including people and cars.
It is now time to initialize our message exchanging process:
101 # start the message process

102 messageProcess = Process(target=exchange_messages,
103 args=(conf, STATE))
104 messageProcess.daemon = True
105 messageProcess.start()
Lines 102-105 initialize our messageProcess with the exchange_messages function as

the target. Both our configuration (conf) and current STATE will always be available to the
messageProcess as “process-safe” variables. This means that if STATE changes in the main
Python process, the fresh value is also available in messageProcess.
Let’s initialize our camera stream and begin looping over frames.
110 vs = VideoStream(usePiCamera=True, resolution=(608, 512)).start()
111 time.sleep(2.0)
112

117
119 while True:
120 # grab the next frame from the stream
122
123 # if the frame dimensions are empty, set them
Line 110 initializes our PiCamera video stream with the specified resolution.
Line 114 sets the signal trap to capture ctrl + c events.
Beginning on Line 119, we loop over incoming video frames. Lines 124 and 125 extract
frame dimensions.
The main process of execution handles a single state and sets the message sending state
when appropriate for messageProcess to take care of. Let’s see what happens when the
STATE.value == STATE_LOOKING_FOR_OBJECTS:
127 # check if object detection process must be skipped

128 if STATE.value == STATE_LOOKING_FOR_OBJECTS:
131 blob = cv2.dnn.blobFromImage(frame, 0.007843, (300, 300),
132 127.5)
135
141
142 # filter out weak detections by ensuring the `confidence`
143 # is greater than the minimum confidence
146 # `detections`, then compute the (x, y)-coordinates of
147 # the bounding box for the object
149 box = detections[0, 0, i, 3:7] * np.array([W, H, W, H])
When our “driveway-pi” is looking for objects in the driveway, we will perform three basic
tasks:
i. Object detection to find people and cars.
ii. Change of the state to STATE_SENDING_MESSAGE so that messageProcess can act to

send a message to “home-pi”.
iii. Annotation of the objects in the frame.
Lines 131-134 perform inference with MobileNet SSD. Lines 137-150 loop over the detec-
tions, ensure the confidence threshold is met and extract the bounding box coordinates of the
object.
Let’s check the class of the object we have found:
152 # check if the detect object is a person or a car

153 if CLASSES[idx] in ["person", "car"]:
154 print("[INFO] {} has been detected...".format(
155 CLASSES[idx]))
156
157 # check if we are still in the looking for
158 # objects state
159 if STATE.value == STATE_LOOKING_FOR_OBJECTS:
160 # change the state to sending message
161 STATE.value = STATE_SENDING_MESSAGE
162
163 # draw the prediction on the frame
164 label = "{}: {:.2f}%".format(CLASSES[idx],
165 confidence * 100)
166 cv2.rectangle(frame, (startX, startY), (endX,
167 endY), COLORS[idx], 2)
168 y = startY - 15 if startY - 15 > 15 else startY + 15
169 cv2.putText(frame, label, (startX, y),
170 cv2.FONT_HERSHEY_SIMPLEX, 0.5, COLORS[idx], 2)
If either a "person" or "car" is detected (Line 153), we have an important message to

send, so Line 161 updates the STATE accordingly.
Lines 164-170 annotate the class label and bounding box of the person or car on the
frame.
Let’s close out our “driveway-pi” detection script:
172 # draw the state information on to the frame

173 info = "STATE: {}".format(DRIVEWAY_PI_STATES[STATE.value])
174 cv2.putText(frame, info, (10, 15), cv2.FONT_HERSHEY_SIMPLEX, 0.6,
175 (0, 255, 0), 2)
176
177 # display the frame and record keypresses, if required
181
183 if key == ord("q"):
184 break
185
186 # terminate the message process
187 messageProcess.terminate()
188 messageProcess.join()
189
191 vs.stop()
Lines 173-175 annotate the corner of the frame with our STATE string which comes directly
from the DRIVEWAY_PI_STATES dictionary.
Lines 178-184 display our frame and capture keypresses. If q is pressed, we break out of
the frame processing loop and perform cleanup.
15.7.5 Driver Script for Pi #2: “home-pi”
Take a moment now to recall the operation of the “home-pi”. The “home-pi” is positioned such
that when a known person or intruder enters the door of your home they are recognized.
The “home-pi” is responsible for communicating with the “driveway-pi” to know when it
should begin face recognition. The “home-pi” is dormant when it is assumed that no occu-
pants are inside the home (i.e. it has detected via background subtraction that a person has
left the home).
The “home-pi” also has full control over lights in the house. In this project, we only turn on
and off a single light; however, you should feel free to hack the code to turn on multiple lights if
you so choose.
With all that said, it is now time to code up the driver script for our “home-pi”. Go ahead and
insert a file named door_monitor.py into the home-pi directory and insert the following
code:


8 from pyHS100 import SmartPlug
9 from pyHS100 import SmartBulb
12 import argparse
13 import imutils
14 import pickle
15 import signal
16 import time
17 import cv2
18 import sys
19 import zmq
20 import os
Lines 2-20 import our packages. Namely, we will use our custom TwilioNotifier to
send SMS messages, Process and Value to perform multiprocessing, and pyHS100
to work with smart plugs and smart bulbs. We’ll also take advantage of Adam Geitgey’s
face_recognition package. The zmq library will enable message passing between our
“home-pi” and “driveway-pi”.
Now that our imports are taken care of, let’s make our four states global:
22 # set variables representing different application states as global

23 global STATE_WAITING_FOR_OBJECT
24 global STATE_FACE_RECOGNITION
25 global STATE_SKIP_FRAMES
26 global STATE_PERSON_LEAVES
Be sure to review Section 15.7.1 for a diagram and explanation of the “home-pi” state ma-
chine.
Let’s define our exchange_messages function — it is quite similar to the sister function in
the “driveway-pi” script:
28 def exchange_messages(conf, STATE):

31
32 # establish a socket for incoming connections
33 print("[INFO] creating a socket...")
34 socket = context.socket(zmq.REP)
35 socket.bind("tcp://*:{}".format(conf["port"]))
36
37 # loop until the quit flag is set
38 while True:
39 # if current state is waiting for the driveway pi to detect a

40 # object and notify the home pi
41 if STATE.value == STATE_WAITING_FOR_OBJECT:
42 # receive a message from the driveway pi and set the state
43 print("[INFO] waiting for message from driveway pi...")
44 message = socket.recv().decode("ascii")
45 STATE.value = STATE_FACE_RECOGNITION
46 print("[INFO] received '{}' message from driveway" \
47 "pi...".format(message))
48
49 # if the door was opened and no face was detected then the
50 # person has left the house
51 elif STATE.value == STATE_PERSON_LEAVES:
52 # sleep for the set period and message the driveway pi
53 # to start detecting objects
54 time.sleep(conf["sleep_time"])
55 reply = "start monitoring driveway"
56 socket.send(reply.encode("ascii"))
57 print("[INFO] sent '{}' message to driveway pi...".format(
58 reply))
Lines 28-35 begin our exchange_messages function/process. Two process safe vari-
ables, conf and STATE, are passed as parameters. The server connection is bound to the
"port" specified in the config. Once the client (i.e. the “driveway-pi”) connects, we’ll fall into
the while loop beginning on Line 38.
Inside our loop we will either be:
• STATE_WAITING_FOR_OBJECT: Waiting for the "start face reco" message from

“driveway-pi” at which point we’ll set the STATE to STATE_FACE_RECOGNITION (Lines
41-47). No validation is conducted on message since there is only once possible connec-
tion to receive. Yes, this does make our server is vulnerable to attack since anyone can
connect to it and begin sending the message causing the application to go haywire. Secu-
rity is beyond the scope of our server, enabling us to focus strictly on the Computer Vision
algorithms and IoT communication. It is assumed that you and only you have control over
the devices on your network.
• STATE_PERSON_LEAVES: When the homeowner leaves the house, we send a reply

message to “driveway-pi” to "start monitoring driveway" (Lines 51-58). No state
changes are made by this process at this point. Rather, the main process of execution
will handle turning off the light and setting the next state.
Next, we’ll set up our signal handler, load our config, and initialize our Twilio notifier:


64 sys.exit(0)
65
71
Lines 61-64 define our signal_handler function to handle ctrl + c keypresses.
Lines 67-70 parse the --conf command line argument and Line 73 loads the configura-
tion.
Line 74 initializes our TwilioNotifier object, tn.
With our configuration loaded, we can instantiate our face detector and face recognizer
objects:
76 # load the actual face recognition model, label encoder, and face
77 # detector
80 detector = cv2.CascadeClassifier(conf["cascade_path"])
Our face recognizer consists of a Support Vector Machines (SVM) model serialized as a
.pickle file. Our label encoder contains the names of the house occupants our recognizer
can distinguish between.
Be sure to refer to Section 15.5.3 which refers to the face recognition chapters in this
volume so that you can train your face recognizer. We will not be reviewing face_recogni
tion/encode_faces.py or face_recognition/train_model.py in this chapter as they
have already been reviewed.
We will use Haar cascades as our face detector (Line 80). If you prefer to use a deep
learning face detector, you should add a coprocessor such as the Movidius NCS or Google
Coral to conduct face detection. Be sure to refer to Section 15.5.2 for referrals to chapters
involving face detection alternatives.
Let’s go ahead an initialize a handful of housekeeping (pun intended) variables which are
key to door open/closed detection, face recognition, and control flow of our door monitor:
82 # initialize the MOG background subtractor object

84
85 # initialize the frame area and boolean used to determine if the door
86 # is open or closed
87 frameArea = None
88 doorOpen = False
89
90 # initialize previous and current person name to None, then set the
91 # consecutive recognition count to zero
93 curPerson = None
94 consecCount = 0
95
96 # initialize skip frame counter
Line 83 initializes the MOG background subtractor object. We will use background sub-
traction as the first step in determining doorOpen status. Line 87 initializes the frameArea
constant which will be calculated once we know our frame dimensions. Depending on the ratio
of the door contour to the frameArea we will be able to determine doorOpen (Line 88) status.
Lines 92-94 initialize the previous and current person names to None and set the consecu-
tive same person count to 0. When the curPerson == prevPerson we will be incrementing
consecCount; the consecutive count will be compared to the threshold set in our configuration
file ("consec_frames").
Line 97 initializes skipFrameCount, a counter which will be compared to "n_skip_frames"

in the configuration for purposes of reseting variables and updating states.
From there, we’ll initialize our IoT lighting device:
99 # check if the device type is set to smart plug or smart switch

100 if conf["smart_device"] == "smart_plug" or \
101 conf["smart_device"] == "smart_switch":
102 # instantiate a smart plug/switch object
103 device = SmartPlug(conf["smart_device_ip"])
104
105 # otherwise, we are using a smart bulb
106 elif conf["smart_device"] == "smart_bulb":
107 # instantiate a smart bulb object
108 device = SmartBulb(conf["smart_device_ip"])
Lines 100-108 initialize a smart lighting device. Refer to Section 15.6.1 for a full guide on
turning smart lights on and off.
Let’s establish our Finite State Machine (FSM):

110 # initialize variables representing different application states

111 STATE_WAITING_FOR_OBJECT = 0
112 STATE_FACE_RECOGNITION = 1
113 STATE_SKIP_FRAMES = 2
114 STATE_PERSON_LEAVES = 3
115
116 # state description lookup table
117 HOME_PI_STATES = {
118 STATE_WAITING_FOR_OBJECT: "waiting for objects",
119 STATE_FACE_RECOGNITION: "turn on light and perform face recognition",
120 STATE_SKIP_FRAMES: "skip frames and reset background subtraction model",
121 STATE_PERSON_LEAVES: "home owner has left the house"
122 }
123
124 # initialize shared variable used to set the STATE
125 STATE = Value("i", STATE_WAITING_FOR_OBJECT)
Lines 111-114 assign unique integer values to our four states. Lines 116-121 define
homePiStates, a dictionary holding a string value associated with each state.
Line 125 then initializes the state machine to the STATE_WAITING_FOR_OBJECT state.
Be sure to refer to Section 15.7.1 for an explanation and diagram showing the overview of our
“home-pi” state machine.
Let’s kick off our messageProcess (just like the one we set up for “driveway-pi”):
127 # start the message process

128 messageProcess = Process(target=exchange_messages,
129 args=(conf, STATE))
130 messageProcess.daemon = True
131 messageProcess.start()
And from there we’ll initialize our video stream, signal trap, and begin looping over frames:
137 time.sleep(2.0)
138
143
145 while True:
146 # grab the next frame from the stream and resize it
149
150 # if we haven't calculated the frame area yet, calculate it
151 if frameArea == None:
152 frameArea = (frame.shape[0] * frame.shape[1])
Line 136 initializes our PiCamera video stream. Line 140 sets our signal trap to handle
keyboard interrupt.
We then begin looping over frames on Line 145. A frame is grabbed and resized (Lines
147 and 148).
Lines 151 and 152 then calculate the frameArea, a key component for calculating our
door ratio soon.
Now it’s time to handle the face recognition state:
154 # check if the state is face recognition state

155 if STATE.value == STATE_FACE_RECOGNITION:
156 # check if the device is off
157 if not device.is_on:
158 # turn on the device and sleep for 5 seconds (so that BS
159 # doesn't start until the camera has adjusted to the
160 # change in lighting)
161 print("[INFO] smart device turned ON...")
162 device.turn_on()
163 time.sleep(5.0)
The first task in the face recognition state is to turn_on the smart device (i.e. our light)
if it isn’t already on (Lines 157-163). The light will (1) allow the camera in our house to “see”
someone’s face, and (2) deter intruders from entering the home in the first place.
Let’s try to determine if our door is open:
165 # if the door is closed, monitor the door using background

166 # subtraction
167 elif not doorOpen:
168 # convert the frame to grayscale and smoothen it using a
169 # gaussian kernel
172
173 # calculate the mask using MOG background subtractor
174 mask = mog.apply(gray)
175
176 # find countours in the mask

180
181 # check to see if at least one contour is found
182 if len(cnts) >= 1:
183 # sort the contours in descending order based on their
184 # area and calculate the percentage of contour area
185 # w.r.t to frame area
186 cnts = sorted(cnts, key=cv2.contourArea,
187 reverse=True)[0]
188 contourToFrame = (cv2.contourArea(cnts) / frameArea)
189
190 # if the *percentage* of contour area w.r.t. frame to
191 # greater than the threshold set then set the door as
192 # open and record the start time of this event
193 if contourToFrame >= conf["threshold"]:
194 print("[INFO] door is open...")
195 doorOpen = True
196 startTime = datetime.now()
Lines 167-196 monitor the frame to determine the door open/closed status using the fol-
lowing method:
• Background subtraction is used to detect motion (Lines 170-179).
• We grab the largest motion contour and calculate the contourToFrame ratio (Lines
182-188).
• If the contourToFrame ratio exceeds the "threshold" (in our config), then the door
is marked as open (doorOpen) and a timestamp is made (Lines 193-196).
Otherwise the door is already open, so let’s handle that case:
198 # if the door is open then:

199 # 1) run face recognition for a pre-determined period of time.
200 # 2) if no face is detected in step 1 then it's a intruder
201 elif doorOpen:
202 # compute the number of seconds difference between the
203 # current timestamp and when the motion threshold was
204 # triggered
205 delta = (datetime.now() - startTime).seconds
206
207 # run face recognition for pre-determined period of time
208 if delta <= conf["look_for_a_face"]:
209 # convert the input frame from (1) BGR to grayscale
210 # (for face detection) and (2) from BGR to RGB (for
211 # face recognition)

214
215 # detect faces in the grayscale frame
216 rects = detector.detectMultiScale(gray,
217 scaleFactor=1.1, minNeighbors=5, minSize=(30, 30))
218
219 # OpenCV returns bounding box coordinates in (x,y,w,h)
220 # order but we need them in (top, right, bottom, left)
221 # order for dlib, so we need to do a bit of reordering
222 box = [(y, x + w, y + h, x) for (x, y, w, h) in rects]
If the door is open then we will run face recognition for a predetermined amount of time. It
may be a person we recognize or it may be someone “unknown” (i.e. an intruder). Or if no face
is detected, then it is an intruder (possibly wearing a mask so the face isn’t detected at all).
Line 201 begins the case where the door is already open and it extends until Line 289.
First, the delta time is calculated from when the door was initially opened (Line 205). If
delta is less than our config, "look_for_a_face", then we’ll perform face detection using
Haar cascades (Lines 208-222).
Now let’s perform face recognition:
224 # check if a face has been detected

225 if len(box) > 0:
227 encodings = face_recognition.face_encodings(rgb,
228 box)
229
234
235 # draw the predicted face name on the image
236 (top, right, bottom, left) = box[0]
237 cv2.rectangle(frame, (left, top), (right,
238 bottom), (0, 255, 0), 2)
239 y = top - 15 if top - 15 > 15 else top + 15
240 cv2.putText(frame, curPerson, (left, y),
241 cv2.FONT_HERSHEY_SIMPLEX, 0.75,
242 (0, 255, 0), 2)
Line 225 ensures that at least one face box is found.
Lines 227 and 228 begin face recognition by extracting encodings for all faces in the image.
Lines 231-233 determine the current person’s name for the highest confidence face in the
image. We then annotate the frame with the person’s name and a box around their face
(Lines 236-242).
Let’s perform some housekeeping:
244 # if the person recognized is the same as in the

245 # previous frame then increment the consecutive
246 # count
249
250 # otherwise, reset the consecutive count as there
251 # was a difference between the class labels
252 else:
253 consecCount = 0
254
255 # set current person to previous person for the
256 # next iteration
Lines 247-253 update the consecCount depending on if the previously recognized person
matches the current person or not.
Line 257 goes ahead and sets the prevPerson name for the next iteration.
Let’s wrap up the case that the door was already open:

261 # conclusion and alert/greet the person
262 # accordingly
263 if consecCount == conf["consec_frames"]:
264 # play the mp3 file according to the
265 # recognized person
266 print("[INFO] recognized {}...".format(
267 curPerson))
268 os.system("mpg321 --stereo {}/{}.mp3".format(
269 conf["msgs_path"], curPerson))
270
271 # check if the person is an intruder
272 if curPerson == "unknown":
273 # send the frame via Twilio to the home
274 # owner
275 tn.send(frame)
276
277 # reset the consecutive count and mark the
278 # door as close and now we start skipping
279 # next few frames
280 print("[INFO] door is closed...")
281 consecCount = 0

283 STATE.value = STATE_SKIP_FRAMES
284
285 # otherwise, no face was detected
286 else:
287 print("[INFO] no face detected...")
289 STATE.value = STATE_PERSON_LEAVES
If the person (either recognized or "unknown") meets consecCount we make a system

call to play either the “welcome” or “I don’t recognize you, I am alerting the homeowners” sound
wave (Lines 263-269). The sound played is based on the curPerson name and associated
.mp3 file.
Additionally, if the person is "unknown", Lines 272-275 send Twilio SMS message to the
homeowner to alert them of the intruder.
Lines 281-283 then update the consecCount and doorOpen in addition to setting the
STATE tp STATE_SKIP_FRAMES.
The final else block handles the case when no face was detected. Lines 286-289 mark
the door as closed and update the state to STATE_PERSON_LEAVES.
Line 289 concludes state machine logic for STATE_FACE_RECOGNITION.
Next we’ll handle both STATE_SKIP_FRAMES and STATE_PERSON_LEAVES:
291 # check if the state is either skip frames or person leaves

292 elif STATE.value in [STATE_SKIP_FRAMES, STATE_PERSON_LEAVES]:
293 # check if the skip frame count is less than the threshold set
294 if skipFrameCount < conf["n_skip_frames"]:
295 # increment the skip frame counter
296 skipFrameCount += 1
297
298 # if the required number of frames have been skipped then
299 # reset skip frame counter and reinitialize MOG
300 elif skipFrameCount >= conf["n_skip_frames"]:
303
304 # check if we are in the frame skipping state
305 if STATE.value == STATE_SKIP_FRAMES:
306 # set the next state
307 STATE.value = STATE_FACE_RECOGNITION
308
309 # otherwise, its the person leaves state
310 elif STATE.value == STATE_PERSON_LEAVES:
311 # turn of the light and set the next state
312 print("[INFO] smart device turned OFF...")
313 device.turn_off()
314 STATE.value = STATE_WAITING_FOR_OBJECT
Lines 294-296 increment skipFrameCount if it is less than the threshold.
Otherwise, if the required number of frames have been skipped, then we reset skipFrame
Count and reinitialize our MOG background subtractor (Lines 300-302). If the state is STATE_
SKIP_FRAMES we update it to STATE_FACE_RECOGNITION (Lines 305-307). Or if the state is
STATE_PERSON_LEAVES, then we turn_off the light and set the state to STATE_WAITING_
FOR_OBJECT (Lines 310-314).
Let’s wrap up:
316 # draw the state information on to the frame

317 info = "STATE: {}".format(HOME_PI_STATES[STATE.value])
318 cv2.putText(frame, info, (10, 15), cv2.FONT_HERSHEY_SIMPLEX, 0.6,
319 (0, 255, 0), 2)
320
321 # if the *display* flag is set, then display the current frame
322 # to the screen and record if a user presses a key
326
328 if key == ord("q"):
329 break
330
331 # terminate the message process
332 messageProcess.terminate()
333 messageProcess.join()
334
336 vs.stop()
Lines 317-319 annotate the STATE string in the top left corner of the frame.
Lines 323-325 display the frame and capture keypresses. If the q key is pressed, we
break from the loop and perform cleanup (Lines 328-336).
15.7.6 Deploying Our IoT Project
Prior to deployment, be sure to run through this checklist:
• “driveway-pi”:
Figure 15.7: IoT security system running and monitoring via my macOS desktop. Green: Separate
VNC sessions with OpenCV windows and terminals so I can monitor states. Red: Laptop webcam
feed so that I can record the live action despite a VNC delay. Blue: IP addresses of both RPis and
the smart plug for lamp control.
• Place one Pi and camera aimed outside where people and/or cars will trigger the
light. For me, this was a “low traffic” area where only the occupants of my home
typically park and enter. If someone else is there, they are likely an intruder.
• Ensure your Pi can see people and vehicles at all times of day. This can be accom-
plished with outdoor flood lighting, possibly infrared lighting, and potentially hacks to
the script to set camera parameters (Chapter 6 of the Hobbyist Bundle) based on
time of day.
• “home-pi”:
• Mount the Pi facing the doorway where the IoT light will illuminate the person’s face.
• Attach a speaker to the Pi for text-to-speech.
• Configure and test the smart IoT lighting device.
• Config files:
• Insert the IP address of the “home-pi” server into the “driveway-pi” config.
• Insert the IP address of the Smart IoT light device into the “home-pi” config.
• Other configuration settings can be made during testing/tuning.
• Face recognition:
• Refer to Chapters 5 and 14.

• Gather face image examples for you and the other occupants of your home.
• Encode faces.
• Train the face recognition model on the encodings.
Figure 15.8: Demonstration of the "driveway-pi" application running while aimed out the window at
the driveway. High resolution flowchart can be found here: http://pyimg.co/so3c2
From there, I recommend opening two VNC sessions from your laptop/desktop as shown
in Figure 15.7. One connection should be to the “driveway-pi” and the other connection should
be to the “home-pi”. The VNC connections allow you to easily see each terminal and OpenCV
video stream window. You could always hack the scripts to use ImageZMQ at another time.
We’re now ready to execute the driver scripts.

Run the “home-pi” first (from VNC) as shown:
$ python door_monitor.py --conf config/config.json

[INFO] creating a socket...
[INFO] Press 'ctrl + c' to exit, or 'q' to quit if you have the
display option on...
[INFO] door is open...
[INFO] recognized unknown...
...audio message is played
[INFO] door is closed...
Next, open the VNC window for your “driveway-pi” and start the application there:
$ python detect.py --conf config/config.json

[INFO] Press 'ctrl + c' to exit, or 'q' to quit if you have the
display option on...
[INFO] person has been detected...
[INFO] notifying home pi...
[INFO] sending 'start face reco' message to home pi...
[INFO] waiting for message from home pi...
[INFO] received 'start monitoring drivway' message from home pi...
As you can see in Figures 15.8 and 15.9, the states are being updated based on both what
is happening in the frame and the messages exchanged between the Pis.
The system works fairly reliably, but could be improved by faster object detection (i.e. Mo-
vidius NCS or Google Coral). Additionally the system does not work well when two people
have entered the house and one person later leaves the house. The system was designed
with simplicity in mind and will work best if only one of the house occupants is present in the
house at any given time. With additional logic these edge cases could be handled.
Ultimately, we need to ask ourselves:
Can an intruder circumvent the system?
Yes. If there is a will, there is a way. Intruders can fool any system, but typically they need
prior knowledge of its inner workings. For COTS systems, they can look up how systems work
online and/or purchase the devices to tinker with. For a custom Raspberry Pi system they are
unfamiliar with, it will be more challenging. If they pick up a copy of this book and learn about
how the system works, they may try to:
• Disconnect the power from the house by pulling the meter off the side of your house.
15.8. SUGGESTIONS AND SCOPE CREEP 365
Most COTS security systems these days have battery backup and a cellular modem to
counteract this move.
• Wear a mask and slip in undetected (the “home-pi” only performs face recognition and
not “person” detection).
• Hack into your network to send your Pis messages (although they would would first need
knowledge of the messages the Pis accept).
• Hack into your network and tinker with the lighting controls so they can enter in the dark-
ness.
• ...and the list goes on.
In other words, if an intruder wants your stuff or wants you, they will likely get what they
want if they are determined. That shouldn’t restrict you from working to engineer something
unique and great though!
15.8 Suggestions and Scope Creep
As with any case study and fully-fledged Raspberry Pi security project, you should be thinking
about how to ensure that your project is reliable and that it interoperates with other projects in
your home.
But before you consider adding features, ensure that a minimum set of features working
– the Minimum Viable Product (MVP). The MVP should consist of components that you have
near full control over (i.e. any code you, yourself develop with Python for your Raspberry Pi).
As you integrate other APIs and devices, consider that those devices will undergo software
updates that may be out of your control – they could impact the operation of your system.
When you have your project working, it is time to integrate it with other IoT devices in
your home such as COTS systems including Amazon Alexa, thermostats, door lock actuators,
window/door/motion sensors, and more.
You may find that your options with Z-Wave wireless devices outnumber WiFi devices. Here
are some questions for you to consider as you build upon this project:
• Does your home alarm system have an API?
• Will facial recognition automatically disarm the alarm? Or is that too risky?
• What sensors and actuators can you integrate with easily?

Figure 15.9: Demonstration of the "home-pi" application running by an entry door. High resolution
flowchart can be found here: http://pyimg.co/y1e59
• Do they have a well documented API or are you relying on a vulnerability that may be
patched in the future? Remember, this chapter relies on an unofficial API for integrating
with TP-Link Kasa products. TP-Link could send security updates to their devices at any
time and then you will no-longer have control over your lights.
• What other chapters of this book can you integrate into the codebase?
• Do you need to provide guests (i.e. non-intruders, but people whom don’t have their face
trained with in your model) with a gesture recognition password? Refer to Chapter 9 of the
Hacker Bundle on building a gesture recognition system. How will you integrate message
passing with your gesture recognition system? Will the gesture recognition system turn
off your home alarm system?
15.9. SUMMARY 367
• Do you need to capture license plates of cars that enter your driveway? Refer to the
Gurus Module 6 for license plate detection and recognition.
• Do you have multiple outside zones to monitor?
• Can you use multiple cameras and image stitching? Refer to Chapter 15 of the Hobbyist
Bundle.
• Do you need two or three “driveway-pis” on different sides of your home? If so, how will
the “home-pi” handle potential messages from each of these Pis?
• Or will there be multiple “home-pis” each aimed at a different entry door of a large home?
If so, will you perform facial recognition at each one?
• Will your project be “night-ready”? Maybe you can rely on motion sensor flood lights
outside your home for the “driveway-pi”. Perhaps you can schedule camera parameter
updates to account for changes in ambient light. Or maybe you could implement a sec-
ondary camera specifically for night vision (i.e. infrared or thermal).
15.9 Summary
In this case study chapter, we brought together a number of concepts to build a multiple Pi
security system.
We worked with message passing via sockets to send commands between the Pis. We used
the pretrained MobileNet SSD object detector to find people and cars in frames. Our TP-Link
smart lighting devices were controlled with a simple API based on the state of our system. We
developed a state machine on each Pi to make our software easier to understand and easier
for future developers to add features. One of the states involved face recognition to see who
is entering the home. First we determined if the door is open via background subtraction and
contour processing. We welcomed the known home dweller or scared the unknown intruder
with an audible text-to-speech message. In the event that an intruder was there, we alerted the
homeowner with a Twilio text message.
This chapter served as the culmination of nearly all concepts covered in the first two vol-
umes of this book. Great job implementing this project!
Chapter 16
Your Next Steps
Congratulations! You have just finished the Hacker Bundle of Raspberry Pi for Computer Vi-
sion. Let’s take a second and review the skills you’ve acquired.
Inside this book you have:
• Discovered why applying deep learning to resource constrained devices is challenging.
• Learned how to use ImageZMQ and message passing programming paradigms to stream
frames from RPi clients to a central server for processing.
• Utilized message passing for non-image data, enabling you to build IoT applications.
• Built a security system capable of monitoring "restricted zones" using a cascade of back-
ground subtraction, the YOLO object detector, and ImageZMQ.
• Created a smart classroom attendance project capable of automatically taking attendance

for a class.
• Monitored the front door of your home and performed face detection to identify the people
entering your house.
• Utilized the RPi, OpenVINO, and the Movidius NCS to detect vehicles in video streams,
track them, and apply speed estimation to detect the MPH/KPH of the moving vehicle.
• Used deep learning on the RPi to recognize hand gestures.
• Helped prevent package theft by automatically detecting and recognizing delivery vehi-
cles.
• Discovered what the Intel Movidius NCS is and how we can use it to improve the speed
of deep learning inference/prediction on the RPi.
• Learned how to perform image classification using the Movidius NCS.
369
370 CHAPTER 16. YOUR NEXT STEPS
• Applied deep learning-based object detection via the Movidius NCS.
• Learned how to apply both deep learning-based face detectors and face recognizers at
the same time on a single NCS stick.
• Created a multi-RPi IoT project capable of monitoring your driveway for vehicles and then
communicating with a second RPi to monitor your front door for people entering your
home.
16.1 So, What’s next?
At this point you understand the fundamentals of applying deep learning on resource con-
strained devices such as the Raspberry Pi.
However, if you want to take a deeper dive into deep learning on embedded devices, I would
suggest you move on to the Complete Bundle.
The Complete Bundle covers how to:
• Train custom deep learning models with Caffe
• Deploy Caffe models to the Movidius NCS
• Train custom TensorFlow/Keras models
• Deploy your TensorFlow/Keras models to the NCS
• Utilize TensorFlow Lite in your own projects
• Apply Human Pose Estimation with TensorFlow Lite
• Perform image classification using pre-trained models with the Google Coral
• Utilize pre-trained object detection models with the Google Coral
• Train custom models for the Google Coral
• Deploy your custom models to the Google Coral
• Configure the NVIDIA Jetson Nano for embedded deep learning
• Use pre-trained image classifiers on the Jetson Nano
• Apply pre-trained object detectors to the Jetson Nano
• Train custom deep learning models for the Jetson Nano
• Deploy your custom models to the Jetson Nano

16.1. SO, WHAT’S NEXT? 371
• Decide between the RPi, Movidius NCS, Google Coral, or Jetson Nano when confronted
with a new project
I hope you’ll allow me to continue to guide you on your journey. If you haven’t already picked
up a copy Complete Bundle, you can do so here:
http://pyimg.co/rpi4cv
And if you have any questions, feel free to contact me:
http://pyimg.co/contact
Cheers,
-Adrian Rosebrock
372 CHAPTER 16. YOUR NEXT STEPS
Bibliography
[1] Samuel Albanie. convnet-burden - Memory consumption and FLOP count estimates for
convnets. https : / / github . com / albanie / convnet - burden. 2018 (cited on
page 18).
[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classification with
Deep Convolutional Neural Networks”. In: Advances in Neural Information Processing
Systems 25. Edited by F. Pereira et al. Curran Associates, Inc., 2012, pages 1097–1105.
URL : http : / / papers . nips . cc / paper / 4824 - imagenet - classification -
with-deep-convolutional-neural-networks.pdf (cited on pages 18, 196).
[3] Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-
Scale Image Recognition”. In: CoRR abs/1409.1556 (2014). URL: http : / / arxiv .
org/abs/1409.1556 (cited on pages 18, 196).
[4] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: CoRR abs/1512.03385
(2015). URL: http://arxiv.org/abs/1512.03385 (cited on pages 18, 79).
[5] Kaiming He et al. “Identity Mappings in Deep Residual Networks”. In: CoRR abs/1603.05027
(2016). URL: http://arxiv.org/abs/1603.05027 (cited on page 18).
[6] Wikipedia Contributors. Raspberry Pi - Performance. https://en.wikipedia.org/

wiki/Raspberry_Pi#Performance. 2019 (cited on page 18).
[7] NVIDIA GeForce Maintainers. GeForce GTX TITAN X - Specifications. https://www.

geforce.com/hardware/desktop-gpus/geforce-gtx-titan-x/specifications.
2019 (cited on page 19).
[8] Adrian Rosebrock. Hands-on with the NVIDIA DIGITS DevBox for Deep Learning. https:
/ / www . pyimagesearch . com / 2016 / 06 / 06 / hands - on - with - the - nvidia -
digits-devbox-for-deep-learning/. 2016 (cited on page 19).
[9] Wikipedia Contributors. Movidius - Products. https://en.wikipedia.org/wiki/

Movidius#Products. 2019 (cited on pages 20, 257, 260).
[10] Google Developers. Retrain an image classification model. https : / / coral . ai /

docs/edgetpu/retrain-classification/. 2019 (cited on page 21).
373
374 BIBLIOGRAPHY
[11] Adrian Rosebrock. Getting started with the NVIDIA Jetson Nano. https : / / www .
pyimagesearch . com / 2019 / 05 / 06 / getting - started - with - the - nvidia -
jetson-nano/. 2019 (cited on page 23).
[12] Adrian Rosebrock. Live video streaming over network with OpenCV and ImageZMQ.
https://www.pyimagesearch.com/2019/04/15/live- video- streaming-
over-network-with-opencv-and-imagezmq/. 2019 (cited on pages 28, 38, 169,
180, 182).
[13] Adrian Rosebrock. An OpenCV barcode and QR code scanner with ZBar. https://
www.pyimagesearch.com/2018/05/21/an-opencv-barcode-and-qr-code-
scanner-with-zbar/. 2018 (cited on page 29).
[14] Adrian Rosebrock. Optical Character Recognition (OCR) Archive. https : / / www .
pyimagesearch . com / category / optical - character - recognition - ocr/.
[15] ZeroMQ Contributors. ZeroMQ - An open-source universal messaging library. https:

//zeromq.org. 2019 (cited on page 31).
[16] RabbitMQ Contributors. RabbitMQ - Messaging that just works. https://www.rabbitmq.

com/. 2019 (cited on page 31).
[17] Adrian Rosebrock. An interview with Jeff Bass, creator of ImageZMQ. https://www.
pyimagesearch.com/2019/04/17/an-interview-with-jeff-bass-creator-
of-imagezmq/. 2019 (cited on page 31).
[18] Jeff Bass. Yin Yang Ranch - Messing About with Permaculture in Newbury Park, CA.
https://yin-yang-ranch.com/. 2017 (cited on page 37).
[19] PyImageSearch Community. PyImageConf 2018 - The practical, hands-on computer vi-
sion, deep learning, and Python conference. https : / / www . pyimageconf . com/.
[20] Jeff Bass. Yin Yang Ranch: A Distributed Computer Vision System with Raspberry Pi’s
and Macs. https://www.pyimageconf.com/static/talks/jeff_bass.pdf.
2018 (cited on pages 37, 336).
[21] Ross B. Girshick et al. “Rich feature hierarchies for accurate object detection and seman-
tic segmentation”. In: CoRR abs/1311.2524 (2013). URL: http://arxiv.org/abs/
1311.2524 (cited on page 50).
[22] Ross B. Girshick. “Fast R-CNN”. In: CoRR abs/1504.08083 (2015). arXiv: 1504.08083.
URL : http://arxiv.org/abs/1504.08083 (cited on page 50).
[23] Shaoqing Ren et al. “Faster R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks”. In: CoRR abs/1506.01497 (2015). URL: http : / / arxiv . org /
abs/1506.01497 (cited on page 50).
BIBLIOGRAPHY 375
[24] Wei Liu et al. “SSD: Single Shot MultiBox Detector”. In: CoRR abs/1512.02325 (2015).
URL : http://arxiv.org/abs/1512.02325 (cited on page 50).
[25] Joseph Redmon et al. “You Only Look Once: Unified, Real-Time Object Detection”. In:
CoRR abs/1506.02640 (2015). URL: http://arxiv.org/abs/1506.02640 (cited on
page 50).
[26] Joseph Redmon and Ali Farhadi. “YOLO9000: Better, Faster, Stronger”. In: CoRR abs/1612.08242
[27] Joseph Redmon and Ali Farhadi. “YOLOv3: An Incremental Improvement”. In: CoRR
abs/1804.02767 (2018). arXiv: 1804.02767. URL: http://arxiv.org/abs/1804.
[28] J.R.R. Uijlings et al. “Selective Search for Object Recognition”. In: International Journal
of Computer Vision (2013). DOI: 10.1007/s11263-013-0620-5. URL: http://www.
huppelen.nl/publications/selectiveSearchDraft.pdf (cited on page 50).
[29] Adrian Rosebrock. Non-Maximum Suppression for Object Detection in Python. https:
/ / www . pyimagesearch . com / 2014 / 11 / 17 / non - maximum - suppression -
object-detection-python/. 2014 (cited on pages 53, 61, 230).
[30] Adrian Rosebrock. (Faster) Non-Maximum Suppression in Python. https : / / www .

pyimagesearch . com / 2015 / 02 / 16 / faster - non - maximum - suppression -
python/. 2015 (cited on page 53).
[31] Adrian Rosebrock. Capturing mouse click events with Python and OpenCV. https :
//www.pyimagesearch.com/2015/03/09/capturing-mouse-click-events-
with-python-and-opencv/. 2015 (cited on page 69).
[32] Adam Geitgey. Machine Learning is Fun! Part 4: Modern Face Recognition with Deep
Learning. https : / / medium . com / @ageitgey / machine - learning - is - fun -
part- 4- modern- face- recognition- with- deep- learning- c3cffc121d78.
[33] Davis E. King. “Dlib-ml: A Machine Learning Toolkit”. In: J. Mach. Learn. Res. 10 (Dec.
2009), pages 1755–1758. ISSN: 1532-4435. URL: http://dl.acm.org/citation.
cfm?id=1577069.1755843 (cited on pages 79, 306).
[34] Adam Geitgey. face_recognition - The world’s simplest facial recognition api for Python
and the command line. https : / / github . com / ageitgey / face _ recognition.
[35] Davis King. High Quality Face Recognition with Deep Metric Learning. http://blog.
dlib.net/2017/02/high- quality- face- recognition- with- deep.html.
376 BIBLIOGRAPHY
[36] Adrian Rosebrock. How to build a custom face recognition dataset. https : / / www .
pyimagesearch.com/2018/06/11/how-to-build-a-custom-face-recognition-
dataset/. 2018 (cited on page 81).
[37] Natesh M Bhat. pyttsx3 - Text-to-speech x-platform. https://pyttsx3.readthedocs.

io/en/latest/ (cited on page 109).
[38] TinyDB Community. TinyDB - Your tiny, document oriented database optimized for your
happiness. https : / / tinydb . readthedocs . io / en / latest/. 2019 (cited on
page 111).
[39] Traffic Safety Systems. VASCAR-PLUS IIIc. https://vascarplus.com/products-

services/vascar-plus-iiic/. 2019 (cited on page 139).
[40] YouTube. WRX Speedometer. https://www.youtube.com/watch?v=hRF39Fr8VOc.

[41] Wikipedia Contributors. VASCAR. https : / / en . wikipedia . org / wiki / VASCAR.

[42] Adrian Rosebrock. Instance segmentation with OpenCV. https://www.pyimagesearch.

com / 2018 / 11 / 26 / instance - segmentation - with - opencv/. 2018 (cited on
page 163).
[43] PiCamera Community. Camera Hardware - Picamera 1.12 documentation. https://

picamera . readthedocs . io / en / release - 1 . 12 / fov . html. 2019 (cited on
page 172).
[44] Adrian Rosebrock. Montages with OpenCV. https://www.pyimagesearch.com/

2017/05/29/montages-with-opencv/. 2017 (cited on page 174).
[45] Adrian Rosebrock. Deep learning: How OpenCV’s blobFromImage works. https : / /
www.pyimagesearch.com/2017/11/06/deep-learning-opencvs-blobfromimage-
works/. 2017 (cited on pages 177, 272, 303).
[46] Adrian Rosebrock. Saving key event video clips with OpenCV. https://www.pyimagesearch.
com / 2016 / 02 / 29 / saving - key - event - video - clips - with - opencv/. 2016
(cited on page 180).
[47] Ling-Feng Liu, Wei Jia, and Yi-Hai Zhu. “Survey of Gait Recognition”. In: Emerging In-
telligent Computing Technology and Applications. With Aspects of Artificial Intelligence.
Edited by De-Shuang Huang et al. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009
[48] Ebenezer R. H. P. Isaac et al. “Trait of Gait: A Survey on Gait Biometrics”. In: CoRR
abs/1903.10744 (2019). arXiv: 1903.10744. URL: http://arxiv.org/abs/1903.
BIBLIOGRAPHY 377
[49] J. P. Singh et al. “Vision-Based Gait Recognition: A Survey”. In: IEEE Access 6 (2018),
pages 70497–70527 (cited on page 184).
[50] Adrian Rosebrock. Deep Learning for Computer Vision with Python, 2nd Ed. PyImage-
Search, 2018 (cited on pages 201, 220, 222, 232, 237, 254, 329).
[51] Adrian Rosebrock. Keras ImageDataGenerator and Data Augmentation. https://www.

pyimagesearch.com/2019/07/08/keras-imagedatagenerator-and-data-
augmentation/. 2019 (cited on page 201).
[52] Kaiming He et al. “Mask R-CNN”. In: CoRR abs/1703.06870 (2017). URL : http : / /
arxiv.org/abs/1703.06870 (cited on page 215).
[53] Valentin Bazarevsky and Fan Valentin. On-Device, Real-Time Hand Tracking with Medi-
aPipe. https://ai.googleblog.com/2019/08/on-device-real-time-hand-
tracking-with.html. 2019 (cited on page 215).
[54] Shorr Packaging. US Cities Where Most Packages Are Stolen. https://www.shorr.
com/packaging- news/2018- 08/package- theft- statistics. 2018 (cited on
page 217).
[55] Adrian Rosebrock. Transfer Learning with Keras and Deep Learning. https://www.
pyimagesearch.com/2019/05/20/transfer- learning- with- keras- and-
deep-learning/. 2019 (cited on pages 219, 220, 232, 237).
[56] Adrian Rosebrock. Keras: Feature extraction on large datasets with Deep Learning.
https://www.pyimagesearch.com/2019/05/27/keras-feature-extraction-
on - large - datasets - with - deep - learning/. 2019 (cited on pages 220, 232,
237).
[57] Adrian Rosebrock. Fine-tuning with Keras and Deep Learning. https://www.pyimagesearch.
com/2019/05/27/keras-feature-extraction-on-large-datasets-with-
deep-learning/. 2019 (cited on pages 220, 232, 237).
[58] The HDF Group. Hierarchical data format version 5. http://www.hdfgroup.org/
HDF5 (cited on pages 222, 233).
[59] Tsung-Yi Lin et al. “Microsoft COCO: Common Objects in Context”. In: CoRR abs/1405.0312
[60] Adrian Rosebrock. (Faster) Non-Maximum Suppression in Python. https : / / www .

pyimagesearch . com / 2015 / 02 / 16 / faster - non - maximum - suppression -
python/. 2015 (cited on page 223).
[61] Adrian Rosebrock. How to create a deep learning dataset using Google Images. https:
//www.pyimagesearch.com/2017/12/04/how-to-create-a-deep-learning-
dataset-using-google-images/. 2017 (cited on page 227).
378 BIBLIOGRAPHY
[62] Adrian Rosebrock. How to (quickly) build a deep learning image dataset. https : / /
www.pyimagesearch.com/2018/04/09/how- to- quickly- build- a- deep-
learning-image-dataset/. 2018 (cited on page 227).
[63] Tractica Contributors. Tractica - Artificial Intelligence Market Forecasts. https://www.

tractica.com/research/artificial-intelligence-market-forecasts/.
[64] OpenVINO and OpenCV Contributors. OpenVINO Toolkit - Open Model Zoo repository.
https://github.com/opencv/open_model_zoo. 2019 (cited on pages 258, 260,
265).
[65] Intel Developer Zone. Pretrained Models - Intel Distribution of OpenVINO Toolkit. hhttps:
//software.intel.com/en-us/openvino-toolkit/documentation/pretrained-
models. 2019 (cited on pages 258, 265).
[66] Adrian Rosebrock. Getting started with the Intel Movidius Neural Compute Stick. https:
/ / www . pyimagesearch . com / 2018 / 02 / 12 / getting - started - with - the -
intel-movidius-neural-compute-stick/. 2018 (cited on page 258).
[67] Adrian Rosebrock. Real-time object detection on the Raspberry Pi with the Movidius
NCS. https://www.pyimagesearch.com/2018/02/19/real- time-object-
detection-on-the-raspberry-pi-with-the-movidius-ncs/. 2018 (cited on
page 258).
[68] Intel Developer Zone. Transitioning from Intel Movidius Neural Compute SDK to Intel Dis-
tribution of OpenVINO toolkit. https://software.intel.com/en-us/articles/
transitioning-from-intel-movidius-neural-compute-sdk-to-openvino-
toolkit. 2019 (cited on page 259).
[69] Adrian Rosebrock. OpenVINO, OpenCV, and Movidius NCS on the Raspberry Pi. https:
//www.pyimagesearch.com/2019/04/08/openvino-opencv-and-movidius-
ncs-on-the-raspberry-pi. 2019 (cited on page 260).
[70] Intel Developer Zone. OpenVINO Toolkit - Optimization Guide. https://docs.openvinotoolkit.

org/latest/_docs_optimization_guide_dldt_optimization_guide.html.
2019 (cited on pages 269, 275).
[71] Adrian Rosebrock. Remote development on the Raspberry Pi (or Amazon EC2). https:
//www.pyimagesearch.com/2019/07/01/remote- development- on- the-
raspberry-pi-or-amazon-ec2/. 2019 (cited on pages 272, 340).
[72] Adrian Rosebrock. OpenCV People Counter. https://www.pyimagesearch.com/

2018/08/13/opencv-people-counter/. 2018 (cited on page 277).
BIBLIOGRAPHY 379
[73] Adrian Rosebrock. Face recognition with OpenCV, Python, and deep learning. https:
//www.pyimagesearch.com/2018/06/18/face-recognition-with-opencv-
python-and-deep-learning/. 2018 (cited on pages 297, 315, 319, 328).
[74] Adrian Rosebrock. OpenCV Face Recognition. https://www.pyimagesearch.com/

2018/09/24/opencv-face-recognition/. 2018 (cited on pages 297, 315, 328).
[75] Adrian Rosebrock. Face detection with OpenCV and deep learning. https : / / www .
pyimagesearch . com / 2018 / 02 / 26 / face - detection - with - opencv - and -
deep-learning/. 2018 (cited on page 299).
[76] Carnegie Mellon University - Satya Lab. OpenFace. https://cmusatyalab.github.

io/openface/. 2016 (cited on pages 299, 317).
[77] Wikipedia Contributors. Radial basis function kernel. https://en.wikipedia.org/

wiki/Radial_basis_function_kernel. 2019 (cited on page 308).
[78] Adrian Rosebrock. PyImageSearch Gurus. https : / / www . pyimagesearch . com /

pyimagesearch-gurus/. 2019 (cited on page 308).
[79] Ruogu Fang et al. Computational Models of Kinship Verification. http://chenlab.

ece.cornell.edu/projects/KinshipVerification/. 2010 (cited on page 317).
[80] Adrian Rosebrock. Facial landmarks with dlib, OpenCV, and Python. https://www.
pyimagesearch.com/2017/04/03/facial-landmarks-dlib-opencv-python/.
[81] Adrian Rosebrock. Face Alignment with OpenCV and Python. https://www.pyimagesearch.
com/2017/05/22/face- alignment- with- opencv- and- python/. 2017 (cited
on page 318).
[82] Scikit-Learn Open Source Contributors. RBF SVM parameters. https : / / scikit -
learn.org/stable/auto_examples/svm/plot_rbf_parameters.html. 2019
[83] Carnegie Mellon University - Satya Lab. OpenFace. https://github.com/cmusatyalab/

openface. 2016 (cited on page 319).
[84] Inc. Apple. Find My - One place to find your devices and friends. https://www.apple.
com/icloud/find-my/. 2019 (cited on page 323).
[85] Inc. Nextdoor. Next Door - The private social network for your neighborhood. https:
//nextdoor.com/. 2019 (cited on page 324).
[86] SoftScheck Contributors. Reverse Engineering the TP-Link HS110. https : / / www .
softscheck.com/en/reverse-engineering-tp-link-hs110/. 2018 (cited on
page 330).
380 BIBLIOGRAPHY
[87] SoftScheck Contributors. TP-Link WiFi SmartPlug Client and Wireshark Dissector. https:
//github.com/softScheck/tplink-smartplug. 2018 (cited on page 330).
[88] GadgetReactor. Python Library to control TPLink Switch (HS100 / HS110). https://
github.com/GadgetReactor/pyHS100. 2018 (cited on page 330).
[89] Adrian Rosebrock. Raspbian and OpenCV, pre-configured and pre-installed. https://
www.pyimagesearch.com/2016/11/21/raspbian-opencv-pre-configured-
and-pre-installed/. 2019 (cited on page 331).

RPi4CV HackerBundle

Uploaded by

Copyright:

Available Formats

RPi4CV HackerBundle

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RPi4CV HackerBundle

Uploaded by

Copyright:

Available Formats

Dr.

Adrian Rosebrock, PhD

2 Deep Learning on Resource Constrained Devices Outline 17

3 Multiple Pis, Message Passing, and ImageZMQ 27

3.5.1 Why Stream Video Frames Over a Network? . . . . . . . . . . . . . . . . 36

4 Advanced Security with YOLO Object Detection 49

5 Face Recognition on the RPi 71

5.4.2.1 Use OpenCV and Webcam for Face Detection . . . . . . . . . . 80

6 Building a Smart Attendance System 105

7 Building a Neighborhood Vehicle Speed Monitor 137

7.1 Chapter Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

8 Deep Learning and Multiple RPis 169

9 Training a Custom Gesture Recognition Model 183

9.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

10 Vehicle Recognition with Deep Learning 217

11 What is the Movidius NCS 255

12 Image Classification with the Movidius NCS 263

12.2 Image Classification with the Movidius NCS . . . . . . . . . . . . . . . . . . . . . 263

13 Object Detection with the Movidius NCS 277

14 Fast, Efficient Face Recognition with the Movidius NCS 297

15 Recognizing Objects with IoT Pi-to-Pi Communication 321

15.3 The Case for a Complex Security System . . . . . . . . . . . . . . . . . . . . . . 323

16 Your Next Steps 369

• Up-to-date installation instructions on how to configure your Raspberry Pi develop-

• Instructions on how to use the pre-configured Raspbian .img file(s)

• Supplementary material that I could not fit inside this book

To create your companion website account, just use this link:

Welcome to the Hacker Bundle of Raspberry Pi for Computer Vision!

• Building a remote wildlife detector

• Creating a video surveillance system (and streaming the result to a webpage)

• Detecting tired, drowsy drivers behind the wheel

• Face tracking with pan/tilt servos

• Creating a traffic counting/footfall application

• Building an automatic prescription pill recognition system

Computer vision is no different.

In the Hobbyist Bundle we didn’t touch on deep learning.

i. Learn how to use the OpenCV library on the Raspberry Pi.

• The number of model parameters/size (ensuring it can fit into RAM).

Flip the page to get started.

Deep Learning on Resource

2.1 Chapter Learning Objectives

Inside this chapter you will:

i. Learn why performing deep learning on embedded devices is so challenging.

2.2 The Challenge of DL on Embedded Devices

Model Memory FLOPs

Finally, you most consider the power draw of the RPi.

2.3 Can we Train Models on the RPi?

i. Train the model on your desktop/GPU-enabled machine.

ii. Serialize the model to disk.

iv. Use the RPi to perform inference.

2.4 Faster Inference with Coprocessors

Our pipeline then becomes:

iv. Passing the image to the coprocessor for inference.

2.4.1 Movidius Neural Compute Stick (NCS)

2.4.2 Goral Coral TPU USB Accelerator

2.5 Dedicated Development Boards

2.5.1 Google Coral TPU Dev Board

2.5.2 NVIDIA Jetson Nano

for the Nano, as I discuss in this PyImageSearch tutorial: (http://pyimg.co/yb72g) [11].

25 # check if the correct message, raspberry, is received and then