RPi4CV HackerBundle
RPi4CV HackerBundle
RPi4CV HackerBundle
Adrian Rosebrock
Raspberry Pi for Computer Vision
Hacker Bundle - v1.0.1
pyimagesearch
The contents of this book, unless otherwise indicated, are Copyright ©2019 Adrian Rosebrock,
PyImageSearch.com. All rights reserved. Books like this are made possible by the time in-
vested by the authors. If you received this book and did not purchase it, please consider
making future books possible by buying a copy at https://www.pyimagesearch.com/raspberry-
pi-for-computer-vision/ today.
© 2019 PyImageSearch
Contents
Contents 3
1 Introduction 15
3
4 CONTENTS
Thank you for picking up a copy of Raspberry Pi for Computer Vision! To accompany this book
I have created a companion website which includes:
• Frequently Asked Questions (FAQs) and their suggested fixes and solutions
Additionally, you can use the “Issues” feature inside the companion website to report any
bugs, typos, or problems you encounter when working through the book. I don’t expect many
problems; however, this is a brand new book so myself and other readers would appreciate
reporting any issues you run into. From there, I can keep the book updated and bug free.
http://pyimg.co/qnv89
Take a second to create your account now so you’ll have access to the supplementary
materials as you work through the book.
13
14 CONTENTS
Chapter 1
Introduction
Inside the Hobbyist Bundle you learned about the Raspberry Pi, an affordable, yet powerful
embedded device given the size and cost of the machine.
You then discovered how to apply computer vision, image processing, and OpenCV to the
RPi to build real-world applications, including:
You should be proud of your accomplishments thus far — some of those projects were not
easy, but you rolled up your sleeves, put your head down, and learned how to apply computer
vision to the RPi.
That said, the Hobbyist Bundle did not touch on deep learning, one of the most important
advances in computer science in the past decade.
Deep learning has impacted nearly ever facet of computer science, including computer vi-
sion, Natural Language Processing (NLP), speech recognition, social network filtering, bioinfor-
matics, drug design, and more — basically, if there is enough labeled data, deep learning
has likely (successfully) been applied to the field in some manner.
15
16 CHAPTER 1. INTRODUCTION
Today we see deep learning applied to computer vision tasks including image classification,
object detection, semantic/instance segmentation, face recognition, gait recognition, and much
more.
We instead focused on the fundamentals of applying fairly standard computer vision and
image processing algorithms to the RPi via the OpenCV library. Placing emphasis on the
basics to enable us to:
ii. Better understand the computational limitations of embedded devices even with basic
algorithms.
Deep learning algorithms, by their very nature, are incredibly computationally expensive
— and in order to apply them to the RPi (or other embedded devices) we need to be extremely
thoughtful regarding:
• Our model’s computational requirements (to ensure inference can be made in a rea-
sonable amount of time)
• Library optimizations, such as NEON, VFPV3, and OpenCL, which can be used to
improve inference time.
• Hardware add-ons, including the Movidius NCS or Google Coral USB Accelerator, which
can be used to push model computation to an optimized compute USB compute stick.
Inside the chapters in this bundle, we’ll take a deeper dive into the world of computer vision
on embedded devices. You’ll learn about more advanced computer vision algorithms, ma-
chine learning techniques, and how to apply deep learning on embedded devices (including
optimization tips, suggestions, and best practices). The techniques covered in this bundle are
much more advanced than what is covered in the Hobbyist Bundle — this is where you’ll start
to separate yourself from a hobbyist to a true embedded device practitioner.
Before we start writing code to perform deep learning inference (including classification, detec-
tion, and segmentation) on the RPi, we first need to take a step back and understand why it’s
such a challenge to perform deep learning on resource constrained devices.
Having this perspective is not only educational, but it will better enable us to assess the right
libraries and tools for the job when building an application that leverages deep learning.
ii. Discover coprocessor devices, such as the Movidius NCS and Google Coral TPU Accel-
erator.
iii. Learn about dedicated development boards, including Google’s TPU Dev Board and the
NVIDIA Jetson Nano.
iv. Discuss whether or not the RPi is relevant for deep learning.
Deep learning algorithms have facilitated unprecedented levels of accuracy in computer vision
— but that accuracy comes at a price — namely, computational resources that embedded
devices tend to lack.
17
18 CHAPTER 2. DEEP LEARNING ON RESOURCE CONSTRAINED DEVICES OUTLINE
Table 2.1: Estimates of memory consumption and FLOP counts for seminal Convolutional Neural
Networks [1].
It’s no secret that training even modest deep learning models requires a GPU. Trying to train
even a small model on a CPU can take multiple of orders of magnitude longer. And even at
inference, when the more computationally expensive backpropagation phase is not required, a
GPU is still often required to obtain real-time performance. These computational requirements
put resource constrained devices, such as the RPi, at a serious disadvantage — how are they
supposed to leverage deep learning if they are so comparatively underpowered?
There are a number of problems at work here, and the first, which we touched on above, is
the complexity of the model and the required computation. Samuel Albanie, a researcher
at the prestigious University of Oxford, studied the amount of required computation for popular
CNNs (Table 2.1).
AlexNet [2], the seminal CNN architecture that helped jumpstart the latest resurgence in
deep learning research after its performance in the ImageNet 2012 competition, requires 727
MFLOPs.
The VGG family of networks [3] requires anywhere from 727 MFLOPs to 20 GFLOPs, de-
pending on which specific architecture is used.
ResNet [4, 5], arguably one of the most powerful and accurate CNNs, requires 2 GFLOPs
to 16 GFLOPs, depending on how deep the model is.
The original Raspberry Pi was released with a 700 MHz processor capable of approximately
0.041 GFLOPS [6]. The RPi v1.1 included a quad-core Cortex-A7 running at 900 MHz and
1GB of RAM. It’s estimated that the RPi v1.1 is approximately 4x faster than the original RPi,
bringing computational power up to 0.164 GLOPs.
The Raspberry Pi 3 was upgraded further, including a ARM Cortex-A53 running at 1.2 GHz,
having 10x the performance of the original RPi [6], giving us approximately 0.41 GFLOPs.
2.3. CAN WE TRAIN MODELS ON THE RPI? 19
Now, take that 0.41 GLOPs and compare it to Table 2.1 — note how the RPi is computa-
tionally limited compared to the amount of operations required by state-of-the-art architectures
such as ResNet (2-16 GLOPs). In order to successfully perform deep learning on the RPi, we’ll
need some serious levels of optimization.
So far we’ve discussed only computational requirements, but we haven’t assessed the
memory requirements (RAM).
The RPi 3 has 1GB of RAM while the RPi has 1-4GB, depending on the model. However,
keep in mind that this RAM is responsible for not only your code and deep learning models, but
also the system operations on the RPi as well. While the amount of RAM tends to be less of
a limitation than the CPU, it’s still worth considering when developing your own deep learning
applications on the RPi.
Embedded devices naturally do not draw as much power as laptops, desktops, or GPU-
enabled deep learning rigs — embedded machines are not designed to draw as much
power. The less power there is, the less computationally powerful the machine.
For comparison purposes, a RPi 3B+ will draw about 3.5W and a full blown desktop GPU
will draw 250W (Titan X) [7]. That’s not to mention the other peripherals drawing even more
power in the GPU machine — most Power Supply Units (PSUs) would be capable of 800W at
a minimum. For reference, my NVIDIA DIGITS DevBox has a 1350W power supply to power
4x TitanX GPUs and all the other components (http://pyimg.co/hsozv) [8].
As you can see, we need to make some special accommodations to accomplish any deep
learning on the RPi.
A common misconception I see from deep learning practitioners new to the world of embedded
devices is that (incorrectly) think they can/should train their models on the RPi.
In short — don’t train your deep learning models on the RPi (or other embedded
devices). Instead, you should:
iii. Transfer the model to the RPi (ex., FTP, SFTP, etc.).
Using the steps above you can alleviate the need to actually train the model on the RPi and
instead use the RPi for just making predictions.
All that said, there are certain situations where training a model on the RPi can make
practical sense — most of these use cases involve taking a pre-trained model and then fine-
tuning it on a small amount of data available on the embedded device itself. Those situations
are few and far between though — my personal recommendation would be to operate under
the assumption that you should not be training on the RPi unless you have a very explicit
reason for doing so.
There are times when the Raspberry Pi CPU itself will not be sufficient for deep learning and
computer vision tasks. If and when those situations arise, we can utilize coprocessors to
perform deep learning inference.
i. Loading a deep learning model into memory and onto the coprocessor
ii. Using the RPi CPU for polling frames from a video stream.
iii. Utilizing the CPU for any preprocessing (resizing, channel order swapping, etc.)
v. Post-processing the results from the coprocessor and then continuing the process with
the CPU.
Coprocessors are specifically designed for deep learning inference in mind, the two most
popular of which are Intel’s Movidus Neural Compute Stick (NCS) and Google Coral’s TPU
USB Accelerator.
Intel’s NCS is a USB thumb drive sized deep learning coprocessor (Figure 2.1). You plug the
USB stick into your RPi and then access it via the NCS2 SDK and/or the OpenVINO toolkit,
the latter of which is recommended as it can be used directly inside OpenCV with only one or
two extra function calls.
The NCS can run between 80-150 GLOPs in just over 1W of power [9], enabling embedded
devices, such as the RPi, to run state-of-the-art neural networks.
Later in this text we’ll learn how to use the NCS for deep learning inference on the RPi.
2.5. DEDICATED DEVELOPMENT BOARDS 21
Figure 2.1: Left: Intel’s Movidius Neural Compute Stick. Right: Google Coral USB Accelerator.
Similar to the Movidius NCS, the Google Coral USB Accelerator (Figure 2.1, right) is also
a coprocessor device that plugs in to the RPi via a USB port — and just like the NCS, it is
designed only for inference (i.e., you wouldn’t train a model with either the Coral or NCS,
although Google’s documentation does how how to fine-tune models on small datasets using
the coral [10]).
Google reports that their Coral line of products are over 10x faster than the NCS; however,
there is a bit of caveat when using the Coral USB Accelerator with the Raspberry Pi — to
obtain such speeds the Coral USB Accelerator utilizes USB 3. The problem the RPi 3B+ only
has USB 2!
Unfortunately, having USB 2 instead of USB 3 does reduce our inference by 10x which
essentially means that the NCS and Coral USB Accelerator will perform very similarly on the
Raspberry Pi 3/3B+.
However, with that said, the Raspberry Pi 4 does have USB 3 — using USB 3, the
Coral USB Accelerator obtains much faster inference than the current iteration of the
NCS.
Just like there are times when you may need a coprocessor to obtain adequate throughput
speeds for your deep learning and computer vision pipeline, there are also times where you
may need to abandon the Raspberry Pi altogether and instead utilize a dedicated develop-
ment board.
22 CHAPTER 2. DEEP LEARNING ON RESOURCE CONSTRAINED DEVICES OUTLINE
Currently, there are two frontrunners in the dedicated deep learning board market — the
Google Coral TPU Dev Board and the NVIDIA Jetson Nano.
Figure 2.2: Left: Google Coral Dev Board. Right: NVIDIA Jetson Nano.
Unlike the TPU USB Accelerator, the Google Coral TPU Dev Board (Figure 2.2, left) is actually
a dedicated board capable of up to 32-64 GLOPs, far more powerful than both the TPU USB
Accelerator and the Movidius NCS.
The device itself utilizes the highly optimized implementation of TensorFlow Lite, capable
of running object detection models such as MobileNet V2 at 100+ FPS in a power efficient
manner.
The downside of the TPU Dev Board is that it can only run TensorFlow Lite models at these
speeds — you won’t be able to take existing, off-the-shelf pre-trained models (such as Caffe
models) and then directly use them with the TPU Dev Board.
Instead, you would first need to convert these models (which may or may not be possible)
and then deploy them to the TPU Dev Board.
I’ll be covering the TPU Dev Board in depth inside the Complete Bundle of this text.
Competing with Google Coral’s TPU Dev Board is the NVIDIA Jetson Nano (Figure 2.2, right).
The Jetson Nano includes a 128-core Maxwell GPU, a quad-core ARM 57 processor run-
ning at 1.43 GHz, and 4GB of 64-bit LPDDR4 RAM. The Nano can provide 472 GFLOPs with
only 5-10W of power.
What I really like about the Nano is that you are not restricted in terms of deep learning
packages and libraries — there are already TensorFlow and Keras libraries that are optimized
2.6. IS THE RPI IRRELEVANT FOR DEEP LEARNING? 23
At this point it’s impossible to tell which dev board is going to win out in the market — it’s
highly dependent on not only the marketing of each respective company, but more importantly,
their documentation and support as well. NVIDIA’s Jetson TX1 and TX2 series, while powerful,
were incredibly challenging to utilize from a user experience perspective. I believe the Nano is
correcting many of the mistakes from the TX series and will ultimately make for a much more
pleasant user experience provided they continue on the road they are now.
If you decide you would like to explore coprocessor additions to the Raspberry Pi, then I would
definitely take a look at the Movidius NCS and Google Coral USB Accelerator.
The NCS has come a long way since it’s original v1 release, and with the OpenVINO toolkit,
it’s incredibly easy to use with your own applications. That said, OpenVINO does lock you
down a bit to using OpenCV, so if you’re looking to utilize TensorFlow or Keras models, I would
instead recommend the Google Coral USB Accelerator.
As for dedicated dev boards, right now I’m partial to NVIDIA Jetson Nano. While the Coral
Dev Board is extremely fast, I found the Jetson Nano easier to use. I also enjoyed not being
locked down to just TensorFlow and Keras models.
ii. "And if so, why would Adrian ever write a book about the RPi?”
Those are two fair questions — and if you know me at all, you know my answer is that it’s
all about bringing the right tool to the job.
When we encounter situations where the RPi CPU just isn’t enough, we’ll then be able to
lean on dedicated coprocessors such as the Google Coral USB TPU Accelerator and Intel’s
24 CHAPTER 2. DEEP LEARNING ON RESOURCE CONSTRAINED DEVICES OUTLINE
Movidius NCS. And when we need a dedicated device that’s both faster than the RPi and more
suitable for deep learning, we have both the Google Coral Dev board and NVIDIA Jetson Nano.
As I said, performing deep learning on the RPi is far from irrelevant — we just need to
understand the limitations of what we can and cannot do.
The rest of this text (as well as the Complete Bundle) will show you both practical projects
and the limitations of the RPi through real-world applications, including situations where you
should utilize either a USB coprocessor or a dedicated dev board.
2.7 Summary
In this chapter you learned about some of the problems surrounding deep learning on resource
constrained devices, namely:
• The immense amount of computation required by state-of-the-art networks (and how the
RPi is quite limited in terms of computation).
• Limited memory/RAM.
Deep learning practitioners new to the world of embedded devices may be tempted to ac-
tually train their models on the RPi. Instead, what you should do is train your model on a
desktop/GPU-enabled machine, and then after training, transfer it to your embedded de-
vice for inference. Unless you have very explicit reasons for doing so, you should not train
a model on the RPi (or other embedded device) — not only are they computational resources
limited, but electrically the RPi is underpowered as well.
You will invariably run into situations where a particular deep learning model/computer vision
pipeline is too computationally expensive for your Raspberry Pi.
If and when that situation arises, you can leverage USB coprocessors such as the Google
Coral USB Accelerator and Intel’s Movidius NCS. These USB sticks are optimized for deep
learning inference and allow you to push computation from the CPU to the USB stick, allowing
you to obtain better performance.
Alternatively, you may choose to switch embedded devices entirely and go with Google
Coral’s TPU development board or NVIDIA’s Jetson Nano — both of these devices are faster
than the Raspberry Pi and more optimized for deep learning.
Depending on your project needs, you may even elect to use a REST API service to process
images/videos in the cloud and receive the results back on your Raspberry Pi. Obviously this
isn’t ideal/possible for realtime needs.
2.7. SUMMARY 25
All that said, the Raspberry Pi itself is not irrelevant for deep learning. Instead, it’s all
about bringing the right tool for the job.
The remainder of this text (as well as the Complete Bundle) will show you how and when to
perform deep learning on the Raspberry Pi, as well as the situations where you should consider
using a coprocessor or dedicated development board.
26 CHAPTER 2. DEEP LEARNING ON RESOURCE CONSTRAINED DEVICES OUTLINE
Chapter 3
The PyImageSearch blog receives questions, comments, and emails quite often that go some-
thing like this:
“Hi Adrian, I’m working on a project where I need to stream frames from a client
camera to a server for processing using OpenCV. Should I use an IP camera? Would
a Raspberry Pi work? What about RTSP streaming? Have you tried using FFMPEG
or GStreamer? How do you suggest I approach the problem?”
It’s a great question — and if you’ve ever attempted live video streaming with OpenCV then
you know there are a ton of different options.
You could go with the IP camera route. But IP cameras can be a pain to work with. Some
IP cameras don’t even allow you to access the RTSP (Real-time Streaming Protocol) stream.
Other IP cameras simply don’t work with OpenCV’s cv2.VideoCapture function. An IP
camera may be too expensive for your budget as well.
In those cases, you are left with using a standard webcam — the question then becomes,
how do you stream the frames from that webcam using OpenCV?
Using FFMPEG or GStreamer is definitely an option. But both of those can be a royal pain to
work with. In fact, they are so much of a pain that we removed our original FFMPEG streaming
content and example code from this book. Quite simply, it was going to be a near impossible
to support and it honestly would have led readers like you down the wrong path.
In this chapter we’ll review the preferred solution using message passing libraries, specif-
ically ZMQ and ImageZMQ.
27
28 CHAPTER 3. MULTIPLE PIS, MESSAGE PASSING, AND IMAGEZMQ
As an introductory example, we’ll learn to use ZMQ to send text strings between clients and
servers.
We’ll also use a package called ImageZMQ (http://pyimg.co/fthtd) [12]) to send video frames
from a client to a server — this has become my preferred way of streaming video from Rasp-
berry Pis, essentially turning them into an inexpensive wireless IP camera that you have full
control over.
PyImageSearch Guru and PyImageConf 2018 speaker, Jeff Bass has made his personal
ImageZMQ project public (http://pyimg.co/lau8p). His system allows for streaming video frames
across your network using minimal lines of code. The project is very well polished and is
another tool you can put in your tool chest for your own projects.
As you’ll see, this method of OpenCV video streaming is not only reliable but incredibly easy
to use, requiring only a few lines of code.
There are a number of use cases where you may wish to integrate multiple Raspberry Pis. The
first two are related to security.
Chapter 15’s IoT Case Study project involving face recognition, door monitoring, message
passing, and IoT lights is a great example. Inside that chapter we utilize two RPis – one is
responsible for detecting people and vehicles as they come down our driveway. The first RPi
sends a message to the second RPi which starts performing face recognition at our front door.
You could extend that project and apply face recognition to every door of a corporate build-
ing or campus.
The next idea is related to factory automation. Maybe you have an automation line with mul-
tiple cameras (RPis) doing different tasks and passing information downstream (and upstream)
the automation line.
3.3. PROJECT STRUCTURE 29
Let’s brainstorm a serial number example. Maybe one camera grabs the serial number
from a barcode [13] or OCR [14] and sends that serial number to the rest of the computers
downstream so they will be expecting it. Perhaps one RPi finds a non-compliant measurement
on a part and it needs to inform other RPis to discard the part matching a certain serial number.
Each RPi takes a picture of the part and sends the frame somewhere with ImageZMQ.
Another idea is swarm robots playing a soccer game. The robots tell each other when and
where they see the soccer ball and other players on the field.
|-- messagepassing_example
|-- client.py
|-- server.py
|-- imagezmq_example
|-- client.py
|-- server.py
Our first demonstration will be using sockets and ZMQ for a simple messagepassing_exa
mple/ using a client/server approach.
From there, we’ll walkthrough an imagezmq_example/ for sending video frames from a
client to a server.
The message broker receives the request and then handles sending the message to the
other process(es). If necessary, the message broker also sends a response to the originating
process.
As an example of message passing, let’s consider a tremendous life event, such as a mother
giving birth to a newborn child (process communication depicted in Figure 3.1). Process A, the
30 CHAPTER 3. MULTIPLE PIS, MESSAGE PASSING, AND IMAGEZMQ
Figure 3.1: Illustrating the concept of sending a message from a process, through a message
broker, to other processes.
mother, wants to announce to all other processes (i.e., the family), that she had a baby. To do
so, Process A constructs the message and sends it to the message broker.
The message broker then takes that message and broadcasts it to all processes. All other
processes then receive the message from the message broker. These processes want to
show their support and happiness to Process A, so they construct a message saying their
congratulations as shown in Figure 3.2.
Figure 3.2: Each process sends an acknowledgment (ACK) message back through the message
broker to notify Process A that the message is received.
These responses are sent to the message broker which in turn sends them back to Process
A.
This example is a dramatic simplification of message passing and message broker sys-
3.4. SOCKETS, MESSAGE PASSING, AND ZMQ 31
tems but should help you understand the general algorithm and the type of communication
the processes are performing. You can very easily get into the weeds studying these topics,
including various distributed programming paradigms and types of messages/communication
(1:1 communication, 1:many, broadcasts, centralized, distributed, broker-less etc.).
As long as you understand the basic concept that message passing allows processes to
communicate (including processes on different machines) then you will be able to follow along
with the rest of this chapter.
ZeroMQ [15], or simply ZMQ for short, is a high-performance asynchronous message passing
library used in distributed systems.
Both RabbitMQ [16] and ZeroMQ are some of the most highly used message passing sys-
tems. However, ZeroMQ specifically focuses on high throughput and low latency appli-
cations — which is exactly how you can frame live video streaming.
When building a system to stream live videos over a network using OpenCV, you would
want a system that focuses on:
• High throughput: There will be new frames from the video stream coming in quickly.
• Low latency: As we’ll want the frames distributed to all nodes on the system as soon as
they are captured from the camera.
ZeroMQ also has the benefit of being extremely easy to both install and use.
Jeff Bass, the creator of ImageZMQ (which builds on ZMQ) [17], chose to use ZMQ as the
message passing library for these reasons.
In this client/server message passing example, our client is going to ask the user a question
and we’ll query the server to see if the message is correct or incorrect.
Textual messages are passed between the client and server in order to answer the question:
We can implement this example in minimal lines of code on both the client and server.
Let’s implement the server first. Go ahead and create a new file named server.py and
insert the following code:
32 CHAPTER 3. MULTIPLE PIS, MESSAGE PASSING, AND IMAGEZMQ
We begin with imports where Line 4 imports zmq for message passing.
We then bind a socket connection with zmq on our --server-port via Lines 17 and 18.
20 while True:
21 # receive a message, decode it, and conver to lowercase
22 message = socket.recv().decode("ascii").lower()
23 print("[INFO] received message `{}`".format(message))
Line 22 receives a string message from our socket and converts it to lowercase.
If the client says "raspberry" is the best type of pie, we let the user know the message
is correct by sending "correct" as our returnMessage (Lines 27-30).
Or, if the message is "quit", then we send "quitting server..." as our returnMessage
(Lines 34-36). We also go ahead and call sys.exit since the client wants to quit (Line 38).
For any other message, we’ll ask the client to "try again!" (Lines 41-44).
Now that our server is ready to go, let’s implement a client. Create a new file named
client.py and insert the following code:
Lines 15-21 then connect to the server via the IP and port.
23 # loop indefinitely
24 while True:
25 # ask the user to type a message
26 message = input("[INPUT] What is the best type of pie? ")
27
28 # send a text message over the socket connection
29 print("[INFO] sending '{}'".format(message))
30 socket.send(message.encode("ascii"))
First, the client poses a question, "What is the best type of pie?". The question
is wrapped in an input() statement, pausing execution until the user types their answer (of
course, we know the answer is “raspberry” or “raspberry pie”, but it is up to the server to be the
judge).
The client will then send the message to the server to see if it is correct (Line 30).
Line 33 receives a response from the server and Line 34 echos it to the user.
In the event that the response is equal to "quitting server...", our client will exit.
Let’s put the client and server to use in the next section.
3.4. SOCKETS, MESSAGE PASSING, AND ZMQ 35
With our client and server coded up, now it’s time to put them to work.
First, go ahead and start your server on a machine such as your laptop or a RPi:
Then, on a separate machine start your client (technically you could run the server and
client on the same machine if you use a loopback IP address such as 127.0.0.1 or localhost
instead of another machine’s IP on your network):
Of course we all like “Raspberry” flavored pie, but just for the sake of this example, let’s
enter “Apple” at the prompt:
At this point, you can chat with the server and try some other types of pie.
When you’re ready, you can tell the server that you’re ready to quit:
If you weren’t watching the server, you can go back and inspect its output too. Here’s what
the server’s output looked like:
As you can see, both our client and server are operating properly.
As a challenge for you, could you implement more features for this program? Perhaps add
more questions and handle answer responses? Or maybe you’d like to create a client/server
chat program using Python and zmq.
You could implement such a chat system that can handle one client in a matter of minutes.
You could keep track of multiple client connections in a multi-chat system in a matter of thirty
minutes. Prompt for the person’s chat username/alias when they connect to the server. Then
broadcast each incoming message to all clients with the alias at the beginning of the string
(just like the good ’ole days of ICQ).
There are a number of reasons why you may want to stream frames from a video stream over
a network with OpenCV.
To start, you could be building a security application that requires all frames to be sent to a
central hub for additional processing and logging.
Or, your client machine may be highly resource constrained (such as a Raspberry Pi) and
lack the necessary computational horsepower required to run computationally expensive algo-
rithms (such as deep neural networks, for example).
In these cases, you need a method to take input frames captured from a webcam with
OpenCV and then pipe them over the network to another system.
There are a variety of methods to accomplish this task such as those mentioned in the intro-
duction of this chapter, but right now we’re going to continue our focus on message passing.
3.5. IMAGEZMQ FOR VIDEO STREAMING 37
Figure 3.3: A great application of video streaming with OpenCV is a security camera system. You
could use Raspberry Pis and ImageZMQ to stream from the Pi (client) to the server.
Jeff Bass is the owner of Yin Yang Ranch [18], a permaculture farm in Southern California. He
was one of the first people to join PyImageSearch Gurus (http://pyimg.co/gurus), my flagship
computer vision course. In the course and community he has been an active participant in
many discussions around the Raspberry Pi.
Jeff has found that Raspberry Pis are perfect for computer vision and other tasks on his
farm. They are inexpensive, readily available, and astoundingly resilient/reliable.
At PyImageConf 2018 [19], Jeff spoke about his farm and more specifically how he used
Raspberry Pis and a central computer to manage data collection and analysis. The heart of
his project is a library that he put together called ImageZMQ [20].
ImageZMQ solves the problem of real-time streaming from the Raspberry Pis on his farm.
It is based on ZMQ and works really well with OpenCV.
Plain and simple, ImageZMQ just works. And it works really reliably.
I’ve found it to be more reliable than alternatives such as GStreamer or FFMPEG streams.
I’ve also had better luck with it than using RTSP streams.
• You can learn the details of ImageZMQ by studying Jeff’s code on GitHub:
http://pyimg.co/lau8p
38 CHAPTER 3. MULTIPLE PIS, MESSAGE PASSING, AND IMAGEZMQ
Figure 3.4: ImageZMQ is a video streaming library developed by PyImageSearch Guru, Jeff Bass.
It is available on GitHub: http://pyimg.co/lau8p
• If you’d like to learn more about Jeff, be sure to refer to his interview here:
http://pyimg.co/sr2gj
• Jeff’s slides from PyImageConf 2018 are also available here: http://pyimg.co/f7jsc
In the coming sections, we’ll install ImageZMQ, implement the client + server, and put the
system to work.
ImageZMQ is preinstalled on the Raspbian and Nano .imgs included with this book. Refer to
the companion website associated with this text to learn more about the preconfigured .img
files.
If you prefer to install ImageZMQ from scratch, refer to this article on PyImageSearch:
http://pyimg.co/fthtd) [12].
Figure 3.5: Changing a Raspbery Pi hostname is as simple as entering the raspi-config inter-
face from a terminal or SSH connection.
Our code is going to use the hostname of the client to identify it. You could use the IP
address in a string for identification, but setting a client’s hostname allows you to more easily
identify the purpose of the client.
In this example, we’ll assume you are using a Raspberry Pi running Raspbian. Of course,
your client could run Windows Embedded, Ubuntu, macOS, etc., but since our demo uses
Raspberry Pis, let’s learn how to change the hostname on the RPi.
To change the hostname on your Raspberry Pi, fire up a terminal (this could be over an
SSH connection if you’d like).
Then run the raspi-config command and navigate to network options as shown in Fig-
ure 3.5. From there, change the hostname to something unique. I recommend a naming
convention such as pi-garage, pi-frontporch, etc.
After changing the hostname you’ll need to save and reboot your Raspberry Pi.
40 CHAPTER 3. MULTIPLE PIS, MESSAGE PASSING, AND IMAGEZMQ
On some networks, now you won’t even need an IP address to SSH to your Pi. You could
SSH via the hostname:
$ ssh pi-livingroom
Figure 3.6: The client/server relationship for ImageZMQ video streaming with OpenCV.
Before we actually implement network video streaming with OpenCV, let’s first define the
client/server relationship to ensure we’re on the same page and using the same terms:
• Client: Responsible for capturing frames from a webcam using OpenCV and then send-
ing the frames to the server.
We could argue back and forth as to which system is the client and which is the server. For
example, a system that is capturing frames via a webcam and then sending them elsewhere
could be considered a server — the system is undoubtedly serving up frames. Similarly, a
system that accepts incoming data could very well be the client.
i. There is at least one (and likely many more) system responsible for capturing frames.
3.5. IMAGEZMQ FOR VIDEO STREAMING 41
ii. There is only a single system used for actually receiving and processing those frames.
For these reasons, we’ll prefer to think of the system sending the frames as the client and
the system receiving/processing the frames as the server. Thus, Figure 3.6 demonstrates the
relationship and responsibilities of both the client and server.
You may disagree, but that is the client-server terminology we’ll be using throughout
the remainder of this chapter.
We begin with our imports and command line arguments. On Line 3 we import imagezmq.
We have one command line argument, the --server-ip which is the server’s IP address
or hostname.
By default, we’ll connect to the typically open port on Lines 16 and 17 where we initialize
our ImageSender as sender.
Moving on, we’ll initialize our video stream and start sending frames:
19 # get the host name, initialize the video stream, and allow the
20 # camera sensor to warmup
21 rpiName = socket.gethostname()
42 CHAPTER 3. MULTIPLE PIS, MESSAGE PASSING, AND IMAGEZMQ
22 #vs = VideoStream(src=0).start()
23 vs = VideoStream(usePiCamera=True).start()
24 time.sleep(2.0)
25
26 # loop over frames from the camera
27 while True:
28 # read the frame from the camera and send it to the server
29 frame = vs.read()
30 sender.send_image(rpiName, frame)
We grab the hostname of the client on Line 21 – refer back to Section 3.5.4 to learn how to
set each of your Raspberry Pi hostnames.
Lines 27-30 start our infinite while loop to both read a frame from the VideoStream and
then send it with send_image. We pass the rpiName string (the hostname) as well as the
frame itself.
Jeff Bass’ ImageZMQ library takes care of the rest for us.
Go ahead and open server.py and insert the following lines of code:
From there, Line 7 initializes our imageHub. The imageHub manages connections to our
clients.
11 # receive RPi name and frame from the RPi and acknowledge
12 # the receipt
13 (rpiName, frame) = imageHub.recv_image()
14 imageHub.send_reply(b'OK')
15 print("[INFO] receiving data from {}...".format(rpiName))
We will use this unique rpiName (i.e. the client’s hostname) as our GUI window name. We
will also annotate the frame itself with the rpiName. This allows us to have many clients each
with their own GUI window on our server. Take the time now to set each of the Pis around your
house with a unique hostname following the guide in Section 3.5.4.
Line 14 sends an “ack” (acknowledgement) message back to the client that we received the
frame successfully.
Lines 21 and 22 annotate the frame with the rpiName in the top left corner.
Line 25 then creates a GUI window based on the rpiName with the frame inside of it.
We handle the q keypress via Lines 26-30. When we break out of the loop, all windows
are destroyed.
For this section, grab one or more Raspberry Pis along with your laptop. Your laptop will be
the server. Each RPi will be a client.
ImageZMQ is required to be installed on the server and all clients (refer to Section 3.5.3.
Once the clients and server are ready to go, we’ll first start the server:
$ # on your laptop
$ python server.py
From there, go ahead and start just one client (you can do so via SSH so you don’t have to
break out a screen/keyboard/mouse on the Pi-side):
$ # on the pi
$ python client.py --server-ip 192.168.1.113
In Figure 3.7 we go ahead and launch our ImageZMQ server. Once the server is running,
you can start each client (Figure 3.8). It is recommended to start a screen session on your
clients if you connect via SSH. Screen sessions persist even when the SSH connection drops.
Your server will begin displaying frames as connections are made from each RPi as shown
in Figure 3.9. As you add more client connections to the server, additional windows will appear.
Notice that each OpenCV window on the host is named via the hostname of the RPi. The
hostname of the RPi is also annotated in the top left corner of each OpenCV frame. To learn
how to configure custom hostnames for your RPis, refer to Section 3.5.4. It is essential to
configure the hostname of each RPi as it represents the unique identifier of the RPi as
the code is written. Another option would be to query and send either the MAC or IP address
on the client side to the server.
Once you start streaming, you’ll notice that the latency is quite low and the image quality is
perfect (we didn’t compress the images). Be sure to see the next section about the impact of
your network hardware/technology on performance.
3.5. IMAGEZMQ FOR VIDEO STREAMING 45
Figure 3.7: Start the ImageZMQ server so that clients have a server to connect to. The server
should run on a desktop/laptop and it will be responsible for displaying frames (and any image
processing done on the server-side). Note: High resolution image can be found here for easier
viewing: http://pyimg.co/frt0a
There are a number of factors which will impact performance (i.e. frames per second through-
put rate) for network video streaming projects. Ask yourself the following questions and keep
in mind that most of them are inter-related.
The more clients, the longer it will be until each client frame is displayed.
The example in this chapter was very serial. Multithreading/multiprocessing will help if you
want to manage connections and displaying video frames in parallel.
By default, the imutils.video module paired with the PiCamera will send 320x240
frames. If you send larger frames, your FPS will go down. Full resolution frames require
more data packets to get to the server.
If you have a high resolution USB webcam and forget to manually insert a frame = imutils.resize(
46 CHAPTER 3. MULTIPLE PIS, MESSAGE PASSING, AND IMAGEZMQ
Figure 3.8: After the ImageZMQ server is running, launch each client while providing the
--server-ip via command line argument. The clients can run in SSH screen sessions so
that they continue to run even when you close the terminal window. Note: High resolution image
can be found here for easier viewing: http://pyimg.co/3yuej
width=400) (substitute 400 for a width of your choosing) between Lines 28 and 29, you’ll
be sending full resolution HD frames. HD frames will quite literally choke your streaming ap-
plications unless you apply compression. Compression is certainly possible but is outside the
scope of this book.
If the server is only displaying frames, the overall FPS will be quite high. However, if you
are running a complex computer vision or deep learning pipeline, performance will definitely
be impacted.
In general, you can take the load off the client and just treat the client as an IP camera. But
you may wish to do some of the frame processing on the Pi (or other client) itself to take a load
off the server. Be creative!
Figure 3.9: Server will begin to display the frames coming from each client in separate windows.
Note: High resolution image can be found here for easier viewing: http://pyimg.co/3yuej
Maybe you only need to send frames containing motion to the server for a security applica-
tion. Try implementing the idea of only sending motion frames to the server using this chapter’s
example code. Simply apply background subtraction on the clients (Chapter 4, Section 4.3.9 of
the Hobbyist Bundle for an example of motion detection). You may wish to send a “keep alive”
frame every so often.
Discussing wireless technology is outside the scope of the book. That said, keep the fol-
lowing points in mind:
i. 802.11b is slow.
vi. 2.4GHz (lower frequency) is good for longer ranges, but is slower than 5GHz.
Be sure to look up the specifications. In general, here’s what you need to know:
48 CHAPTER 3. MULTIPLE PIS, MESSAGE PASSING, AND IMAGEZMQ
There’s no problem with mixing clients unless you have an issue with speed. If you are
using serial processing (as we demonstrated in this chapter), then an RPi Zero W could be a
bottleneck, for example.
3.6 Summary
In this chapter we learned about message passing via sockets, ZMQ, and ImageZMQ.
The Raspberry Pi is a true Internet of Things (IoT) device. You should use its communication
capabilities to your advantage when designing an interconnected computer vision system.
In the next chapter, we’ll build a home security system with a central server that processes
frames from multiple Pis around your home to find people and other objects with a deep learn-
ing object detector.
Flip the page to continue learning more about ImageZMQ with a real world example.
Chapter 4
In Chapter 14 of the Hobbyist Bundle you learned how to build a basic home security applica-
tion using motion detection/background subtraction.
In this chapter we are going to extend the basic system to include more advanced computer
vision algorithms, including deep learning, object detection, and the ability to monitor specific
zones.
Additionally, you’ll also learn how to stream frames directly from your Raspberry Pi to a
central host (such as a laptop/desktop) where more computationally expensive operations can
be performed.
ii. Efficiently apply a cascade of background subtraction and object detection (to save CPU
cycles).
iii. Define unauthorized security zones and monitor them for intruders.
iv. Use ImageZMQ to stream frames from a Pi to a central computer (i.e., your laptop, desk-
top, or GPU rig) running YOLO.
49
50 CHAPTER 4. ADVANCED SECURITY WITH YOLO OBJECT DETECTION
When it comes to deep learning-based object detection, there are three primary object detec-
tors you’ll encounter:
• R-CNN and their variants, including the original R-CNN, Fast R-CNN, and Faster R-CNN
[21, 22, 23].
R-CNNs are one of the first deep learning-based object detectors and are an example of a
two stage detector. In the original R-CNN publication, Girshick et al. proposed:
i. Using an algorithm such as Selective Search [28] (or equivalent) to propose candidate
bounding boxes that could contain objects.
ii. These regions were then passed into a CNN for classification.
The problem with the standard R-CNN method was that it was painfully slow and not a
complete end-to-end-object detectors. Two followup papers were published (Fast R-CNN and
Faster R-CNN, respectively), leading to a complete end-to-end trainable object detector that
automatically proposed regions of an image that could potentially contain objects.
While R-CNNs did tend to be very accurate, the biggest problem was their speed —
they were incredibly slow, obtaining only 5 FPS on a GPU.
To help increase the speed of deep learning-based object detectors, both Single Shot De-
tectors (SSDs) and YOLO use a one-stage detector strategy. These algorithms treat object
detection as a regression problem, taking a given input image and simultaneously learning
bounding box coordinates and corresponding class label probabilities.
In general, single-stage detectors tend to be less accurate than two-stage detectors, but
are significantly faster.
Since the original YOLO publication the framework has gone through a number of itera-
tions, including YOLO9000 and YOLOv2 [26]. The performance of both these updates was
a bit underwhelming and it wasn’t until the 2018 publication of YOLOv3 [27] that prediction
performance improved.
We’ll be using YOLOv3 for this chapter but feel free to swap out the model for another
object detection method — you are not limited to just YOLO. As long as your object detector
can produce bounding box coordinates, you can use it in this chapter as a starting point for
your own projects.
4.3. AN OVERVIEW OF OUR SECURITY APPLICATION 51
Figure 4.1: Flowchart of steps when building our RPi home security system. First, our client
RPi reads a frame from its camera sensor. The frame is sent (via ImageZMQ) to a server for
processing. The server checks to see if motion has occurred in the frame, and if so, applies YOLO
object detection. A video clip containing the action is then generated and saved to disk.
Before we get started building our deep learning-based security application, let’s first ensure
we understand the general steps (Figure 4.1). You’ll note that our Raspberry Pi is meant to be
a “camera only” — this means that the RPi will not be performing any on-board processing.
Therefore, the RPi will be responsible for capturing frames and then sending them to a more
powerful server for additional processing.
The server will apply a computationally expensive object detector to locate objects in our
video feed. However, our server will utilize a two-stage process, called a cascade (not to be
confused with Haar cascade object detectors):
ii. If, and only if, motion is found, we then apply the YOLO object detector.
Even on a laptop/desktop CPU, the YOLO object detector is very computationally expensive
— we should only apply it when we need to. Since a scene without motion is presumed not to
have any objects we’re interested in, it doesn’t make sense to apply YOLO to an area that has
no objects!
Once motion is detected we’ll run YOLO and continue to monitor the video feed. If a person,
animal, or any other object we define enters any unauthorized zone, we’ll record a video clip of
the intrusion.
52 CHAPTER 4. ADVANCED SECURITY WITH YOLO OBJECT DETECTION
Remark. These “unauthorized zones” must be provided before we launch our security appli-
cation. You’ll learn how to find these coordinates inside Section 4.6.
In this section we will implement our actual security application. This system will follow the
flowchart depicted and described in Section 4.3.
Before we start writing any code let’s first take a look at our directory structure for the project:
|-- config
| |-- config.json
|-- output
|-- pyimagesearch
| |-- keyclipwriter
| | |-- __init__.py
| | |-- keyclipwriter.py
| |-- motion_detection
| | |-- __init__.py
| | |-- singlemotiondetector.py
| |-- utils
| | |-- __init__.py
| | |-- conf.py
| |-- __init__.py
| |-- parseyolooutput.py
|-- yolo-coco
| |-- coco.names
| |-- yolov3.cfg
| |-- yolov3.weights
|-- client.py
|-- server.py
Inside the config/ directory we will store our config.json file, containing our various
configurations.
• parseyolooutput.py: Helper class used to parse the output of the YOLO object de-
tection network.
The yolo-coco/ directory contains the YOLO object detector (pre-trained on COCO).
This object detector is capable of recognizing 80 common object classes, including people,
cars/trucks, animals (dogs, cats, etc.).
The client.py will run on our Raspberry Pi — this script will be responsible for capturing
frames from the RPi camera and then streaming them back to server.py which apply our
actual computer vision pipeline.
With our directory structure reviewed, let’s now take a look at conf.json:
1 {
2 // number of frames required to construct a reasonable background
3 // model
4 "frame_count": 32,
5
6 // path to yolo weights and labels
7 "yolo_path": "yolo-coco",
8
9 // minimum confidence to filter out weak detections
10 "confidence": 0.5,
11
12 // non-maxima suppression threshold
13 "threshold": 0.3,
The "frame_count" defines the minimum number of required frames in order to construct
a model for background subtraction. If you recall from Chapter 14 of the Hobbyist Bundle,
we must first ensure that our background subtractor knows what the background “looks like”,
thereby allowing it to detect when motion takes place in the scene.
The "yolo_path" configuration points to the yolo-coco/ directory where the pre-trained
YOLO model lives. This directory includes the YOLO model architecture, pre-trained weights,
and a .names file with the names of the labels YOLO can detect.
Our "confidence" is the minimum required probability to filter out weak detections. Any
given detection must have a predicted probability larger than "confidence", otherwise we’ll
throw out the detection, assuming it’s a false-positive.
The "threshold" is used for non-maxima suppression (NMS) [29, 30]. NMS is used to
suppress overlapping bounding boxes, collapsing the boxes into a single object. You can learn
more about NMS in the following tutorial: http://pyimg.co/fz1ak.
54 CHAPTER 4. ADVANCED SECURITY WITH YOLO OBJECT DETECTION
Let’s now define our unauthorized zones and the objects we want to look for:
Lines 21 and 22 constructs the list of objects we want to look for in our scene. Here we have
supplied a list of common objects you’ll likely want to monitor for in a security application; how-
ever, feel free to add or remove any of the object classes from the yolo-coco/coco.names
file if you wish.
24 // output video codec, output video FPS, and path to the output
25 // videos
26 "codec": "MJPG",
27 "fps": 10,
28 "output_path": "output",
29
30 // key clip writer buffer size
31 "buffer_size": 32
32 }
The "codec" controls our video codec while the "fps" dictates the playback frame rate of
the video path stored in the "output_path" directory.
Our client is a Raspberry Pi. The Pi will be responsible for one task, and one task only —
capturing frames from a video stream and then sending those frames to our server for
processing.
Let’s implement the client now. Open up client.py and insert the following code:
4.4. BUILDING A SECURITY APPLICATION WITH DEEP LEARNING 55
Lines 2-6 import our required Python packages while Lines 9-12 parse our command line
arguments. Only a single argument is required, --server-ip, which is the IP address of the
server running the ImageZMQ hub.
Lines 16 and 17 then initialize the ImageSender used to send frames from our video
stream via ImageZMQ to our central server.
Overall, this code should look very similar to the ImageZMQ streaming code from the previ-
ous chapter (only now we’re sending frames instead of text content).
19 # get the host name, initialize the video stream, and allow the
20 # camera sensor to warmup
21 rpiName = socket.gethostname()
22 #vs = VideoStream(src=0).start()
23 vs = VideoStream(usePiCamera=True).start()
24 time.sleep(2.0)
25
26 # start looping over all the frames
27 while True:
28 # read the frame from the pi camera and send it to the server
29 frame = vs.read()
30 sender.send_image(rpiName, frame)
On Line 21 we grab the hostname of the RPi — we’ll eventually extend this implementation
to handle multiple RPis streaming to a single server (Chapter 8). When we do extend our
56 CHAPTER 4. ADVANCED SECURITY WITH YOLO OBJECT DETECTION
implementation we need each RPi to have a unique ID. Grabbing the hostname ensures each
RPi is uniquely identified. To set the hostname of your Raspberry Pi, refer to Section 3.5.4
Remark. Be sure to refer to the book’s companion website to learn how to set the hostname of
your Raspberry Pi. A link to access the companion website can be found in the first few pages
of this text.
Line 27 starts a while loop that loops over frames from our VideoStream. We then call
the send_image method to send the frame to the central server.
Now that we’ve created the client, let’s implement the server. Open up server.py and insert
the following code:
Lines 2-12 handle importing our required Python packages, but most notably, take a look
at Lines 2-4 — these imports will facilitate building the computer vision pipeline discussed in
Section 4.3 earlier in this chapter.
As we work through this script keep in mind that our security application consists of two
stages:
i. Stage #1: First perform the less computationally expensive background subtraction/mo-
tion detection to the frame.
ii. Stage #2: If, and only if, motion is found, utilize the more computationally expensive
background subtractor.
The overlap method requires that we supply the bounding box coordinates of two rectan-
gles, rectA and rectB, respectively. Line 18 checks to see if:
• The first x-coordinate of rectA is greater than the second x-coordinate of rectB.
• The second x-coordinate of rectA is less than the first x-coordinate of rectB.
If any of these cases hold, then the two rectangles do not overlap.
Lines 24 makes a similar check, only this time for the respective y -coordinates. Again, if
the check passes, then the two rectangles do not overlap. Finally, if both checks fail then we
know that the rectangles do overlap and return True.
Next, let’s parse our command line arguments and perform a few initializations:
Lines 31-34 parse our command line arguments. We only need a single argument, --conf,
the path to our configuration file. Lines 37 and 38 load our conf and initialize the ImageHub.
Lines 44 then instantiates the motion detector, initializes the total number of frames read
thus far, and the frame width and height, respectively.
46 # load the COCO class labels our YOLO model was trained on
47 labelsPath = os.path.sep.join([conf["yolo_path"], "coco.names"])
48 LABELS = open(labelsPath).read().strip().split("\n")
49
50 # initialize a list of colors to represent each possible class label
51 np.random.seed(42)
52 COLORS = np.random.randint(0, 255, size=(len(LABELS), 3),
53 dtype="uint8")
54
55 # derive the paths to the YOLO weights and model configuration
56 weightsPath = os.path.sep.join([conf["yolo_path"], "yolov3.weights"])
57 configPath = os.path.sep.join([conf["yolo_path"], "yolov3.cfg"])
58
59 # load our YOLO object detector trained on COCO dataset (80 classes)
60 # and determine only the *output* layer names that we need from YOLO
61 print("[INFO] loading YOLO from disk...")
62 net = cv2.dnn.readNetFromDarknet(configPath, weightsPath)
63 ln = net.getLayerNames()
64 ln = [ln[i[0] - 1] for i in net.getUnconnectedOutLayers()]
65
66 # initialize the YOLO output parsing object
67 pyo = ParseYOLOOutput(conf)
Here we construct the path to the COCO class labels file (i.e., the list of objects YOLO can
detect). We then load this file into the LABELS list. We also derive random COLORS that we
can use to visualize each label (Lines 43-53).
In order to load YOLO from disk we must first derive the paths to the weights and model
configuration (Lines 56 and 57). Given these paths we can load YOLO from disk on Line 62.
We also determine the output layer names from YOLO on Lines 63 and 64, enabling us to
extract the object predictions from the network.
Line 67 instantiates ParseYOLOOutput, used to actually parse the output of the network.
We’ll implement that class in Section 4.4.4.1.
Let’s start looping over frames received from our Raspberry Pi via ImageZMQ:
72 consecFrames = 0
73 print("[INFO] starting advanced security surveillance...")
74
75 # start looping over all the frames
76 while True:
77 # receive RPi name and frame from the RPi and acknowledge
78 # the receipt
79 (rpiName, frame) = imageHub.recv_image()
80 imageHub.send_reply(b'OK')
81
82 # resize the frame, convert it to grayscale, and blur it
83 frame = imutils.resize(frame, width=400)
84 gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
85 gray = cv2.GaussianBlur(gray, (7, 7), 0)
86
87 # grab the current timestamp and draw it on the frame
88 timestamp = datetime.now()
89 cv2.putText(frame, timestamp.strftime(
90 "%A %d %B %Y %I:%M:%S%p"), (10, frame.shape[0] - 10),
91 cv2.FONT_HERSHEY_SIMPLEX, 0.35, (0, 0, 255), 1)
92
93 # if we do not already have the dimensions of the frame,
94 # initialize it
95 if H is None and W is None:
96 (H, W) = frame.shape[:2]
Line 71 instantiates our KeyClipWriter so we can write video clips of intruders to disk.
We then start looping over frames from the RPi on Line 76.
Lines 79 and 80 grab the latest frame from the RPi and then acknowledge receipt of the
frame. We preprocess the frame by resizing it to have a width of 400px, convert it to grayscale,
and then blur it to reduce noise (Lines 83-85).
We also draw the current timestamp on the frame (Lines 88-91) and grab the frame di-
mensions (Lines 95 and 96).
In order to apply motion detection we must first ensure that we have sufficiently modeled
the background, assuming that no motion has taken place during the first "frame_count"
frames:
Line 101 checks to ensure that we have received a sufficient number of frames to build
an adequate background model. Provided we have, we attempt to detect motion in the gray
frame (Line 104).
Line 109 checks to see if motion has been found. If motion has been found and we are
not recording (Line 117), we create an output directory, build output video file path, and start
recording (Lines 120-128).
Since we know motion has taken place, we can now apply the YOLO object detector:
130 # construct a blob from the input frame and then perform
131 # a forward pass of the YOLO object detector, giving us
132 # our bounding boxes and associated probabilities
133 blob = cv2.dnn.blobFromImage(frame, 1 / 255.0,
134 (416, 416), swapRB=True, crop=False)
135 net.setInput(blob)
136 layerOutputs = net.forward(ln)
137
138 # parse YOLOv3 output
139 (boxes, confidences, classIDs) = pyo.parse(layerOutputs,
140 LABELS, H, W)
141
142 # apply non-maxima suppression to suppress weak,
143 # overlapping bounding boxes
144 idxs = cv2.dnn.NMSBoxes(boxes, confidences,
145 conf["confidence"], conf["threshold"])
4.4. BUILDING A SECURITY APPLICATION WITH DEEP LEARNING 61
Lines 133-136 construct a blob from the input frame and then pass it through the YOLO
object detector.
Given the layerOutputs we parse them on Lines 139 and 140 and then apply non-
maxima suppression to suppress weak, overlapping bounding boxes, keeping only the most
confident ones. NMS also ensures that we do not have any redundant or extraneous bounding
boxes in our results. If you’re interested, you can learn more about NMS, including how the
underlying algorithm works, in the following tutorial: http://pyimg.co/fz1ak [29].
Moving on, let’s first ensure that at least one detection was found in the frame:
Provided at least one detection was found we can start looping over the indexes of the
62 CHAPTER 4. ADVANCED SECURITY WITH YOLO OBJECT DETECTION
Lines 152 and 153 extract the bounding box coordinates of the current bounding box. Given
the bounding box coordinates we need to check and see if they overlap with an unauthorized
zone. To accomplish this task we start looping over the unauthorized zones on Line 156 and
construct a bounding box for the zone (Line 159).
Lines 164 and 165 check to see if there is no overlap between the two boxes. If so, we
continue looping over the unauthorized zone coordinates.
Otherwise, there is an overlap between the zones (Line 169) so we draw both label text
and bounding box coordinates on Lines 172-179. If you wish to raise an alarm or send a text
message alert, for example, this else code block is where you would want to insert your logic.
Line 186-188 loops over each of the unauthorized zones and draws them on our frame,
enabling us to visualize them.
Provided that no motion has taken place, Lines 193 and 194 update our consecFrames
count.
Line 197 updates the key clip writer while Lines 201 and 202 check to see if we should
stop recording.
Finally, let’s update our background model and then display the output frame to our screen:
204 # update the background model and increment the total number
205 # of frames read thus far
4.4. BUILDING A SECURITY APPLICATION WITH DEEP LEARNING 63
206 md.update(gray)
207 total += 1
208
209 # show the frame
210 cv2.imshow("{}".format(rpiName), frame)
211 key = cv2.waitKey(1) & 0xFF
212
213 # if the `q` key was pressed, break from the loop
214 if key == ord("q"):
215 break
216
217 # if we are in the middle of recording a clip, wrap it up
218 if kcw.recording:
219 kcw.finish()
220
221 # do a bit of cleanup
222 cv2.destroyAllWindows()
At this point we’re almost finished — we have one more Python class to define.
Inside the server.py script we utilized a class named ParseYOLOOutput — let’s define that
class now.
On Line 5 we create the constructor to the class. We only need a single argument, our
configuration, which we store on Line 7.
The parse method requires that we supply four parameters to the function:
• layerOutputs: The output of making predictions with the YOLO object detector.
We then initialize three lists used to store the detected bounding box coordinates, confi-
dences (i.e., predicted probabilities), and class label IDs. We can now start populating these
lists by looping over the layerOutputs:
For each of the layerOutputs we then loop over each of the detections. Lines 21 and
22 extract the predicted class ID with the largest corresponding predicted probability. We then
make a check on Line 26 to ensure that the class label exists in the set of class labels we want
to "consider" (defined in Section 4.4.2) — if not, we continue looping over detections.
Otherwise, we have found a class label we are interested in so let’s process it:
Line 31 extracts the probability for the class label from the scores list — we then ensure
that a minimum confidence is met on Line 36. Ensuring that a prediction meets a minimum
predicted probability helps filter out false-positive detections.
Line 42 scales the bounding box coordinates back relative to the size of the original image.
Lines 43 and 44 extract the dimensions of the bounding box, while keeping in mind that
YOLO returns bounding box coordinates in the following order: (centerX, centerY, width,
height). We use this information to derive the top-left (x, y)-coordinates of the bounding box
(Lines 48 and 49).
Finally, we update the boxes, confidences, and classIDs lists on Lines 53-55 and
return the three respective lists as a tuple to the calling function (Line 59).
Now that we’ve coded up our server.py script, let’s put it to work.
From there, on your Raspberry Pi (via SSH or VNC), start the client:
66 CHAPTER 4. ADVANCED SECURITY WITH YOLO OBJECT DETECTION
Figure 4.2: Objects such as people are annotated/recorded (red) only when they enter unautho-
rized zones (blue).
Your server will then come alive with the feed from the RPi. In Figures 4.2 and 4.3 you can
see my macOS desktop. People (red) are detected via YOLO, but they are only annotated and
recorded if/when they enter unauthorized zones (blue). Notice that the objects that are outside
the blue boxes are not detected. Our key clip writer will store clips only as people enter the
unauthorized zones.
You can verify that the video clip was written to disk by checking the contents of your
output/ directory:
$ ls output/2019-11-15/
151118.avi 151544.avi 152020.avi 153044.avi 153351.avi
151505.avi 151601.avi 152624.avi 153240.avi 153846.avi
4.6. HOW DO I DEFINE MY OWN UNAUTHORIZED ZONES? 67
Figure 4.3: Objects such as people are annotated/recorded (red) only when they enter unautho-
rized zones (blue).
In this chapter you learned how to monitor unauthorized zones for access; however, how do
you actually define these unauthorized zones? In general, there are two methods I recom-
mend: using an image processing tool (ex., Photoshop, GIMP, etc.) or utilizing OpenCV’s
mouse click events.
Initially determining the bounding box (x, y)-coordinates of an unauthorized zone is manual
process. The good news is that these coordinates only need to be determined once.
The first step is to actually capture your image/frame. You can use the exact code from
Section 4.4.4, but you should insert a cv2.imwrite call at the bottom of the while loop used
to loop over frames, like this:
68 CHAPTER 4. ADVANCED SECURITY WITH YOLO OBJECT DETECTION
Figure 4.4: Using Photoshop to derive the (x, y)-coordinates of a region in an image.
Notice how I’m writing the current frame to disk with the filename frame.png (Line 215).
We’ll now examine this frame in the following two sections.
If you have Photoshop or GIMP installed on your machine then you can simply open up
frame.png in the editor and look at the (x, y)-coordinates of the frame, an example of which
can be seen in Figure 4.4.
As I move my mouse, Photoshop will update the “Info” section with the current (x, y)-
coordinates. To define my unauthorized zone coordinates I can simply move my mouse to
the four corners of the rectangle I want to monitor and jot them down. Given the coordinates I
can go back to my config.json file and updated the "unauthorized_zone" list.
4.7. SUMMARY 69
An alternative option to defining your unauthorized zone coordinates is to use OpenCV and
mouse click events (http://pyimg.co/fmckl) [31].
Figure 4.5: Using OpenCV’s mouse click events to print and display (x, y)-coordinates.
Using OpenCV you can capture mouse clicks on any window opened by cv2.imshow.
Figure 4.5 demonstrates how I’ve opened a frame, moved my mouse to a location on the
frame I want to monitor, and then clicked my mouse — the (x, y)-coordinates of the click are
then printed to my terminal. I can then repeat this process for the remaining three vertices of
the bounding box rectangle.
From there, I would take the (x, y)-coordinates I just found and update the "unauthorized_zone"
list in the config file.
4.7 Summary
In this tutorial you learned how to build a more advanced home security application.
While streaming may feel like “cheating” to a certain degree, it’s a perfectly acceptable ap-
proach provided you have enough network bandwidth. If you do not have the network strength
to facilitate real-time streaming then you’ll need to apply your deep learning algorithms directly
on the RPi itself — the rest of this text is dedicated to techniques you can apply to do so.
Furthermore, in Chapter 8 you will learn how to extend the method we covered here in
this chapter to build an even more advanced home security application capable of leveraging
multiple Raspberry Pis.
Chapter 5
Inside this chapter you will learn how to build a complete face recognition system.
Such a system could be used to recognize faces at the front of your house and unlock your
front door, monitor your desk and capture anyone who is trying to use your workstation, or even
build something fun like a “smart treasure chest” that automatically unlocks when the correct
face is identified.
Regardless of what you’re building, this chapter will serve as a template you can use
to build your own face recognition systems.
i. About “embeddings” and how deep learning can be used for face recognition.
ii. Three methods that can be used to gather and build your face recognition dataset.
iii. How to use Python to extract face embeddings from your custom dataset.
iv. How to train a Support Vector Machine (SVM) classifier on top of the embeddings.
vi. How to put all the pieces together and create a complete face recognition pipeline.
71
72 CHAPTER 5. FACE RECOGNITION ON THE RPI
Figure 5.1: In order to build a face recognition system we must: create a dataset of example
faces (i.e., our training data), extract features from our face dataset, and train a machine learning
classifier model on top of the face features. From there we can recognize faces in images and
video streams.
Section 5.4.2 will show you three methods that I recommend for building your own custom
face dataset (or you can simply follow along with this chapter using the dataset provided for
you in the downloads associated with the text).
It’s recommended that you perform both of these steps on a laptop, desktop, or GPU
rig. A laptop/desktop will have more RAM and a faster CPU, enabling you more easily work
with your data.
If you choose to use your RPi for these steps you may find (1) it takes substantially longer to
quantify each face and train the model and/or (2) the scripts error out completely due to lack of
memory. Therefore, use your laptop/desktop to train the model and then transfer the resulting
models to your RPi.
Once your models have been trained and transferred, we’ll perform face recognition on the
5.3. GETTING STARTED 73
Raspberry Pi itself.
The Raspberry Pi is therefore responsible for inference (i.e., making predictions) but not the
actual training process.
Before we can get started building our face recognition project we first need to review the key
components, namely the project structure and the associated configuration file.
This chapter is meant to serve as a complete review of building a face recognition system
— this means that we’ll not only be covering how to actually deploy face recognition to the
Raspberry Pi, but also how to create your own custom faces dataset and then train a face
recognizer on this data.
Since we’ll be covering so many techniques in a chapter we also have significantly more
Python scripts/files to review.
|-- cascade
| |-- haarcascade_frontalface_default.xml
|-- config
| |-- config.json
|-- face_recognition
| |-- dataset
| | |-- abhishek [5 entries]
| | |-- adrian [5 entries]
| | |-- dave [5 entries]
| | |-- mcCartney [5 entries]
| | |-- unknown [6 entries]
| |-- encode_faces.py
| |-- train_model.py
|-- messages
| |-- abhishek.mp3
| |-- adrian.mp3
| |-- dave.mp3
| |-- mcCartney.mp3
| |-- unknown.mp3
|-- output
| |-- encodings.pickle
| |-- le.pickle
| |-- recognizer.pickle
|-- pyimagesearch
| |-- notifications
74 CHAPTER 5. FACE RECOGNITION ON THE RPI
| | |-- __init__.py
| | |-- twilionotifier.py
| |-- utils
| | |-- __init__.py
| | |-- conf.py
| |-- __init__.py
|-- create_voice_msgs.py
|-- door_monitor.py
The cascade/ directory contains our Haar cascade for face detection. Face recognition
is a computationally demanding task, and given the resource limited nature of the RPi, any
speedup we can obtain is worth it.
Here we’ll be using Haar cascades for face detection, but later in the Hacker Bundle, I’ll be
showing you how to use the Movidius NCS for faster, more efficient face detection as well.
The config/ directory contains our config.json file. We’ll be reviewing this file in Sec-
tion 5.3.2 below.
All scripts that will be used to train our actual face recognizer are contained in the face_rec
ognition/ directory.
The dataset/ directory is where we’ll store our dataset of faces images. You’ll learn how
to build your own face recognition dataset inside Section 5.4.2.
The encode_faces.py and train_model.py will be used to train the face recognition
model. These scripts are covered in Section 5.4.3, respectively.
The output/ directory will then contain the output files from encode_faces.py and
train_model.py.
We’ll be using text-to-speech (TTS) to announce each recognized (or unidentified) face,
but in order to utilize TTS, we first need to generate .mp3 files for each person’s name. The
messages/ directory will contain these .mp3 files after they have been generated by the
create_voice_messages.py script.
Inside the pyimagesearch module you’ll find utilities load our configuration file and send
Twilio text message notifications.
Finally, door_monitor.py will put all the pieces together, and as the name suggests, will
build a face recognition application for the purpose of monitoring our front door.
Our configuration file for this project is longer than the vast majority of configuration files in this
book. I would suggest you review both the config file, along with the rest of the code in this
5.3. GETTING STARTED 75
chapter once, then go back and read the code separately from the text. Doing so will enable
you to understand the context in which each configuration is used (and how different Python
scripts can utilize the same config values, such as the path to our Haar cascade or trained face
recognition model).
With that said, open up the config.json file now and let’s review it:
1 {
2 // path to OpenCV's face detector cascade file
3 "cascade_path": "cascade/haarcascade_frontalface_default.xml",
4
5 // path to extracted face encodings file
6 "encodings_path": "output/encodings.pickle",
7
8 // path to face recognizer model
9 "recognizer_path": "output/recognizer.pickle",
10
11 // path to label encoder file
12 "le_path": "output/le.pickle",
Line 2 defines the path to our Haar cascade used for face detection (i.e., finding the
bounding box coordinates of each face in an image).
For face recognition (the actual person identification) we then have three paths:
• "encodings_path": Where we’ll store our face quantifications after our deep learning
model has extracted embeddings from the face ROI.
We’ll be reviewing each of these paths in detail inside Section 5.4.3 and Section 5.4.4, but
now simply make note of them.
The "display" config controls whether our driver script, door_monitor.py, should dis-
play any output frames via cv2.imshow. If our face recognition application is meant to run in
76 CHAPTER 5. FACE RECOGNITION ON THE RPI
the background then you can set "display" to false, ensuring no output frames are shown
via cv2.imshow, and therefore allowing your script to run faster.
The "threshold" parameter controls the threshold percentage for our background sub-
tractor.
As Line 19 specifies, if more than 12% of any given threshold is occupied with motion,
we’ll start running our face detector on each frame of the video stream, which leads us to the
"look_for_a_face" parameter:
Once both (1) motion is found and (2) more than "threshold" percent of the frame con-
tains motion, we’ll start applying face detection to each frame of the video stream for a total of
"look_for_a_face" frames.
Since face detection is more computationally expensive than motion detection via back-
ground subtraction, we wish to only apply face detection occasionally — setting a limit on the
number of frames to sequentially apply face detection to enables us to save CPU cycles and
ensure our face recognition system runs faster.
In the event that a face is found in a frame, we then apply face recognition. If a given
person is identified in at least "consec_frames" frames, we can label the face as a positive
recognition.
Smaller values of "consec_frames" will allow your pipeline to run faster as face recogni-
tion only has to be applied for a total of "consec_frames"; however, smaller values may lead
to false-positive recognitions. You may need to balance this parameter in your own applications
to reach a satisfiable level of accuracy.
Line 35 defines the name of the language used by the TTS package while accent (Line
36) controls the particular accent of the language
The msgs_path controls the path to the output messages/ directory where we’ll later gen-
erate and store .mp3 files for each person’s name. Lines 43-45 then specify the "welcome_
sound", used to welcome a user to their home, while "intruder_sound" contains the text
for an unauthorized user entering the premises.
Our face recognition system can also send text message notifications to the administrator
by specifying the relevant AWS S3 and Twilio API keys/credentials:
If you need a review of these parameters please refer to Chapter 10 of the Hobbyist Bundle
where they are covered in detail.
In the first part of this section we’ll briefly review how deep learning algorithms can facilitate
accurate face recognition via face embeddings.
78 CHAPTER 5. FACE RECOGNITION ON THE RPI
From there you’ll learn how to gather and build a dataset that can be used for face recogni-
tion, followed by extracting face embeddings from the dataset, and then training the actual face
recognition model.
Figure 5.2: Facial recognition via deep metric learning involves a “triplet training step.” The triplet
consists of 3 unique face images — 2 of the 3 are the same person. The NN generates a 128-d
vector for each of the 3 face images. For the 2 face images of the same person, we tweak the
neural network weights to make the vector closer via distance metric. Image credit: [32]
Face recognition via deep learning hinges on a technique called deep metric learning. If
you have any prior experience with deep learning you know that we typically train a network to:
Deep metric learning is a bit different. Instead of trying to output a single label (or even the
coordinates/bounding box of objects in an image), we instead output a real-valued feature
vector. For the dlib facial recognition network (the library we’ll be using to face recognition), the
5.4. DEEP LEARNING FOR FACE RECOGNITION 79
output feature vector is 128-d (i.e., a list of 128 real-valued numbers) that are used to quantify
the face.
Training the network is done using triplets (Figure 5.2). Here we provide three images to
the network:
• The third image is a random face from our dataset and is not the same person as the
other two images.
As an example, let’s again consider Figure 5.2 where we provided three images: one of
Chad Smith (drummer of the Red Hot Chili Peppers) and two of Will Ferrell (a famous actor).
Our network quantifies these faces, constructing a 128-d embedding (quantification) for each.
From there, the general idea is that we’ll tweak the weights of our neural networks so that
the 128-d measurements of the two Will Ferrel images will be closer to each other and further
from the measurements of Chad Smith.
Our network architecture for face recognition is based on ResNet-34 from the Deep Resid-
ual Learning for Face Recognition paper by He et al. [4], but with fewer layers and the number
of filters reduced by half. The network was trained by Davis King [33] on a dataset of ¥ 3
million images. On the Labeled Faces in the Wild (LFW) dataset, a popular benchmark for
face recognition/verification, the network compares to other state-of-the-art methods, reaching
99.38% accuracy.
Both Davis King (the creator of dlib) and Adam Gietgey (the author of the face_recogniti
on module [34] which wraps around dlib) have written detailed articles on how deep learning-
based facial recognition works [32, 35] — I would highly encourage you to read them if you
would like more details on how (1) these networks are trained and (2) how the networks pro-
duce 128-d embeddings used to quantify a face.
Before we can recognize faces, we first need to build a dataset of faces to recognize. For
example, your dataset could consist of:
Exactly who is in the images doesn’t matter as much as you have enough images per
person to obtain an accurate face recognition model.
Keep in mind that machine learning algorithms are not magic — you need to supply enough
example images for them to learn patterns that can discriminate between faces and accurately
recognize them. In general, there are three ways to build a face recognition dataset:
Figure 5.3: Using OpenCV and a webcam it’s possible to detect faces in a video stream and save
the examples to disk. This process can be used to create a face recognition dataset on premises.
The first method is appropriate when you are building an “on-site” face recognition system
and you need to have physical access to a particular person to gather example images of their
face. Such a system would be typical for companies, schools, or other organizations where
people need to physically show up and attend every day.
To gather example faces of these people, you may escort them to a special room where a
video camera is setup to (1) detect the (x, y)-coordinates of their face in a video stream and (2)
write the frames containing their face to disk (Figure 5.3). We may even perform this process
over multiple days or weeks to gather examples of their face in different lighting conditions,
times of day, and varying moods or emotional states.
5.4. DEEP LEARNING FOR FACE RECOGNITION 81
To learn how to build a face recognition dataset by using your OpenCV and Webcam, you
can follow this PyImageSearch tutorial which includes detailed code: http://pyimg.co/v9evi [36].
However, there are cases where you may not have access to the physical person and/or they
are a public figure with a strong online presence — in those cases you can programmatically
download example images of their faces via APIs on varying platforms.
Exactly which API you choose here depends dramatically on the person you are attempting
to gather example face images of.
For example, if a person consistently posts on Twitter or Instagram, you may want to lever-
age one of their (or other) social media APIs to scrape face images.
Another option would be to leverage a search engine, such as Google or Bing. If you
decide you want to use a search engine to build a face recognition dataset, be sure to refer the
following two guides:
Both of the guides linked to above will enable you to build a custom dataset using Google
and Bing, respectively.
The final method to creating your own face recognition dataset, and also the least desirable
one, is to manually find and save example face images yourself (Figure 5.4).
This method is obviously the most tedious and requires the most man-hours — typically we
would prefer a more “automatic” solution, but in some cases, you’ll need to resort to it. Using
this method you will need to manually inspect:
Then, for each service, you’ll need to manually save the images to disk.
In these types of scenarios the user often has a public profile of some sort but has signifi-
cantly fewer images to crawl programmatically.
82 CHAPTER 5. FACE RECOGNITION ON THE RPI
Figure 5.4: Manually downloading face images to create a face recognition dataset is the least
desirable option but one that you should not forget about. Use this method if the person doesn’t
have (as large of) an online presence or if the images aren’t tagged.
Before we can recognize faces with our Raspberry Pi, we first need to quantify the faces in our
training set.
Keep in mind that we are not actually training a network here — the network has already
been trained to create 128-d embeddings on a dataset of ¥ 3 million images. We certainly
could train a network from scratch or even fine-tune the weights of an existing model, but that
is likely overkill for many projects. Furthermore, you would need a lot of images to train the
network from scratch.
Instead, it’s easier to use a pre-trained network and use it to extract the 128-d embeddings
for each face in our dataset (Figure 5.5).
We’ll then take the extracted embeddings and train a SVM classifier on top of them in
Section 5.4.4.
At this point I’ll assume you have either (1) chosen to use the example images included
with the source code associated with this text, or (2) created your own example image dataset
using the instructions detailed above.
In either case, your dataset should have the following directory structure:
5.4. DEEP LEARNING FOR FACE RECOGNITION 83
Figure 5.5: Facial recognition via deep learning and Python using the face_recognition mod-
ule method generates a 128-d real-valued number feature vector per face.
|-- dataset
| |-- abhishek [5 entries]
| |-- adrian [5 entries]
| |-- dave [5 entries]
| |-- mcCartney [5 entries]
| |-- unknown [6 entries]
Notice how I have placed all example faces inside the dataset/ directory. Inside the
dataset/ directory there are subdirectories for each person that I want to recognize (i.e.,
each person has their own corresponding directory). Then, inside the subdirectory for each
person, I put example faces of that person, and that person alone.
For example, you should not place example faces of “Adrian” in the "Abhishek" directory (or
vice versa).
Using a directory structure such as the one proposed here forces organization and ensures
your images are organized on disk.
Now that our directory structure is organized, let’s get to work. Open up the encode_faces.
py file in the directory structure for this project and let’s get to work:
First, we need to import our required packages. Most notably, the face_recognition
84 CHAPTER 5. FACE RECOGNITION ON THE RPI
package, a wrapper around the dlib library, provides us with the ability to easily and conve-
niently extract 128-d face embeddings from our image.
• --dataset : The path to our dataset of person names and example images.
• --encodings: Our face encodings are written to the file that this argument points to.
Now that we’ve defined our arguments, let’s grab the paths to our files in our --dataset
directory:
Line 22 uses the path to our input directory to build a list of all imagePaths contained
therein.
We also need to initialize two lists before our loop, knownEncodings and knownNames,
respectively. These two lists will contain the face encodings and corresponding names for each
person in the dataset (Lines 25 and 26).
On Line 29 we loop over each imagePath in the imagePaths list. For each image, we
extract the name of the person (i.e., subdirectory name), based on our assumption of how our
dataset is organized on disk, discussed earlier in this section.
We then load the current image from disk on Line 37. OpenCV orders channels in BGR
order; however, face_recognition and dlib expect RGB — we therefore swap from BGR
to RGB color channel ordering on Line 38.
Lines 42 and 43 find/localize the face in the image resulting in a list of boxes. We pass
two parameters to the face_locations method, including:
• model: Either cnn or hog, based on the value of the --detection-method command
line argument.
The CNN method is much more accurate but far slower than the HOG face detector. The
HOG face detector is faster but less accurate than the CNN. In either case, you should be
86 CHAPTER 5. FACE RECOGNITION ON THE RPI
running this script on a laptop/desktop and not a Raspberry Pi (due to the computational
and memory requirements of the face detector and embedding network), so feel free to play
around with both face detectors.
Given the bounding boxes used to localize each face in the image, we pass them into the
face_encodings function which, internally (1) loops over each of the bounding box loca-
tions, and (2) quantifies the face and returns a 128-d vector used to represent the face. The
face_encodings function then returns a list of encodings, one 128-d vector per face de-
tected. We loop over each of the encodings on Line 49 and update the knownEncodings
and knownNames lists, respectively. The for loop back on Line 29 repeats this process for all
images in our dataset.
The final step is to serialize our knownEncodings and knownNames to disk, enabling us
to train a face recognition model over top of them in Section 5.4.4:
From there, Lines 58-60 dump the names and encodings to disk so we can use them in the
following section.
To extract face embeddings for your own dataset, execute the following command:
$ cd face_recognition
$ python encode_faces.py \
--dataset ../datasets/face_recognition_dataset \
--encodings ../output/encodings.pickle
[INFO] quantifying faces...
[INFO] processing image 1/29
[INFO] processing image 2/29
[INFO] processing image 3/29
[INFO] processing image 4/29
[INFO] processing image 5/29
...
[INFO] processing image 25/29
[INFO] processing image 26/29
[INFO] processing image 27/29
[INFO] processing image 28/29
[INFO] processing image 29/29
[INFO] serializing encodings...
5.4. DEEP LEARNING FOR FACE RECOGNITION 87
As the output demonstrates, I have successfully extracted embeddings for each of the 29
faces in my dataset. Looking at my directory structure you can also see that the encodings.pi
ckle file is now present on disk. The entire process took ¥ 10m30s using the CNN method on
my laptop CPU (i.e., no GPU was used here).
I also want to again reiterate that this script should only be run on your laptop/desk-
top. The face detection methods we’re using in this script, while accurate, are very slow on
the Raspberry Pi.
At this point we have extracted 128-d embeddings for each face — but how do we actually
recognize a person based on these embeddings?
The answer is that we need to train a “standard” machine learning model (such as SVM,
k-NN classifier, Random Forest, etc.) on top of the embeddings. We’ll be using a Linear SVM
here as the model is fast to train and can provide probabilities when making a prediction.
We import our packages and modules on Lines 2-5. We’ll be using scikit-learn’s implemen-
tation of SVM.
• --embeddings: The path to the serialized embeddings (we exported these embeddings
by running the encode_faces.py script in the previous section.
• --recognizer: The path to the trained SVM model that actually recognizes faces.
88 CHAPTER 5. FACE RECOGNITION ON THE RPI
• --le: Our LabelEncoder file path. We’ll serialize our label encoder to disk so that we
can use both it and the recognizer model to perform face recognition.
Let’s load our facial embeddings from disk and encode our labels:
Here we load our embeddings from Section 5.4.3 on Line 19. We won’t be generating
any embeddings in this training script — we’ll use the embeddings previously generated and
serialized.
We then initialize our LabelEncoder and encode our labels (Lines 23 and 24).
26 # train the model used to accept the 128-d encodings of the face and
27 # then produce the actual face recognition
28 print("[INFO] training model...")
29 recognizer = SVC(C=1.0, kernel="linear", probability=True)
30 recognizer.fit(data["encodings"], labels)
On Line 29 we initialize our SVM with a linear kernel and then on Line 30 we train
the model. Again, we are using a Linear SVM as the model is fast to train and capable of
producing a probability for each prediction, but you can try experimenting with other machine
learning models if you wish.
After training the model we output the model and label encoder to disk as two pickle files:
Again, both the trained SVM (used to make predictions) and the label encoder (used to
convert SVM predictions to actual person names) are serialized to disk for use in Section 5.6.
Let’s go ahead and train the SVM now; however, make sure you run this command on
your laptop/desktop and not your Raspberry Pi:
As you can see, the SVM has been trained on the embeddings and both the (1) SVM itself
and (2) label encoder has been written to disk, enabling us to utilize them in Section 5.6.
Again, I strongly encourage you to run both the encode_faces.py and train_model.py
scripts on a laptop or desktop — between the face detector and face embedding neural net-
work, your RPi could easily run out of memory. I would therefore suggest you:
iii. Then transfer the resulting files/model to your RPi via FTP, SFTP, USB thumb drive, email-
ing to yourself, etc.
From there you’ll be able to perform inference (i.e., make predictions) on the RPi where
there is sufficient computational horsepower to do so.
Refer to Chapter 14 to learn how to harness the power of the Movidius NCS for face recog-
nition on the Raspberry Pi.
We’ll be using Google’s TTS library, gTTS, to announce the name of a recognized person or
alert when an intruder is detected.
To make our driver script easier to follow, we’ll actually pre-record .mp3 files for each per-
son. This is not a requirement for our script but does make it easier to follow along.
Lines 2-6 import our required Python packages. The gTTs library on Line 3 will be used to
perform TTS and generate our resulting .mp3 files.
Lines 9-12 parse our command line arguments. We only need a single argument here,
--conf, the path to our configuration file, which is then loaded on Line 15.
We also load our label encoder (Line 16) which contains the names of each person our
face recognition model, including the “unknown” class.
Line 24 makes a check to see if we are currently examining the “unknown” (i.e., intruder)
class. If so, we record a special message for the intruder using the "intruder_sound" text
(Lines 27 and 28). Otherwise, we’re examining a legitimate user, so we use the "welcome_so
und" text (Lines 31-35).
Lines 38 and 39 then saves the resulting .mp3 file to disk inside the "msgs_path" direc-
tory.
To build the .mp3 files for each person, just execute the following command:
$ ls messages
abhishek.mp3 adrian.mp3 dave.mp3 mcCartney.mp3 unknown.mp3
Examining the messages/ directory you can see a total of four .mp3 files, one for each
person our face recognition model was trained on, along with an additional unknown.mp3 file
for the “unknown” class.
It’s time to put all the pieces together and build our complete face recognition pipeline. Open
up the door_monitor.py file and insert the following code:
10 import pickle
11 import signal
12 import time
13 import cv2
14 import sys
15 import os
16
17 # function to handle keyboard interrupt
18 def signal_handler(sig, frame):
19 print("[INFO] You pressed `ctrl + c`! Closing face recognition" \
20 " door monitor application...")
21 sys.exit(0)
22
23 # construct the argument parser and parse the arguments
24 ap = argparse.ArgumentParser()
25 ap.add_argument("-c", "--conf", required=True,
26 help="Path to the input configuration file")
27 args = vars(ap.parse_args())
Lines 2-16 handle importing our required Python packages. We have a number of im-
ports for this project, including our TwilioNotifier to send text message notifications, the
VideoStream to access our webcam/RPi camera module via OpenCV, and face_recogniti
on to facilitate face recognition.
Lines 25-28 parse our command line arguments. The only argument we need is --conf,
the path to our configuration file.
Line 31 loads our configuration file while Line 32 instantiates the TwilioNotifier used
to send text message notifications.
5.6. FACE RECOGNITION: PUTTING THE PIECES TOGETHER 93
We then load our face recognition models on Lines 36-38, including the trained face recog-
nition model (SVM), label/name encoder, and Haar cascade for face detection, respectively.
Line 41 then instantiates the MOG background subtractor. Background subtraction is com-
putationally efficient, so we’ll apply motion detection to each frame of the video stream rather
than applying a more computationally expensive face detector to every frame. Provided suffi-
cient motion has taken place, we’ll then trigger the face detection and recognition models.
42 # initialize the frame area and boolean used to determine if the door
43 # is open or closed
44 frameArea = None
45 doorOpen = False
46
47 # initialize previous and current person name to None, then set the
48 # consecutive recognition count to zero
49 prevPerson = None
50 curPerson = None
51 consecCount = 0
52
53 # initialize the skip frames boolean and skip frame counter
54 skipFrames = False
55 skipFrameCount = 0
Line 44 initializes frameArea which are our spatial dimensions (i.e., width and height) of
the frame.
The doorOpen boolean (Line 45) is used to indicate whether a door is open or closed
based on the results of our background subtraction algorithm. We’ll only be applying face
detection and recognition if doorOpen is equal to True, thereby saving precious CPU cycles.
Line 49 initializes the name of the previous person identified while Line 50 initializes the
current person. Recall from our config.json file that we need at least "consec_frames"
frames where prevPerson equals curPerson to indicate a successful recognition — the
actual consecutive count is initialized on Line 51.
Lines 54 and 55 indicates whether or not we are performing skip-frames, and if so, how
many frames have been skipped.
With our initializations out of the way, let’s access our VideoStream and allow the camera
sensor to warmup:
Line 75 then makes to check to see if (1) we are performing skip frames and if so, (2) how
many frames have been skipped so far.
In the event that we are still under "n_skip_frames" we increment our skipFrameCount
and continue looping.
Otherwise, we have reached our "n_skip_frames count (Line 83) and need to reset our
skipFrames boolean, skipFrameCount integer, and re-instantiate our background subtrac-
tor.
5.6. FACE RECOGNITION: PUTTING THE PIECES TOGETHER 95
Line 89 resizes our frame to have a width of 500px while Lines 92 and 93 grab the spatial
dimensions after resizing.
Next, let’s make a check to see our door is closed, implying that we are currently performing
background subtraction only :
Line 97 checks to see if the door is closed (i.e., not doorOpen), and if so, converts the
frame to grayscale and smooths it (Lines 100 and 101).
We then apply background subtraction to the gray frame and detect contours in the binary
mask. Provided at least one contour is found (Line 112) we sort the contours by their area,
largest to smallest, and grab the largest one (Line 115).
Line 120 makes a check to see if the area of the contour is greater than threshold
percent of the frame area (i.e., width x height). Provided the area is sufficiently large we know
that motion has occurred, in which case we mark doorOpen as True and grab the current
timestamp.
Our next code block handles what happens when motion has occurred:
96 CHAPTER 5. FACE RECOGNITION ON THE RPI
Line 128 checks to see if motion has occurred, and provided it has, we compute the delta
time difference between the current time and the startTime (Line 131).
Line 134 then starts an if statement that will perform face recognition for a total of "look_
for_a_face" seconds.
We start by converting the frame to grayscale (for face detection) and then changing the
channel ordering from BGR to RGB (for face recognition).
Lines 142 and 143 detect faces in the image while Line 148 reorders the (x, y)-coordinates
of the bounding boxes for dlib.
From here, we need to ensure that at least one face was detected:
Provided at least one face was detected (Line 151) we then use the face_encodings
function to compute 128-d embeddings for each face. The encodings are then passed into
our face recognition model and the label with the largest corresponding probability is extracted
(Lines 156-158).
We then use OpenCV to draw a bounding box surrounding the face along with the name of
the person on the frame (Lines 162-167).
Line 171 checks to see if the previous identification matches the current identification, and
if so, increments the consecCount. Otherwise, we reset the consecCount as the identified
names do not match between consecutive frames (Lines 176 and 177).
Line 181 then sets the prevPerson to the curPerson for the next iteration of the loop.
Let’s now check to see if we should apply an audio message and/or alert the home owner:
Line 186 checks to see if the consecCount meets the minimum number of required frames
("consec_frames") for a person identification to be “positive”. Provided it has, we use the
mpg321 command line tool to play the corresponding audio message for the user (Lines 189-
191).
If the identified person is “unknown” we’ll also send a text message alert, including the
frame, to the home owner (Lines 194-196).
Remark. Ensure that your Raspberry Pi has the mpg321 apt-get package installed and can
play .mp3 files. The package is installed by default on the Raspberry Pi .img that accompanies
this book.
We then reset our doorOpen boolean and indicate that we should start skipping frames,
ensuring that the user is not accidentally "re-identified” by or system (Lines 201 and 202).
The else block beginning on Line 205 closes the if delta <= conf["look_for_a_
face"]: on Line 134. If a face could not be detected for the specified seconds then we’ll
assume the motion that detected is not something we are interested in (such as a house pet
passing by the camera).
Our final code block handles displaying our frame to the screen, but only if the display
configuration is set:
If your goal is to run the face recognition system as a background process then you can
leave display to False. While it may not seem like it, I/O operations such as cv2.imshow
can cause latency which in turn slows down your FPS throughput rate. When possible, leave
out calls to cv2.imshow to improve your pipeline speed.
Additionally, be sure to refer to Chapter 23 of the Hobbyist Bundle to learn about bench-
marking and profiling your computer vision scripts.
Figure 5.6: Our camera has been mounted with a clear view of the back door and just above eye
level so that it can perform face recognition.
Wow! We put in a lot of work in this chapter to build our face recognition pipeline! Take a
second to congratulate yourself on all that you’ve accomplished so far.
100 CHAPTER 5. FACE RECOGNITION ON THE RPI
But now it’s time for the fun part — running the face recognition system.
Directory: messages
Playing MPEG stream from dave.mp3 ...
MPEG 2.0 layer III, 32 kbit/s, 24000 Hz mono
ALSA lib pcm.c:2565:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.front
Figure 5.6 shows the positioning of the camera relative to the doorway which is about 10ft
away (not pictured).
In the top-left our Raspberry Pi detects motion from the front door opening, triggering face
detection to be applied (top-right). The face is marked as "unknown" until our face recognition
algorithm runs on the detected face.
The bottom-left shows that my face is correctly recognized as "Adrian" while the bottom-
right indicates that an "intruder" has entered the house.
When testing the fully integrated door monitor aspect of this chapter, we found that lighting
makes a big difference for two aspects of the system.
Recall from working through the Hobbyist Bundle computer vision scripts that lighting can
make or break a computer vision application. You may realize that you need to make some
additional tweaks for your home, or you might just need more training data. Be sure to review
the next section for ideas on how to make the system reliable in most conditions.
5.8. ROOM FOR IMPROVEMENT 101
Figure 5.7: Door monitoring security with face recognition. Top-left: Our RPi detects that motion
has occurred (i.e., the door has opened) and starts applying face detection. Top-right: Once a
face is detected it’s marked as "unknown" until the face recognition verifies the face. Bottom-left:
Correctly recognizing my face as "Adrian". Bottom-right: The face recognition system marking
the person as an "intruder" as they are not a member of the household.
Facial recognition using controlled lighting conditions is great when you’re hacking, experi-
menting, or researching. When you take the plunge from image processing to computer vision,
environmental factors come into play. The number one environmental factor we are concerned
with is light.
In this section we’ll review tips and suggestions for making light your friend and not your foe.
At the bottom of the section, there is a reference to an additional section of this book which
has extra tips on achieving higher accuracy (i.e. reducing false positives and false negatives)
when performing face recognition.
First, there is the lighting from the door. The motion detection via background subtraction
uses a change in light to detect when the door is open or closed. Sounds perfect, right? In
theory, yes, but consider the camera’s point of view. The camera is aimed directly at the door
102 CHAPTER 5. FACE RECOGNITION ON THE RPI
If it is semi-dark inside, and extremely bright outside (such as when the sun is shining), it is
possible for the camera sensor to become flooded with light. This light makes the doorway and
other nearby areas of the frame experience washout. As for the face in front of all the light?
It might experience reflections or it might just appear as a dark blob with no facial features
making it hard to recognize. The doorOpen boolean coupled with the "look_for_a_face"
timeout helps to ensure we give the camera’s “auto white balance” feature time to compensate.
Depending on your system’s physical environment, you may need to work on some hacks in
that section of the code.
The second lighting issue is related, but separate, from the first one. If you’re like me,
you probably enter your house sometimes during the day and sometimes at night. Sometimes,
your entry way light bulb will be left on. Other times not. Your face training data should be
captured for occupants of your house under various circumstances. When you capture
images (using the method from Section 5.4.2.1), it would be best if you capture them under
multiple natural and artificial lighting conditions. Take your face photos at different times of
the day. Take photos with the ceiling or lamp lights on and off. Capture photos with the
blinds/curtains opened and closed. You get the idea.
An easy solution is to just mount your camera and log face images by the doorway
for a period of a week knowing that different lighting will be present over the course of
this time. Once you’ve logged the data, you can manually sort it and then follow the training
procedure detailed in Section 5.4 prior to deploying your system.
Another idea is to add an IoT light to the room and send commands to it to control the lighting
from your RPi. Internet of Things lights come in the form of smart plugs, smart switches, and
smart bulbs. You can control a light from your RPi with simple APIs. In Chapter 15, we’ll use
an inexpensive smart plug to turn on a lamp near the door. This serves two purposes. First it
serves to illuminate the face and room where facial recognition is to be performed. Secondly,
a light that automatically turns on in the house is usually a deterrent to intruders/thieves.
For additional face recognition tips, be sure to refer to Section 14.3 of the “Fast, Effi-
cient Face Recognition with the Movidius NCS” chapter. In that chapter as a whole, you’ll
learn how to use the Movidius NCS coprocessor to speed up your face recognition pipelines.
5.9 Summary
In this chapter you learned how to build a complete, end-to-end face recognition system, in-
cluding:
We created our face recognition system in the context of building a “front door monitor”,
capable of monitoring our home for intruders; however, you can use this code as a template for
whatever facial recognition applications you can imagine.
For added security and working in dark environments, I would suggest extending this project
by incorporating IoT components, including SMS notifications or smart bulbs. Refer to Chapter
15 for more details on building such a project.
104 CHAPTER 5. FACE RECOGNITION ON THE RPI
Chapter 6
In this chapter you will learn how to build a smart attendance system used in school and
classroom applications.
Using this system, you, or a teacher/school administrator, can take attendance for a class-
room using only face recognition — no manual intervention of the instructor is required.
To build such an application, we’ll be using computer vision algorithms and concepts we’ve
learned throughout the text, including accessing our video stream, detecting faces, extract-
ing face embeddings, training a face recognizer, and then finally putting all the components
together to create the final smart classroom application.
Since this chapter references techniques used in so many previous chapters, I highly rec-
ommend that you read all preceding chapters in the book before you proceed.
i. Learn about smart attendance applications and why they are useful
105
106 CHAPTER 6. BUILDING A SMART ATTENDANCE SYSTEM
Figure 6.1: You can envision your attendance system being placed near where students enter the
classroom at the front doorway. You will need a screen with adjacent camera along with a speaker
for audible alerts.
We’ll start this section with a brief review of what a smart attendance application is and why
we may want to implement our own smart attendance system.
From there we’ll review the directory structure for the project and review the configuration
file.
The goal of a smart attendance system is to automatically recognize students and take atten-
dance without having the instructor having to manually intervene. Freeing the instructor from
having to take attendance gives the teacher more time to interact with the students and do
what they do best — teach rather than administer.
An example of a working smart attendance system can be seen in Figures 6.1 and 6.2.
Notice how as the student walks into a classroom they are automatically recognized. This
6.2. OVERVIEW OF OUR SMART ATTENDANCE SYSTEM 107
positive recognition is then logged to a database, marking the student as “present” for the
given session.
Figure 6.2: An example of a smart attendance system in action. Face detection is performed to
detect the presence of a face. Next, the detected face is recognized. Once the person is identified
the student is logged as "present" in the database.
We’ll be building our own smart attendance system in the remainder of this chapter. The
application will have multiple steps and components, each detailed below:
ii. Step #2: Face enrollment (needs to be performed for each student in the class)
iii. Step #3: Train the face recognition model (needs to be performed once, and then again
if a student is ever enrolled or un-enrolled).
Before we start implementing these steps, let’s first review our directory structure for the
project.
Let’s go ahead and review our directory structure for the project:
108 CHAPTER 6. BUILDING A SMART ATTENDANCE SYSTEM
|-- config
| |-- config.json
|-- database
| |-- attendance.json
|-- dataset
| |-- pyimagesearch_gurus
| |-- S1901
| | |-- 00000.png
...
| | |-- 00009.png
| |-- S1902
| |-- 00000.png
...
| |-- 00009.png
|-- output
| |-- encodings.pickle
| |-- le.pickle
| |-- recognizer.pickle
|-- pyimagesearch
| |-- utils
| | |-- __init__.py
| | |-- conf.py
| |-- __init__.py
|-- initialize_database.py
|-- enroll.py
|-- unenroll.py
|-- encode_faces.py
|-- train_model.py
|-- attendance.py
The config/ directory will store our config.json configurations for the project.
The database/ directory will store our attendance.json file which is the serialized
JSON output from TinyDB, the database we’ll be using for this project.
The dataset/ directory (not to be confused with the database/ folder) will store all ex-
ample faces of each student captured via the enroll.py script.
We’ll then train a face recognition model on these captured faces via both encode_faces.py
and train_model.py — the output of these scripts will be stored in the output/ directory.
Our pyimagesearch module is quite simplistic, requiring only our Conf class used to load
our configuration file from disk.
Before building our smart attendance system we must first initialize our database via the
initialize_database.py script.
Once we have captured example faces for each student, extracted face embeddings, and
then trained our face recognition model, we can use attendance.py to take attendance. The
6.2. OVERVIEW OF OUR SMART ATTENDANCE SYSTEM 109
attendance.py script is the final script meant to be run in the actual classroom. It takes
all of the individual components implemented in the project and combines it into the final smart
attendance system.
If a student ever needs to leave the class (such as them dropping out of the course), we can
run unenroll.py.
1 {
2 // text to speech engine language and speech rate
3 "language": "english-us",
4 "rate": 175,
5
6 // path to the dataset directory
7 "dataset_path": "../datasets/face_recognition_dataset",
8
9 // school/university code for the class
10 "class": "pyimagesearch_gurus",
11
12 // timing of the class
13 "timing": "14:05",
Lines 3 and 4 define the language and rate of speech for our Text-to-Speech (TTS) engine.
We’ll be using the pyttsx3 library in this project — if you need to change the "language"
you can refer to the documentation for your specific language value [37].
The "dataset_path" points to the "dataset" directory. Inside this directory we’ll store
all captured face ROIs via the enroll.py script. We’ll then later read all images from this
directory and train a face recognition model on top of the faces via encode_faces.py and
train_model.py.
Line 10 sets the class_name, which is as the name suggests, is the title of the class.
We’re calling this class "pyimagesearch_gurus".
We then have the "timing" of the class (Line 13). This value is the time of day the class
actually starts. Our attendance.py script will monitor the time and ensure that attendance
can only be taken in a N second window once class actually starts.
Our next set of configurations is for face detection and face recognition:
17 "n_face_detection": 10,
18
19 // number of images required per person in the dataset
20 "face_count": 10,
21
22 // maximum time limit (in seconds) to take the attendance once
23 // the class has started
24 "max_time_limit": 300,
25
26 // number of consecutive recognitions to be made to mark a person
27 // recognized
28 "consec_count": 3,
The "n_face_detection" value controls the number of subsequent frames where a face
must be detected before we save the face ROI to disk. Enforcing at least ten consecutive
frames with a face detected prior to saving the face ROI to disk helps reduce false-positive
detections.
The "face_count" parameter sets the minimum number of face examples per student.
Here we are requiring that we capture ten total face examples per student.
We then have "max_time_limit" — this value sets the maximum time limit (in seconds)
to take attendance for once the class has started. Here we have a value of 300 seconds (five
minutes). Once class starts, the students have a total of five minutes to make their way to the
classroom, verify that they are present with our smart attendance system, and take their seats
(otherwise they will be marked as “absent”).
Line 31 defines the "db_path" which is the output path to our serialized TinyDB file.
Finally, Line 39 sets our "detection_method". We’ll be using the HOG + Linear SVM
detector from the dlib and face_recognition library. Haar cascades would be faster than
HOG + Linear SVM, but less accurate. Similarly, deep learning-based face detectors would
be much more accurate but far too slow to run in real-time (especially since we’re not only
performing face detection but face recognition as well).
If you are using a co-processor such as the Movidius NCS or Google Coral USB Accelerator
I would suggest using the deep learning face detector (as it will be more accurate), but if you’re
using just the Raspberry Pi CPU, stick with either HOG + Linear SVM or Haar cascades as
these methods are fast enough to run in (near) real-time on the RPi.
Before we can enroll faces in our system and take attendance, we first need to initialize the
database that will store information on the class (name of the class, date/time class starts,
etc.) and the students (student ID, name, etc.).
We’ll be using TinyDB [38] for all database operations. TinyDB is small, efficient, and imple-
mented in pure Python. We’re using TinyDB for this project as it allows for database operations
to “get out of way”, ensuring we can focus on implementing the actual computer vision algo-
rithms rather than CRUD (Create, Read, Update, Delete) operations.
For example, the following code snippet loads a serialized database from disk, inserts a
record for a student, and then demonstrates how to query for that record:
112 CHAPTER 6. BUILDING A SMART ATTENDANCE SYSTEM
As you can see, TinyDB allows us to focus less on the actual database code and more on
the embedded computer vision/deep learning concepts (which is what this book is about, after
all).
If you do not already have the tinydb Python package installed on your system, you can
install it via the following command:
I would recommend you stick with TinyDB to build your own proof-of-concept smart at-
tendance system. Once you’re happy with it the system you can then try more advanced,
feature-rich databases including mySQL, postgresql, MongoDB, Firebase, etc.
1 {
2 "_default": {
3 "1": {
4 "class": "pyimagesearch_gurus"
5 }
6 },
7 "attendance": {
8 "2": {
9 "2019-11-13": {
10 "S1901": "08:01:15",
11 "S1903": "08:03:41",
12 "S1905": "08:04:19"
13 }
14 },
15 "1": {
6.3. STEP #1: CREATING OUR DATABASE 113
16 "2019-11-14": {
17 "S1904": "08:02:22",
18 "S1902": "08:02:49",
19 "S1901": "08:04:27"
20 }
21 }
22 },
23 "student": {
24 "1": {
25 "S1901": [
26 "Adrian",
27 "enrolled"
28 ]
29 },
30 "2": {
31 "S1902": [
32 "David",
33 "enrolled"
34 ]
35 },
36 "3": {
37 "S1903": [
38 "Dave",
39 "enrolled"
40 ]
41 },
42 "4": {
43 "S1904": [
44 "Abhishek",
45 "enrolled"
46 ]
47 },
48 "5": {
49 "S1905": [
50 "Sayak",
51 "enrolled"
52 ]
53 }
54 }
55 }
The class key (Line 4) contains the name of the class where our smart attendance system
will be running. Here you can see that the name of the class, for this example, is “pyimage-
search_gurus”.
The student dictionary (Line 23) stores information for all students in the database. Each
student must have a name and a status flag used to indicate if they are enrolled or un-enrolled
in a given class. The actual student ID can be whatever you want, but I’ve chosen the format:
• S: Indicating “student”
114 CHAPTER 6. BUILDING A SMART ATTENDANCE SYSTEM
The next student registered would then be S1902, and so on. You can choose to keep this
same ID structure or define your own — the actual ID is entirely arbitrary provided that the
ID is unique.
Finally, the attendance dictionary stores the attendance record for each session of the
class. For each session, we store both (1) the student ID for each student who attended the
class, along with (2) the timestamp of when each student was successfully recognized. By
recording both of these values we can then determine which students attended a class and
whether or not they were late for class.
Keep in mind this database schema is meant to be the bare minimum of what’s required to
build a smart attendance application. Feel free to add in additional information on the student,
including age, address, emergency contact, etc.
Secondly, we’re using TinyDB here out of simplicity. When building your own smart atten-
dance application you may wish to use another database — I’ll leave that as an exercise to
you to implement as the point of this text is to focus on computer vision algorithms rather than
database operations.
Our first script, initialize_database.py, is a utility script used to create our initial TinyDB
database. This script only needs to be executed once but it has to be executed before you
start enrolling faces.
16 db = TinyDB(conf["db_path"])
17
18 # insert the details regarding the class
19 print("[INFO] initializing the database...")
20 db.insert({"class": conf["class"]})
21 print("[INFO] database initialized...")
22
23 # close the database
24 db.close()
Lines 2-4 import our required Python packages. Line 3 imports the TinyDB class used to
interface with our database.
Only only command line argument, --conf, is parsed on Lines 7-10. We then load the
conf on Line 13 and use the "db_path" to initialize the TinyDB instance.
Once we have the db object instantiated we use the insert method to add data to the
database. Here we are adding the class name from the configuration file. We need to insert
a class so that students can be enrolled in the class in Section 6.4.2.
Finally, we close the database on Line 24 which triggers the TinyDB library to serialize the
database back out to disk as a JSON file.
If you check the database/ directory you’ll now see a file named attendance.json:
$ ls database/
attendance.json
The attendance.json file is our actual TinyDB database. The TinyDB library will read,
manipulate, and save the data inside this file.
116 CHAPTER 6. BUILDING A SMART ATTENDANCE SYSTEM
Now that we have our database initialized we can move on to face enrollment and un-enrollment.
During enrollment a student will stand in front of a camera. Our system will access the
camera, perform face detection, extract the ROI of the face and then serialize the ROI to disk.
In Section 6.5 we’ll take these ROIs, extract face embeddings, and then train a SVM on top of
the embeddings.
Before we can recognize students in a classroom we first need to “enroll” them in our face
recognition system. Enrollment is a two step process:
i. Step #1: Capture faces of each individual and record them in our database (covered in
this section).
ii. Step #2: Train a machine learning model to recognize each individual (covered in Section
6.5).
The first phase of face enrollment will be accomplished via the enroll.py script. Open up
that script now and insert the following code:
Lines 2-12 handle importing our required Python packages. The tinydb imports on Lines
4 and 5 will interface with our database. The where function will be used to perform SQL-
like “where” clauses to search our database. Line 6 imports the face_recognition library
which will be used to facilitate face detection (in this section) and face recognition (in Section
6.4. STEP #2: ENROLLING FACES IN THE SYSTEM 117
6.6). The pyttsx3 import is our Text-to-Speech (TTS) library. We’ll be using this package
whenever we need to generate speech and play it through our speakers.
Let’s use TinyDB and query if a student with the given --id already exists in our database:
Using this configuration we then load the TinyDB and grab a reference to the student table
(Line 29). The where method is used to search the studentTable for all records which have
the supplied --id.
If there are no existing records with the supplied ID then we know the student has not been
enrolled yet:
34 # check if an entry for the student id does *not* exist, if so, then
35 # enroll the student
118 CHAPTER 6. BUILDING A SMART ATTENDANCE SYSTEM
36 if len(student) == 0:
37 # initialize the video stream and allow the camera sensor to warmup
38 print("[INFO] warming up camera...")
39 # vs = VideoStream(src=0).start()
40 vs = VideoStream(usePiCamera=True).start()
41 time.sleep(2.0)
42
43 # initialize the number of face detections and the total number
44 # of images saved to disk
45 faceCount = 0
46 total = 0
47
48 # initialize the text-to-speech engine, set the speech language, and
49 # the speech rate
50 ttsEngine = pyttsx3.init()
51 ttsEngine.setProperty("voice", conf["language"])
52 ttsEngine.setProperty("rate", conf["rate"])
53
54 # ask the student to stand in front of the camera
55 ttsEngine.say("{} please stand in front of the camera until you" \
56 "receive further instructions".format(args["name"]))
57 ttsEngine.runAndWait()
Line 36 makes a check to ensure that no existing students have the same --id we are
using. Provided that check passes we access our video stream and initialize two integers:
• total: The total number of faces saved for the current student.
Lines 50-52 initialize the TTS engine by setting the speech language and the speech rate.
We then instruct the student (via the TTS engine) to stand in front of the camera (Lines
55-57). With the student now in front of the camera we can capture faces of the individual:
Line 60 sets our current "status" to "detecting". Later this status will be updated to
saving once we start writing example face ROIs to disk.
We then start looping over frames of our video stream on Line 67. We preprocess the
frame by resizing it to have a width of 400px (for faster processing) and then horizontally
flipping it (to remove the mirror effect).
We use the face_recognition library to perform face detection using the HOG + Linear
120 CHAPTER 6. BUILDING A SMART ATTENDANCE SYSTEM
The face_locations function returns a list of four values: the top, right, bottom, and left
(x, y)-coordinates of each face in the image.
On Line 85 we loop over the detected boxes and use the cv2.rectangle function to
draw the bounding box of the face.
Line 92 makes a check to see if the faceCount is still below the number of required
consecutive frames with a face detected (used to reduce false-positive detections). If our
faceCount is below the threshold we increment the counter and continue looping.
Once we have reached the threshold we derive the path to the output face ROI (Lines 101
and 102) and then write the face ROI to disk (Line 103). We then increment our total face
ROI count and update the status.
We can then draw the status on the frame and visualize it on our screen:
If our total reaches the maximum number of face_count images needed to train our
face recognition model (Line 119), we use the TTS engine to tell the user enrollment for them
is now complete (Lines 121-123). The student is then inserted into the TinyDB, including the
ID, name, and enrollment status.
The else statement on Line 136 closes the if statement back on Line 36. As a reminder,
this if statement checks to see if the student has already been enrolled — the else statement
therefore catches if the student is already in the database and trying to enroll again. If that
case happens we simply inform the user that they have already been enrolled and skip any
face detection and localization.
To enroll faces in our database, open up a terminal and execute the following command:
Figure 6.3 (left) shows the “face detection” status. During this phase our enrollment software
is running face detection on each and every frame. Once we have reached a sufficient number
of consecutive frames detected with a face, we change the status to “saving” (right) and begin
saving face images to disk. After we reach the required number of face images an audio
message is played over the speaker and a notification is printed in the terminal.
$ ls dataset/pyimagesearch_gurus/S1902
00000.png 00002.png 00004.png 00006.png 00008.png
00001.png 00003.png 00005.png 00007.png 00009.png
You can repeat the process of face enrollment via enroll.py for each student that is
registered to the class.
Once you have face images for each student we can then train a face recognition model in
Section 6.5.
122 CHAPTER 6. BUILDING A SMART ATTENDANCE SYSTEM
Figure 6.3: Step #2: Enrolling faces in our attendance system via enroll.py.
If a student decides to drop out of the class we need to un-enroll them from both our (1)
database and (2) face recognition model. To accomplish both these tasks we’ll be using the
unenroll.py script — open up that file now and insert the following code:
Lines 2-7 import our required Python packages while Lines 10-15 parse our command line
arguments. We need two command line arguments here, --id, the ID of the student we are
un-enrolling, and ---conf, the path to our configuration file.
6.4. STEP #2: ENROLLING FACES IN THE SYSTEM 123
We can then load the Conf and then access the students table:
Once we find the row we update the enrollment status to be “unenrolled”. We then delete
the students face images from our dataset directory (Lines 31 and 32). The db is then
serialized back out to disk on Line 37.
Let’s go ahead and un-enroll the “Dave” student that we enrolled from Section 6.4.2:
You can use this script whenever you need to un-enroll a student from a database, but
before you continue on to the next section, make sure you use the enroll.py script
to register at least two students in the database (as we need at least two students in the
database to train our model).
124 CHAPTER 6. BUILDING A SMART ATTENDANCE SYSTEM
Once you have done so you can move on to training the actual face recognition component
of the smart attendance system.
Now that we have example images for each student in the class we can move on to training
the face recognition component of the project.
The encode_faces.py script we’ll be reviewing in this section is essentially identical to the
script covered in Chapter 5. We’ll still review the file here as a matter of completeness, but
make sure you refer to Chapter 5 for more details on how this script works.
Lines 2-8 import our required Python packages. The face_recognition library, in-
conjunction with dlib, will be used to quantify each of the faces in our dataset/ directory.
6.5. STEP #3: TRAINING THE FACE RECOGNITION COMPONENT 125
We then parse the path to our --conf file on (Lines 11-14). The configuration itself is
loaded on Line 17.
Lines 21-22 grabs the paths to all images inside the dataset/ directory. We then initialize
two lists, one to store the quantifications of each face followed by a second list to store the
actual names of each face (Lines 25 and 26).
Line 33 extracts the name of the student from the imagePath. In this case the name is the
ID of the student.
Line 37 and 38 reads our input image from disk and converts it from BGR to RGB channel
ordering (the channel ordering that the face_recognition library expects when performing
face quantification).
A call to the face_encodings method uses a neural network to compute a list of 128 float-
ing point values used to quantify the face in the image. We then update our knownEncodings
with each encoding and the knownNames list with the name of the person.
54 f.write(pickle.dumps(data))
55 f.close()
Again, for more details on how the face encoding process works, refer to Chapter 5.
To quantify each student face in the dataset/ directory, open up a terminal and execute the
following command:
Lines 9-12 parse the --conf switch. We then load the associated Conf file on Line 15.
Line 19 loads the serialized data from disk. The data includes both the (1) 128-d quantifi-
cations for each face, and (2) the names of each respective individual. We take the names and
then pass them through a LabelEncoder, ensuring that each name (string) is represented by
a unique integer.
26 # train the model used to accept the 128-d encodings of the face and
27 # then produce the actual face recognition
28 print("[INFO] training model...")
29 recognizer = SVC(C=1.0, kernel="linear", probability=True)
30 recognizer.fit(data["encodings"], labels)
31
32 # write the actual face recognition model to disk
33 print("[INFO] writing the model to disk...")
34 f = open(conf["recognizer_path"], "wb")
35 f.write(pickle.dumps(recognizer))
36 f.close()
37
38 # write the label encoder to disk
39 f = open(conf["le_path"], "wb")
40 f.write(pickle.dumps(le))
41 f.close()
After training is complete we serialize both the face recognizer model and the LabelEnc
oder to disk.
Again, for more details on this script and how it works, make sure you refer to Chapter 5.
128 CHAPTER 6. BUILDING A SMART ATTENDANCE SYSTEM
Training should only take a few minutes. After training is complete you should have two
new files in your output/ directory, recognizer.pickle and le.pickle, in addition to
encodings.pickle from previously:
$ ls output
encodings.pickle le.pickle recognizer.pickle
The recognizer.py file is your actual trained SVM. The SVM model will be used to accept
the 128-d face encoding inputs and then predict the probability of the student based on the face
quantification.
We then take the prediction with the highest probability and pass it through our serialized
LabelEncoder (i.e., le.pickle) to convert the prediction to a human-readable name (i.e.,
the unique ID of the student).
We now have all the pieces of the puzzle — it’s time to assemble them and create our smart
attendance system.
11 import imutils
12 import pyttsx3
13 import pickle
14 import time
15 import cv2
16
17 # construct the argument parser and parse the arguments
18 ap = argparse.ArgumentParser()
19 ap.add_argument("-c", "--conf", required=True,
20 help="Path to the input configuration file")
21 args = vars(ap.parse_args())
On Lines 2-15 we import our required packages. Notable imports include tinydb used to
interface with our attendance.json database, face_recognition to facilitate both face
detection and face identification, and pyttsx3 used for Text-to-Speech.
We can now move on to loading the configuration and accessing individual tables via
TinyDB:
Lines 29 and 30 grab a reference to the student and attendance tables, respectively.
We then load the trained face recognizer model and LabelEncoder on Lines 33 and 34.
Let’s access our video stream and perform a few more initializations:
36 # initialize the video stream and allow the camera sensor to warmup
37 print("[INFO] warming up camera...")
38 # vs = VideoStream(src=0).start()
39 vs = VideoStream(usePiCamera=True).start()
40 time.sleep(2.0)
41
42 # initialize previous and current person to None
43 prevPerson = None
44 curPerson = None
130 CHAPTER 6. BUILDING A SMART ATTENDANCE SYSTEM
45
46 # initialize consecutive recognition count to 0
47 consecCount = 0
48
49 # initialize the text-to-speech engine, set the speech language, and
50 # the speech rate
51 print("[INFO] taking attendance...")
52 ttsEngine = pyttsx3.init()
53 ttsEngine.setProperty("voice", conf["language"])
54 ttsEngine.setProperty("rate", conf["rate"])
55
56 # initialize a dictionary to store the student ID and the time at
57 # which their attendance was taken
58 studentDict = {}
Line 43 and 44 initialize two variables, prevPerson, the ID of the previous person rec-
ognized in the video stream, and curPerson, the ID of the current person identified in the
stream. In order to reduce false-positive identifications we’ll ensure that the prevPerson and
curPerson match for a total of consec_count frames (defined inside our config.json file
from Section 6.2.3).
The consecCount integer keeps track of the number of consecutive frames with the same
person identified.
Lines 52-54 initialize our ttsEngine, used to generate speech and play it through our
speakers.
We then initialize studentDict, a dictionary used to map a student ID to when their re-
spective attendance was taken.
Line 63 grabs the current time. We then take this value and compute the difference between
the current time and when class officially starts. We’ll use this timeDiff value to determine if
6.6. STEP #4: IMPLEMENTING THE ATTENDANCE SCRIPT 131
class has already started, in which time to take attendance has closed.
Line 70 reads a frame from our video stream which we then preprocess by resizing to have
a width of 400px and then flipping horizontally.
Let’s check to see if the maximum time limit to take attendance has been reached:
Provided that (1) class has already started, and (2) the maximum time limit for attendance
has been passed (Line 76), we make a second check on Line 78 to see if the attendance
record has been added to our database. If we have not added the results of taking attendance
to our database, we insert a new set of records to the database, indicating that each of the
individual students in studentDict are in attendance. The teacher or principal can then
audit the attendance results at their leisure.
Lines 86-92 draw class information on our frame, including the name of the class, when
class starts, and the current timestamp.
The remaining code blocks assume that we are still taking attendance, implying that we’re
132 CHAPTER 6. BUILDING A SMART ATTENDANCE SYSTEM
Line 108 swaps our frame to RGB ordering so we can perform both face detection and
recognition via dlib and face_recognition.
Lines 112 and 113 perform face detection. We then loop over each of the detecting bound-
ing boxes and draw them on our frame.
Lines 126-134 draw class information on our screen, most importantly, the amount of time
remaining to register yourself as having “attended” the class.
Let’s now check and see if any faces were detected in the current frame:
Provided at least one face was detected in the frame, Line 139 takes all detected faces
and then extracts 128-d embeddings used to quantify each face. We take these face embed-
dings and pass them through our recognizer, finding the index of the label with the largest
corresponding probability (Lines 142-144).
Line 148 checks to see if the prevPerson prediction matches the curPerson predic-
tion, in which case we increment the consecCount. Otherwise, we do not have matching
consecutive predictions so we reset the consecCount (Lines 153 and 154).
Line 163 ensures that the consecutive prediction count has been satisfied (used to reduce
false-positive identifications). We then check to see if the student’s attendance has already
been taken (Line 166), and if not, we update the studentDict to include (1) the ID of the
person, and (2) the timestamp at which attendance was taken.
Lines 171-175 lookup the name of the student via the ID and then use the TTS engine to
let the student know their attendance has been taken.
Line 186 handles when the consecCount threshold has not been met — in that case we
tell the user to stand in front of the camera until their attendance has been taken.
Our final code block displays the output frame to our screen and performs a few tidying up
operations:
In the event the q key is pressed, we check to see if there are any students in studentDict
that need to have their attendance recorded, and if so, we insert them into our TinyDB — after
6.7. SMART ATTENDANCE SYSTEM RESULTS 135
It’s been quite a journey to get here, but we are now ready to run our smart attendance system
on the Raspberry Pi!
Figure 6.4: An example of a student enrolled in the PyImageSearch Gurus course being marked
as "present" in the TinyDB database. Face detection and face recognition has recognized this
student while sounding an audible message and printing a text based annotation on the screen.
Figure 6.4 demonstrates our smart attendance system in action. As students enter the
classroom, attendance is taken until the time expires. Each result is saved to our TinyDB
database. The instructor can can query the database at a later date to determine which stu-
dents have attended/not attended certain class sessions throughout the semester.
136 CHAPTER 6. BUILDING A SMART ATTENDANCE SYSTEM
6.8 Summary
In this chapter you learned how to implement a smart attendance application from scratch.
This system is capable of running in real-time on the Raspberry Pi, despite using more
advanced computer vision and deep learning techniques.
I know this has been a heavily requested topic on the PyImageSearch blog, particularly
among students working on their final projects before graduation, so if you use any of the
concepts/code in this chapter, please don’t forget to cite it in your final reports. You can find
citation/reference instructions here on the PyImageSearch blog: http://pyimg.co/hwovx.
Chapter 7
This chapter is inspired by PyImageSearch readers who have emailed me asking for speed
estimation computer vision solutions.
As pedestrians taking the dog for a walk, escorting our kids to school, or marching to our
workplace in the morning, we’ve all experienced unsafe, fast-moving vehicles operated by
inattentive drivers that nearly mow us down.
We feel almost powerless. These drivers disregard speed limits, crosswalk areas, school
zones, and “children at play” signs altogether. When there is a speed bump, they speed up
almost as if they are trying to catch some air!
In most cases, the answer is unfortunately “no” — we have to look out for ourselves and our
families by being careful as we walk in the neighborhoods we live in.
But what if we could catch these reckless neighborhood miscreants in action and
provide video evidence of the vehicle, speed, and time of day to local authorities?
In fact, we can.
i. Detects vehicles in video using a MobileNet SSD and Intel Movidius Neural Compute
Stick (NCS)
137
138 CHAPTER 7. BUILDING A NEIGHBORHOOD VEHICLE SPEED MONITOR
iii. Estimates the speed of a vehicle and stores the evidence in the cloud (specifically in a
Dropbox folder).
Once in the cloud, you can provide the shareable link to anyone you choose. I sincerely
hope it will make a difference in your neighborhood.
Let’s take a ride of our own and learn how to estimate vehicle speed using a Raspberry Pi
and Movidius NCS.
ii. Discover the VASCAR approach that police use to measure speed
iii. Understand the human component that leads to inaccurate speeds with police VASCAR
electronics
iv. Build a Python computer vision app based on object detection/tracking and use the VAS-
CAR approach to automatically determine the speed of vehicles moving though the FOV
of a single camera
v. Utilize the Movidius NCS to ensure our system runs in real-time on the RPi
vi. Tune and calibrate our system for accurate speed readings
In the first part of this chapter, we’ll review the concept of VASCAR, a method for measuring
speed of moving objects using distance and timestamps. From there, we’ll review our Python
project structure and config file including key configuration settings.
We’ll then implement our computer vision app and deploy it. We’ll also review a method for
tuning your speed estimation system by adjusting one of the constants.
Visual Average Speed Computer and Recorder (VASCAR) is a method for calculating the
speed of vehicles. It does not rely on RADAR or LIDAR, but it borrows from those acronyms.
Instead, VASCAR is a simple timing device relying on the following equation:
7.2. NEIGHBORHOOD VEHICLE SPEED ESTIMATION 139
Figure 7.1: Visual Average Speed Computer and Recorder (VASCAR) devices calculate speed
based on Equation 7.1. A police officer must press a button each time a vehicle crosses two
waypoints relying on their eyesight and reaction time. There is potential for significant human
error. Police use VASCAR where RADAR/LIDAR is illegal or where drivers use RADAR/LIDAR
detectors. In this chapter we will build a computer vision speed measurement system based on
VASCAR that eliminates the human component. Figure credits: [39, 40]
Police use VASCAR where RADAR and LIDAR is illegal or when they don’t want to be
detected by RADAR/LIDAR detectors.
Police must know the distance between two fixed points on the road (such as signs, lines,
trees, bridges, or other reference points). When a vehicle passes the first reference point,
they press a button to start the timer. When the vehicle passes the second point, the timer
is stopped. The speed is automatically computed because the computer already knows the
distance per Equation 7.1.
Speed measured by VASCAR is obviously severely limited by the human factor. What if
the police officer has poor eyesight or a poor reaction time? If they press the button late (first
reference point) and then early (second reference point), then your speed will be calculated
faster than you are actually going since the time component is smaller. If you are ever issued a
ticket by a police officer and it says VASCAR on it, then you have a very good chance of getting
out of the ticket in a court room. You can (and should) fight it. Be prepared with Equation 7.1
above and to explain how significant the human component is.
Our project relies on a VASCAR approach, but with four reference points. We will average
the speed between all four points with a goal of having a better estimate of the speed. Our
system is also dependent upon the distance and time components.
140 CHAPTER 7. BUILDING A NEIGHBORHOOD VEHICLE SPEED MONITOR
For further reading about VASCAR, please refer to the Wikipedia article: http://pyimg.co/91
s1o [41].
|-- config
| |-- config.json
|-- pyimagesearch
| |-- utils
| | |-- __init__.py
| | |-- conf.py
| |-- __init__.py
| |-- centroidtracker.py
| |-- trackableobject.py
|-- sample_data
| |-- cars.mp4
|-- output
| |-- log.csv
|-- MobileNetSSD_deploy.caffemodel
|-- MobileNetSSD_deploy.prototxt
|-- speed_estimation_dl.py
|-- speed_estimation_dl_video.py
Our config.json file holds all the project settings — we will review these configurations
in the next section.
A sample video compilation from vehicles passing in front of Dave Hoffman’s house is in-
cluded (cars.mp4). Take note that you should not rely on video files for accurate speeds
— the FPS of the video, in addition to the speed at which frames are read from the file, will
impact speed readouts. Videos like the one provided are great for ensuring that the program
functions as intended, but again, accurate speed readings from video files are not likely.
The output/ folder will a log file, log.csv, includes the timestamps and speeds of vehi-
cles that have passed the camera.
Our pretrained Caffe MobileNet SSD object detector files are included in the root of the
project.
The driver script, speed_estimation_dl.py, interacts with the live video stream, object
detector, and calculates the speeds of vehicles using the VASCAR approach. It is one of the
7.2. NEIGHBORHOOD VEHICLE SPEED ESTIMATION 141
1 {
2 // maximum consecutive frames a given object is allowed to be
3 // marked as "disappeared" until we need to deregister the object
4 // from tracking
5 "max_disappear": 10,
6
7 // maximum distance between centroids to associate an object --
8 // if the distance is larger than this maximum distance we'll
9 // start to mark the object as "disappeared"
10 "max_distance": 175,
11
12 // number of frames to perform object tracking instead of object
13 // detection
14 "track_object": 4,
15
16 // minimum confidence
17 "confidence": 0.4,
18
19 // frame width in pixels
20 "frame_width": 400,
21
22 // dictionary holding the different speed estimation columns
23 "speed_estimation_zone": {"A": 120, "B": 160, "C": 200, "D": 240},
24
25 // real world distance in meters
26 "distance": 16,
27
28 // speed limit in mph
29 "speed_limit": 15,
The "max_disappear" frame count signals to our centroid tracker when to mark an object
as disappeared (Line 5). The "max_distance" value is the maximum Euclidean distance in
142 CHAPTER 7. BUILDING A NEIGHBORHOOD VEHICLE SPEED MONITOR
pixels for which we’ll associate object centroids (Line 10) — if they exceed this distance we
mark the object as disappeared.
Our "track_object" value represents the number of frames to perform object tracking
rather than object detection (Line 14).
The "confidence" value is the probability threshold for object detection with MobileNet
SSD — objects (i.e. cars) that don’t meet the threshold are skipped to avoid false-positive
detections (Line 17).
The frame will be resized to a "frame_width" of 400 pixels (Line 20). We have four
speed estimation zones. Line 23 holds a dictionary of the frame’s columns (y -coordinates)
separating the zones. These columns are dependent upon the "frame_width", so take care
while updating them.
Figure 7.2: The camera’s FOV is measured at the roadside carefully. Oftentimes calibration is
required. Refer to Section 7.2.9 to learn about the calibration procedure.
Line 26 is the most important value in this configuration. You will have to physically measure
the "distance" on the road from one side of the frame to the other side. It will be best if you
have a helper to make the measurement.
Have the helper watch the screen and tell you when you are standing at the very edge of
the frame. Put the tape down on the ground at that point. Stretch the tape to the other side of
the frame until your helper tells you that they see you at the very edge of the frame in the video
stream. Take note of the distance in meters that all your calculations will be based upon.
7.2. NEIGHBORHOOD VEHICLE SPEED ESTIMATION 143
As shown in Figure 7.2, there are 49 feet between the edges of where cars will travel in the
frame relative to the positioning on my camera. The conversion of 49 feet to meters is 14.94
meters.
The value has been tuned for system calibration. See Section 7.2.9 to learn how to test and
calibrate your system. Secondly, had the measurement been made at the center of the street
(i.e. further from the camera), the distance would have been longer. The measurement was
taken next to the street by Dave Hoffman so he would not get run over by a car!
Our speed_limit in this example is 15mph (Line 29). Vehicles traveling less than this
speed will not be logged. Vehicles exceeding this speed will be logged. If you want all speeds
to be logged, you can set the value to 0.
The remaining configuration settings are for display, Dropbox upload, and important file
paths:
If you set "display" to true on Line 32, an OpenCV window is displayed on your Rasp-
berry Pi desktop.
If you elect to "use_dropbox", then you must set the value on Line 42 to true and fill
in your access token on Line 43. Videos of vehicles passing the camera will be logged to
Dropbox. Ensure that you have the quota for the videos as well.
To create/find your Dropbox API key, you can create an app on the app creation page:
http://pyimg.co/tcvd1. Once you have an app created, the API key may be generated under
144 CHAPTER 7. BUILDING A NEIGHBORHOOD VEHICLE SPEED MONITOR
the OAuth section of the app’s page on the App Console: http://pyimg.co/ynxh8. On the App
Console page, simply click the “Generate” button and copy/paste the key into the configuration
file).
Figure 7.3: This project assumes the camera is aimed perpendicular to the road. Timestamps
of a vehicle are collected at waypoints ABCD or DCBA. From there, Equation 7.1 is put to use to
calculate 3 speeds among the 4 waypoints. Speeds are averaged together and converted to km/hr
and miles/hr. As you can see, the distance measurement is different depending on where (edges
or centerline) the tape is laid on the ground. We will account for this by calibrating our system in
Section 7.2.9.
Figure 7.3 shows an overhead view of how the project is laid out. In the case of Dave
Hoffman’s house, the RPi and camera are sitting in his road-facing window. The measurement
for the "distance" was taken at the side of the road on the far edges of the FOV lines for the
camera. Points A, B, C, and D mark the columns in a frame. They should be equally spaced
in your video frame.
Cars pass through the FOV in either direction and MobileNet SSD combined with an object
tracker assist in grabbing timestamps at points ABCD (left-to-right) or DCBA (right-to-left).
7.2. NEIGHBORHOOD VEHICLE SPEED ESTIMATION 145
Object tracking is a concept we have already covered in this book, however let’s take a moment
to review.
A simple object tracking algorithm relies on keeping track of the centroids of objects.
Typically an object tracker works hand-in-hand with a less-efficient object detector. The
object detector is responsible for localizing an object. The object tracker is responsible for
keeping track of which object is which by assigning and maintaining an identification numbers
(IDs).
This object tracking algorithm we’re implementing is called centroid tracking as it relies on
the Euclidean distance between (1) existing object centroids (i.e., objects the centroid tracker
has already seen before) and (2) new object centroids between subsequent frames in a video.
The centroid tracking algorithm is a multi-step process. The five steps include:
ii. Step #2: Compute Euclidean distance between new bounding boxes and existing objects
The CentroidTracker class was covered in Chapters 19 and 20 of the Hobbyist Bundle
in addition to Chapter 13 of this volume. Please take the time now to review the class in any of
those chapters.
In order to track and calculate the speed of objects in a video stream, we need an easy way to
store information regarding the object itself, including:
• Its previous centroids (so we can easily to compute the direction the object is moving).
• A dictionary of x-coordinate positions of the object. These positions reflect the actual
position in which the timestamp was recorded so speed can accurately be calculated.
146 CHAPTER 7. BUILDING A NEIGHBORHOOD VEHICLE SPEED MONITOR
• A "last point boolean" serves as a flag to indicate that the object has passed the last
waypoint (i.e. column) in the frame.
• The calculated speed in MPH and KMPH. We calculate both and the user can choose
which they prefers to use by a small modification to the driver script.
• A boolean to indicate if the speed has been estimated (i.e. calculated) yet.
• A boolean indicating if the speed has been logged in the .csv log file.
• The direction through the FOV the object is traveling (left-to-right or right-to-left).
We will have multiple trackable objects — one for each car that is being tracked in the frame.
Each object will have the attributes shown on Lines 8-29.
7.2. NEIGHBORHOOD VEHICLE SPEED ESTIMATION 147
Lines 18 and 19 hold the speed in MPH and KMPH. We need a function to calculate the
speed, so let’s define the function now:
There are 0.621371 miles in one kilometer (Line 34). Knowing this, Line 35 calculates the
speedMPH attribute.
Before we begin working on our driver script, let’s review our algorithm at a high level:
• We have a known distance constant measured by a tape at the roadside. The camera
will face at the road perpendicular to the distance measurement unobstructed by obsta-
cles.
• Meters per pixel are calculated by dividing the distance constant by the frame width in
pixels (Equation 7.2).
• Distance in pixels is calculated as the difference between the centroids as they pass by
the columns for the zone (Equation 7.3). Distance in meters is then calculated for the
particular zone (Equation 7.4).
• Four timestamps (t) will be collected as the car moves through the FOV past four waypoint
columns of the video frame.
• Three pairs of the four timestamps will be used to determine three —t values.
• The three speed estimates will be averaged for an overall speed (Equation 7.5).
• The speed is converted and made available in the TrackableObject class as speedMPH
or speedKMPH. While will display speeds in miles per hour. Minor changes to the script
are required if you prefer to have the kilometers per hour logged and displayed — be sure
to read the remarks as you follow along in the chapter.
148 CHAPTER 7. BUILDING A NEIGHBORHOOD VEHICLE SPEED MONITOR
distance constant
meters per pixel = mpp = (7.2)
f rame width
—tab
+ —tbc
+ —tcd
average speed = dab dbc dcd
(7.5)
3
Now that we understand the(1) methodology for calculating speeds of vehicles and (2)
we have defined the CentroidTracker and TrackableObject classes, let’s work on our
speed estimation driver script.
Open a new file named speed_estimation_dl.py and insert the following lines:
Lines 2-17 handle our imports including our CentroidTracker and TrackableObject
for object tracking. The correlation tracker from Davis King’s dlib is also part of our object
tracking method. We’ll use the dropbox API to upload data to the cloud in a separate Thread
so as not block the main thread of execution.
Our upload_file function will run in one or more separate threads. This method accepts
the tempFile object, Dropbox client object, and imageID as parameters. Using these
parameters, it builds a path and then uploads the file to Dropbox (Lines 22 and 23). From
there, Line 24 then removes the temporary file from local storage.
Lines 27-33 parse the --conf command line argument and load the contents of the con-
figuration into the conf dictionary.
We’ll then initialize our pretrained MobileNet SSD CLASSES and Dropbox client if re-
quired:
And from there, we’ll load our object detector and initialize our video stream:
50 net = cv2.dnn.readNetFromCaffe(conf["prototxt_path"],
51 conf["model_path"])
52 net.setPreferableTarget(cv2.dnn.DNN_TARGET_MYRIAD)
53
54 # initialize the video stream and allow the camera sensor to warmup
55 print("[INFO] warming up camera...")
56 #vs = VideoStream(src=0).start()
57 vs = VideoStream(usePiCamera=True).start()
58 time.sleep(2.0)
59
60 # initialize the frame dimensions (we'll set them as soon as we read
61 # the first frame from the video)
62 H = None
63 W = None
Lines 50-52 load the MobileNet SSD net and set the target processor to the Movidius NCS
Myriad. Using the Movidius NCS coprocessor ensures that our FPS is high enough for
accurate speed calculations. In other words, if we have a lag between frame captures, our
timestamps can become out of sync and lead to inaccurate speed readouts.
Lines 57-63 initialize the PiCamera video stream and frame dimensions.
For object tracking purposes, Lines 68-71 initialize our CentroidTracker, trackers list,
and trackableObjects dictionary.
Line 74 initializes a totalFrames counter which will be incremented each time a frame
7.2. NEIGHBORHOOD VEHICLE SPEED ESTIMATION 151
is captured. We’ll use this value to calculate when to perform object detection versus object
tracking.
Our speed will be based on the ABCD column points in our frame. Line 81 initializes a list
of pairs of points for which speeds will be calculated. Given our four points, we can calculate
three speeds and then average them.
With all of our initializations taken care of, let’s begin looping over frames:
Our frame processing loop begins on Line 87. We begin by grabbing a frame and taking
our first timestamp (Lines 90-92).
Lines 99-115 initialize our logFile and write the column headings. Notice that if we are
using Dropbox, one additional column is present in the CSV — the image ID.
152 CHAPTER 7. BUILDING A NEIGHBORHOOD VEHICLE SPEED MONITOR
Remark. If you prefer to log speeds in kilometers per hour, be sure to update the CSV column
headings on Line 110 and Line 115.
Line 118 resizes our frame to a known width directly from the "frame_width" value held
in the config file.
Remark. If you change "frame_width" in the config, be sure to update the "speed_estim
ation_zone" columns as well.
Line 119 converts the frame to RGB format for dlib’s correlation tracker.
Lines 122-124 initialize the frame dimensions and calculate meterPerPixel. The meters
per pixel value helps to calculate our three estimated speeds among the four points.
Remark. If your lens introduces distortion (i.e. a wide area lens or fisheye), you should con-
sider a proper camera calibration (via intrinsic/extrinsic parameters) so that the meterPerPixel
value is more accurate.
Line 128 initializes an empty list to hold bounding box rectangles returned by either (1) our
object detector or (2) the correlation trackers.
At this point, we’re ready to perform object detection to update our trackers:
Object detection only will occur on multiples of "track_object" per Line 132. Perform-
ing object detection only every N frames reduces the expensive inference operations. We’ll
perform object tracking instead most of the time.
Lines 134 initializes our new list of object trackers to update with accurate bounding box
rectangles so that correlation tracking can do its job later.
Lines 148-159 filter the detection based on the "confidence" threshold and CLASSES
type. We only look for the “car” class using our pretrained MobileNet SSD.
We then initialize a dlib correlation tracker and begin track the rect ROI found by our
object detector (Lines 169-171). Line 175 adds the tracker to our trackers list.
Now let’s handle the event in which we’ll be performing object tracking rather than object
detection:
Object tracking is less of a computational load on our RPi, so most of the time (i.e. except
every N "track_object" frames) we will perform tracking.
Lines 182-185 loop over the available trackers and update the position of each object.
Lines 188-194 add the bounding box coordinates of the object to the rects list.
Line 198 then updates the CentroidTracker’s objects using either the object detection
or object tracking rects.
Let’s loop over the objects now and take steps towards calculating speeds:
Each trackable object has an associated objectID. Lines 204-208 create a trackable
object (with ID) if necessary.
From here we’ll check if the speed has been estimated for this trackable object yet:
If the speed has not been estimated (Line 212), then we first need to determine the direction
in which the object is moving (Lines 215-218). Positive direction values indicate left-to-right
movement and negative values indicate right-to-left movement. Knowing the direction is impor-
tant so that we can estimate our speed between the points properly.
Lines 222-267 collect timestamps for cars moving from left-to-right for each of our columns,
A, B, C, and D.
i. Line 225 checks to see if a timestamp has been made for point A — if not, we’ll proceed
to do so.
ii. Line 230 checks to see if the current x-coordinate centroid is greater than column A.
iii. If so, Lines 231 and 232 record a timestamp and the exact x-position of the centroid.
iv. Columns B, C, and D use the same method to collect timestamps and positions with one
exception. For column D, the lastPoint is marked as True. We’ll use this flag later to
indicate that it is time to perform our speed formula calculations.
Now let’s perform the same timestamp, position, and last point updates for right-to-left
traveling cars (i.e. direction < 0):
7.2. NEIGHBORHOOD VEHICLE SPEED ESTIMATION 157
Lines 271-316 grab timestamps and positions for cars as they pass by columns D, C, B,
and A. For A the lastPoint is marked as True.
158 CHAPTER 7. BUILDING A NEIGHBORHOOD VEHICLE SPEED MONITOR
318 # check to see if the vehicle is past the last point and
319 # the vehicle's speed has not yet been estimated, if yes,
320 # then calculate the vehicle speed and log it if it's
321 # over the limit
322 if to.lastPoint and not to.estimated:
323 # initialize the list of estimated speeds
324 estimatedSpeeds = []
325
326 # loop over all the pairs of points and estimate the
327 # vehicle speed
328 for (i, j) in points:
329 # calculate the distance in pixels
330 d = to.position[j] - to.position[i]
331 distanceInPixels = abs(d)
332
333 # check if the distance in pixels is zero, if so,
334 # skip this iteration
335 if distanceInPixels == 0:
336 continue
337
338 # calculate the time in hours
339 t = to.timestamp[j] - to.timestamp[i]
340 timeInSeconds = abs(t.total_seconds())
341 timeInHours = timeInSeconds / (60 * 60)
342
343 # calculate distance in kilometers and append the
344 # calculated speed to the list
345 distanceInMeters = distanceInPixels * meterPerPixel
346 distanceInKM = distanceInMeters / 1000
347 estimatedSpeeds.append(distanceInKM / timeInHours)
348
349 # calculate the average speed
350 to.calculate_speed(estimatedSpeeds)
351
352 # set the object as estimated
353 to.estimated = True
354 print("[INFO] Speed of the vehicle that just passed"\
355 " is: {:.2f} MPH".format(to.speedMPH))
356
357 # store the trackable object in our dictionary
358 trackableObjects[objectID] = to
When the trackable object’s (1) last point timestamp and position has been recorded, and
(2) the speed has not yet been estimated (Line 322) we’ll proceed to estimate speeds.
Line 324 initializes a list to hold three estimatedSpeeds. Let’s calculate the three esti-
mates now.
Line 328 begins a loop over our pairs of points. We calculate the distanceInPixels
7.2. NEIGHBORHOOD VEHICLE SPEED ESTIMATION 159
using the position values (Lines 330-331). If the distance is 0, we’ll skip this pair (Lines
335 and 336).
Next we calculate the elapsed time between two points in hours (Lines 339-341). We need
the time in hours as we are calculating kilometers per hour and miles per hour.
We then calculate the distance in kilometers by multiplying the pixel distance by the esti-
mated meterPerPixel value (Lines 345 and 346). Recall that meterPerPixel is based
on (1) the width of the FOV at roadside and (2) the width of the frame.
The speed is calculated by Equation 7.1 (distance over time) and added to the estimated
Speeds list.
Line 353 marks the speed as estimated. Lines 354 and 355 then print the speed in
the terminal.
Remark. If you prefer to print the speed in km/hr be sure to update both the string to KMPH
and the format variable to to.speedKMPH.
Phew! The hard part is out of the way in this script. Let’s wrap up, first by annotating the
centroid and ID on the frame:
360 # draw both the ID of the object and the centroid of the
361 # object on the output frame
362 text = "ID {}".format(objectID)
363 cv2.putText(frame, text, (centroid[0] - 10, centroid[1] - 10)
364 , cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)
365 cv2.circle(frame, (centroid[0], centroid[1]), 4,
366 (0, 255, 0), -1)
A small dot is drawn on the centroid of the moving car with the ID number next to it.
Next we’ll go ahead and update our log file and store vehicle images in Dropbox (i.e., the
cloud):
At a minimum every vehicle that exceeds the speed limit will be logged in the CSV file.
Optionally Dropbox will be populated with images of the speeding vehicles.
Lines 369-372 check to see if the trackable object has been logged, speed estimated, and
if the car was speeding.
If so Lines 374-377 extract the year, month, day, and time from the timestamp.
If an image will be logged in Dropbox, Lines 381-391 store a temporary file and spawn a
thread to upload the file to Dropbox. Using a separate thread for a potentially time-consuming
upload is critical so that our main thread doesn’t block, impacting FPS and speed calculations.
The filename will be the imageID on Line 383 so that it can easily be found later if it is
associated in the log file.
Lines 394-404 write the CSV data to the logFile. If Dropbox is used, the imageID is the
7.2. NEIGHBORHOOD VEHICLE SPEED ESTIMATION 161
last value.
Remark. If you prefer to log the kilometers per hour speed, simply update to.speedMPH to
to.speedKMPH on Line 395 and Line 403.
409 # if the *display* flag is set, then display the current frame
410 # to the screen and record if a user presses a key
411 if conf["display"]:
412 cv2.imshow("frame", frame)
413 key = cv2.waitKey(1) & 0xFF
414
415 # if the `q` key is pressed, break from the loop
416 if key == ord("q"):
417 break
418
419 # increment the total number of frames processed thus far and
420 # then update the FPS counter
421 totalFrames += 1
422 fps.update()
423
424 # stop the timer and display FPS information
425 fps.stop()
426 print("[INFO] elapsed time: {:.2f}".format(fps.elapsed()))
427 print("[INFO] approx. FPS: {:.2f}".format(fps.fps()))
428
429 # check if the log file object exists, if it does, then close it
430 if logFile is not None:
431 logFile.close()
432
433 # close any open windows
434 cv2.destroyAllWindows()
435
436 # clean up
437 print("[INFO] cleaning up...")
438 vs.stop()
Lines 411-417 display the annotated frame and look for the q keypress in which case we’ll
quit (break).
Lines 421 and 422 increment totalFrames and update our FPS counter.
When we have broken out of the frame processing loop we perform housekeeping including
printing FPS stats, closing our log file, destroying GUI windows, and stopping our video stream
(Lines 425-438)
162 CHAPTER 7. BUILDING A NEIGHBORHOOD VEHICLE SPEED MONITOR
Now that our code is implemented, we’ll deploy and test our system.
I highly recommend that you conduct a handful of controlled drive-bys and tweak the vari-
ables in the config file until you are achieving accurate speed readings. Prior to any fine tuning
calibration, we’ll just ensure that the program working. Be sure you have met the following
requirements prior to trying to run the application:
• Position and aim your camera perpendicular to the road as per Figure 7.3.
• Ensure your camera has a clear line of sight with limited obstructions — our object detec-
tor must be able to detect a vehicle at multiple points as it crosses through the camera’s
field of view (FOV).
• It is best if your camera is positioned far from the road. The further points A and D are
from each other at the point at which cars pass on the road, the better the distance / time
calculations will average out and produce more accurate speed readings. If your camera
is close to the road, a wide-angle lens is an option, but then you’ll need to perform camera
calibration (a future PyImageSearch blog topic).
• If you are using Dropbox functionality, ensure that your RPi has a solid WiFi, Ethernet, or
even cellular connection.
• Ensure that you have set all constants in the config file. We may elect to fine tune the
constants in the next section.
Assuming you have met each of the requirements, you are now ready to deploy your pro-
gram.
Enter the following command to start the program and begin logging speeds:
Figure 7.4: Deployment of our neighborhood speed system. Vehicle speeds are calculated after
they leave the viewing frame. Speeds are logged to .csv and images are stored in Dropbox.
As shown in Figure 7.4, our system is measuring speeds of vehicles traveling in both di-
rections. In the next section, we will perform drive-by tests to ensure our system is reporting
accurate speeds.
To see a video of the system in action, be sure to follow the following link: http://pyimg.co/n3zu9
On occasions when multiple cars are passing through the frame at one given time, speeds
will be reported inaccurately. This can occur when our centroid tracker mixes up centroids.
This is a known drawback to our algorithm. To solve the issue additional algorithm engineering
will need to be conducted by you as the reader. One suggestion would be to perform instance
segmentation (http://pyimg.co/lqzvq [42]) to accurate segment each vehicle.
Per the remark in Section 7.2.2, you may also execute a separate script for use with the
sample_data/cars.mp4 video file as follows:
Figure 7.5: Images of vehicles exceeding the speed limit are stored in Dropbox.
[INFO] Speed of the vehicle that just passed is: 26.85 MPH
[INFO] Speed of the vehicle that just passed is: 25.04 MPH
[INFO] Speed of the vehicle that just passed is: 26.08 MPH
[INFO] Speed of the vehicle that just passed is: 28.58 MPH
[INFO] elapsed time: 22.88
[INFO] approx. FPS: 30.78
[INFO] cleaning up...
Note that those calculated and reported speeds are inaccurate since OpenCV can’t throttle
a video file per its FPS and instead processes it as fast as possible.
You may find that the system produces slightly inaccurate readouts of the vehicle speeds going
by. Do not disregard the project just yet. You can tweak the config file to get closer and closer
to accurate readings.
We used the following approach to calibrate our system until our readouts were spot-on:
• Begin recording a screencast of the RPi desktop showing both the video stream and
terminal. This screencast should record throughout testing.
• Meanwhile, record a voice memo on your smartphone throughout testing of you driving
by while stating what your drive-by speed is.
7.2. NEIGHBORHOOD VEHICLE SPEED ESTIMATION 165
Figure 7.6: Calibration involves drive-testing. This figure shows how the "distance" contstant
in your config file affects the outcome of the speed calculation.
• Sync the screencast to the audio file so that it can be played back.
• The speed +/- differences could be jotted down as you playback your video with synced
audio file.
• With this information, tune the constants: (1) If your speed readouts are a little high, then
decrease the "distance" constant, or (2) Conversely, if your speed readouts are slightly
low, then increase the "distance" constant.
• Rinse and repeat until you are satisfied. Don’t worry, you won’t burn too much fuel in the
process.
Based on extensive testing and adjustments, Dave Hoffman and Abhishek Thanki found
that Dave needed to increase his distance constant from the original 14.94m to 16.00m. The
final testing table is shown in Figure 7.7.
166 CHAPTER 7. BUILDING A NEIGHBORHOOD VEHICLE SPEED MONITOR
Figure 7.7: Drive testing results. Click here for a high resolution version of the table:
http://pyimg.co/y7o5z
If you care to watch and listen to Dave Hoffman’s final testing and verification video, you
can click this link: http://pyimg.co/tx49e.
With a calibrated system, you’re now ready to let it run for a full day. Your system is likely
only configured for daytime use unless you have streetlights on your road.
Remark. For nighttime use (outside the scope of this chapter), you may need infrared cameras
and infrared lights and/or adjustments to your camera parameters (refer to the Hobbyist Bundle
Chapters 6, 12, and 13 for these topics).
7.3 Summary
In this chapter we built a system to monitor the speeds of moving vehicles with just a camera
and well-crafted software.
Rather than relying on expensive RADAR or LIDAR sensors, we used timestamps, a known
distance, and a simple physics equation to calculate speeds. In the police world this is known
as Vehicle Average Speed Computer and Recorder (VASCAR). Police rely on their eyesight
and button pushing reaction time to collect timestamps — a method that barely holds in court
in comparison to RADAR and LIDAR.
But of course, we are engineers so our system eliminates the human component to calcu-
late speeds automatically with computer vision. Using both object detection and object tracking
we coded a method to calculate four timestamps. We then let the math do the talking: We know
that speed equals distance over time. Three speeds were calculated among the three pairs of
7.3. SUMMARY 167
One drawback to our automated system is that it is only as good as the key distance con-
stant.
To combat this, we measured carefully and then conducted drive-bys while looking at our
speedometer to verify operation. Adjustments to the distance constant were made if needed.
Yes, there is a human component in this verification method. If you have a cop friend that
can help you verify with their RADAR gun that would be even better.
We hope you enjoyed this chapter, and more importantly, we hope that you can apply it to
detect drivers speeding in your neighborhood.
168 CHAPTER 7. BUILDING A NEIGHBORHOOD VEHICLE SPEED MONITOR
Chapter 8
ImageZMQ made throwing frames around a network dead simple. In this chapter we’ll take
it a step further by monitoring a home with deep learning and multiple Raspberry Pis.
Clients will send video frames to a central server via ImageZMQ. Our server will run an
object detector to find people and animals in the incoming frames from our clients. The results
will be displayed in a montage. You can extend this chapter to make your own security digital
video recorder.
Imagine for thirty seconds that you are profusely rich and that you have an extravagant 10,000
sq. foot house. Your house has multiple rooms and you may even have a guest house or a
169
170 CHAPTER 8. DEEP LEARNING AND MULTIPLE RPIS
pool house. Your beloved dog (or cat) is free to roam the property inside and out. Sometimes
she sleeps in the den. Other times she’s sprawled out by the pool. Clearly your dog has a
wonderful life.
Today your dog is due to go to the vet. You’ve looked in the usual places, but she’s not
turning up. The top of the hour is creeping up and you’re going to be late. Being late is the
worst feeling – you are impacting someone else’s schedule, not to mention your own.
Finally you find her, and make it to the vet appointment just five minutes late.
Your appointment was bumped to the next slot, so while you’re waiting, you think of ways so
that this won’t happen again.
What if you could put cameras around your house so that you could monitor each room
much like a security guard would?
Better yet, what if a deep learning object detector finds your dog automatically in any of the
camera’s video streams and you know exactly where she is?
• Pretrained MobileNet SSD object detection to find people, dogs, and cats.
• And an OpenCV montage so that you can visualize all the feeds in one (or more) conve-
nient windows on a large screen.
We’ll begin this chapter by implementing our client. From there, we’ll import the server
which handles object detection and display. We’ll wrap up by analyzing our results.
By the end of the chapter, you’ll have a system you can deploy to find your dog, cat, or
partner in your home and even your vehicles outside your home. You could extend it to include
Digital Video Recording (DVR) functionality if you want to use it for a security purpose.
|-- MobileNetSSD_deploy.caffemodel
|-- MobileNetSSD_deploy.prototxt
8.2. AN IMAGEZMQ CLIENT/SERVER APPLICATION FOR MONITORING A HOME 171
|-- client.py
|-- server.py
The first two files listed in the project are the pre-trained Caffe MobileNet SSD object de-
tection files. The server (server.py) will take advantage of these Caffe files using OpenCV’s
DNN module to perform object detection. The pre-trained caffe model supports 20 classes –
we’ll configure it to filter for people, dogs, cats, and vehicles.
The client.py script will reside on each device, sending a video stream to the server.
Later on, we’ll upload client.py onto each of the Pis (or another machine) on your network
so they can send video frames to the central location.
• Capturing frames from the camera (either USB or the RPi camera module).
We reviewed this script in the last chapter, but let’s review it again to reinforce the concept.
We start off by importing packages and modules on Lines 2-6. We’re importing imagezmq
so we can send frames from our client to our server (Line 3).
172 CHAPTER 8. DEEP LEARNING AND MULTIPLE RPIS
The server’s IP address (--server-ip) is the only command line argument parsed on
Lines 9-12. The socket module of Python is simply used to grab the hostname of the RPi.
Lines 16 and 17 create the ImageZMQ sender object and specify the IP address and port
of the server. The IP address will come from the command line argument that we already
established. I’ve found that port 5555 doesn’t usually have conflicts, so it is hardcoded. You
could easily turn it into a command line argument if you need to as well.
Let’s initialize our video stream and start sending frames to the server:
19 # get the host name, initialize the video stream, and allow the
20 # camera sensor to warmup
21 rpiName = socket.gethostname()
22 #vs = VideoStream(src=0).start()
23 vs = VideoStream(usePiCamera=True).start()
24 time.sleep(2.0)
25
26 # loop over frames from the camera
27 while True:
28 # read the frame from the camera and send it to the server
29 frame = vs.read()
30 sender.send_image(rpiName, frame)
Line 21 grabs the hostname, storing the value as rpiName. Refer to Section 3.5.4 to set
hostnames on your Raspberry Pis.
Our VideoStream is created via Line 22 or 23 depending on whether you are using a
PiCamera or USB camera. This is the point where you should also set your camera resolution
and other camera parameters (Hobbyist Bundle Chapters 5 and 6).
For this project, we are just going to use the default PiCamera resolution so the argu-
ment is not provided, but if you find that there is a lag, you are likely sending too many pixels.
If that is the case, you may reduce your resolution quite easily. Just pick from one of the reso-
lutions available for the PiCamera V2 here: http://pyimg.co/mo5w0 [43] (the second table is for
PiCamera V2).
Remark. The resolution argument won’t make a difference for USB cameras since they are all
implemented differently. As an alternative, you can insert a frame = imutils.resize(frame,
width=320) between Lines 28 and 29 to resize the frame manually (beware that doing so
will require the CPU to resize the image, thus slightly slowing down your pipeline).
8.2. AN IMAGEZMQ CLIENT/SERVER APPLICATION FOR MONITORING A HOME 173
Finally, our while loop on Lines 26-29 grabs and sends the frames.
As you can see, the client is quite simple and straightforward! Let’s move on to the server
where the heart of today’s project lives.
Figure 8.1: The Graphical User Interface (GUI) concept drawing for our ImageZMQ server that
performs object detection on frames incoming from client Raspberry Pis.
iii. Maintaining an “object count” for each of the frames (i.e., count the number of objects).
Figure 8.1 shows the initial GUI concept for the ImageZMQ server application we are build-
ing.
Let’s go ahead and implement the server — open up the server.py file and insert the
following code:
On Lines 2-8 we import packages and libraries. Most notably we’ll be using: build_montages
to build a montage of all incoming frames. For more details on building montages with OpenCV,
refer to this PyImageSearch blog post for an example:
http://pyimg.co/vquhs [44].
We’ll use imagezmq for video streaming. OpenCV’s DNN module will be utilized to perform
inference with our pretrained Caffe object detector.
• --model: The path to our pre-trained Caffe deep learning model. I’ve provided a pre-
trained MobileNet SSD in the project folder, but with some minor changes, you could elect
to use an alternative model.
• --montageW: Number of columns for our montage (this is not the width in pixels). We’re
going to stream from four Raspberry Pis, so you could do 2x2, 4x1, or 1x4. You could
also do, for example, 3x3 for nine clients, provided you have nine client RPis.
Let’s initialize our ImageHub object along with our deep learning object detector:
The ImageHub is initialized on Line 25. Now we’ll be able to receive frames from clients.
Our MobileNet SSD object CLASSES are specified on Lines 29-32. Later, we will filter
detections for only the classes we wish to consider. From there we instantiate our Caffe object
detector on Line 36.
38 # initialize the consider set (class labels we care about and want
39 # to count), the object count dictionary, and the frame dictionary
40 CONSIDER = set(["dog", "person", "car"])
41 objCount = {obj: 0 for obj in CONSIDER}
42 frameDict = {}
43
44 # initialize the dictionary which will contain information regarding
45 # when a device was last active, then store the last time the check
46 # was made was now
47 lastActive = {}
48 lastActiveCheck = datetime.now()
49
50 # stores the estimated number of Pis, active checking period, and
51 # calculates the duration seconds to wait before making a check to
52 # see if a device was active
53 ESTIMATED_NUM_PIS = 4
54 ACTIVE_CHECK_PERIOD = 10
55 ACTIVE_CHECK_SECONDS = ESTIMATED_NUM_PIS * ACTIVE_CHECK_PERIOD
56
57 # assign montage width and height so we can view all incoming frames
58 # in a single "dashboard"
59 mW = args["montageW"]
60 mH = args["montageH"]
61 print("[INFO] detecting: {}...".format(", ".join(obj for obj in
62 CONSIDER)))
176 CHAPTER 8. DEEP LEARNING AND MULTIPLE RPIS
In this chapter’s example, we’re only going to CONSIDER three types of objects from the
MobileNet SSD list of CLASSES. We’re considering (1) dogs, (2) people, and (3) cars (Line
40).
We’ll soon use this CONSIDER set to filter out other classes that we don’t care about such
as chairs, plants, monitors, or sofas which don’t typically move and aren’t interesting for this
security type project.
Line 41 initializes a dictionary for our object counts to be tracked in each video feed. Each
count is initialized to zero.
A separate dictionary, frameDict, is initialized on Line 42. The frameDict dictionary will
contain the hostname key and the associated latest frame value from the respective RPi.
Lines 47 and 48 are variables which help us determine when a Pi last sent a frame to the
server. If it has been a while (i.e. there is a problem, such as the RPi freezing or crashing), we
can get rid of the static, out of date image in our montage. The lastActive dictionary will
have hostname keys and timestamps for values.
Lines 53-55 are constants which help us to calculate whether a Pi is active. Line 55 itself
calculates that our check for activity will be 40 seconds. You can reduce this period of time by
adjusting ESTIMATED_NUM_PIS and ACTIVE_CHECK_PERIOD on Lines 53 and 54.
Our mW and mH variables on Lines 59 and 60 represent the number of columns and rows
for our montage. These values are pulled directly from the command line args dictionary.
Let’s loop over incoming streams from our clients and process the data:
Lines 68 and 69 grab an image from the imageHub and send an ACK message. The result
8.2. AN IMAGEZMQ CLIENT/SERVER APPLICATION FOR MONITORING A HOME 177
of imageHub.recv_image is rpiName (the hostname), and the video frame itself. Be sure
to refer to Chapter 3, Section 3.5.4 to learn how to set your RPi hostname.
Lines 73-78 perform housekeeping duties to determine when a Raspberry Pi was lastActive.
Lines 82-90 perform object detection on the frame. First, the frame dimensions are com-
puted. Then, a blob is created from the image. The blob is then passed through the neural
net.
Remark. If you need a refresher on how the blobFromImage function works, refer to this
tutorial: http://pyimg.co/c4gws [45].
On Line 93 we reset the object counts to zero (we will be populating the dictionary with
fresh count values shortly).
Let’s loop over the detections with the goal of (1) counting, and (2) drawing boxes around
objects that we are considering:
On Line 96 we begin looping over each of the detections. Inside the loop, we proceed to
extract the object confidence and filter out weak detections (Lines 99-103). We then grab
the label idx (Line 106) and ensure that the label is in the CONSIDER set (Line 110).
For each detection that has passed the two checks (confidence threshold and in CONSI
DER), we will (1) increment the objCount for the respective object, and (2) draw a rectangle
around the object (Lines 113-123).
Next, let’s annotate each frame with the hostname and object counts. We’ll also build a
montage to display them in:
145
146 # detect any kepresses
147 key = cv2.waitKey(1) & 0xFF
On Lines 126-133 we make two calls to cv2.putText to draw the Raspberry Pi hostname
and object counts.
From there we update our frameDict with the frame corresponding to the RPi hostname.
Lines 139-144 create and display a montage of our client frames. The montage will be
mW frames wide and mH frames tall (there is the possibility that multiple montages of equal
dimensions will be displayed if you have more clients than available "tiles" for output in a single
montage).
The last block is responsible for checking our lastActive timestamps for each client feed
and removing frames from the montage that have stalled. Let’s see how it works:
149 # if current time *minus* last time when the active device check
150 # was made is greater than the threshold set then do a check
151 if (datetime.now() - lastActiveCheck).seconds > ACTIVE_CHECK_SECONDS:
152 # loop over all previously active devices
153 for (rpiName, ts) in list(lastActive.items()):
154 # remove the RPi from the last active and frame
155 # dictionaries if the device hasn't been active recently
156 if (datetime.now() - ts).seconds > ACTIVE_CHECK_SECONDS:
157 print("[INFO] lost connection to {}".format(rpiName))
158 lastActive.pop(rpiName)
159 frameDict.pop(rpiName)
160
161 # set the last active check time as current time
162 lastActiveCheck = datetime.now()
163
164 # if the `q` key was pressed, break from the loop
165 if key == ord("q"):
166 break
167
168 # do a bit of cleanup
169 cv2.destroyAllWindows()
from the frameDict. From there, the lastActiveCheck is updated to the current time on
Line 162.
Effectively, this implementation enables getting rid of expired frames (i.e. frames that
are no longer real-time). This is really important if you are using the ImageHub server for a
security application. Perhaps you are saving key motion events like a Digital Video Recorder
(DVR) (we covered the Key Clip Writer in Section 9.2.2 of the Hobbyist Bundle and in a 2016
PyImageSearch tutorial titled Saving key event video clips with OpenCV (http://pyimg.co/hvskf)
[46]). The worst thing that could happen if you don’t get rid of expired frames is that an intruder
kills power to a client and you don’t realize the frame isn’t updating. Think James Bond or
Jason Bourne sort of spy techniques.
Last in the loop is a check to see if the q key has been pressed — if so we break from the
loop and destroy all active montage windows (Lines 165-169).
8.2.4 Streaming Video Over Your Network with OpenCV and ImageZMQ
Now that we’ve implemented both the client and the server, let’s put them to the test.
Once your server is running, go ahead and start each client pointing to the server. On
each client, follow these steps:
i. Open an SSH connection to the client: ssh [email protected] (inserting your own IP
address, of course)
iv. If you are not using the book’s accompanying preconfigured Raspbian .img, then you’ll
need to install ImageZMQ: http://pyimg.co/fthtd [12]
As an alternative to Steps 1-6, you may start the client on reboot (as we learned to do in
Chapter 8 of the Hobbyist Bundle).
Once frames roll in from the clients, your server will come alive! Each frame received is
passed through the MobileNet SSD, annotated, and added to the montage. Figure 8.2 shows
a screenshot of the resulting streams annotated and arranged in a montage.
Figure 8.2: The Graphical User Interface (GUI) concept drawing for our ImageZMQ server that
performs object detection on frames incoming from multiple client Raspberry Pis.
You shouldn’t observe much, if-any, lag — be sure to review the previous chapter’s guidance
on performance with ImageZMQ (Section 3.5.9).
8.3 Summary
In this chapter, we learned how to stream video over a network using OpenCV and the Im-
ageZMQ library.
If you are ever in a situation where you need to stream live video over a network, definitely
give ImageZMQ a try — you’ll find it super intuitive.
Chapter 9
Our hands, gestures, and body language can communicate just as much information, if not
more, than the words coming out of our mouths. Based on our posture and stance we, con-
sciously or unconsciously, communicate signals about our level of comfort, including whether
we are relaxed, agitated, nervous, or threatened. In fact, our bodies communicate so much
information that studies have found gait recognition (i.e., how someone walks) to be more
accurate than face recognition for person identification tasks!
In this chapter we are going to focus on recognizing hand gestures. We’ll utilize computer
vision and deep learning to build a custom hand gesture recognition system on the Raspberry
Pi. This system will be capable of running in real-time on the RPi, despite leveraging deep
learning.
You can use this project as a template when building your own security applications (i.e., re-
placing the keypad on your home security alarm with gesture recognition), accessibility projects
(i.e., enabling disabled users to more easily access a computer), or smart home applications
(i.e., replacing your remote with your hand).
iv. Create a Python script to take our trained model and then recognize gestures in real-time.
183
184 CHAPTER 9. TRAINING A CUSTOM GESTURE RECOGNITION MODEL
In this section we’ll look at a high level overview of our hand gesture recognition pipeline,
ensuring we understand the goal of the project before we start diving into code. From there
we’ll look look at the directory structure for our project and then implement our configuration
file.
The goal of gesture recognition is to interpret human gestures via systemized algorithms. Most
gesture recognition algorithms focus on hand or face gestures; however, there is an increasing
body of work surrounding gait recognition [47, 48, 49] which can be used to identify a person
strictly by how they are walking.
In this chapter we’ll focus on hand gesture recognition, that is, recognizing specific ges-
tures such as stop, fist/close, peace, etc. (Figure 9.1, top).
Figure 9.1: Top: An example of the hand gesture system we’ll be creating. Bottom: The steps
involved in creating our hand gesture implementation.
9.2. GETTING STARTED WITH GESTURE RECOGNITION 185
To build our gesture recognition application we’ll need to utilize both traditional computer
vision and deep learning, the complete pipeline of which is depicted in Figure 9.1 (bottom).
First, we’ll utilize thresholding to segment the foreground hand from the background im-
age/frame. The end result is a binary image depicting which pixels belong to the hand and
which ones are the background (and thus uninteresting to our application).
In the context of this chapter, we’ll be framing gesture recognition as a home security
application.
Perhaps you are interested in replacing or augmenting your existing home alarm keypad
(i.e., the keypad where you enter your code to arm/disarm your alarm) with gesture recognition.
When you (or an intruder) enters your residence you will need to enter a “four digit” code that is
based on your hand gestures — if the gesture is correct, or sufficient time has passed without
entering a gesture, the alarm will sound.
The benefit of augmenting an existing home security keypad with gesture recognition is that
most burglars know about keypad-based inputs — but gesture recognition? That’s something
new and something likely to throw them off their game (if they even recognize/know what the
screen and camera mounted on your wall is supposed to do).
|-- assets
| |-- correct.wav
| |-- fist.png
| |-- hang_loose.png
| |-- incorrect.wav
| |-- peace.png
| |-- stop.png
|-- config
| |-- config.json
|-- output
| |-- gesture_reco.h5
| |-- lb.pickle
|-- pyimagesearch
| |-- nn
| | |-- __init__.py
186 CHAPTER 9. TRAINING A CUSTOM GESTURE RECOGNITION MODEL
| | |-- gesturenet.py
| |-- notifications
| | |-- __init__.py
| | |-- twilionotifier.py
| |-- utils
| | |-- __init__.py
| | |-- conf.py
| |-- __init__.py
|-- gather_examples.py
|-- train_model.py
|-- recognize.py
Inside the assets/ directory we store audio and image files which will be used when
building a simple user interface with OpenCV. Each of the .png files provides a visualization
for each of the hand gestures we’ll be recognizing. The .wav files provide audio for “correct”
and “incorrect” gesture inputs to our home security application.
The config/ directory stores the config.json file that we’ll be utilizing for this project
while the ouput/ directory will store the trained gesture recognition deep learning model along
with the label binarizer (so we can encode/decode gesture labels).
The gather_examples.py script is used to access the video stream on our camera —
from there we provide examples of each gesture that we wish our CNN to recognize.
These example frames are saved to disk for the train_model.py script, which as the
name suggests, trains GestureNet on the example gestures.
Remark. If you do not wish to gather your own example gestures, I have provided a sample
of the gesture recognition dataset I gathered inside the data directory. You can use those
example images if you simply want to run the scripts and get a feel for how the project functions.
Finally, the recognize.py script puts all the pieces together and:
1 {
2 // define the top-left and bottom-right coordinates of the gesture
3 // capture area
4 "top_left": [20, 10],
5 "bot_right": [220, 210],
Figure 9.2: Hand gesture recognition will be performed when the user places their hand within the
"black square" region which we have programmatically highlighted with a red rectangle.
In order to make our gesture recognition pipeline more accurate, we’ll only perform gesture
recognition with a specific ROI of the frame (Figure 9.2). Here you can see a rectangle has
been defined via Lines 4 and 5 — any hands/gestures captured within this rectangular region
will be captured and identified. Any hands/gestures outside of this region will be ignored.
Our foreground hands should have sufficient contrast between the background, en-
abling basic image processing (i.e., thresholding) to perform the segmentation. Since I
188 CHAPTER 9. TRAINING A CUSTOM GESTURE RECOGNITION MODEL
have light skin I have chosen to use a dark background — in this manner there will be enough
contrast between my hand and the background. Conversely, if you have dark skin you should
utilize a light background, again, ensuring there is enough contrast for thresholding to be ac-
curately utilized).
We implement our gesture recognition method this way to make it easier for us segment
foreground hands from the background as we assume we can control the background of the
ROI we are monitoring.
Next, let’s name the gestures that we’ll be identifying and enable “hot keys” that can be used
to save examples of each of these gestures to disk:
Line 10 defines the "mappings" dictionary which maps a key on your keyboard to a par-
ticular gesture name
Here you can see that we’ll be recognizing four gestures: "fist, "hang loose", "peace",
and "stop" — examples of these gestures can be seen in Figure 9.3.
Inside the gather_examples.py script (Section 9.3) we’ll be accessing our video stream
and providing examples of each of these gestures so we can later train our CNN to recognize
them (Section 9.4).
In order to save each of the gesture recognition examples we define “hot keys”. For exam-
ple, if we press the f key on our keyboard then the gather_examples.py script assumes
we are making a “fist” in which case the current frame is saved to disk and labeled as a “fist”.
Similarly, if we make a peace sign and then press the p key, then the frame is labeled as a
“peace sign” and saved to our hard drive.
The “ignore” class (captured via the i key) is a special case — we assume that frames with
the “ignore” label have no gestures and are instead just the background and nothing else. We
capture such a class so that we can train our CNN to “ignore” any frames that do not have any
gestures, thereby reducing false-positive gesture recognitions (i.e., the CNN thinking there is a
9.2. GETTING STARTED WITH GESTURE RECOGNITION 189
Figure 9.3: The four signs our hand gesture recognition system will recognize: fist (top-left), hang
loose (top-right), peace (bottom-left), and stop (bottom-right). You can recognize additional ges-
tures by providing training examples for each sign.
Finally, Line 19 defines the "dataset_path", the path to where each of our gestures
classes and associated examples will be saved, respectively.
Next, let’s take a look at parameters when training our GestureNet model:
Lines 23-25 define our initial learning rate, batch size, and number of training epochs. We
190 CHAPTER 9. TRAINING A CUSTOM GESTURE RECOGNITION MODEL
then define the path to the output serialized model (after training) along with the label binarizer,
used to encode/decode class label (Lines 29 and 30).
The following code block defines the path to our assets used to build the simple frontend
GUI along with our actual passcode:
The "passcode" consists of four gestures but you could modify it to use one, two, three, or
even one hundred gestures — the actual length of the list is arbitrary. What is not arbitrary is
the contents of the "passcode" list. You’ll notice that each of the entries in the "passcode"
list maps to a gesture in the "mappings" dictionary back on Line 10.
You can make your passcode whatever you want provided that every entry in "passcode"
exists in "mappings" — the GestureNet CNN will only recognize gestures it was trained
to identify.
The following two configurations handle (1) how many consecutive frames a given gesture
needs to identified for before we consider it a positive recognition and (2) the number of sec-
onds to show a correct/incorrect message after a gesture code has been input:
The final code block in our configuration defines the paths to our correct/incorrect audio
files along with any optional Twilio API information used to send text message notifications if
an input gesture code is incorrect:
46 // path to the audio files that will play for correct and incorrect
47 // pass codes
48 "correct_audio": "assets/correct.wav",
49 "incorrect_audio": "assets/incorrect.wav",
50
51 // variables to store your twilio account credentials
52 "twilio_sid": "YOUR_TWILIO_SID",
9.3. GATHERING GESTURE TRAINING EXAMPLES 191
53 "twilio_auth": "YOUR_TWILIO_AUTH_ID",
54 "twilio_to": "YOUR_PHONE_NUMBER",
55 "twilio_from": "YOUR_TWILIO_PHONE_NUMBER",
56 "address_id": "YOUR_ADDRESS"
57 }
We are now ready to start implementing our hand gesture recognition system!
In order to train our GestureNet architecture we first need to gather training examples of each
hand gesture we wish to recognize. Once we have the dataset we can train the model.
Open up the gather_examples.py file in your project structure and insert the following code:
Lines 2-8 handle importing our required Python packages while Lines 11-14 parse our
command line arguments. We only need a single argument, --conf, the path to our input
configuration file which we load from disk on Line 17.
Lines 21 and 22 grab the top-left and bottom-right (x, y)-coordinates for our gesture recog-
nition capture area. Our hand must be placed within this region before either (1) gather-
192 CHAPTER 9. TRAINING A CUSTOM GESTURE RECOGNITION MODEL
ing examples of a particular gesture or (2) recognizing a gesture (which we’ll do later in the
recognize.py script).
Next, let’s map the names of each gesture to a key on our keyboard:
24 # grab the key => class label mappings from the configuration
25 MAPPINGS = conf["mappings"]
26
27 # loop over the mappings
28 for (key, label) in list(MAPPINGS.items()):
29 # update the mappings dictionary to use the ordinal value of the
30 # key (the key value will be different on varying operating
31 # systems)
32 MAPPINGS[ord(key)] = label
33 del MAPPINGS[key]
34
35 # grab the set of valid keys from the mappings dictionary
36 validKeys = set(MAPPINGS.keys())
37
38 # initialize the counter dictionary used to count the number of times
39 # each key has been pressed
40 keyCounter = {}
Line 25 grabs our MAPPINGS from the configuration. The MAPPINGS variable is a dictionary
with the keys being a given letter on our keyboard and the value being the name of the gesture.
For example, the s key on our keyboard maps to the stop gesture (as defined in Section 9.2.3
above).
However, we have a bit of extra work to do. We’ll be using the cv2.waitKey function to
capture keypresses — this function requires that we have the ordinal value of the key rather
than the string value. Therefore, we must:
ii. Update the MAPPINGS dictionary to use the ord value of the key (Line 32).
iii. Delete the original string key value from the dictionary (Line 33).
Given this updated dictionary we then grab the set of validKeys that can be pressed
when gathering training examples (Line 36).
41
42 # start the video stream thread
43 print("[INFO] starting video stream thread...")
44 vs = VideoStream(src=0).start()
45 # vs = VideoStream(usePiCamera=True).start()
46 time.sleep(2.0)
47
48 # loop over frames from the video stream
49 while True:
50 # grab the frame from the threaded video file stream
51 frame = vs.read()
52
53 # resize the frame and then flip it horizontally
54 frame = imutils.resize(frame, width=500)
55 frame = cv2.flip(frame, 1)
Line 40 initializes a counter dictionary to count the number of times a given key on the
keyboard has been pressed (thus counting the total number of gathered training examples per
class).
We start looping over frames from the video stream on Line 49. We preprocess the frame
by (1) reducing the frame size and then (2) flipping the frame horizontally. We flip the frame
horizontally since our frame is mirrored to us. Flipping “un-mirrors” the frame.
Next, we can extract the gesture capture roi from the frame:
Line 61 uses NumPy array slicing and the supplied gesture capture (x, y)-coordinates to
extract the roi. We then convert the roi to grayscale and threshold it, leaving us with a binary
194 CHAPTER 9. TRAINING A CUSTOM GESTURE RECOGNITION MODEL
representation of the image. Ideally, our hand will show up as foreground (white) pixels while
the background remains dark (black).
Remark. You may need to tune the parameters to cv2.threshold to obtain an adequate
segmentation of the hand. See Section 9.5 for more details on how this method can be made
more robust.
Lines 66 clones the frame so we can draw on it while Line 70 visualizes the gesture capture
region. We then display the frame and roi to our screen on Lines 71 and 72.
Line 73 checks to see if any keys are pressed. If a key is pressed, we need to check and
see which one:
If the q key is pressed then we have finished gathering gesture examples and can safely
exit the script.
Otherwise, we check to see if the key pressed exists inside our validKeys set (Line 81). If
so, we construct the path to the output label subdirectory (Line 83), create the output directory
if necessary (Lines 86 and 87), and finally save the output roi to disk (Lines 90-96).
To gather your own gesture recognition dataset, open up a terminal and execute the following
command:
9.3. GATHERING GESTURE TRAINING EXAMPLES 195
Figure 9.4: Left: An example of the cv2.threshold operation used to binarize my hand as a
white foreground on a black background. This example "hang loose" gesture will be logged to disk
once I press the h key on my keyboard. We can then train a CNN to recognize the gesture. Right:
Just like we need examples for each gesture we want to recognize, we also need examples of "no
gestures", ensuring that our model knows the difference between a person performing a sign and
when to simply ignore the frame as it has no semantic content.
Here you can see that I am making a “hang loose” sign (Figure 9.4, left). I then press the f
key on my keyboard which saves the fist example to the hand_gesture_data/hang_loose/
directory.
Figure 9.4 (right) shows an example of the “ignore” class. Note how there are no gestures
in the gesture capture region — this is done purposely so that we can train our CNN to recog-
nize the lack of a gesture in the frame (otherwise our CNN may report false-positive gesture
classifications). To capture the “ignore” class I press the i key on my keyboard.
Examining the output of the hand_gesture_data/ directory you can see I have gathered
approximately 100 examples per class:
We’ll train a CNN to recognize each of these gestures in the following section.
196 CHAPTER 9. TRAINING A CUSTOM GESTURE RECOGNITION MODEL
Remark. I have included my own example gesture dataset in the downloads associated with
this book. Feel free to use this dataset to continue following along or stop now to gather your
own dataset.
In the first part of this section we’ll implement a CNN architecture to recognize gestures based
on the example dataset we gathered in the previous section. We’ll then train the CNN and
examine the results.
The CNN we’ll be implementing in this chapter is called GestureNet — I created this archi-
tecture specifically for the task of recognizing gestures.
The architecture has both AlexNet and VGGNet-like characteristics, including (1) a larger
CONV kernel size in the very first layer of the network (AlexNet-like [2]), and (2) 3x3 filter sizes
throughout the rest of the network (VGGNet-like [3]).
Since we are working with binary blobs as our training data, the larger filter sizes can capture
more “structural information”, such as the size and shape of the binary blob, before switching
to more "standard" 3x3 convolutions. As we’ll see in Section 9.4.3, this combination of a large
filter size early in the network followed by smaller filter sizes leads to an accurate hand gesture
recognition model.
19 chanDim = -1
20
21 # if we are using "channels first", update the input shape
22 # and channels dimension
23 if K.image_data_format() == "channels_first":
24 inputShape = (depth, height, width)
25 chanDim = 1
Lines 1-10 import our required Python packages while Line 14 defines the build method
used to construct our CNN architecture. The build method requires four parameters:
• width: The width (in pixels) of the input images in our dataset.
• depth: The number of channels in the image (3 for RGB images, 1 for grayscale/single
channel images).
• classes: The total number of classes labels in our dataset (4 gestures to recognize plus
an “ignore” class, so 5 classes total).
Line 17 initializes the model while Lines 18-25 initialize the inputShape and channel
dimension based on whether or not we are using channels-first or channels-lsat ordering.
27 # first CONV => RELU => CONV => RELU => POOL layer set
28 model.add(Conv2D(16, (7, 7), padding="same",
29 input_shape=inputShape))
30 model.add(Activation("relu"))
31 model.add(BatchNormalization(axis=chanDim))
32 model.add(MaxPooling2D(pool_size=(2, 2)))
33 model.add(Dropout(0.25))
Line 28 defines the first layer of the network, a CONV layer that will learn 16 filters, each
with a filter size of 7x7. As mentioned above, we use a larger filter size to capture more
structural/shape information of the binarized shape in the image. We then apply batch normal-
ization, max-pooling (to reduce volume size), and dropout (to reduce overfitting).
We then stack to more CONV layers, this time each with 3x3 filter sizes:
35 # second CONV => RELU => CONV => RELU => POOL layer set
36 model.add(Conv2D(32, (3, 3), padding="same"))
37 model.add(Activation("relu"))
38 model.add(BatchNormalization(axis=chanDim))
198 CHAPTER 9. TRAINING A CUSTOM GESTURE RECOGNITION MODEL
39 model.add(MaxPooling2D(pool_size=(2, 2)))
40 model.add(Dropout(0.25))
41
42 # third CONV => RELU => CONV => RELU => POOL layer set
43 model.add(Conv2D(64, (3, 3), padding="same"))
44 model.add(Activation("relu"))
45 model.add(BatchNormalization(axis=chanDim))
46 model.add(MaxPooling2D(pool_size=(2, 2)))
47 model.add(Dropout(0.25))
Note how (1) the volume size is reduced via max-pooling the deeper we are go in the
network, while (2) simultaneously the number of filters learned by CONV layers increases the
deeper we go. This behavior is very typical and you’ll see it in nearly every CNN you encounter.
Lines 50-54 add a single fully-connected layer with 128 neurons. Another FC layer is added
on Line 57, this time containing the total number of classes. The final, fully constructed model
is returned to the calling function on Line 61.
With the model implemented we can move on to our training script. Open up the train_model
.py file and we’ll get to work:
Line 2-14 import our required Python packages. Note how we import GestureNet on Line
7.
Lines 16-20 parse our command line arguments. Again, the only switch we need is --conf,
the path to our configuration file which we load on Line 23.
Line 28 grabs the paths to all example training images in the "dataset_path" from our
configuration while Lines 29 and 30 initialize the data and labels lists, respectively.
On Line 33 we loop over all imagePaths in our dataset. For each image path, we:
i. Grab the label by parsing the class label from the filename on Line 35 (the label is the
subdirectory name of where the image lives, which in this case, is the name of the label
itself — see Section 9.3 for more details).
From there we add the image and label to our data and labels lists, respectively (Lines
44 and 45).
Line 49 converts the data list to a proper NumPy array while simultaneously scaling all
pixel values from the range [0, 255] to [0, 1].
We then reshape the data matrix to include a channel dimension (Line 53), otherwise Keras
will not understand that we are working with grayscale, single channel images.
Lines 56 and 57 one-hot encode our labels while Lines 61 and 62 construct the training
and testing split.
Lines 65-67 initialize our data augmentation object, used to randomly translate, rotate,
shear, etc. our input images. Data augmentation is used to reduce overfitting and improve the
ability of our model generalize. You can read more about more data augmentation inside Deep
Learning for Computer Vision with Python http://pyimg.co/dl4cv [50] and in this tutorial:
http://pyimg.co/pedyk [51].
Line 70 initializes the GestureNet model, instructing it to accept 64x64 input images with
only a single channel. The total number of classes is equal to the number of unique classes
found by the LabelBinarizer.
From there we compile the model and then begin training (Lines 77-81).
We also serialize both the model and lb to disk so we can use them later in the recognize.py
202 CHAPTER 9. TRAINING A CUSTOM GESTURE RECOGNITION MODEL
script.
By the end of our 75th epoch we are obtaining 100% accuracy on our testing data!
9.5. IMPLEMENTING THE COMPLETE GESTURE RECOGNITION PIPELINE 203
We are now ready to put all the pieces together and finish implementing our hand gesture
recognition pipeline!
iii. Classify each frame from the video stream as either (1) an identified gesture or (2) back-
ground, in which case we ignore the frame.
iv. Compare the input gesture passcode to the correct passcode defined in Section 9.2.3.
v. Either welcome the user (correct passcode) or alert the home owner of an intruder (incor-
rect passcode).
Open recognize.py in your favorite code editor and we’ll get started:
Lines 2-18 handle our imports. We’ll be using the TwilioNotifier to send text message
notifications of intruders. The GPIO package can also be used to access the GPIO pins on a
Pi Traffic Light if you are using one.
Lines 21-23 define the integer values of the red, yellow, and green lights on the Traffic Light
HAT. We then setup the Pi Traffic Light on Lines 26-29.
Remark. If you wish to eliminate the Pi Traffic Light, be sure to comment out all code lines that
begin with GPIO.
We’ll now define four helper utility functions, the first of which, correct_passcode, is
defined below:
31 def correct_passcode(p):
32 # actuate a lock or (in our case) turn on the green light
33 GPIO.output(green, True)
34
35 # print status and play the correct sound file
36 print("[INFO] everything is okay :-)")
37 play_sound(p)
The correct_passcode function, as the name suggests, is called when the user has
input a correct hand gesture code. This method accepts a single parameter, p, the path to
the input audio file to play when the correct passcode has been entered. Line 33 turns on the
green light via the GPIO library while Line 37 plays the “correct passcode” audio file.
This method accepts two parameters, the first of which is p, the path to the audio file to be
played if an incorrect passcode is entered. The second argument, tn, is our TwilioNotifier
object used to send a text message alert to the home owner.
Line 41 displays the “red” light on our Traffic Light HAT while Line 45 plays the “incorrect
passcode” audio file. Lines 48-52 then send a text message to the home owner, indicating that
a potential intruder has entered the house and entered the incorrect passcode.
The reset_lights function is used to turn off all lights on the Traffic Light HAT:
54 def reset_lights():
55 # turn off the lights
56 GPIO.output(red, False)
57 GPIO.output(yellow, False)
58 GPIO.output(green, False)
The final utility function, play_sound, plays an audio file via the built-in aplay command
on the Raspberry Pi:
60 def play_sound(p):
61 # construct the command to play a sound, then execute the command
62 command = "aplay {}".format(p)
63 os.system(command)
64 print(command)
With our helper utilities defined, we can move on to the body of the script:
Lines 67-70 parse our command line arguments. Just like all other scripts in this chapter,
we need only the --conf switch.
206 CHAPTER 9. TRAINING A CUSTOM GESTURE RECOGNITION MODEL
If you recall from Section 9.2.2, where we reviewed the project/directory structure, we have
an assets/ directory which contains emoji-like visualizations for each of the gestures.
The following code block loads each of these emoji images from disk:
76 # grab the paths to gesture icon images and then initialize the icons
77 # dictionary where the key is the gesture name (derived from the image
78 # path) and the key is the actual icon image
79 print("[INFO] loading icons...")
80 imagePaths = paths.list_images(conf["assets_path"])
81 icons = {}
82
83 # loop over the image paths
84 for imagePath in imagePaths:
85 # extract the gesture name (label) the icon represents from the
86 # filename, load the icon, and then update the icons dictionary
87 label = imagePath.split(os.path.sep)[-1].split(".")[0]
88 icon = cv2.imread(imagePath)
89 icons[label] = icon
Line 80 grabs all imagePaths inside the "assets_path" directory while Line 81 initial-
izes the icons dictionary. The key to the dictionary is the name of the label while the value is
the icon itself.
Lines 84-89 then loops over each of the icon paths, extracts the name of the gesture from
the filename, loads the icon, and then stores it in the icons dictionary.
91 # grab the top-left and and bottom-right (x, y)-coordinates for the
92 # gesture capture area
93 TOP_LEFT = tuple(conf["top_left"])
94 BOT_RIGHT = tuple(conf["bot_right"])
95
96 # load the trained gesture recognizer model and the label binarizer
97 print("[INFO] loading model...")
98 model = load_model(str(conf["model_path"]))
99 lb = pickle.loads(open(str(conf["lb_path"]), "rb").read())
100
101 # start the video stream thread
102 print("[INFO] starting video stream thread...")
103 # vs = VideoStream(src=0).start()
104 vs = VideoStream(usePiCamera=True).start()
105 time.sleep(2.0)
Lines 93 and 94 grab the top-left and bottom-right (x, y)-coordinates for our gesture recog-
9.5. IMPLEMENTING THE COMPLETE GESTURE RECOGNITION PIPELINE 207
nition area, just like we did in the gather_examples.py script from Section 9.3.
We then load both the trained GestureNet model and serialized LabelBinarizer from
disk on Lines 98 and 99. Lines 102-105 then access our video stream.
We only have a few more initializations to go before we start looping over frames:
Line 110 initializes currentGesture, a list containing two values: (1) the current recog-
nized gesture, and (2) the total number of consecutive frames that GestureNet has reported
the current gesture as the classification. By keeping tracking of the number of consecutive
frames a given gesture has been recognized as, we can reduce the likelihood of a false-positive
classification from the network.
Line 114 initializes gestures, a list of hand gestures input from the user. We’ll compare
gestures with the passcode in the configuration file to determine if the user has input the
correct passcode. We’ll also grab the timestamp of when the gestures have been input to the
system (Line 115).
Line 126 reads a frame from our stream while Line 127 grabs the current timestamp.
We then resize the frame and horizontally flip it, just like we did in Section 9.3. Line 135
visualizes the gesture capture region, ensuring we know were to place our hands for our ges-
tures to be recognized.
Let’s now make a check to see how many gestures have been input by the user:
For this application, we’ll assume that each passcode has four gestures (you can increase
or decrease this value as you see fit). Line 139 checks to see if there are less than four
gestures input, implying that the end user is still entering their gestures.
Lines 143-145 extract the hand gesture recognition roi and then threshold it, just like we
did in Section 9.3.2. After thresholding, our hand now appears as white foreground on a black
background (Figure 9.4).
9.5. IMPLEMENTING THE COMPLETE GESTURE RECOGNITION PIPELINE 209
Lines 151-154 preprocess the roi for classification, just as we did in the train_model.py
script from Section 9.4. Lines 157 and 158 classify the input roi to obtain the gesture predic-
tion.
Our next code block handles updating the currentGesture bookkeeping variable:
160 # check to see if the label from our model matches the label
161 # from the previous classification
162 if label == currentGesture[0] and label != "ignore":
163 # increment the current gesture count
164 currentGesture[1] += 1
165
166 # check to see if the current gesture has been recognized
167 # for the past N consecutive frames
168 if currentGesture[1] == conf["consec_frames"]:
169 # update the gestures list with the predicted label
170 # and then reset the gesture counter
171 gestures.append(label)
172 currentGesture = [None, 0]
173
174 # otherwise, reset the current gesture count
175 else:
176 currentGesture = [label, 0]
Line 162 checks to see if (1) the current predicted label matches whatever the previous
predicted label was, and (2) that the predicted label is not the “ignore” class. Provided this
passes, we increment the the number of consecutive frames the current gesture was predicted
as (Line 164).
If we have reached the "consec_frames" threshold, we add the current label to the
gestures list and then reset the currentGesture. Using a consecutive count helps reduce
false-positive classifications.
Lines 175 and 176 handle the case where the current label does not match the previous
label in currentGesture, in which case we reset currentGesture.
In the event we are entering a gesture we’ll change the color of the Traffic Light HAT to
“yellow”, indicating that a passphrase is being entered:
If there are no gestures or we already have a full gestures list, we’ll turn off the light on
the HAT (Lines 183 and 184).
Let’s now start building a basic GUI visualization for the end user:
Line 187 initializes a 425 ◊ 250 pixel canvas that we’ll be drawing on.
We then loop over the number of input gestures on Line 190. For each gesture we compute
the starting x-coordinate of the emoji. If there exists an entry for the i-th gesture, we use the
icons dictionary to draw it on the canvas (Lines 196 and 197); otherwise, we draw a white
rectangle, indicating a gesture has not yet been input (Lines 200-204). An example of this
simple interface can be seen in Figure 9.5.
Figure 9.5: Left: An example of our GUI interface waiting for a gesture recognition input. Right:
As gestures are input the white rectangles are replaced with the emoji corresponding to the rec-
ognized gesture.
Our next code block handles checking to see if the input list of gestures is correct or not:
9.5. IMPLEMENTING THE COMPLETE GESTURE RECOGNITION PIPELINE 211
Line 214 checks to see if four gestures have been entered, indicating that we need to
check if the passcode is correct or not. Lines 217 and 218 then grab the timestamp of when
the check took place, just in case we need to alert the home owner.
We initialize our status, color, and audioPath variables, operating under the assump-
tion that the entered passcode is correct (Lines 222-224). Line 227 then checks to see if the
passcode is correct.
Provided the correct gesture code has been entered we then make another check on Line
230. If correct is still set to False then we know that we have not played a “welcome”
message to the user — in that case, we create a Thread to call the correct_passcode
function and set the correct boolean to True.
Line 237 handles the case in which case the entered passcode is incorrect. In this case
update our status, color, and audioPath variables, respectively (Lines 239-241). If the
alarm has not been raised, we raise it on Lines 244-249.
Lines 253-260 allow our simple GUI to display the correct/incorrect passcode for a total of
"num_seconds" before we reset and allow the user to try a different gesture passcode.
The final code block in the script handles visualizing the output images to our screen:
282 cv2.destroyAllWindows()
283 vs.stop()
Lines 265-268 draw the current timestamp, ts, and status on the output canvas. We
then visualize the ROI we’re monitoring, output frame, and passcode information to our screen
(Lines 271-273). Finally, if the q key is pressed on our keyboard we gracefully exit our script.
Figure 9.6: The correct hand gesture passcode has been entered.
Figure 9.6 visualizes the output of our script. In the bottom-right you can see my hand
gesture sequence (i.e. passcode). The passcode has been correctly recognized as “peace”,
“stop”, “fist”, and “hang_loose”, after which the correct.wav audio file is played and “Correct”
is displayed on the output frame (this passcode matches the "passcode" configuration setting
from Section 9.2.3).
Figure 9.7 shows the output of what an incorrect passcode would look like. Notice how the
screenshot clearly shows an incorrect passcode, which triggers the incorrect.wav audio
214 CHAPTER 9. TRAINING A CUSTOM GESTURE RECOGNITION MODEL
Figure 9.7: The incorrect hand gesture passcode has been entered by an intruder.
file to play and display "Incorrect" on the frame. Additionally, a text message notification is then
sent to my smartphone, alerting me to the intruder.
Keep in mind that hand gesture recognition is not limited to security applications.
You can also utilize gesture recognition to build accessibility programs, enabling disabled
users to more easily access a computer. You could also use hand gesture recognition to build
smart home systems, such as accessing your TV without a remote.
When developing your own hand gesture recognition applications you should use this chap-
ter as a template and starting point — from there you can extend it to work with your own
projects!
9.7 Summary
In this chapter you learned how to perform hand gesture recognition on the Raspberry Pi.
Our gesture recognition pipeline combined both traditional computer vision techniques along
with deep learning algorithms.
In order to ensure our method was able of running in real-time on the RPi, we needed to
utilize background subtraction and thresholding to first segment the hand from the rest of the
image. Our CNN was then able to recognize the hand with a high level of accuracy.
The biggest problem with this approach is that it hinges on being able to accurately segment
the foreground hand from the background — if there is too much noise or if the segmentation
is not reasonably accurate, then the CNN will incorrectly classify the hand region. A more
9.7. SUMMARY 215
advanced approach would be to utilize a deep learning-based object detector to first detect the
hand [52] in the image and then apply a hand keypoint detector [53] to localize each of the
fingers on the hand. Using this information we could more easily segment the hand from the
image and thereby increase the robustness of our system.
However, such as system would be too computationally expensive to run on the RPi alone
— we would need to utilize coprocessors such as the Movidius NCS or Google Coral USB
Accelerator, both of which are covered in the Complete Bundle of this text.
216 CHAPTER 9. TRAINING A CUSTOM GESTURE RECOGNITION MODEL
Chapter 10
Package theft has become a massive problem in the United States, especially surrounding
major holidays.
Interestingly, it’s been reported that rural areas have had a higher package rate theft (per
population) than major cities, thus demonstrating the problem is not limited to areas with just
high populations of people [54].
Using computer vision we can help combat package theft, ensuring whether you’re awaiting
the arrival of the hottest video game just released or simply sending a care package for your
grandmother’s 80th birthday, that your package arrives safely.
Inside this chapter we’ll explore package theft through vehicle identification, and more
specifically, recognizing various delivery vehicles. You can use the techniques covered in
this chapter you recognize other types of vehicles as well.
ii. Utilize the (pre-trained) YOLO object detector to detect trucks in your vehicle dataset
iii. Perform transfer learning via feature extraction to extract features from the detected vehi-
cles
v. Recognize vehicles in real-time video using a Raspberry Pi, ImageZMQ, and deep learn-
ing
217
218 CHAPTER 10. VEHICLE RECOGNITION WITH DEEP LEARNING
Vehicle recognition is the process of identifying the means of transport a person is operating.
In its simplest form, vehicle recognition labels a vehicle as car, truck, bus, etc. More advanced
forms of vehicle recognition may provide more specific information, including the make, model,
and color of the vehicle.
Figure 10.1: An example of: (1) detecting the presence of a vehicle in an image, (2) localizing
where in the image the vehicle is, and (3) correctly identifying the type of vehicle.
We’ll be examining vehicle recognition through our project on delivery truck identification.
We will learn how to utilize deep learning and transfer learning to create a computer vision
system capable of recognizing various types of trucks (ex., UPS, FedEx, DHL, USPS, etc.).
An example output of our delivery project identification project can be seen in Figure 10.1.
Notice how we have correctly:
The project we’ll be building here in this chapter is one of the more advanced projects in
the Hacker Bundle and will require more Python files and code than previous chapters. Take
10.3. GETTING STARTED WITH VEHICLE RECOGNITION 219
your time when working through this chapter and slowly work through it. I would also suggest
reading through the chapter once to obtain a higher level understanding of what we’re building
and then going back to read it a second time, this time paying closer attention to the details.
In the first part of this chapter we’ll review the general flow of our algorithm, including each of
the components we are going to implement. From there we’ll review our project structure and
then dive into our configuration file.
i. Phase #1: Building our dataset of vehicles and extracting them from images.
ii. Phase #2: Training the vehicle recognition model via transfer learning.
iii. Phase #3: Detecting and classifying vehicles in real-time video using our trained model.
The steps of Phase #1 can be see in Figure 10.2 (top). We’ll start by assuming we have a
dataset of example delivery trucks, including UPS, USPS, DHL, etc.
Remark. The delivery truck dataset is provided for you in the downloads associated with this
text. You can also use the instructions in Section 10.4 to build your own dataset as well.
However, simply having example images of various delivery trucks is not enough — we also
need to know where in the input images the vehicle is. To localize where the delivery truck is,
we’ll utilize the YOLO object detector. YOLO will find the truck in the image and then write
the bounding box (x, y)-coordinates of the truck to a CSV file.
Given the CSV file, we can move on to Phase #2 (Figure 10.2, middle). In this phase we
take the detected trucks and perform transfer learning via deep learning. We start by looping
over each of the detections in the CSV file, loading the corresponding image for the current
detection, and then extract the vehicle from the image using the (x, y)-coordinates provided by
the CSV file — the vehicle ROI.
Instead of trying to train a vehicle recognition model from scratch, we can instead apply
transfer learning via feature extraction using deep learning (http://pyimg.co/r0rgh) [55].
Pre-trained deep neural networks learn discriminative, robust features that can be used to
recognize classes the network was never train on. We’ll take advantage of the robust nature
220 CHAPTER 10. VEHICLE RECOGNITION WITH DEEP LEARNING
Figure 10.2: Our vehicle recognition project consists of three phases. In Phase #1 we build our
training dataset by taking an input set of vehicle images, performing object detection, and storing
the (x, y)-coordinates of where each vehicle resides in an image. Phase #2 consists of taking the
detected vehicle regions, extracting features from the vehicle ROIs using ResNet, and then training
a Logistic Regression model on the features. Finally, Phase #3 utilizes a RPi to capture frames,
passes them to a central server using ImageZMQ, applies object detection to detect vehicles,
extracts features from the vehicle ROI, and then identifies the vehicle using the extracted features
and our Logistic Regression model.
of CNNs and use ResNet (pre-trained on ImageNet) to extract features from the vehicle ROIs.
These features are then written to disk as an HDF5 file.
Remark. If you are new to the concept of transfer learning, feature extraction, and fine-tuning,
you’ll want to refer to Deep Learning for Computer Vision with Python (http://pyimg.co/dl4cv)
[50], as well as the following tutorials on the PyImageSearch blog:
Finally, we arrive at Phase #3 (Figure 10.2, bottom). This phase puts all the pieces together,
arriving at a fully functioning vehicle recognition system.
However, there is a bit of a problem we need to address first — the Raspberry Pi does not
have the computational resources to run all of the following models at the same time:
Instead of trying to utilize the underpowered RPi for these operations (or using a copro-
cessor such as the Movidius NCS or Google Coral), we’ll instead treat the Raspberry Pi as a
simple network camera and then use ImageZMQ to stream the results back to a more power-
ful host machine for processing (similar to Chapter 4 on Advanced Security Applications with
YOLO Object Detection). The results of the vehicle identification are then sent back to the RPi.
Take a second now to review the steps of Figure 10.2 to ensure you understand each of
the phases of this project. From here, we’ll move on to reviewing the directory structure for our
project.
|-- config
| |-- truckid.json
|-- pyimagesearch
| |-- io
| | |-- __init__.py
| | |-- hdf5datasetwriter.py
| |-- keyclipwriter
| | |-- __init__.py
| | |-- keyclipwriter.py
| |-- utils
| | |-- __init__.py
| | |-- conf.py
| |-- __init__.py
|-- yolo-coco
| |-- coco.names
| |-- yolov3.cfg
| |-- yolov3.weights
|-- output
| |-- detections.csv
| |-- truckid.model
|-- videos
222 CHAPTER 10. VEHICLE RECOGNITION WITH DEEP LEARNING
| | cars.mp4
|-- client.py
|-- build_training_data.py
|-- extract_features.py
|-- train_model.py
|-- truck_id.py
Inside the config/ directory we’ll store our config.json file which we’ll implement in
Section 10.3.3.
The yolo-coco/ directory contains three files, all for the YOLO object detector:
• coco.names: A plaintext file containing the names of the objects/labels YOLO was
trained on (i.e., the labels from the COCO dataset [59]).
We’ll be using YOLO in Phase #1 (Section 10.4) to detect vehicles in our training set and
then again in Phase #3 (Section 10.5), where we detect vehicles in a real-time video stream.
The vehicle recognition dataset stores the images of vehicles we’ll be training a machine
learning model to recognize. The build_training_data.py script applies the YOLO object
detector to every image in the dataset directory, yielding the detections.csv file in the
output/ directory.
We then take the detections.csv file and use extract_features.py file to extract
deep learning features (via ResNet, pre-trained on ImageNet) and write them to disk in output/
features.hdf5. Given the extracted features we use train_model.py to train a Lo-
gistic Regression model on top of the features — the output model is serialized to disk in
output/truckid.model.
With our directory structure reviewed, let’s move on to the configuration file:
10.3. GETTING STARTED WITH VEHICLE RECOGNITION 223
1 {
2 // path to the input directory containing our truck training data
3 "dataset_path": "../datasets/vehicle_recognition_dataset",
4
5 // path to the CSV file containing the bounding box detections after
6 // applying the YOLO object detector
7 "detections_path": "output/detections.csv",
8
9 // base path to YOLO directory
10 "yolo_path": "yolo-coco",
11
12 // minimum probability to filter weak detections and threshold
13 // when applying non-maxima suppression
14 "yolo_confidence": 0.5,
15 "yolo_threshold": 0.3,
16
17 // list of class labels we're interested in from the YOLO object
18 // detector (we're adding "bus" since YOLO will sometimes
19 // misclassify a delivery truck as a bus)
20 "classes": ["truck", "bus"],
Line 3 defines the "dataset_path", which is the path to our directory containing the
input vehicle images. We assume that the "dataset_path" is a root directory and then
contains subdirectories for each class label. For example, in our project structure we have a
subdirectory named fedex/ — this directory lives inside dataset/. When we go to detect
objects in this image, and later extract features from the ROI, we can use the subdirectory
name to easily derive the label name.
After applying the YOLO object detector we use the "detections_path" to write the
bounding box coordinates to disk (Line 7).
Line 20 defines the list of class labels we’re interested in detecting. Sometimes the YOLO
object detector will misclassify a “truck” as “bus” — therefore, we’ll include the “bus” class to
make sure we find all truck-like vehicles in our input images.
The following configurations are used when applying transfer learning in Section 10.5:
24
25 // define the batch size and buffer size during feature extraction
26 "batch_size": 16,
27 "buffer_size": 1000,
28
29 // path to the label encoder
30 "label_encoder_path": "output/le.pickle",
31
32 // number of jobs to run while grid-searching the best parameters
33 "n_jobs": 1,
34
35 // path to the output model after training
36 "model_path": "output/truckid.model",
Line 23 sets the path to where extracted features will be stored. Lines 26 and 27 set the
batch size and buffer size for feature extraction. Under the vast majority of circumstances you
will not have to adjust this parameter.
Line 30 sets the path to the output, serialized LabelEncoder file while Line 36 does the
same for the output model after training.
Line 33 defines the number of parallel jobs to run while grid-searching for the best hyper-
parameters. Typically you’ll want to leave this value at 1 as grid-searching requires quite a bit
of RAM — the exception is if your machine has as large amount of RAM (over 64GB). In that
case you can increase the number of parallel jobs and have the grid-search run faster.
The next code block handles configurations from Phase #3 (Section 10.6) where we actually
deploy the trained vehicle recognition model:
Applying the YOLO object detector is very computationally expensive, therefore we shouldn’t
run the detector on each and every frame — instead, we can utilize the same trick we did in
Chapter 5 and utilize a two-stage process:
ii. Stage #2: If sufficient foreground is determined, we can then apply the YOLO object
detector to find any vehicles.
10.4. PHASE #1: CREATING OUR TRAINING DATASET 225
In order to apply motion detection we’ll be using OpenCV’s built-in MOG method. Lines 39-
42 define our parameters to MOG. Refer to Chapter 5 where these parameters are explained
in detail.
Our final set of configurations handle KeyClipWriter parameters, the path to any output
video files, along with any Dropbox API authentications:
I have already gathered the vehicle/truck dataset we’ll be using in this chapter (which is also
included in the accompanying downloads associated with this text), but I’ll show you the method
that I used to curate the dataset, just in case you want to build your own.
Given our dataset of vehicles, we then need to detect and localize each delivery truck.
Simply knowing that a vehicle exists in an image is not enough — we instead need to
use an object detector to detect the bounding box coordinates of a vehicle in an image.
Having the bounding box coordinates of the vehicle will enable us to (1) extract it from the input
image, (2) perform transfer learning via feature extraction on the ROI, and finally (3) train a
Logistic Regression model on top of the data. These three three tasks will take place in Phase
#2 (Section 10.5), but before we can get there, we first need to build our dataset and obtain the
vehicle detections.
226 CHAPTER 10. VEHICLE RECOGNITION WITH DEEP LEARNING
Figure 10.3: Example montage of delivery trucks for training our vehicle recognition system.
The dataset we’ll be using for delivery truck identification contains three types of trucks
along with a final “ignore” class used to filter out non-delivery truck vehicles:
FedEx, UPS, and USPS are all examples of delivery vehicles. The “ignore” class contains
vehicles that are not FedEx, UPS, or USPS trucks, including school buses, garbage trucks,
pickup trucks, etc.
The goal of Phase #1 is to (1) loop over all 1,178 input images in our dataset, (2) detect
the bounding box coordinates of each truck/vehicle in the current image, and (3) write the
coordinates back out to disk. Once we have the bounding box coordinates of each truck in an
image, we can then utilize transfer learning to actually recognize the vehicle detected by our
object detector.
Again, I have already curated the vehicle dataset for this chapter and provided it for you in
the downloads associated with this text. To create the dataset I downloaded images for each
class using Google Images. For example, I searched for “usps truck” in Google and then pro-
grammatically downloaded each of the image results using Google Images programmatically
10.4. PHASE #1: CREATING OUR TRAINING DATASET 227
(http://pyimg.co/rdyh0) [61]. As an alternative, you may use Bing Images to download results
programmatically (http://pyimg.co/vgcns) [62].
If you would like to build your own vehicle recognition dataset (or any other image dataset),
be sure to follow those instructions.
We have downloaded/organized our dataset of vehicles on disk in the previous section, but
that’s only the first step — simply having the vehicle images is not enough.
The next step is to apply object detection to localize where each of vehicles reside in the
image. We’ll then write these locations to disk (in CSV file format) so we can use them when
for feature extraction in Section 10.5.
Lines 2-8 import our required Python packages while Lines 11-14 parse our command line
arguments. The only command line switch we need is --conf, the path to our configuration
file.
Line 20 derives the path to the coco.names file based on our "yolo_path" from the
configuration. That path is used to load the labels plaintext file into a list named LABELS.
Lines 24 and 25 derive the paths to the YOLO weights and configuration files, respectively.
Once we have the paths we load the YOLO model itself on Line 29.
Lines 32 and 33 determine the output names of the YOLO layers (i.e., the layers that
contain our detected objects).
Let’s now grab the paths to our input images in the "dataset_path" and start looping
over them:
35 # grab the input image paths and open the output CSV file for writing
36 imagePaths = list(paths.list_images(conf["dataset_path"]))
37 random.shuffle(imagePaths)
38 csv = open(conf["detections_path"], "w")
39
40 # loop over the input images
41 for (i, imagePath) in enumerate(imagePaths):
42 # load the input image and grab its spatial dimensions
43 print("[INFO] processing image {} of {}".format(i + 1,
44 len(imagePaths)))
45 image = cv2.imread(imagePath)
46 (H, W) = image.shape[:2]
47
48 # construct a blob from the input image and then perform a
49 # forward pass of the YOLO object detector, giving us our
50 # bounding boxes and associated probabilities
51 blob = cv2.dnn.blobFromImage(image, 1 / 255.0, (416, 416),
52 swapRB=True, crop=False)
53 net.setInput(blob)
54 layerOutputs = net.forward(ln)
55
56 # initialize our lists of detected bounding boxes, confidences,
57 # and class IDs, respectively
58 boxes = []
59 confidences = []
60 classIDs = []
10.4. PHASE #1: CREATING OUR TRAINING DATASET 229
Line 36 and 37 grab the paths to all of our input imagePaths and shuffles them. Line 38
opens our csv file for writing.
We then start looping over each of the imagePaths on Line 41. For each image we load it
from disk and then grab its spatial dimensions. Lines 51-54 construct a blob from the image,
which we then pass through the YOLO object detector. Lines 58-60 initialize three lists used
to store our detecting bounding boxes, corresponding probability of detection, and class label
IDs.
For each output layer, and for each detection, we grab the scores (probabilities),
classID (the index with the largest corresponding predicted probability), and confidence
(the probability itself).
We make a check on Line 74 to ensure that the confidence meets our minimum prob-
ability. Performing this check helps reduce false-positive detections. Lines 80-86 derive the
230 CHAPTER 10. VEHICLE RECOGNITION WITH DEEP LEARNING
bounding box coordinates of the detected object. We then update the boxes, confidences,
and classIDs appropriately on Lines 90-92.
At this point we have now looped over all detections from the YOLO network and our boxes,
confidences, and classIDs lists have been populated — the next step is to (1) apply NMS
and then (2) filter only the truck/bus class:
Lines 96 and 97 apply non-maxima suppression [29] to suppress weak, overlapping bound-
ing boxes.
We then ensure that at least one detection was found on Line 100. Provided that we have
at least one detection, we initialize keep, a bookkeeping variable used to keep track of the
area of the largest bounding box rectangle we’ve found thus far. The keep tuple will store two
values (1) the area of the bounding box rectangle, and (2) the index into the idxs list for that
rectangle.
Line 106 starts looping over all remaining idxs after NMS. Line 109 checks to see if the
current class label for index i does not exist in our classes configuration (i.e., either “truck”
or “bus”).
10.4. PHASE #1: CREATING OUR TRAINING DATASET 231
If the label is not “truck” or “bus” we ignore the detection and keep looping (Line 110).
Otherwise, we can safely assume the label is either “truck” or “bus” so we compute the area
of the bounding box (Lines 114 and 115) and then check to see if we should update our
bookkeeping variable, keeping track of the largest bounding box found thus far (Lines 119 and
120).
Our final code block handles writing the largest bounding box to disk:
Provided at least one truck/bus class we found (Line 124) we write the following to the
output CSV file:
To apply the YOLO object detector to our dataset of vehicle images, open up a terminal and
execute the following command:
On my 3 GHz Intel Xeon W processor, the YOLO object detector took 3m41s to run on the
input dataset. After the script was finishes executing, you will find a file named detections.csv
in your output/ directory:
$ ls output/
detections.csv
Examining the detections.csv you’ll find that each row contains the image file path and
bounding box coordinates of the largest truck/bus class found in the image. In the following
section we’ll take this information, use it to extract the vehicle ROI for the input image, and then
apply transfer learning via feature extraction to train a model to correctly recognize the vehicle.
In this section you will use transfer learning via feature extraction to extract features from each
of the detected vehicle ROIs and then train a Logistic Regression model on top of the extracted
features. The output Logistic Regression model will be capable of recognizing vehicles in
images and video streams.
I’ll be making the assumption that you have (1) read the transfer learning chapters in Deep
Learning for Computer Vision with Python [50], and/or (2) read the PyImageSearch tutorials
on transfer learning (http://pyimg.co/r0rgh [55]), feature extraction (http://pyimg.co/1j05z [56]),
and fine-tuning (http://pyimg.co/rqtlj [57]).
If you haven’t read those chapters or tutorials, stop now and go read them before continuing.
In Section 10.4 we applied the YOLO object detector to localize vehicles in our input dataset of
trucks/buses. We will now use those locations to:
i. Extract the ROI of the vehicle using the bounding box coordinates.
ii. Pass the ROI through the ResNet network (without the FC layer head).
10.5. PHASE #2: TRANSFER LEARNING 233
iv. Treat these activations as features and write them to disk in HDF5 format.
The output features will serve as input to our Logistic Regression model in Section 10.5.3.
You’ll notice that the HDF5DatasetWriter class is being imported on Line 2. We won’t
be reviewing this class as it’s outside the scope of this chapter, but all you need to know is
that this class allows us to store data on disk in HDF5 format.
HDF5 is a binary data format created by the HDF5 group [58] to store gigantic numerical
datasets on disk (far too large to fit into memory) while at the same time facilitating easy access
and computation on the rows in the datasets. Feature extraction via deep learning tends to lead
to very large feature vectors so storing the data in HDF5 tends to be a natural choice.
Lines 15-18 parse our command line arguments. The only switch we need is --conf, the
path to our input configuration file.
Let’s now load our detections CSV file from disk and process it:
24
25 # extract the class labels from the image paths then encode the
26 # labels
27 labels = [r.split(",")[0].split(os.path.sep)[-2] for r in rows]
28 le = LabelEncoder()
29 labels = le.fit_transform(labels)
30
31 # load the ResNet50 network
32 print("[INFO] loading network...")
33 model = ResNet50(weights="imagenet", include_top=False)
Line 23 loads the contents of the "detections_path" CSV file, breaking it into a list, one
row per line.
We then extract the class labels from the filenames in each row (Line 27). These labels
are then used to fit a LabelEncoder (Lines 28 and 29), used to transform our labels from
strings to integers.
Line 33 loads the ResNet50 architecture from disk with weights pre-trained on ImageNet.
We leave off the top fully-connected layer from ResNet so we can perform transfer learning via
feature extraction.
35 # initialize the HDF5 dataset writer, then store the class label
36 # names in the dataset
37 dataset = HDF5DatasetWriter((len(rows), 100352),
38 conf["features_path"], dataKey="features",
39 bufSize=conf["buffer_size"])
40 dataset.storeClassLabels(le.classes_)
41
42 # initialize the progress bar
43 widgets = ["Extracting Features: ", progressbar.Percentage(), " ",
44 progressbar.Bar(), " ", progressbar.ETA()]
45 pbar = progressbar.ProgressBar(maxval=len(rows),
46 widgets=widgets).start()
The final output volume of ResNet50, without the fully-connected layers, is 7 ◊ 7 ◊ 2048,
thus implying that our output feature dimension is 100,352-d, as indicated when initializing the
HDF5DatasetWriter class on Lines 37-39.
Lines 43-46 initializes a progress bar widget we can use to estimate how long it will take for
the feature extraction process to complete.
We can now start looping over the rows in batches of size "batch_size":
Line 53 grabs the next batch of detected objects from the rows array. We then grab the
accompanying labels on Line 54. Line 55 initializes an empty list, batchImages, which will
store the images corresponding to each of the batchLabels.
We start looping over each of the batched rows on Line 58. For each row, we unpack it,
obtaining the imagePath and bounding box coordinates for the largest vehicle in the input
image (Lines 60-62). Line 65 loads the image from disk and grabs its dimensions.
At this point we need to extract the region of the image containing the vehicle:
Lines 70-73 truncate any bounding box coordinates which may fall outside the boundaries
of the input image. We then use NumPy array slicing on Line 76 to obtain the region of the
interest that contains the vehicle. The ROI is processed by converting from BGR to RGB
color channel ordering, resizing to 224x224 pixels (the input dimensions for ResNet), and then
performing mean subtraction (Lines 80-87).
After all preprocessing steps are complete, we add the image to the batchImages list.
Given our batch of images we can pass them through our CNN to obtain the output features:
92 # pass the images through the network and use the outputs as
93 # our actual features
94 batchImages = np.vstack(batchImages)
95 features = model.predict(batchImages,
96 batch_size=conf["batch_size"])
97
98 # reshape the features so that each image is represented by
99 # a flattened feature vector
100 features = features.reshape((features.shape[0], 100352))
101
102 # add the features and labels to our HDF5 dataset
103 dataset.add(features, batchLabels)
104 pbar.update(i)
Lines 95 and 96 send the current batchImages through our ResNet network, obtaining
the activations from the network.
Again, recall that without the fully-connected layer head, ResNet produces an output volume
of size 7 ◊ 7 ◊ 2048. When flattened (Line 100), that leads to a total of 100,352 values used to
quantify the vehicle region. The combination of features and corresponding batchLabels
are then added to the HDF5 dataset on Line 103.
We wrap up the extract_features.py script by closing the dataset and then serializing
the LabelEncoder to disk:
If you’re interested in learning more about transfer learning, feature extraction, and fine-
tuning, make sure you refer to the following resources:
To extract features from our dataset of vehicle images, open up a terminal and execute the
following command:
On my machine the entire feature extraction process took 1m48s. You could also run this
script on a machine with a GPU to dramatically speedup the process as well.
After the script has run you can check the output/ directory to verify that the features.hdf5
file has been created:
$ ls output/
detections.csv features.hdf5
In the next section you will train a Logistic Regression model on these features, giving us
our final vehicle identification model.
Now that we have a dataset of extracted features, we can move on to training our machine
learning model to (1) accept the input features (extracted via ResNet), and (2) then actually
identify and recognize the vehicle
Lines 2-9 handle our imports. You’ll want to specifically make note of the LogisticRegre
ssion class, an instance of which we’ll be training later in this script, along with h5py, a
Python class used to create, modify, and access HDF5 datasets.
The next step is to load our LabelEncoder (created in Section 10.5.2 after running the
extract_features.py script) and then open the HDF5 database for reading:
Line 25 computes an index, i, used to determine the training/testing split. Here we’ll be
using 75% of the data for training and the remaining 25% for testing.
We can now define a set of hyperparameters we want to tune and then apply the GridSearch
CV class to perform a cross-validated search across each choice of these hyperparameters:
Here we’ll be tuning the C value which controls the “strictness” of the Logistic Regression
model:
• A larger value of C is more rigid and will try to make the model make less mistakes on
the training data.
• A smaller value of C is less rigid and will allow for for some mistakes on the training data.
The benefit of a larger value of C is that you may be able to obtain higher training accuracy.
The downside is that you may overfit your model to the training data.
Conversely, a smaller value of C may lead to lower training accuracy, but could potentially
lead to better model generalization. At the same time though, too small of a C value could
make the model effectively useless and incapable of making meaningful predictions.
The goal of the GridSearchCV is to identify which value of C will perform best on our data.
We then serialize the model to disk so we can use it in the truck_id.py script in Section
10.6:
Let’s go ahead and train our Logistic Regression model for vehicle recognition. Open up a
terminal and execute the following command:
As you can see, we are obtaining 96% accuracy. Knowing that we now have an accurate
truck identification model, we can move on to Phase #3 where we’ll build out our computer
vision app.
We are now ready to finish implementing the vehicle recognition pipeline! In this phase we
need two machines:
• The client, presumed to be a Raspberry Pi, that acts as an IP camera, responsible only
for capturing frames and streaming them back to the host.
• The server/host, which we assume is a more powerful laptop, desktop, or GPU machine,
which accepts frames from the RPi, runs motion detection, object detection, and vehicle
recognition, and then write the identification to disk as a video clip.
Our client script, running on the Raspberry Pi, is responsible for capturing frames from the
video stream and then sending the frames to the server/host for processing via ImageZMQ.
The client.py script is identical to Chapter 4 but we’ll review it here as a matter of complete-
ness:
Line 2 imports our VideoStream which we’ll use to access our video stream, whether that
be a USB camera or a Raspberry Pi camera module. The imagezmq import on Line 3 is used
to interface with ImageZMQ for sending frames across the wire.
Lines 9-12 parse our command line arguments. The only argument we need is --server-
ip, the IP address of the server to which the client will connect. Line 16 and 17 initializes the
sender used to send frames via ImageZMQ.
Lines 23 then access our video stream. Lines 27-30 loop over frames from the camera
and then send them to the server for processing.
242 CHAPTER 10. VEHICLE RECOGNITION WITH DEEP LEARNING
The server script is considerably more complex than the client script as it’s responsible for:
v. Using our trained vehicle recognition model to actually identify the vehicle
It is technically possible to run all of these operations on the RPi, but you would need a
dedicated coprocessor such as a Movidius NCS or Google Coral USB Accelerator to handle
the object detection and feature extraction components.
Lines 2-15 import our required Python packages. You’ll want to take note that the ResNet50
model is being imported — that is the same model we used for Section 10.5.1 on feature ex-
traction. We’ll need to use ResNet to quantify the vehicle ROIs prior to passing them through
our Logistic Regression model for classification.
10.6. PHASE #3: IMPLEMENTING THE VEHICLE RECOGNITION PIPELINE 243
Lines 18-21 parse our command line arguments. Just in the previous scripts in this chapter,
we only need to supply --conf, the path to our configuration file.
We can then load the configuration file, initialize the KeyClipWriter, and connect to Drop-
box (if necessary):
Take note of Line 29 where we initialize consecFrames — this variable is used to count the
total number of consecutive frames that have not contained any action. Once consecFrames
reaches a certain threshold we’ll know to stop recording the video clip.
40 # load the COCO class labels our YOLO model was trained on
41 labelsPath = os.path.sep.join([conf["yolo_path"], "coco.names"])
42 LABELS = open(labelsPath).read().strip().split("\n")
43
44 # load the truck ID model and label encoder
45 print("[INFO] loading truck ID model...")
46 truckModel = pickle.loads(open(conf["model_path"], "rb").read())
47 le = pickle.loads(open(conf["label_encoder_path"], "rb").read())
48
49 # load the ResNet50 network
50 print("[INFO] loading ResNet...")
51 model = ResNet50(weights="imagenet", include_top=False)
52
53 # derive the paths to the YOLO weights and model configuration
54 weightsPath = os.path.sep.join([conf["yolo_path"], "yolov3.weights"])
55 configPath = os.path.sep.join([conf["yolo_path"], "yolov3.cfg"])
56
57 # load our YOLO object detector and determine only the *output*
58 # layer names that we need from YOLO
59 print("[INFO] loading YOLO from disk...")
244 CHAPTER 10. VEHICLE RECOGNITION WITH DEEP LEARNING
We have one final code block of initializations before we can start looping over frames from
ImageZMQ:
64 # initialize the ImageZMQ image hub along with the frame dimensions
65 # from our input video stream
66 imageHub = imagezmq.ImageHub()
67 (W, H) = (None, None)
68
69 # initialize the MOG foreground background subtractor and
70 # morphological kernels
71 fgbg = cv2.bgsegm.createBackgroundSubtractorMOG(
72 history=conf["mog_history"], nmixtures=conf["mog_nmixtures"],
73 backgroundRatio=conf["mog_bg_ratio"],
74 noiseSigma=conf["mog_noise_sigma"])
75 eKernel = np.ones((3, 3), "uint8")
76 dKernel = np.ones((5, 5), "uint8")
77
78 # initialize the motion status flag
79 motion = False
80
81 # initialize the label, filename, and path
82 label = None
83 filename = None
84 localPath = None
Line 66 initializes the ImageZMQ imageHub while Line 67 initializes the spatial dimensions
of our input frame (which we’ll populate once we’re inside the while loop).
Lines 71-74 initialize the MOG background subtractor used to detect motion/foreground in
the input frames. We also initialize two kernels, one for erosion (eKernel) and another for
dilation (dKernel) — we’ll be using these kernels to cleanup our foreground segmentation.
Line 79 initializes motion, a flag used to indicate whether or not motion has occurred in
the frame.
10.6. PHASE #3: IMPLEMENTING THE VEHICLE RECOGNITION PIPELINE 245
Finally, Lines 82-84 initialize variables used to save our output vehicle identification clips to
disk.
We can now enter the body of the while loop used to process frames from the RPi client:
Line 89 reads the next frame from the ImageZMQ video stream. If we haven’t already
grabbed the dimensions of the frame, we do so on Lines 93 and 94.
We then apply the MOG background subtractor on Line 98 to obtain the foreground/back-
ground mask. We apply a series of erosions and dilations to cleanup the mask (Lines 102 and
103) and then apply contour detection (Lines 106-108).
The following code block processes the contours to determine if motion has taken place:
119 area = w * h
120 minArea = (W * H) / conf["min_area_divisor"]
121 maxArea = (W * H) / conf["max_area_divisor"]
122
123 # check if the motion area is within range, and if so,
124 # indicate that motion was found
125 if area > minArea and area < maxArea:
126 motion = True
127
128 # otherwise there is no motion
129 else:
130 motion = False
Line 111 ensures that at least one contour was found. Provided we have at least one
contour, we find the largest one (Line 114) and compute its bounding box dimensions (Line
115).
Lines 119-121 compute the area of the largest bounding box and determine its relative
area (compared to the frame dimensions). Provided that the size of the motion area is greater
than the minArea and smaller than the maxArea, we set motion equal to True. Otherwise,
motion is set to False. We perform this check to help reduce false-positive detections.
Now that motion has been appropriately set, we should check to see if (1) motion was
detected or (2) we are already recording:
132 # check to see if (1) there was motion detected or (2) we are
133 # already recording (just in case the delivery truck stops and
134 # delivers to our house)
135 if motion or kcw.recording:
136 # construct a blob from the input frame and then perform a
137 # forward pass of the YOLO object detector, giving us our
138 # bounding boxes and associated probabilities
139 blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (416, 416),
140 swapRB=True, crop=False)
141 net.setInput(blob)
142 layerOutputs = net.forward(ln)
143
144 # otherwise, there is no motion or we are not recoding, to
145 # indicate there were no outputs from YOLO (since we didnt have
146 # to run it)
147 else:
148 layerOutputs = []
Provided that motion was found or we are already recording, we need to apply YOLO object
detection to find the vehicle in the frame (Lines 139-142). Otherwise, if there is no motion
and we are not recording, we can simply set layerOutputs to an empty list (implying that no
vehicles needed to be detected).
10.6. PHASE #3: IMPLEMENTING THE VEHICLE RECOGNITION PIPELINE 247
The following code block is identical to the YOLO output processing from Section 10.4.2:
Here we are looping over the outputs of the YOLO model, filtering out low probability detec-
tions, computing the bounding box coordinates of the object, and then populating our boxes,
confidences, and classIDs lists, respectively.
The next code block is also near identical to the YOLO processing in Section 10.4.2:
192
193 # ensure at least one detection exists
194 if len(idxs) > 0:
195 # loop over the indexes we are keeping
196 for i in idxs.flatten():
197 # extract the bounding box coordinates
198 (x, y) = (boxes[i][0], boxes[i][1])
199 (w, h) = (boxes[i][2], boxes[i][3])
200
201 # if the label is not not one we are interested in, then
202 # ignore it
203 if LABELS[classIDs[i]] not in conf["classes"]:
204 continue
Lines 190 and 191 apply NMS to suppress weak, overlapping bounding boxes.
We then start looping over all kept idxs on Line 196, extracting the bounding box coordi-
nates of the current object on Lines 198 and 199. We also check to ensure that the current
object is either a “truck” or “bus”.
Lines 208-211 truncate any bounding box coordinates that may fall outside the range of
the image. We use these coordinates to perform NumPy array slicing to extract the roi of the
vehicle (Line 214).
This roi is then converted from BGR to RGB channel ordering and resized to 224x224
pixels, followed by preprocessing the roi via mean subtraction (Lines 218-225). You’ll notice
that these are the exact preprocessing steps we performed in Section 10.5.1 when performing
feature extraction on our original vehicle dataset.
Lines 229 passes the roi through ResNet50 (without the FC layer head), yielding our
feature vector. This feature vector is then supplied to the truckModel (i.e., the trained Logistic
Regression model) which gives us the final vehicle identification (Lines 234 and 235).
We can then draw the bounding box rectangle and label for the vehicle on the frame:
Line 245 checks to see if we are not recording a video clip, and if not, we start recording
(Lines 248-254).
Our next code block handles the case when we are both (1) currently recording and (2) we
have reached a consecFrames number of frames without motion (indicating the vehicle has
passed view of our camera):
Line 264 finishes up the recording of the key event clip. We then check to see if the video
clip should be uploaded to Dropbox on Lines 267-278. We wrap up the video recording by
resetting the filename, localPath, consecFrame count for the next time a vehicle enters
the scene.
The final step is to update our KeyClipWriter with the current frame, update the consec
Frame count (if necessary) and display the output frame to our screen:
302 if kcw.recording:
303 kcw.finish()
Figure 10.4: Montage of vehicle recognition results using a tiered deep learning model approach.
Each vehicle is correctly recognized.
With our truck_id.py file complete, let’s now put vehicle recognition to the test!
Make sure you have ZMQ installed, then open up a terminal and execute the following
command to start the server on your laptop, desktop, or host machine:
Figure 10.5: Montage of vehicle recognition results using a tiered deep learning model approach.
The vehicle on the bottom-left is incorrectly recognized as "usps" but then correctly recognized as
"ignore" on the bottom-right.
Then, open up a new terminal, and then launch the Raspberry Pi client script:
Once executed, your Raspberry Pi will act as an “network camera” and start streaming from
the RPi to the server. The server will then process the frames, apply object detection, locate
and vehicles, identify them, and then write the results to disk as a video clip.
10.7. VEHICLE RECOGNITION RESULTS 253
Figure 10.6: Montage of vehicle recognition results using a tiered deep learning model approach.
On the bottom-right, the image experiences "washout" from the sun aiming into the camera sensor.
Examples of our vehicle recognition results can be seen in Figures 10.4-10.6. While our
system performs well, referring to the figures and their captions, you can see that our system is
not 100% reliable. On some occasions, the truck is incorrectly recognized. On other occasions,
camera sensor washout causes the system to not be recognized at all. The washout problem
is not the deep learning recognizer’s fault — perhaps the image quality would be better with a
polarizing lens filter.
To improve accuracy, I would suggest capturing more examples of trucks to include in the
"ignore" folder for the training set and balancing the dataset as needed.
When building your own vehicle recognition system I suggest you follow the recipe in this
chapter. Additionally, you may be able to run the entire vehicle detection, feature extraction,
and identification models on the Raspberry Pi itself provided that you use a coprocessor such
as the Movidius NCS or Google Coral USB Accelerator to offload the object detector and ideally
feature extractor (the Logistic Regression model can easily run on the CPU). This would also
be a case where you definitely need at least a Raspberry Pi 4B.
254 CHAPTER 10. VEHICLE RECOGNITION WITH DEEP LEARNING
10.8 Summary
In this chapter you learned how to build a vehicle recognition system using computer vision
and deep learning.
We framed our vehicle recognition project as delivery truck identification, capable of log-
ging when delivery trucks stop at your house. Using such a system you can monitor your home
for deliveries (and ideally help reduce the risk of package theft).
ii. Phase #2: Utilizing transfer learning to build the vehicle recognition model
iii. Phase #3: Applying vehicle recognition via the RPi and ImageZMQ
You should use this project as a starting point when developing your own vehicle identifica-
tion projects.
Finally, if you are interested in learning more about vehicle recognition, including how to
recognize the make (ex., “Telsa”) and model (ex., “Model-S”) of vehicle, be sure to refer to Deep
Learning for Computer Vision with Python (http://pyimg.co/dl4cv) [50] which can recognize
make/model with over 96% accuracy.
Chapter 11
One of the biggest challenges of developing computer vision and deep learning applications
on the Raspberry Pi is trying to balance speed with computational complexity.
The Raspberry Pi, by definition, is an embedded device with limited computational re-
sources; however, we know that computationally intensive deep learning algorithms are re-
quired in order to build robust, highly accurate computer vision systems.
Trying to balance speed with computational complexity is like trying to push together identi-
cal poles of a magnet — they repel instead of attract.
Are we limited to having to choose between speed and computational complexity on em-
bedded devices?
But just as deep learning has enabled software to obtain unprecedented accuracy on com-
puter vision tasks, hardware is now enabling better, more efficient deep learning, creating a
feedback loop where one empowers the other.
Intel’s Movidius Neural Compute Stick (NCS) is one of the first pieces of hardware to bring
real-time computer vision and deep learning to the Raspberry Pi. The NCS is actually a co-
processor — a USB stick that plugs into your embedded device.
When you’re ready to apply deep learning to the NCS, you simply load an NCS-optimized
model into your software, and pass data through the network. The onboard co-processor
handles the computation in an efficient manner, enabling you to obtain faster performance
than using your CPU alone.
255
256 CHAPTER 11. WHAT IS THE MOVIDIUS NCS
In this chapter, as well as the following chapters, you’ll discover how to use the Movidius
NCS in your own embedded deep learning applications, ensuring you can have your cake and
eat it too.
i. The Intel Movidius Neural Compute Stick including its capabilities, initial rollout, and
promising technology.
Figure 11.1: The Intel Movidius Neural Compute Stick is a USB-based deep learning coprocessor.
It is designed around the Myriad processor and is geared for single board computers like the
Raspberry Pi.
Marketed as the “Vision Processing Unit (VPU) with a dedicated Neural Compute Engine”
(NCE), the Intel Movidius Neural Compute Stick (NCS) is a USB stick which is optimized for
deep learning inference.
11.2. WHAT IS THE INTEL MOVIDIUS NEURAL COMPUTE STICK? 257
The Movidius company (founded in 2005) designed the Myriad X processor with the VPU/
NCE capability and Intel quickly bought the technology in September 2016 [9]. Intel/Movid-
ius has brought the technology to many embedded camera devices, drones such as the DJI
Tello, and the hobbyist community in the form of a USB stick that pairs well with Single Board
Computers (SBCs) like the Raspberry Pi.
According to Tractica market research, “the total revenue of the AI-related deep learning
chip market is forecast to rise from $500M USD in 2016 to $12.2B USD in 2025” [63]. Of
course 2016 was a few years ago and now we have competing products such as the Google
Coral Tensor Processing Unit (TPU) which is covered in the Complete Bundle.
In the remainder of this chapter, we’ll learn about what the Movidius NCS is capable of.
From there, we’ll learn about the Movidius NCS product’s history and promising future with
the advent of the OpenVINO library.
The NCS is small, USB-based, and only draws upwards of 2.5W in comparison to a 300W
NVIDIA K80 GPU. A full-blown GPU like a K80 actually requires an even bigger Power Supply
Unit (PSU) to power the entire motherboard, more GPUs, and other peripherals.
The NCS is designed to work in tandem with your Single Board Computer (SBC) CPU while
being very efficient for deep learning tasks.
Generally, it cannot be used for training like your GPU is geared for — rather, the NCS is
designed for deep learning inference.
The Raspberry Pi makes the perfect companion for the NCS — the deep learning stick
allows for much faster inference and allows the RPi CPU to work on other tasks.
The NCS is not designed to be used for training a deep learning model.
You’ll find yourself using a full sized computer with a more powerful setup to train a deep
learning model. This could be a laptop with CPU for a small model. Or it could be a full sized
deep learning rig desktop outfitted with lots of memory, powerful CPU(s), and powerful GPU(s).
From there you’d transfer your trained model to the target device (i.e. the Raspberry Pi with the
Movidius plugged in).
258 CHAPTER 11. WHAT IS THE MOVIDIUS NCS
The Movidius NCS is capable of deep learning classification, object detection, and feature
extraction. Deep learning segmentation is also possible.
There are a selection of Movidius NCS compatible models available at the following links:
If you have the option, I highly recommend working with pretrained models that you find in
this book and in online examples.
You should exhaust all possible pretrained models before you embark on training and de-
ploying your own model to the Movidius NCS.
Thus far, the Movidius best supports Caffe and TensorFlow models. Unfortunately Keras
models are yet to be supported, although contacts on the product team at Intel tell us that
Keras support is highly requested and in their roadmap.
As with the advent of any new product, the Movidius NCS product had a tough start, but now
that the product is backed by Intel, it is set up for success. In this section we’ll briefly discuss
the history of the NCS and how we got to where we are today.
The NCS launched in 2017 and PyImageSearch was quick to get our hands on one to write
two blog posts about the shiny blue USB stick [66, 67].
The blog posts were a huge hit — they exceeded traffic expectations. It was clear that the
Raspberry Pi community had both (1) the need for such a product, and (2) high expectations
for the product.
Movidius/Intel certainly brought their product to the market at the right time to ride the deep
learning wave.
The NCS was launched with the Neural Compute Software Development Kit (NCSDK) Appli-
cation Program Interface version 1 (APIv1). The APIv1 left a lot to be desired, but it functioned
well enough to convince the community of the speed and promise of Myriad and the NCS.
11.4. INTEL MOVIDIUS’ NCS HISTORY LESSON 259
Part of the SDK included a challenging, hard to use tool for converting deep learning mod-
els into “graph files." The graphs were required to be able to use your own models with the
Movidius NCS. The tool was only Caffe and Tensorflow capable. Furthermore, not all CNN
architectures were supported which frustrated many deep learning practitioners.
We all knew that Intel likely had something in development to address both the APIv1 and
graph file tool.
The question was: When will working with Movidius become easier?
Some of the issues were fixed with the release of APIv2 in mid-2018, which brought virtual
environment and Docker support (alleviating the need for an Ubuntu VM to convert models to
graph files).
Some were quick to port their code and others continued to watch from the sidelines.
The Raspberry Pi communities online persevered by posting examples on GitHub and blogs
highlighting the breakthroughs, limitations, and wishes.
Figure 11.2: Transitioning from the NCSDK to the OpenVINO Toolkit. OpenVINO is far easier to
use and work with than the NCSDK. Intel recommends using OpenVINO instead of the NCSDK
for all projects [68].
OpenVINO presents a much simpler API and makes the Movidius a lot easier to work with.
OpenVINO is essentially an Intel device-optimized version of OpenCV supporting deep learn-
ing on a range of Intel hardware (GPUs, CPUs, Coprocessors, FPGAs, and more).
Deep learning inference with the Movidius now requires little to no code updates to perform
260 CHAPTER 11. WHAT IS THE MOVIDIUS NCS
Simply set the target processor (either the Raspberry Pi CPU or the Myriad NCS processor)
and the rest of the code is mostly the same.
OpenVINO supports the NCS1 (with no firmware upgrade) and the newer, faster NCS2
(announced on November 14th 2018 [9]).
The Raspberry Pi community was thrilled about OpenVINO and its long term viability. No
longer were we to rely on a challenging, non-Pythonesque API. In fact, the APIv1 and APIv2
were laid to rest on mid-2019 — they are no-longer supported by Intel, but there are some
legacy models in the ModelZoo and the code is out there if you need it: http://pyimg.co/dc8ck
[64].
Figure 11.3: The Raspberry Pi 4B has USB 3.0 capability, unlocking the full potential of the Intel
Movidius NCS2 deep learning stick.
Shortly after OpenVINO and the NCS2 were released, the Raspberry Pi 4B line was re-
leased with highly anticipated USB 3.0 support.
11.5. WHAT ARE THE ALTERNATIVES TO THE MOVIDIUS NCS? 261
The NCS has supported USB 3.0 since the beginning, but the Raspberry Pi hardware (3B
and 3B+) was stuck on USB 2.0. The lack of USB 3.0 severely limited overall inference time
when using a Raspberry Pi. Some users switched to alternative hardware (i.e. non-RPI SBCs).
Apart from the Raspberry Pi 4B being 2x faster altogether, the time it takes to perform
inference on a frame is also reduced as we can transfer the data back and forth much more
quickly using USB 3.0.
The Movidius was the first product in its class to make a significant dent in the marketplace.
The chip is present in many cameras and devices.
Of course there are competing chips, but are there competing USB-based devices?
The main competitor is the Google Coral Tensor Processing Unit (TPU) with the benefit
being that it is backed by the Google behemoth. The Google Coral is covered in the Complete
Bundle of this text.
As computer vision tasks move to IoT/Edge devices, it will not be uncommon for new prod-
ucts to enter the marketplace. Our recommendation is not to jump on new products too
quickly. If you do find yourself evaluating a new coprocessor product, be sure to conduct a
suite of benchmarks (refer to Section 23.3 of the Hobbyist Bundle) across a variety of CNN
architectures. Always put the device through the ringer by processing video files for extended
periods of time so that the device warms up and the processor really gets a good workout.
11.6 Summary
When building deep learning applications on the Raspberry Pi, you’ll find yourself trying to
balance speed with computationally complexity. The problem is, by nature, deep learning
models are incredibly computationally hungry, and the underpowered CPU on your embedded
device won’t be able to keep up with its appetite.
The answer is to turn to a co-processor device such as Intel’s Movidius NCS or the Google
Coral TPU USB Accelerator. These devices are essentially hardware-optimized USB sticks that
plug into your RPi and then handle the heavy lifting of deep learning inference on your device.
262 CHAPTER 11. WHAT IS THE MOVIDIUS NCS
In the next few chapters in the text you’ll learn how to utilize the Movidius NCS to give your
deep learning applications a much needed speedup on the RPi.
Chapter 12
As discussed in Chapter 11, the Intel Movidius NCS is a deep learning coprocessor designed
for Single Board Computers (SBCs) like the Raspberry Pi.
While the Intel Movidius NCS is not a fully-fledged GPU, it packs a punch that might be
just what you’re looking for to gain a few FPS in your Raspberry Pi classification (or object
detection) project.
In this chapter we’ll learn how to deploy pretrained classification models to your RPi with
the Intel Movidius NCS and OpenVINO.
If you want to train your own model for the NCS, be sure to refer to the RPi4CV Complete
Bundle.
263
264 CHAPTER 12. IMAGE CLASSIFICATION WITH THE MOVIDIUS NCS
First, we’ll review the project structure. From there, we’ll review our configuration file which
makes working with our CPU and Movidius classification scripts easy.
We’ll then dive into our Movidius image classification. We’ll review the differences between
the CPU and Movidius image classification scripts — they are nearly identical, a testament to
how much work the NCS has put into making the OpenVINO API seamless and easy to use.
Finally, we’ll compare results and calculate the Movidius classification speedup versus the
CPU for SqueezeNet and GoogLeNet.
Let’s begin!
|-- config
| |-- config.json
|-- images
| |-- beer.png
| |-- brown_bear.png
| |-- dog_beagle.png
| |-- keyboard.png
| |-- monitor.png
| |-- space_shuttle.png
| |-- steamed_crab.png
|-- models
| |-- googlenetv4
| | |-- caffe
| | | |-- googlenet-v4.caffemodel
| | | |-- googlenet-v4.prototxt
| | |-- ir
| | |-- googlenet-v4.bin
| | |-- googlenet-v4.mapping
| | |-- googlenet-v4.xml
| |-- squeezenet
| |-- caffe
| | |-- squeezenet1.1.caffemodel
| | |-- squeezenet1.1.prototxt
| |-- ir
| |-- squeezenet1.1.bin
| |-- squeezenet1.1.mapping
| |-- squeezenet1.1.xml
|-- pyimagesearch
| |-- utils
| | |-- __init__.py
| | |-- conf.py
| |-- __init__.py
|-- imagenet_class_index.json
|-- imagenet_classes.pickle
|-- classify_cpu.py
|-- classify_movidius.py
12.2. IMAGE CLASSIFICATION WITH THE MOVIDIUS NCS 265
Our scripts for this project are nearly identical. We will review classify_movidius.py
in detail. The classify_cpu.py file will not be reviewed in its entirety. Rather, we will only
review specific lines where the changes are. As you follow along, you should open both scripts
side-by-side on your computer screen. Both scripts include timestamps such that we can
benchmark the Movidius NCS vs. the CPU.
Our project settings are organized in a JSON file for convenience to eliminate the need for a
metric ton of command line arguments or even separate scripts per GoogLeNet and SqueezeNet.
1 {
2 // paths to the model files
3 "model_paths": {
4 "ir": {
5 "squeezenet": {
6 "xml": "models/squeezenet/ir/squeezenet1.1.xml",
7 "bin": "models/squeezenet/ir/squeezenet1.1.bin"
8 },
9 "googlenet": {
10 "xml": "models/googlenetv4/ir/googlenet-v4.xml",
11 "bin": "models/googlenetv4/ir/googlenet-v4.bin"
12 }
13 },
14 "caffe": {
15 "squeezenet": {
16 "prototxt": "models/squeezenet/caffe/squeezenet1.1.prototxt",
17 "caffemodel": "models/squeezenet/caffe/squeezenet1.1.caffemodel"
18 },
19 "googlenet": {
20 "prototxt": "models/googlenetv4/caffe/googlenet-v4.prototxt",
21 "caffemodel": "models/googlenetv4/caffe/googlenet-v4.caffemodel"
22 }
23 }
24 },
Our "model_paths" are organized by "ir" (Movidius NCS graph files) and "caffe"
(Caffe files for the CPU). The pre-trained Movidius-compatible files came from OpenVINO/
OpenCV’s ModelZoo (http://pyimg.co/dc8ck [64]). The pre-trained CPU-compatible files came
from Intel’s official pre-trained model page (http://pyimg.co/78cqs [65]). These files are included
in the project folder so that you don’t have to go searching for them.
28 "squeezenet":
29 {
30 "input_size": [227, 227],
31 "mean": [104.0, 117.0, 123.0],
32 "scale": 1.0
33 },
34 "googlenet":
35 {
36 "input_size": [299, 299],
37 "mean": 128.0,
38 "scale": 128.0
39 }
40 },
Preprocessing settings are organized in Lines 27-40 under "preprocess". Both SqueezeNet
and GoogLeNet require different input dimensions ("input_size"), mean subtraction ("mean"),
and scaling ("scale").
Our CPU script performs both mean subtraction and scaling, but the Movidius NCS script
does not since the preprocessing steps are baked into the .bin and .xml files (the "ir"
paths above).
Finally, our ImageNet dataset class labels are specified in the .pickle file on Line 43:
If you’ve never used the old Movidius NCS APIs (APIv1 and APIv2), you really dodged a bullet.
Intel’s newer OpenVINO-compatible release of OpenCV now takes care of a lot of the hard
work for us.
As we’ll see in this section, as well as the following one, the scripts have very minor differ-
ences between deploying a model for the Movidius NCS or CPU to perform inference.
Go ahead and open a new file named classify_movidius.py and insert the following
code:
6 import argparse
7 import pickle
8 import time
9 import cv2
Lines 2-9 import our packages. Our openvino imports are on Lines 2 and 3.
Our Movidius NCS classification script requires three command line arguments:
• --conf: The path to the configuration file we reviewed in the previous section.
When these arguments are provided via the terminal, the script will handle loading the
image, model, and configuration so that we can conduct classification with the Movidius NCS.
Then, Line 26 loads the ImageNet classes using the "labels_path" (path to the
.pickle file) contained within the conf dictionary.
Now we’ll setup the Intel Movidius NCS with our pretrained model:
268 CHAPTER 12. IMAGE CLASSIFICATION WITH THE MOVIDIUS NCS
Line 29 initializes the OpenVINO plugin. The "MYRIAD" device is the Movidius NCS pro-
cessor.
Lines 33-35 load the CNN. The model and weights file paths are provided directly from
the config file. Either SqueezeNet or GoogLeNet is loaded via the --model switch. Here we
are grabbing the "ir" paths which include .xml and .bin files.
Line 45 then grabs the number of input blobs (n), number of channels (c), and required
dimensions of the blob (h and w).
From there, Line 49 goes ahead and loads the net onto the Movidius NCS. This step
only needs to be completed once, but it can take some time. If you were to be performing
classification on a video stream, rest assured that from here, inference will be as fast as it can
be.
51 # load the input image and resize input frame to network size
52 orig = cv2.imread(args["image"])
53 frame = cv2.resize(orig, (w, h))
12.2. IMAGE CLASSIFICATION WITH THE MOVIDIUS NCS 269
54
55 # change data layout from HxWxC to CxHxW
56 frame = frame.transpose((2, 0, 1))
57 frame = frame.reshape((n, c, h, w))
Lines 52-57 load our image from the path provided in --image and resize/reshape it.
Mean subtraction and image scaling preprocessing is baked into the .bin and .xml files
(the "ir" paths in our config). Be sure to refer to OpenVINO’s Optimization Guide (http://pyimg
.co/hyleq, [70]) and search the page for “image mean/scale parameters”.
We’re now ready to perform classification inference with the Movidius NCS:
Timestamps are taken before and after inference — the elapsed inference time is printed
via Lines 65 and 66 for benchmarking purposes.
Using the idxs, we form a loop over them (Line 73). Inside the loop, we grab the classifica-
tion probability (proba). The results have different shapes depending on whether we used
the squeezenet (Lines 77 and 78) or googlenet (Lines 82 and 83) model.
From here we will (1) annotate our output image with the top result, and (2) print the top-five
results to the terminal:
Lines 86-90 annotate our orig input image with the top classification result and probability.
Lines 94 and 95 print out the top-five classification results and probabilities as the loop
iterates.
Finally, the freshly annotated image is displayed on screen until a key is pressed (Lines 98
and 99).
Raspberry Pi image classification with the CPU is nearly identical for this project. In this
section, we will inspect the minor differences.
Let’s review the differences, while paying attention to the line numbers since identical
blocks will not be reviewed:
6 import time
7 import cv2
Lines 2-9 import our packages — for CPU-based classification, we do not need openvino
imports.
26 # load our serialized model from disk, set the preferable backend and
27 # set the preferable target
28 print("[INFO] loading model...")
29 net = cv2.dnn.readNetFromCaffe(
30 conf["model_paths"]["caffe"][args["model"]]["prototxt"],
31 conf["model_paths"]["caffe"][args["model"]]["caffemodel"])
32 net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
33 net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)
For classification with the CPU, we use OpenCV’s standard dnn module (Lines 29-31).
Notice that we are grabbing the "caffe" paths (the "prototxt" and "caffemodel").
This is as opposed to our Movidius NCS "ir" paths (the "xml" model and "bin" weights).
Additionally, take note of the backend and target (Lines 32 and 33). We are telling the
OpenVINO implementation of OpenCV that we will use the OpenCV backend and CPU (rather
than the Myriad processor on the Movidius NCS).
Preprocessing comes next and it is slightly different than our Movidius script:
Lines 38 and 39 set up our mean subtraction and image pixel scaling scale. We didn’t
need to perform mean preprocessing or pixel scaling previously as those steps are baked into
the .bin and .xml files (generated using OpenVINO optimizer). Instead, here we are using
the original caffe model and hence we need to perform these preprocessing steps manually.
The orig image is loaded and the frame is resized (Lines 42 and 43) prior to us creating
a blob for classification inference:
272 CHAPTER 12. IMAGE CLASSIFICATION WITH THE MOVIDIUS NCS
45 # convert the frame to a blob and perform a forward pass through the
46 # model and obtain the classification result
47 blob = cv2.dnn.blobFromImage(frame, scale, (W, H), mean)
48 print("[INFO] classifying image...")
49 start = time.time()
50 net.setInput(blob)
51 results = net.forward()
52 end = time.time()
53 print("[INFO] classification took {:.4f} seconds...".format(
54 end - start))
Our image is converted to a blob (Line 47) so that we can send it through the neural net.
Be sure to refer to my blog post on How OpenCV’s blobFromImage works (http://pyimg.co/c4
gws) [45].
Classification inference takes place on Lines 50 and 51. Again, timestamps are collected
and the elapsed time is computed and printed to the terminal for benchmarking purposes.
From here we process the results — all Lines 56-87 are identical to our previous Mo-
vidius script (Lines 68-99).
Now that we have implemented classification scripts for both the (1) Movidius NCS, and (2)
CPU, let’s put them to the test and compare results.
If you are using the .img that accompanies the book, you will need to use the openvino
virtual environment.
I recommend initiating a VNC or SSH (with X forwarding) session for running this example.
Remark. Remote development on the Raspberry Pi was covered in the Hobbyist Bundle of
this text, but if you need a refresher, refer to this tutorial: http://pyimg.co/tunq7 [71].
When you’re ready, open a terminal open on your Raspberry Pi and run the following script
to start the OpenVINO environment:
$ source start_openvino.sh
Starting Python 3.7 with OpenCV-OpenVINO 4.1.1 bindings...
[setupvars.sh] OpenVINO environment initialized
Remark. It is important for the openvino virtual environment to avoid usage of solely the
workon openvino command. If you source start_openvino.sh as shown, another
Intel-provided script is also sourced in the process to set up key environmental variables. I
12.2. IMAGE CLASSIFICATION WITH THE MOVIDIUS NCS 273
recommend that you inspect the start script on the .img via the following command: cat
~/start_openvino.sh.
Figure 12.1: Comparing image classification on both the Raspberry Pi CPU and Movidius NCS
using the SqueezeNet CNN pretrained on ImageNet.
From there, let’s fire up the CPU-based classification script using (1) SqueezeNet, and (2)
an image of a beer glass:
As you can see in the terminal output and in Figure 12.1, CPU inference took 0.1049 sec-
onds whereas Movidius inference required only 0.0125 seconds, a speedup of 8.39X.
Figure 12.2: Comparing image classification on both the Raspberry Pi CPU and Movidius NCS
using the GoogLeNet CNN pretrained on ImageNet.
Now let’s run (1) GoogLeNet with (2) an image of a brown bear on both the CPU and
Movidius:
Figure 12.3: Comparing image classification on both the Raspberry Pi CPU and Movidius NCS
using SqueezeNet and GoogLeNet models pretrained on ImageNet.
As you can see in the terminal output and in Figure 12.2, CPU inference took 3.1376 sec-
onds whereas Movidius inference required only 0.1624 seconds, a speedup of 19.32X.
The results in Figures 12.1 and 12.2 were collected with a Raspberry Pi 4B using a Movidius
NCS2 and OpenVINO version 4.1.1. A summary of the results is shown in the table in Figure
12.3. As the results show, the Movidius NCS2 paired with a Raspberry Pi 4B is 8x faster for
SqueezeNet classification, and a whopping 19x faster for GoogLeNet classification. Results
will not be as good using a Raspberry Pi 3B+ which does not have USB 3.0.
I would highly recommend giving this Intel OpenVINO document (http://pyimg.co/hyleq, [70])
a read regarding performance comparisons if you are conducting your own benchmarking.
12.3 Summary
In this chapter, we learned how to perform image classification with the Movidius NCS using
OpenVINO’s implementation of OpenCV.
276 CHAPTER 12. IMAGE CLASSIFICATION WITH THE MOVIDIUS NCS
As we discovered, the CPU and Movidius classification scripts are nearly identical minus
the initial setup.
The Movidius NCS2 on the Raspberry Pi 4 is a whopping 19x faster using GoogLeNet than
using only the Raspberry Pi CPU. This makes the Movidius and OpenVINO a great companion
for your Raspberry Pi deep learning projects.
In the next chapter, we’ll perform object detection with the Movidius NCS.
Chapter 13
Rather than a background subtraction method, which leads to inaccurate counting if people
are close together, we’ll instead use object detection.
When this code was originally written, the Raspberry Pi 3B+ was the best RPi hardware
available, so we needed to add additional horsepower. The horsepower comes in the form of
the Intel Movidius NCS with the OpenVINO software.
The example we will cover in this chapter will be similar to my original “OpenCV People
Counter” tutorial on PyImageSearch from August 2018 (http://pyimg.co/vgak6) [72], with the
main exception being that we will dispatch the Movidius NCS for the heavy lifting. Our ex-
ample also uses our new DirectionCounter class which was not included in the blog post
example.
ii. Use the MobileNet SSD with either a (1) CPU, or (2) Myriad/Movidius processor.
iii. Count objects (people) using object detection, correlation tracking, and centroid tracking.
277
278 CHAPTER 13. OBJECT DETECTION WITH THE MOVIDIUS NCS
In this chapter, we begin by reviewing our project structure. From there, we’ll briefly review
object counting which was covered in the Hobbyist Bundle.
We’ll then dive right into object counting with OpenVINO/Movidius. In our results section,
we will benchmark the CPU vs. the Movidius NCS. Now that the RPi 4B is available with USB
3.0 support, the results are quite impressive.
|-- config
| |-- config.json
|-- mobilenet_ssd
| |-- MobileNetSSD_deploy.caffemodel
| |-- MobileNetSSD_deploy.prototxt
|-- output
| |-- output_01.avi
| |-- output_02.avi
|-- pyimagesearch
| |-- utils
| | |-- __init__.py
| | |-- conf.py
| |-- __init__.py
| |-- centroidtracker.py
| |-- directioncounter.py
| |-- trackableobject.py
|-- videos
| |-- example_01.mp4
| |-- example_02.mp4
|-- people_counter_openvino.py
Input videos for testing are included in the videos/ directory. Both example_01.mp4 and
example_02.mp4 are provided by David McDuffee.
Our output/ folder will be where we’ll store processed videos. One example output video
is included.
The mobilenet_ssd/ directory contains our pretrained Caffe-based object detector files.
The pyimagesearch module contains our Conf class for parsing JSON configs. Addi-
tionally three classes related to counting objects are included: (1) TrackableObject, (2)
CentroidTracker, and (3) DirectionCounter. We will briefly review the purpose of these
three classes later in this chapter. A full line-by-line review can be found in Chapter 19 of the
Hobbyist Bundle.
13.2. OBJECT DETECTION WITH THE MOVIDIUS NCS 279
Our driver script for object detection and people counting is contained within people_coun
ter_openvino.py. This script takes advantage of all the aforementioned classes as well as
MobileNet SSD in order to count people using the Movidius coprocessor or the Raspberry Pi
CPU. The CPU option is only recommended if you are using a Raspberry Pi 4 — it is mainly
for benchmarking purposes.
Figure 13.1: An example of our people/footfall counter in action. Our algorithm detects people in
a video stream, determines the direction they are going (up/down or left/right), and then counts
them appropriately once they cross the center line.
Our Movidus NCS, object detection, and people counting implementation uses three classes
developed in Chapter 19 of the Hobbyist Bundle.
The TrackableObject class only holds/stores data and does not have any methods.
Every person in our frame is a trackable object and has an objectID, list of centroids, and
a boolean indicating if it is counted or not.
of view we use the find_direction method to determine if they are moving up/down or
left/right. From there the count_object method checks to see if the person has crossed the
“counting line” and increments the respective counter.
If you need a full review of the three classes, be sure to refer to Chapter 19 of the Hobbyist
Bundle now before moving on.
Now let’s put the Movidius Neural Compute Stick to work using OpenVINO.
To demonstrate the power of OpenVINO on the Raspberry Pi with Movidius, we’re going
to perform real-time deep learning object detection along with people counting. The Mo-
vidius/Myriad coprocessor will perform the actual deep learning inference, reducing the load
on the Pi’s CPU. We’ll use the Raspberry Pi CPU to process the results and tell the Movidius
what to do, but we’re reserving object detection for the Myriad as its hardware is optimized and
designed for deep learning inference.
This script is especially convenient as only a single function call affects whether the Mo-
vidius NCS Myriad or CPU is used for inference. The .setPreferableTarget call tells
OpenVINO which processor to use for deep learning inference. The function call demonstrates
the power of OpenVINO in that it is really quite simple to set either the CPU or coprocessor
to handle deep learning inference; however not all OpenVINO scripts are this convenient as of
the time of this writing.
In the results section, we will benchmark both CPU and Movidius people counting on a
Raspberry Pi 4B.
Lines 2-16 import packages and modules. As you can see, we begin by importing our cus-
tom classes including DirectionCounter, CentroidTracker, TrackableObject, and
Conf.
We’ll by using multiprocessing for writing to output video files so as not to slow down
the main functionality of the script. We’ll use a process safe Queue of frames and Value
variable to indicate whether we should be writing to video.
We will also use the dlib correlation tracker (a new addition compared to the implementa-
tion in Chapter 19 of the Hobbyist Bundle).
The write_video function will run in an independent process so that our main thread
of execution isn’t bogged down with the time-consuming blocking operation of writing video
frames to disk.
The main process will simply append frames from the FIFO frameQueue and the write_
video process will handle the video writing more efficiently.
The function accepts five parameters: (1) outputPath, the filepath to the output video
file, (2) writeVideo, a flag indicating if video writing should be ongoing, (3) frameQueue,
a process-safe Queue holding our frames to be written to disk in the video file, and (4/5) the
video file dimensions.
From there, a loop starts on Line 25 — it will continue to write to the video until writeVideo
is False. The frames are written as they become available in the frameQueue. When the
video is finished, the output file pointer is released (Line 33).
282 CHAPTER 13. OBJECT DETECTION WITH THE MOVIDIUS NCS
With our video writing process out of the way, let’s define our command line arguments:
• --target: The target processor for object detection, either myriad or cpu.
• --mode: The direction (either horizontal or vertical in which people will be moving
through the frame.
• --output: The path to an optional output video file. When an output video filepath is
provided, the write_video process will come to life.
From there, we’ll parse our config and list our MobileNet SSD CLASSES:
We are only concerned with people counting, so later in our frame processing loop, we will
filter for the "person" class from the CLASSES list (Lines 55-58).
Let’s go ahead and load our serialized object detection model from disk and set the prefer-
able processor:
13.2. OBJECT DETECTION WITH THE MOVIDIUS NCS 283
Lines 62 and 63 load our caffe model from the .prototxt and .model files (the paths
are specified in config.json).
The power of OpenVINO lies in the ability to set the target processor for inference. Lines
67-75 set the OpenVINO target processor based on the --target processor provided via
command line argument.
Lines 91-93 initialize our writerProcess and output video frame dimensions.
Next, we initialize our CentroidTracker (Line 98), trackers to hold our dlib correlation
trackers (Line 99), and our trackableObjects ordered dictionary (Line 100).
The directionInfo is initialized (Line 105) and will later hold a dictionary of our object
counts to be annotated on the screen.
Object detection is a resource-hungry task. Therefore, we will only perform object detection
every N skip-frames. The totalFrames variable is initialized to 0 for now and it will increment
upon each iteration of our while loop. When totalFrames % conf["skip_frames"]
== 0, we will perform object detection so that we have an accurate position of the people
walking through the frame.
We begin looping on Line 112. At the top of the loop we grab the next frame (Lines 115
and 116). In the event that we’ve reached the end of the video, we’ll break out of the loop
13.2. OBJECT DETECTION WITH THE MOVIDIUS NCS 285
On the first iteration of our loop, our frame dimensions will still be None. Lines 127-131
set the frame dimensions and initialize our DirectionCounter as dc while providing the
direction mode (vertical/horizontal).
If we will be writing a processed output video to disk, Lines 134-143 initialize the frameQueue
and start the writerProcess.
145 # initialize the current status along with our list of bounding
146 # box rectangles returned by either (1) our object detector or
147 # (2) the correlation trackers
148 status = "Waiting"
149 rects = []
150
151 # check to see if we should run a more computationally expensive
152 # object detection method to aid our tracker
153 if totalFrames % conf["skip_frames"] == 0:
154 # set the status and initialize our new set of object
155 # trackers
156 status = "Detecting"
157 trackers = []
158
159 # convert the frame to a blob and pass the blob through the
286 CHAPTER 13. OBJECT DETECTION WITH THE MOVIDIUS NCS
• Detecting: We’re actively in the process of detecting people using the MobileNet SSD.
• Tracking: People are being tracked in the frame and we’re counting the totalUp and
totalDown.
Our rects list will be populated either via detection or tracking. We go ahead and initialize
rects on Line 149.
Deep learning object detectors are very computationally expensive, especially if you are
running them on your CPU (and even if you use your Movidius NCS).
To avoid running our object detector on every frame, and to speed up our tracking pipeline,
we’ll be skipping every N frames (set by command line argument --skip-frames where 30
is the default).
Only every N frames will we exercise our SSD for object detection. Otherwise, we’ll simply
be tracking moving objects in-between. Using the modulo operator on Line 153 we ensure that
we’ll only execute the code in the if-statement every N frames.
We then initialize our new list of dlib correlation trackers (Line 157).
Next, we’ll perform inference via object detection. We begin by creating a blob from the
frame, followed by passing the blob through the net to obtain detections (Lines 161-165).
OpenVINO will seamlessly use either the (1) Myriad in the Movidius NCS, or (2) CPU,
depending on the preferred target processor we set via command line arguments.
Now we’ll loop over each of the detections in hopes of finding objects belonging to the
“person” class:
Looping over detections on Line 168, we proceed to grab the confidence (Line 171)
and filter out weak results and those that don’t belong to the "person" class (Lines 175-182).
We can now compute a bounding box for each person and begin correlation tracking:
Subsequently, we start tracking on Line 195 and add the tracker to the trackers list on
Line 199.
Again, we will have one dlib correlation tracker per person in the frame. Correlation tracking
is computationally expensive, but since we’ve offloaded the deep learning object detection
288 CHAPTER 13. OBJECT DETECTION WITH THE MOVIDIUS NCS
aspect of our system to the Movidius, the RPi CPU is able to handle multiple correlation trackers
within reason even on an RPi 3B+.
Obviously the more people that are present in the frame, the lower our FPS will be — just
keep that in mind if you are counting people over a large area.
Let’s take care of the typical operations where tracking (not object detection) is taking place
in the else block:
Most of the time, we aren’t landing on a skip-frame multiple. During these iterations of the
loop we’ll utilize our trackers to track our object rather than applying detection.
We begin looping over the available trackers on Line 206. Inside the loop, we proceed to
update the status to "Tracking" (Line 209) and grab the object position (Lines 212 and
213). From there we extract the position coordinates (Lines 216-219) followed by populating
the information in our rects list.
Now let’s draw a line that people must cross in order to be tracked:
Next, we’ll update the centroid tracker with our fresh object rects:
238 # use the centroid tracker to associate the (1) old object
239 # centroids with (2) the newly computed object centroids
240 objects = ct.update(rects)
241
242 # loop over the tracked objects
243 for (objectID, centroid) in objects.items():
244 # grab the trackable object via its object ID
245 to = trackableObjects.get(objectID, None)
246
247 # create a new trackable object if needed
248 if to is None:
249 to = TrackableObject(objectID, centroid)
250
251 # otherwise, there is a trackable object so we can utilize it
252 # to determine direction
253 else:
254 # find the direction and update the list of centroids
255 dc.find_direction(to, centroid)
256 to.centroids.append(centroid)
257
258 # check to see if the object has been counted or not
259 if not to.counted:
260 # find the direction of motion of the people
261 directionInfo = dc.count_object(to, centroid)
262
263 # store the trackable object in our dictionary
264 trackableObjects[objectID] = to
Our centroid tracker will associate object IDs with object locations.
Line 243 begins a loop over our objects to (1) track, (2) determine direction, and (3) count
them if they cross the counting line.
On Line 245 we attempt to fetch a TrackableObject for the current objectID. If the
trackable object doesn’t exist for this particular ID, we create one (Lines 248 and 249).
290 CHAPTER 13. OBJECT DETECTION WITH THE MOVIDIUS NCS
If the person has not been counted, we go ahead and count them (Lines 259-261). Behind
the scenes in the DirectionCounter class, the object will not be counted if it has yet to
cross the counting line.
Finally, we store the trackable object in our trackableObjects dictionary (Line 264) so
we can grab and update it when the next frame is captured.
ii. Writing frames to a video file on disk (if the --output command line argument is present).
iv. Cleanup.
266 # draw both the ID of the object and the centroid of the
267 # object on the output frame
268 text = "ID {}".format(objectID)
269 color = (0, 255, 0) if to.counted else (0, 0, 255)
270 cv2.putText(frame, text, (centroid[0] - 10, centroid[1] - 10),
271 cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
272 cv2.circle(frame, (centroid[0], centroid[1]), 4, color, -1)
273
274 # check if there is any direction info available
275 if directionInfo is not None:
276 # construct a list of information as a combination of
277 # direction info and status info
278 info = directionInfo + [("Status", status)]
279
280 # otherwise, there is no direction info available yet
281 else:
282 # construct a list of information as status info since we
283 # don't have any direction info available yet
284 info = [("Status", status)]
285
286 # loop over the info tuples and draw them on our frame
287 for (i, (k, v)) in enumerate(info):
288 text = "{}: {}".format(k, v)
289 cv2.putText(frame, text, (10, H - ((i * 20) + 20)),
290 cv2.FONT_HERSHEY_SIMPLEX, 0.6, (0, 0, 255), 2)
13.2. OBJECT DETECTION WITH THE MOVIDIUS NCS 291
The person is annotated with a dot and an ID number where red means not counted and
green means counted (Lines 268-272).
We build our text info via Lines 275-284. It contains the (1) object counts, and (2) status.
Lines 279-282 annotate the corner of the frame with the text-based info.
292 # put frame into the shared queue for video writing
293 if writerProcess is not None:
294 frameQueue.put(frame)
295
296 # show the output frame
297 cv2.imshow("Frame", frame)
298 key = cv2.waitKey(1) & 0xFF
299
300 # if the `q` key was pressed, break from the loop
301 if key == ord("q"):
302 break
303
304 # increment the total number of frames processed thus far and
305 # then update the FPS counter
306 totalFrames += 1
307 fps.update()
Lines 293 and 294 put a frame in the frameQueue for the writerProcess to consume.
Lines 297-302 display the frame to the screen and capture keypresses (the q key quits
the frame processing loop).
Our totalFrames counter is incremented and our fps counter is updated (Lines 306 and
307). The totalFrames are used in our modulo operation to check to see if we should skip
object detection (i.e. "skip-frames").
321 vs.stop()
322
323 # otherwise, release the video file pointer
324 else:
325 vs.release()
326
327 # close any open windows
328 cv2.destroyAllWindows()
Lines 315-317 stop the video writerProcess — our output video will be ready for us in
the output/ folder.
Let’s proceed to count people via object detection, correlation tracking, and centroid tracking
using both the (1) CPU, and (2) Myriad/Movidius processors.
If you are using the .img provided with this book, fire up the openvino virtual environment:
$ source ~/start_openvino.sh
Starting Python 3.7 with OpenCV-OpenVINO 4.1.1 bindings...
[setupvars.sh] OpenVINO environment initialized
Remark. When using the openvino virtual environment it is recommended to avoid usage of
the the workon openvino command. If you source start_openvino.sh as shown, an-
other Intel-provided script is also sourced in the process to set up key environmental variables.
I recommend that you inspect the start script on the .img via the following command: cat
~/start_openvino.sh
From there, be sure to source the setup file in the project folder:
$ source setup.sh
Figure 13.2: People counting via object detection with the Raspberry Pi 4B 4GB CPU.
As you can see, the RPi 4B 4GB achieved 26 FPS using only the CPU for object detection.
Let’s see how much of an improvement the Intel Movidius Neural Compute Stick coupled
with OpenVINO yields using the Myriad processor. Ensure that your Movidius NCS2 is plugged
into your RPi 4B 4GB’s USB 3.0 port and update the --target:
Be sure to review Chapters 19 and 20 of the Hobbyist Bundle for more details on algorithm
optimizations. In particular, be sure to review Section 20.4, "Leading up to a successful project
294 CHAPTER 13. OBJECT DETECTION WITH THE MOVIDIUS NCS
Figure 13.3: People counting via object detection with the Raspberry Pi 4B 4GB and Intel Movidius
NCS2 plugged into the USB 3.0 port.
13.3 Pre-trained Models and Custom Training with the Movidius NCS
In this chapter, we used a pre-trained Caffe-based MobileNet SSD for object detection.
Other pre-trained models compatible with OpenVINO and the Movidius NCS are available
at the following resources:
To learn how to train your own models for the Movidius NCS, be sure to refer to the Complete
Bundle volume of this book.
13.4. SUMMARY 295
13.4 Summary
In this chapter, we learned how to build a people counter using object detection with the Intel
Movidius NCS.
Our implementation:
• Utilizes deep learning object detectors for improved person detection accuracy.
• Leverages two separate object tracking algorithms, including both centroid tracking and
correlation filters for improved tracking accuracy.
• Applies both a “detection” and “tracking” phase, making it capable of (1) detecting new
people, and (2) finding people that may have been “lost” during the tracking phase.
• Is capable of running in real-time at 32 FPS using the Movidius NCS, but only 24 FPS
using only the CPU.
As you can see from the images, this type of system would be especially useful to a store
owner to track the number of people that go in and out of the store at various times of day.
Using the data, the store owner could determine how much staff is needed to tend to the
customers, potentially saving the owner money in the long run.
If you enjoyed this object detection chapter using the Movidius NCS, you’re going to love
the Google Coral TPU alternative presented in the Complete Bundle of this book. Once you
become familiar with the Coral, a great homework assignment would be to implement people
counting as we did in this chapter, but replace the Intel Movidius coprocessor with the Google
Coral TPU coprocessor and benchmark your results.
296 CHAPTER 13. OBJECT DETECTION WITH THE MOVIDIUS NCS
Chapter 14
When we built our face recognition door monitor in Chapter 5 you may have noticed that you
could enter your doorway quickly enough to avoid face recognition by the door monitor.
Is there a problem with the face detection or face recognition models themselves?
The problem is that your Raspberry Pi CPU simply can’t process the frames quickly enough.
You need more computational horsepower.
As the title to this chapter suggests, we’re going to pair our RPi with the Movidius Neural
Compute Stick coprocessor. The NCS Myriad processor will handle both face detection and
extracting face embeddings. The RPi CPU processor will handle the final machine learning
classification using the results from the face embeddings.
The process of offloading deep learning tasks to the Movidius NCS frees up the Raspberry
Pi CPU to handle the non-deep learning tasks. Each processor is then doing what it is designed
for. We are certainly pushing our Raspberry Pi to the limit, but we don’t have much choice short
of using a completely different single board computer such as an NVIDIA Jetson Nano.
By the end of this chapter you’ll have a fully functioning face recognition script running at
6.29FPS on the RPi and Movidius NCS, a 243% speedup compared to using just the RPi
alone!
Remark. This chapter includes a selection of reposted content from the following two blog
posts: Face recognition with OpenCV, Python, and deep learning (http://pyimg.co/oh21b [73])
OpenCV Face Recognition (http://pyimg.co/i39fy [74]). The content in this chapter, however, is
optimized for the Movidius NCS.
297
298 CHAPTER 14. FAST, EFFICIENT FACE RECOGNITION WITH THE MOVIDIUS NCS
In this chapter, we will build upon Chapter 5 but with a simpler example which demonstrates:
Prior to reading this chapter, be sure to read/review Chapter 5 in which face recognition was
first presented in this book. Specifically, you should review Section 5.4, “Deep learning for face
recognition” to ensure you understand modern face recognition concepts.
You can reuse your face dataset that you may have developed for that chapter; alternatively
take the time now to develop a face dataset for this chapter (Section 5.4.2).
Upon your understanding of this chapter, you will be able to revisit Chapter 5 and integrate
the Movidius into the door monitor project both (1) with the Movidius NCS for faster speeds,
and (2) with higher accuracy.
In the remainder of this chapter, we’ll begin by reviewing the process of extracting embed-
dings for/with the NCS. From there, we’ll train a model upon the embeddings.
Finally we’ll develop a quick demo script to ensure that our faces are being recognized
properly.
|-- face_detection_model
| |-- deploy.prototxt
| |-- res10_300x300_ssd_iter_140000.caffemodel
|-- face_embedding_model
| |-- openface_nn4.small2.v1.t7
|-- output
| |-- embeddings.pickle
| |-- le.pickle
| |-- recognizer.pickle
|-- setupvars.sh
|-- extract_embeddings.py
14.2. FAST, EFFICIENT FACE RECOGNITION WITH THE MOVIDIUS NCS 299
|-- train_model.py
|-- recognize_video.py
Our face detector will localize a face in the image to be recognized. The pre-trained Caffe
face detector files (provided by OpenCV) are included inside the face_detection_model/
directory. Be sure to refer to this deep learning face detection blog post to learn more about
the detector and how it can be put to use: http://pyimg.co/l9v8e [75].
We will extract face embeddings with a pre-trained OpenFace PyTorch model included in the
face_embedding_model/ directory. The openface_nn4.small2.v1.t7 file was trained
by the team at Carnegie Mellon University as part of the OpenFace project [76].
We’ll then train a Support Vector Machines (SVM) machine learning model on top of the
embeddings by executing the train_model.py script. The result of training our SVM will be
serialized to recognizer.pickle in the output/ directory.
Remark. You should delete the files included in the output/ directory and generate new files
associated with your own face dataset.
The recognize_video.py script simply activates your camera and detects plus recog-
nizes faces in each frame.
Our Movidius face recognition system will not work properly unless a system environment
variable is set.
This may change in future revisions of OpenVINO, but for now a shell script is provided in
the project associated with this chapter.
1 #!/bin/sh
2
3 export OPENCV_DNN_IE_VPU_TYPE=Myriad2
Line 3 sets the environment variable using the export command. You could, of course,
300 CHAPTER 14. FAST, EFFICIENT FACE RECOGNITION WITH THE MOVIDIUS NCS
manually type the command in your terminal, but this shell script alleviates you from having to
memorize the variable name and setting.
$ source setup.sh
Provided that you have executed this script, you shouldn’t see any any strange OpenVINO-
related errors with the rest of the project.
If you encounter the following error message in the next section, be sure to execute setup.sh:
Recall from Section 5.4.1 that in order to perform deep learning face recognition, we need
real-valued feature vectors to train a model upon. The script in this section serves the purpose
of extracting 128-d feature vectors for all faces in our dataset.
Lines 2-8 import the necessary packages for extracting face embeddings.
• --embeddings: The path to our output embeddings file. Our script will compute face
embeddings which we’ll serialize to disk.
• --detector: Path to OpenCV’s Caffe-based deep learning face detector used to actu-
ally localize the faces in the images.
• --embedding-model: Path to the OpenCV deep learning Torch embedding model. This
model will allow us to extract a 128-D facial embedding vector.
• detector: Loaded via Lines 26-29. We’re using a Caffe based DL face detector to
localize faces in an image.
• embedder: Loaded on Line 33. This model is Torch-based and is responsible for extract-
ing facial embeddings via deep learning feature extraction.
Notice that we’re using the respective cv2.dnn functions to load the two separate models.
The dnn module is optimized by the Intel OpenVINO developers.
302 CHAPTER 14. FAST, EFFICIENT FACE RECOGNITION WITH THE MOVIDIUS NCS
As you can see on Line 30 and Line 36 we call setPreferableTarget and pass the Myr-
iad constant setting. These calls ensure that the Movidius Neural Compute Stick will conduct
the deep learning heavy lifting for us.
Moving forward, let’s grab our image paths and perform initializations:
The imagePaths list, built on Line 40, contains the path to each image in the dataset. The
imutils function, paths.list_images automatically traverses the directory tree to find all
image paths.
Our embeddings and corresponding names will be held in two lists: (1) knownEmbeddings,
and (2) knownNames (Lines 44 and 45).
We’ll also be keeping track of how many faces we’ve processed the total variable (Line
48).
Let’s begin looping over the imagePaths — this loop will be responsible for extracting
embeddings from faces found in each image:
First, we extract the name of the person from the path (Line 55). To explain how this works,
consider the following example in a Python shell:
$ python
>>> from imutils import paths
>>> import os
>>> datasetPath = "../datasets/face_recognition_dataset"
>>> imagePaths = list(paths.list_images(datasetPath))
>>> imagePath = imagePaths[0]
>>> imagePath
'dataset/adrian/00004.jpg'
>>> imagePath.split(os.path.sep)
['dataset', 'adrian', '00004.jpg']
>>> imagePath.split(os.path.sep)[-2]
'adrian'
>>>
Notice how by using imagePath.split and providing the split character (the OS path
separator — “/” on Unix and “\” on Windows), the function produces a list of folder/file names
(strings) which walk down the directory tree. We grab the second-to-last index, the persons
name, which in this case is adrian.
Finally, we wrap up the above code block by loading the image and resizing it to a known
width (Lines 60 and 61).
On Lines 65-67, we construct a blob. A blob packages an image into a data structure
compatible with OpenCV’s dnn module. To learn more about this process, please read Deep
learning: How OpenCV’s blobFromImage works (http://pyimg.co/c4gws [45]).
From there we detect faces in the image by passing the imageBlob through the detector
network (Lines 71 and 72).
75 if len(detections) > 0:
76 # we're making the assumption that each image has only ONE
77 # face, so find the bounding box with the largest probability
78 j = np.argmax(detections[0, 0, :, 2])
79 confidence = detections[0, 0, j, 2]
80
81 # ensure that the detection with the largest probability also
82 # means our minimum probability test (thus helping filter out
83 # weak detection)
84 if confidence > args["confidence"]:
85 # compute the (x, y)-coordinates of the bounding box for
86 # the face
87 box = detections[0, 0, j, 3:7] * np.array([w, h, w, h])
88 (startX, startY, endX, endY) = box.astype("int")
89
90 # extract the face ROI and grab the ROI dimensions
91 face = image[startY:endY, startX:endX]
92 (fH, fW) = face.shape[:2]
93
94 # ensure the face width and height are sufficiently large
95 if fW < 20 or fH < 20:
96 continue
The detections list contains probabilities and bounding box coordinates to localize faces
in an image. Assuming we have at least one detection, we’ll proceed into the body of the if-
statement (Line 75). We make the assumption that there is only one face in the image, so we
extract the detection with the highest confidence and check to make sure that the confidence
meets the minimum probability threshold used to filter out weak detections (Lines 78-84).
When we’ve met that threshold, we extract the face ROI and grab/check dimensions to
make sure the face ROI is sufficiently large (Lines 87-96).
From there, we’ll take advantage of our embedder CNN and extract the face embeddings:
98 # construct a blob for the face ROI, then pass the blob
99 # through our face embedding model to obtain the 128-d
100 # quantification of the face
101 faceBlob = cv2.dnn.blobFromImage(face, 1.0 / 255,
102 (96, 96), (0, 0, 0), swapRB=True, crop=False)
103 embedder.setInput(faceBlob)
104 vec = embedder.forward()
105
106 # add the name of the person + corresponding face
107 # embedding to their respective lists
108 knownNames.append(name)
109 knownEmbeddings.append(vec.flatten())
110 total += 1
We construct another blob, this time from the face ROI (not the whole image as we did
14.2. FAST, EFFICIENT FACE RECOGNITION WITH THE MOVIDIUS NCS 305
Subsequently, we pass the faceBlob through the embedder CNN (Lines 103 and 104).
This generates a 128-D vector (vec) which quanitifies the face. We’ll leverage this data to
recognize new faces via machine learning.
And then we simply add the name and embedding vec to knownNames and knownEmbedd
ings, respectively (Lines 108 and 109).
We also can’t forget about the variable we set to track the total number of faces either —
we go ahead and increment the value on Line 110.
We continue this process of looping over images, detecting faces, and extracting face em-
beddings for each and every image in our dataset.
All that’s left when the loop finishes is to dump the data to disk:
We add the name and embedding data to a dictionary and then serialize it into a pickle file
on Lines 113-117.
At this point we’re ready to extract embeddings by running our script. Prior to running the
embeddings script, be sure to setup environmental variables via our script if you did not do so
in the previous section:
$ source setup.sh
From there, open up a terminal and execute the following command to compute the face
embeddings with OpenCV and Movidius:
$ python extract_embeddings.py \
--dataset ../datasets/face_recognition_dataset \
--embeddings output/embeddings.pickle \
--detector face_detection_model \
--embedding-model face_embedding_model/openface_nn4.small2.v1.t7
[INFO] loading face detector...
[INFO] loading face recognizer...
[INFO] quantifying faces...
[INFO] processing image 1/120
306 CHAPTER 14. FAST, EFFICIENT FACE RECOGNITION WITH THE MOVIDIUS NCS
This process completed in 57s on a RPi 4B with an NCS2 plugged into the USB 3.0 port.
You may notice a delay at the beginning as the model is being loaded. From there, each image
will process very quickly.
As you can see we’ve extracted 120 embeddings for each of the 120 face photos in our
dataset. The embeddings.pickle file is now available in the output/ folder as well:
ls -lh output/*.pickle
-rw-r--r-- 1 pi pi 66K Nov 20 14:35 output/embeddings.pickle
The serialized embeddings filesize is 66KB — embeddings files grow linearly according
to the size of your dataset. Be sure to review Section 14.3.1 later in this chapter about the
importance of an adequately large dataset for achieving high accuracy.
At this point we have extracted 128-d embeddings for each face — but how do we actually
recognize a person based on these embeddings? The answer is that we need to train a
“standard” machine learning model (such as an SVM, k-NN classifier, Random Forest, etc.) on
top of the embeddings.
For small datasets a k-Nearest Neighbor (k-NN) approach can be used for face recognition
on 128-d embeddings created via the dlib [33] and face_recognition [34] libraries.
However, in this chapter, we will build a more powerful classifier (Support Vector Machines)
on top of the embeddings — you’ll be able to use this same method in your dlib-based face
recognition pipelines as well if you are so inclined.
We import our packages and modules on Lines 2-6. We’ll be using scikit-learn’s implemen-
tation of Support Vector Machines (SVM), a common machine learning model.
• --embeddings: The path to the serialized embeddings (we saved them to disk by run-
ning the previous extract_embeddings.py script).
• --recognizer: This will be our output model that recognizes faces. We’ll be saving it
to disk so we can use it in the next two recognition scripts.
• --le: Our label encoder output file path. We’ll serialize our label encoder to disk so that
we can use it and the recognizer model in our image/video face recognition scripts.
Here we load our embeddings from our previous section on Line 20. We won’t be gen-
erating any embeddings in this model training script — we’ll use the embeddings previously
generated and serialized.
Then we initialize our scikit-learn LabelEncoder and encode our name labels (Lines 24
and 25).
308 CHAPTER 14. FAST, EFFICIENT FACE RECOGNITION WITH THE MOVIDIUS NCS
Now it’s time to train our SVM model for recognizing faces:
27 # train the model used to accept the 128-d embeddings of the face and
28 # then produce the actual face recognition
29 print("[INFO] training model...")
30 params = {"C": [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0],
31 "gamma": [1e-1, 1e-2, 1e-3, 1e-4, 1e-5]}
32 model = GridSearchCV(SVC(kernel="rbf", gamma="auto",
33 probability=True), params, cv=3, n_jobs=-1)
34 model.fit(data["embeddings"], labels)
35 print("[INFO] best hyperparameters: {}".format(model.best_params_))
We are using a machine learning Support Vector Machine (SVM) Radial Basis Function
(RBF) kernel [77] which is typically harder to tune than a linear kernel. Therefore, we will
undergo a process known as “gridsearching”, a method to find the optimal machine learning
hyperparameters for a model.
Lines 30-33 set our gridsearch parameters and perform the process. Notice that n_jobs=1.
If you were utilizing a more powerful system, you could run more than one job to perform grid-
searching in parallel. We are on a Raspberry Pi, so we will use a single worker.
Line 34 handles training our face recognition model on the face embeddings vectors.
Remark. You can and should experiment with alternative machine learning classifiers. The
PyImageSearch Gurus course [78] covers popular machine learning algorithms in depth. To
learn more about the course use this link: http://pyimg.co/gurus
From here we’ll serialize our face recognizer model and label encoder to disk:
To execute our training script, enter the following command in your terminal:
ls -lh output/*.pickle
-rw-r--r-- 1 pi pi 66K Nov 20 14:35 output/embeddings.pickle
-rw-r--r-- 1 pi pi 470 Nov 20 14:55 le.pickle
-rw-r--r-- 1 pi pi 97K Nov 20 14:55 recognizer.pickle
With our serialized face recognition model and label encoder, we’re ready to recognize faces
in images or video streams.
In this section we will code a quick demo script to recognize faces using your PiCamera or
USB webcamera. Go ahead and open recognize_video.py and insert the following code:
• --detector: The path to OpenCV’s deep learning face detector. We’ll use this model
to detect where in the image the face ROIs are.
• --recognizer: The path to our recognizer model. We trained our SVM recognizer in
Section 14.2.4. This model will actually determine who a face is.
• --le: The path to our label encoder. This contains our face labels such as adrian or
unknown.
Be sure to study these command line arguments — it is critical that you know the difference
between the two deep learning models and the SVM model. If you find yourself confused later
in this script, you should refer back to here.
Now that we’ve handled our imports and command line arguments, let’s load the three
models from disk into memory:
We load three models in this block. At the risk of being redundant, here is a brief summary
of the differences among the models:
i. detector: A pre-trained Caffe DL model to detect where in the image the faces are
(Lines 28-32).
14.2. FAST, EFFICIENT FACE RECOGNITION WITH THE MOVIDIUS NCS 311
ii. embedder: A pre-trained Torch DL model to calculate our 128-D face embeddings (Line
37 and 38).
One and two are pre-trained deep learning models, meaning that they are provided to you
as-is by OpenCV. The Movidius NCS will perform inference using each of these models.
The third recognizer model is not a form of deep learning. Rather, it is our SVM machine
learning face recognition model. The RPi CPU will have to handle making face recognition
predictions using it.
We also load our label encoder which holds the names of the people our model can recog-
nize (Line 42).
44 # initialize the video stream, then allow the camera sensor to warm up
45 print("[INFO] starting video stream...")
46 #vs = VideoStream(src=0).start()
47 vs = VideoStream(usePiCamera=True).start()
48 time.sleep(2.0)
49
50 # start the FPS throughput estimator
51 fps = FPS().start()
Line 47 initializes and starts our VideoStream object. We wait for the camera sensor to
warm up on Line 48.
We grab a frame from the webcam on Line 56. We resize the frame (Line 61) and then
construct a blob prior to detecting where the faces are (Lines 65-72).
Given our new detections , let’s recognize faces in the frame. But, first we need to filter
weak detections and extract the face ROI:
We loop over the detections on Line 75 and extract the confidence of each on Line 78.
Then we compare the confidence to the minimum probability detection threshold contained
in our command line args dictionary, ensuring that the computed probability is larger than the
minimum probability (Line 81).
From there, we extract the face ROI (Lines 84-89) as well as ensure it’s spatial dimensions
are sufficiently large (Lines 92 and 93).
Recognizing the name of the face ROI requires just a few steps:
95 # construct a blob for the face ROI, then pass the blob
96 # through our face embedding model to obtain the 128-d
14.2. FAST, EFFICIENT FACE RECOGNITION WITH THE MOVIDIUS NCS 313
First, we construct a faceBlob (from the face ROI) and pass it through the embedder to
generate a 128-D vector which quantifies the face (Lines 98-102)
Then, we pass the vec through our SVM recognizer model (Line 105), the result of which
is our predictions for who is in the face ROI.
We take the highest probability index and query our label encoder to find the name (Lines
106-108).
Remark. You can further filter out weak face recognitions by applying an additional threshold
test on the probability. For example, inserting if proba < T (where T is a variable you de-
fine) can provide an additional layer of filtering to ensure there are fewer false-positive face
recognitions.
Now, let’s display face recognition results for this particular frame:
110 # draw the bounding box of the face along with the
111 # associated probability
112 text = "{}: {:.2f}%".format(name, proba * 100)
113 y = startY - 10 if startY - 10 > 10 else startY + 10
114 cv2.rectangle(frame, (startX, startY), (endX, endY),
115 (0, 0, 255), 2)
116 cv2.putText(frame, text, (startX, y),
117 cv2.FONT_HERSHEY_SIMPLEX, 0.45, (0, 0, 255), 2)
118
119 # update the FPS counter
120 fps.update()
121
122 # show the output frame
123 cv2.imshow("Frame", frame)
124 key = cv2.waitKey(1) & 0xFF
125
126 # if the `q` key was pressed, break from the loop
127 if key == ord("q"):
128 break
129
314 CHAPTER 14. FAST, EFFICIENT FACE RECOGNITION WITH THE MOVIDIUS NCS
• Draw a bounding box around the face and the person’s name and corresponding predicted
probability (Lines 112-117).
• Display the annotated frame (Line 123) and wait for the q key to be pressed at which
point we break out of the loop (Lines 124-128).
• Stop our fps counter and print statistics in the terminal (Lines 131-133).
• Cleanup by closing windows and releasing pointers (Lines 136 and 137).
Now that we have (1) extracted face embeddings, (2) trained a machine learning model on the
embeddings, and (3) written our face recognition in video streams driver script, let’s see the
final result.
As you can see, faces have correctly been identified. What’s more, we are achieving 6.29
FPS using the Movidius NCS in comparison to 2.59 FPS using strictly the CPU. This comes
out to a speedup of 243% using the RPi 4B and Movidius NCS2.
14.3. HOW TO OBTAIN HIGHER FACE RECOGNITION ACCURACY 315
Figure 14.1: Face recognition with the Raspberry Pi and Intel Movidius Neural Compute Stick.
Inevitably, you’ll run into a situation where OpenCV does not recognize a face correctly.
What do you do in those situations? And how do you improve your OpenCV face recognition
accuracy?
In this section, I’ll detail a few of the suggested methods to increase the accuracy of your
face recognition pipeline.
Remark. This section includes reposted content from my OpenCV Face Recognition blog post
http://pyimg.co/6hwuu [74].
My first suggestion is likely the most obvious one, but it’s worth sharing.
Figure 14.2: All face recognition systems are error-prone. There will never be a 100% accurate
face recognition system.
I get the impression that most readers already know they need more face images when they
only have one or two example faces per person, but I suspect they are hoping for me to pull a
computer vision technique out of my bag of tips and tricks to solve the problem.
If you find yourself with low face recognition accuracy and only have a few example faces
per person, gather more data — there are no “computer vision tricks” that will save you from
the data gathering process.
Invest in your data and you’ll have a better OpenCV face recognition pipeline. In
general, I would recommend a minimum of 10-20 faces per person.
14.3. HOW TO OBTAIN HIGHER FACE RECOGNITION ACCURACY 317
Figure 14.3: Most people aren’t training their OpenCV face recognition models with enough data.
(image source: [79])
Remark. You may be thinking, “But Adrian, you only gathered 20 images per person for this
chapter!” Yes, you are right — and that is to prove a point. The face recognition system we
discussed in this chapter worked but can always be improved. There are times when smaller
datasets will give you your desired results, and there’s nothing wrong with trying a small dataset
— but when you don’t achieve your desired accuracy you’ll want to gather more data.
The face recognition model OpenCV uses to compute the 128-d face embeddings comes from
the OpenFace project [76].
The OpenFace model will perform better on faces that have been aligned. Face alignment
is the process of (1) dentifying the geometric structure of faces in images and (2) attempting to
obtain a canonical alignment of the face based on translation, rotation, and scale.
i. Detected faces in the image and extracted the ROIs (based on the bounding box coordi-
nates).
ii. Applied facial landmark detection (http://pyimg.co/x0f5r [80]) to extract the coordinates of
the eyes.
318 CHAPTER 14. FAST, EFFICIENT FACE RECOGNITION WITH THE MOVIDIUS NCS
Figure 14.4: Performing face alignment for OpenCV facial recognition can dramatically improve
face recognition performance.
iii. Computed the centroid for each respective eye along with the midpoint between the eyes.
iv. And based on these points, applied an affine transform to resize the face to a fixed size
and dimension.
If we apply face alignment to every face in our dataset, then in the output coordinate space,
all faces should:
ii. Be rotated such the eyes lie on a horizontal line (i.e., the face is rotated such that the eyes
lie along the same y -coordinates).
iii. Be scaled such that the size of the faces is approximately identical.
Applying face alignment to our OpenCV face recognition pipeline is outside the scope of
this chapter, but if you would like to further increase your face recognition accuracy using
OpenCV and OpenFace, I would recommend you apply face alignment using the method in
this PyImageSearch article: http://pyimg.co/tnbzf [81].
14.3. HOW TO OBTAIN HIGHER FACE RECOGNITION ACCURACY 319
My next suggestion is for you to attempt to tune your hyperparameters on whatever machine
learning model you are using (i.e., the model trained on top of the extracted face embeddings).
For this chapter’s tutorial, we used an SVM with a Radial Basis Function (RBF) kernel. To
tune the hyperparameters, we performed a grid search over the C value, which is typically the
most important value of an SVM to tune.
The C value is a “strictness” parameter and controls how much you want to avoid misclassi-
fying each data point in the training set. Larger values of C will be more strict, causing the SVM
to try harder to classify every input data point correctly, even at the risk of overfitting. Smaller
values of C will be more “soft”, allowing some misclassifications in the training data, but ideally
generalizing better to testing data.
You may also want to consider tuning the gamma value. The default scikit-learn implemen-
tation will attempt to automatically set the gamma value for you; however, the result may not be
satisfactory. The following example from the scikit-learn documentation shows you how to tune
both the C and gamma parameters to a RBF SVM: http://pyimg.co/qz3pq [82].
It’s interesting to note that according to one of the classification examples in the OpenFace
GitHub [83], they actually recommend to not tune the hyperparameters if you are using a
linear SVM, as, from their experience, they found that setting C=1 obtains satisfactory face
recognition results in most settings.
That said, RBF SVMs tend to be significantly harder to tune, so if your face recognition
accuracy is not sufficient, it may be worth the extra effort and computational cost of tuning your
hyperparameters via either a grid search or random search.
In my experience using both OpenCV’s face recognition model along with dlib’s face recogni-
tion model [35], I’ve found that dlib’s face embeddings are more discriminative, especially for
smaller datasets.
Furthermore, I’ve found that dlib’s model is less dependent on (1) preprocessing steps such
as face alignment, and (2) using a more powerful machine learning model on top of extracted
face embeddings
To improve accuracy further, you may want to use dlib’s embedding model, and then instead
of applying k-NN, follow Section 14.2.4 from this chapter and train a more powerful classifier
on the face embeddings.
14.4 Summary
In this chapter, we used OpenVINO and our Movidius NCS to perform face recognition.
i. Create your dataset of face images. You can, of course, swap in your own face dataset
provided you follow the directory structure of the project covered in Chapter 5).
iii. Train a machine learning model (Support Vector Machines) on top of the face embed-
dings.
iv. Utilize OpenCV and our Movidius NCS to recognize faces in video streams.
We put our Movidius NCS to work for the following deep learning tasks:
• Extracting face embeddings: generating 128-D vectors which quantify a face numerically
We then used the Raspberry Pi CPU to handle the non-DL machine learning classifier used
to make predictions on the 128-D embeddings.
This process of separating responsibilities allowed the CPU to call the shots, while employ-
ing the NCS for the heavy lifting. We achieved a speedup of 243% using the Movidius NCS
for face recognition in video streams.
Chapter 15
Now that you’ve read all of the Hobbyist Bundle, and most of the Hacker Bundle, imagine that
you’re brainstorming with me on a security-related project to build upon Chapter 5 — the face
recognition door monitor project.
Let’s consider the drawbacks of many alarm systems available today for homes and busi-
nesses.
Commercial off-the-shelf (COTS) systems are expensive in comparison to the cheap and
affordable Raspberry Pi. Service contracts, installation fees, and false-alarm police call fees
come to mind.
COTS security systems typically cannot be customized. Despite being an IoT device, rarely
does a COTS security system integrate and interoperate with other devices in your home.
Alarm systems sometimes tend to be a nuisance and people end up not using them in
the first place. Don’t you hate that your motion sensor isn’t smart enough to ignore your dog
moving through the home? Many people with animals simply do not arm their home when for
fear of a false-alarm.
As the Hacker Bundle comes to a close with this final chapter, what better way to to end
than by putting all the pieces together and creating a real-world IoT security project you can
use in your home?
In this chapter, we will arm our home with multiple Raspberry Pis that communicate among
themselves to accomplish a common goal. In a sense, this chapter represents the culmination
of all the topics we have learned and discussed so far in the first two volumes of this book.
We will focus on learning by doing and using the concepts that we have learned along the
way. There will certainly be room for you to hack the project to your needs and create entirely
321
322 CHAPTER 15. RECOGNIZING OBJECTS WITH IOT PI-TO-PI COMMUNICATION
Admittedly, this chapter has a lot of moving parts. Keeping that in mind, we have actually
covered most of the topics in previous chapters of the Hobbyist and Hacker Bundle.
iii. And bring it all together for a multiple Pi Internet of Things project.
In this chapter, we’ll begin by building a case for security with an emphasis on the lack of
flexibility of commercially available systems — that’s where we bring in Raspberry Pis to the
rescue. From there we’ll review the concept of what we’re building in this chapter. Our interop-
erable system will involve at least two Raspberry Pis and all the concepts listed in the objects
section.
Next, we’ll review concepts that we have already covered in this book including links to
chapters which you should be familiar with prior to reading this chapter and hacking with the
code.
We’ll then introduce two new concepts: IoT lighting and state machines. We’ll review the
IoT lights that are recommended for compatibility with this chapter including a quick demo of
15.3. THE CASE FOR A COMPLEX SECURITY SYSTEM 323
the API. If you have a computer science or computer engineering background, you’re likely
already familiar with state machines. If not, you’ll pick up the concept very quickly.
From there we’ll get into our IoT Case Study project including discussing our state machines,
project structure, config files, and driver scripts. We’ll bring it all together for deployment. And
finally, we’ll review suggestions and ideas for this project and spin-off projects that you dream
up.
Be sure to budget some time for this chapter as there are a lot of moving parts. For de-
ployment you may have to spend some time tweaking your physical environment including
positioning of your cameras and other hardware as well as testing.
In this section, we’ll discuss security systems, why they are important, and how networked
Raspberry Pis can help with the equation.
We all know way too many people that have had belongings stolen.
This leaves someone with a fear for their safety, their family’s safety, and a fear of their pre-
cious belongings being vulnerable to loss. Protecting your family and belongings is important
to you — theft and vandalism is on the rise in many areas and it is hard for communities to
counteract.
Many break-ins even happen in the light of day, what can we do?
Relying on law enforcement to pick up the pieces of an insecure home or car that gets
broken into doesn’t always lead to putting people behind bars.
Law enforcement needs evidence to act, and in many cases, they don’t have time to dust
for finger prints for the theft of an expensive laptop or jewelry.
Thanks to GPS and connectivity in laptops and smartphones, thieves are beginning to leave
them behind. Check out Apple’s latest “Find My” system that relies on nearby bluetooth devices
to find phones, watches, laptops, and headphones [84]. Apple claims the system is secure and
will not impact battery life of your devices. Considering you’re nearly always within 30 feet of
an iPhone, thieves may begin to think twice about stealing an Apple product!
Instead, they go for items that can’t be electronically tracked — currency, tools from your
garage/shed, and jewelry. For these items, it is essential to have a guarded home with a
security system to include video evidence if you want any hope of the person serving time
324 CHAPTER 15. RECOGNIZING OBJECTS WITH IOT PI-TO-PI COMMUNICATION
Let’s face it. In most cases a camera or motion sensitive light won’t deter a thief. The thief or
vandal will sometimes wear a mask making it nearly impossible for identification.
That’s not an excuse to skip installing cameras – cameras and storage are cheap. If you
don’t have a video to provide to police, you certainly aren’t helping them on their quest to track
down the bad guy. Video evidence provided to both the police and posted in community social
groups like the NextDoor app [85] may lead to action being taken.
15.3.3 How the RPi Shines for Security — Flexibility and Affordability
There are countless off the shelf systems for security surveillance, many with a local DVR or
cloud storage option. These systems are great for the “average Joe” homeowner. As Rasp-
berry Pi tinkerers, we tend to overlook the COTS (Commercial Off-the-Shelf) solution. There’s
actually a lot to learn from these systems so browsing product pages and talking to home
security experts is definitely worth our time.
But normally, COTS systems don’t provide much room for tinkering and custom develop-
ment. We’re hackers, so we say “boring!” We also become frustrated that we can’t make these
expensive products and services operate as we desire:
• Nor can it activate all the lights in the house to really throw the bad guy off his game at
night.
• Nor can it send SMS messages to your smart phone (and your neighbors).
• Can it borrow a gigabyte of storage from Dropbox or Box that you already pay for to store
motion or pedestrian clips? I doubt it — you’re locked in to an expensive storage solution
possibly with a proprietary video file format.
• Does the camera interface with your alarm system? Some do, but you may pay a pre-
mium.
• Full control over your motion detection algorithm? I didn’t think so.
• Is that COTS system less than a one time fee of $100USD and some sweat equity on the
keyboard? No, you’re locked into a recurring monthly fee.
15.4. A FULLY-FLEDGED PROJECT INVOLVING MULTIPLE IOT DEVICES 325
The Raspberry Pi (coupled with Python, OpenCV, and other libraries) can complete all the
above tasks as we’ve learned in the Hobbyist Bundle and so far in the Hacker Bundle.
Our solution is flexible, affordable, and interoperable with other IoT devices and services that
are worth paying for (there are plenty of poor products and services that won’t interoperate as
well).
Flexibility is both a pro and a con. On one hand, the possibilities are endless. On the other
hand, it requires time and development. But you’ll learn new skills in the process which is a
win-win in my mind coming from the computer vision education perspective.
You may even find yourself working for a security company developing computer vision
applications and algorithms now that you’re armed with knowledge you gained in this book!
Before we begin implementing our IoT security system, let’s break it down and understand how
it will actually work.
Our system has the goals of (1) monitoring our driveway outside the house for motion and
people/vehicles, (2) turning on a light inside the home to deter bad guys and help with face
detection/recognition at night, and (3) performing face recognition and alerting the homeowners
if someone shouldn’t be inside (i.e. not you or your family members).
We can accomplish this proof of concept IoT case study system with a minimum of three
devices:
Each Raspberry Pi will be running a separate program; however, the Raspberry Pis will pass
messages back and forth to convey the state of our system. This communication/message
passing paradigm allows for each RPi’s Python script to update its own state so that each Pi
smartly does the correct task (i.e. waits, monitors driveway, detects the door as opened/closed,
or performs face recognition, turns on a light, etc.).
Only one of the Raspberry Pis will communicate with our IoT lighting via an API. Arguably,
either or both Raspberry Pis could control the lights, but for simplicity only one RPi will be
responsible in this example.
Before we design our system, we need to review (1) previously covered topics, (2) state
machine basics, and (3) IoT lighting via APIs.
326 CHAPTER 15. RECOGNIZING OBJECTS WITH IOT PI-TO-PI COMMUNICATION
This case study builds upon many concepts previously covered in this book. The chapter
admittedly lengthy, so this section serves as your starting point with pointers to other chapter-
s/sections and outside resources.
Be sure to refer to previous chapters as needed while you read the rest of this chapter.
Previous chapters include more detailed code explanations that you should be familiar with at
this point.
Let’s review the concepts you should know for understanding this chapter.
Any Internet communications system you use and rely on utilizes sockets which in programmer-
speak is a word for a connection. A single program can have many connections to services
such as REST APIs, databases, websites, SFTP, applications residing on servers or other
systems, etc.
We will rely on simple message passing sockets in this project. Be sure to refer to Chapter
3 where we reviewed message passing applications by example, ZMQ, and ImageZMQ.
Figure 15.1: Left: Face detection localizes a face in an image. Right: Face recognition determines
who is in the detected face ROI.
Face detection involves localizing a face in an image (i.e. finding the bounding box (x,
y)-coordinates of all faces). Face detection algorithms rely on Haar Cascades, Histogram
15.5. REINFORCING CONCEPTS COVERED IN PREVIOUS CHAPTERS 327
of Oriented Gradient detectors (HOG), or deep learning object detectors. Each has its own
tradeoffs in speed/accuracy performance. Knowing which type of face detector is key to a
project’s success, especially on a resource-constrained device like the Raspberry Pi. The
following chapters of this book implement face detection:
• Hobbyist Bundle:
• Hacker Bundle:
• Chapter 5 uses face detection prior to recognition in a video stream of your doorway
to monitor and alert you of people entering your home.
• Chapter 6 utilizes face detection to find faces in a frame prior to recognition for class-
room attendance purposes.
• Chapter 14 utilizes a deep learning face detection model prior to applying deep learn-
ing/machine learning based face recognition using the Movidius NCS.
Face recognition involves discerning the difference between faces in an image or video feed
(i.e. who is who?).
Local Binary Patterns and Eigenfaces were successful algorithms used for facial recognition
in the early days of the art.
These days, modern face recognition systems employ deep learning and machine learning
to accomplish face recognition. We compute “face embeddings”, a form of a feature vector, for
faces in a dataset. From there we can train a machine learning model on top of the extracted
face embeddings.
We then load the serialized model to recognize fresh faces presented to a camera as either
recognized or unknown.
Alternatively a k-Nearest Neighbor approach could be used, which is arguably not machine
learning. Rather, k-NN relies on computing the distance between face embeddings and finding
the closest match(es).
• Chapter 5 makes use of facial recognition to recognize inhabitants and intruders entering
your home.
• Face Recognition with OpenCV, Python, and Deep Learning: http://pyimg.co/oh21b [73]
Face recognition is a hot topic and you can find all relevant topics on PyImageSearch via
this "faces" category link: http://pyimg.co/yhluw.
Figure 15.2: Left: Background subtraction for motion detection. Right: Object detection for local-
izing and determining types of objects.
Background subtraction is a method that can help find motion areas in a video stream. In
order to successfully apply background subtraction, we need to make the assumption that our
background is mostly static and unchanging over consecutive frames of a video. Then, we can
model the background and monitor it for substantial changes. The changes are detected and
marked as motion. You can observe background subtraction in action in the following chapters:
• Hobbyist Bundle:
• Hacker Bundle:
Object detection with the pre-trained MobileNet SSD enables localization and recognition of 20
everyday classes.
In this chapter we will use the pretrained model to detect people and vehicles.
Object detection with the MobileNet SSD is covered in the following chapters of the Hacker
Bundle:
• Chapter 8 uses MobileNet SSD to find people and animals in multiple RPi client video
streams so you can locate them in any frame streamed via ImageZMQ.
• Chapter 13 improves upon Chapter 19 of the Hobbyist Bundle using MobileNet SSD for
accurate and fast people counting with the addition of OpenVINO and the Movidius NCS
for inference.
I’ve written about the MobileNet SSD a number of times on PyImageSearch, so be sure to
refer to these articles for more practical usages of MobileNet: http://pyimg.co/o64vu
If you are interested in training your own Faster R-CNNs, SSDs, or RetinaNet object detec-
tion models, you may refer to the ImageNet Bundle of Deep Learning for Computer Vision with
Python (http://pyimg.co/dl4cv) [50].
Text message alerts are not only useful, but are a lot of fun and are a great way to show off
your projects while you’re out having drinks with friends. Be sure to check out the following
chapters involving Twilio SMS/MMS alerts:
• Hobbyist Bundle:
• Chapter 10 introduces the code templates for working with Twilio notifications.
• Chapter 11 includes a project that sends SMS notifications when your mailbox is
opened.
• Hacker Bundle:
330 CHAPTER 15. RECOGNIZING OBJECTS WITH IOT PI-TO-PI COMMUNICATION
• Chapter 5 alerts you via MMS when an unknown face has entered the door of your
home.
• Chapter 9 sends alerts to your phone when someone enters the wrong hand gesture
code in front of your camera.
Figure 15.3: Left: Smart plug. Center: Smart light bulb. Right: Smart switch.
In this chapter we will put IoT lighting devices to work for us. A Raspberry Pi will activate a
light in the home so that the camera can see our face for facial recognition. It may also serve
as a deterrent to an unknowing intruder that think’s a person turned on the light.
There are many IoT lights on the market, but a lot of them are so secure that you can’t
easily work with them using Python. The TP-Link brand of IoT lights have a known security
vulnerability [86] and the folks at SoftSCheck reverse engineered the communication protocol.
SoftSCheck discovered the vulnerability, responsibly disclosed it to the TP-Link engineering
team, and published the WireShark capture files and Python scripts on GitHub [87].
Later, the GadgetReactor user on GitHub posted an API that supports more devices [88].
Luckily for us, the lights are easily turned on and off using this Python API. TP-Link’s “Kasa” line
15.6. NEW CONCEPTS 331
of lighting products (Figure 15.3) with the vulnerability (i.e. compatible with pyHS100) include:
You may purchase any of the lighting devices listed on the companion website — they all
work with the API, though some may require slight modifications.
We will use the pyHS100 project to interface with our lights by IP address. The Python
package is already installed on the preconfigured .img that accompanies this book [89]. If you
are not using the PyImageSearch RPi .img, you may simply install the package via pip in your
virtual environment:
The following demo program, iot_light_demo.py, will toggle your light on and off:
Lines 2-5 import packages, namely pyHS100. The preconfigured Raspbian .img that
accompanies this book includes the package. It is also pip-installable via pip install
pyHS100.
Lines 8-13 parse the --ip-address and smart device --type. You’ll need to follow the
instructions that come with your TP-LINK product to connect it to your network. You can find
the IP address by looking at your DHCP client table on your router. The type can be any one
of the choices listed on Line 12 ("plug", "switch", or "bulb" as shown in Figure 15.3).
With our known IP address and device type, we’re now ready to instantiate the device:
332 CHAPTER 15. RECOGNIZING OBJECTS WITH IOT PI-TO-PI COMMUNICATION
If the --type is either a plug or a switch, Lines 16-18 initialize the device. Otherwise, a
bulb is initialized via Lines 21-23; bulbs use a slightly different API for a different feature set.
From here, we’ll query the device status and request user input:
Line 29 begins a user input loop. Inside the loop, first we request user input via the
terminal. The input message commands the user to specify “on”, “off”, or “exit”.
Next we’ll process the user input and print the device status:
50 continue
51
52 # print a success message if the device state matches desired state
53 if device.state == val.upper():
54 print("[SUCCESS] device state equals desired state of {}".format(
55 val.upper()))
56
57 # otherwise print a failure message when the state does not match
58 else:
59 print("[FAILURE] device state not equal to desired state of" \
60 " {}".format(val.upper()))
If the input val is "on" we turn_on the device (Lines 34 and 35). Conversely, if the
val is "off" we turn the device off (Lines 38 and 39).
Upon the "exit" command, the Python program will print a message and quit (Lines
42-45).
Lines 53-60 print a success or failure message depending on if the desired on/off state
matches the actual state of the device.
To run the program, simply pass the IP address and type via command line argument:
State machines, also known as “finite-state machines”, often simplify computer application
logic and design. During the design phase of a system that has multiple states of operation,
the developers and designers are forced to consider the various modes of operation (i.e. states)
a system can be in.
Consider matter. Matter can be in any one of four states: solid, liquid, gas, or plasma.
There’s nothing in between. The matter is able to transition between the four states, but can
only actually be in any one state at any given time.
334 CHAPTER 15. RECOGNIZING OBJECTS WITH IOT PI-TO-PI COMMUNICATION
Figure 15.4: State machine which mimics a common traffic light (not the entire intersection). All
state machines have initial states. Some state machines have final states. You could argue either
way whether a traffic light has a final state, but consider power failure, a final state in which all
lights are off. When power comes back, the light will go to it’s inital state of all lights off until the
controller is ready to route traffic through the intersection. At that point its next state would either
be "Red" (as shown) or "Green".
Now consider a computer program such as one that controls a traffic light. We know that
a standard traffic light has three states: “go” (green), “stop” (red), or “caution” (yellow/amber).
Sometimes there may be a protected turn arrow as well. Those states represent the three/four
possible states from one perspective of a traffic intersection. You will never see both the red
and green lights on at the same time in the case of an operational traffic light as shown in
Figure 15.4. The state can transition from "stop" to "go", however.
Let’s take this example a step further. Take into account the entire intersection. The system
controlling the intersection usually has a minimum of four incoming lines of traffic to route
through. Induction sensors in the road and/or cameras mounted above the intersection monitor
for cars waiting to pass through the intersection. There are also crosswalks and buttons for
pedestrians to request to cross the intersection. The state machine quickly grows to handle
all the lanes of traffic and directions vehicles can go in as well as crosswalks. There are also
individual states of each array of lights just as in Figure 15.4.
Without a logical state machine, not only would you struggle to understand all the conditional
statements in your code, the next traffic engineer would have no idea how to read the program
just to make a change so that the intersection routes vehicles more efficiently after an additional
turn late has been built.
15.7. OUR IOT SASE STUDY PROJECT 335
In this chapter, we have two Raspberry Pis. Each RPi will operate its own state machine.
Information will be exchanged via message passing sockets when one Pi is alerting the other
Pi that it has completed an action. The other Pi then jumps into a new state and begins a
separate task.
We’ll review the state machine for each of our RPis in the next section.
In this section, we will begin by reviewing each of our two RPis’ state machines and how
they interact. From there we’ll dive into each separate project structure. We’ll then review the
separate JSON configurations.
Next, we’ll dive into the driver script for the Pi aimed out the window at our driveway (i.e.
“driveway-pi”). Similarly we’ll walk through the driver script for the face recognition door
monitor Pi (i.e. “home-pi”).
We’ll wrap up by learning how to execute and test our system and reviewing tips and sug-
gestions for developing similar projects to this one.
As shown in Figure 15.5 our state machines are simplified as compared to Figure 15.4. To
simplify the drawing, each initial state is marked as state 0. There is no final state — it is
expected that this program will run forever.
The state machines run independently in their own Python scripts on separate Raspberry
Pis.
But how does one Python application trigger a state machine on a separate system?
For the "driveway-pi" (Figure 15.5, left), the heart of the program is in the "0: Looking For
Objects" state. When the "driveway-pi" finds a person or car, it will send a message to the
"home-pi" and wait until the "home-pi" sends further instructions.
Similarly, the bulk of functionality for the "home-pi" (Figure 15.5, right) lays in the "1: Face
Recognition" state. The "home-pi" starts in a "0: Waiting for Object" state in which it needs a
message from the "driveway-pi" indicating that either a person or car has been detected. At
336 CHAPTER 15. RECOGNIZING OBJECTS WITH IOT PI-TO-PI COMMUNICATION
Figure 15.5: Left: The "driveway-pi" state machine. Right: The "home-pi" state machine. The
state machines run independently in their own Python scripts on separate Raspberry Pis. A
near-identical function/process will run on each RPi called exchange_messages, responsible
for sending/receiving messages and changing states accordingly.
that point it will turn on the light, wait for the door to open, and perform face recognition (the "1:
Face Recognition" state).
Figure 15.6 demonstrates the exchange_messages function/process and the two types
of messages our Pis will be configured to send/receive. To keep our code blocks short, no
validation is performed on the messages. If you were to have multiple types of messages,
you would, of course, need validation (i.e. conditional if/elif/else statements) to determine
which message is received and what action to take (i.e. changing to a different state).
If you are adding functionality to this system (i.e. additional Pis with new responsibilities), I
highly encourage you to think about the states of the system and how states will transition.
You should sketch your flowchart/state machine and messages your RPis and any other
computers will exchange on a blank sheet of paper. Make iterations until you are comfortable
working on the driver scripts.
As an example, maybe you’ll have a third Pi that monitors when your home’s garage door is
open/closed as Jeff Bass discussed at PyImageConf 2018 [20]. Or maybe you’ll have a Pi that
monitors for dogs roaming on your property when you are not home — if a picture is delivered
15.7. OUR IOT SASE STUDY PROJECT 337
Figure 15.6: Each Raspberry Pi in our security system has an exchange_messages function/pro-
cess. This process is able to send/receive messages and change states at any time. All states
are process-safe variables.
to your smartphone, maybe you can let your neighbor know that their dog escaped!
In any of these cases, you may need more RPis, more states, and more types of messages
exchanged among the RPis. You would also add message validation as we did in our message
passing example (Section 3.4.3).
Each Pi will run an independent but interworking Python application. Inside the chapter code
folder are two subdirectories: (1) driveway-pi/, and (2) home-pi/. Let’s review the con-
tents of each subdirectory now.
|-- config
| |-- config.json
|-- pyimagesearch
| |-- utils
| | |-- __init__.py
338 CHAPTER 15. RECOGNIZING OBJECTS WITH IOT PI-TO-PI COMMUNICATION
| | |-- conf.py
| |-- __init__.py
|-- MobileNetSSD_deploy.caffemodel
|-- MobileNetSSD_deploy.prototxt
|-- detect.py
Our configuration settings for “driveway-pi” are stored in config.json and parsed by the
Conf class in conf.py.
The detect.py script uses MobileNet SSD to detect people and vehicles that are present
in your driveway. Depending on when these objects are detected, a message is sent to the
“home-pi” to turn on lights and perform facial recognition.
|-- cascade
| |-- haarcascade_frontalface_default.xml
|-- config
| |-- config.json
|-- face_recognition
| |-- encode_faces.py
| |-- train_model.py
|-- messages
| |-- abhishek.mp3
| |-- adrian.mp3
| |-- dave.mp3
| |-- mcCartney.mp3
| |-- unknown.mp3
|-- output
| |-- encodings.pickle
| |-- le.pickle
| |-- recognizer.pickle
|-- pyimagesearch
| |-- notifications
| | |-- __init__.py
| | |-- twilionotifier.py
| |-- utils
| | |-- __init__.py
| | |-- conf.py
| |-- __init__.py
|-- create_voice_msgs.py
|-- door_monitor.py
As you can see, there is a lot going on in the “home-pi” project tree. Let’s break it down.
codings via train_model.py which produces the recognizer and label encoder (recogniz
er.pickle and le.pickle).
Creating voice messages is a process that is conducted after you have trained your facial
recognition model. The create_voice_msgs.py script reads the label encoder file to grab
the names of the individuals the system can recognize. This script produces .mp3 text-to-
speech files for each person in the messages/ directory.
Door monitoring (door_monitor.py) is very similar to Chapter 5, but with added fea-
tures including a state machine, socket connection for reading and sending status messages,
and IoT light control. Existing features include detecting when the door is opened, performing
face detection (haarcascade_frontalface_default.xml), face recognition (recogniz
er.pickle), playing audio files, and sending SMS messages via Twilio (twilionotifier.py)
to the homeowner when an intruder is detected.
Each of the projects have their own respective config file as mentioned in the previous section.
Let’s go ahead and review both of them now
1 {
2 // home pi ip address and port number
3 "home_pi": {
4 "ip": "HOME_PI_IP_ADDRESS",
5 "port": 5558
6 },
7
8 // path the object detection model
9 "model_path": "MobileNetSSD_deploy.caffemodel",
10
11 // path to the prototxt file of the object detection model
12 "prototxt_path": "MobileNetSSD_deploy.prototxt",
13
14 // variable indicating minimum threshold confidence
15 "confidence": 0.5,
16
17 // boolean variable used to indicate if frames should be displayed
18 // to our screen
19 "display": true
20 }
340 CHAPTER 15. RECOGNIZING OBJECTS WITH IOT PI-TO-PI COMMUNICATION
The “driveway-pi” must know the IP address and port of the “home-pi” (Lines 3-6). Be
sure to replace "HOME_PI_IP_ADDRESS" with the IP of the “home-pi” which is set up for face
recognition.
The pretrained MobileNet SSD object detector file paths are shown on Lines 9-12. The
"confidence" threshold is currently set to 50% via Line 15.
It is highly recommended that you set "display": true and set up a VNC connection
while you’re testing your system. Once you are satisfied with the operation of the system, you
can set "display": false. To learn more about working remotely with your RPi, including
VNC, be sure to refer to this article: http://pyimg.co/tunq7 [71].
The majority of settings that you need to tune are held inside the “home-pi” configuration. Go
ahead and inspect home-pi/config/config.json now:
1 {
2 // type of smart device being used
3 "smart_device": "smart_plug",
4
5 // smart device ip address
6 "smart_device_ip": "YOUR_SMART_DEVICE_IP_ADDRESS",
7
8 // port number used to communicate with driveway pi
9 "port": 5558,
Line 3 is the type of smart IoT lighting device that you have on your network. Provided
you are using a TP-Link Kasa device, your options are "smart_plug", "smart_switch",
or "smart_bulb".
Line 9 holds the "port" number used to communicate with the “driveway-pi”. This num-
ber should match the port number in the “driveway-pi” configuration. The “driveway-pi” will
be responsible for connecting to our “home-pi”, so the “home-pi” does not need to know the
“driveway-pi”’s IP address.
Lines 12-21 include the paths to our Haar cascade face detector, face encodings file, facial
recognition model, and label encoder.
The majority of the following settings are related to the door status and face recognition:
Again, for testing and demonstration purposes, it is recommended that you set "display":
true (Line 25) and establish a VNC connection to this Pi. You can and should establish two
VNC connections — one from your laptop to the “driveway-pi” and one from your laptop to the
“home-pi”. Being able to see the output of these two video feeds allows for monitoring each
system’s state machine and other frame annotations.
The door area "threshold" is set to 12% of the frame (Line 28). You can adjust this
value depending on (1) how close your camera is to the door, and (2) the frame resolution you
are using. Both of these factors impact the relative size of the doorway in the frame.
• "look_for_a_face": The number of consecutive frames to look for a face after which
it has been determined that the door is open.
342 CHAPTER 15. RECOGNIZING OBJECTS WITH IOT PI-TO-PI COMMUNICATION
On Line 43 the "sleep_time" represents the number of seconds for which the “home-pi”
sleeps before messaging the “driveway-pi” to begin detecting objects. It is currently set to 300
seconds (5 minutes).
I’ve chosen the English language and United States accent (Lines 48 and 49) for the
Google text-to-speech engine. Follow the instructions in the comment on Lines 45-47 to see
the available languages and accents/dialects.
All text-to-speech files should be stored in the path specified by "msgs_path" (Line 52).
The remaining configurations should look quite familiar to you for S3 file storage and Twilio
MMS settings:
66 "twilio_sid": "YOUR_TWILIO_SID",
67 "twilio_auth": "YOUR_TWILIO_AUTH_ID",
68 "twilio_to": "YOUR_PHONE_NUMBER",
69 "twilio_from": "YOUR_TWILIO_PHONE_NUMBER",
70
71 // message sent to the owners when a intruder is detected
72 "message_body": "There is an intruder in your house."
73 }
S3 will be used for temporary storage of the image file that will be included with an MMS
message (Lines 61-63). Twilio settings including your ID numbers, phone number, and desti-
nation number must be populated on Lines 66-69. The text message body is show on Line
72.
Be sure to spend a few minutes familiarizing yourself with all configurations. You may need
to make adjustments later.
Take a moment now to refer to Section 15.7.1 and the figures in that section in which we
learned about the responsibilities and states for Pi #1, known as the “driveway-pi”.
The “driveway-pi” has one main responsibility: to monitor the area outside your home
where cars or people will be present.
Of course, if you don’t have a driveway, it could monitor a walkway or hallway by your
apartment. Maybe you are using this project at your business to monitor employees coming in
and to validate that they are not intruders attempting to steal physical or electronic goods and
records.
Recall that the secondary responsibility of the “driveway-pi” is to inform the “home-pi” (Pi
#2) that a person or vehicle has been detected. It is up to the “home-pi” what to do with that
information (we know it involves turning on a light and performing facial recognition when the
door opens).
Go ahead and open a new file named detect.py in the driveway-pi directory and insert
the following code:
6 import numpy as np
7 import argparse
8 import signal
9 import time
10 import cv2
11 import sys
12 import zmq
Lines 2-12 import our packages and modules. Namely, we will use Process and Value
from Python’s multiprocessing module. Additionally, we will utilize zmq for message pass-
ing via sockets.
Lines 15-17 indicate that each of our states are global variables. We’ll refer to these states
a number of times, so become familiar with them and refer to Section 15.7.1 as needed.
Now we’ll define our exchange_messages function which we’ll later implement as a sep-
arate Python process:
• conf: Our configuration dictionary is passed directly to this function/process so that all
configuration variables can be accessed.
• STATE: Depending on the current state of the “driveway-pi” we’ll either be sending a mes-
sage or waiting to receive a message. This exchange_messages function will handle
communication depending on either of those conditions.
15.7. OUR IOT SASE STUDY PROJECT 345
Lines 21-27 initialize our socket connection to the “home-pi” server via its IP address and
port.
The only message that “driveway-pi” will ever send to “home-pi” is "start face reco"
on Line 30. This message indicates both (1) “driveway-pi” has detected a person or vehicle,
and (2) it is time for “home-pi” to begin face recognition.
Line 33 begins an infinite loop to monitor the STATE and act accordingly. Inside the loop, if
the current STATE is:
• STATE_SENDING_MESSAGE, then we go ahead and send the message over the socket
connection to “home-pi” followed by changing the state to STATE_WAITING_FOR_MESSAGE
(Lines 35-41).
With our message exchanging process coded and ready, now let’s define our signal handler
and parse command line arguments:
346 CHAPTER 15. RECOGNIZING OBJECTS WITH IOT PI-TO-PI COMMUNICATION
The signal_handler simply monitors for ctrl + c keypresses from the user upon
which the application exits (Lines 54-57).
All of our settings are in our configuration file; Lines 60-63 parse the command line argu-
ments for the path to the config (--conf). Line 66 then loads the configuration into memory.
With our configuration in hand, now let’s perform initializations related to our MobileNet SSD
object detector:
Lines 70-84 initialize our pretrained MobileNet SSD including its CLASSES, random anno-
tation COLORS, and frame annotations.
87 STATE_LOOKING_FOR_OBJECTS = 0
88 STATE_SENDING_MESSAGE = 1
89 STATE_WAITING_FOR_MESSAGE = 2
90
91 # state description lookup table
92 DRIVEWAY_PI_STATES = {
93 STATE_LOOKING_FOR_OBJECTS: "looking for a person/car",
94 STATE_SENDING_MESSAGE: "sending message to home pi",
95 STATE_WAITING_FOR_MESSAGE: "waiting for message from home pi"
96 }
97
98 # initialize shared variable used to set the STATE
99 STATE = Value("i", STATE_LOOKING_FOR_OBJECTS)
Our FSM consists of three states (we reviewed them in Section 15.7.1). Lines 87-89 assign
each state an integer value.
Let’s initialize our camera stream and begin looping over frames.
107 # initialize the video stream and allow the camera sensor to warmup
108 print("[INFO] warming up camera...")
109 # vs = VideoStream(src=0).start()
110 vs = VideoStream(usePiCamera=True, resolution=(608, 512)).start()
111 time.sleep(2.0)
112
113 # signal trap to handle keyboard interrupt
114 signal.signal(signal.SIGINT, signal_handler)
348 CHAPTER 15. RECOGNIZING OBJECTS WITH IOT PI-TO-PI COMMUNICATION
Line 110 initializes our PiCamera video stream with the specified resolution.
Beginning on Line 119, we loop over incoming video frames. Lines 124 and 125 extract
frame dimensions.
The main process of execution handles a single state and sets the message sending state
when appropriate for messageProcess to take care of. Let’s see what happens when the
STATE.value == STATE_LOOKING_FOR_OBJECTS:
When our “driveway-pi” is looking for objects in the driveway, we will perform three basic
15.7. OUR IOT SASE STUDY PROJECT 349
tasks:
Lines 131-134 perform inference with MobileNet SSD. Lines 137-150 loop over the detec-
tions, ensure the confidence threshold is met and extract the bounding box coordinates of the
object.
Lines 164-170 annotate the class label and bounding box of the person or car on the
frame.
176
177 # display the frame and record keypresses, if required
178 if conf["display"]:
179 cv2.imshow("frame", frame)
180 key = cv2.waitKey(1) & 0xFF
181
182 # if the `q` key is pressed, break from the loop
183 if key == ord("q"):
184 break
185
186 # terminate the message process
187 messageProcess.terminate()
188 messageProcess.join()
189
190 # do a bit of cleanup
191 vs.stop()
Lines 173-175 annotate the corner of the frame with our STATE string which comes directly
from the DRIVEWAY_PI_STATES dictionary.
Lines 178-184 display our frame and capture keypresses. If q is pressed, we break out of
the frame processing loop and perform cleanup.
Take a moment now to recall the operation of the “home-pi”. The “home-pi” is positioned such
that when a known person or intruder enters the door of your home they are recognized.
The “home-pi” is responsible for communicating with the “driveway-pi” to know when it
should begin face recognition. The “home-pi” is dormant when it is assumed that no occu-
pants are inside the home (i.e. it has detected via background subtraction that a person has
left the home).
The “home-pi” also has full control over lights in the house. In this project, we only turn on
and off a single light; however, you should feel free to hack the code to turn on multiple lights if
you so choose.
With all that said, it is now time to code up the driver script for our “home-pi”. Go ahead and
insert a file named door_monitor.py into the home-pi directory and insert the following
code:
Lines 2-20 import our packages. Namely, we will use our custom TwilioNotifier to
send SMS messages, Process and Value to perform multiprocessing, and pyHS100
to work with smart plugs and smart bulbs. We’ll also take advantage of Adam Geitgey’s
face_recognition package. The zmq library will enable message passing between our
“home-pi” and “driveway-pi”.
Now that our imports are taken care of, let’s make our four states global:
Be sure to review Section 15.7.1 for a diagram and explanation of the “home-pi” state ma-
chine.
Let’s define our exchange_messages function — it is quite similar to the sister function in
the “driveway-pi” script:
Lines 28-35 begin our exchange_messages function/process. Two process safe vari-
ables, conf and STATE, are passed as parameters. The server connection is bound to the
"port" specified in the config. Once the client (i.e. the “driveway-pi”) connects, we’ll fall into
the while loop beginning on Line 38.
Next, we’ll set up our signal handler, load our config, and initialize our Twilio notifier:
Lines 67-70 parse the --conf command line argument and Line 73 loads the configura-
tion.
With our configuration loaded, we can instantiate our face detector and face recognizer
objects:
76 # load the actual face recognition model, label encoder, and face
77 # detector
78 recognizer = pickle.loads(open(conf["recognizer_path"], "rb").read())
79 le = pickle.loads(open(conf["le_path"], "rb").read())
80 detector = cv2.CascadeClassifier(conf["cascade_path"])
Our face recognizer consists of a Support Vector Machines (SVM) model serialized as a
.pickle file. Our label encoder contains the names of the house occupants our recognizer
can distinguish between.
Be sure to refer to Section 15.5.3 which refers to the face recognition chapters in this
volume so that you can train your face recognizer. We will not be reviewing face_recogni
tion/encode_faces.py or face_recognition/train_model.py in this chapter as they
have already been reviewed.
We will use Haar cascades as our face detector (Line 80). If you prefer to use a deep
learning face detector, you should add a coprocessor such as the Movidius NCS or Google
Coral to conduct face detection. Be sure to refer to Section 15.5.2 for referrals to chapters
involving face detection alternatives.
Let’s go ahead an initialize a handful of housekeeping (pun intended) variables which are
key to door open/closed detection, face recognition, and control flow of our door monitor:
354 CHAPTER 15. RECOGNIZING OBJECTS WITH IOT PI-TO-PI COMMUNICATION
Line 83 initializes the MOG background subtractor object. We will use background sub-
traction as the first step in determining doorOpen status. Line 87 initializes the frameArea
constant which will be calculated once we know our frame dimensions. Depending on the ratio
of the door contour to the frameArea we will be able to determine doorOpen (Line 88) status.
Lines 92-94 initialize the previous and current person names to None and set the consecu-
tive same person count to 0. When the curPerson == prevPerson we will be incrementing
consecCount; the consecutive count will be compared to the threshold set in our configuration
file ("consec_frames").
Lines 100-108 initialize a smart lighting device. Refer to Section 15.6.1 for a full guide on
turning smart lights on and off.
Lines 111-114 assign unique integer values to our four states. Lines 116-121 define
homePiStates, a dictionary holding a string value associated with each state.
Line 125 then initializes the state machine to the STATE_WAITING_FOR_OBJECT state.
Be sure to refer to Section 15.7.1 for an explanation and diagram showing the overview of our
“home-pi” state machine.
Let’s kick off our messageProcess (just like the one we set up for “driveway-pi”):
And from there we’ll initialize our video stream, signal trap, and begin looping over frames:
133 # initialize the video stream and allow the camera sensor to warmup
134 print("[INFO] warming up camera...")
135 # vs = VideoStream(src=0).start()
136 vs = VideoStream(usePiCamera=True).start()
137 time.sleep(2.0)
138
139 # signal trap to handle keyboard interrupt
140 signal.signal(signal.SIGINT, signal_handler)
141 print("[INFO] Press `ctrl + c` to exit, or 'q' to quit if you have" \
142 " the display option on...")
143
144 # loop over the frames of the stream
145 while True:
356 CHAPTER 15. RECOGNIZING OBJECTS WITH IOT PI-TO-PI COMMUNICATION
146 # grab the next frame from the stream and resize it
147 frame = vs.read()
148 frame = imutils.resize(frame, width=600)
149
150 # if we haven't calculated the frame area yet, calculate it
151 if frameArea == None:
152 frameArea = (frame.shape[0] * frame.shape[1])
Line 136 initializes our PiCamera video stream. Line 140 sets our signal trap to handle
keyboard interrupt.
We then begin looping over frames on Line 145. A frame is grabbed and resized (Lines
147 and 148).
Lines 151 and 152 then calculate the frameArea, a key component for calculating our
door ratio soon.
The first task in the face recognition state is to turn_on the smart device (i.e. our light)
if it isn’t already on (Lines 157-163). The light will (1) allow the camera in our house to “see”
someone’s face, and (2) deter intruders from entering the home in the first place.
Lines 167-196 monitor the frame to determine the door open/closed status using the fol-
lowing method:
• We grab the largest motion contour and calculate the contourToFrame ratio (Lines
182-188).
• If the contourToFrame ratio exceeds the "threshold" (in our config), then the door
is marked as open (doorOpen) and a timestamp is made (Lines 193-196).
If the door is open then we will run face recognition for a predetermined amount of time. It
may be a person we recognize or it may be someone “unknown” (i.e. an intruder). Or if no face
is detected, then it is an intruder (possibly wearing a mask so the face isn’t detected at all).
Line 201 begins the case where the door is already open and it extends until Line 289.
First, the delta time is calculated from when the door was initially opened (Line 205). If
delta is less than our config, "look_for_a_face", then we’ll perform face detection using
Haar cascades (Lines 208-222).
Lines 227 and 228 begin face recognition by extracting encodings for all faces in the image.
Lines 231-233 determine the current person’s name for the highest confidence face in the
15.7. OUR IOT SASE STUDY PROJECT 359
image. We then annotate the frame with the person’s name and a box around their face
(Lines 236-242).
Lines 247-253 update the consecCount depending on if the previously recognized person
matches the current person or not.
Line 257 goes ahead and sets the prevPerson name for the next iteration.
Let’s wrap up the case that the door was already open:
Additionally, if the person is "unknown", Lines 272-275 send Twilio SMS message to the
homeowner to alert them of the intruder.
Lines 281-283 then update the consecCount and doorOpen in addition to setting the
STATE tp STATE_SKIP_FRAMES.
The final else block handles the case when no face was detected. Lines 286-289 mark
the door as closed and update the state to STATE_PERSON_LEAVES.
313 device.turn_off()
314 STATE.value = STATE_WAITING_FOR_OBJECT
Otherwise, if the required number of frames have been skipped, then we reset skipFrame
Count and reinitialize our MOG background subtractor (Lines 300-302). If the state is STATE_
SKIP_FRAMES we update it to STATE_FACE_RECOGNITION (Lines 305-307). Or if the state is
STATE_PERSON_LEAVES, then we turn_off the light and set the state to STATE_WAITING_
FOR_OBJECT (Lines 310-314).
Lines 317-319 annotate the STATE string in the top left corner of the frame.
Lines 323-325 display the frame and capture keypresses. If the q key is pressed, we
break from the loop and perform cleanup (Lines 328-336).
• “driveway-pi”:
362 CHAPTER 15. RECOGNIZING OBJECTS WITH IOT PI-TO-PI COMMUNICATION
Figure 15.7: IoT security system running and monitoring via my macOS desktop. Green: Separate
VNC sessions with OpenCV windows and terminals so I can monitor states. Red: Laptop webcam
feed so that I can record the live action despite a VNC delay. Blue: IP addresses of both RPis and
the smart plug for lamp control.
• Place one Pi and camera aimed outside where people and/or cars will trigger the
light. For me, this was a “low traffic” area where only the occupants of my home
typically park and enter. If someone else is there, they are likely an intruder.
• Ensure your Pi can see people and vehicles at all times of day. This can be accom-
plished with outdoor flood lighting, possibly infrared lighting, and potentially hacks to
the script to set camera parameters (Chapter 6 of the Hobbyist Bundle) based on
time of day.
• “home-pi”:
• Mount the Pi facing the doorway where the IoT light will illuminate the person’s face.
• Attach a speaker to the Pi for text-to-speech.
• Configure and test the smart IoT lighting device.
• Config files:
• Insert the IP address of the “home-pi” server into the “driveway-pi” config.
• Insert the IP address of the Smart IoT light device into the “home-pi” config.
• Other configuration settings can be made during testing/tuning.
• Face recognition:
15.7. OUR IOT SASE STUDY PROJECT 363
Figure 15.8: Demonstration of the "driveway-pi" application running while aimed out the window at
the driveway. High resolution flowchart can be found here: http://pyimg.co/so3c2
From there, I recommend opening two VNC sessions from your laptop/desktop as shown
in Figure 15.7. One connection should be to the “driveway-pi” and the other connection should
be to the “home-pi”. The VNC connections allow you to easily see each terminal and OpenCV
video stream window. You could always hack the scripts to use ImageZMQ at another time.
Next, open the VNC window for your “driveway-pi” and start the application there:
As you can see in Figures 15.8 and 15.9, the states are being updated based on both what
is happening in the frame and the messages exchanged between the Pis.
The system works fairly reliably, but could be improved by faster object detection (i.e. Mo-
vidius NCS or Google Coral). Additionally the system does not work well when two people
have entered the house and one person later leaves the house. The system was designed
with simplicity in mind and will work best if only one of the house occupants is present in the
house at any given time. With additional logic these edge cases could be handled.
Yes. If there is a will, there is a way. Intruders can fool any system, but typically they need
prior knowledge of its inner workings. For COTS systems, they can look up how systems work
online and/or purchase the devices to tinker with. For a custom Raspberry Pi system they are
unfamiliar with, it will be more challenging. If they pick up a copy of this book and learn about
how the system works, they may try to:
• Disconnect the power from the house by pulling the meter off the side of your house.
15.8. SUGGESTIONS AND SCOPE CREEP 365
Most COTS security systems these days have battery backup and a cellular modem to
counteract this move.
• Wear a mask and slip in undetected (the “home-pi” only performs face recognition and
not “person” detection).
• Hack into your network to send your Pis messages (although they would would first need
knowledge of the messages the Pis accept).
• Hack into your network and tinker with the lighting controls so they can enter in the dark-
ness.
In other words, if an intruder wants your stuff or wants you, they will likely get what they
want if they are determined. That shouldn’t restrict you from working to engineer something
unique and great though!
As with any case study and fully-fledged Raspberry Pi security project, you should be thinking
about how to ensure that your project is reliable and that it interoperates with other projects in
your home.
But before you consider adding features, ensure that a minimum set of features working
– the Minimum Viable Product (MVP). The MVP should consist of components that you have
near full control over (i.e. any code you, yourself develop with Python for your Raspberry Pi).
As you integrate other APIs and devices, consider that those devices will undergo software
updates that may be out of your control – they could impact the operation of your system.
When you have your project working, it is time to integrate it with other IoT devices in
your home such as COTS systems including Amazon Alexa, thermostats, door lock actuators,
window/door/motion sensors, and more.
You may find that your options with Z-Wave wireless devices outnumber WiFi devices. Here
are some questions for you to consider as you build upon this project:
• Will facial recognition automatically disarm the alarm? Or is that too risky?
Figure 15.9: Demonstration of the "home-pi" application running by an entry door. High resolution
flowchart can be found here: http://pyimg.co/y1e59
• Do they have a well documented API or are you relying on a vulnerability that may be
patched in the future? Remember, this chapter relies on an unofficial API for integrating
with TP-Link Kasa products. TP-Link could send security updates to their devices at any
time and then you will no-longer have control over your lights.
• What other chapters of this book can you integrate into the codebase?
• Do you need to provide guests (i.e. non-intruders, but people whom don’t have their face
trained with in your model) with a gesture recognition password? Refer to Chapter 9 of the
Hacker Bundle on building a gesture recognition system. How will you integrate message
passing with your gesture recognition system? Will the gesture recognition system turn
off your home alarm system?
15.9. SUMMARY 367
• Do you need to capture license plates of cars that enter your driveway? Refer to the
Gurus Module 6 for license plate detection and recognition.
• Can you use multiple cameras and image stitching? Refer to Chapter 15 of the Hobbyist
Bundle.
• Do you need two or three “driveway-pis” on different sides of your home? If so, how will
the “home-pi” handle potential messages from each of these Pis?
• Or will there be multiple “home-pis” each aimed at a different entry door of a large home?
If so, will you perform facial recognition at each one?
• Will your project be “night-ready”? Maybe you can rely on motion sensor flood lights
outside your home for the “driveway-pi”. Perhaps you can schedule camera parameter
updates to account for changes in ambient light. Or maybe you could implement a sec-
ondary camera specifically for night vision (i.e. infrared or thermal).
15.9 Summary
In this case study chapter, we brought together a number of concepts to build a multiple Pi
security system.
We worked with message passing via sockets to send commands between the Pis. We used
the pretrained MobileNet SSD object detector to find people and cars in frames. Our TP-Link
smart lighting devices were controlled with a simple API based on the state of our system. We
developed a state machine on each Pi to make our software easier to understand and easier
for future developers to add features. One of the states involved face recognition to see who
is entering the home. First we determined if the door is open via background subtraction and
contour processing. We welcomed the known home dweller or scared the unknown intruder
with an audible text-to-speech message. In the event that an intruder was there, we alerted the
homeowner with a Twilio text message.
This chapter served as the culmination of nearly all concepts covered in the first two vol-
umes of this book. Great job implementing this project!
368 CHAPTER 15. RECOGNIZING OBJECTS WITH IOT PI-TO-PI COMMUNICATION
Chapter 16
Congratulations! You have just finished the Hacker Bundle of Raspberry Pi for Computer Vi-
sion. Let’s take a second and review the skills you’ve acquired.
• Learned how to use ImageZMQ and message passing programming paradigms to stream
frames from RPi clients to a central server for processing.
• Utilized message passing for non-image data, enabling you to build IoT applications.
• Built a security system capable of monitoring "restricted zones" using a cascade of back-
ground subtraction, the YOLO object detector, and ImageZMQ.
• Monitored the front door of your home and performed face detection to identify the people
entering your house.
• Utilized the RPi, OpenVINO, and the Movidius NCS to detect vehicles in video streams,
track them, and apply speed estimation to detect the MPH/KPH of the moving vehicle.
• Helped prevent package theft by automatically detecting and recognizing delivery vehi-
cles.
• Discovered what the Intel Movidius NCS is and how we can use it to improve the speed
of deep learning inference/prediction on the RPi.
369
370 CHAPTER 16. YOUR NEXT STEPS
• Learned how to apply both deep learning-based face detectors and face recognizers at
the same time on a single NCS stick.
• Created a multi-RPi IoT project capable of monitoring your driveway for vehicles and then
communicating with a second RPi to monitor your front door for people entering your
home.
At this point you understand the fundamentals of applying deep learning on resource con-
strained devices such as the Raspberry Pi.
However, if you want to take a deeper dive into deep learning on embedded devices, I would
suggest you move on to the Complete Bundle.
• Perform image classification using pre-trained models with the Google Coral
• Decide between the RPi, Movidius NCS, Google Coral, or Jetson Nano when confronted
with a new project
I hope you’ll allow me to continue to guide you on your journey. If you haven’t already picked
up a copy Complete Bundle, you can do so here:
http://pyimg.co/rpi4cv
http://pyimg.co/contact
Cheers,
-Adrian Rosebrock
372 CHAPTER 16. YOUR NEXT STEPS
Bibliography
[1] Samuel Albanie. convnet-burden - Memory consumption and FLOP count estimates for
convnets. https : / / github . com / albanie / convnet - burden. 2018 (cited on
page 18).
[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classification with
Deep Convolutional Neural Networks”. In: Advances in Neural Information Processing
Systems 25. Edited by F. Pereira et al. Curran Associates, Inc., 2012, pages 1097–1105.
URL : http : / / papers . nips . cc / paper / 4824 - imagenet - classification -
with-deep-convolutional-neural-networks.pdf (cited on pages 18, 196).
[3] Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-
Scale Image Recognition”. In: CoRR abs/1409.1556 (2014). URL: http : / / arxiv .
org/abs/1409.1556 (cited on pages 18, 196).
[4] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: CoRR abs/1512.03385
(2015). URL: http://arxiv.org/abs/1512.03385 (cited on pages 18, 79).
[5] Kaiming He et al. “Identity Mappings in Deep Residual Networks”. In: CoRR abs/1603.05027
(2016). URL: http://arxiv.org/abs/1603.05027 (cited on page 18).
[8] Adrian Rosebrock. Hands-on with the NVIDIA DIGITS DevBox for Deep Learning. https:
/ / www . pyimagesearch . com / 2016 / 06 / 06 / hands - on - with - the - nvidia -
digits-devbox-for-deep-learning/. 2016 (cited on page 19).
373
374 BIBLIOGRAPHY
[11] Adrian Rosebrock. Getting started with the NVIDIA Jetson Nano. https : / / www .
pyimagesearch . com / 2019 / 05 / 06 / getting - started - with - the - nvidia -
jetson-nano/. 2019 (cited on page 23).
[12] Adrian Rosebrock. Live video streaming over network with OpenCV and ImageZMQ.
https://www.pyimagesearch.com/2019/04/15/live- video- streaming-
over-network-with-opencv-and-imagezmq/. 2019 (cited on pages 28, 38, 169,
180, 182).
[13] Adrian Rosebrock. An OpenCV barcode and QR code scanner with ZBar. https://
www.pyimagesearch.com/2018/05/21/an-opencv-barcode-and-qr-code-
scanner-with-zbar/. 2018 (cited on page 29).
[14] Adrian Rosebrock. Optical Character Recognition (OCR) Archive. https : / / www .
pyimagesearch . com / category / optical - character - recognition - ocr/.
2019 (cited on page 29).
[17] Adrian Rosebrock. An interview with Jeff Bass, creator of ImageZMQ. https://www.
pyimagesearch.com/2019/04/17/an-interview-with-jeff-bass-creator-
of-imagezmq/. 2019 (cited on page 31).
[18] Jeff Bass. Yin Yang Ranch - Messing About with Permaculture in Newbury Park, CA.
https://yin-yang-ranch.com/. 2017 (cited on page 37).
[19] PyImageSearch Community. PyImageConf 2018 - The practical, hands-on computer vi-
sion, deep learning, and Python conference. https : / / www . pyimageconf . com/.
2018 (cited on page 37).
[20] Jeff Bass. Yin Yang Ranch: A Distributed Computer Vision System with Raspberry Pi’s
and Macs. https://www.pyimageconf.com/static/talks/jeff_bass.pdf.
2018 (cited on pages 37, 336).
[21] Ross B. Girshick et al. “Rich feature hierarchies for accurate object detection and seman-
tic segmentation”. In: CoRR abs/1311.2524 (2013). URL: http://arxiv.org/abs/
1311.2524 (cited on page 50).
[22] Ross B. Girshick. “Fast R-CNN”. In: CoRR abs/1504.08083 (2015). arXiv: 1504.08083.
URL : http://arxiv.org/abs/1504.08083 (cited on page 50).
[23] Shaoqing Ren et al. “Faster R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks”. In: CoRR abs/1506.01497 (2015). URL: http : / / arxiv . org /
abs/1506.01497 (cited on page 50).
BIBLIOGRAPHY 375
[24] Wei Liu et al. “SSD: Single Shot MultiBox Detector”. In: CoRR abs/1512.02325 (2015).
URL : http://arxiv.org/abs/1512.02325 (cited on page 50).
[25] Joseph Redmon et al. “You Only Look Once: Unified, Real-Time Object Detection”. In:
CoRR abs/1506.02640 (2015). URL: http://arxiv.org/abs/1506.02640 (cited on
page 50).
[26] Joseph Redmon and Ali Farhadi. “YOLO9000: Better, Faster, Stronger”. In: CoRR abs/1612.08242
(2016). URL: http://arxiv.org/abs/1612.08242 (cited on page 50).
[27] Joseph Redmon and Ali Farhadi. “YOLOv3: An Incremental Improvement”. In: CoRR
abs/1804.02767 (2018). arXiv: 1804.02767. URL: http://arxiv.org/abs/1804.
02767 (cited on page 50).
[28] J.R.R. Uijlings et al. “Selective Search for Object Recognition”. In: International Journal
of Computer Vision (2013). DOI: 10.1007/s11263-013-0620-5. URL: http://www.
huppelen.nl/publications/selectiveSearchDraft.pdf (cited on page 50).
[29] Adrian Rosebrock. Non-Maximum Suppression for Object Detection in Python. https:
/ / www . pyimagesearch . com / 2014 / 11 / 17 / non - maximum - suppression -
object-detection-python/. 2014 (cited on pages 53, 61, 230).
[31] Adrian Rosebrock. Capturing mouse click events with Python and OpenCV. https :
//www.pyimagesearch.com/2015/03/09/capturing-mouse-click-events-
with-python-and-opencv/. 2015 (cited on page 69).
[32] Adam Geitgey. Machine Learning is Fun! Part 4: Modern Face Recognition with Deep
Learning. https : / / medium . com / @ageitgey / machine - learning - is - fun -
part- 4- modern- face- recognition- with- deep- learning- c3cffc121d78.
2016 (cited on pages 78, 79).
[33] Davis E. King. “Dlib-ml: A Machine Learning Toolkit”. In: J. Mach. Learn. Res. 10 (Dec.
2009), pages 1755–1758. ISSN: 1532-4435. URL: http://dl.acm.org/citation.
cfm?id=1577069.1755843 (cited on pages 79, 306).
[34] Adam Geitgey. face_recognition - The world’s simplest facial recognition api for Python
and the command line. https : / / github . com / ageitgey / face _ recognition.
2016 (cited on pages 79, 306).
[35] Davis King. High Quality Face Recognition with Deep Metric Learning. http://blog.
dlib.net/2017/02/high- quality- face- recognition- with- deep.html.
2017 (cited on pages 79, 319).
376 BIBLIOGRAPHY
[36] Adrian Rosebrock. How to build a custom face recognition dataset. https : / / www .
pyimagesearch.com/2018/06/11/how-to-build-a-custom-face-recognition-
dataset/. 2018 (cited on page 81).
[38] TinyDB Community. TinyDB - Your tiny, document oriented database optimized for your
happiness. https : / / tinydb . readthedocs . io / en / latest/. 2019 (cited on
page 111).
[45] Adrian Rosebrock. Deep learning: How OpenCV’s blobFromImage works. https : / /
www.pyimagesearch.com/2017/11/06/deep-learning-opencvs-blobfromimage-
works/. 2017 (cited on pages 177, 272, 303).
[46] Adrian Rosebrock. Saving key event video clips with OpenCV. https://www.pyimagesearch.
com / 2016 / 02 / 29 / saving - key - event - video - clips - with - opencv/. 2016
(cited on page 180).
[47] Ling-Feng Liu, Wei Jia, and Yi-Hai Zhu. “Survey of Gait Recognition”. In: Emerging In-
telligent Computing Technology and Applications. With Aspects of Artificial Intelligence.
Edited by De-Shuang Huang et al. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009
(cited on page 184).
[48] Ebenezer R. H. P. Isaac et al. “Trait of Gait: A Survey on Gait Biometrics”. In: CoRR
abs/1903.10744 (2019). arXiv: 1903.10744. URL: http://arxiv.org/abs/1903.
10744 (cited on page 184).
BIBLIOGRAPHY 377
[49] J. P. Singh et al. “Vision-Based Gait Recognition: A Survey”. In: IEEE Access 6 (2018),
pages 70497–70527 (cited on page 184).
[50] Adrian Rosebrock. Deep Learning for Computer Vision with Python, 2nd Ed. PyImage-
Search, 2018 (cited on pages 201, 220, 222, 232, 237, 254, 329).
[52] Kaiming He et al. “Mask R-CNN”. In: CoRR abs/1703.06870 (2017). URL : http : / /
arxiv.org/abs/1703.06870 (cited on page 215).
[53] Valentin Bazarevsky and Fan Valentin. On-Device, Real-Time Hand Tracking with Medi-
aPipe. https://ai.googleblog.com/2019/08/on-device-real-time-hand-
tracking-with.html. 2019 (cited on page 215).
[54] Shorr Packaging. US Cities Where Most Packages Are Stolen. https://www.shorr.
com/packaging- news/2018- 08/package- theft- statistics. 2018 (cited on
page 217).
[55] Adrian Rosebrock. Transfer Learning with Keras and Deep Learning. https://www.
pyimagesearch.com/2019/05/20/transfer- learning- with- keras- and-
deep-learning/. 2019 (cited on pages 219, 220, 232, 237).
[56] Adrian Rosebrock. Keras: Feature extraction on large datasets with Deep Learning.
https://www.pyimagesearch.com/2019/05/27/keras-feature-extraction-
on - large - datasets - with - deep - learning/. 2019 (cited on pages 220, 232,
237).
[57] Adrian Rosebrock. Fine-tuning with Keras and Deep Learning. https://www.pyimagesearch.
com/2019/05/27/keras-feature-extraction-on-large-datasets-with-
deep-learning/. 2019 (cited on pages 220, 232, 237).
[58] The HDF Group. Hierarchical data format version 5. http://www.hdfgroup.org/
HDF5 (cited on pages 222, 233).
[59] Tsung-Yi Lin et al. “Microsoft COCO: Common Objects in Context”. In: CoRR abs/1405.0312
(2014). URL: http://arxiv.org/abs/1405.0312 (cited on page 222).
[61] Adrian Rosebrock. How to create a deep learning dataset using Google Images. https:
//www.pyimagesearch.com/2017/12/04/how-to-create-a-deep-learning-
dataset-using-google-images/. 2017 (cited on page 227).
378 BIBLIOGRAPHY
[62] Adrian Rosebrock. How to (quickly) build a deep learning image dataset. https : / /
www.pyimagesearch.com/2018/04/09/how- to- quickly- build- a- deep-
learning-image-dataset/. 2018 (cited on page 227).
[64] OpenVINO and OpenCV Contributors. OpenVINO Toolkit - Open Model Zoo repository.
https://github.com/opencv/open_model_zoo. 2019 (cited on pages 258, 260,
265).
[65] Intel Developer Zone. Pretrained Models - Intel Distribution of OpenVINO Toolkit. hhttps:
//software.intel.com/en-us/openvino-toolkit/documentation/pretrained-
models. 2019 (cited on pages 258, 265).
[66] Adrian Rosebrock. Getting started with the Intel Movidius Neural Compute Stick. https:
/ / www . pyimagesearch . com / 2018 / 02 / 12 / getting - started - with - the -
intel-movidius-neural-compute-stick/. 2018 (cited on page 258).
[67] Adrian Rosebrock. Real-time object detection on the Raspberry Pi with the Movidius
NCS. https://www.pyimagesearch.com/2018/02/19/real- time-object-
detection-on-the-raspberry-pi-with-the-movidius-ncs/. 2018 (cited on
page 258).
[68] Intel Developer Zone. Transitioning from Intel Movidius Neural Compute SDK to Intel Dis-
tribution of OpenVINO toolkit. https://software.intel.com/en-us/articles/
transitioning-from-intel-movidius-neural-compute-sdk-to-openvino-
toolkit. 2019 (cited on page 259).
[69] Adrian Rosebrock. OpenVINO, OpenCV, and Movidius NCS on the Raspberry Pi. https:
//www.pyimagesearch.com/2019/04/08/openvino-opencv-and-movidius-
ncs-on-the-raspberry-pi. 2019 (cited on page 260).
[73] Adrian Rosebrock. Face recognition with OpenCV, Python, and deep learning. https:
//www.pyimagesearch.com/2018/06/18/face-recognition-with-opencv-
python-and-deep-learning/. 2018 (cited on pages 297, 315, 319, 328).
[75] Adrian Rosebrock. Face detection with OpenCV and deep learning. https : / / www .
pyimagesearch . com / 2018 / 02 / 26 / face - detection - with - opencv - and -
deep-learning/. 2018 (cited on page 299).
[80] Adrian Rosebrock. Facial landmarks with dlib, OpenCV, and Python. https://www.
pyimagesearch.com/2017/04/03/facial-landmarks-dlib-opencv-python/.
2017 (cited on page 317).
[81] Adrian Rosebrock. Face Alignment with OpenCV and Python. https://www.pyimagesearch.
com/2017/05/22/face- alignment- with- opencv- and- python/. 2017 (cited
on page 318).
[82] Scikit-Learn Open Source Contributors. RBF SVM parameters. https : / / scikit -
learn.org/stable/auto_examples/svm/plot_rbf_parameters.html. 2019
(cited on page 319).
[84] Inc. Apple. Find My - One place to find your devices and friends. https://www.apple.
com/icloud/find-my/. 2019 (cited on page 323).
[85] Inc. Nextdoor. Next Door - The private social network for your neighborhood. https:
//nextdoor.com/. 2019 (cited on page 324).
[86] SoftScheck Contributors. Reverse Engineering the TP-Link HS110. https : / / www .
softscheck.com/en/reverse-engineering-tp-link-hs110/. 2018 (cited on
page 330).
380 BIBLIOGRAPHY
[87] SoftScheck Contributors. TP-Link WiFi SmartPlug Client and Wireshark Dissector. https:
//github.com/softScheck/tplink-smartplug. 2018 (cited on page 330).
[88] GadgetReactor. Python Library to control TPLink Switch (HS100 / HS110). https://
github.com/GadgetReactor/pyHS100. 2018 (cited on page 330).
[89] Adrian Rosebrock. Raspbian and OpenCV, pre-configured and pre-installed. https://
www.pyimagesearch.com/2016/11/21/raspbian-opencv-pre-configured-
and-pre-installed/. 2019 (cited on page 331).