03 Object-Detection-Based - Video - Compression - For - Wireless - Surveillance - Systems
03 Object-Detection-Based - Video - Compression - For - Wireless - Surveillance - Systems
Object-
what is available from individual cameras.4
A typical automatic surveillance system
includes five stages: object detection, object
classification, object tracking, understanding
Detection-
and description of behaviors, and human
identification.4 Object detection is the first and
most essential step of the entire procedure
because detecting the object provides a focus
Based Video
of attention for later processes, such as track-
ing and behavior analysis. However, the inevi-
table degradation of video quality caused by
lossy compression at embedded cameras sig-
Compression
nificantly impacts object detection.5,6 To
address this, video encoders for surveillance
systems should be designed to improve object-
detection performance.
for Wireless
In our recent work, we studied the effects of
lossy compression on object detection.7 Unlike
human beings, who can easily extract and focus
on a moving object in a blurred video, the per-
Surveillance
formance of computer vision algorithms can be
significantly affected by temporal fluctuation
in background areas. All modern video com-
pression standards—such as H.264/Advanced
Systems
Video Coding (H.264/AVC) and the latest High
Efficiency Video Coding (HEVC, also known as
H.265)—use the block-based hybrid approach,
which includes intra- and interpicture predic-
tion and 2D transform coding.8 As Figure 1
Lingchao Kong and Rui Dai shows, this approach measures the encoding
University of Cincinnati distortion by comparing the encoded video
with the original video (A direction), but it does
not measure the temporal domain fluctuation
in the encoded video (B direction). This strategy
results in temporal fluctuation when colocated
W
ireless embedded camera regions of consecutive frames—such as ft1 to
To obtain better
sensors play crucial roles in ft—are not consistently encoded, especially
object-detection
various distributed surveil- when intraframes are periodically inserted at
performance on
lance applications, includ- low and medium bit rates.
compressed videos,
this standard- ing those for border patrol, traffic monitoring, As the “Related Work on Temporal
compliant video- and environmental monitoring. In many dis- Fluctuation” sidebar shows, existing approaches
encoding scheme tributed wireless surveillance systems,1 camera are designed to optimize human visual percep-
introduces new sensors report their video observations to a tion; in contrast, we aim to address the temporal
mode-decision central base station through wireless communi- fluctuation problem itself to improve object-
strategies to suppress detection performance. This approach is worth-
cation. Given the embedded cameras’ low
unnecessary temporal while because the human and computer vision
computing power and limited energy and band-
fluctuation in stable systems might have different responses to an
width, raw videos acquired by cameras are
background areas encoded video. Our new temporal-fluctuation-
usually preprocessed, encoded, and compressed
while maintaining reduced video-encoding (TFRE) scheme can sup-
before being delivered to the base station.2,3 A press unnecessary temporal fluctuations in
acceptable rate-
powerful central server or data center at the base stable background areas. TFRE is designed to
distortion
station can fully utilize its powerful computing comply with the standardized hybrid block-
performance.
capability to perform data fusion on videos based video-coding architecture. It uses sum-of-
from multiple cameras, producing a much better absolute frame difference (SFD) to measure the
76
1070-986X/17/$33.00 c 2017 IEEE Published by the IEEE Computer Society
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:34 UTC from IEEE Xplore. Restrictions apply.
Related Work on Temporal Fluctuation
Many researchers have investigated the problem of temporal fluctuation with the objective of improving the percep-
tual quality of compressed videos. The temporal fluctuation humans perceive is defined as flicker, which usually refers
to frequent luminance or chrominance perceptual changes that do not appear in uncompressed raw videos.1
Researchers have proposed a temporal low-pass filtering scheme that smooths the luminance changes on a
block-by-block basis,1 as well as a two-pass coding scheme, including a first pass of simplified P-frame coding that
derives a no-flicker reference of the current frame, and a second pass of actual I-frame coding with small quantization
parameters for closely approaching the no-flicker reference.2 Other researchers propose a modified distortion meas-
ure to reduce flicker; this approach considers the distortions both in A and B directions (see Figure 1 in the main
text), applying the measure during the intraprediction mode rate-distortion optimized selection process.3 To reduce
the flicker artifact in High-Efficiency Video Coding, researchers have proposed a region-classification-based rate con-
trol for coding tree units in I-frames that improves the reconstructed quality of I-frames.4
References
1 A. Jimenez-Moreno et al., “Standard-Compliant Low-Pass Temporal Filter to Reduce the Perceived Flicker Artifact,” IEEE Trans.
Multimedia, vol. 16, no. 7, 2014, pp. 1863–1873.
2 H. Yang, J.M. Boyce, and A. Stein, “Effective Flicker Removal from Periodic Intra Frames and Accurate Flicker Measurement,”
Proc. Int’l Conf. Image Processing (ICIP), 2008, pp. 2868–2871.
3 S.S. Chun, J.-R. Kim, and S. Sull, “Intra Prediction Mode Selection for Flicker Reduction in H. 264/AVC,” IEEE Trans. Consumer
Electronics, vol. 52, no. 4, 2006, pp. 1303–1310.
4 P. Wang et al., “Region-Classification-Based Rate Control for Flicker Suppression of i-Frames in HEVC,” Proc. Int’l Conf. Image
Processing (ICIP), 2013, pp. 1986–1990.
Preliminary Study
We constructed a distorted video database to
study the impact of lossy compression on B
object-detection performance.7 Eight raw video
sequences with different spatial and temporal Figure 1. Schematic diagram of temporal fluctuation. The traditional block-
details were selected, including three traffic based hybrid approach does not measure the temporal domain fluctuation in
videos—container, GRAM Road-Traffic Monitoring an encoded video.
(GR), and GRAM Road-Traffic Monitoring HD
(GRHD)—and three indoor videos—hall, hori-
zontal, and overlook. We also included two out- video was compressed using 19 different QPs
door videos—people and vehicle. Figure 2 shows ranging from 22 to 40, which resulted in a total
snapshots of these videos. The open source of 152 compressed videos. We chose three
H.264/AVC encoder x264 (www.videolan.org/ object-detection algorithms from different cat-
developers/x264.html) was used to compress egories10 to be executed on the compressed
April–June 2017
the raw videos. The one-pass constant quantiza- videos: the Gaussian mixture model (GMM)
tion parameter (QP) mode with medium speed algorithm, the algorithm that combines statisti-
was applied in the x264 encoder, and the length cal background estimation and per-pixel Baye-
of the group of pictures (GOP) was set to 20 sian segmentation (referred to as the GMG
with the IPPP (where I denotes intra frame, P algorithm), and the adaptive background learn-
denotes predictive frame) structure. Each raw ing (ABL) algorithm.
77
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:34 UTC from IEEE Xplore. Restrictions apply.
(a) Container (b) GR (c) GRHD (d) Hall
Figure 2. Eight video sequences with different spatial and temporal details were used for this study. There are three traffic videos: (a)
container, (b) GRAM Road-Traffic Monitoring (GR), and (c) GRAM Road-Traffic Monitoring HD (GRHD). There are also three
indoor videos: (d) hall and (e) horizontal, and (f) overlook. Finally, there are two outdoor videos: (g) people and (h) vehicle.
0.6 QP=40
0.5
The performance of object-detection algo-
0.4
rithms can be affected by the background’s
0.3
quality. However, the video-coding procedure
0.2
might introduce temporal fluctuations in the
0.1
background that can cause FP. To describe the
0 degree of temporal fluctuation in stable back-
0 100 200 300 400 500
Sum-of-absolute frame difference (SFD) ground areas, we introduce SFD in the macro-
block (MB) unit between the current frame and
Figure 3. False positive pixels (FP) and the sum-of- the previous frame as follows:
absolute frame difference (SFD). FP increases
i;j¼16
X
when SFD increases.
SFD ¼ jmt ði; jÞ mt1 ði; jÞj;
i;j¼1
Object-detection results from uncompressed where mt(i, j) is the reconstructed pixel value at
raw videos are regarded as the ground truth, and location (i, j) in an MB of the current frame and
object-detection results on compressed videos mt1(i, j) is the reconstructed pixel value in the
are the algorithm results compared against the previous frame’s corresponding MB.
ground truth. We use the commonly known We collected SFD and FP samples from stable
Recall and Precision measures to quantify background areas for all the compressed videos
object-detection performance as follows: in the aforementioned dataset. Figure 3 shows
IEEE MultiMedia
Recall ¼ TP/(TP þ FN) and Precision ¼ TP/(TP þ the relationship between SFD and FP from our
FP), where TP, FN, and FP stand for the amount test data: FP grows when SFD increases. We con-
of true positive pixels, false negative pixels, and ducted analysis of variance (ANOVA) on the
false positive pixels, respectively. Recall and Pre- pairs of FP and SFD, where a small p-value
cision selectively evaluate the level of missing (p 0.01) means significant correlation.12 The
TP and mistaken TP; we measure the overall resulting p-values are close to 0 and much
78
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:34 UTC from IEEE Xplore. Restrictions apply.
smaller than 0.01, indicating that FP is closely Input frame
associated with SFD.
79
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:34 UTC from IEEE Xplore. Restrictions apply.
1: if current MB belongs to stable background then In this figure, each block represents one MB
2: for available type mode M ti do unit; the yellow, blue, and red colors denote
3: encode current MB and store Cti
4:
P SKIP mode, other intermodes, and intramo-
calculate and store SF Dti
5: end for des, respectively. Obviously, the P SKIP location
6: sort records in ascending order based on SFD value, and obtain distribution fluctuates in these consecutive
valid number of records (N t) interframes. For an MB in the stable back-
7: obtain SFDt th based on N ttop = ⎡Nt × Ptop⎤
ground area, when the intermode changes
8: find the minimum Ct, subject to SF Dti ≤ SF Dtth
9: output the corresponding M t* as the selected type mode
from P SKIP to other interprediction modes or
10: for available prediction mode M pi of the selected type M t* vice versa in consecutive frames, a temporal
do fluctuation will occur in the encoded frames,
11: encode current MB and store Cpi which might result in FP for object detection
12: calculate and store SFDpi
due to mistaking this fluctuation for the appear-
13: end for
14:
ance of a new object.
sort records in ascending order based on SFD value, and obtain
valid number of records (N p ) We propose reducing temporal fluctuation
15: obtain SF Dpth based on N ptop = ⎡Np × Ptop⎤ by designing new criteria in the analysis of
16: find the minimum Cp, subject to SF Dpi ≤ SF Dpth intertype modes. Specifically, we expect to clas-
17: output the corresponding M p* as the selected prediction mode
sify more MBs in stable background areas as
18: end if
P SKIP—or set the MVs of these MBs to zero—
Figure 5. Algorithm 1: Intraframe joint temporal-fluctuation and and, at the same time, we expect to maintain
rate distortion mode selection. Based on temporal-fluctuation and acceptable traditional distortion sum of
rate distortion, joint T-RD selection first determines the best type squared differences (SSD), which is the differen-
mode, and then determines the best prediction mode. ces between the intensities of an original MB
and the intensities of an encoded MB. Based on
Interframe Coding/Mode Selection the typical inter-MB analysis process, we
A typical interframe analysis process includes designed new schemes in the probe P SKIP
three steps: process and the analysis of the P 16 16 mode.
For MBs dissatisfied with the original crite-
1. Probe P SKIP mode—that is, encode the
rion in the probe P SKIP process,13 we compare
current MB, assuming no encoding resid-
the encoding option of P SKIP with the encod-
uals and no motion vector (MV) differ-
ing option of using predictive MV; if the P SKIP
ence, and use only the predictive MV. The
option brings less SFD while maintaining
decimate score is computed, indicating
acceptable SSD, the current MB will be set as
whether we could set the discrete cosine
P SKIP. Algorithm 2 (see Figure 7) shows the
transform (DCT) coefficients to 0 given
steps, where SSDr and SFDr are SSD and SFD of
the DCT coefficients after the actual
the reconstructed MB based on predictive MV;
encoding of this inter-MB.13 If the deci-
SSDs and SFDs are SSD and SFD of the current
mate score of the current MB is less than 6,
MB assuming P SKIP encoding; and d w and s w
then the current MB can be encoded as
are weight variables that can be customized by
P SKIP and return.
encoders.
2. Otherwise, other inter-prediction modes, Furthermore, to analyze the P 16 16
including P 16 16, P 8 16, P 16 8, mode, we design an interframe P 16 16 direct
P 8 8, P 4 8, P 8 4, and P 4 4, are copy mode—directly copying from the corre-
all tried, the corresponding MVs are esti- sponding MB in the previous frame due to neg-
mated, and search is also performed on ligible motion in the stable background area. If
the intramodes. the distortion brought by assuming no motion
is comparable to the distortion of reconstructed
3. Run the RDO process and determine the MB after motion estimation, the process will
best mode from all available modes. skip other intermode analyses and jump to
encode current MB process without the RDO proc-
IEEE MultiMedia
However, the typical interframe analysis ess, as Figure 4 shows. Algorithm 3 (Figure 8)
process can result in temporal fluctuation for describes the detailed steps of the interframe
stable background areas, which will reduce the P 16 16 direct copy mode, where SSDme is the
accuracy of object detection. For example, the MB distortion based on MVme after motion esti-
first row of Figure 6 shows three consecutive mation, SSDdc is the MB distortion based on the
interframes of the GR video clip (Figure 6a–c). assumption that there is no motion and that a
80
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:34 UTC from IEEE Xplore. Restrictions apply.
(a) Frame 8 (b) Frame 9 (c) Frame 10
Figure 6. Fluctuation of P SKIP distribution in the GR video. The top row (a–c) shows the results of x264 implementation; the bottom
row (d–f) shows results from the TFRE scheme.
direct copy from the previous frame’s corre- 1: Input: decimate score of current MB
sponding MB is applied, and d w is a custom 2: if decimate score of current MB < 6 then
3: current MB is set as P_SKIP
weight parameter that restricts SSDdc inside a
4: return
threshold of d w SSDme. 5: else if current MB belongs to stable background then
To demonstrate how effectively Algorithms 6: encode current MB based on predictive MV
2 and 3 reduce temporal fluctuation, Figure 7: calculate SSDr and SFDr based on the reconstructed MB
6d–f shows an example of the proposed inter- 8: calculate SSDs and SFDs assume current MB as P_SKIP
9: if SSDs ≤ d_w × SSDr and SFDs ≤ s_w × SFDr then
coding scheme in three consecutive interframes
10: current MB is set as P_SKIP
of the GR clip. Compared with the images in 11: return
Figure 6a–c, which show results from the stand- 12: end if
ard interanalysis process, applying the pro- 13: end if
posed scheme encodes more background MBs
Figure 7. Algorithm 2: Interframe probe P SKIP algorithm. It
as P SKIP modes and the P SKIP distribution
compares the encoding option of P SKIP with the encoding option of
remains stable for consecutive frames.
using predictive MV, and tries to bring less temporal fluctuation
while maintaining acceptable distortion.
Performance Evaluation
We compared the proposed TRFE scheme’s per-
formance to the H.264/AVC-based open source Figure 2 for this test; Table 1 summarizes the
encoder x264 and the reducing-flicker video- compression settings. The x264 encoder (ver-
coding approach (RFC).14 The objective of sion 0.142.x) is configured to encode videos
April–June 2017
RFC is to improve perceptual video quality by using one-pass constant QP mode with medium
reducing flicker effects, and it considers the dis- speed. We applied the aforementioned three
tortions not only between the encoded video object-detection algorithms (GMM, GMG, and
and the original video, but also in the encoded ABL) on these compressed videos.
video’s temporal domain during the intra-RDO We first evaluate object-detection perform-
process. We used the eight raw videos shown in ance in terms of TP and FP levels. Figure 9
81
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:34 UTC from IEEE Xplore. Restrictions apply.
1: Input: M Vme after motion estimation in P_16×16 inter analysis modest gains for GMG (0.94–2.54 percent) and
2: if current MB belongs to stable background then GMM (0.54–2.57 percent).
3: encode current MB based on M Vme We summarize the average Recall, Preci-
4: calculate SSDme based on the reconstructed MB
5: calculate SSDdc assume current MB as Direct Copy mode
sion, and F1 scores over the eight videos for
6: if SSDdc ≤ d_w × SSDme then the three object-detection algorithms in
7: current MB is set as P_16 ×16 Direct Copy mode Table 2. The numbers in the D1 and D2 col-
8: return umns denote the gains of TFRE over the x264
9: end if encoder and RFC, respectively. Three points
10: end if
can be made based on the results of 10 differ-
Figure 8. Algorithm 3: Interframe P 16 16 direct copy mode. It ent QP values:
directly copies from the corresponding MB in the previous frame if RFC’s performance is comparable with that
only negligible motion in the stable background area. of x264, regardless of which measure is used.
82
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:34 UTC from IEEE Xplore. Restrictions apply.
5,500 5,500 1,300
5,400 5,400 1,250
5,300
5,300 1,200
5,200
5,100 1,150
5,200
5,000 1,100
TP
TP
TP
4,900 5,100
1,050
4,800 5,000 1,000
4,700
4,600 4,900 950
4,500 4,800 900
28 30 32 34 36 38 40 42 44 46 28 30 32 34 36 38 40 42 44 46 28 30 32 34 36 38 40 42 44 46
QP QP QP
FP
FP
5,000 1,400
1,500 1,200
1,000 4,000
1,000
500 3,000 800
0 2,000 600
28 30 32 34 36 38 40 42 44 46 28 30 32 34 36 38 40 42 44 46 28 30 32 34 36 38 40 42 44 46
(a) QP (b) QP (c) QP
RFC x264 TFRE
Figure 9. True positive (TP) pixels (top row) and false positive (FP) pixels (bottom row) for three object-detection algorithms: the
(a) adaptive background learning (ABL) algorithm, (b) GMG algorithm, and (c) Gaussian mixture model (GMM) algorithm.
F1
F1
F1
0.7
0.4 0.7 0.4
0.65
0.3 0.6 0.65 0.3
0.2 0.55 0.6 0.2
0.1 0.5 0.55 0.1
28 30 32 34 36 38 40 42 44 46 28 30 32 34 36 38 40 42 44 46 28 30 32 34 36 38 40 42 44 46 28 30 32 34 36 38 40 42 44 46
(a) QP (b) QP (c) QP (d) QP
1 1 0.9 1
0.9 0.9 0.8
0.9
0.8 0.8 0.7
0.7 0.7 0.8
0.6
0.6 0.6
0.5 0.7
F1
F1
F1
F1
0.5 0.5
0.4
0.4 0.4 0.6
0.3 0.3 0.3
0.5
0.2 0.2 0.2
0.1 0.1 0.1 0.4
28 30 32 34 36 38 40 42 44 46 28 30 32 34 36 38 40 42 44 46 28 30 32 34 36 38 40 42 44 46 28 30 32 34 36 38 40 42 44 46
(e) QP (f) QP (g) QP (h) QP
ABL RFC ABL x264 ABL TFRE GMG RFC GMG x264
GMG TFRE GMM RFC GMM x264 GMM TFRE
Figure 10. F1 scores of the eight test videos: (a) container, (b) GR, (c) GRHD, (d) hall, (e) horizontal, (f) overlook, (g) people, and
(h) vehicle.
83
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:34 UTC from IEEE Xplore. Restrictions apply.
Table 2. Average object-detection results for various algorithms.
*
Reducing-flicker video coding
†
Temporal-fluctuation-reduced video coding
38 40 40
36 38 38
34 36 36
PSNR (dB)
PSNR (dB)
PSNR (dB)
32 34 34
30 32 32
28 30 30
26 28 28
24 26 26
50 100 150 200 250 300 50 100 150 200 250 300 350 20 40 60 80 100 120 140 160 180 200
(a) Bit rate (kbps) Bit rate (kbps) Bit rate (kbps)
1 1 1
0.95 0.95
0.95
0.9
0.9
0.85 0.9
SSIM
SSIM
SSIM
0.85
0.8 0.85
0.8
0.75
0.8 0.75
0.7
0.65 0.75 0.7
50 100 150 200 250 300 50 100 150 200 250 300 350 20 40 60 80 100 120 140 160 180 200
(b) Bit rate (kbps) Bit rate (kbps) Bit rate (kbps)
Figure 11. Rate-distortion curves: (a) peak signal-to-noise ratio (PSNR) versus bitrate and (b) structural similarity (SSIM) versus
bitrate for all the videos.
other video sequences, TFRE performs similar whereas TFRE decreases the bit rate by 2.45
to x264. Compared with x264 encoding, TFRE’s kbps on average. The slight decrease in bit rate
PSNR and SSIM values decrease slightly by is because TFRE encodes more inter-MBs as
0.237 dB and 0.0046 on average, respectively, P SKIP. Generally speaking, TFRE’s rate-
84
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:34 UTC from IEEE Xplore. Restrictions apply.
distortion performance is comparable with that Circuits and Systems for Video Technology, vol. 22,
of the x264 encoder. no. 12, 2012, pp. 1649–1668.
9. L. Kong and R. Dai, “Temporal-Fluctuation-
85
Authorized licensed use limited to: ULAKBIM UASL - Uluslararasi Kibris Universitesi. Downloaded on March 22,2023 at 04:55:34 UTC from IEEE Xplore. Restrictions apply.