Deep Learning in Object Detection, PDF
Deep Learning in Object Detection, PDF
Pedestrian parsing
Human pose estimation
Deep learning
Crowd segmentation
Pedestrian detection
Crowd tracking
1986
• Hard to train
• Insufficient computational resources
• Small training sets
• Does not work well
Neural network
Back propagation
1986 2006
1986 2006
• ImageNet 2013
Detection
Pedestrian detection
Facial keypoint detection
Segmentation
Face parsing
Pedestrian parsing
Recognition
Face verification
Face attribute
recognition
Pedestrian Detection
Improve state-of-the-art
average miss detection rate
on the largest Caltech dataset
from 63% to 39%
ICCV’13
Convolution Pooling
Classical Deep Models
• Deep belief net
– Hinton’06
x
Opinion I
• How to formulate a vision problem with deep learning?
– Make use of experience and insights obtained in CV research
– Sequential design/learning vs joint learning
– Effectively train a deep model (layerwise pre-training + fine tuning)
Spatial ↔ multi-level
pyramid pooling
SVM + feature
smoothness, shape prior…
Output
High-dimensional
? data transform
Input
Opinion III
• Deep learning likes challenging tasks (for better
generalization)
– Make input data more challenging (augmenting data by
translating, rotating, and scaling)
– Make training process more challenging (dropout:
randomly setting some responses to zero; dropconnect:
randomly setting some weights to zero)
– Make prediction more challenging
Learning feature through face
verification (predicting 0/1 label):
92.57% on LFW with 480 CNNs
Z. Zhu, P. Luo, X. Wang, and X. Tang, “Deep Learning Indentify-Preserving Face Space,” ICCV 2013.
Joint Deep Learning
What if we treat an existing deep model as
a black box in pedestrian detection?
ConvNet−U−MS
– Sermnet, K. Kavukcuoglu, S. Chintala, and LeCun, “Pedestrian Detection with
Unsupervised Multi-Stage Feature Learning,” CVPR 2013.
Results on Caltech Test Results on ETHZ
• N. Dalal and B. Triggs. Histograms of oriented gradients for human detection.
CVPR, 2005. (6000 citations)
• P. Felzenszwalb, D. McAlester, and D. Ramanan. A Discriminatively Trained,
Multiscale, Deformable Part Model. CVPR, 2008. (2000 citations)
• W. Ouyang and X. Wang. A Discriminative Deep Model for Pedestrian Detection
with Occlusion Handling. CVPR, 2012.
Our Joint Deep Learning Model
Modeling Part Detectors
• Design the filters in the second
convolutional layer with variable sizes
Part models learned
from HOG
90 68%
80
63% (state-of-the-art)
70 53%
60
50
39% (best performing)
40
Improve by ~ 20%
30
2000 2002 2004 2006 2008 2010 2012 2014
W. Ouyang and X. Wang, "A Discriminative Deep Model for Pedestrian Detection with Occlusion Handling,“ CVPR 2012.
W. Ouyang, X. Zeng and X. Wang, "Modeling Mutual Visibility Relationship in Pedestrian Detection ", CVPR 2013.
W. Ouyang, Xiaogang Wang, "Single-Pedestrian Detection aided by Multi-pedestrian Detection ", CVPR 2013.
X. Zeng, W. Ouyang and X. Wang, ” A Cascaded Deep Learning Architecture for Pedestrian Detection,” ICCV 2013.
W. Ouyang and Xiaogang Wang, “Joint Deep Learning for Pedestrian Detection,” IEEE ICCV 2013.
Results on Caltech Test Results on ETHZ
DN-HOG
UDN-HOG
UDN-HOGCSS
UDN-CNNFeat
UDN-DefLayer
Multi-Stage Contextual Deep Learning
Motivated by Cascaded Classifiers and
Contextual Boost
• The classifier of each stage deals with a specific set
of samples
• The score map output by one classifier can serve as
contextual information for the next classifier
Caltech ETHZ
DeepNetNoneFilter
Comparison of Different Training Strategies
Network-BP: use back propagation to update all the parameters without pre-training
PretrainTransferMatrix-BP: the transfer matrices are unsupervised pertrained, and then
all the parameters are fine-tuned
Multi-stage: our multi-stage training strategy
High-Dimensional Data Transforms
Output
High-dimensional
data transform
Input
PLDA 90.07
(Li, TPAMI’12)
Joint Bayesian 90.9
(Chen, ECCV’12, 5-point align)
Fisher Vector Faces 93.30
(Barkan, ICCV’13)
High-dim LBP 93.18
(Chen, CVPR’13, 27-point align)
Ours 94.38
(5-point align)
Comparison on LFW (with outside training data)
Associate-Predict 90.57
(Yin CVPR’12)
Joint Bayesian 92.4
(Chen, ECCV’12, 5-point align)
Tom-vs-Peter 93.30
(Berg, BMVC’12, 90-point align)
High-dim LBP 95.17
(Chen, CVPR’13, 27-point align)
Transfer learning joint Bayesian 96.33
(Cao, ICCV’13, 27-point align)
Ours 96.45
(5-point align)
Face Parsing
• P. Luo, X. Wang and X. Tang, “Hierarchical Face
Parsing via Deep Learning,” CVPR 2012
Motivations
1. http://www.luxand.com/facesdk/
2. http://research.microsoft.com/en-us/projects/facesdk/.
3. O. Jesorsky, K. J. Kirchberg, and R. Frischholz. Robust face detection using the hausdorff distance. In Proc. AVBPA, 2001.
4. P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar. Localizing parts of faces using a consensus of exemplars. In Proc. CVPR, 2011.
5. X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by explicit shape regression. In Proc. CVPR, 2012.
6. L. Liang, R. Xiao, F. Wen, and J. Sun. Face alignment via component-based discriminative search. In Proc. ECCV, 2008.
7. M. Valstar, B. Martinez, X. Binefa, and M. Pantic. Facial point detection using boosted regression and graph models. In Proc. CVPR, 2010.
Validation.
BioID.
LFPW.
Conclusions
• Deep learning can jointly optimize key components in
vision systems
• Prior knowledge from vision research is valuable for
developing deep models and training strategies
• Deep learning can solve some vision challenges as
problems of high-dimensional data transform
• Challenging prediction tasks can make better use the
large learning capacity and avoid overfitting
People working on deep learning in our group
Acknowledgement
Hong Kong Research Grants Council
中国自然科学基金
Thank you!
http://mmlab.ie.cuhk.edu.hk/ http://www.ee.cuhk.edu.hk/~xgwang/