0% found this document useful (0 votes)

56 views52 pages

CS231N Section: Video Understanding

This document provides an overview of video understanding and summarizes some common approaches used for video classification models. It discusses pre-deep learning methods that used hand-crafted features and bag-of-words models. For deep learning models, it describes CNN+RNN approaches that model videos as sequences, 3D convolution networks that learn spatiotemporal features, and two-stream networks that separately model appearance and motion. Several widely used video datasets for tasks like action recognition and video captioning are also summarized.

Uploaded by

Zaidggg ggg Ggit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views52 pages

CS231N Section: Video Understanding

Uploaded by

Zaidggg ggg Ggit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

CS231N Section

Video Understanding
6/1/2018
Outline
● Background / Motivation / History
● Video Datasets
● Models
○ Pre-deep learning
○ CNN + RNN
○ 3D convolution
○ Two-stream
What we’ve seen in class so far...
● Image Classification
● CNNs, GANs, RNNs, LSTMs, GRU
● Reinforcement Learning
What’s missing → videos!
Robotics / Manipulation
Self-Driving Cars
Collective Activity Understanding
Video Captioning
...and more!
● Video editing

● VR (e.g. vision as inverse graphics)

● Video QA

● ...
Datasets
● Video Classification
● Atomic Actions
● Video Retrieval
Video Classification
UCF101
● YouTube videos
● 13320 videos, 101 action
categories
● Large variations in camera motion,
object appearance and pose,
viewpoint, background,
illumination, etc.
Sports-1M
● YouTube videos
● 1,133,157 videos, 487
sports labels
YouTube 8M
● Data
○ Machine-generated annotations
from 3,862 classes
○ Audio-visual features
Atomic Actions
Charades
● Hollywood in Homes:
crowdsourced “boring” videos
of daily activities
● 9848 videos
● RGB + optical flow features
● Action classification, sentence
prediction
● Pros and cons
○ Pros: Objects; video-level and
frame-level classification
○ Cons: No human localization
Atomic Visual Actions (AVA)
● Data
○ 57.6k 3s segments
○ Pose and object interactions
● Pros and cons
○ Pros: Fine-grained
○ Cons: no annotations about
objects
Moments in Time (MIT)
● Dataset: 1,000,000 3s videos
○ 339 verbs
○ Not limited to humans
○ Sound-dependent: e.g. clapping
in the background
● Advantages:
○ Balanced
● Disadvantages:
○ Single label (classification, not
detection)
Movie Querying
M-VAD and MPII-MD
● Video clips with descriptions. e.g.:
○ SOMEONE holds a crossbow.
○ He and SOMEONE exit a mansion. Various vehicles sit in the driveway, including an RV and a
boat. SOMEONE spots a truck emblazoned with a bald eagle surrounded by stars and stripes.
○ At Vito's the Datsun parks by a dumpster.
LSMDC (Large Scale Movie Description Challenge)
● Combination of M-VAD and MPII-MD

Tasks

● Movie description
○ Predict descriptions for 4-5s movie clips
● Movie retrieval
○ Find the correct caption for a video, or retrieve
videos corresponding to the given activity
● Movie Fill-in-the-Blank (QA)
○ Given a video clip and a sentence with a blank
in it, fill in the blank with the correct word
Challenges in Videos
● Computationally expensive
○ Size of video >> image datasets
● Lower quality
○ Resolution, motion blur, occlusion
● Requires lots of training data!
What a video framework should have
● Sequence modeling
● Temporal reasoning (receptive field)
● Focus on action recognition
○ Representative task for video understanding
Models
Pre-Deep Learning
Pre-Deep Learning
Features:
● Local features: HOG + HOF (Histogram of Optical Flow)
● Trajectory-based:
○ Motion Boundary Histograms (MBH)
○ (improved) dense trajectories: good performance, but computationally intensive

Ways to aggregate features:

● Bag of Visual Words (Ref)
● Fisher vectors (Ref)
Representing Motion
Optical flow: pattern of apparent motion

● Calculation: e.g. TVL1, DeepFlow,

Representing Motion
1) Optical flow 2) Trajectory stacking
Deep Learning ☺
Large-scale Video Classification with Convolutional Neural Networks (pdf)

2 Questions:
● Modeling perspective: what architecture to best capture temporal patterns?

● Computational perspective: how to reduce computation cost without

sacrificing accuracy?
Large-scale Video Classification with Convolutional Neural Networks (pdf)

Architecture: different ways to fuse features from multiple frames

Conv layer Norm layer Pooling layer

Large-scale Video Classification with Convolutional Neural Networks (pdf)

Computational cost: reduce spatial dimension to reduce model complexity

→ multi-resolution: low-res context + high-res foveate
High-res image center
of size (w/2, h/2)

Reduce #parameters
to around a half

Low-res image context

downsampled to (w/2, h/2)
Large-scale Video Classification with Convolutional Neural Networks (pdf)

Results on video retrieval (Hit@k: the correct video is ranked among the top k):
Next...
● CNN + RNN

● 3D Convolution

● Two-stream networks
CNN + RNN
Videos as Sequences
Previous work: multi-frame features are temporally local (e.g. 10 frames)

Hypothesis: a global description would be beneficial

Design choices:

● Modality: 1) RGB 2) optical flow 3) RGB + optical flow

● Features: 1) hand-crafted 2) extracted using CNN
● Temporal aggregation: 1) temporal pooling 2) RNN (e.g. LSTM, GRU)
Beyond Short Snippets: Deep Networks for Video Classification (arXiv)

1) Conv Pooling 2) Late Pooling 3) Slow Pooling

4) Local Pooling 5) Time-domain convolution

Beyond Short Snippets: Deep Networks for Video Classification (arXiv)

Learning global description:

Design choices:

● Modality: 1) RGB 2) optical flow 3) RGB + optical flow

● Features: 1) hand-crafted 2) extracted using CNN
● Temporal aggregation: 1) temporal pooling 2) RNN (e.g. LSTM, GRU)
3D Convolution
2D vs 3D Convolution
Previous work: 2D convolutions collapse temporal information

Proposal: 3D convolution → learning features that encode temporal information

3D Convolutional Neural Networks for Human Action Recognition (pdf)

Multiple channels as input:

1) gray, 2) gradient x, 3) gradient y, 4) optical flow x, 5) optical flow y

3D Convolutional Neural Networks for Human Action Recognition (pdf)

Handcrafted long-term features: information beyond the 7 frames + regularization

Learning Spatiotemporal Features with 3D Convolutional Networks (pdf)

Improve over the previous 3D conv model

● 3 x 3 x 3 homogeneous kernels
● End-to-end: no human detection preprocessing required
● Compact features; new SOTA on several benchmarks
Two-Stream
Video = Appearance + Motion
Complementary information:
● Single frames: static appearance
● Multi-frame: e.g. optical flow: pixel displacement as motion information
Two-Stream Convolutional Networks for Action Recognition in Videos (pdf)

Previous work: failed because of the difficulty of learning implicit motion

Proposal: separate motion (multi-frame) from static appearance (single frame)

● Motion: external + camera → mean subtraction to compensate camera motion
Two-Stream Convolutional Networks for Action Recognition in Videos (pdf)

Two types of motion representations:

1) Optical flow 2) Trajectory stacking
Convolutional Two-Stream Network Fusion for Video Action Recognition (pdf)

Disadvantages of the previous two-stream network:

● The appearance and motion stream are not aligned

○ Solution: spatial fusion
● Lacking modeling of temporal evolution
○ Solution: temporal fusion
Convolutional Two-Stream Network Fusion for Video Action Recognition (pdf)

Spatial fusion:

● Spatial correspondence: upsample to the same spatial dimension

● Channel correspondence: fusion:
○ Max fusion:
○ Sum fusion:
○ Concat-conv fusion: stacking + conv layer for dimension reduction
■ Learned channel correspondence:
○ Bilinear fusion:
Convolutional Two-Stream Network Fusion for Video Action Recognition (pdf)

Temporal fusion:

● 3D pooling
● 3D Conv + pooling
Convolutional Two-Stream Network Fusion for Video Action Recognition (pdf)

Multi-scale: local spatiotemporal features + global temporal features

Model Takeaway
The motivations:

● CNN + RNN: video understanding as sequence modeling

● 3D Convolution: embed temporal dimension to CNN
● Two-stream: explicit model of motion
Further Readings
● CNN + RNN
❏ Unsupervised Learning of Video Representations using LSTMs (arXiv)
❏ Long-term Recurrent ConvNets for Visual Recognition and Description (arXiv)
● 3D Convolution
❏ I3D: integration of 2D info
❏ P3D: 3D = 2D + 1D
● Two streams
❏ I3D also uses both modalities
● Others:
❏ Objects2action: Classifying and localizing actions w/o any video example (arXiv)
❏ Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos (arXiv)

Berserk of Gluttony Volume 8 - PDF Room
No ratings yet
Berserk of Gluttony Volume 8 - PDF Room
264 pages
READING
No ratings yet
READING
62 pages
1833 Designing High Rise Housing The Singapore Experience
No ratings yet
1833 Designing High Rise Housing The Singapore Experience
7 pages
Earth Potential Rise (EPR) Computation For A Fault On Transmission Mains Pole
No ratings yet
Earth Potential Rise (EPR) Computation For A Fault On Transmission Mains Pole
6 pages
Raster Data Analysis: Dr. K V. Kale
No ratings yet
Raster Data Analysis: Dr. K V. Kale
31 pages
Path Analysis and Network Applications: Dr. K V. Kale
No ratings yet
Path Analysis and Network Applications: Dr. K V. Kale
30 pages
Chevrolet Chevy Astro Van Shop Manual
100% (1)
Chevrolet Chevy Astro Van Shop Manual
1,665 pages
CHPC PDF
No ratings yet
CHPC PDF
6 pages
Vector Data Analysis: Dr. K V. Kale
No ratings yet
Vector Data Analysis: Dr. K V. Kale
44 pages
Kinetic Theory and Intermolecular Forces of Liquid and Solid PDF
No ratings yet
Kinetic Theory and Intermolecular Forces of Liquid and Solid PDF
70 pages
Lecture14 PDF
No ratings yet
Lecture14 PDF
130 pages
Makeup Kit Manual
100% (4)
Makeup Kit Manual
31 pages
Keats Letters
No ratings yet
Keats Letters
10 pages
Detection of Eye Contact With Deep Neural Networks Is As Accurate As Human Experts
No ratings yet
Detection of Eye Contact With Deep Neural Networks Is As Accurate As Human Experts
28 pages
Movinets: Mobile Video Networks For Efficient Video Recognition
No ratings yet
Movinets: Mobile Video Networks For Efficient Video Recognition
21 pages
3rd SUMMATIVE 4
No ratings yet
3rd SUMMATIVE 4
2 pages
Learning Spatiotemporal Features With 3D Convolutional Networks
No ratings yet
Learning Spatiotemporal Features With 3D Convolutional Networks
16 pages
Visual Recognition For Images, Video, and 3D: ICCV 2019 Tutorial Overview
No ratings yet
Visual Recognition For Images, Video, and 3D: ICCV 2019 Tutorial Overview
11 pages
Temporal Recurrent Networks For Online Action Detection
No ratings yet
Temporal Recurrent Networks For Online Action Detection
10 pages
Video Tutorial CVPR19
No ratings yet
Video Tutorial CVPR19
40 pages
Action Classification and Highlighting in Videos
No ratings yet
Action Classification and Highlighting in Videos
12 pages
Wu AdaFrame Adaptive Frame Selection For Fast Video Recognition CVPR 2019 Paper
No ratings yet
Wu AdaFrame Adaptive Frame Selection For Fast Video Recognition CVPR 2019 Paper
10 pages
Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors
No ratings yet
Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors
10 pages
Abu Farha MS-TCN Multi-Stage Temporal Convolutional Network For Action Segmentation CVPR 2019 Paper
No ratings yet
Abu Farha MS-TCN Multi-Stage Temporal Convolutional Network For Action Segmentation CVPR 2019 Paper
10 pages
CNN Unconstrained Video Classification
No ratings yet
CNN Unconstrained Video Classification
9 pages
Exam 4TH
No ratings yet
Exam 4TH
8 pages
Frame-Skip Convolutional Neural Networks For Action Recognition
No ratings yet
Frame-Skip Convolutional Neural Networks For Action Recognition
6 pages
Deep Edge Computing For Videos
No ratings yet
Deep Edge Computing For Videos
10 pages
Circuits MCQ DR Haitham
No ratings yet
Circuits MCQ DR Haitham
7 pages
10.1007@s00371 020 01868 8
No ratings yet
10.1007@s00371 020 01868 8
15 pages
GST 103 Summary
No ratings yet
GST 103 Summary
15 pages
Ufc Sports Data
No ratings yet
Ufc Sports Data
10 pages
Activity Running Wolf Ws 7A
No ratings yet
Activity Running Wolf Ws 7A
3 pages
IC-M801GMDSS Certificate N° 06212504AA00
No ratings yet
IC-M801GMDSS Certificate N° 06212504AA00
5 pages
Deep Networks-Based Video Classification Methods: A Literature Overview
No ratings yet
Deep Networks-Based Video Classification Methods: A Literature Overview
20 pages
Hands On Computer Vision With Tensorflow 2 Lever 1788830644 mE2AdpJs
No ratings yet
Hands On Computer Vision With Tensorflow 2 Lever 1788830644 mE2AdpJs
1 page
Large-Scale Video Classification With Convolutional Neural Networks
No ratings yet
Large-Scale Video Classification With Convolutional Neural Networks
8 pages
Video Transformer Network
No ratings yet
Video Transformer Network
11 pages
Ilidrissi-Tan2019 Article ADeepUnifiedFrameworkForSuspic
No ratings yet
Ilidrissi-Tan2019 Article ADeepUnifiedFrameworkForSuspic
6 pages
Video Prediction PDF
No ratings yet
Video Prediction PDF
21 pages
Subsea Well Control Brochure
No ratings yet
Subsea Well Control Brochure
8 pages
Final Feasibility
100% (3)
Final Feasibility
26 pages
Jianhang Chen Activity Recognition On Kinect-3d Videos Using Transfer Learning
No ratings yet
Jianhang Chen Activity Recognition On Kinect-3d Videos Using Transfer Learning
3 pages
Q Sep Aoac 805 01 002
No ratings yet
Q Sep Aoac 805 01 002
2 pages
Multimodal Human Action Recognition Based On A Fusion of Dynamic Images Using CNN Descriptors
No ratings yet
Multimodal Human Action Recognition Based On A Fusion of Dynamic Images Using CNN Descriptors
8 pages
10224/submission 10224
No ratings yet
10224/submission 10224
10 pages
3D Convolutional Neural Networks For Human Action Recognition
No ratings yet
3D Convolutional Neural Networks For Human Action Recognition
11 pages
CVPR2016 - Slicing Convolutional Neural Network For Crowd Video Understanding
No ratings yet
CVPR2016 - Slicing Convolutional Neural Network For Crowd Video Understanding
9 pages
Yan Multiview Transformers For Video Recognition CVPR 2022 Paper
No ratings yet
Yan Multiview Transformers For Video Recognition CVPR 2022 Paper
11 pages
Lan Deep Local Video CVPR 2017 Paper
No ratings yet
Lan Deep Local Video CVPR 2017 Paper
7 pages
The Diamond Lens
No ratings yet
The Diamond Lens
19 pages
Two-Stream Convolutional Networks For Action Recognition in Videos
No ratings yet
Two-Stream Convolutional Networks For Action Recognition in Videos
9 pages
Video Classification Project
No ratings yet
Video Classification Project
52 pages
3D Convolutional Neural Networks For Human Action Recognition
No ratings yet
3D Convolutional Neural Networks For Human Action Recognition
11 pages
SIDDHARTH SHARMA - FinalProjectReport - Siddharth
No ratings yet
SIDDHARTH SHARMA - FinalProjectReport - Siddharth
27 pages
Pytorchvideo: A Deep Learning Library For Video Understanding
No ratings yet
Pytorchvideo: A Deep Learning Library For Video Understanding
4 pages
Industrial Automation-Car Manufacturing Industry: 1. Use Case Diagram
No ratings yet
Industrial Automation-Car Manufacturing Industry: 1. Use Case Diagram
7 pages
Bim Project Execution Plan
No ratings yet
Bim Project Execution Plan
1 page
Thermomix tm31 User Manual
No ratings yet
Thermomix tm31 User Manual
52 pages
Wurlitzer Lyra - A Storing, Streaming Jukebox For The 21C
No ratings yet
Wurlitzer Lyra - A Storing, Streaming Jukebox For The 21C
1 page
Combining Multiple Sources of Knowledge in Deep Cnns For Action Recognition
No ratings yet
Combining Multiple Sources of Knowledge in Deep Cnns For Action Recognition
8 pages
File 4
No ratings yet
File 4
3 pages
2D or Not 2D Adaptive 3D Convolution Selection For Efficient Video Recogition
No ratings yet
2D or Not 2D Adaptive 3D Convolution Selection For Efficient Video Recogition
10 pages
Sigmoid Deep Learning
No ratings yet
Sigmoid Deep Learning
8 pages
Sun Human Action Recognition ICCV 2015 Paper
No ratings yet
Sun Human Action Recognition ICCV 2015 Paper
9 pages
Gao SimVP Simpler Yet Better Video Prediction CVPR 2022 Paper
No ratings yet
Gao SimVP Simpler Yet Better Video Prediction CVPR 2022 Paper
11 pages
2023 - A Novel Deep Convolutionalencoder Decoder Network Application To Moving Object Detection in Videos
No ratings yet
2023 - A Novel Deep Convolutionalencoder Decoder Network Application To Moving Object Detection in Videos
15 pages
Raushan Pandey Review Paper of Deep Learning
No ratings yet
Raushan Pandey Review Paper of Deep Learning
3 pages
SlowFast Video Recognition
No ratings yet
SlowFast Video Recognition
10 pages
Copie de AI Speech Classifier Pitch Deck by Slidesgo
No ratings yet
Copie de AI Speech Classifier Pitch Deck by Slidesgo
29 pages
Research On Human Action Recognition Model Based On Computer Laplacian Matrix An
No ratings yet
Research On Human Action Recognition Model Based On Computer Laplacian Matrix An
6 pages
Luke Cruff (Ricarco) 5-11-10
No ratings yet
Luke Cruff (Ricarco) 5-11-10
11 pages
5.5.2 Video To Text With LSTM Models
No ratings yet
5.5.2 Video To Text With LSTM Models
10 pages
Lee Few-Shot Common Action Localization Via Cross-Attentional Fusion of Context and ICCV 2023 Paper
No ratings yet
Lee Few-Shot Common Action Localization Via Cross-Attentional Fusion of Context and ICCV 2023 Paper
10 pages
Video Understanding With Large Language Models - A Survey
No ratings yet
Video Understanding With Large Language Models - A Survey
24 pages
Feichtenhofer Convolutional Two-Stream Network CVPR 2016 Paper
No ratings yet
Feichtenhofer Convolutional Two-Stream Network CVPR 2016 Paper
9 pages
Cheron P-CNN Pose-Based CNN ICCV 2015 Paper
No ratings yet
Cheron P-CNN Pose-Based CNN ICCV 2015 Paper
9 pages
Week5 Computer Vision
No ratings yet
Week5 Computer Vision
58 pages
Lin TSM Temporal Shift Module For Efficient Video Understanding ICCV 2019 Paper
No ratings yet
Lin TSM Temporal Shift Module For Efficient Video Understanding ICCV 2019 Paper
11 pages
Computer Vision
No ratings yet
Computer Vision
45 pages
3.action Recognition
No ratings yet
3.action Recognition
10 pages
Action Recog
No ratings yet
Action Recog
11 pages
5 6280382869936280464
No ratings yet
5 6280382869936280464
14 pages
Lecture2.2 UnimodalRepresentations Part1 PDF
No ratings yet
Lecture2.2 UnimodalRepresentations Part1 PDF
92 pages
Price of Materials-1
No ratings yet
Price of Materials-1
2 pages
Painting Certificate
No ratings yet
Painting Certificate
1 page
LTB00406v2 - Heating Ventilation Air Conditioning Evaporator (HVAC) Temperature Reading Inaccurate
No ratings yet
LTB00406v2 - Heating Ventilation Air Conditioning Evaporator (HVAC) Temperature Reading Inaccurate
2 pages
Lec6 Video Understanding
No ratings yet
Lec6 Video Understanding
33 pages
Reference: Phenomena-Report-2023
No ratings yet
Reference: Phenomena-Report-2023
23 pages
1 s2.0 S2666307424000214 Main 4
No ratings yet
1 s2.0 S2666307424000214 Main 4
1 page
Integrating Spatial and Temporal Dependencies
No ratings yet
Integrating Spatial and Temporal Dependencies
6 pages
Action Recognition 2
No ratings yet
Action Recognition 2
6 pages
CVIP Unit4 QB
No ratings yet
CVIP Unit4 QB
14 pages
Taylor Eccv 10
No ratings yet
Taylor Eccv 10
14 pages
Kuby Immunology 7th Edition Owen HQ File Fast Access
No ratings yet
Kuby Immunology 7th Edition Owen HQ File Fast Access
317 pages
Image Compression: Efficient Techniques for Visual Data Optimization
From Everand
Image Compression: Efficient Techniques for Visual Data Optimization
Fouad Sabry
No ratings yet
Joint Photographic Experts Group: Unlocking the Power of Visual Data with the JPEG Standard
From Everand
Joint Photographic Experts Group: Unlocking the Power of Visual Data with the JPEG Standard
Fouad Sabry
No ratings yet

CS231N Section: Video Understanding

Uploaded by

CS231N Section: Video Understanding

Uploaded by

CS231N Section

● VR (e.g. vision as inverse graphics)

Ways to aggregate features:

● Calculation: e.g. TVL1, DeepFlow,

● Computational perspective: how to reduce computation cost without

Architecture: different ways to fuse features from multiple frames

Conv layer Norm layer Pooling layer

Computational cost: reduce spatial dimension to reduce model complexity

Low-res image context

Hypothesis: a global description would be beneficial

● Modality: 1) RGB 2) optical flow 3) RGB + optical flow

1) Conv Pooling 2) Late Pooling 3) Slow Pooling

4) Local Pooling 5) Time-domain convolution

Learning global description:

● Modality: 1) RGB 2) optical flow 3) RGB + optical flow

Proposal: 3D convolution → learning features that encode temporal information

Multiple channels as input:

1) gray, 2) gradient x, 3) gradient y, 4) optical flow x, 5) optical flow y

Handcrafted long-term features: information beyond the 7 frames + regularization

Improve over the previous 3D conv model

Previous work: failed because of the difficulty of learning implicit motion

Proposal: separate motion (multi-frame) from static appearance (single frame)

Two types of motion representations:

Disadvantages of the previous two-stream network:

● The appearance and motion stream are not aligned

● Spatial correspondence: upsample to the same spatial dimension

Multi-scale: local spatiotemporal features + global temporal features

● CNN + RNN: video understanding as sequence modeling

You might also like