MoveNet SinglePose Model Card
MoveNet SinglePose Model Card
SinglePose
Model Details
A convolutional neural network model that runs on RGB images and predicts human joint
locations of a single person. The model is designed to be run in the browser using Tensorflow.js
or on devices using TF Lite in real-time, targeting movement/fitness activities. Two variants
are presented:
● MoveNet.SinglePose.Lightning: A lower capacity model that can run >50FPS on most
modern laptops while achieving good performance.
Model Specifications
Model Architecture
MobileNetV2 image feature extractor with Feature Pyramid Network decoder (to stride of 4)
followed by CenterNet prediction heads with custom post-processing logic. Lightning uses
depth multiplier 1.0 while Thunder uses depth multiplier 1.75.
Inputs
A frame of video or an image, represented as an int32 tensor of shape: 192x192x3(Lightning) /
256x256x3(Thunder). Channels order: RGB with values in [0, 255].
Outputs
A float32 tensor of shape [1, 1, 17, 3].
● The first two channels of the last dimension represents the yx coordinates (normalized to
image frame, i.e. range in [0.0, 1.0]) of the 17 keypoints (in the order of: [nose, left eye,
right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist,
right wrist, left hip, right hip, left knee, right knee, left ankle, right ankle]).
● The third channel of the last dimension represents the prediction confidence scores of
each keypoint, also in the range [0.0, 1.0].
Authors
(Equal contributions)
Francois Beletti, Google Yu-Hui Chen, Google
Ard Oerlemans, Google Ronny Votel, Google
Evaluation Data
● COCO Keypoint Dataset Validation Set 2017: In-the-wild images with diverse scenes,
instance sizes, and occlusions. The original validation set contains 5k images (images,
annotations) in total. The images which contain a single person are retained to be the
evaluation set of this model, in total of 919 images. The dataset is chosen to evaluate
the model performance in the general in-the-wild scenario.
● Active Dataset Evaluation Set: Images sampled from YouTube fitness, yoga, and
dance videos which captures people movements. It contains diverse poses and motion
with more motion blur and self-occlusions. The set contains 1161 images with a single
person in the frame. This dataset is chosen to evaluate the model performance on the
targeted domain, i.e. fitness/human motion.
Training Data
● COCO Keypoint Dataset Training Set 2017: In-the-wild images with diverse scenes,
instance sizes, and occlusions. The original training set contains 64k images (images,
annotations). The images with three or more people were filtered out, resulting in a 28k
final training set.
● Active Dataset Training Set: Images sampled from YouTube fitness videos which
captures people exercising (e.g. HIIT, weight-lifting, etc.), stretching, or dancing. It
contains diverse poses and motion with more motion blur and self-occlusions. The set of
images with a single person contains 23.5k images.
Factors
Groups
To perform fairness evaluation, we analyze the model performance under different person
attributes and categories:
● Gender: Male/Female
● Age: Young/Middle-age/Old
● Skin tone: Medium/Darker/Lighter
Instrumentation
The training dataset images were captured in a real-world environment with different light, noise,
and motion. Therefore, the model is robust to the input video streams that are captured through
common devices’ webcams.
Environments
The model is trained on images with various lighting, noise, motion conditions and with diverse
augmentations.
Metrics
● Keypoint mean average precision (mAP) with Object Keypoint Similarity (OKS):
this is the standard metric used to evaluate the quality of the predictions of a keypoint
model in the COCO competition.
● Inference Time: the time spent to run the model inference for a single image measured
in milliseconds.
Quantitative Analyses
Prediction Quality
The following tables show the evaluation result for different attributes/categories. Both models
perform fairly (< 5% performance differences between categories) on our targeted Active Single
Person Image Set.