0% found this document useful (0 votes)
191 views56 pages

Holistik

MediaPipe Holistic provides live perception of simultaneous human pose, face landmarks, and hand tracking in real-time on mobile devices. It combines separate pose, face, and hand models into a single pipeline. The pipeline first estimates pose, then derives regions of interest for the hands and face at higher resolutions. It applies specialized face and hand models to these regions before merging all landmarks into a single output. Tracking is used to identify regions of interest between frames for improved efficiency.

Uploaded by

Nuraliyah billy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
191 views56 pages

Holistik

MediaPipe Holistic provides live perception of simultaneous human pose, face landmarks, and hand tracking in real-time on mobile devices. It combines separate pose, face, and hand models into a single pipeline. The pipeline first estimates pose, then derives regions of interest for the hands and face at higher resolutions. It applies specialized face and hand models to these regions before merging all landmarks into a single output. Tracking is used to identify regions of interest between frames for improved efficiency.

Uploaded by

Nuraliyah billy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Installing on Windows

Disclaimer: Running MediaPipe on Windows is experimental.

Note: building MediaPipe Android apps is still not possible on native Windows. Please
do this in WSL instead and see the WSL setup instruction in the next section.

1. Install MSYS2 and edit the %PATH% environment variable.

If MSYS2 is installed to C:\msys64 , add C:\msys64\usr\bin to


your %PATH% environment variable.

2. Install necessary packages.


3. C:\> pacman -S git patch unzip

4. Install Python and allow the executable to edit the %PATH% environment variable.

Download Python Windows executable from


https://www.python.org/downloads/windows/ and install.

5. Install Visual C++ Build Tools 2019 and WinSDK

Go to the VisualStudio website, download build tools, and install Microsoft Visual
C++ 2019 Redistributable and Microsoft Build Tools 2019.

Download the WinSDK from the official MicroSoft website and install.

6. Install Bazel or Bazelisk and add the location of the Bazel executable to
the %PATH% environment variable.

Option 1. Follow the official Bazel documentation to install Bazel 5.2.0 or higher.

Option 2. Follow the official Bazel documentation to install Bazelisk.

7. Set Bazel variables. Learn more details about “Build on Windows” in the Bazel
official documentation.
8. # Please find the exact paths and version numbers from your local version.
9. C:\> set BAZEL_VS=C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools
10. C:\> set BAZEL_VC=C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC
11. C:\> set BAZEL_VC_FULL_VERSION=<Your local VC version>
12. C:\> set BAZEL_WINSDK_FULL_VERSION=<Your local WinSDK version>

13. Checkout MediaPipe repository.


14. C:\Users\Username\mediapipe_repo> git clone https://github.com/google/mediapipe.git
15.
16. # Change directory into MediaPipe root directory
17. C:\Users\Username\mediapipe_repo> cd mediapipe

18. Install OpenCV.

Download the Windows executable from https://opencv.org/releases/ and install.


We currently use OpenCV 3.4.10. Remember to edit the WORKSPACE file if OpenCV
is not installed at C:\opencv .

new_local_repository(
name = "windows_opencv",
build_file = "@//third_party:opencv_windows.BUILD",
path = "C:\\<path to opencv>\\build",
)

19. Run the Hello World! in C++ example.

Note: For building MediaPipe on Windows, please add --action_env


PYTHON_BIN_PATH="C://path//to//python.exe" to the build command. Alternatively, you

can follow issue 724 to fix the python configuration manually.


C:\Users\Username\mediapipe_repo>bazel build -c opt --define MEDIAPIPE_DISABLE_GPU=1 --
action_env PYTHON_BIN_PATH="C://python_36//python.exe"
mediapipe/examples/desktop/hello_world

C:\Users\Username\mediapipe_repo>set GLOG_logtostderr=1

C:\Users\Username\mediapipe_repo>bazel-
bin\mediapipe\examples\desktop\hello_world\hello_world.exe

# should print:
# I20200514 20:43:12.277598 1200 hello_world.cc:56] Hello World!
# I20200514 20:43:12.278597 1200 hello_world.cc:56] Hello World!
# I20200514 20:43:12.279618 1200 hello_world.cc:56] Hello World!
# I20200514 20:43:12.279618 1200 hello_world.cc:56] Hello World!
# I20200514 20:43:12.279618 1200 hello_world.cc:56] Hello World!
# I20200514 20:43:12.279618 1200 hello_world.cc:56] Hello World!
# I20200514 20:43:12.279618 1200 hello_world.cc:56] Hello World!
# I20200514 20:43:12.279618 1200 hello_world.cc:56] Hello World!
# I20200514 20:43:12.279618 1200 hello_world.cc:56] Hello World!
# I20200514 20:43:12.280613 1200 hello_world.cc:56] Hello World!

MediaPipe in Python
1. Ready-to-use Python Solutions
2. MediaPipe on Google Colab
3. MediaPipe Python Framework
4. Building MediaPipe Python Package

Ready-to-use Python Solutions


MediaPipe offers ready-to-use yet customizable Python solutions as a prebuilt Python
package. MediaPipe Python package is available on PyPI for Linux, macOS and Windows.

You can, for instance, activate a Python virtual environment:


$ python3 -m venv mp_env && source mp_env/bin/activate

Install MediaPipe Python package and start Python interpreter:


(mp_env)$ pip install mediapipe
(mp_env)$ python3

In Python interpreter, import the package and start using one of the solutions:
import mediapipe as mp
mp_face_mesh = mp.solutions.face_mesh

Tip: Use command deactivate to later exit the Python virtual environment.

To learn more about configuration options and usage examples, please find details in
each solution via the links below:

• MediaPipe Face Detection


• MediaPipe Face Mesh
• MediaPipe Hands
• MediaPipe Holistic
• MediaPipe Objectron
• MediaPipe Pose
• MediaPipe Selfie Segmentation

MediaPipe on Google Colab


• MediaPipe Face Detection Colab
• MediaPipe Face Mesh Colab
• MediaPipe Hands Colab
• MediaPipe Holistic Colab
• MediaPipe Objectron Colab
• MediaPipe Pose Colab
• MediaPipe Pose Classification Colab (Basic)
• MediaPipe Pose Classification Colab (Extended)
• MediaPipe Selfie Segmentation Colab

MediaPipe Python Framework


The ready-to-use solutions are built upon the MediaPipe Python framework, which can
be used by advanced users to run their own MediaPipe graphs in Python. Please
see here for more info.

Building MediaPipe Python Package


Follow the steps below only if you have local changes and need to build the Python
package from source. Otherwise, we strongly encourage our users to simply run pip
install mediapipe to use the ready-to-use solutions, more convenient and much faster.

MediaPipe PyPI currently doesn’t provide aarch64 Python wheel files. For building and
using MediaPipe Python on aarch64 Linux systems such as Nvidia Jetson and Raspberry
Pi, please read here.

1. Make sure that Bazel and OpenCV are correctly installed and configured for
MediaPipe. Please see Installation for how to setup Bazel and OpenCV for
MediaPipe on Linux and macOS.

2. Install the following dependencies.

Debian or Ubuntu:
$ sudo apt install python3-dev
$ sudo apt install python3-venv
$ sudo apt install -y protobuf-compiler

# If you need to build opencv from source.


$ sudo apt install cmake

macOS:
$ brew install protobuf

# If you need to build opencv from source.


$ brew install cmake

Windows:
Download the latest protoc win64 zip from the Protobuf GitHub repo, unzip the
file, and copy the protoc.exe executable to a preferred location. Please ensure
that location is added into the Path environment variable.

3. Activate a Python virtual environment.


4. $ python3 -m venv mp_env && source mp_env/bin/activate

5. In the virtual environment, go to the MediaPipe repo directory.

6. Install the required Python packages.


7. (mp_env)mediapipe$ pip3 install -r requirements.txt

8. Build and install MediaPipe package.


9. (mp_env)mediapipe$ python3 setup.py install --link-opencv

or
(mp_env)mediapipe$ python3 setup.py bdist_wheel

Holistik

Overview
Live perception of simultaneous human pose, face landmarks, and hand tracking in real-
time on mobile devices can enable various modern life applications: fitness and sport
analysis, gesture control and sign language recognition, augmented reality try-on and
effects. MediaPipe already offers fast and accurate, yet separate, solutions for these
tasks. Combining them all in real-time into a semantically consistent end-to-end
solution is a uniquely difficult problem requiring simultaneous inference of multiple,
dependent neural networks.
Fig 1. Example of MediaPipe Holistic.

ML Pipeline
The MediaPipe Holistic pipeline integrates separate models
for pose, face and hand components, each of which are optimized for their particular
domain. However, because of their different specializations, the input to one component
is not well-suited for the others. The pose estimation model, for example, takes a lower,
fixed resolution video frame (256x256) as input. But if one were to crop the hand and
face regions from that image to pass to their respective models, the image resolution
would be too low for accurate articulation. Therefore, we designed MediaPipe Holistic as
a multi-stage pipeline, which treats the different regions using a region appropriate
image resolution.

First, we estimate the human pose (top of Fig 2) with BlazePose’s pose detector and
subsequent landmark model. Then, using the inferred pose landmarks we derive three
regions of interest (ROI) crops for each hand (2x) and the face, and employ a re-crop
model to improve the ROI. We then crop the full-resolution input frame to these ROIs
and apply task-specific face and hand models to estimate their corresponding
landmarks. Finally, we merge all landmarks with those of the pose model to yield the full
540+ landmarks.
Fig 2. MediaPipe Holistic Pipeline Overview.

To streamline the identification of ROIs for face and hands, we utilize a tracking
approach similar to the one we use for standalone face and hand pipelines. It assumes
that the object doesn’t move significantly between frames and uses estimation from the
previous frame as a guide to the object region on the current one. However, during fast
movements, the tracker can lose the target, which requires the detector to re-localize it
in the image. MediaPipe Holistic uses pose prediction (on every frame) as an additional
ROI prior to reduce the response time of the pipeline when reacting to fast movements.
This also enables the model to retain semantic consistency across the body and its parts
by preventing a mixup between left and right hands or body parts of one person in the
frame with another.

In addition, the resolution of the input frame to the pose model is low enough that the
resulting ROIs for face and hands are still too inaccurate to guide the re-cropping of
those regions, which require a precise input crop to remain lightweight. To close this
accuracy gap we use lightweight face and hand re-crop models that play the role
of spatial transformers and cost only ~10% of corresponding model’s inference time.

The pipeline is implemented as a MediaPipe graph that uses a holistic landmark


subgraph from the holistic landmark module and renders using a dedicated holistic
renderer subgraph. The holistic landmark subgraph internally uses a pose landmark
module , hand landmark module and face landmark module. Please check them for
implementation details.
Note: To visualize a graph, copy the graph and paste it into MediaPipe Visualizer. For
more information on how to visualize its associated subgraphs, please see visualizer
documentation.

Models

Landmark Models
MediaPipe Holistic utilizes the pose, face and hand landmark models in MediaPipe
Pose, MediaPipe Face Mesh and MediaPipe Hands respectively to generate a total of
543 landmarks (33 pose landmarks, 468 face landmarks, and 21 hand landmarks per
hand).

Hand Recrop Model


For cases when the accuracy of the pose model is low enough that the resulting ROIs for
hands are still too inaccurate we run the additional lightweight hand re-crop model that
play the role of spatial transformer and cost only ~10% of hand model inference time.

Solution APIs

Cross-platform Configuration Options


Naming style and availability may differ slightly across platforms/languages.

STA TI C _ I MA G E_ M ODE
If set to false , the solution treats the input images as a video stream. It will try to detect
the most prominent person in the very first images, and upon a successful detection
further localizes the pose and other landmarks. In subsequent images, it then simply
tracks those landmarks without invoking another detection until it loses track, on
reducing computation and latency. If set to true , person detection runs every input
image, ideal for processing a batch of static, possibly unrelated, images. Default
to false .

M ODE L_ COM PLE XITY


Complexity of the pose landmark model: 0 , 1 or 2 . Landmark accuracy as well as
inference latency generally go up with the model complexity. Default to 1 .

SM OOTH_ LA N DM AR KS
If set to true , the solution filters pose landmarks across different input images to reduce
jitter, but ignored if static_image_mode is also set to true . Default to true .

E N A BLE_ SE G ME N TA TI ON
If set to true , in addition to the pose, face and hand landmarks the solution also
generates the segmentation mask. Default to false .

SM OOTH_ SE G M E NTA TI ON
If set to true , the solution filters segmentation masks across different input images to
reduce jitter. Ignored if enable_segmentation is false or static_image_mode is true .

Default to true .

R E F I NE _ FA CE_ LA NDM A R KS
Whether to further refine the landmark coordinates around the eyes and lips, and
output additional landmarks around the irises. Default to false .

M I N _ DE TE CTI ON _ CON F I DE N CE
Minimum confidence value ( [0.0, 1.0] ) from the person-detection model for the
detection to be considered successful. Default to 0.5 .

M I N _ TRA CKI NG _ CON F I DE N CE


Minimum confidence value ( [0.0, 1.0] ) from the landmark-tracking model for the pose
landmarks to be considered tracked successfully, or otherwise person detection will be
invoked automatically on the next input image. Setting it to a higher value can increase
robustness of the solution, at the expense of a higher latency. Ignored
if static_image_mode is true , where person detection simply runs on every image.
Default to 0.5 .

Output
Naming style may differ slightly across platforms/languages.

POSE _ LA N DMA R KS
A list of pose landmarks. Each landmark consists of the following:

• x and y : Landmark coordinates normalized to [0.0, 1.0] by the image width


and height respectively.
• z:Should be discarded as currently the model is not fully trained to predict
depth, but this is something on the roadmap.
• visibility : A value in [0.0, 1.0] indicating the likelihood of the landmark being
visible (present and not occluded) in the image.

POSE _ WOR LD_ LA NDM A R KS


Another list of pose landmarks in world coordinates. Each landmark consists of the
following:

• x, y and z: Real-world 3D coordinates in meters with the origin at the center


between hips.
• visibility : Identical to that defined in the corresponding pose_landmarks.

F A C E_ LA N DMA R KS
A list of 468 face landmarks. Each landmark consists of x , y and z . x and y are
normalized to [0.0, 1.0] by the image width and height respectively. z represents the
landmark depth with the depth at center of the head being the origin, and the smaller
the value the closer the landmark is to the camera. The magnitude of z uses roughly
the same scale as x .

LE F T_ HA N D_ LA N DM A R KS
A list of 21 hand landmarks on the left hand. Each landmark consists
of x , y and z . x and y are normalized to [0.0, 1.0] by the image width and height
respectively. z represents the landmark depth with the depth at the wrist being the
origin, and the smaller the value the closer the landmark is to the camera. The
magnitude of z uses roughly the same scale as x .

R I G HT_ HA N D_ LA NDM A R KS
A list of 21 hand landmarks on the right hand, in the same representation
as left_hand_landmarks.

SE G M E N TA TI ON _ MA SK
The output segmentation mask, predicted only when enable_segmentation is set
to true . The mask has the same width and height as the input image, and contains
values in [0.0, 1.0] where 1.0 and 0.0 indicate high certainty of a “human” and
“background” pixel respectively. Please refer to the platform-specific usage examples
below for usage details.
Python Solution API
Please first follow general instructions to install MediaPipe Python package, then learn
more in the companion Python Colab and the usage example below.

Supported configuration options:

• static_image_mode
• model_complexity
• smooth_landmarks
• enable_segmentation
• smooth_segmentation
• refine_face_landmarks
• min_detection_confidence
• min_tracking_confidence
import cv2
import mediapipe as mp
mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles
mp_holistic = mp.solutions.holistic

# For static images:


IMAGE_FILES = []
with mp_holistic.Holistic(
static_image_mode=True,
model_complexity=2,
enable_segmentation=True,
refine_face_landmarks=True) as holistic:
for idx, file in enumerate(IMAGE_FILES):
image = cv2.imread(file)
image_height, image_width, _ = image.shape
# Convert the BGR image to RGB before processing.
results = holistic.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))

if results.pose_landmarks:
print(
f'Nose coordinates: ('
f'{results.pose_landmarks.landmark[mp_holistic.PoseLandmark.NOSE].x * image_width},
'
f'{results.pose_landmarks.landmark[mp_holistic.PoseLandmark.NOSE].y *
image_height})'
)

annotated_image = image.copy()
# Draw segmentation on the image.
# To improve segmentation around boundaries, consider applying a joint
# bilateral filter to "results.segmentation_mask" with "image".
condition = np.stack((results.segmentation_mask,) * 3, axis=-1) > 0.1
bg_image = np.zeros(image.shape, dtype=np.uint8)
bg_image[:] = BG_COLOR
annotated_image = np.where(condition, annotated_image, bg_image)
# Draw pose, left and right hands, and face landmarks on the image.
mp_drawing.draw_landmarks(
annotated_image,
results.face_landmarks,
mp_holistic.FACEMESH_TESSELATION,
landmark_drawing_spec=None,
connection_drawing_spec=mp_drawing_styles
.get_default_face_mesh_tesselation_style())
mp_drawing.draw_landmarks(
annotated_image,
results.pose_landmarks,
mp_holistic.POSE_CONNECTIONS,
landmark_drawing_spec=mp_drawing_styles.
get_default_pose_landmarks_style())
cv2.imwrite('/tmp/annotated_image' + str(idx) + '.png', annotated_image)
# Plot pose world landmarks.
mp_drawing.plot_landmarks(
results.pose_world_landmarks, mp_holistic.POSE_CONNECTIONS)

# For webcam input:


cap = cv2.VideoCapture(0)
with mp_holistic.Holistic(
min_detection_confidence=0.5,
min_tracking_confidence=0.5) as holistic:
while cap.isOpened():
success, image = cap.read()
if not success:
print("Ignoring empty camera frame.")
# If loading a video, use 'break' instead of 'continue'.
continue

# To improve performance, optionally mark the image as not writeable to


# pass by reference.
image.flags.writeable = False
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = holistic.process(image)

# Draw landmark annotation on the image.


image.flags.writeable = True
image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
mp_drawing.draw_landmarks(
image,
results.face_landmarks,
mp_holistic.FACEMESH_CONTOURS,
landmark_drawing_spec=None,
connection_drawing_spec=mp_drawing_styles
.get_default_face_mesh_contours_style())
mp_drawing.draw_landmarks(
image,
results.pose_landmarks,
mp_holistic.POSE_CONNECTIONS,
landmark_drawing_spec=mp_drawing_styles
.get_default_pose_landmarks_style())
# Flip the image horizontally for a selfie-view display.
cv2.imshow('MediaPipe Holistic', cv2.flip(image, 1))
if cv2.waitKey(5) & 0xFF == 27:
break
cap.release()

JavaScript Solution API


Please first see general introduction on MediaPipe in JavaScript, then learn more in the
companion web demo and the following usage example.

Supported configuration options:

• modelComplexity
• smoothLandmarks
• enableSegmentation
• smoothSegmentation
• refineFaceLandmarks
• minDetectionConfidence
• minTrackingConfidence
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/camera_utils/camera_utils.js"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/control_utils/control_utils.js"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/drawing_utils/drawing_utils.js"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/holistic/holistic.js"
crossorigin="anonymous"></script>
</head>

<body>
<div class="container">
<video class="input_video"></video>
<canvas class="output_canvas" width="1280px" height="720px"></canvas>
</div>
</body>
</html>
<script type="module">
const videoElement = document.getElementsByClassName('input_video')[0];
const canvasElement = document.getElementsByClassName('output_canvas')[0];
const canvasCtx = canvasElement.getContext('2d');

function onResults(results) {
canvasCtx.save();
canvasCtx.clearRect(0, 0, canvasElement.width, canvasElement.height);
canvasCtx.drawImage(results.segmentationMask, 0, 0,
canvasElement.width, canvasElement.height);

// Only overwrite existing pixels.


canvasCtx.globalCompositeOperation = 'source-in';
canvasCtx.fillStyle = '#00FF00';
canvasCtx.fillRect(0, 0, canvasElement.width, canvasElement.height);

// Only overwrite missing pixels.


canvasCtx.globalCompositeOperation = 'destination-atop';
canvasCtx.drawImage(
results.image, 0, 0, canvasElement.width, canvasElement.height);
canvasCtx.globalCompositeOperation = 'source-over';
drawConnectors(canvasCtx, results.poseLandmarks, POSE_CONNECTIONS,
{color: '#00FF00', lineWidth: 4});
drawLandmarks(canvasCtx, results.poseLandmarks,
{color: '#FF0000', lineWidth: 2});
drawConnectors(canvasCtx, results.faceLandmarks, FACEMESH_TESSELATION,
{color: '#C0C0C070', lineWidth: 1});
drawConnectors(canvasCtx, results.leftHandLandmarks, HAND_CONNECTIONS,
{color: '#CC0000', lineWidth: 5});
drawLandmarks(canvasCtx, results.leftHandLandmarks,
{color: '#00FF00', lineWidth: 2});
drawConnectors(canvasCtx, results.rightHandLandmarks, HAND_CONNECTIONS,
{color: '#00CC00', lineWidth: 5});
drawLandmarks(canvasCtx, results.rightHandLandmarks,
{color: '#FF0000', lineWidth: 2});
canvasCtx.restore();
}

const holistic = new Holistic({locateFile: (file) => {


return `https://cdn.jsdelivr.net/npm/@mediapipe/holistic/${file}`;
}});
holistic.setOptions({
modelComplexity: 1,
smoothLandmarks: true,
enableSegmentation: true,
smoothSegmentation: true,
refineFaceLandmarks: true,
minDetectionConfidence: 0.5,
minTrackingConfidence: 0.5
});
holistic.onResults(onResults);

const camera = new Camera(videoElement, {


onFrame: async () => {
await holistic.send({image: videoElement});
},
width: 1280,
height: 720
});
camera.start();
</script>

Example Apps
Please first see general instructions for Android, iOS, and desktop on how to build
MediaPipe examples.

Note: To visualize a graph, copy the graph and paste it into MediaPipe Visualizer. For
more information on how to visualize its associated subgraphs, please see visualizer
documentation.

Mobile
• Graph: mediapipe/graphs/holistic_tracking/holistic_tracking_gpu.pbtxt
• Android target: (or download prebuilt ARM64
APK) mediapipe/examples/android/src/java/com/google/mediapipe/apps/holistictrackinggp
u:holistictrackinggpu

• iOS target: mediapipe/examples/ios/holistictrackinggpu:HolisticTrackingGpuApp

MediaPipe Pose
TABLE OF CONTENTS

1.
2.
3.
4.
1.
2.
5.
1.
1.
2.
3.
4.
5.
6.
7.
2.
1.
2.
3.
3.
4.
6.
1.
1.
2.
1.
7.

Overview
Human pose estimation from video plays a critical role in various applications such
as quantifying physical exercises, sign language recognition, and full-body gesture
control. For example, it can form the basis for yoga, dance, and fitness applications. It
can also enable the overlay of digital content and information on top of the physical
world in augmented reality.

MediaPipe Pose is a ML solution for high-fidelity body pose tracking, inferring 33 3D


landmarks and background segmentation mask on the whole body from RGB video
frames utilizing our BlazePose research that also powers the ML Kit Pose Detection API.
Current state-of-the-art approaches rely primarily on powerful desktop environments
for inference, whereas our method achieves real-time performance on most
modern mobile phones, desktops/laptops, in python and even on the web.

Fig 1. Example of MediaPipe Pose for pose tracking.

ML Pipeline
The solution utilizes a two-step detector-tracker ML pipeline, proven to be effective in
our MediaPipe Hands and MediaPipe Face Mesh solutions. Using a detector, the
pipeline first locates the person/pose region-of-interest (ROI) within the frame. The
tracker subsequently predicts the pose landmarks and segmentation mask within the
ROI using the ROI-cropped frame as input. Note that for video use cases the detector is
invoked only as needed, i.e., for the very first frame and when the tracker could no
longer identify body pose presence in the previous frame. For other frames the pipeline
simply derives the ROI from the previous frame’s pose landmarks.

The pipeline is implemented as a MediaPipe graph that uses a pose landmark


subgraph from the pose landmark module and renders using a dedicated pose renderer
subgraph. The pose landmark subgraph internally uses a pose detection subgraph from
the pose detection module.

Note: To visualize a graph, copy the graph and paste it into MediaPipe Visualizer. For
more information on how to visualize its associated subgraphs, please see visualizer
documentation.

Pose Estimation Quality


To evaluate the quality of our models against other well-performing publicly available
solutions, we use three different validation datasets, representing different verticals:
Yoga, Dance and HIIT. Each image contains only a single person located 2-4 meters
from the camera. To be consistent with other solutions, we perform evaluation only for
17 keypoints from COCO topology.

Yoga Yoga Dance Dance HIIT HIIT


Method
mAP [email protected] mAP [email protected] mAP [email protected]

BlazePose GHUM Heavy 68.1 96.4 73.0 97.2 74.0 97.5

BlazePose GHUM Full 62.6 95.5 67.4 96.3 68.0 95.7

BlazePose GHUM Lite 45.0 90.2 53.6 92.5 53.8 93.5

AlphaPose ResNet50 63.4 96.0 57.8 95.5 63.4 96.0

Apple Vision 32.8 82.7 36.4 91.4 44.5 88.6


Yoga Yoga Dance Dance HIIT HIIT
Method
mAP [email protected] mAP [email protected] mAP [email protected]

Fig 2. Quality evaluation in [email protected] .

We designed our models specifically for live perception use cases, so all of them work in
real-time on the majority of modern devices.

Latency Latency
Method
Pixel 3 TFLite GPU MacBook Pro (15-inch 2017)

BlazePose GHUM Heavy 53 ms 38 ms

BlazePose GHUM Full 25 ms 27 ms

BlazePose GHUM Lite 20 ms 25 ms

Models

Person/pose Detection Model (BlazePose Detector)


The detector is inspired by our own lightweight BlazeFace model, used in MediaPipe
Face Detection, as a proxy for a person detector. It explicitly predicts two additional
virtual keypoints that firmly describe the human body center, rotation and scale as a
circle. Inspired by Leonardo’s Vitruvian man, we predict the midpoint of a person’s hips,
the radius of a circle circumscribing the whole person, and the incline angle of the line
connecting the shoulder and hip midpoints.

Fig 3. Vitruvian man aligned via two virtual keypoints predicted by BlazePose detector in addition to the
face bounding box.

Pose Landmark Model (BlazePose GHUM 3D)


The landmark model in MediaPipe Pose predicts the location of 33 pose landmarks (see
figure below).
Fig 4. 33 pose landmarks.

Optionally, MediaPipe Pose can predicts a full-body segmentation mask represented as


a two-class segmentation (human or background).

Please find more detail in the BlazePose Google AI Blog, this paper, the model card and
the Output section below.

Solution APIs

Cross-platform Configuration Options


Naming style and availability may differ slightly across platforms/languages.

STA TI C _ I MA G E_ M ODE
If set to false , the solution treats the input images as a video stream. It will try to detect
the most prominent person in the very first images, and upon a successful detection
further localizes the pose landmarks. In subsequent images, it then simply tracks those
landmarks without invoking another detection until it loses track, on reducing
computation and latency. If set to true , person detection runs every input image, ideal
for processing a batch of static, possibly unrelated, images. Default to false .
M ODE L_ COM PLE XITY
Complexity of the pose landmark model: 0 , 1 or 2 . Landmark accuracy as well as
inference latency generally go up with the model complexity. Default to 1 .

SM OOTH_ LA N DM AR KS
If set to true , the solution filters pose landmarks across different input images to reduce
jitter, but ignored if static_image_mode is also set to true . Default to true .

E N A BLE_ SE G ME N TA TI ON
If set to true , in addition to the pose landmarks the solution also generates the
segmentation mask. Default to false .

SM OOTH_ SE G M E NTA TI ON
If set to true , the solution filters segmentation masks across different input images to
reduce jitter. Ignored if enable_segmentation is false or static_image_mode is true .
Default to true .

M I N _ DE TE CTI ON _ CON F I DE N CE
Minimum confidence value ( [0.0, 1.0] ) from the person-detection model for the
detection to be considered successful. Default to 0.5 .

M I N _ TRA CKI NG _ CON F I DE N CE


Minimum confidence value ( [0.0, 1.0] ) from the landmark-tracking model for the pose
landmarks to be considered tracked successfully, or otherwise person detection will be
invoked automatically on the next input image. Setting it to a higher value can increase
robustness of the solution, at the expense of a higher latency. Ignored
if static_image_mode is true , where person detection simply runs on every image.
Default to 0.5 .

Output
Naming style may differ slightly across platforms/languages.

POSE _ LA N DMA R KS
A list of pose landmarks. Each landmark consists of the following:
• x and y : Landmark coordinates normalized to [0.0, 1.0] by the image width and
height respectively.
• z: Represents the landmark depth with the depth at the midpoint of hips being the
origin, and the smaller the value the closer the landmark is to the camera. The
magnitude of z uses roughly the same scale as x .
• visibility : A value in [0.0, 1.0] indicating the likelihood of the landmark being visible
(present and not occluded) in the image.

POSE _ WOR LD_ LA NDM A R KS


Fig 5. Example of MediaPipe Pose real-world 3D coordinates.

Another list of pose landmarks in world coordinates. Each landmark consists of the
following:

• x, y and z : Real-world 3D coordinates in meters with the origin at the center between
hips.
• visibility : Identical to that defined in the corresponding pose_landmarks.

SE G M E N TA TI ON _ MA SK
The output segmentation mask, predicted only when enable_segmentation is set
to true . The mask has the same width and height as the input image, and contains
values in [0.0, 1.0] where 1.0 and 0.0 indicate high certainty of a “human” and
“background” pixel respectively. Please refer to the platform-specific usage examples
below for usage details.

Fig 6. Example of MediaPipe Pose segmentation mask.

Python Solution API


Please first follow general instructions to install MediaPipe Python package, then learn
more in the companion Python Colab and the usage example below.

Supported configuration options:

• static_image_mode
• model_complexity
• smooth_landmarks
• enable_segmentation
• smooth_segmentation
• min_detection_confidence
• min_tracking_confidence
import cv2
import mediapipe as mp
mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles
mp_pose = mp.solutions.pose

# For static images:


IMAGE_FILES = []
BG_COLOR = (192, 192, 192) # gray
with mp_pose.Pose(
static_image_mode=True,
model_complexity=2,
enable_segmentation=True,
min_detection_confidence=0.5) as pose:
for idx, file in enumerate(IMAGE_FILES):
image = cv2.imread(file)
image_height, image_width, _ = image.shape
# Convert the BGR image to RGB before processing.
results = pose.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))

if not results.pose_landmarks:
continue
print(
f'Nose coordinates: ('
f'{results.pose_landmarks.landmark[mp_pose.PoseLandmark.NOSE].x * image_width}, '
f'{results.pose_landmarks.landmark[mp_pose.PoseLandmark.NOSE].y * image_height})'
)

annotated_image = image.copy()
# Draw segmentation on the image.
# To improve segmentation around boundaries, consider applying a joint
# bilateral filter to "results.segmentation_mask" with "image".
condition = np.stack((results.segmentation_mask,) * 3, axis=-1) > 0.1
bg_image = np.zeros(image.shape, dtype=np.uint8)
bg_image[:] = BG_COLOR
annotated_image = np.where(condition, annotated_image, bg_image)
# Draw pose landmarks on the image.
mp_drawing.draw_landmarks(
annotated_image,
results.pose_landmarks,
mp_pose.POSE_CONNECTIONS,
landmark_drawing_spec=mp_drawing_styles.get_default_pose_landmarks_style())
cv2.imwrite('/tmp/annotated_image' + str(idx) + '.png', annotated_image)
# Plot pose world landmarks.
mp_drawing.plot_landmarks(
results.pose_world_landmarks, mp_pose.POSE_CONNECTIONS)

# For webcam input:


cap = cv2.VideoCapture(0)
with mp_pose.Pose(
min_detection_confidence=0.5,
min_tracking_confidence=0.5) as pose:
while cap.isOpened():
success, image = cap.read()
if not success:
print("Ignoring empty camera frame.")
# If loading a video, use 'break' instead of 'continue'.
continue

# To improve performance, optionally mark the image as not writeable to


# pass by reference.
image.flags.writeable = False
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = pose.process(image)

# Draw the pose annotation on the image.


image.flags.writeable = True
image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
mp_drawing.draw_landmarks(
image,
results.pose_landmarks,
mp_pose.POSE_CONNECTIONS,
landmark_drawing_spec=mp_drawing_styles.get_default_pose_landmarks_style())
# Flip the image horizontally for a selfie-view display.
cv2.imshow('MediaPipe Pose', cv2.flip(image, 1))
if cv2.waitKey(5) & 0xFF == 27:
break
cap.release()

JavaScript Solution API


Please first see general introduction on MediaPipe in JavaScript, then learn more in the
companion web demo and the following usage example.

Supported configuration options:

• modelComplexity
• smoothLandmarks
• enableSegmentation
• smoothSegmentation
• minDetectionConfidence
• minTrackingConfidence
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/camera_utils/camera_utils.js"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/control_utils/control_utils.js"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/control_utils_3d/control_utils_3d.js"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/drawing_utils/drawing_utils.js"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/pose/pose.js"
crossorigin="anonymous"></script>
</head>

<body>
<div class="container">
<video class="input_video"></video>
<canvas class="output_canvas" width="1280px" height="720px"></canvas>
<div class="landmark-grid-container"></div>
</div>
</body>
</html>
<script type="module">
const videoElement = document.getElementsByClassName('input_video')[0];
const canvasElement = document.getElementsByClassName('output_canvas')[0];
const canvasCtx = canvasElement.getContext('2d');
const landmarkContainer = document.getElementsByClassName('landmark-grid-container')[0];
const grid = new LandmarkGrid(landmarkContainer);

function onResults(results) {
if (!results.poseLandmarks) {
grid.updateLandmarks([]);
return;
}

canvasCtx.save();
canvasCtx.clearRect(0, 0, canvasElement.width, canvasElement.height);
canvasCtx.drawImage(results.segmentationMask, 0, 0,
canvasElement.width, canvasElement.height);

// Only overwrite existing pixels.


canvasCtx.globalCompositeOperation = 'source-in';
canvasCtx.fillStyle = '#00FF00';
canvasCtx.fillRect(0, 0, canvasElement.width, canvasElement.height);

// Only overwrite missing pixels.


canvasCtx.globalCompositeOperation = 'destination-atop';
canvasCtx.drawImage(
results.image, 0, 0, canvasElement.width, canvasElement.height);

canvasCtx.globalCompositeOperation = 'source-over';
drawConnectors(canvasCtx, results.poseLandmarks, POSE_CONNECTIONS,
{color: '#00FF00', lineWidth: 4});
drawLandmarks(canvasCtx, results.poseLandmarks,
{color: '#FF0000', lineWidth: 2});
canvasCtx.restore();

grid.updateLandmarks(results.poseWorldLandmarks);
}

const pose = new Pose({locateFile: (file) => {


return `https://cdn.jsdelivr.net/npm/@mediapipe/pose/${file}`;
}});
pose.setOptions({
modelComplexity: 1,
smoothLandmarks: true,
enableSegmentation: true,
smoothSegmentation: true,
minDetectionConfidence: 0.5,
minTrackingConfidence: 0.5
});
pose.onResults(onResults);

const camera = new Camera(videoElement, {


onFrame: async () => {
await pose.send({image: videoElement});
},
width: 1280,
height: 720
});
camera.start();
</script>

Example Apps
Please first see general instructions for Android, iOS, and desktop on how to build
MediaPipe examples.

Note: To visualize a graph, copy the graph and paste it into MediaPipe Visualizer. For
more information on how to visualize its associated subgraphs, please see visualizer
documentation.

MediaPipe Face Mesh

Overview
MediaPipe Face Mesh is a solution that estimates 468 3D face landmarks in real-time
even on mobile devices. It employs machine learning (ML) to infer the 3D facial surface,
requiring only a single camera input without the need for a dedicated depth sensor.
Utilizing lightweight model architectures together with GPU acceleration throughout the
pipeline, the solution delivers real-time performance critical for live experiences.

Additionally, the solution is bundled with the Face Transform module that bridges the
gap between the face landmark estimation and useful real-time augmented reality (AR)
applications. It establishes a metric 3D space and uses the face landmark screen
positions to estimate a face transform within that space. The face transform data
consists of common 3D primitives, including a face pose transformation matrix and a
triangular face mesh. Under the hood, a lightweight statistical analysis method
called Procrustes Analysis is employed to drive a robust, performant and portable logic.
The analysis runs on CPU and has a minimal speed/memory footprint on top of the ML
model inference.
Fig 1. AR effects utilizing the 3D facial surface.

ML Pipeline
Our ML pipeline consists of two real-time deep neural network models that work
together: A detector that operates on the full image and computes face locations and a
3D face landmark model that operates on those locations and predicts the approximate
3D surface via regression. Having the face accurately cropped drastically reduces the
need for common data augmentations like affine transformations consisting of
rotations, translation and scale changes. Instead it allows the network to dedicate most
of its capacity towards coordinate prediction accuracy. In addition, in our pipeline the
crops can also be generated based on the face landmarks identified in the previous
frame, and only when the landmark model could no longer identify face presence is the
face detector invoked to relocalize the face. This strategy is similar to that employed in
our MediaPipe Hands solution, which uses a palm detector together with a hand
landmark model.

The pipeline is implemented as a MediaPipe graph that uses a face landmark


subgraph from the face landmark module, and renders using a dedicated face renderer
subgraph. The face landmark subgraph internally uses a face_detection_subgraph from
the face detection module.
Note: To visualize a graph, copy the graph and paste it into MediaPipe Visualizer. For
more information on how to visualize its associated subgraphs, please see visualizer
documentation.

Models

F A C E DE TE CTI ON M ODE L
The face detector is the same BlazeFace model used in MediaPipe Face Detection.
Please refer to MediaPipe Face Detection for details.

F A C E LA N DMA R K M ODE L
For 3D face landmarks we employed transfer learning and trained a network with several
objectives: the network simultaneously predicts 3D landmark coordinates on synthetic
rendered data and 2D semantic contours on annotated real-world data. The resulting
network provided us with reasonable 3D landmark predictions not just on synthetic but
also on real-world data.

The 3D landmark network receives as input a cropped video frame without additional
depth input. The model outputs the positions of the 3D points, as well as the probability
of a face being present and reasonably aligned in the input. A common alternative
approach is to predict a 2D heatmap for each landmark, but it is not amenable to depth
prediction and has high computational costs for so many points. We further improve the
accuracy and robustness of our model by iteratively bootstrapping and refining
predictions. That way we can grow our dataset to increasingly challenging cases, such as
grimaces, oblique angle and occlusions.

You can find more information about the face landmark model in this paper.
Fig 2. Face landmarks: the red box indicates the cropped area as input to the landmark model, the red
dots represent the 468 landmarks in 3D, and the green lines connecting landmarks illustrate the
contours around the eyes, eyebrows, lips and the entire face.

A TTE N TI ON M E SH M ODE L
In addition to the Face Landmark Model we provide another model that
applies attention to semantically meaningful face regions, and therefore predicting
landmarks more accurately around lips, eyes and irises, at the expense of more
compute. It enables applications like AR makeup and AR puppeteering.
The attention mesh model can be selected in the Solution APIs via
the refine_landmarks option. You can also find more information about the model in
this paper.

Fig 3. Attention Mesh: Overview of model architecture.

Face Transform Module


The Face Landmark Model performs a single-camera face landmark detection in the
screen coordinate space: the X- and Y- coordinates are normalized screen coordinates,
while the Z coordinate is relative and is scaled as the X coodinate under the weak
perspective projection camera model. This format is well-suited for some applications,
however it does not directly enable the full spectrum of augmented reality (AR) features
like aligning a virtual 3D object with a detected face.

The Face Transform module moves away from the screen coordinate space towards a
metric 3D space and provides necessary primitives to handle a detected face as a
regular 3D object. By design, you’ll be able to use a perspective camera to project the
final 3D scene back into the screen coordinate space with a guarantee that the face
landmark positions are not changed.

Key Concepts

M E TR I C 3D SPA CE
The Metric 3D space established within the Face Transform module is a right-handed
orthonormal metric 3D coordinate space. Within the space, there is a virtual
perspective camera located at the space origin and pointed in the negative direction of
the Z-axis. In the current pipeline, it is assumed that the input camera frames are
observed by exactly this virtual camera and therefore its parameters are later used to
convert the screen landmark coordinates back into the Metric 3D space. The virtual
camera parameters can be set freely, however for better results it is advised to set them
as close to the real physical camera parameters as possible.

Fig 4. A visualization of multiple key elements in the Metric 3D space.

C A N ON I CA L F A CE M ODE L
The Canonical Face Model is a static 3D model of a human face, which follows the 468
3D face landmark topology of the Face Landmark Model. The model bears two
important functions:

• Defines metric units: the scale of the canonical face model defines the metric units of
the Metric 3D space. A metric unit used by the default canonical face model is a
centimeter;
• Bridges static and runtime spaces: the face pose transformation matrix is - in fact - a
linear map from the canonical face model into the runtime face landmark set estimated
on each frame. This way, virtual 3D assets modeled around the canonical face model can
be aligned with a tracked face by applying the face pose transformation matrix to them.

Components

TR A N SF OR M PI PE LI N E
The Transform Pipeline is a key component, which is responsible for estimating the
face transform objects within the Metric 3D space. On each frame, the following steps
are executed in the given order:

• Face landmark screen coordinates are converted into the Metric 3D space coordinates;
• Face pose transformation matrix is estimated as a rigid linear mapping from the
canonical face metric landmark set into the runtime face metric landmark set in a way
that minimizes a difference between the two;
• A face mesh is created using the runtime face metric landmarks as the vertex positions
(XYZ), while both the vertex texture coordinates (UV) and the triangular topology are
inherited from the canonical face model.

The transform pipeline is implemented as a MediaPipe calculator. For your convenience,


this calculator is bundled together with corresponding metadata into a unified
MediaPipe subgraph. The face transform format is defined as a Protocol Buffer message.

E F F E C T R E N DE R E R
The Effect Renderer is a component, which serves as a working example of a face effect
renderer. It targets the OpenGL ES 2.0 API to enable a real-time performance on mobile
devices and supports the following rendering modes:

• 3D object rendering mode: a virtual object is aligned with a detected face to emulate
an object attached to the face (example: glasses);
• Face mesh rendering mode: a texture is stretched on top of the face mesh surface to
emulate a face painting technique.

In both rendering modes, the face mesh is first rendered as an occluder straight into the
depth buffer. This step helps to create a more believable effect via hiding invisible
elements behind the face surface.

The effect renderer is implemented as a MediaPipe calculator.


Fig 5. An example of face effects rendered by the Face Transform Effect Renderer.

Solution APIs

Configuration Options
Naming style and availability may differ slightly across platforms/languages.

STA TI C _ I MA G E_ M ODE
If set to false , the solution treats the input images as a video stream. It will try to detect
faces in the first input images, and upon a successful detection further localizes the face
landmarks. In subsequent images, once all max_num_faces faces are detected and the
corresponding face landmarks are localized, it simply tracks those landmarks without
invoking another detection until it loses track of any of the faces. This reduces latency
and is ideal for processing video frames. If set to true , face detection runs on every
input image, ideal for processing a batch of static, possibly unrelated, images. Default
to false .

M A X_ NU M _F A CE S
Maximum number of faces to detect. Default to 1.
R E F I NE _ LA N DM ARKS
Whether to further refine the landmark coordinates around the eyes and lips, and
output additional landmarks around the irises by applying the Attention Mesh Model.
Default to false .

M I N _ DE TE CTI ON _ CON F I DE N CE
Minimum confidence value ( [0.0, 1.0] ) from the face detection model for the detection
to be considered successful. Default to 0.5 .

M I N _ TRA CKI NG _ CON F I DE N CE


Minimum confidence value ( [0.0, 1.0] ) from the landmark-tracking model for the face
landmarks to be considered tracked successfully, or otherwise face detection will be
invoked automatically on the next input image. Setting it to a higher value can increase
robustness of the solution, at the expense of a higher latency. Ignored
if static_image_mode is true , where face detection simply runs on every image. Default
to 0.5 .

Output
Naming style may differ slightly across platforms/languages.

M U LTI _ FA CE_ LA N DM A R KS
Collection of detected/tracked faces, where each face is represented as a list of 468 face
landmarks and each landmark is composed of x , y and z . x and y are normalized
to [0.0, 1.0] by the image width and height respectively. z represents the landmark
depth with the depth at center of the head being the origin, and the smaller the value
the closer the landmark is to the camera. The magnitude of z uses roughly the same
scale as x .

Python Solution API


Please first follow general instructions to install MediaPipe Python package, then learn
more in the companion Python Colab and the usage example below.

Supported configuration options:

• static_image_mode
• max_num_faces
• refine_landmarks
• min_detection_confidence
• min_tracking_confidence
import cv2
import mediapipe as mp
mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles
mp_face_mesh = mp.solutions.face_mesh

# For static images:


IMAGE_FILES = []
drawing_spec = mp_drawing.DrawingSpec(thickness=1, circle_radius=1)
with mp_face_mesh.FaceMesh(
static_image_mode=True,
max_num_faces=1,
refine_landmarks=True,
min_detection_confidence=0.5) as face_mesh:
for idx, file in enumerate(IMAGE_FILES):
image = cv2.imread(file)
# Convert the BGR image to RGB before processing.
results = face_mesh.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))

# Print and draw face mesh landmarks on the image.


if not results.multi_face_landmarks:
continue
annotated_image = image.copy()
for face_landmarks in results.multi_face_landmarks:
print('face_landmarks:', face_landmarks)
mp_drawing.draw_landmarks(
image=annotated_image,
landmark_list=face_landmarks,
connections=mp_face_mesh.FACEMESH_TESSELATION,
landmark_drawing_spec=None,
connection_drawing_spec=mp_drawing_styles
.get_default_face_mesh_tesselation_style())
mp_drawing.draw_landmarks(
image=annotated_image,
landmark_list=face_landmarks,
connections=mp_face_mesh.FACEMESH_CONTOURS,
landmark_drawing_spec=None,
connection_drawing_spec=mp_drawing_styles
.get_default_face_mesh_contours_style())
mp_drawing.draw_landmarks(
image=annotated_image,
landmark_list=face_landmarks,
connections=mp_face_mesh.FACEMESH_IRISES,
landmark_drawing_spec=None,
connection_drawing_spec=mp_drawing_styles
.get_default_face_mesh_iris_connections_style())
cv2.imwrite('/tmp/annotated_image' + str(idx) + '.png', annotated_image)

# For webcam input:


drawing_spec = mp_drawing.DrawingSpec(thickness=1, circle_radius=1)
cap = cv2.VideoCapture(0)
with mp_face_mesh.FaceMesh(
max_num_faces=1,
refine_landmarks=True,
min_detection_confidence=0.5,
min_tracking_confidence=0.5) as face_mesh:
while cap.isOpened():
success, image = cap.read()
if not success:
print("Ignoring empty camera frame.")
# If loading a video, use 'break' instead of 'continue'.
continue

# To improve performance, optionally mark the image as not writeable to


# pass by reference.
image.flags.writeable = False
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = face_mesh.process(image)

# Draw the face mesh annotations on the image.


image.flags.writeable = True
image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
if results.multi_face_landmarks:
for face_landmarks in results.multi_face_landmarks:
mp_drawing.draw_landmarks(
image=image,
landmark_list=face_landmarks,
connections=mp_face_mesh.FACEMESH_TESSELATION,
landmark_drawing_spec=None,
connection_drawing_spec=mp_drawing_styles
.get_default_face_mesh_tesselation_style())
mp_drawing.draw_landmarks(
image=image,
landmark_list=face_landmarks,
connections=mp_face_mesh.FACEMESH_CONTOURS,
landmark_drawing_spec=None,
connection_drawing_spec=mp_drawing_styles
.get_default_face_mesh_contours_style())
mp_drawing.draw_landmarks(
image=image,
landmark_list=face_landmarks,
connections=mp_face_mesh.FACEMESH_IRISES,
landmark_drawing_spec=None,
connection_drawing_spec=mp_drawing_styles
.get_default_face_mesh_iris_connections_style())
# Flip the image horizontally for a selfie-view display.
cv2.imshow('MediaPipe Face Mesh', cv2.flip(image, 1))
if cv2.waitKey(5) & 0xFF == 27:
break
cap.release()

JavaScript Solution API


Please first see general introduction on MediaPipe in JavaScript, then learn more in the
companion web demo and the following usage example.

Supported configuration options:

• maxNumFaces
• refineLandmarks
• minDetectionConfidence
• minTrackingConfidence
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/camera_utils/camera_utils.js"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/control_utils/control_utils.js"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/drawing_utils/drawing_utils.js"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/face_mesh/face_mesh.js"
crossorigin="anonymous"></script>
</head>

<body>
<div class="container">
<video class="input_video"></video>
<canvas class="output_canvas" width="1280px" height="720px"></canvas>
</div>
</body>
</html>
<script type="module">
const videoElement = document.getElementsByClassName('input_video')[0];
const canvasElement = document.getElementsByClassName('output_canvas')[0];
const canvasCtx = canvasElement.getContext('2d');

function onResults(results) {
canvasCtx.save();
canvasCtx.clearRect(0, 0, canvasElement.width, canvasElement.height);
canvasCtx.drawImage(
results.image, 0, 0, canvasElement.width, canvasElement.height);
if (results.multiFaceLandmarks) {
for (const landmarks of results.multiFaceLandmarks) {
drawConnectors(canvasCtx, landmarks, FACEMESH_TESSELATION,
{color: '#C0C0C070', lineWidth: 1});
drawConnectors(canvasCtx, landmarks, FACEMESH_RIGHT_EYE, {color: '#FF3030'});
drawConnectors(canvasCtx, landmarks, FACEMESH_RIGHT_EYEBROW, {color: '#FF3030'});
drawConnectors(canvasCtx, landmarks, FACEMESH_RIGHT_IRIS, {color: '#FF3030'});
drawConnectors(canvasCtx, landmarks, FACEMESH_LEFT_EYE, {color: '#30FF30'});
drawConnectors(canvasCtx, landmarks, FACEMESH_LEFT_EYEBROW, {color: '#30FF30'});
drawConnectors(canvasCtx, landmarks, FACEMESH_LEFT_IRIS, {color: '#30FF30'});
drawConnectors(canvasCtx, landmarks, FACEMESH_FACE_OVAL, {color: '#E0E0E0'});
drawConnectors(canvasCtx, landmarks, FACEMESH_LIPS, {color: '#E0E0E0'});
}
}
canvasCtx.restore();
}

const faceMesh = new FaceMesh({locateFile: (file) => {


return `https://cdn.jsdelivr.net/npm/@mediapipe/face_mesh/${file}`;
}});
faceMesh.setOptions({
maxNumFaces: 1,
refineLandmarks: true,
minDetectionConfidence: 0.5,
minTrackingConfidence: 0.5
});
faceMesh.onResults(onResults);

const camera = new Camera(videoElement, {


onFrame: async () => {
await faceMesh.send({image: videoElement});
},
width: 1280,
height: 720
});
camera.start();
</script>

Android Solution API


Please first follow general instructions to add MediaPipe Gradle dependencies and try
the Android Solution API in the companion example Android Studio project, and learn
more in the usage example below.

Supported configuration options:

• staticImageMode
• maxNumFaces
• refineLandmarks
• runOnGpu: Run the pipeline and the model inference on GPU or CPU.

C A M E RA I N PU T
// For camera input and result rendering with OpenGL.
FaceMeshOptions faceMeshOptions =
FaceMeshOptions.builder()
.setStaticImageMode(false)
.setRefineLandmarks(true)
.setMaxNumFaces(1)
.setRunOnGpu(true).build();
FaceMesh faceMesh = new FaceMesh(this, faceMeshOptions);
faceMesh.setErrorListener(
(message, e) -> Log.e(TAG, "MediaPipe Face Mesh error:" + message));

// Initializes a new CameraInput instance and connects it to MediaPipe Face Mesh Solution.
CameraInput cameraInput = new CameraInput(this);
cameraInput.setNewFrameListener(
textureFrame -> faceMesh.send(textureFrame));

// Initializes a new GlSurfaceView with a ResultGlRenderer<FaceMeshResult> instance


// that provides the interfaces to run user-defined OpenGL rendering code.
// See
mediapipe/examples/android/solutions/facemesh/src/main/java/com/google/mediapipe/examples/face
mesh/FaceMeshResultGlRenderer.java
// as an example.
SolutionGlSurfaceView<FaceMeshResult> glSurfaceView =
new SolutionGlSurfaceView<>(
this, faceMesh.getGlContext(), faceMesh.getGlMajorVersion());
glSurfaceView.setSolutionResultRenderer(new FaceMeshResultGlRenderer());
glSurfaceView.setRenderInputImage(true);

faceMesh.setResultListener(
faceMeshResult -> {
NormalizedLandmark noseLandmark =
result.multiFaceLandmarks().get(0).getLandmarkList().get(1);
Log.i(
TAG,
String.format(
"MediaPipe Face Mesh nose normalized coordinates (value range: [0, 1]): x=%f,
y=%f",
noseLandmark.getX(), noseLandmark.getY()));
// Request GL rendering.
glSurfaceView.setRenderData(faceMeshResult);
glSurfaceView.requestRender();
});

// The runnable to start camera after the GLSurfaceView is attached.


glSurfaceView.post(
() ->
cameraInput.start(
this,
faceMesh.getGlContext(),
CameraInput.CameraFacing.FRONT,
glSurfaceView.getWidth(),
glSurfaceView.getHeight()));

I M A GE I N PU T
// For reading images from gallery and drawing the output in an ImageView.
FaceMeshOptions faceMeshOptions =
FaceMeshOptions.builder()
.setStaticImageMode(true)
.setRefineLandmarks(true)
.setMaxNumFaces(1)
.setRunOnGpu(true).build();
FaceMesh faceMesh = new FaceMesh(this, faceMeshOptions);

// Connects MediaPipe Face Mesh Solution to the user-defined ImageView instance


// that allows users to have the custom drawing of the output landmarks on it.
// See
mediapipe/examples/android/solutions/facemesh/src/main/java/com/google/mediapipe/examples/face
mesh/FaceMeshResultImageView.java
// as an example.
FaceMeshResultImageView imageView = new FaceMeshResultImageView(this);
faceMesh.setResultListener(
faceMeshResult -> {
int width = faceMeshResult.inputBitmap().getWidth();
int height = faceMeshResult.inputBitmap().getHeight();
NormalizedLandmark noseLandmark =
result.multiFaceLandmarks().get(0).getLandmarkList().get(1);
Log.i(
TAG,
String.format(
"MediaPipe Face Mesh nose coordinates (pixel values): x=%f, y=%f",
noseLandmark.getX() * width, noseLandmark.getY() * height));
// Request canvas drawing.
imageView.setFaceMeshResult(faceMeshResult);
runOnUiThread(() -> imageView.update());
});
faceMesh.setErrorListener(
(message, e) -> Log.e(TAG, "MediaPipe Face Mesh error:" + message));

// ActivityResultLauncher to get an image from the gallery as Bitmap.


ActivityResultLauncher<Intent> imageGetter =
registerForActivityResult(
new ActivityResultContracts.StartActivityForResult(),
result -> {
Intent resultIntent = result.getData();
if (resultIntent != null && result.getResultCode() == RESULT_OK) {
Bitmap bitmap = null;
try {
bitmap =
MediaStore.Images.Media.getBitmap(
this.getContentResolver(), resultIntent.getData());
// Please also rotate the Bitmap based on its orientation.
} catch (IOException e) {
Log.e(TAG, "Bitmap reading error:" + e);
}
if (bitmap != null) {
faceMesh.send(bitmap);
}
}
});
Intent pickImageIntent = new Intent(Intent.ACTION_PICK);
pickImageIntent.setDataAndType(MediaStore.Images.Media.INTERNAL_CONTENT_URI, "image/*");
imageGetter.launch(pickImageIntent);

VI DE O I N PU T
// For video input and result rendering with OpenGL.
FaceMeshOptions faceMeshOptions =
FaceMeshOptions.builder()
.setStaticImageMode(false)
.setRefineLandmarks(true)
.setMaxNumFaces(1)
.setRunOnGpu(true).build();
FaceMesh faceMesh = new FaceMesh(this, faceMeshOptions);
faceMesh.setErrorListener(
(message, e) -> Log.e(TAG, "MediaPipe Face Mesh error:" + message));

// Initializes a new VideoInput instance and connects it to MediaPipe Face Mesh Solution.
VideoInput videoInput = new VideoInput(this);
videoInput.setNewFrameListener(
textureFrame -> faceMesh.send(textureFrame));

// Initializes a new GlSurfaceView with a ResultGlRenderer<FaceMeshResult> instance


// that provides the interfaces to run user-defined OpenGL rendering code.
// See
mediapipe/examples/android/solutions/facemesh/src/main/java/com/google/mediapipe/examples/face
mesh/FaceMeshResultGlRenderer.java
// as an example.
SolutionGlSurfaceView<FaceMeshResult> glSurfaceView =
new SolutionGlSurfaceView<>(
this, faceMesh.getGlContext(), faceMesh.getGlMajorVersion());
glSurfaceView.setSolutionResultRenderer(new FaceMeshResultGlRenderer());
glSurfaceView.setRenderInputImage(true);

faceMesh.setResultListener(
faceMeshResult -> {
NormalizedLandmark noseLandmark =
result.multiFaceLandmarks().get(0).getLandmarkList().get(1);
Log.i(
TAG,
String.format(
"MediaPipe Face Mesh nose normalized coordinates (value range: [0, 1]): x=%f,
y=%f",
noseLandmark.getX(), noseLandmark.getY()));
// Request GL rendering.
glSurfaceView.setRenderData(faceMeshResult);
glSurfaceView.requestRender();
});

ActivityResultLauncher<Intent> videoGetter =
registerForActivityResult(
new ActivityResultContracts.StartActivityForResult(),
result -> {
Intent resultIntent = result.getData();
if (resultIntent != null) {
if (result.getResultCode() == RESULT_OK) {
glSurfaceView.post(
() ->
videoInput.start(
this,
resultIntent.getData(),
faceMesh.getGlContext(),
glSurfaceView.getWidth(),
glSurfaceView.getHeight()));
}
}
});
Intent pickVideoIntent = new Intent(Intent.ACTION_PICK);
pickVideoIntent.setDataAndType(MediaStore.Video.Media.INTERNAL_CONTENT_URI, "video/*");
videoGetter.launch(pickVideoIntent);

Example Apps
Please first see general instructions for Android, iOS and desktop on how to build
MediaPipe examples.

Note: To visualize a graph, copy the graph and paste it into MediaPipe Visualizer. For
more information on how to visualize its associated subgraphs, please see visualizer
documentation.

Face Landmark Example


Face landmark example showcases real-time, cross-platform face landmark detection.
For visual reference, please refer to Fig. 2.

M OBI LE
• Graph: mediapipe/graphs/face_mesh/face_mesh_mobile.pbtxt
• Android target: (or download prebuilt ARM64
APK) mediapipe/examples/android/src/java/com/google/mediapipe/apps/facemeshgpu:faceme
shgpu

• iOS target: mediapipe/examples/ios/facemeshgpu:FaceMeshGpuApp


Tip: Maximum number of faces to detect/process is set to 1 by default. To change it, for
Android modify NUM_FACES in MainActivity.java, and for iOS
modify kNumFaces in FaceMeshGpuViewController.mm.

DE SKTOP
• Running on CPU
o Graph: mediapipe/graphs/face_mesh/face_mesh_desktop_live.pbtxt
o Target: mediapipe/examples/desktop/face_mesh:face_mesh_cpu
• Running on GPU
o Graph: mediapipe/graphs/face_mesh/face_mesh_desktop_live_gpu.pbtxt
o Target: mediapipe/examples/desktop/face_mesh:face_mesh_gpu

Tip: Maximum number of faces to detect/process is set to 1 by default. To change it, in


the graph file modify the option of ConstantSidePacketCalculator .

Face Effect Example


Face effect example showcases real-time mobile face effect application use case for the
Face Mesh solution. To enable a better user experience, this example only works for a
single face. For visual reference, please refer to Fig. 4.

MediaPipe Hands
TABLE OF CONTENTS

1.
2.
3.
1.
2.
4.
1.
1.
2.
3.
4.
5.
2.
1.
2.
3.
3.
4.
5.
1.
2.
3.
5.
1.
1.
2.
2.
6.

Overview
The ability to perceive the shape and motion of hands can be a vital component in
improving the user experience across a variety of technological domains and platforms.
For example, it can form the basis for sign language understanding and hand gesture
control, and can also enable the overlay of digital content and information on top of the
physical world in augmented reality. While coming naturally to people, robust real-time
hand perception is a decidedly challenging computer vision task, as hands often occlude
themselves or each other (e.g. finger/palm occlusions and hand shakes) and lack high
contrast patterns.

MediaPipe Hands is a high-fidelity hand and finger tracking solution. It employs


machine learning (ML) to infer 21 3D landmarks of a hand from just a single frame.
Whereas current state-of-the-art approaches rely primarily on powerful desktop
environments for inference, our method achieves real-time performance on a mobile
phone, and even scales to multiple hands. We hope that providing this hand perception
functionality to the wider research and development community will result in an
emergence of creative use cases, stimulating new applications and new research
avenues.

Fig 1. Tracked 3D hand landmarks are represented by dots in different shades, with the brighter ones
denoting landmarks closer to the camera.

ML Pipeline
MediaPipe Hands utilizes an ML pipeline consisting of multiple models working
together: A palm detection model that operates on the full image and returns an
oriented hand bounding box. A hand landmark model that operates on the cropped
image region defined by the palm detector and returns high-fidelity 3D hand keypoints.
This strategy is similar to that employed in our MediaPipe Face Mesh solution, which
uses a face detector together with a face landmark model.

Providing the accurately cropped hand image to the hand landmark model drastically
reduces the need for data augmentation (e.g. rotations, translation and scale) and
instead allows the network to dedicate most of its capacity towards coordinate
prediction accuracy. In addition, in our pipeline the crops can also be generated based
on the hand landmarks identified in the previous frame, and only when the landmark
model could no longer identify hand presence is palm detection invoked to relocalize
the hand.

The pipeline is implemented as a MediaPipe graph that uses a hand landmark tracking
subgraph from the hand landmark module, and renders using a dedicated hand
renderer subgraph. The hand landmark tracking subgraph internally uses a hand
landmark subgraph from the same module and a palm detection subgraph from
the palm detection module.

Note: To visualize a graph, copy the graph and paste it into MediaPipe Visualizer. For
more information on how to visualize its associated subgraphs, please see visualizer
documentation.

Models

Palm Detection Model


To detect initial hand locations, we designed a single-shot detector model optimized for
mobile real-time uses in a manner similar to the face detection model in MediaPipe Face
Mesh. Detecting hands is a decidedly complex task: our lite model and full model have
to work across a variety of hand sizes with a large scale span (~20x) relative to the
image frame and be able to detect occluded and self-occluded hands. Whereas faces
have high contrast patterns, e.g., in the eye and mouth region, the lack of such features
in hands makes it comparatively difficult to detect them reliably from their visual
features alone. Instead, providing additional context, like arm, body, or person features,
aids accurate hand localization.

Our method addresses the above challenges using different strategies. First, we train a
palm detector instead of a hand detector, since estimating bounding boxes of rigid
objects like palms and fists is significantly simpler than detecting hands with articulated
fingers. In addition, as palms are smaller objects, the non-maximum suppression
algorithm works well even for two-hand self-occlusion cases, like handshakes. Moreover,
palms can be modelled using square bounding boxes (anchors in ML terminology)
ignoring other aspect ratios, and therefore reducing the number of anchors by a factor
of 3-5. Second, an encoder-decoder feature extractor is used for bigger scene context
awareness even for small objects (similar to the RetinaNet approach). Lastly, we
minimize the focal loss during training to support a large amount of anchors resulting
from the high scale variance.

With the above techniques, we achieve an average precision of 95.7% in palm detection.
Using a regular cross entropy loss and no decoder gives a baseline of just 86.22%.

Hand Landmark Model


After the palm detection over the whole image our subsequent hand
landmark model performs precise keypoint localization of 21 3D hand-knuckle
coordinates inside the detected hand regions via regression, that is direct coordinate
prediction. The model learns a consistent internal hand pose representation and is
robust even to partially visible hands and self-occlusions.

To obtain ground truth data, we have manually annotated ~30K real-world images with
21 3D coordinates, as shown below (we take Z-value from image depth map, if it exists
per corresponding coordinate). To better cover the possible hand poses and provide
additional supervision on the nature of hand geometry, we also render a high-quality
synthetic hand model over various backgrounds and map it to the corresponding 3D
coordinates.

Fig 2. 21 hand landmarks.


Fig 3. Top: Aligned hand crops passed to the tracking network with ground truth annotation. Bottom:
Rendered synthetic hand images with ground truth annotation.

Solution APIs

Configuration Options
Naming style and availability may differ slightly across platforms/languages.

STA TI C _ I MA G E_ M ODE
If set to false , the solution treats the input images as a video stream. It will try to detect
hands in the first input images, and upon a successful detection further localizes the
hand landmarks. In subsequent images, once all max_num_hands hands are detected
and the corresponding hand landmarks are localized, it simply tracks those landmarks
without invoking another detection until it loses track of any of the hands. This reduces
latency and is ideal for processing video frames. If set to true , hand detection runs on
every input image, ideal for processing a batch of static, possibly unrelated, images.
Default to false .

M A X_ NU M _ HA N DS
Maximum number of hands to detect. Default to 2.

M ODE L_ CO M PLE XITY


Complexity of the hand landmark model: 0 or 1 . Landmark accuracy as well as
inference latency generally go up with the model complexity. Default to 1 .

M I N _ DE TE CTI ON _ CON F I DE N CE
Minimum confidence value ( [0.0, 1.0] ) from the hand detection model for the
detection to be considered successful. Default to 0.5 .

M I N _ TRA CKI NG _ CON F I DE N CE:


Minimum confidence value ( [0.0, 1.0] ) from the landmark-tracking model for the hand
landmarks to be considered tracked successfully, or otherwise hand detection will be
invoked automatically on the next input image. Setting it to a higher value can increase
robustness of the solution, at the expense of a higher latency. Ignored
if static_image_mode is true , where hand detection simply runs on every image. Default
to 0.5 .

Output
Naming style may differ slightly across platforms/languages.

M U LTI _ HAN D_ LANDM A R KS


Collection of detected/tracked hands, where each hand is represented as a list of 21
hand landmarks and each landmark is composed of x , y and z . x and y are
normalized to [0.0, 1.0] by the image width and height respectively. z represents the
landmark depth with the depth at the wrist being the origin, and the smaller the value
the closer the landmark is to the camera. The magnitude of z uses roughly the same
scale as x .
M U LTI _ HAN D_ WOR LD_ LA N DM AR KS
Collection of detected/tracked hands, where each hand is represented as a list of 21
hand landmarks in world coordinates. Each landmark is composed of x , y and z : real-
world 3D coordinates in meters with the origin at the hand’s approximate geometric
center.

M U LTI _ HAN DE DN ESS


Collection of handedness of the detected/tracked hands (i.e. is it a left or right hand).
Each hand is composed of label and score . label is a string of value
either "Left" or "Right" . score is the estimated probability of the predicted handedness
and is always greater than or equal to 0.5 (and the opposite handedness has an
estimated probability of 1 - score ).

Note that handedness is determined assuming the input image is mirrored, i.e., taken
with a front-facing/selfie camera with images flipped horizontally. If it is not the case,
please swap the handedness output in the application.

Python Solution API


Please first follow general instructions to install MediaPipe Python package, then learn
more in the companion Python Colab and the usage example below.

Supported configuration options:

• static_image_mode
• max_num_hands
• model_complexity
• min_detection_confidence
• min_tracking_confidence
import cv2
import mediapipe as mp
mp_drawing = mp.solutions.drawing_utils
mp_drawing_styles = mp.solutions.drawing_styles
mp_hands = mp.solutions.hands

# For static images:


IMAGE_FILES = []
with mp_hands.Hands(
static_image_mode=True,
max_num_hands=2,
min_detection_confidence=0.5) as hands:
for idx, file in enumerate(IMAGE_FILES):
# Read an image, flip it around y-axis for correct handedness output (see
# above).
image = cv2.flip(cv2.imread(file), 1)
# Convert the BGR image to RGB before processing.
results = hands.process(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))

# Print handedness and draw hand landmarks on the image.


print('Handedness:', results.multi_handedness)
if not results.multi_hand_landmarks:
continue
image_height, image_width, _ = image.shape
annotated_image = image.copy()
for hand_landmarks in results.multi_hand_landmarks:
print('hand_landmarks:', hand_landmarks)
print(
f'Index finger tip coordinates: (',
f'{hand_landmarks.landmark[mp_hands.HandLandmark.INDEX_FINGER_TIP].x * image_width},
'
f'{hand_landmarks.landmark[mp_hands.HandLandmark.INDEX_FINGER_TIP].y *
image_height})'
)
mp_drawing.draw_landmarks(
annotated_image,
hand_landmarks,
mp_hands.HAND_CONNECTIONS,
mp_drawing_styles.get_default_hand_landmarks_style(),
mp_drawing_styles.get_default_hand_connections_style())
cv2.imwrite(
'/tmp/annotated_image' + str(idx) + '.png', cv2.flip(annotated_image, 1))
# Draw hand world landmarks.
if not results.multi_hand_world_landmarks:
continue
for hand_world_landmarks in results.multi_hand_world_landmarks:
mp_drawing.plot_landmarks(
hand_world_landmarks, mp_hands.HAND_CONNECTIONS, azimuth=5)

# For webcam input:


cap = cv2.VideoCapture(0)
with mp_hands.Hands(
model_complexity=0,
min_detection_confidence=0.5,
min_tracking_confidence=0.5) as hands:
while cap.isOpened():
success, image = cap.read()
if not success:
print("Ignoring empty camera frame.")
# If loading a video, use 'break' instead of 'continue'.
continue

# To improve performance, optionally mark the image as not writeable to


# pass by reference.
image.flags.writeable = False
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = hands.process(image)

# Draw the hand annotations on the image.


image.flags.writeable = True
image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR)
if results.multi_hand_landmarks:
for hand_landmarks in results.multi_hand_landmarks:
mp_drawing.draw_landmarks(
image,
hand_landmarks,
mp_hands.HAND_CONNECTIONS,
mp_drawing_styles.get_default_hand_landmarks_style(),
mp_drawing_styles.get_default_hand_connections_style())
# Flip the image horizontally for a selfie-view display.
cv2.imshow('MediaPipe Hands', cv2.flip(image, 1))
if cv2.waitKey(5) & 0xFF == 27:
break
cap.release()

JavaScript Solution API


Please first see general introduction on MediaPipe in JavaScript, then learn more in the
companion web demo and a [fun application], and the following usage example.

Supported configuration options:

• maxNumHands
• modelComplexity
• minDetectionConfidence
• minTrackingConfidence
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/camera_utils/camera_utils.js"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/control_utils/control_utils.js"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/drawing_utils/drawing_utils.js"
crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/@mediapipe/hands/hands.js"
crossorigin="anonymous"></script>
</head>

<body>
<div class="container">
<video class="input_video"></video>
<canvas class="output_canvas" width="1280px" height="720px"></canvas>
</div>
</body>
</html>
<script type="module">
const videoElement = document.getElementsByClassName('input_video')[0];
const canvasElement = document.getElementsByClassName('output_canvas')[0];
const canvasCtx = canvasElement.getContext('2d');

function onResults(results) {
canvasCtx.save();
canvasCtx.clearRect(0, 0, canvasElement.width, canvasElement.height);
canvasCtx.drawImage(
results.image, 0, 0, canvasElement.width, canvasElement.height);
if (results.multiHandLandmarks) {
for (const landmarks of results.multiHandLandmarks) {
drawConnectors(canvasCtx, landmarks, HAND_CONNECTIONS,
{color: '#00FF00', lineWidth: 5});
drawLandmarks(canvasCtx, landmarks, {color: '#FF0000', lineWidth: 2});
}
}
canvasCtx.restore();
}

const hands = new Hands({locateFile: (file) => {


return `https://cdn.jsdelivr.net/npm/@mediapipe/hands/${file}`;
}});
hands.setOptions({
maxNumHands: 2,
modelComplexity: 1,
minDetectionConfidence: 0.5,
minTrackingConfidence: 0.5
});
hands.onResults(onResults);

const camera = new Camera(videoElement, {


onFrame: async () => {
await hands.send({image: videoElement});
},
width: 1280,
height: 720
});
camera.start();
</script>

Android Solution API


Please first follow general instructions to add MediaPipe Gradle dependencies and try
the Android Solution API in the companion example Android Studio project, and learn
more in the usage example below.

Supported configuration options:

• staticImageMode
• maxNumHands
• runOnGpu: Run the pipeline and the model inference on GPU or CPU.

C A M E RA I N PU T
// For camera input and result rendering with OpenGL.
HandsOptions handsOptions =
HandsOptions.builder()
.setStaticImageMode(false)
.setMaxNumHands(2)
.setRunOnGpu(true).build();
Hands hands = new Hands(this, handsOptions);
hands.setErrorListener(
(message, e) -> Log.e(TAG, "MediaPipe Hands error:" + message));

// Initializes a new CameraInput instance and connects it to MediaPipe Hands Solution.


CameraInput cameraInput = new CameraInput(this);
cameraInput.setNewFrameListener(
textureFrame -> hands.send(textureFrame));

// Initializes a new GlSurfaceView with a ResultGlRenderer<HandsResult> instance


// that provides the interfaces to run user-defined OpenGL rendering code.
// See
mediapipe/examples/android/solutions/hands/src/main/java/com/google/mediapipe/examples/hands/H
andsResultGlRenderer.java
// as an example.
SolutionGlSurfaceView<HandsResult> glSurfaceView =
new SolutionGlSurfaceView<>(
this, hands.getGlContext(), hands.getGlMajorVersion());
glSurfaceView.setSolutionResultRenderer(new HandsResultGlRenderer());
glSurfaceView.setRenderInputImage(true);

hands.setResultListener(
handsResult -> {
if (result.multiHandLandmarks().isEmpty()) {
return;
}
NormalizedLandmark wristLandmark =
handsResult.multiHandLandmarks().get(0).getLandmarkList().get(HandLandmark.WRIST);
Log.i(
TAG,
String.format(
"MediaPipe Hand wrist normalized coordinates (value range: [0, 1]): x=%f, y=%f",
wristLandmark.getX(), wristLandmark.getY()));
// Request GL rendering.
glSurfaceView.setRenderData(handsResult);
glSurfaceView.requestRender();
});

// The runnable to start camera after the GLSurfaceView is attached.


glSurfaceView.post(
() ->
cameraInput.start(
this,
hands.getGlContext(),
CameraInput.CameraFacing.FRONT,
glSurfaceView.getWidth(),
glSurfaceView.getHeight()));

I M A GE I N PU T
// For reading images from gallery and drawing the output in an ImageView.
HandsOptions handsOptions =
HandsOptions.builder()
.setStaticImageMode(true)
.setMaxNumHands(2)
.setRunOnGpu(true).build();
Hands hands = new Hands(this, handsOptions);

// Connects MediaPipe Hands Solution to the user-defined ImageView instance that


// allows users to have the custom drawing of the output landmarks on it.
// See
mediapipe/examples/android/solutions/hands/src/main/java/com/google/mediapipe/examples/hands/H
andsResultImageView.java
// as an example.
HandsResultImageView imageView = new HandsResultImageView(this);
hands.setResultListener(
handsResult -> {
if (result.multiHandLandmarks().isEmpty()) {
return;
}
int width = handsResult.inputBitmap().getWidth();
int height = handsResult.inputBitmap().getHeight();
NormalizedLandmark wristLandmark =
handsResult.multiHandLandmarks().get(0).getLandmarkList().get(HandLandmark.WRIST);
Log.i(
TAG,
String.format(
"MediaPipe Hand wrist coordinates (pixel values): x=%f, y=%f",
wristLandmark.getX() * width, wristLandmark.getY() * height));
// Request canvas drawing.
imageView.setHandsResult(handsResult);
runOnUiThread(() -> imageView.update());
});
hands.setErrorListener(
(message, e) -> Log.e(TAG, "MediaPipe Hands error:" + message));

// ActivityResultLauncher to get an image from the gallery as Bitmap.


ActivityResultLauncher<Intent> imageGetter =
registerForActivityResult(
new ActivityResultContracts.StartActivityForResult(),
result -> {
Intent resultIntent = result.getData();
if (resultIntent != null && result.getResultCode() == RESULT_OK) {
Bitmap bitmap = null;
try {
bitmap =
MediaStore.Images.Media.getBitmap(
this.getContentResolver(), resultIntent.getData());
// Please also rotate the Bitmap based on its orientation.
} catch (IOException e) {
Log.e(TAG, "Bitmap reading error:" + e);
}
if (bitmap != null) {
hands.send(bitmap);
}
}
});
Intent pickImageIntent = new Intent(Intent.ACTION_PICK);
pickImageIntent.setDataAndType(MediaStore.Images.Media.INTERNAL_CONTENT_URI, "image/*");
imageGetter.launch(pickImageIntent);

VI DE O I N PU T
// For video input and result rendering with OpenGL.
HandsOptions handsOptions =
HandsOptions.builder()
.setStaticImageMode(false)
.setMaxNumHands(2)
.setRunOnGpu(true).build();
Hands hands = new Hands(this, handsOptions);
hands.setErrorListener(
(message, e) -> Log.e(TAG, "MediaPipe Hands error:" + message));

// Initializes a new VideoInput instance and connects it to MediaPipe Hands Solution.


VideoInput videoInput = new VideoInput(this);
videoInput.setNewFrameListener(
textureFrame -> hands.send(textureFrame));

// Initializes a new GlSurfaceView with a ResultGlRenderer<HandsResult> instance


// that provides the interfaces to run user-defined OpenGL rendering code.
// See
mediapipe/examples/android/solutions/hands/src/main/java/com/google/mediapipe/examples/hands/H
andsResultGlRenderer.java
// as an example.
SolutionGlSurfaceView<HandsResult> glSurfaceView =
new SolutionGlSurfaceView<>(
this, hands.getGlContext(), hands.getGlMajorVersion());
glSurfaceView.setSolutionResultRenderer(new HandsResultGlRenderer());
glSurfaceView.setRenderInputImage(true);

hands.setResultListener(
handsResult -> {
if (result.multiHandLandmarks().isEmpty()) {
return;
}
NormalizedLandmark wristLandmark =
handsResult.multiHandLandmarks().get(0).getLandmarkList().get(HandLandmark.WRIST);
Log.i(
TAG,
String.format(
"MediaPipe Hand wrist normalized coordinates (value range: [0, 1]): x=%f, y=%f",
wristLandmark.getX(), wristLandmark.getY()));
// Request GL rendering.
glSurfaceView.setRenderData(handsResult);
glSurfaceView.requestRender();
});

ActivityResultLauncher<Intent> videoGetter =
registerForActivityResult(
new ActivityResultContracts.StartActivityForResult(),
result -> {
Intent resultIntent = result.getData();
if (resultIntent != null) {
if (result.getResultCode() == RESULT_OK) {
glSurfaceView.post(
() ->
videoInput.start(
this,
resultIntent.getData(),
hands.getGlContext(),
glSurfaceView.getWidth(),
glSurfaceView.getHeight()));
}
}
});
Intent pickVideoIntent = new Intent(Intent.ACTION_PICK);
pickVideoIntent.setDataAndType(MediaStore.Video.Media.INTERNAL_CONTENT_URI, "video/*");
videoGetter.launch(pickVideoIntent);

Example Apps
Please first see general instructions for Android, iOS and desktop on how to build
MediaPipe examples.

Note: To visualize a graph, copy the graph and paste it into MediaPipe Visualizer. For
more information on how to visualize its associated subgraphs, please see visualizer
documentation.

You might also like