Zbook - Using Ffmpeg With Nvidia Gpu H - 5dbe
Zbook - Using Ffmpeg With Nvidia Gpu H - 5dbe
Hardware Acceleration
User Guide
All NVIDIA® GPUs starting with Kepler generation support fully-accelerated hardware video
encoding and decoding. The hardware encoder and hardware decoder are referred to as NVENC
and NVDEC, respectively, in the rest of the document.
The hardware capabilities of NVENC and NVDEC are exposed in the NVIDIA Video Codec SDK
through APIs (herein referred to as NVENCODE API and NVDECODE API), by which the user can
access the hardware acceleration abilities of NVENC and NVDEC.
FFmpeg is the most popular multimedia transcoding software and is used extensively for video
and audio transcoding. NVENC and NVDEC can be effectively used with FFmpeg to significantly
speed up video decoding, encoding, and end-to-end transcoding.
This document explains ways to accelerate video encoding, decoding and end-to-end
transcoding on NVIDIA GPUs through FFmpeg which uses APIs exposed in the NVIDIA Video
Codec SDK.
‣ Clone ffnvcodec
git clone https://git.videolan.org/git/ffmpeg/nv-codec-headers.git
‣ Install ffnvcodec
cd nv-codec-headers && sudo make install && cd –
‣ Configure
./configure --enable-nonfree --enable-cuda-nvcc --enable-libnpp --extra-cflags=-I/usr/
local/cuda/include --extra-ldflags=-L/usr/local/cuda/lib64 --disable-static --enable-
shared
‣ Compile
make -j 8
‣ Create a folder named nv_sdk in the parent directory of FFmpeg and copy all the header files
from C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\include and library files
from C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\lib\x64 to nv_sdk folder.
‣ Launch the Visual Studio x64 Native Tools Command Prompt.
‣ From the Visual Studio x64 Native Tools Command Prompt, launch the MinGW64
environment by running mingw64.exe from the msys2 installation folder.
‣ In the MinGW64 environment, install the necessary packages.
pacman -S diffutils make pkg-config yasm
Once the FFmpeg binary with NVIDIA hardware acceleration support is compiled, hardware-
accelerated video transcode should be tested to ensure everything works well. To automatically
detect NV-accelerated video codec and keep video frames in GPU memory for transcoding,
the ffmpeg cli option "-hwaccel cuda -hwaccel_output_format cude" is used in further code
snippets.
There is a built-in cropper in cuvid decoder as well. The following command illustrates the use
of cropping. (-crop (top)x(bottom)x(left)x(right))
ffmpeg -y -vsync 0 -hwaccel cuda -hwaccel_output_format cuda –crop 16x16x32x32 -i
input.mp4 -c:a copy -c:v h264_nvenc -b:v 5M output.mp4
The pixel format (pix_fmt) should be changed to yuv444p/p010/yuv444p16 for encoding YUV 444,
420-10 and 444-10 files respectively.
and transcodes it to output21.mp4 at 720p and output22.mp4 at 480p as H.264 videos. These are
achieved using a single command line.
Input: input1.mp4, input2.mp4
Output: 480p 240p (from input1.mp4), 720p. 480p (from input2.mp4) (audio same as input)
ffmpeg -y -hwaccel cuda -hwaccel_output_format cuda -i input1.mp4
-hwaccel cuda -hwaccel_output_format cuda -i input2.mp4
-map 0:0 -vf scale_npp=640:480 –c:v h264_nvenc -b:v 1M output11.mp4
-map 0:0 -vf scale_npp=320:240 –c:v h264_nvenc -b:v 500k output12.mp4
-map 1:0 -vf scale_npp=1280:720 –c:v h264_nvenc -b:v 3M output21.mp4
-map 1:0 -vf scale_npp=640:480 –c:v h264_nvenc -b:v 2M output22.mp4
Once basic FFmpeg setup is confirmed to be working properly, other options provided on the
FFmpeg command line can be used to test encoding, decoding, and transcoding.
This chapter lists FFmpeg commands for accelerating video encoding, decoding, and
transcoding using NVENC and NVDEC.
This generates the output file in MP4 format (output.mp4) with H264 encoded video.
Video encoding can be broadly classified into two types of use cases:
‣ Latency tolerant high quality: In these kind of use cases latency is permitted. Encoder
features such as B-frames, look-ahead, reference B frames, variable bitrate (VBR) and
higher VBV buffer sizes can be used. Typical use cases include cloud transcoding, recording
and archiving, etc.
‣ Low latency: In these kind of use cases latency should be low and can be as low as 16
ms. In this mode, B-frames are disabled, constant bitrate modes are used, and VBV-buffer
sizes are kept very low. Typical use cases include real-time gaming, live streaming and video
conferencing etc. This encoding mode results in a lower encoding quality due to the above
constraints.
NVENCODEAPI supports several features for adjusting quality, performance, and latency which
are exposed through the FFmpeg command line. It is recommended to enable the feature(s) and
command line option(s) depending on the use case.
To decode multiple input bitstreams concurrently within a single FFmpeg process, use the
following command.
ffmpeg -y -vsync
0 -hwaccel cuda -hwaccel_output_format cuda -i input1.264 -hwaccel cuda
-hwaccel_output_format cuda -i
input2.264 -hwaccel cuda -hwaccel_output_format cuda -i input3.264 -
filter_complex
"[0:v]hwdownload,format=nv12[o0];[1:v]hwdownload,format=nv12[o1];
[2:v]hwdownload,format=nv12[o2]"
-map "[o0]" -f rawvideo output1.yuv -map "[o1]" -f rawvideo output2.yuv
-map "[o2]"
-f rawvideo output3.yuv
This uses a separate thread per decode operation, a single Cuda context shared among all
threads and generates the output files in NV12 format (outputN.yuv).
‣ Slow Preset
ffmpeg -y -vsync 0 -hwaccel cuda -hwaccel_output_format cuda -i input.mp4 -c:a copy
-c:v h264_nvenc -preset p6 -tune hq -b:v 5M -bufsize 5M -maxrate 10M -qmin 0 -g 250
-bf 3 -b_ref_mode middle -temporal-aq 1 -rc-lookahead 20 -i_qfactor 0.75 -b_qfactor
1.1 output.mp4
‣ Medium Preset
Use -preset p4 instead of -preset p6 in the above command line.
‣ Fast Preset
Use -preset p2 instead of -preset p6 in the above command line.
5.1. Lookahead
Lookahead improves the video encoder’s rate-control accuracy by enabling the encoder to buffer
the specified number of frames, estimate their complexity, and allocate the bits appropriately
among these frames proportional to their complexity. This typically results in better quality
because the encoder can distribute the bits proportional to the complexity over a larger number
of frames. The number of lookahead frames should be at least the number of B frames + 1 to
avoid CPU stalls. A lookahead of 10-20 frames is suggested for optimal quality benefits.
To enable lookahead, use the -rc-lookahead N (N = number of frames) option on FFmpeg
command line.
Temporal AQ attempts to adjust the encoding quantization parameter (QP) (on top of QP
evaluated by the rate control algorithm) based on temporal characteristics of the sequence.
Temporal AQ improves the quality of encoded frames by adjusting QP for regions which are
constant or have low motion across frames but have high spatial detail, such that they become
a better reference for future frames. Allocating extra bits to such regions in reference frames
is better than allocating them to the residuals in referred frames because it helps improve the
overall encoded video quality. If most of the region within a frame has little or no motion but has
high spatial details (e.g. high-detail non-moving background), enabling temporal AQ will benefit
the most.
One of the potential disadvantages of temporal AQ is that enabling temporal AQ may result in
high fluctuation of bits consumed per frame within a GOP. I/P-frames will consume more bits
than average P-frame size, and B-frames will consume fewer bits. Although the target bitrate
will be maintained at the GOP level, the frame size will fluctuate from one frame to the next within
a GOP more than it would without temporal AQ. Enabling temporal AQ is not recommended if a
strict CBR profile is required for every frame size within a GOP. To enable temporal AQ, use the
-temporal_aq 1 option on the FFmpeg command line.
Various factors affect the performance of hardware accelerated transcoding on the GPU. Getting
the highest performance for your workload requires some tuning. This section provides some
tips for measuring and optimizing end-to-end transcode performance.
NVIDIA Video Codec SDK documentation publishes performance of GPU hardware accelerated
encoder and decoder as stand-alone numbers, measured using high-performance encode
or decode application included in the SDK. Although FFmpeg software is highly optimized,
its performance is slightly lower than the performance reported in the SDK documentation,
mainly due to software overheads and additional setup/initialization time within FFmpeg code.
Therefore, to get high transcoding throughput using FFmpeg, it is essential to saturate the
hardware encoder and decoder engines such that the initialization time overhead for one session
gets hidden behind the transcoding time of other sessions. This can be achieved by running
multiple parallel encode/decode sessions on the hardware (see Section 1:N HWACCEL encode
from YUV or RAW Data). In such a case, the aggregate transcode performance with FFmpeg
matches closely with the theoretically expected hardware performance.
provides clean boundaries for streaming bandwidth adaptation and helps parallelizing
transcoding workloads on the servers. Transcoding smaller video chunks using GPU-hardware-
accelerated transcoding, however, poses a challenge because the initialization time overhead of
each FFmpeg process becomes significant.
To minimize the overhead when transcoding M input files into MN output files (i.e. when each
of the M inputs is transcoded into N outputs), it is better to minimize the number of FFmpeg
processes launched (see Section 1:N HWACCEL encode from YUV or RAW Data for example
command lines).
Additionally, follow these tips to reduce the FFmpeg initialization time overhead:
‣ Use FFmpeg command lines such as those in Sections 1:N HWACCEL Transcode with
Scaling and 1:N HWACCEL encode from YUV or RAW Data. These command lines share
the CUDA context across multiple transcode sessions, thereby reducing the CUDA context
initialization time overhead significantly.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.
Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.
NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgment, unless otherwise agreed in
an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any
customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed
either directly or indirectly by this document.
NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications
where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA
accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.
NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product
is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document,
ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of
the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional
or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem
which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.
Trademarks
NVIDIA, the NVIDIA logo, and cuBLAS, CUDA, CUDA Toolkit, cuDNN, DALI, DIGITS, DGX, DGX-1, DGX-2, DGX Station, DLProf, GPU, Jetson, Kepler, Maxwell, NCCL,
Nsight Compute, Nsight Systems, NVCaffe, NVIDIA Deep Learning SDK, NVIDIA Developer Program, NVIDIA GPU Cloud, NVLink, NVSHMEM, PerfWorks, Pascal,
SDK Manager, Tegra, TensorRT, TensorRT Inference Server, Tesla, TF-TRT, Triton Inference Server, Turing, and Volta are trademarks and/or registered trademarks
of NVIDIA Corporation in the United States and other countries. Other company and product names may be trademarks of the respective companies with which
they are associated.
Copyright
© 2010-2021 NVIDIA Corporation. All rights reserved.
NVIDIA Corporation | 2788 San Tomas Expressway, Santa Clara, CA 95051
http://www.nvidia.com