An Overview of Omnidirectional MediA Format OMAF
An Overview of Omnidirectional MediA Format OMAF
An Overview of
Omnidirectional MediA
Format (OMAF)
ABSTRACT | During recent years, there have been product suddenly, VR became a buzzword everywhere in the world,
launches and research for enabling immersive audio–visual many companies in the information and communication
media experiences. For example, a variety of head-mounted technology field started to have VR as an important strate-
displays and 360◦ cameras are available in the market. gic direction, and all kinds of VR cameras and devices
To facilitate interoperability between devices and media sys- started to be available in the market.
tem components by different vendors, the Moving Picture Unavoidably, numerous, different, noninteroperable VR
Experts Group (MPEG) developed the Omnidirectional MediA solutions have been designed and used. This called for
Format (OMAF), which is arguably the first virtual reality (VR) standardization, for which the number one target is always
system standard. OMAF is a storage and streaming format to enable devices and services by different manufactures
for omnidirectional media, including 360◦ video and images, and providers to interoperate.
spatial audio, and associated timed text. This article provides The Moving Picture Experts Group (MPEG) started to
a comprehensive overview of OMAF. look at the development of a VR standard in October 2015.
This effort led to the arguably first VR system standard,
KEYWORDS | 360◦ video; Dynamic Adaptive Streaming over
called Omnidirectional MediA Format (OMAF) [3]. OMAF
HTTP (DASH); file format; Omnidirectional MediA Format
defines a media format that enables omnidirectional media
(OMAF); omnidirectional media; viewport; virtual reality (VR).
applications, focusing on 360◦ video, images, and audio,
as well as the associated timed text. The first edition
I. I N T R O D U C T I O N (also referred to as the first version or v1) of OMAF
Virtual reality (VR) has been researched and trialed for was finalized in October 2017. It provides basic support
many years [1], [2]. Due to the growth of computing for 360◦ video, images, and audio with three degrees of
capability in devices and network bandwidth, as well as freedom (3DOF), meaning that only rotations around any
advances in the technology for head-mounted displays coordinate axes are supported. Since the finalization of
(HMDs), wide deployment of VR became possible only the standard, source code packages of several implementa-
recently. Facebook’s two-billion-dollar acquisition of Ocu- tions compatible with OMAF v1 have been made publicly
lus in 2014 seemed to be a start and a catalyst to the available [4]–[6]. The development of the second edition
fast proliferation of VR research and development, device of OMAF was completed in October 2020. OMAF v2 [7]
production, and services throughout the globe. Almost includes all v1 features and also supports richer 360◦
presentations with overlays and multiple viewpoints and
Manuscript received February 28, 2020; revised October 29, 2020; accepted
improves viewport-dependent delivery. OMAF v2 enables
February 19, 2021. (Corresponding author: Miska M. Hannuksela.) limited support for six degrees of freedom (6DOF), where
Miska M. Hannuksela is with Nokia Technologies, 33100 Tampere, Finland the translational movement of the user impacts the render-
(e-mail: [email protected]).
Ye-Kui Wang is with Bytedance Inc., San Diego, CA 92130 USA (e-mail: ing of overlays. Even though OMAF v2 was just recently
[email protected]). finalized, there are already implementations supporting its
Digital Object Identifier 10.1109/JPROC.2021.3063544 new features [8], [9].
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
OMAF has been further profiled to suit specific indus- partitioned into segments. The OMAF player module
tries and environments by the VR Industry Forum (VRIF) mainly consists of the media file and segment decapsu-
and the 3rd Generation Partnership Project (3GPP). VRIF lation, media decoding, and media rendering. In some
has the mission to advocate industry consensus on stan- operation modes, the media decapsulation block may con-
dards for the end-to-end VR ecosystem and chose to tain a bitstream rewriting process that combines several
reference some of the OMAF media profiles and spe- delivered streams into one video bitstream for decoding
cific viewport-dependent streaming scenarios in the VRIF and rendering. Note that the rendering process is not
Guidelines [10]. Just a few months after finalizing OMAF normatively specified in the OMAF standard. The OMAF
v2, the VRIF Guidelines were updated to version 2.3, which player also contains essential processing blocks for the
incorporates selected video profiles and toolset brands player operation, namely, the tracking and selection strat-
from OMAF v2. At the time of writing this article, the egy modules. The tracking module controls the viewing
Streaming Video Alliance is carrying out a trial using orientation and, in some cases, also the viewing position
selected OMAF video profiles as recommended in the VRIF according to which the content is rendered. For example,
Guidelines for streaming to various end-user devices [11]. the tracking module may obtain the head orientation when
3GPP standardizes cellular telecommunications, including an HMD is used for rendering. The selection strategy mod-
multimedia services. The 3GPP specification on VR profiles ule makes the decisions that content pieces are streamed.
for streaming applications [12] is based on technical ele- The delivery access module acts as a bridge between the
ments specified in OMAF v1. selection strategy and stream(s) delivery.
Fig. 1 shows the OMAF architecture, which consists of The media types supported in OMAF include video,
three major modules: OMAF content authoring, delivery, audio, image, and timed text. However, in this article,
and OMAF player. The OMAF content authoring module we focus only on video and image, and therefore, we will
consists of media acquisition, omnidirectional video/image not discuss audio and timed-text beyond this point.
preprocessing, media encoding, and media file and seg- The key underlying technologies for file/segment
ment encapsulation. OMAF may either use file delivery encapsulation and delivery of OMAF are ISO Base Media
or streaming delivery for which the content is timewise File Format (ISOBMFF) [13] and Dynamic Adaptive
Streaming over HTTP (DASH) [14]. OMAF specifies file for its entire duration in a MovieDataBox or split the
format and DASH extensions in a backward-compatible metadata in time ranges using MovieFragmentBoxes.
manner, which enables reusing of existing ISOBMFF In a self-contained movie fragment, the MediaDataBox
and DASH implementations for conventional 2-D containing the samples of a movie fragment is next to the
media formats with only moderate changes. Note that, respective MovieFragmentBox.
while OMAF also specifies signaling and delivery of A sample entry of a track describes the coding and
omnidirectional media over MPEG Media Transport encapsulation format used in the samples and includes a
(MMT, ISO/IEC 23008-1), it is not discussed in this article. 4CC sample entry type and contained boxes that provide
This article is organized as follows. ISOBMFF and further information of the format or content of the track.
DASH basics are reviewed in Section II. Representation A restricted video sample entry type (“resv”) is used for
formats of omnidirectional video/image are discussed in video tracks that require postprocessing operations after
Section III. Section IV provides an introduction to 360◦ decoding to be displayed properly. The type of postprocess-
video streaming with an emphasis on viewport-dependent ing is specified by one or more scheme types associated
streaming, which mitigates the large resolution and high with the restricted video track.
bitrate required for 360◦ video by prioritizing the displayed ISOBMFF defines items for storing untimed media or
area, i.e., the viewport. Sections V and VI present the metadata, and HEIF uses items for storing still images.
OMAF video and image profiles, which specify how a In addition to coded image items, HEIF supports derived
media codec is adapted for omnidirectional application image items, where an operation corresponding to the
usage. OMAF v2 defines the concept of toolset brands for type of the derived image item is performed to one or
functionalities beyond basic playback of omnidirectional more indicated input images to produce an output image
audio–visual content. Toolset brands are elaborated in to be displayed. The “grid” derived image item arranges
Section VII. In Section VIII, we draw a conclusion and take input images onto a grid to create a large output image.
a look at future VR standardization work in MPEG. Metadata that are specific to an item are typically stored as
This article contains a significant amount of additional an item property. A comprehensive technical summary on
details compared to our earlier paper that provides a HEIF is available in [17].
simpler overview of OMAF v1 [15]. Furthermore, we have
added the descriptions for omnidirectional images and
OMAF image profiles. Moreover, this article is arguably the B. DASH
first publication that provides a comprehensive review of DASH specifies a Media Presentation Description (MPD)
OMAF v2. format for describing the content available for streaming
and segment formats for the streamed content. There
II. B A C K G R O U N D are three basic types of segments in DASH: initialization
A. ISOBMFF and HEIF segment, media segment, and index segment. Initialization
The ISOBMFF is a popular media container format for segments are meant for bootstrapping the media decoding
audio, video, and timed text. ISOBMFF compliant files and playback. Media segments contain the coded media
are often casually referred to as MP4 files. The High data. Index segments provide a directory to the media
Efficiency Image File Format (HEIF) [16] derives from the segments for accessing them in a more fine-grained man-
ISOBMFF and is gaining popularity as a storage format ner than on a segment basis. In the segment format for
for still images and image sequences, such as exposure ISOBMFF, each media segment consists of one or more
stacks. It is natively supported by major operating systems self-contained movie fragments, whereas the movie header
for smartphones and personal computers, i.e., iOS and containing the track header is delivered as an initializa-
Android, as well as Windows 10 and MacOS. OMAF file tion segment. It is possible to omit separate initialization
format features for omnidirectional video and still images segments by creating self-initializing media segments that
are built on top of ISOBMFF and HEIF, respectively. contain the necessary movie and track headers. Conven-
A basic building block in ISOBMFF is called a box, tionally, index segments have not been used with ISOBMFF,
which is a data structure consisting of a four-character- but rather each media segment can be split into subseg-
code (4CC) box type, the byte count of the box, and a ments that are indexed within the media segment itself.
payload, whose format is determined by the box type and DASH does not specify carriage of image items, but, since
which may contain other boxes. An ISOBMFF file consists an image item can be used as a viewpoint, an overlay,
of a sequence of boxes. or background for overlays, OMAF v2 specifies carriage of
Each stream of timed media or metadata is logically image items as self-initializing media segments. Fig. 2 sum-
stored in a track, for which timestamps, random access marizes how timed and static media are encapsulated
positions, and other information are provided in respec- into ISOBMFF files and further into segments for DASH
tive boxes. The media data for tracks are composed of delivery.
samples carried in MediaDataBox(es), where each sam- Conventionally, DASH can be used in two operation
ple corresponds to the coded media data of a single modes, namely, live and on-demand. For both operation
time instance. It is possible to store the track metadata modes, the DASH standard provides profiles that specify
III. R E P R E S E N T A T I O N F O R M A T S O F
OMNIDIRECTIONAL VIDEO
AND IMAGES
A. Introduction
OMAF specifies three types of representation for-
mats, namely, projected, mesh, and fisheye omnidirec-
tional video and images. These formats differ in image Fig. 3. OMAF coordinate system [3].
rotation equations for the conversion between the global left-hand side being the east instead of the west, as the
coordinate system and a local coordinate system. viewing perspective is opposite. In ERP, the user looks
from the center of the sphere outward toward the inside
C. Omnidirectional Projection Formats surface of the sphere, while, for a world map, the user
Omnidirectional projection is a necessary geometric looks from outside the sphere toward the outside surface
operation applied at the content production side to gen- of the sphere.
erate 2-D pictures from the stitched sphere signal, and As illustrated in Fig. 5, in the CMP specified in OMAF,
an inverse projection operation needs to be used in the the sphere signal is rectilinearly projected into six square
rendering process by the OMAF player. faces that are laid out to form a rectangle with a 3:2 ratio
OMAF specifies the support of two types of projection: of width versus height, with some of the faces rotated to
equirectangular projection (ERP) and cubemap projection maximize continuity across face edges.
(CMP). In addition to ERP and CMP, a number of other
projection methods were studied during the OMAF v1 stan-
dardization process, but none of them were found to D. Regionwise Packing
provide sufficient technical benefits over the widely used RWP is an optional step after projection on the content
ERP and CMP formats. production side. It enables resizing, repositioning, rotation
As illustrated in Fig. 4, the ERP process is close to by 90◦ , 180◦ , or 270◦ , and vertical/horizontal mirroring of
how a 2-D world map is typically generated, but with the any rectangular region before encoding.
RWP can be used, e.g., for the following purposes: cases. However, the mesh omnidirectional video provides
1) indicating the exact coverage of content that does flexibility for optimizing the projection beyond ERP and
not cover the entire sphere; 2) generating viewport- CMP.
specific (VS) video or extractor tracks with region- The given 3-D mesh can be used directly for rendering.
wise mixed-resolution packing or overlapping regions; In other words, the 3-D mesh format enables direct one-
3) arranging the cube faces of CMP in an adaptive manner; to-one mapping of regions of a 2-D image to elements of
4) providing guard bands by adding some additional pixels a 3-D mesh, which is often referred to as UV mapping
at geometric boundaries when generating the 2-D pictures in computer graphics terminology. The 3-D mesh format
for encoding, which can be used to avoid or reduce seam avoids the need for deriving the UV map according to the
artifacts in rendered 360◦ video due to projection, and projection format and the RWP metadata.
5) compensating the oversampling of pole areas in ERP.
An example of using RWP for compensating the over-
F. Fisheye Omnidirectional Video and Images
sampling of pole areas in ERP is presented in Fig. 6. First,
an ERP picture is split into three regions: top, middle, and Fisheye video/images do not use projection or RWP.
bottom, where the top and bottom regions cover the two Rather, for each picture, the circular images captured by
poles and have the same height, while the middle region fisheye cameras are directly placed onto a 2-D picture, e.g.,
covers the equator. Second, the top and bottom regions as shown in Fig. 8.
are subsampled to keep the same height but half of the Parameters indicating the placement of the circular
width, and then, the subsampled top and bottom regions images on the 2-D picture and the characteristics of the
are placed next to each other on top of the middle region. fisheye video/images are specified in OMAF and can be
This way, the equator area remains the same resolution, used for correct rendering. The fisheye format avoids the
while the top and bottom regions got subsampled to half need for real-time stitching in video recording. OMAF files
of the width, which compensates for the oversampling of with fisheye video/images could be suitable for low-cost
the pole areas in ERP. consumer 360◦ cameras and smartphones, for example.
The RWP metadata indicate the interrelations between
regions in the projected picture (e.g., an ERP picture) and G. Supplemental Metadata for Omnidirectional
the respective regions in the packed picture (i.e., the pic- Video and Images
ture in the coded video bitstream) through the position and
This section provides a summary of supplemental meta-
size of the regions in both projected and packed pictures,
data for omnidirectional video or images that may option-
as well as indications of the applied rotation and mirroring,
ally be present in OMAF files or MPDs.
if any. When RWP has been applied, the decoded pic-
Regionwise Quality Ranking (RWQR): OMAF specifies
tures are packed pictures characterized by RWP metadata.
RWQR metadata as a basic mechanism to enable viewport-
Players can map the regions of decoded pictures onto
dependent content selection. Quality ranking metadata
projected pictures and, consequently, onto the sphere by
can be provided for sphere regions and for rectangular
processing the RWP metadata.
regions on decoded 2-D pictures. Quality ranking values
are given for indicated regions and describe the relative
E. Mesh Omnidirectional Video
OMAF v2 adds the 3-D mesh format as a new omnidirec-
tional content format type. A 3-D mesh is specified as a set
of mesh elements, all of which are either parallelograms or
regions on a sphere surface. The parallelograms can appear
at any location and orientation within the unit sphere and
need not be connected. A sphere-surface mesh element
is specified through an azimuth range and an elevation
range, as illustrated in Fig. 7. Thus, it is possible to specify
a 3-D mesh to represent both ERP and CMP as special
Fig. 6. Example of using RWP for compensating pole area Fig. 7. Mesh element specified as a region on the sphere surface
oversampling of ERP. through an azimuth range and an elevation range.
Table 2 Approaches for Improving the Compression of ERP Video for Viewport-Independent Streaming
inherent delay in the streaming system to react to viewport context of viewport-dependent 360◦ streaming, the term
changes, the spherical video not contained within the tile commonly refers to an isolated region [31], which
viewport is typically streamed too albeit at a lower depends only on the collocated isolated region in reference
bitrate and thus also at lower picture quality. Another pictures and does not depend on any other picture regions.
benefit provided by some viewport-dependent streaming Several versions of the tiles are encoded at different
approaches over viewport-independent streaming is bitrates and/or resolutions. Coded tile sequences are made
that the sample count can be nonuniformly allocated, available for streaming together with metadata describ-
with a higher sampling density covering the viewport. ing the location of the tile on the omnidirectional video.
Thus, the effective resolution on the viewport is greater Clients select which tiles are received so that the viewport
than what the decoding capacity would otherwise has higher quality and/or resolution than the tiles out-
support. An example scheme where the content of the side the viewport. A categorization of tile-based viewport-
viewport originates from a 6K (6144 × 3072) ERP was dependent 360◦ streaming is presented in Table 4.
presented in [27]. The remaining part of this section discusses tile-based
One approach for viewport-dependent streaming is to viewport-dependent streaming and is organized as follows.
create multiple VS 360◦ streams by encoding the same The present OMAF video profiles use either the Advanced
input video content for a predefined set of viewport Video Coding (AVC) [18] or the High Efficiency Video
orientations. Each stream also covers areas other than Coding (HEVC) [19] standard as the basis. Section IV-B
the targeted viewport, though at lower quality. Moreover, describes the use of AVC and HEVC for tile-based
the content may be encoded for several bitrates and/or viewport-dependent streaming. In a typical arrangement
picture resolutions. The streams are made available for for tile-based viewport-dependent 360◦ , a player binds
streaming, and metadata describing the viewports that the received tiles into a single video bitstream for decoding.
streams are aimed for are provided. Clients select the 360◦ Section IV-C presents tile binding approaches applicable to
stream that is targeted for their current viewport and suits OMAF video profiles. Section IV-D introduces tile index and
the network throughput. Approaches to achieve VS 360◦ tile data segment formats that are specified in OMAF v2 for
streams are summarized in Table 3. improving viewport-dependent streaming. Section IV-E dis-
In tile-based viewport-dependent 360◦ streaming, pro- cusses a content authoring pipeline for tile-based viewport-
jected pictures are encoded as several tiles. Early dependent streaming.
approaches, such as [29] and [30], split the video prior
to encoding into regions that were encoded indepen-
dently of each other and decoded with separate decoding B. Isolated Regions in AVC and HEVC
instances. However, managing and synchronizing many Video coding formats provide different high-level struc-
video decoder instances pose practical problems. Thus, tures for realizing isolated regions, which are used as
a more practical approach is to encode tiles in a man- elementary units in tile-based viewport-dependent 360◦
ner that they can be merged to a bitstream that can streaming. This section provides more details on how
be decoded with a single decoder instance. Thus, in the isolated regions can be realized in AVC and HEVC.
Table 3 Approaches for Achieving VS 360◦ Streams
In HEVC, a picture is split into tiles along a grid of tile tracks originating from several bitstreams, which may
tile columns and rows. A slice can be either an integer require rewriting of parameter sets and slice headers.
number of complete tiles or a subset of a single tile. Coded
slices consist of a slice header and slice data. Among C. Tile Binding
other things, the slice header indicates the position of OMAF supports both author-driven and late tile binding
the slice within the picture. Encoders can choose to use approaches. In author-driven tile binding, the processing
only rectangular slices, keep the tile and slice boundaries that requires knowledge of the video coding format is
unchanged throughout a coded video sequence, and con- performed by content authors and OMAF players follow
strain the coding mode and motion vector selection so that instructions created as a part of the content authoring
a slice references only the collocated slices in the reference process to merge tiles. In late tile binding, OMAF players
picture(s). In a common operation mode, a slice encloses rewrite high-level syntax structures of a video bitstream to
a set of one or more complete tiles, which can be referred merge tiles. Both tile binding approaches are described in
to as a motion-constrained tile set (MCTS). further detail in the following.
AVC does not enable picture partitioning into tiles. How- In author-driven tile binding, an extractor track con-
ever, slices can be arranged vertically into a single column, tains instructions to extract data from other tracks and is
and their encoding can be constrained as described above resolved into a single video bitstream. Extractor tracks are
for HEVC. specified in the ISOBMFF encapsulation format of HEVC
A sub-picture is a picture that represents a spatial subset and AVC bitstreams (ISO/IEC 14496-15). In author-driven
of the original video content. Consequently, a sub-picture tile binding, an extractor track serves as a prescription for
bitstream represents a sub-picture sequence. As an alterna- OMAF players how tiles are merged from other tracks.
tive to partitioning pictures into tiles and/or slices, pictures An extractor track also contains rewritten parameter sets
can be split prior to encoding into sub-picture sequences. and slice headers since they cannot typically be inherited
Each sub-picture sequence is encoded with constraints in from the referenced tracks.
the coding modes and motion vectors so that the encoded In free-viewport author-driven tile binding, an extractor
sub-picture bitstreams can be merged into a single bit- track suits any viewing orientation (hence, the qualifier
stream with multiple tiles. free-viewport) and provides multiple options for how tiles
Each coded tile or sub-picture sequence is typically can be merged. For example, an extractor track may con-
stored in its own track. There are a few options for the tain references to track groups, each containing collocated
storage of a coded tile or sub-picture sequence as a track, tiles of different bitrates. An OMAF player can choose tiles
which are summarized in Table 5. A sub-picture track
contains a sub-picture bitstream and can be decoded with
a regular decoding process of AVC or HEVC. Slice headers Table 5 Storage Options for Coded sub-picture and Tile Sequences
E. Content Authoring
Since OMAF supports many types of
viewport-dependent streaming, a content author has
Fig. 11. Example of late tile binding. the freedom to choose which approach is used for
preparing the content. Thus, the viewport-dependent
streaming approach needs to be selected first. Preparation
of multiple VS 360◦ streams would require preprocessing
headers are rewritten so that a conforming video bitstream (e.g., generation of regionwise mixed-resolution content),
is obtained. In this example, the OMAF player selects all spatially tailored encoding, and/or rewriting of encoded
low-resolution tile tracks as a fallback to cope with sudden streams. The choice between tile-based viewport-
viewing orientation changes and 27 high-resolution tile dependent streaming approaches may depend on the
tracks covering the viewport. resolution of the original content, the expected decoding
capability, and the expected display resolutions. The
D. Tile Index and Tile Data Segment Formats targeted OMAF video profile also limits the choice that
In tile-based viewport-dependent 360◦ streaming, codecs and viewport-dependent streaming approaches can
the number of representations can be relatively high, be supported, as indicated in Table 6.
even up to hundreds of Representations, since the content A benefit of both the viewport + 360◦ video and RWMR
may be partitioned into several tens of tiles and maybe methods is that they enable improving the resolution on
coded with several resolutions and bitrates. Moreover, the viewport with a constrained video decoding capacity.
the duration of (sub)segments may be inconveniently long For example, in [27], it was shown that the viewport can
to update the viewport quickly with high-quality tiles after originate from a 6K (6144 × 3072) version of the content
a viewing orientation change. Thus, requests having a even though the decoding capacity of the OMAF player
finer granularity than (sub)segments could be desirable. only ranges up to about 4K (4096 × 2048) resolution. This
To enable fine-grained requests, even down to a single pic- article also compared the rate-distortion performance of
ture interval, and to obtain the indexing data conveniently RWMR and RWMQ approaches. An advantage of RWMR
for all tiles, OMAF v2 includes new segment formats, compared to the viewport + 360◦ technique is that no
namely, initialization segment for an OMAF base track, decoding capacity is spent for decoding low-resolution
a tile index segment, and a tile data segment. video that is superimposed by the high-resolution tiles.
The initialization segment for an OMAF base track con- Some devices may have problems downloading tens of
tains the track header for the OMAF base track and all the HTTP streams in parallel, each requiring bandwidth of
referenced tile or sub-picture tracks. This allows the client up to several Mb/s. It is, therefore, advisable to keep the
to download only the initialization segment for the OMAF number of required tile or sub-picture representations for
base track without the need to download the initialization the author-driven tile binding at the lower end of the range
segments of the referenced tile or sub-picture tracks. allowed by the codec at least in some extractor or tile base
The tile index segment is logically an index seg- tracks.
ment as specified in the DASH standard. It is required In the following, we concentrate on the tile-based oper-
to include MovieFragmentBoxes for the OMAF base ation of HEVC, while an AVC-based pipeline could be
track and all the referenced tile or sub-picture tracks. implemented similarly. The content authoring workflow
MovieFragmentBoxes indicate the byte ranges on a sam- for tile-based viewport-dependent operation is depicted
ple basis. Consequently, a client can choose to request in Fig. 12, and the steps of the workflow are described in
content on smaller units than (sub)segments. the next paragraphs. For practical implementation exam-
The tile data segments are media segments ples, the Nokia OMAF reference implementation [4] covers
containing only media data enclosed in steps 2–6 described below, and HEVC encoding with tiles is
IdentifiedMediaDataBoxes (“imda”). The byte supported for example in the HM reference software [36]
offsets contained in MovieFragmentBoxes (“moof”) are and in the Kvazaar open-source software [37].
relative to the start of IdentifiedMediaDataBoxes. 1) Encoding: The video content is encoded using tiles or
Thus, MovieFragmentBoxes and media data can reside the content is split into sub-picture sequences before
in separate resources, unlike in conventional DASH encoding and then encoded in a constrained manner
segment formats where the byte offsets to the media so that merging of the coded sub-picture sequences
data are relative to the MovieFragmentBox. The box into the same bitstream is possible. Usually, multiple
payload of each IdentifiedMediaDataBox starts versions of the content are generated at different
with a sequence number that is also contained in the bitrates. A relatively short random access interval,
e.g., in the order of 1 s, is used in encoding to enable direction may be needed for VS author-driven tile
frequent viewport switching. binding.
2) Bitstream Processing: A processing step may be needed 5) (Sub)segment Encapsulation: (Sub)segments are cre-
to prepare the encoded bitstreams for encapsula- ated from each track for DASH delivery. When con-
tion into sub-picture or tile tracks. When the con- ventional segment formats specified in the DASH
tent was encoded using tiles, each tile sequence is standard are in use, no changes to the (sub)segment
extracted from the bitstream. This requires parsing encapsulation process are needed compared to the
of the high-level structure of the bitstream, including corresponding process for 2-D video.
parameter sets and slice headers. When sub-picture 6) DASH MPD Generation: An MPD is generated. Each
bitstreams were encoded, no additional processing at extractor track and tile base track form a represen-
this phase is needed. tation in its own adaptation set. An adaptation set
3) sub-picture or Tile Track Generation: OMAF video pro- consists of the sub-picture or the tile representations
files constrain that sample entry types are allowed for covering the same sphere region at the same resolu-
the sub-picture or tile tracks. Slice headers require tion but at different bitrates. The DASH preselection
rewriting in all cases where the slice position in feature is used to associate the extractor or tile base
the encoded bitstream does not match the position adaptation set with the associated sub-picture or tile
implied by the sample entry type. As an integral adaptation sets. Moreover, in this processing step,
part of generating both the sub-picture or tile tracks the OMAF file metadata is interpreted to create the
and the extractor or tile base track(s), the necessary OMAF extensions for DASH MPD.
OMAF file format metadata is also authored.
4) Extractor or Tile Base Track Generation: If the “hvt1”
or “hvt3” sample entry type is in use, a tile base V. O M A F V I D E O P R O F I L E S
track is generated. Otherwise, one or more extractor A summary of the video profiles specified in OMAF is
tracks are created. A single extractor track is typically presented in Table 6. This section first introduces the video
sufficient for free-viewport author-driven tile binding, profiles and then discusses the similarities and differences
whereas one extractor track per a distinct viewing between the profiles.
Fig. 12. Basic flow of content authoring operations for tile-based viewport-dependent streaming.
The HEVC-based viewport-independent profile is contain exactly one HEVC tile. Compared to the advanced
intended for basic viewport-independent files and tiling profile, the HEVC-based viewport-dependent and
streaming using the ERP. In OMAF v1, the decoding the simple tiling profiles provide more freedom since
capacity of the HEVC-based viewport-independent they enable using rectangular slices that comprise one or
profile was limited to approximately 4K (4096 × 2048) more tiles or a subset of a tile as the unit for tile-based
resolution at 60-Hz picture rate, while the unconstrained streaming.
HEVC-based viewport-independent was specified similarly
in OMAF v2 but without decoding capacity constraints to VI. O M A F I M A G E P R O F I L E S
respond to the need of higher HMD resolutions and the The image profiles of OMAF were designed to be seam-
availability of more powerful video decoding hardware. lessly compatible with HEIF. Consequently, devices and
The HEVC- and AVC-based viewport-dependent profiles platforms with HEIF capability are easily extensible to
support both VS streaming and different types of tile- support 360◦ images with metadata specified in OMAF.
based viewport-dependent streaming schemes. Two tiling Since OMAF is a toolbox standard, it is envisioned that
profiles, namely the simple and advanced tiling profiles, devices could only implement specific parts of OMAF.
were added for viewport-dependent streaming in OMAF For example, 360◦ cameras could only support an OMAF
v2. The main difference of the simple tiling profile image profile or the HEIF image metadata specified in the
compared to the HEVC-based viewport-dependent profile OMAF standard.
is the use of the tile index and tile data segment formats. At the time of releasing OMAF v1, there was arguably
The advanced tiling profile is the only profile that uses no other standard for storage of 360◦ images with the nec-
the 3-D mesh projection format and requires players to essary metadata for displaying them properly. Since then,
support late tile binding, while, otherwise, it is similar to the JPEG 360 standard [33] was finalized and includes
the simple tiling profile. omnidirectional metadata specifications for JPEG [34] and
Bit Depth: Since the HEVC-based profiles require support JPEG 2000 [35] images. Since OMAF specifies the omnidi-
for the HEVC Main 10 Profile, they support bit depths rectional image metadata for HEIF files, there is no overlap
up to 10 bits, whereas the AVC-based viewport-dependent with JPEG 360 even though the types of metadata in OMAF
profile is limited to 8 bits per color component. and JPEG 360 are similar.
Decoding Capacity: The HEVC-based profiles specified in OMAF v2 integrates images more tightly to 360◦ pre-
OMAF v1 require support for Level 5.1, which, in practice, sentations that can contain timed media types too. Images
means decoding capacity of approximately 4K pictures at can be used as overlays enriching an omnidirectional
60 Hz, whereas the AVC-based profile can support only 4K video background. An opposite arrangement is equally
pictures at 30 Hz. The profiles specified in OMAF v2 are supported, i.e., an omnidirectional background image can
tailorable in terms of decoding capacity, and thus, no HEVC be accompanied by video overlays. Moreover, presenta-
level constraints are specified for them. tions with multiple viewpoints can equally use images or
Projection Formats and RWP: In the HEVC-based video clips as the visual content of the viewpoints.
viewport-independent profile, the use of RWP can only be OMAF v1 specifies two profiles for projected omnidirec-
used to indicate a limited content coverage. In the HEVC- tional images. OMAF HEVC image profile uses the HEVC
and AVC-based viewport-dependent profiles, RWP is not Main 10 profile and the OMAF legacy image profile using
constrained. In the simple tiling profile, RWP is, otherwise, the JPEG codec, as summarized in Table 7. Both OMAF
unconstrained, but a single region is not allowed to cross image profiles are compatible with HEIF, and they share
a boundary of a projection surface, such as a cube face common features, as listed in Table 8. Coded image items
boundary. Moreover, the RWP format of an OMAF base of the OMAF HEVC image profile are limited to approx-
track is not indicated but inherited by OMAF players imately the 4K resolution, but larger image sizes can be
from the selected tile or sub-picture tracks. Consequently, achieved by using the “grid” derived image item, which
OMAF base tracks can enable free-viewport author-driven arranges input images onto a grid to create a large output
tile binding. In the advanced tiling profile, the 3-D mesh image. The image resolution constraint ensures that most
format is used, and RWP is disabled.
Viewport-Dependent Streaming: The HEVC- and Table 7 OMAF Image Profiles
Table 9 OMAF Toolset Brands 4) Rotation information for conversion from the global
coordinate system of the viewpoint to the common
reference coordinate system.
5) Optionally, the orientation of the common reference
coordinate system relative to the geomagnetic north.
6) Optionally, the global positioning system (GPS) loca-
tion of the viewpoint, which enables the client appli-
cation to place the viewpoint into a real-world map.
hardware implementations can be used for HEVC image 7) Optionally, viewpoint switching information, which
decoding. provides a number of switching transitions possible
from the current viewpoint, and for each of these,
VII. O M A F T O O L S E T B R A N D S
information such as the sphere region that a user
A. Introduction
can select to cause the viewpoint switch, the destina-
OMAF v2 specifies viewpoint, nonlinear storyline, and
tion viewpoint, the viewport to view after switching,
overlay toolset brands, which are summarized in Table 9.
the presentation time to start the playback of the
Compatibility to a toolset brand can be indicated at the
destination viewpoint, and a recommended transi-
file level using the 4CC of the brand. This section reviews
tion effect during switching (such as zoom-in, walk
the OMAF features for multiple viewpoints and overlays,
though, fade-to-black, or mirroring).
as well as the toolset brands.
8) Optionally, viewpoint looping information indicating
which time period of the presentation is looped
B. Multiple Viewpoints
and a maximum count of how many times the
OMAF v2 supports 360◦ video content comprising pieces time period is looped. The looping feature can be
captured by multiple 360◦ video cameras or camera rigs, used for requesting end-user’s input for initiating
referred to as viewpoints. This way, users can switch viewpoint switching.
between different viewpoints, e.g., in a basketball game
Some of the viewpoints can be static, i.e., captured by
switch between scenes captured by 360◦ video cameras
360◦ video cameras at fixed positions. Other viewpoints
located at different ends of the court.
can be dynamic, e.g., captured by a 360◦ video cam-
Switching between viewpoints captured by 360◦ video
era mounted on a flying drone. For dynamic viewpoints,
cameras that can "see" each other can be seamless in the
the above information is stored in timed metadata tracks
sense that after switching the user still sees the same
that are time-synchronized with the media tracks.
object, e.g., the same player in a sports game, just from
a different viewing angle. However, when there is an
C. Nonlinear Storyline
obstacle, e.g., a wall, between two 360◦ video cameras
such that they cannot "see" each other, switching between The viewpoint switching and looping information
the two viewpoints incurs a noticeable cut or transition. enable content authors to generate presentations with
When multiple viewpoints exist, identification and asso- a nonlinear storyline. Each viewpoint is a scene in the
ciation of tracks or image items belonging to one viewpoint storyline. The viewpoint switching metadata can be used to
are needed. For this purpose, OMAF specifies the viewpoint provide multiple switching options from which an end-user
grouping of tracks and image items, as well as similar is required to choose before advancing to the next scene of
metadata for DASH MPD. This grouping mechanism pro- the storyline. The user selection may be linked to a given
vides an identifier (ID) of the viewpoint and a set of other sphere region, viewport region, or overlay, but other user
information that can be used to assist streaming of the input means are not precluded either. The viewpoint loop-
content and switching between different viewpoints. Such ing metadata may be used to create a loop in the playback
information includes the following. of the current scene to wait for the user’s selection. The
viewpoint looping metadata also allow defining a default
1) A label, for annotation of the viewpoint, e.g., "home
destination viewpoint that is applied when an indicated
court.”
maximum number of loops has been passed.
2) Mapping of the viewpoint to a viewpoint group con-
Fig. 13 presents an example where Scene 1 is played
sisting of cameras that "see" each other and have
until the end of its timeline, and then, a given time range
an indicated viewpoint group ID. This information
of Scene 1 is repeated until an end-user selects between
provides a means to indicate whether the switching
Scenes 2a and 2b. After completing the playback of Scene
between two particular viewpoints can be seamless.
2a or 2b, the playback automatically switches to Scene 3,
3) Viewpoint position relative to the common reference
after which the presentation ends.
coordinate system shared by all viewpoints of a view-
point group. Viewpoint positions enable a good user
experience during viewpoint switching, provided that D. Overlays
the client can properly utilize the positions in its An overlay is a video clip, an image, or text that is
rendering process. superimposed on top of an omnidirectional video or image.
REFERENCES
[1] R. S. Kalawsky, The Science of Virtual Reality and https://www.youtube.com/watch?v=FpQiF8YEfY4 [10] VR Industry Forum Guidelines, Version 2.3.
Virtual Environments: A Technical, Scientific and and https://github.com/fraunhoferhhi/omaf.js Accessed: Jan. 2021. [Online]. Available:
Engineering Reference on Virtual Environments. [6] Open Visual Cloud Immersive Video Samples. https://www.vr-if.org/guidelines/
Reading, MA, USA: Addison-Wesley, 1993. Accessed: Mar. 9, 2021. [Online]. Available: [11] VR Industry Forum Newsletter. Accessed: Dec. 2020.
[2] F. Biocca and M. R. Levy, Eds., Communication in https://github.com/OpenVisualCloud/ [Online]. Available: https://www.vr-if.org/
the Age of Virtual Reality. Newark, NJ, USA: Immersive-Video-Sample december-2020-newsletter/
Lawrence Erlbaum Associates, 1995. [7] S. Deshpande, Y.-K. Wang, and M. M. Hannuksela, [12] Virtual Reality (VR) Profiles for Streaming
[3] Information Technology—Coded Representation of Eds., Text of ISO/IEC FDIS 23090-2 2nd edition Applications, document 3GPP Technical
Immersive Media—Part 2: Omnidirectional Media OMAF, document ISO/IEC JTC1 SC29 WG3, Specification 26.118, 2020.
Format, Standard ISO/IEC 23090-2:2019, 2019. N00072, Dec. 2020. Accessed: Oct. 27, 2020. [Online]. Available:
[4] Nokia OMAF Implementation. Accessed: Mar. 9, [8] K. K. Sreedhar, I. D. D. Curcio, A. Hourunranta, and https://www.3gpp.org/ftp//Specs/archive/26
2021. [Online]. Available: M. Lepistö, “Immersive media experience with _series/26.118/
https://github.com/nokiatech/omaf MPEG OMAF multi-viewpoints and overlays,” in [13] Information Technology—Coding of Audio-Visual
[5] D. Podborski et al., “HTML5 MSE playback of Proc. 11th ACM Multimedia Syst. Conf., May 2020, Objects—Part 12: ISO Base Media File Format,
MPEG 360 VR tiled streaming: JavaScript pp. 333–336. [Online]. Available: Standard ISO/IEC 14496-12, 2012.
implementation of MPEG-OMAF https://www.youtube.com/watch?v=WcucAw3HNVE [14] Information Technology—Dynamic Adaptive
viewport-dependent video profile with HEVC tiles,” [9] How ClearVR Drives and Leverages Standards. Streaming Over HTTP (DASH)—Part 1: Media
in Proc. 10th ACM Multimedia Syst. Conf., Accessed: Oct. 27, 2020. [Online]. Available: Presentation Description and Segment Formats,
Jun. 2019, pp. 324–327. [Online]. Available: https://www.tiledmedia.com/index.php/standards/ Standard ISO/IEC 23009-1:2019, 2019.
[15] M. M. Hannuksela, Y.-K. Wang, and video coding,” IEEE Trans. Circuits Syst. Video video,” in Proc. IEEE Int. Conf. Multimedia Expo,
A. Hourunranta, “An overview of the OMAF Technol., vol. 29, no. 6, pp. 1767–1780, Jul. 2011, pp. 1–6.
standard for 360◦ video,” in Proc. Data Compress. Jun. 2019. [31] M. M. Hannuksela, Y.-K. Wang, and M. Gabbouj,
Conf. (DCC), Mar. 2019, pp. 418–427. [24] A. Zare, A. Aminlou, M. M. Hannuksela, and “Isolated regions in video coding,” IEEE Trans.
[16] Information Technology—High Efficiency Coding and M. Gabbouj, “HEVC-compliant tile-based streaming Multimedia, vol. 6, no. 2, pp. 259–267, Apr. 2004.
Media Delivery in Heterogeneous of panoramic video for virtual reality applications,” [32] R. Skupin, Y. Sanchez, C. Hellge, and T. Schierl,
Environments—Part 12: Image File Format, in Proc. ACM Multimedia Conf., Oct. 2016, “Tile based HEVC video for head mounted
Standard ISO/IEC 23008-12, 2012. pp. 601–605. displays,” in Proc. IEEE Int. Symp. Multimedia
[17] M. M. Hannuksela, E. B. Aksu, V. K. M. Vadakital, [25] K. K. Sreedhar, A. Aminlou, M. M. Hannuksela, and (ISM), Dec. 2016, pp. 399–400.
and J. Lainema, Overview of the High Efficiency M. M. Gabbouj, “Viewport-adaptive encoding and [33] Information Technology—JPEG Systems—JPEG 360,
Image File Format, document JCTVC-V0072, streaming of 360-degree video for virtual reality Standard ISO/IEC 19566-6:2019, 2019.
Oct. 2015, pp. 1–12. [Online]. Available: applications,” in Proc. IEEE Int. Symp. Multimedia, [34] Digital Compression and Coding of Continuous-Tone
http://phenix.it-sudparis.eu/jct/doc_end_user/ Dec. 2016, pp. 583–586. Still Images, Standard ISO/IEC 10918-1:1994,
documents/22_Geneva/wg11/JCTVC-V0072-v1.zip [26] R. Ghaznavi-Youvalari et al., “Comparison of HEVC 1994.
[18] Advanced Video Coding, document ITU-T Rec. coding schemes for tile-based viewport-adaptive [35] JPEG 2000 Image Coding System, Standard ISO/IEC
H.264, ISO/IEC 14496-10, 2010. streaming of omnidirectional video,” in Proc. IEEE 15444-1:2019, 2019.
[19] High Efficiency Video Coding, document ITU-T Rec. 19th Int. Workshop Multimedia Signal Process. [36] Reference Software for High Efficiency Video Coding,
H.265, ISO/IEC 23008-2, 2002. (MMSP), Oct. 2017, pp. 1–6. document ITU-T Rec. H.265.2, Dec. 2016, ISO/IEC
[20] M. Budagavi, J. Furton, G. Jin, A. Saxena, [27] A. Zare, A. Aminlou, and M. M. Hannuksela, “6K 23008-5:2017, 2017.
J. Wilkinson, and A. Dickerson, “360 degrees video effective resolution with 4K HEVC decoding [37] A. Lemmetti, M. Viitanen, A. Mercat, and J. Vanne,
coding using region adaptive smoothing,” in Proc. capability for OMAF-compliant 360◦ video “Kvazaar 2.0: Fast and efficient open-source HEVC
IEEE Int. Conf. Image Process. (ICIP), Sep. 2015, streaming,” in Proc. 23rd Packet Video Workshop, inter encoder,” in Proc. 11th ACM Multimedia Syst.
pp. 750–754. Jun. 2018, pp. 72–77. Conf., May 2020, pp. 237–242. [Online]. Available:
[21] R. G. Youvalari, A. Aminlou, and M. M. Hannuksela, [28] H. Hristova, X. Corbillon, G. Simon, https://github.com/ultravideo/kvazaar
“Analysis of regional down-sampling methods for V. Swaminathan, and A. Devlic, “Heterogeneous [38] M.-L. Champel and I. D. D. Curcio, Eds.,
coding of omnidirectional video,” in Proc. Picture spatial quality for omnidirectional video,” in Proc. Requirements for MPEG-I Phase 2, document
Coding Symp. (PCS), Dec. 2016, pp. 1–5. IEEE 20th Int. Workshop Multimedia Signal Process. ISO/IEC JTC1 SC29 WG11, N19511, Jul. 2020.
[22] M. Tang, Y. Zhang, J. Wen, and S. Yang, “Optimized (MMSP), Aug. 2018, pp. 1–6. [39] Visual Volumetric Video-Based Coding and
video coding for omnidirectional videos,” in Proc. [29] A. Smolic and P. Kauff, “Interactive 3-D video Video-Based Point Cloud Compression, document
IEEE Int. Conf. Multimedia Expo (ICME), Jul. 2017, representation and coding technologies,” Proc. ISO/IEC JTC1 SC29 WG11, N19579, Sep. 2020.
pp. 799–804. IEEE, vol. 93, no. 1, pp. 98–110, Jan. 2005. [40] MPEG Immersive Video, document ISO/IEC CD
[23] Y. Li, J. Xu, and Z. Chen, “Spherical domain [30] P. R. Alface, J.-F. Macq, and N. Verzijp, “Evaluation 23090-12, ISO/IEC JTC1 SC29 WG11, N19482,
rate-distortion optimization for omnidirectional of bandwidth performance for interactive spherical Jul. 2020.