0% found this document useful (0 votes)
52 views17 pages

An Overview of Omnidirectional MediA Format OMAF

This article provides an overview of the Omnidirectional Media Format (OMAF) standard, which was developed by MPEG to facilitate interoperability for virtual reality media. OMAF defines a media format for 360-degree video, images, audio and text. The article describes the OMAF standard's architecture and key modules for content authoring, delivery and playback. It also discusses OMAF's support for different versions and profiles defined by organizations like VRIF and 3GPP.

Uploaded by

vuductrung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views17 pages

An Overview of Omnidirectional MediA Format OMAF

This article provides an overview of the Omnidirectional Media Format (OMAF) standard, which was developed by MPEG to facilitate interoperability for virtual reality media. OMAF defines a media format for 360-degree video, images, audio and text. The article describes the OMAF standard's architecture and key modules for content authoring, delivery and playback. It also discusses OMAF's support for different versions and profiles defined by organizations like VRIF and 3GPP.

Uploaded by

vuductrung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

An Overview of
Omnidirectional MediA
Format (OMAF)

By M ISKA M. H ANNUKSELA , Member IEEE, AND Y E -KUI WANG

ABSTRACT | During recent years, there have been product suddenly, VR became a buzzword everywhere in the world,
launches and research for enabling immersive audio–visual many companies in the information and communication
media experiences. For example, a variety of head-mounted technology field started to have VR as an important strate-
displays and 360◦ cameras are available in the market. gic direction, and all kinds of VR cameras and devices
To facilitate interoperability between devices and media sys- started to be available in the market.
tem components by different vendors, the Moving Picture Unavoidably, numerous, different, noninteroperable VR
Experts Group (MPEG) developed the Omnidirectional MediA solutions have been designed and used. This called for
Format (OMAF), which is arguably the first virtual reality (VR) standardization, for which the number one target is always
system standard. OMAF is a storage and streaming format to enable devices and services by different manufactures
for omnidirectional media, including 360◦ video and images, and providers to interoperate.
spatial audio, and associated timed text. This article provides The Moving Picture Experts Group (MPEG) started to
a comprehensive overview of OMAF. look at the development of a VR standard in October 2015.
This effort led to the arguably first VR system standard,
KEYWORDS | 360◦ video; Dynamic Adaptive Streaming over
called Omnidirectional MediA Format (OMAF) [3]. OMAF
HTTP (DASH); file format; Omnidirectional MediA Format
defines a media format that enables omnidirectional media
(OMAF); omnidirectional media; viewport; virtual reality (VR).
applications, focusing on 360◦ video, images, and audio,
as well as the associated timed text. The first edition
I. I N T R O D U C T I O N (also referred to as the first version or v1) of OMAF
Virtual reality (VR) has been researched and trialed for was finalized in October 2017. It provides basic support
many years [1], [2]. Due to the growth of computing for 360◦ video, images, and audio with three degrees of
capability in devices and network bandwidth, as well as freedom (3DOF), meaning that only rotations around any
advances in the technology for head-mounted displays coordinate axes are supported. Since the finalization of
(HMDs), wide deployment of VR became possible only the standard, source code packages of several implementa-
recently. Facebook’s two-billion-dollar acquisition of Ocu- tions compatible with OMAF v1 have been made publicly
lus in 2014 seemed to be a start and a catalyst to the available [4]–[6]. The development of the second edition
fast proliferation of VR research and development, device of OMAF was completed in October 2020. OMAF v2 [7]
production, and services throughout the globe. Almost includes all v1 features and also supports richer 360◦
presentations with overlays and multiple viewpoints and
Manuscript received February 28, 2020; revised October 29, 2020; accepted
improves viewport-dependent delivery. OMAF v2 enables
February 19, 2021. (Corresponding author: Miska M. Hannuksela.) limited support for six degrees of freedom (6DOF), where
Miska M. Hannuksela is with Nokia Technologies, 33100 Tampere, Finland the translational movement of the user impacts the render-
(e-mail: [email protected]).
Ye-Kui Wang is with Bytedance Inc., San Diego, CA 92130 USA (e-mail: ing of overlays. Even though OMAF v2 was just recently
[email protected]). finalized, there are already implementations supporting its
Digital Object Identifier 10.1109/JPROC.2021.3063544 new features [8], [9].

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

P ROCEEDINGS OF THE IEEE 1


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Hannuksela and Wang: Overview of Omnidirectional MediA Format

Fig. 1. OMAF architecture.

OMAF has been further profiled to suit specific indus- partitioned into segments. The OMAF player module
tries and environments by the VR Industry Forum (VRIF) mainly consists of the media file and segment decapsu-
and the 3rd Generation Partnership Project (3GPP). VRIF lation, media decoding, and media rendering. In some
has the mission to advocate industry consensus on stan- operation modes, the media decapsulation block may con-
dards for the end-to-end VR ecosystem and chose to tain a bitstream rewriting process that combines several
reference some of the OMAF media profiles and spe- delivered streams into one video bitstream for decoding
cific viewport-dependent streaming scenarios in the VRIF and rendering. Note that the rendering process is not
Guidelines [10]. Just a few months after finalizing OMAF normatively specified in the OMAF standard. The OMAF
v2, the VRIF Guidelines were updated to version 2.3, which player also contains essential processing blocks for the
incorporates selected video profiles and toolset brands player operation, namely, the tracking and selection strat-
from OMAF v2. At the time of writing this article, the egy modules. The tracking module controls the viewing
Streaming Video Alliance is carrying out a trial using orientation and, in some cases, also the viewing position
selected OMAF video profiles as recommended in the VRIF according to which the content is rendered. For example,
Guidelines for streaming to various end-user devices [11]. the tracking module may obtain the head orientation when
3GPP standardizes cellular telecommunications, including an HMD is used for rendering. The selection strategy mod-
multimedia services. The 3GPP specification on VR profiles ule makes the decisions that content pieces are streamed.
for streaming applications [12] is based on technical ele- The delivery access module acts as a bridge between the
ments specified in OMAF v1. selection strategy and stream(s) delivery.
Fig. 1 shows the OMAF architecture, which consists of The media types supported in OMAF include video,
three major modules: OMAF content authoring, delivery, audio, image, and timed text. However, in this article,
and OMAF player. The OMAF content authoring module we focus only on video and image, and therefore, we will
consists of media acquisition, omnidirectional video/image not discuss audio and timed-text beyond this point.
preprocessing, media encoding, and media file and seg- The key underlying technologies for file/segment
ment encapsulation. OMAF may either use file delivery encapsulation and delivery of OMAF are ISO Base Media
or streaming delivery for which the content is timewise File Format (ISOBMFF) [13] and Dynamic Adaptive

2 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Hannuksela and Wang: Overview of Omnidirectional MediA Format

Streaming over HTTP (DASH) [14]. OMAF specifies file for its entire duration in a MovieDataBox or split the
format and DASH extensions in a backward-compatible metadata in time ranges using MovieFragmentBoxes.
manner, which enables reusing of existing ISOBMFF In a self-contained movie fragment, the MediaDataBox
and DASH implementations for conventional 2-D containing the samples of a movie fragment is next to the
media formats with only moderate changes. Note that, respective MovieFragmentBox.
while OMAF also specifies signaling and delivery of A sample entry of a track describes the coding and
omnidirectional media over MPEG Media Transport encapsulation format used in the samples and includes a
(MMT, ISO/IEC 23008-1), it is not discussed in this article. 4CC sample entry type and contained boxes that provide
This article is organized as follows. ISOBMFF and further information of the format or content of the track.
DASH basics are reviewed in Section II. Representation A restricted video sample entry type (“resv”) is used for
formats of omnidirectional video/image are discussed in video tracks that require postprocessing operations after
Section III. Section IV provides an introduction to 360◦ decoding to be displayed properly. The type of postprocess-
video streaming with an emphasis on viewport-dependent ing is specified by one or more scheme types associated
streaming, which mitigates the large resolution and high with the restricted video track.
bitrate required for 360◦ video by prioritizing the displayed ISOBMFF defines items for storing untimed media or
area, i.e., the viewport. Sections V and VI present the metadata, and HEIF uses items for storing still images.
OMAF video and image profiles, which specify how a In addition to coded image items, HEIF supports derived
media codec is adapted for omnidirectional application image items, where an operation corresponding to the
usage. OMAF v2 defines the concept of toolset brands for type of the derived image item is performed to one or
functionalities beyond basic playback of omnidirectional more indicated input images to produce an output image
audio–visual content. Toolset brands are elaborated in to be displayed. The “grid” derived image item arranges
Section VII. In Section VIII, we draw a conclusion and take input images onto a grid to create a large output image.
a look at future VR standardization work in MPEG. Metadata that are specific to an item are typically stored as
This article contains a significant amount of additional an item property. A comprehensive technical summary on
details compared to our earlier paper that provides a HEIF is available in [17].
simpler overview of OMAF v1 [15]. Furthermore, we have
added the descriptions for omnidirectional images and
OMAF image profiles. Moreover, this article is arguably the B. DASH
first publication that provides a comprehensive review of DASH specifies a Media Presentation Description (MPD)
OMAF v2. format for describing the content available for streaming
and segment formats for the streamed content. There
II. B A C K G R O U N D are three basic types of segments in DASH: initialization
A. ISOBMFF and HEIF segment, media segment, and index segment. Initialization
The ISOBMFF is a popular media container format for segments are meant for bootstrapping the media decoding
audio, video, and timed text. ISOBMFF compliant files and playback. Media segments contain the coded media
are often casually referred to as MP4 files. The High data. Index segments provide a directory to the media
Efficiency Image File Format (HEIF) [16] derives from the segments for accessing them in a more fine-grained man-
ISOBMFF and is gaining popularity as a storage format ner than on a segment basis. In the segment format for
for still images and image sequences, such as exposure ISOBMFF, each media segment consists of one or more
stacks. It is natively supported by major operating systems self-contained movie fragments, whereas the movie header
for smartphones and personal computers, i.e., iOS and containing the track header is delivered as an initializa-
Android, as well as Windows 10 and MacOS. OMAF file tion segment. It is possible to omit separate initialization
format features for omnidirectional video and still images segments by creating self-initializing media segments that
are built on top of ISOBMFF and HEIF, respectively. contain the necessary movie and track headers. Conven-
A basic building block in ISOBMFF is called a box, tionally, index segments have not been used with ISOBMFF,
which is a data structure consisting of a four-character- but rather each media segment can be split into subseg-
code (4CC) box type, the byte count of the box, and a ments that are indexed within the media segment itself.
payload, whose format is determined by the box type and DASH does not specify carriage of image items, but, since
which may contain other boxes. An ISOBMFF file consists an image item can be used as a viewpoint, an overlay,
of a sequence of boxes. or background for overlays, OMAF v2 specifies carriage of
Each stream of timed media or metadata is logically image items as self-initializing media segments. Fig. 2 sum-
stored in a track, for which timestamps, random access marizes how timed and static media are encapsulated
positions, and other information are provided in respec- into ISOBMFF files and further into segments for DASH
tive boxes. The media data for tracks are composed of delivery.
samples carried in MediaDataBox(es), where each sam- Conventionally, DASH can be used in two operation
ple corresponds to the coded media data of a single modes, namely, live and on-demand. For both operation
time instance. It is possible to store the track metadata modes, the DASH standard provides profiles that specify

P ROCEEDINGS OF THE IEEE 3


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Hannuksela and Wang: Overview of Omnidirectional MediA Format

prepreprocessing for encoding, in the arrangement of


visual content in both the input pictures (for encoding) and
the decoded pictures, and in the image rendering process-
ing blocks. A summary of the omnidirectional video and
image representation formats is provided in Table 1. In all
representation formats, both monoscopic and stereoscopic
contents are allowed, and the content coverage can be less
than 360◦ .
This section is organized as follows. Section III-B
discusses the coordinate systems used in OMAF.
Fig. 2. Relation of media stream formats, ISOBMFF, and DASH
Sections III-C and III-D describe the projection formats
units. and regionwise packing (RWP), respectively. Sections III-E
and III-F present the mesh and fisheye omnidirectional
formats, respectively. Finally, Section III-G provides a brief
constraints on the MPD and segment formats. In the
review of supplemental metadata for omnidirectional
live profiles, the MPD contains sufficient information for
video and images.
requesting media segments, and the client can adapt the
streaming bitrate by selecting the representations from
which the media segments are received. In the on-demand B. Coordinate Systems
profiles, in addition to the information in the MPD,
As illustrated in Fig. 3, the OMAF coordinate system
the client typically obtains an index of subsegments of the
consists of a unit sphere and three coordinate axes. The
media segments of each representation. The client selects
location of a point on the sphere is identified by a pair
the representation(s) from which subsegments are fetched
of sphere coordinates azimuth (φ) and elevation (θ). The
and requests them using byte-range requests.
user looks from the center of the sphere outward toward
The MPD syntax is specified as an Extensible Markup
the inside surface of the sphere.
Language (XML) schema and contains one or more adap-
OMAF specifies global coordinate axes that are shared
tation sets, each containing one or more representations.
for all media types intended to be rendered together and
A representation corresponds to an ISOBMFF track, and
used for determining the initial viewing orientation. Each
an adaptation set contains representations of the same
video or image may use its own local coordinate axes
content between which the player can select, e.g., based
specified by the X-, Y-, and Z-axes of the coordinate system
on the available bitrate.
after the application of a rotation to the global coordinate
The MPD format includes bitrates and other char-
axes, where the rotation consists of yaw, pitch, and roll
acteristics for representations and adaptation sets for
rotation angles, around the Z-, Y-, and X-axes, respec-
player-driven content selection. DASH specifies essen-
tively. The use of unaligned global and local coordinate
tial and supplemental property descriptor elements for
axes can be advantageous, e.g., for correcting the horizon
describing additional characteristics of representations or
to be exactly horizontal in the projected omnidirectional
adaptation sets. When a player does not recognize an
video or image or for improving perceived picture quality
essential property descriptor, it is required to omit the rep-
by avoiding seams between projection surfaces to cross
resentation or adaptation set that contains the descriptor.
objects of interest. OMAF specifies the signaling and the
In contrast, a player is allowed to ignore an unknown sup-
plemental property descriptor and continue the processing
of the respective representation or adaptation set.
An MPD contains either a template for deriving a uni-
form resource locator (URL) for each segment or a list
of segment URLs. Players use the URLs (or byte ranges
of them) of the selected segments when requesting the
content over the Hypertext Transfer Protocol (HTTP).
A conventional web server can be used for responding to
HTTP requests.

III. R E P R E S E N T A T I O N F O R M A T S O F
OMNIDIRECTIONAL VIDEO
AND IMAGES
A. Introduction
OMAF specifies three types of representation for-
mats, namely, projected, mesh, and fisheye omnidirec-
tional video and images. These formats differ in image Fig. 3. OMAF coordinate system [3].

4 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Hannuksela and Wang: Overview of Omnidirectional MediA Format

Table 1 Summary of Omnidirectional Video and Image Representation Formats in OMAF

rotation equations for the conversion between the global left-hand side being the east instead of the west, as the
coordinate system and a local coordinate system. viewing perspective is opposite. In ERP, the user looks
from the center of the sphere outward toward the inside
C. Omnidirectional Projection Formats surface of the sphere, while, for a world map, the user
Omnidirectional projection is a necessary geometric looks from outside the sphere toward the outside surface
operation applied at the content production side to gen- of the sphere.
erate 2-D pictures from the stitched sphere signal, and As illustrated in Fig. 5, in the CMP specified in OMAF,
an inverse projection operation needs to be used in the the sphere signal is rectilinearly projected into six square
rendering process by the OMAF player. faces that are laid out to form a rectangle with a 3:2 ratio
OMAF specifies the support of two types of projection: of width versus height, with some of the faces rotated to
equirectangular projection (ERP) and cubemap projection maximize continuity across face edges.
(CMP). In addition to ERP and CMP, a number of other
projection methods were studied during the OMAF v1 stan-
dardization process, but none of them were found to D. Regionwise Packing
provide sufficient technical benefits over the widely used RWP is an optional step after projection on the content
ERP and CMP formats. production side. It enables resizing, repositioning, rotation
As illustrated in Fig. 4, the ERP process is close to by 90◦ , 180◦ , or 270◦ , and vertical/horizontal mirroring of
how a 2-D world map is typically generated, but with the any rectangular region before encoding.

Fig. 4. Illustration of the ERP.

Fig. 5. Illustration of the CMP in OMAF [3].

P ROCEEDINGS OF THE IEEE 5


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Hannuksela and Wang: Overview of Omnidirectional MediA Format

RWP can be used, e.g., for the following purposes: cases. However, the mesh omnidirectional video provides
1) indicating the exact coverage of content that does flexibility for optimizing the projection beyond ERP and
not cover the entire sphere; 2) generating viewport- CMP.
specific (VS) video or extractor tracks with region- The given 3-D mesh can be used directly for rendering.
wise mixed-resolution packing or overlapping regions; In other words, the 3-D mesh format enables direct one-
3) arranging the cube faces of CMP in an adaptive manner; to-one mapping of regions of a 2-D image to elements of
4) providing guard bands by adding some additional pixels a 3-D mesh, which is often referred to as UV mapping
at geometric boundaries when generating the 2-D pictures in computer graphics terminology. The 3-D mesh format
for encoding, which can be used to avoid or reduce seam avoids the need for deriving the UV map according to the
artifacts in rendered 360◦ video due to projection, and projection format and the RWP metadata.
5) compensating the oversampling of pole areas in ERP.
An example of using RWP for compensating the over-
F. Fisheye Omnidirectional Video and Images
sampling of pole areas in ERP is presented in Fig. 6. First,
an ERP picture is split into three regions: top, middle, and Fisheye video/images do not use projection or RWP.
bottom, where the top and bottom regions cover the two Rather, for each picture, the circular images captured by
poles and have the same height, while the middle region fisheye cameras are directly placed onto a 2-D picture, e.g.,
covers the equator. Second, the top and bottom regions as shown in Fig. 8.
are subsampled to keep the same height but half of the Parameters indicating the placement of the circular
width, and then, the subsampled top and bottom regions images on the 2-D picture and the characteristics of the
are placed next to each other on top of the middle region. fisheye video/images are specified in OMAF and can be
This way, the equator area remains the same resolution, used for correct rendering. The fisheye format avoids the
while the top and bottom regions got subsampled to half need for real-time stitching in video recording. OMAF files
of the width, which compensates for the oversampling of with fisheye video/images could be suitable for low-cost
the pole areas in ERP. consumer 360◦ cameras and smartphones, for example.
The RWP metadata indicate the interrelations between
regions in the projected picture (e.g., an ERP picture) and G. Supplemental Metadata for Omnidirectional
the respective regions in the packed picture (i.e., the pic- Video and Images
ture in the coded video bitstream) through the position and
This section provides a summary of supplemental meta-
size of the regions in both projected and packed pictures,
data for omnidirectional video or images that may option-
as well as indications of the applied rotation and mirroring,
ally be present in OMAF files or MPDs.
if any. When RWP has been applied, the decoded pic-
Regionwise Quality Ranking (RWQR): OMAF specifies
tures are packed pictures characterized by RWP metadata.
RWQR metadata as a basic mechanism to enable viewport-
Players can map the regions of decoded pictures onto
dependent content selection. Quality ranking metadata
projected pictures and, consequently, onto the sphere by
can be provided for sphere regions and for rectangular
processing the RWP metadata.
regions on decoded 2-D pictures. Quality ranking values
are given for indicated regions and describe the relative
E. Mesh Omnidirectional Video
OMAF v2 adds the 3-D mesh format as a new omnidirec-
tional content format type. A 3-D mesh is specified as a set
of mesh elements, all of which are either parallelograms or
regions on a sphere surface. The parallelograms can appear
at any location and orientation within the unit sphere and
need not be connected. A sphere-surface mesh element
is specified through an azimuth range and an elevation
range, as illustrated in Fig. 7. Thus, it is possible to specify
a 3-D mesh to represent both ERP and CMP as special

Fig. 6. Example of using RWP for compensating pole area Fig. 7. Mesh element specified as a region on the sphere surface
oversampling of ERP. through an azimuth range and an elevation range.

6 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Hannuksela and Wang: Overview of Omnidirectional MediA Format

relationship group corresponds to a planar spatial part


of a video source. The signaling indicates the size (width
and height) of the original video content and the position
and size of each of the split sub-pictures. In addition,
the signaling also indicates whether a sub-picture track
is intended to be presented alone without any other
sub-picture tracks from the same original video content
and whether video bitstream carried in the sub-picture
track can be merged with the video bitstream carried in
any other sub-picture tracks split from the same original
video content to generate a single video bitstream without
Fig. 8. Example fisheye omnidirectional video captured by two decoding mismatch by rewriting only the header data
lenses.
of the bitstreams, where a decoding mismatch refers
to the value of any pixel when decoding the video
bitstream in the current track is not identical to the
value of the same pixel when decoding the merged video
quality order of the regions: when region A has a nonzero bitstream. Besides file format signaling for sub-pictures,
quality ranking value less than that of region B, region OMAF v2 also specifies DASH signaling for sub-pictures
A has a higher quality than region B. RWQR metadata through the sub-picture composition identifier element,
remain static for the entire duration of the track. OMAF which indicates the DASH adaptation sets that contain
players can use RWQR metadata for a viewport-dependent sub-picture representations carrying sub-picture tracks
selection of tracks for streaming and/or playback. belonging to the same 2-D spatial relationship track group.
ERP region timed metadata provides a time-varying rel-
ative quality rank recommendation, relative priority infor- IV. 3 6 0 ◦ V I D E O S T R E A M I N G
mation, or heatmap signaling for a rectangular grid rel- A. Introduction
ative to ERP. OMAF players may use the information for This section reviews approaches for omnidirectional
spatially fine-grained streaming rate adaptation choices so video streaming and describes which building blocks
that picture quality is first reduced in regions that are OMAF provides for them. Section V describes further
subjectively the least important. details on the types and features of 360◦ video streaming
Initial Viewing Orientation: The default viewing orienta- that are supported in OMAF video profiles.
tion to start displaying the omnidirectional video or image 360◦ video streaming can either be carried out in
is along the X-axis of the global coordinate axes. Content a viewport-independent or viewport-dependent manner.
authors can override the default behavior by using an In viewport-independent 360◦ video streaming, no picture
initial viewing orientation timed metadata track and item quality emphasis is given to any spatial part of the video,
property for video and images, respectively. If an HMD is and the prevailing viewing orientation has no impact on
used for viewing, players are expected to obey only the which version of the video content is streamed. How-
indicated initial azimuth. Otherwise (i.e., when a conven- ever, since the spherical sampling density depends on
tional 2-D display is used for viewing), players should use the elevation angle in the ERP format, content authoring
the initial azimuth, elevation, and tilt for rendering. Initial for ERP may be adapted to provide a more consistent
viewing orientation can be indicated to apply also during picture quality in the spherical domain with any approach
normal playback. This is helpful to reset the viewing orien- described in Table 2. Typically, a sequence of projected
tation toward to content author’s choice after a scene cut. omnidirectional pictures is encoded in one or more bitrate
Recommended Viewport Timed Metadata: OMAF supports or resolution versions, each of which is made available for
a playback mode where a user does not have or has streaming as a single DASH representation. A client selects
given up control of the viewing orientation. Such usage the version that best suits its display resolution and the
may suit for example displaying omnidirectional video on prevailing throughput.
a conventional flat-panel display. Rather than the user Since the viewport covers only a fraction of the omni-
controlling the viewing orientation, the displayed viewport directional video at any time instance, a large portion of
is indicated in a recommended viewport timed metadata the omnidirectional video is not displayed. Thus, network
track. Several recommended viewport tracks can be made bandwidth is inefficiently utilized in viewport-independent
available, may be indicated to be based on viewing sta- 360◦ video streaming. A key idea of viewport-dependent
tistics or manual selections, and may be labeled with a 360◦ video streaming is to dedicate a large share of
description. the available bandwidth for the video covering the
The 2-D spatial relationship track grouping provides viewport. Studies presented in [24]–[26] have shown that
another option for viewport-dependent omnidirectional viewport-dependent streaming is able to reach a bit
video streaming, in addition to the viewport-dependent rate reduction of several tens of percents compared
video profiles. Each track in an indicated 2-D spatial to viewport-independent streaming. Since there is an

P ROCEEDINGS OF THE IEEE 7


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Hannuksela and Wang: Overview of Omnidirectional MediA Format

Table 2 Approaches for Improving the Compression of ERP Video for Viewport-Independent Streaming

inherent delay in the streaming system to react to viewport context of viewport-dependent 360◦ streaming, the term
changes, the spherical video not contained within the tile commonly refers to an isolated region [31], which
viewport is typically streamed too albeit at a lower depends only on the collocated isolated region in reference
bitrate and thus also at lower picture quality. Another pictures and does not depend on any other picture regions.
benefit provided by some viewport-dependent streaming Several versions of the tiles are encoded at different
approaches over viewport-independent streaming is bitrates and/or resolutions. Coded tile sequences are made
that the sample count can be nonuniformly allocated, available for streaming together with metadata describ-
with a higher sampling density covering the viewport. ing the location of the tile on the omnidirectional video.
Thus, the effective resolution on the viewport is greater Clients select which tiles are received so that the viewport
than what the decoding capacity would otherwise has higher quality and/or resolution than the tiles out-
support. An example scheme where the content of the side the viewport. A categorization of tile-based viewport-
viewport originates from a 6K (6144 × 3072) ERP was dependent 360◦ streaming is presented in Table 4.
presented in [27]. The remaining part of this section discusses tile-based
One approach for viewport-dependent streaming is to viewport-dependent streaming and is organized as follows.
create multiple VS 360◦ streams by encoding the same The present OMAF video profiles use either the Advanced
input video content for a predefined set of viewport Video Coding (AVC) [18] or the High Efficiency Video
orientations. Each stream also covers areas other than Coding (HEVC) [19] standard as the basis. Section IV-B
the targeted viewport, though at lower quality. Moreover, describes the use of AVC and HEVC for tile-based
the content may be encoded for several bitrates and/or viewport-dependent streaming. In a typical arrangement
picture resolutions. The streams are made available for for tile-based viewport-dependent 360◦ , a player binds
streaming, and metadata describing the viewports that the received tiles into a single video bitstream for decoding.
streams are aimed for are provided. Clients select the 360◦ Section IV-C presents tile binding approaches applicable to
stream that is targeted for their current viewport and suits OMAF video profiles. Section IV-D introduces tile index and
the network throughput. Approaches to achieve VS 360◦ tile data segment formats that are specified in OMAF v2 for
streams are summarized in Table 3. improving viewport-dependent streaming. Section IV-E dis-
In tile-based viewport-dependent 360◦ streaming, pro- cusses a content authoring pipeline for tile-based viewport-
jected pictures are encoded as several tiles. Early dependent streaming.
approaches, such as [29] and [30], split the video prior
to encoding into regions that were encoded indepen-
dently of each other and decoded with separate decoding B. Isolated Regions in AVC and HEVC
instances. However, managing and synchronizing many Video coding formats provide different high-level struc-
video decoder instances pose practical problems. Thus, tures for realizing isolated regions, which are used as
a more practical approach is to encode tiles in a man- elementary units in tile-based viewport-dependent 360◦
ner that they can be merged to a bitstream that can streaming. This section provides more details on how
be decoded with a single decoder instance. Thus, in the isolated regions can be realized in AVC and HEVC.
Table 3 Approaches for Achieving VS 360◦ Streams

8 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Hannuksela and Wang: Overview of Omnidirectional MediA Format

Table 4 Tile-Based Viewport-Dependent 360◦ Streaming Approaches [15]

In HEVC, a picture is split into tiles along a grid of tile tracks originating from several bitstreams, which may
tile columns and rows. A slice can be either an integer require rewriting of parameter sets and slice headers.
number of complete tiles or a subset of a single tile. Coded
slices consist of a slice header and slice data. Among C. Tile Binding
other things, the slice header indicates the position of OMAF supports both author-driven and late tile binding
the slice within the picture. Encoders can choose to use approaches. In author-driven tile binding, the processing
only rectangular slices, keep the tile and slice boundaries that requires knowledge of the video coding format is
unchanged throughout a coded video sequence, and con- performed by content authors and OMAF players follow
strain the coding mode and motion vector selection so that instructions created as a part of the content authoring
a slice references only the collocated slices in the reference process to merge tiles. In late tile binding, OMAF players
picture(s). In a common operation mode, a slice encloses rewrite high-level syntax structures of a video bitstream to
a set of one or more complete tiles, which can be referred merge tiles. Both tile binding approaches are described in
to as a motion-constrained tile set (MCTS). further detail in the following.
AVC does not enable picture partitioning into tiles. How- In author-driven tile binding, an extractor track con-
ever, slices can be arranged vertically into a single column, tains instructions to extract data from other tracks and is
and their encoding can be constrained as described above resolved into a single video bitstream. Extractor tracks are
for HEVC. specified in the ISOBMFF encapsulation format of HEVC
A sub-picture is a picture that represents a spatial subset and AVC bitstreams (ISO/IEC 14496-15). In author-driven
of the original video content. Consequently, a sub-picture tile binding, an extractor track serves as a prescription for
bitstream represents a sub-picture sequence. As an alterna- OMAF players how tiles are merged from other tracks.
tive to partitioning pictures into tiles and/or slices, pictures An extractor track also contains rewritten parameter sets
can be split prior to encoding into sub-picture sequences. and slice headers since they cannot typically be inherited
Each sub-picture sequence is encoded with constraints in from the referenced tracks.
the coding modes and motion vectors so that the encoded In free-viewport author-driven tile binding, an extractor
sub-picture bitstreams can be merged into a single bit- track suits any viewing orientation (hence, the qualifier
stream with multiple tiles. free-viewport) and provides multiple options for how tiles
Each coded tile or sub-picture sequence is typically can be merged. For example, an extractor track may con-
stored in its own track. There are a few options for the tain references to track groups, each containing collocated
storage of a coded tile or sub-picture sequence as a track, tiles of different bitrates. An OMAF player can choose tiles
which are summarized in Table 5. A sub-picture track
contains a sub-picture bitstream and can be decoded with
a regular decoding process of AVC or HEVC. Slice headers Table 5 Storage Options for Coded sub-picture and Tile Sequences

of a sub-picture track always indicate the sub-picture to


appear in the top-left corner of the picture. A tile track
contains only a coded tile sequence with its original slice
headers, indicating the tile location where it appeared
during the encoding. A bitstream can be reconstructed in
the form that it was encoded by combining the content
from all its tile tracks. An HEVC tile base track refer-
ences HEVC tile tracks in their order in the coded picture
and, hence, facilitates bitstream reconstruction. However,
many viewport-dependent streaming approaches combine

P ROCEEDINGS OF THE IEEE 9


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Hannuksela and Wang: Overview of Omnidirectional MediA Format

certain range of viewing orientations and the remaining


tiles from the low-resolution CMP. Each sample in an
extractor track hence contains instructions to copy slice
data from selected tile tracks. The figure illustrates one
possible selection of the tiles in relation to the CMP for-
mat and the spatial arrangement according to which the
extractor track organizes the tiles into a coded picture. The
RWP metadata of the extractor track describe the mapping
between rectangular regions in the decoded pictures and
the CMP picture format.
In late tile binding, an OMAF player selects the tiles to
Fig. 9. Example of free-viewport author-driven tile binding. be received and merges them into a single video bitstream.
Late tile binding gives freedom to OMAF players, e.g.,
on selecting the field of view for the viewport but also
covering the viewport so that they have higher bitrate requires more sophisticated client-side processing com-
and/or picture quality than the tiles selected for the other pared to author-driven tile binding.
parts of the sphere. An OMAF base track provides instructions to reconstruct
Content authoring for free-viewport author-driven tile a single video bitstream by merging samples of the refer-
binding is illustrated through an example in Fig. 9. ERP enced tile or sub-picture tracks. An OMAF base track can
content is encoded with 4 × 2 tiles at two qualities. Each either be an HEVC tile base track or an extractor track.
encoded tile sequence is stored as a tile track. Each pair of When late tile binding is targeted, the OMAF base track is
collocated tile tracks may be encapsulated into the same typically an HEVC tile base track due to its low byte count
track group. An extractor track is also created, where each overhead. However, it is remarked that, even if extractor
tile location may reference the track group of that location, tracks were provided by the content author, an OMAF
thus indicating that a player should choose which of the player could choose to ignore them and perform late tile
two tile tracks is received for that location. The figure binding.
illustrates one possible player’s selection for the tile tracks Several versions of the content at different resolutions
to be received and merged into a bitstream with tiles of and possibly for different bitrates or different random
mixed quality. access point periods are encoded. The tile tracks that have
In VS author-driven tile binding, each extractor track the same resolution and are collocated may be encapsu-
is tailor-made for a certain range of viewing orientations, lated into the same track group to indicate that they are
described by RWQR metadata. Thus, the content author alternatives out of which players should choose at most
must prepare several extractor tracks to cover all possible one track. The same tile dimensions are typically used
viewing orientations. An OMAF player selects an extractor across all resolution versions to simplify the merging of tile
track based on its RWQR metadata so that the viewport is tracks in any order.
covered by higher quality than the remaining parts of the In late tile binding, an OMAF player performs the fol-
sphere. lowing operations for bitstream rewriting.
Fig. 10 presents an example of content authoring for VS
author-driven tile binding, where CMP content is encoded 1) The parameter sets in the initialization segment in
at two resolutions, with 2 × 2 tiles per cube face. While the main adaptation set can be used as the basis but
not presented in the figure, each encoded tile sequence is need to be modified according to the selected tile
stored as a tile track. Moreover, several extractor tracks adaptation sets.
are created by selecting 12 high-resolution tiles covering a 2) The spatial location of a slice in the merged bitstream
may differ from its location in the encoded bitstream,
and when it does differ, rewriting of the slice header
is needed.
3) Removal and insertion of the start code emulation
prevention bytes may be needed depending on the
rewritten syntax structures of parameter sets and slice
headers.
An example of late tile binding is illustrated in Fig. 11.
CMP content is encoded at two resolutions (2048 × 2048
and 512 × 512 per cube face) and the same tile size
(512 × 512). Each encoded tile sequence is stored as a
tile track, out of which an OMAF player can select any
Fig. 10. Example of content authoring for the VS author-driven set of tile tracks to be received. The coded slices are
tile binding. decapsulated from the received tile tracks, and their slice

10 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Hannuksela and Wang: Overview of Omnidirectional MediA Format

corresponding MovieFragmentBox, thus enabling to


pair a MovieFragmentBox with the corresponding
IdentifiedMediaDataBox.

E. Content Authoring
Since OMAF supports many types of
viewport-dependent streaming, a content author has
Fig. 11. Example of late tile binding. the freedom to choose which approach is used for
preparing the content. Thus, the viewport-dependent
streaming approach needs to be selected first. Preparation
of multiple VS 360◦ streams would require preprocessing
headers are rewritten so that a conforming video bitstream (e.g., generation of regionwise mixed-resolution content),
is obtained. In this example, the OMAF player selects all spatially tailored encoding, and/or rewriting of encoded
low-resolution tile tracks as a fallback to cope with sudden streams. The choice between tile-based viewport-
viewing orientation changes and 27 high-resolution tile dependent streaming approaches may depend on the
tracks covering the viewport. resolution of the original content, the expected decoding
capability, and the expected display resolutions. The
D. Tile Index and Tile Data Segment Formats targeted OMAF video profile also limits the choice that
In tile-based viewport-dependent 360◦ streaming, codecs and viewport-dependent streaming approaches can
the number of representations can be relatively high, be supported, as indicated in Table 6.
even up to hundreds of Representations, since the content A benefit of both the viewport + 360◦ video and RWMR
may be partitioned into several tens of tiles and maybe methods is that they enable improving the resolution on
coded with several resolutions and bitrates. Moreover, the viewport with a constrained video decoding capacity.
the duration of (sub)segments may be inconveniently long For example, in [27], it was shown that the viewport can
to update the viewport quickly with high-quality tiles after originate from a 6K (6144 × 3072) version of the content
a viewing orientation change. Thus, requests having a even though the decoding capacity of the OMAF player
finer granularity than (sub)segments could be desirable. only ranges up to about 4K (4096 × 2048) resolution. This
To enable fine-grained requests, even down to a single pic- article also compared the rate-distortion performance of
ture interval, and to obtain the indexing data conveniently RWMR and RWMQ approaches. An advantage of RWMR
for all tiles, OMAF v2 includes new segment formats, compared to the viewport + 360◦ technique is that no
namely, initialization segment for an OMAF base track, decoding capacity is spent for decoding low-resolution
a tile index segment, and a tile data segment. video that is superimposed by the high-resolution tiles.
The initialization segment for an OMAF base track con- Some devices may have problems downloading tens of
tains the track header for the OMAF base track and all the HTTP streams in parallel, each requiring bandwidth of
referenced tile or sub-picture tracks. This allows the client up to several Mb/s. It is, therefore, advisable to keep the
to download only the initialization segment for the OMAF number of required tile or sub-picture representations for
base track without the need to download the initialization the author-driven tile binding at the lower end of the range
segments of the referenced tile or sub-picture tracks. allowed by the codec at least in some extractor or tile base
The tile index segment is logically an index seg- tracks.
ment as specified in the DASH standard. It is required In the following, we concentrate on the tile-based oper-
to include MovieFragmentBoxes for the OMAF base ation of HEVC, while an AVC-based pipeline could be
track and all the referenced tile or sub-picture tracks. implemented similarly. The content authoring workflow
MovieFragmentBoxes indicate the byte ranges on a sam- for tile-based viewport-dependent operation is depicted
ple basis. Consequently, a client can choose to request in Fig. 12, and the steps of the workflow are described in
content on smaller units than (sub)segments. the next paragraphs. For practical implementation exam-
The tile data segments are media segments ples, the Nokia OMAF reference implementation [4] covers
containing only media data enclosed in steps 2–6 described below, and HEVC encoding with tiles is
IdentifiedMediaDataBoxes (“imda”). The byte supported for example in the HM reference software [36]
offsets contained in MovieFragmentBoxes (“moof”) are and in the Kvazaar open-source software [37].
relative to the start of IdentifiedMediaDataBoxes. 1) Encoding: The video content is encoded using tiles or
Thus, MovieFragmentBoxes and media data can reside the content is split into sub-picture sequences before
in separate resources, unlike in conventional DASH encoding and then encoded in a constrained manner
segment formats where the byte offsets to the media so that merging of the coded sub-picture sequences
data are relative to the MovieFragmentBox. The box into the same bitstream is possible. Usually, multiple
payload of each IdentifiedMediaDataBox starts versions of the content are generated at different
with a sequence number that is also contained in the bitrates. A relatively short random access interval,

P ROCEEDINGS OF THE IEEE 11


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Hannuksela and Wang: Overview of Omnidirectional MediA Format

Table 6 OMAF Video Profiles

e.g., in the order of 1 s, is used in encoding to enable direction may be needed for VS author-driven tile
frequent viewport switching. binding.
2) Bitstream Processing: A processing step may be needed 5) (Sub)segment Encapsulation: (Sub)segments are cre-
to prepare the encoded bitstreams for encapsula- ated from each track for DASH delivery. When con-
tion into sub-picture or tile tracks. When the con- ventional segment formats specified in the DASH
tent was encoded using tiles, each tile sequence is standard are in use, no changes to the (sub)segment
extracted from the bitstream. This requires parsing encapsulation process are needed compared to the
of the high-level structure of the bitstream, including corresponding process for 2-D video.
parameter sets and slice headers. When sub-picture 6) DASH MPD Generation: An MPD is generated. Each
bitstreams were encoded, no additional processing at extractor track and tile base track form a represen-
this phase is needed. tation in its own adaptation set. An adaptation set
3) sub-picture or Tile Track Generation: OMAF video pro- consists of the sub-picture or the tile representations
files constrain that sample entry types are allowed for covering the same sphere region at the same resolu-
the sub-picture or tile tracks. Slice headers require tion but at different bitrates. The DASH preselection
rewriting in all cases where the slice position in feature is used to associate the extractor or tile base
the encoded bitstream does not match the position adaptation set with the associated sub-picture or tile
implied by the sample entry type. As an integral adaptation sets. Moreover, in this processing step,
part of generating both the sub-picture or tile tracks the OMAF file metadata is interpreted to create the
and the extractor or tile base track(s), the necessary OMAF extensions for DASH MPD.
OMAF file format metadata is also authored.
4) Extractor or Tile Base Track Generation: If the “hvt1”
or “hvt3” sample entry type is in use, a tile base V. O M A F V I D E O P R O F I L E S
track is generated. Otherwise, one or more extractor A summary of the video profiles specified in OMAF is
tracks are created. A single extractor track is typically presented in Table 6. This section first introduces the video
sufficient for free-viewport author-driven tile binding, profiles and then discusses the similarities and differences
whereas one extractor track per a distinct viewing between the profiles.

Fig. 12. Basic flow of content authoring operations for tile-based viewport-dependent streaming.

12 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Hannuksela and Wang: Overview of Omnidirectional MediA Format

The HEVC-based viewport-independent profile is contain exactly one HEVC tile. Compared to the advanced
intended for basic viewport-independent files and tiling profile, the HEVC-based viewport-dependent and
streaming using the ERP. In OMAF v1, the decoding the simple tiling profiles provide more freedom since
capacity of the HEVC-based viewport-independent they enable using rectangular slices that comprise one or
profile was limited to approximately 4K (4096 × 2048) more tiles or a subset of a tile as the unit for tile-based
resolution at 60-Hz picture rate, while the unconstrained streaming.
HEVC-based viewport-independent was specified similarly
in OMAF v2 but without decoding capacity constraints to VI. O M A F I M A G E P R O F I L E S
respond to the need of higher HMD resolutions and the The image profiles of OMAF were designed to be seam-
availability of more powerful video decoding hardware. lessly compatible with HEIF. Consequently, devices and
The HEVC- and AVC-based viewport-dependent profiles platforms with HEIF capability are easily extensible to
support both VS streaming and different types of tile- support 360◦ images with metadata specified in OMAF.
based viewport-dependent streaming schemes. Two tiling Since OMAF is a toolbox standard, it is envisioned that
profiles, namely the simple and advanced tiling profiles, devices could only implement specific parts of OMAF.
were added for viewport-dependent streaming in OMAF For example, 360◦ cameras could only support an OMAF
v2. The main difference of the simple tiling profile image profile or the HEIF image metadata specified in the
compared to the HEVC-based viewport-dependent profile OMAF standard.
is the use of the tile index and tile data segment formats. At the time of releasing OMAF v1, there was arguably
The advanced tiling profile is the only profile that uses no other standard for storage of 360◦ images with the nec-
the 3-D mesh projection format and requires players to essary metadata for displaying them properly. Since then,
support late tile binding, while, otherwise, it is similar to the JPEG 360 standard [33] was finalized and includes
the simple tiling profile. omnidirectional metadata specifications for JPEG [34] and
Bit Depth: Since the HEVC-based profiles require support JPEG 2000 [35] images. Since OMAF specifies the omnidi-
for the HEVC Main 10 Profile, they support bit depths rectional image metadata for HEIF files, there is no overlap
up to 10 bits, whereas the AVC-based viewport-dependent with JPEG 360 even though the types of metadata in OMAF
profile is limited to 8 bits per color component. and JPEG 360 are similar.
Decoding Capacity: The HEVC-based profiles specified in OMAF v2 integrates images more tightly to 360◦ pre-
OMAF v1 require support for Level 5.1, which, in practice, sentations that can contain timed media types too. Images
means decoding capacity of approximately 4K pictures at can be used as overlays enriching an omnidirectional
60 Hz, whereas the AVC-based profile can support only 4K video background. An opposite arrangement is equally
pictures at 30 Hz. The profiles specified in OMAF v2 are supported, i.e., an omnidirectional background image can
tailorable in terms of decoding capacity, and thus, no HEVC be accompanied by video overlays. Moreover, presenta-
level constraints are specified for them. tions with multiple viewpoints can equally use images or
Projection Formats and RWP: In the HEVC-based video clips as the visual content of the viewpoints.
viewport-independent profile, the use of RWP can only be OMAF v1 specifies two profiles for projected omnidirec-
used to indicate a limited content coverage. In the HEVC- tional images. OMAF HEVC image profile uses the HEVC
and AVC-based viewport-dependent profiles, RWP is not Main 10 profile and the OMAF legacy image profile using
constrained. In the simple tiling profile, RWP is, otherwise, the JPEG codec, as summarized in Table 7. Both OMAF
unconstrained, but a single region is not allowed to cross image profiles are compatible with HEIF, and they share
a boundary of a projection surface, such as a cube face common features, as listed in Table 8. Coded image items
boundary. Moreover, the RWP format of an OMAF base of the OMAF HEVC image profile are limited to approx-
track is not indicated but inherited by OMAF players imately the 4K resolution, but larger image sizes can be
from the selected tile or sub-picture tracks. Consequently, achieved by using the “grid” derived image item, which
OMAF base tracks can enable free-viewport author-driven arranges input images onto a grid to create a large output
tile binding. In the advanced tiling profile, the 3-D mesh image. The image resolution constraint ensures that most
format is used, and RWP is disabled.
Viewport-Dependent Streaming: The HEVC- and Table 7 OMAF Image Profiles

AVC-based viewport-dependent profiles enable both


VS streams and tile-based viewport-dependent streaming,
while the simple and advanced tiling profiles only enable
the latter. While both the HEVC- and AVC-based viewport-
dependent profiles support all categories, the AVC-based Table 8 Features of OMAF Image Profiles

profile is more constrained since AVC does not support


tile partitioning, arranging slices vertically imposes
restrictions on slice sizes, and AVC has limits on picture
aspect ratio. The advanced tiling profile requires using
HEVC tiles of identical width and height and a tile track to

P ROCEEDINGS OF THE IEEE 13


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Hannuksela and Wang: Overview of Omnidirectional MediA Format

Table 9 OMAF Toolset Brands 4) Rotation information for conversion from the global
coordinate system of the viewpoint to the common
reference coordinate system.
5) Optionally, the orientation of the common reference
coordinate system relative to the geomagnetic north.
6) Optionally, the global positioning system (GPS) loca-
tion of the viewpoint, which enables the client appli-
cation to place the viewpoint into a real-world map.
hardware implementations can be used for HEVC image 7) Optionally, viewpoint switching information, which
decoding. provides a number of switching transitions possible
from the current viewpoint, and for each of these,
VII. O M A F T O O L S E T B R A N D S
information such as the sphere region that a user
A. Introduction
can select to cause the viewpoint switch, the destina-
OMAF v2 specifies viewpoint, nonlinear storyline, and
tion viewpoint, the viewport to view after switching,
overlay toolset brands, which are summarized in Table 9.
the presentation time to start the playback of the
Compatibility to a toolset brand can be indicated at the
destination viewpoint, and a recommended transi-
file level using the 4CC of the brand. This section reviews
tion effect during switching (such as zoom-in, walk
the OMAF features for multiple viewpoints and overlays,
though, fade-to-black, or mirroring).
as well as the toolset brands.
8) Optionally, viewpoint looping information indicating
which time period of the presentation is looped
B. Multiple Viewpoints
and a maximum count of how many times the
OMAF v2 supports 360◦ video content comprising pieces time period is looped. The looping feature can be
captured by multiple 360◦ video cameras or camera rigs, used for requesting end-user’s input for initiating
referred to as viewpoints. This way, users can switch viewpoint switching.
between different viewpoints, e.g., in a basketball game
Some of the viewpoints can be static, i.e., captured by
switch between scenes captured by 360◦ video cameras
360◦ video cameras at fixed positions. Other viewpoints
located at different ends of the court.
can be dynamic, e.g., captured by a 360◦ video cam-
Switching between viewpoints captured by 360◦ video
era mounted on a flying drone. For dynamic viewpoints,
cameras that can "see" each other can be seamless in the
the above information is stored in timed metadata tracks
sense that after switching the user still sees the same
that are time-synchronized with the media tracks.
object, e.g., the same player in a sports game, just from
a different viewing angle. However, when there is an
C. Nonlinear Storyline
obstacle, e.g., a wall, between two 360◦ video cameras
such that they cannot "see" each other, switching between The viewpoint switching and looping information
the two viewpoints incurs a noticeable cut or transition. enable content authors to generate presentations with
When multiple viewpoints exist, identification and asso- a nonlinear storyline. Each viewpoint is a scene in the
ciation of tracks or image items belonging to one viewpoint storyline. The viewpoint switching metadata can be used to
are needed. For this purpose, OMAF specifies the viewpoint provide multiple switching options from which an end-user
grouping of tracks and image items, as well as similar is required to choose before advancing to the next scene of
metadata for DASH MPD. This grouping mechanism pro- the storyline. The user selection may be linked to a given
vides an identifier (ID) of the viewpoint and a set of other sphere region, viewport region, or overlay, but other user
information that can be used to assist streaming of the input means are not precluded either. The viewpoint loop-
content and switching between different viewpoints. Such ing metadata may be used to create a loop in the playback
information includes the following. of the current scene to wait for the user’s selection. The
viewpoint looping metadata also allow defining a default
1) A label, for annotation of the viewpoint, e.g., "home
destination viewpoint that is applied when an indicated
court.”
maximum number of loops has been passed.
2) Mapping of the viewpoint to a viewpoint group con-
Fig. 13 presents an example where Scene 1 is played
sisting of cameras that "see" each other and have
until the end of its timeline, and then, a given time range
an indicated viewpoint group ID. This information
of Scene 1 is repeated until an end-user selects between
provides a means to indicate whether the switching
Scenes 2a and 2b. After completing the playback of Scene
between two particular viewpoints can be seamless.
2a or 2b, the playback automatically switches to Scene 3,
3) Viewpoint position relative to the common reference
after which the presentation ends.
coordinate system shared by all viewpoints of a view-
point group. Viewpoint positions enable a good user
experience during viewpoint switching, provided that D. Overlays
the client can properly utilize the positions in its An overlay is a video clip, an image, or text that is
rendering process. superimposed on top of an omnidirectional video or image.

14 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Hannuksela and Wang: Overview of Omnidirectional MediA Format

The center point of the plane is located at given


spherical coordinates and distance from the center of
the unit sphere, and the plane can be rotated by given
yaw, pitch, and roll angles.
2) Sphere-relative omnidirectional overlays, where an
omnidirectional projection, such as ERP, has been
used for an overlay source. Sphere-relative omnidi-
rectional overlays may, but need not, cover the entire
Fig. 13. Nonlinear storyline example. sphere and are located at a given spherical location
and distance from the center of the unit sphere.
3) Sphere-relative 3-D mesh overlays, where both a 3-D
Overlays can be used for multiple purposes, including the
mesh and a mapping of an overlay source onto the
following:
1) annotations of the content; for instance, stock tickers 3-D mesh are specified. The 3-D mesh can consist of
and player statistics of sports games; parallelograms having any rotation and being located
2) recommended viewport for the content, for example, at any position within the unit sphere.
giving the end-user the possibility to follow the direc- 4) Viewport-relative overlays, which are located on a
tor’s intent while having the freedom to peek freely given position within the viewport regardless of the
around; viewing orientation. The rendering process projects
3) 2-D video or image close-ups of the omnidirectional the sphere-relative overlays and the background
video or image on the background; visual media onto the viewport, which is then super-
4) hotspots for switching viewpoints interactively; imposed by the viewport-relative overlays. This is
5) displaying a logo of the content provider; illustrated in Fig. 14 through an isosceles triangle
6) displaying a semitransparent watermark on top of the whose sides illustrate the horizontal field of view of
content; a display and the base corresponds to a viewport.
7) advertisements. Since viewports can be of different shapes and sizes
The appearance of overlays can be controlled flexibly in different player devices, the top-left corner posi-
in OMAF. Moreover, the overlay structures are extensible, tion, width, and height of a viewport-relative overlay
and new controls or properties can be specified in future are provided in percents relative to the viewport
versions or amendments of the OMAF standard. Some dimensions.
basic concepts related to overlays are illustrated in Fig. 14, OMAF enables two rendering modes for presenting
which shows an equator-level cross section of the unit sphere-relative overlays with background visual media.
sphere and different types of overlays. Background visual In conventional 3DOF rendering, a viewing position that
media is defined as the omnidirectional video or image that is in the center of the unit sphere is used for projecting
is rendered on the unit sphere, and the term overlay source the sphere-relative overlays and the background visual
refers to the visual content displayed as an overlay. media onto the viewport. In the second rendering mode,
The following types of overlays are specified in OMAF. the viewing position is tracked and used for projecting
1) Sphere-relative 2-D overlays, where an overlay source the content onto the viewport. When the second render-
is displayed on a plane of a given width and height. ing mode is used with an HMD, it may be referred to
as head-tracked rendering. The second rendering mode
enables viewing overlays from different perspectives and
peeking on the background appearing behind the overlays.
Sphere-relative overlays can be placed at given distances
from the center of the unit sphere, which is perceivable
through motion parallax. Content authors can define a
viewing space that specifies valid viewing positions around
the center of the unit sphere. OMAF enables specifying the
viewing space boundaries as a cuboid, a sphere, a cylinder,
or an ellipsoid.
As discussed above, sphere-relative overlays are located
at a given distance from the center of the unit sphere.
A layering order can be given so that the player behav-
ior is deterministic when several overlays are positioned
at the same distance or when viewport-relative overlays
overlap.
By default, overlays are opaque. However, either a con-
stant opacity or an alpha plane that specifies a pixelwise
Fig. 14. 2-D illustration of overlays and background visual media. opacity can be optionally provided.

P ROCEEDINGS OF THE IEEE 15


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Hannuksela and Wang: Overview of Omnidirectional MediA Format

The content author can specify, separately per each VIII. C O N C L U S I O N


overlay, which types of user interactions are enabled. The An overview of OMAF, the arguably first VR system stan-
following user interaction types can be enabled or disabled dard, was provided. The overview focused on omnidirec-
in an OMAF file: changing the position, modifying the tional video and images, without much detail on audio and
distance from the center of the unit sphere, switching the timed text. This article described the OMAF architecture,
overlay ON or OFF, tuning the opacity, resizing, rotating, the representation formats for omnidirectional video and
cropping, and switching the overlay source to another one. images, and the file format and DASH extensions. Further-
A textual label can be given for each overlay and utilized more, 360◦ video streaming techniques and the related
by a user interface to enable end-users to switch overlays features in OMAF were discussed in detail. In addition,
ON or OFF . Another way is to provide an associated sphere the OMAF video and image profiles, as well as the toolsets
region that the user can select to turn an overlay ON or OFF. for overlays, viewpoints, and nonlinear storylines, were
As discussed above, an overlay source can either be summarized.
a video track or an image item; in that case, the over- The OMAF standard supports many different
lay consists of the entire decoded picture. Since some approaches for viewport-dependent streaming. It is an
player devices might not be capable of running several open research question which approach provides the
video decoder instances simultaneously, it is also possi- best end-user experience. Furthermore, there are many
ble to pack overlays spatially with the background visual detailed research topics that would benefit from a
media. In that case, an overlay source is specified as a more thorough investigation, for example, determina-
rectangle within the decoded picture area. Furthermore, tion of optimal projection format or 3-D mesh, tiling
it is possible to indicate that an overlay source is defined strategy, and bitrate adaptation logic for tile-based
by the recommended viewport timed metadata track or streaming.
provided by external means, such as through a URL. Requirements for the next OMAF version have been
The externally specified overlay source could be used to agreed in MPEG [38] and include support for new visual
show content from a separate application within an OMAF volumetric media types, namely, video-based point cloud
presentation. compression (V-PCC) and immersive video. The MPEG
The content author has two mechanisms to enable standard for visual volumetric video-based coding and
scaling the player-side complexity of overlay rendering. V-PCC [39] was recently finalized and can be used to
First, each overlay can be given a priority for rendering. represent captured volumetric objects. The MPEG Immer-
The highest priority value means that the overlay must sive Video standard [40] has a target completion by
be rendered. Second, it is indicated whether control or July 2021 and enables 6DOF within a limited viewing
property associated with an overlay is essential or optional. volume. It is expected that the OMAF standardization for
For example, it can be indicated that overlay composition integrating these media types will start in 2021.
with an alpha plane is optional. In this case, if the player
does not have enough resources to carry out the processing Acknowledgment
required for alpha planes, it is allowed to render an opaque The authors would like to greatly thank the numerous
overlay. Moving Picture Experts Group (MPEG) delegates who
The controls and properties for overlays can be static, have contributed to the development of Omnidirectional
i.e., remain constant for the entire duration of the overlay, MediA Format (OMAF). They also express gratitude to
or dynamic, i.e., signaled by a timed metadata track where the coeditors with whom the authors had a pleasure to
the controls and properties are dynamically adjusted. For work either in v1 or v2 of OMAF. They are also grateful
example, it is possible to move or resize an overlay as a to the anonymous reviewers and Lukasz Kondrad for their
function of time. excellent suggestions to improve this article.

REFERENCES
[1] R. S. Kalawsky, The Science of Virtual Reality and https://www.youtube.com/watch?v=FpQiF8YEfY4 [10] VR Industry Forum Guidelines, Version 2.3.
Virtual Environments: A Technical, Scientific and and https://github.com/fraunhoferhhi/omaf.js Accessed: Jan. 2021. [Online]. Available:
Engineering Reference on Virtual Environments. [6] Open Visual Cloud Immersive Video Samples. https://www.vr-if.org/guidelines/
Reading, MA, USA: Addison-Wesley, 1993. Accessed: Mar. 9, 2021. [Online]. Available: [11] VR Industry Forum Newsletter. Accessed: Dec. 2020.
[2] F. Biocca and M. R. Levy, Eds., Communication in https://github.com/OpenVisualCloud/ [Online]. Available: https://www.vr-if.org/
the Age of Virtual Reality. Newark, NJ, USA: Immersive-Video-Sample december-2020-newsletter/
Lawrence Erlbaum Associates, 1995. [7] S. Deshpande, Y.-K. Wang, and M. M. Hannuksela, [12] Virtual Reality (VR) Profiles for Streaming
[3] Information Technology—Coded Representation of Eds., Text of ISO/IEC FDIS 23090-2 2nd edition Applications, document 3GPP Technical
Immersive Media—Part 2: Omnidirectional Media OMAF, document ISO/IEC JTC1 SC29 WG3, Specification 26.118, 2020.
Format, Standard ISO/IEC 23090-2:2019, 2019. N00072, Dec. 2020. Accessed: Oct. 27, 2020. [Online]. Available:
[4] Nokia OMAF Implementation. Accessed: Mar. 9, [8] K. K. Sreedhar, I. D. D. Curcio, A. Hourunranta, and https://www.3gpp.org/ftp//Specs/archive/26
2021. [Online]. Available: M. Lepistö, “Immersive media experience with _series/26.118/
https://github.com/nokiatech/omaf MPEG OMAF multi-viewpoints and overlays,” in [13] Information Technology—Coding of Audio-Visual
[5] D. Podborski et al., “HTML5 MSE playback of Proc. 11th ACM Multimedia Syst. Conf., May 2020, Objects—Part 12: ISO Base Media File Format,
MPEG 360 VR tiled streaming: JavaScript pp. 333–336. [Online]. Available: Standard ISO/IEC 14496-12, 2012.
implementation of MPEG-OMAF https://www.youtube.com/watch?v=WcucAw3HNVE [14] Information Technology—Dynamic Adaptive
viewport-dependent video profile with HEVC tiles,” [9] How ClearVR Drives and Leverages Standards. Streaming Over HTTP (DASH)—Part 1: Media
in Proc. 10th ACM Multimedia Syst. Conf., Accessed: Oct. 27, 2020. [Online]. Available: Presentation Description and Segment Formats,
Jun. 2019, pp. 324–327. [Online]. Available: https://www.tiledmedia.com/index.php/standards/ Standard ISO/IEC 23009-1:2019, 2019.

16 P ROCEEDINGS OF THE IEEE


This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

Hannuksela and Wang: Overview of Omnidirectional MediA Format

[15] M. M. Hannuksela, Y.-K. Wang, and video coding,” IEEE Trans. Circuits Syst. Video video,” in Proc. IEEE Int. Conf. Multimedia Expo,
A. Hourunranta, “An overview of the OMAF Technol., vol. 29, no. 6, pp. 1767–1780, Jul. 2011, pp. 1–6.
standard for 360◦ video,” in Proc. Data Compress. Jun. 2019. [31] M. M. Hannuksela, Y.-K. Wang, and M. Gabbouj,
Conf. (DCC), Mar. 2019, pp. 418–427. [24] A. Zare, A. Aminlou, M. M. Hannuksela, and “Isolated regions in video coding,” IEEE Trans.
[16] Information Technology—High Efficiency Coding and M. Gabbouj, “HEVC-compliant tile-based streaming Multimedia, vol. 6, no. 2, pp. 259–267, Apr. 2004.
Media Delivery in Heterogeneous of panoramic video for virtual reality applications,” [32] R. Skupin, Y. Sanchez, C. Hellge, and T. Schierl,
Environments—Part 12: Image File Format, in Proc. ACM Multimedia Conf., Oct. 2016, “Tile based HEVC video for head mounted
Standard ISO/IEC 23008-12, 2012. pp. 601–605. displays,” in Proc. IEEE Int. Symp. Multimedia
[17] M. M. Hannuksela, E. B. Aksu, V. K. M. Vadakital, [25] K. K. Sreedhar, A. Aminlou, M. M. Hannuksela, and (ISM), Dec. 2016, pp. 399–400.
and J. Lainema, Overview of the High Efficiency M. M. Gabbouj, “Viewport-adaptive encoding and [33] Information Technology—JPEG Systems—JPEG 360,
Image File Format, document JCTVC-V0072, streaming of 360-degree video for virtual reality Standard ISO/IEC 19566-6:2019, 2019.
Oct. 2015, pp. 1–12. [Online]. Available: applications,” in Proc. IEEE Int. Symp. Multimedia, [34] Digital Compression and Coding of Continuous-Tone
http://phenix.it-sudparis.eu/jct/doc_end_user/ Dec. 2016, pp. 583–586. Still Images, Standard ISO/IEC 10918-1:1994,
documents/22_Geneva/wg11/JCTVC-V0072-v1.zip [26] R. Ghaznavi-Youvalari et al., “Comparison of HEVC 1994.
[18] Advanced Video Coding, document ITU-T Rec. coding schemes for tile-based viewport-adaptive [35] JPEG 2000 Image Coding System, Standard ISO/IEC
H.264, ISO/IEC 14496-10, 2010. streaming of omnidirectional video,” in Proc. IEEE 15444-1:2019, 2019.
[19] High Efficiency Video Coding, document ITU-T Rec. 19th Int. Workshop Multimedia Signal Process. [36] Reference Software for High Efficiency Video Coding,
H.265, ISO/IEC 23008-2, 2002. (MMSP), Oct. 2017, pp. 1–6. document ITU-T Rec. H.265.2, Dec. 2016, ISO/IEC
[20] M. Budagavi, J. Furton, G. Jin, A. Saxena, [27] A. Zare, A. Aminlou, and M. M. Hannuksela, “6K 23008-5:2017, 2017.
J. Wilkinson, and A. Dickerson, “360 degrees video effective resolution with 4K HEVC decoding [37] A. Lemmetti, M. Viitanen, A. Mercat, and J. Vanne,
coding using region adaptive smoothing,” in Proc. capability for OMAF-compliant 360◦ video “Kvazaar 2.0: Fast and efficient open-source HEVC
IEEE Int. Conf. Image Process. (ICIP), Sep. 2015, streaming,” in Proc. 23rd Packet Video Workshop, inter encoder,” in Proc. 11th ACM Multimedia Syst.
pp. 750–754. Jun. 2018, pp. 72–77. Conf., May 2020, pp. 237–242. [Online]. Available:
[21] R. G. Youvalari, A. Aminlou, and M. M. Hannuksela, [28] H. Hristova, X. Corbillon, G. Simon, https://github.com/ultravideo/kvazaar
“Analysis of regional down-sampling methods for V. Swaminathan, and A. Devlic, “Heterogeneous [38] M.-L. Champel and I. D. D. Curcio, Eds.,
coding of omnidirectional video,” in Proc. Picture spatial quality for omnidirectional video,” in Proc. Requirements for MPEG-I Phase 2, document
Coding Symp. (PCS), Dec. 2016, pp. 1–5. IEEE 20th Int. Workshop Multimedia Signal Process. ISO/IEC JTC1 SC29 WG11, N19511, Jul. 2020.
[22] M. Tang, Y. Zhang, J. Wen, and S. Yang, “Optimized (MMSP), Aug. 2018, pp. 1–6. [39] Visual Volumetric Video-Based Coding and
video coding for omnidirectional videos,” in Proc. [29] A. Smolic and P. Kauff, “Interactive 3-D video Video-Based Point Cloud Compression, document
IEEE Int. Conf. Multimedia Expo (ICME), Jul. 2017, representation and coding technologies,” Proc. ISO/IEC JTC1 SC29 WG11, N19579, Sep. 2020.
pp. 799–804. IEEE, vol. 93, no. 1, pp. 98–110, Jan. 2005. [40] MPEG Immersive Video, document ISO/IEC CD
[23] Y. Li, J. Xu, and Z. Chen, “Spherical domain [30] P. R. Alface, J.-F. Macq, and N. Verzijp, “Evaluation 23090-12, ISO/IEC JTC1 SC29 WG11, N19482,
rate-distortion optimization for omnidirectional of bandwidth performance for interactive spherical Jul. 2020.

ABOUT THE AUTHORS


Miska M. Hannuksela (Member, IEEE) Ye-Kui Wang received the B.S. degree in
received the M.Sc. degree in engineering industrial automation from the Beijing Insti-
and the D.Sc. degree in technology from the tute of Technology, Beijing, China, in 1995,
Tampere University of Technology, Tampere, and the Ph.D. degree in information and
Finland, in 1997 and 2010, respectively. telecommunication engineering from the
He has been with Nokia Technologies, Graduate School in Beijing, University of Sci-
Tampere, since 1996, in different roles ence and Technology of China, Hefei, China,
including research manager/leader positions in 2001.
in the areas of video and image compres- His earlier working experiences and titles
sion, end-to-end multimedia systems, and sensor signal processing include the Chief Scientist of Media Coding and Systems at
and context extraction. He is currently the Bell Labs Fellow and the Huawei Technologies, San Diego, CA, USA, the Director of Tech-
Head of Video Research, Nokia Technologies. He has published nical Standards at Qualcomm, San Diego, CA, a Principal Mem-
above 180 journal articles and conference papers and more ber of Research Staff at Nokia Corporation, Tampere, Finland,
than 1000 standardization contributions in Joint Video Experts and so on. He is currently a Principal Scientist with Bytedance
Team (JVET), Joint Collaborative Team on Video Coding (JCT-VC), Inc., San Diego. He has been an active contributor to various
Joint Video Team (JVT), Moving Picture Experts Group (MPEG), multimedia standards, including video codecs, file formats, real-
the 3rd Generation Partnership Project (3GPP), and Digital Video time transport protocol (RTP) payload formats, and multimedia
Broadcasting Project (DVB). He has granted patents from more streaming and application systems, developed by various stan-
than 130 patent families. His research interests include video dardization organizations including International Telecommunica-
compression, multimedia communication systems and formats, tion Union, Telecommunication Standardization Sector (ITU-T) Video
user experience and human perception of multimedia, and sensor Coding Experts Group (VCEG), ISO/IEC Moving Picture Experts
signal processing. Group (MPEG), Joint Video Team (JVT), Joint Collaborative Team
Dr. Hannuksela has several best paper awards and received an on Video Coding (JCT-VC), Joint Collaborative Team on 3D Video
award of the best doctoral thesis of the Tampere University of Tech- Coding Extension Development (JCT-3V), Internet Engineering Task
nology in 2009 and the Scientific Achievement Award nominated by Force (IETF), Audio Video coding Standard (AVS), Digital Video
the Centre of Excellence of Signal Processing, Tampere University Broadcasting Project (DVB), Advanced Television Systems Commit-
of Technology, in 2010. He was an Associate Editor of the IEEE tee (ATSC), and Digital Entertainment Content Ecosystem (DECE).
TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY He has coauthored about 1000 standardization contributions, over
from 2010 to 2015. He has been an Editor in several video and 60 academic articles, and about 500 families of patent applications
systems standards, including the High Efficiency Image File Format (out of which 336 U.S. patents have been granted as of February 23,
(HEIF), the Omnidirectional Media Format, RFC 3984, and RFC 2021). His research interests include video coding, storage, trans-
7798 and some parts of H.264/AVC, H.265/High Efficiency Video port, and multimedia systems.
Coding (HEVC), and the ISO Base Media File Format. Dr. Wang has been chairing the development of Omnidirectional
MediA Format (OMAF) at MPEG. He has been an Editor for sev-
eral standards, including versatile video coding (VVC), OMAF, all
versions of High Efficiency Video Coding (HEVC), VVC file format,
HEVC file format, layered HEVC file format, ITU-T H.271, SVC file
format, multiview video coding (MVC), RFC 6184, RFC 6190, RFC
7798, 3GPP TR 26.906, and 3GPP TR 26.948.

P ROCEEDINGS OF THE IEEE 17

You might also like