Computer Communications and Networks: For Further Volumes
Computer Communications and Networks: For Further Volumes
Networks
Emphasis is placed on clear and explanatory styles that support a tutorial approach,
so that even the most complex of topics is presented in a lucid and intelligible man-
ner.
Lingfen Sun r Is-Haka Mkwawa r
Emmanuel Jammeh r Emmanuel Ifeachor
Series Editors
A.J. Sammes
Centre for Forensic Computing
Cranfield University
Shrivenham campus
Swindon, UK
Since the release of the first Internet Phone in 1995, Voice over Internet Protocol
(VoIP) has grown exponentially, from a lab-based application to today’s established
technology, with global penetration, for real-time communications for business and
daily life. Many organisations are moving from the traditional PSTN networks to
modern VoIP solutions and are using VoIP products such as audio/video conferenc-
ing systems for their daily business operation. We depend on different VoIP tools
such as Skype, Google Talk and Microsoft Lync to keep contact with our business
partners, colleagues, friends and family members, virtually any time and from any-
where. We now enjoy free or low cost VoIP audio or even high quality video calls
which have made the world like a small village for real-time audio/video communi-
cations. VoIP tools have been incorporated into our mobile devices, tablets, desktop
PCs and even TV sets and the use of VoIP tools is just an easy one-click task.
Behind the huge success and global penetration of VoIP, we have witnessed
great advances in the technologies that underpin VoIP such as speech/video sig-
nal processing and compression (e.g., from narrowband, wideband to fullband
speech/audio compression), computer networking techniques and protocols (for bet-
ter and more efficient transmission of multimedia services), and mobile/wireless
communications (e.g., from 2G, 3G to 4G broadband mobile communications).
This book aims to provide an understanding and a practical guide to some of
the fundamental techniques (including their latest developments) which are behind
the success of VoIP. These include speech compression, video compression, me-
dia transport protocols (RTP/RTCP), VoIP signalling protocols (SIP/SDP), QoS and
QoE for voice/video calls, Next Generation Networks based on IP Multimedia Sub-
system (IMS) and mobile VoIP, together with case studies on how to build a VoIP
system based on Asterisk, how to assess and analyse VoIP quality, and how to set
up a mobile VoIP system based on Open IMS and Android mobile. We have pro-
vided many practical examples including real trace data to illustrate and explain
the concepts of relevant transport and signalling protocols. Exercises, illustrative
worked examples in the chapters and end-of-chapter problems will also help readers
to check their understanding of the topics and to stretch their knowledge. Step-by-
step instructions are provided in the case studies to enable readers to build their own
open-source based VoIP system and to assess voice/video call quality accordingly,
or to set up their own mobile VoIP system based on Open IMS Core and IMSDroid
with an Android mobile. Challenging questions are set up in the case studies to help
them to think deeper and to practice more.
v
vi Preface
This book has benefitted from the authors’ research activities in VoIP and re-
lated activities of over 10 years. In particular, it has benefitted from the recent in-
ternational collaborative projects, including the EU FP7 ADAMANTIUM project
(Grant agreement no. 214751), the EU FP7 GERYON project (Grant agreement
no. 284863) and the EU COST Action IC1003 European Network on Quality of
Experience in Multimedia Systems and Services (QUALINET). The book has also
benefitted from the authors’ teaching experience in developing and delivering mod-
ules on “Voice and Video over IP” to undergraduate and postgraduate students at
Plymouth University in the past four years. Some of the contents of the book were
drawn from the lecture notes and some of the case studies materials from the lab
activities.
This book can be used as a textbook for final year undergraduate and first year
postgraduate courses in computer science and/or electronic engineering. It can also
serve as a reference book for engineers in industry and for those interested in VoIP,
for example, those who wish to have a general understanding of VoIP as well as
those who wish to have an in-depth and practical understanding of key VoIP tech-
nologies.
In this book, Dr. Sun has contributed to Chaps. 1 (Introduction), 2 (Speech Com-
pression), 3 (Video Compression), 4 (Media Transport) and 6 (VoIP QoE); Dr. Mk-
wawa has contributed to Chaps. 1 (Introduction), 5 (SIP Signalling), 7 (IMS and Mo-
bile VoIP), 8 (Case Study 1), 9 (Case Study 2) and 10 (Case Study 3); Dr. Jammeh
has contributed to Chap. 3 (Video Compression) and Professor Ifeachor has con-
tributed to Chap. 1 (Introduction) and the book editing. Due to the time constraints
and the limitation of our knowledge, some errors or omissions may be inevitable in
the book, we welcome any feedbacks and comments about the book.
Finally, we would like to thank Simon Rees, our editor at Springer-Verlag for
his encouragement, patience, support and understanding in the past two years in
helping us complete the book. We would also like to express our deepest gratitude
to our family for their love, support and encouragement throughout the process of
this book.
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview of VoIP . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 How VoIP Works and Factors That Affect Quality . . . . . . . . . 3
1.3 VoIP Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Microsoft’s Lync . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Skype . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Google Talk . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.4 X-Lite . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 VoIP Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 VoIP Protocol Stack and the Scope of the Book . . . . . . . . . . . 12
1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Speech Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Speech Compression Basics . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Speech Signal Digitisation . . . . . . . . . . . . . . . . . . 18
2.2.2 Speech Waveform and Spectrum . . . . . . . . . . . . . . 21
2.2.3 How Is Human Speech Produced? . . . . . . . . . . . . . . 23
2.3 Speech Compression and Coding Techniques . . . . . . . . . . . . 25
2.3.1 Waveform Compression Coding . . . . . . . . . . . . . . . 26
2.3.2 Parametric Compression Coding . . . . . . . . . . . . . . 28
2.3.3 Hybrid Compression Coding—Analysis-by-Synthesis . . . 31
2.3.4 Narrowband to Fullband Speech Audio Compression . . . 35
2.4 Standardised Narrowband to Fullband Speech/Audio Codecs . . . 36
2.4.1 ITU-T G.711 PCM and G.711.1 PCM-WB . . . . . . . . . 36
2.4.2 ITU-T G.726 ADPCM . . . . . . . . . . . . . . . . . . . . 37
2.4.3 ITU-T G.728 LD-CELP . . . . . . . . . . . . . . . . . . . 38
2.4.4 ITU-T G.729 CS-ACELP . . . . . . . . . . . . . . . . . . 38
2.4.5 ITU-T G.723.1 MP-MLQ/ACELP . . . . . . . . . . . . . 39
2.4.6 ETSI GSM . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4.7 ETSI AMR . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.8 IETF’s iLBC . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4.9 Skype/IETF’s SILK . . . . . . . . . . . . . . . . . . . . . 41
2.4.10 ITU-T G.722 ADPCM-WB . . . . . . . . . . . . . . . . . 42
vii
viii Contents
This chapter provides background information about the book. In particular, it pro-
vides an overview of VoIP to make the reader aware of its benefits and growing
importance, how it works and factors that affect VoIP quality. We also introduce
current VoIP approaches and tools which are used in the real world for VoIP calls
and highlight the trends in VoIP and its applications. Finally, we give an outline of
the book in relation to VoIP stack to give the reader a deeper insight into the contents
of the book.
Voice over Internet Protocol or Voice over IP (VoIP) is a technology used to trans-
mit real-time voice over the Internet Protocol (IP) based networks (e.g., the Inter-
net or private IP networks). The original idea behind VoIP is to transmit real-time
speech signal over a data network and to reduce the cost for long distance calls
as VoIP calls would go through packet-based networks at a flat rate, instead of
the traditional Public Switched Telephone Network (PSTN) which was expensive
for long-distance calls. Today, the new trend is to include both voice and video
calls in VoIP. VoIP was originally invented by Alon Cohen and Lior Haramaty in
1995. The first Internet Phone was released in February 1995 by VocalTec and a
flagship patent on audio transceiver for real-time or near real-time communication
of audio signal over a data network was filed in 1998 [3]. This was the first at-
tempt in the telecommunications history aimed at transmitting both data and voice
at the same time and over one common network. Traditionally, voice and data
were sent over two separate networks, with data on packet networks and speech
over PSTN.
Since its invention, VoIP has grown exponentially, from a small-scale lab-based
application to today’s global tool with applications in most areas of business and
daily life. More and more organisations are moving from traditional PSTN to mod-
ern VoIP solutions, such as Microsoft’s Unified Communications Solution (Mi-
L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 1
DOI 10.1007/978-1-4471-4905-7_1, © Springer-Verlag London 2013
2 1 Introduction
crosoft Lync1 ) which provides a unified solution for voice, Instant Message, audio
and video conferencing for business operations. Telecommunication and network
service providers now offer attractive packages to customers which include the pro-
vision of VoIP, TV (or IPTV) together with broadband data access for Triple-play
and Quadruple-play services (including mobility). An increasing number of peo-
ple in different age groups now rely on VoIP products and tools such as Skype,2
to make voice/video calls to keep in contact with family and friends because they
are free or inexpensive. Many companies and organizations use VoIP (e.g., Skype)
for routine conference calls for project meetings and for interviewing prospective
employees. New VoIP applications such as mobile VoIP have widened the VoIP
arena further to include seamless and timely communications. VoIP has truly be-
come an invaluable tool which we rely on for business, social and family communi-
cations.
Behind the great success and the wide penetration of VoIP lie major technol-
ogy advances in Information and Communication Technology (ICT) which under-
pin its delivery and applications. Without these, VoIP as we know today would not
be possible. The key technologies include advanced speech compression methods
(including for narrowband, wideband and fullband compression), advanced video
compression methods (including layered coding to support various network condi-
tions), transport signalling protocols (SIP/SDP), transport protocols (RTP/RTCP),
Quality of Service (QoS) and Quality of Experience (QoE) management, monitor-
ing and control; IMS (IP Multimedia Subsystem) and mobile VoIP. Descriptions of
these key technologies and their use in VoIP form an important part of this book.
IP based networks now carry all types of traffic, including real-time voice and
video. Figure 1.1 depicts a generalised set-up for VoIP calls. As can be seen in the
figure, a VoIP call can originate from or be sent to a mobile or landline device or
1 http://lync.microsoft.com
2 http://www.skype.com
1.2 How VoIP Works and Factors That Affect Quality 3
3 http://www.cisco.com
4 http://www.asterisk.org
4 1 Introduction
The voice packets are then sent over the IP network. As the voice packets are
transported, they may be subjected to a variety of impairments such as delay, delay
variation and packet loss.
At the receiving end, de-packetizer is used to remove the protocol header added
for transmission; jitter buffer (or playback buffer) is used to absorb delay variations
suffered by the voice packets and make it possible to obtain a smooth playout and
hence smooth reconstruction of speech. Playout buffers can lead to additional packet
loss as packets arriving too late are discarded. Some modern codecs have built-in
packet loss concealment mechanisms which can alleviate the impact of network
packet loss on voice quality.
Since the first VoIP software (named as Internet Phone) released by VocalTec Ltd
in 1995, there have been many VoIP softphones in the market. Typical VoIP tools
or softphones include Microsoft Netmeeting, Yahoo Messenger, MSN Messenger,
Windows Live Messenger, Skype, Google Talk, Linphone, Eyebeam and XLite.
Some of them are open source such as Linphone and XLite and some are propri-
etary such as Skype. In this section, we present some key VoIP tools, including
Microsoft’s Lync which is a unified VoIP solution, Skype, Google Talk and XLite.
Microsoft’s Lync 2010 is a rich client application providing a unified solution for
Instance Messaging (IM), presence, audio and video conferencing. It can be easily
1.3 VoIP Tools 5
1.3.2 Skype
Skype has over 660 million registered users and has a record of over 36 million
simultaneous online users in March 2012.5 Skype has been incorporated into many
devices including mobile phones such as Android based phones and iPhones, tablet
such as iPad, and PCs. The main home screen of Skype is shown in Fig. 1.4.
Skype with IM, voice call, video call, and audio/video conferencing features, is
the most successful VoIP software with the largest register users. It provides free
PC-to-PC voice/video call or conferencing functions and cheap call rate for PC to
traditional landline calls. Skype supports narrowband to wideband audio coding and
high quality video such as iSAC, SVOPC, iLBC, SILK for speech/audio coding and
VP7/VP8 from On26 (now part of Google7 ) for video coding. Skype has many ad-
vanced features such as Forward Error Control (FEC), variable audio/video sender
bitrate, variable packet size and variable sampling rate and outperforms other VoIP
tools such as Google Talk and Windows Live Messenger in many experiments car-
ried out by VoIP researchers [2, 13].
Due to the proprietary nature of Skype, there has been many research efforts
trying to evaluate the performance of Skype’s voice/video call when compared
with other VoIP tools such as Google Talk, MSN messenger and Windows Live
Messenger [2, 9, 13], to analyse Skype traffic (e.g., how does it compete with
5 http://en.wikipedia.org/wiki/Skype
6 http://www.on2.com/
7 http://www.google.com
1.3 VoIP Tools 7
other TCP background traffic under limited network bandwidth and whether it is
TCP-friendly or not) [2, 7], and to understand Skype’s architecture and its un-
derlying QoE control mechanisms (e.g., how does it adapt audio/video sender bi-
trate to available network bandwidth and how does it cope with network conges-
tion) [1, 8].
Google Talk [6] is a voice and instant messaging service that can be run in Google
Chrome OS, Microsoft Windows OS, Android and Blackberry. The communication
between Google Talk servers and clients for authentication, presence and messag-
ing is via Extensible Messaging and Presence Protocol (XMPP). The popularity of
Google Talk is driven by its integration into Gmail whereby Gmail users can send
instant messages and talk to each others. Furthermore, it works within a browser and
therefore, the Google Talk client application does not need to be downloaded to talk
and send instant messages to amongst Gmail users. Figure 1.5 shows the Google
Talk client home screenshot with video chat enabled.
Google Talk supports the following audio and video codecs, PCMA, PCMU,
G.722, GSM, iLBC, Speex, ISAC, IPCMWB, EG711U, EG711A, H.264/SVC,
H.264, H.263-1998 and Google VP8.
8 1 Introduction
1.3.4 X-Lite
X-Lite is a proprietary freeware VoIP soft phone that uses SIP for VoIP sessions
setup and termination. It combines voice calls, video calls and Instant Messaging
in a simple interface. X-Lite is developed by CounterPath [5]. The screen shot of
X-Lite version 4 is depicted in Fig. 1.6 together with its video screen in Fig. 1.7.
Some of the X-Lite basic functionalities include, call display and message waiting
indicator, speaker phone and mute icon, hold and redial and call history for incoming
and outgoing and missed calls.
1.4 VoIP Trend 9
Some of the X-Lite enhanced features and functions include, video call support,
instant message and presence via SIMPLE protocol support, contact list support, au-
tomatic detection and configuration of voice and video devices, support of echo can-
celation, voice activity detection (VAD) and automatic gain control. X-Lite supports
the following audio and video codecs, Broadvoice-32, Broadvoice-32 FEC, DVU4,
DVI4 Wideband, G711aLaw, G711uLaw, GSM, L16 PCM Wideband, iLBC, Speex,
Speex FEC, Speex Wideband, Speec Wideband FEC, H.263 and H.263+1998.
The German Internet traffic management systems provider Ipoque [14] sampled
about three Petabytes of data in December 2007 from Australia, Germany, East-
ern and southern Europe and the Middle East. Ipoque found that while VoIP made
about 1 % of all Internet traffic, but it was used by around 30 % of the Internet users.
The Skype accounted for 95 % of all VoIP traffic in December 2007.
The growth of VoIP subscribers continued to increase in Western Europe
whereby in June 2007 it reached 21.7 million. The growth was significantly high
compared to 15.6 million in January 2007. TeleGeography [16] estimated that Eu-
ropean VoIP subscribers would have grown to 29 million by December 2007. The
report by Infonetics [11] indicated that there were about 80 million VoIP subscribers
in the world in 2007, with the high rate of adoption coming from the Asia Pacific
region.
The report by OVUM Telecom Research [6] on World Consumer VoIP of
September 2009 which predicted the VoIP trend for 2009–14 showed that in the
4th quarter of 2009 in Europe,
• VoIP voice call volumes rose. Mobile VoIP communication was the driving fac-
tor to this growth.
• VoIP voice call prices fell. Conventional telephony voice call prices fell at an-
nual rate of 2.6 %. Mobile VoIP call prices fell sharply.
From OVUM World Consumer VoIP forecast shows that worldwide revenues
from VoIP subscribers will continue to rise until 2014 and this growth will follow a
slow down in the subsequent years (cf., Fig. 1.8).
Figure 1.9 illustrates the projected growth in VoIP revenues per region whereby
North America is at the top followed by Europe and Asia-Pacific.
This growth in VoIP revenues is attributed by the increase of VoIP subscribers.
As depicted in Fig. 1.10, the number of VoIP subscribers will keep up the upward
trend.
The trend of VoIP and Skype in 2009 as per Ipoque [15] showed that SIP gen-
erated over 50 % of all VoIP traffic where Skype was number one in the Middle
East and the Eastern Europe. Skype still is a popular VoIP application due to its
diverse functions and the ease to use. It can provide voice, video, file transfer, it
also has the ability to go through firewalls and Network Address Translation (NAT)
enabled routers. Now applications such as Yahoo and Microsoft Messengers and
10 1 Introduction
Google Talk which previously where used for messaging now offer VoIP services
as well. They are different from Skype because they use standard based or modified
SIP protocol and therefore RTP packets are used to transport voice and video pay-
loads. These has triggered another trend which is SIP/RTP traffic initiated by Instant
Messaging (IM) application. According to Ipoque [15], SIP/RTP traffic initiated by
IM accounts to 20–30 % of the overall VoIP traffic.
Figure 1.11 depicts the VoIP protocols distribution. It can be seen that Skype is
far popular VoIP protocol in the Eastern Europe and the Middle East with more than
1.4 VoIP Trend 11
80 % share. Skype is popular in these regions where internet speed is low because it
has adaptive audio codec under varying internet bandwidth.
The rapid growth of mobile broadband and advancement in mobile devices capa-
bilities have prompted an increase in VoIP services on mobile devices. According to
12 1 Introduction
the report by In-stat [10], it is estimated that there are about 255 million active VoIP
subscribers via GPRS/3G/HSDPA. In-Stat also forecasts mobile VoIP applications
and services will generate the annual revenue of around $33 billion. Figure 1.12 de-
picts the growth of VoIP subscribers via UMTS and HSPA/LTE cellular networks.
In order to have a better understanding of the scope of the book, here we introduce
briefly the VoIP protocol stack, which is illustrated in Fig. 1.13.
From the top to the bottom, the VoIP protocol stack consists of techniques/proto-
cols at the application layer, the transport layer (e.g., TCP or UDP), the network
layer (e.g., IP) and the link/physical layer. The link/physical layer concerns about the
techniques and protocols on transmission networks/medium such as Ethernet (IEEE
802.2), wireless local area networks (WLANs, e.g., IEEE 802.11) and cellular mo-
bile networks (e.g., GSM/UMTS-based 2G/3G mobile networks or LTE-based 4G
mobile networks). The network layer protocol, such as the Internet Protocol (IP) is
responsible for the transmission of IP packets from the sender to the receiver over
the Internet and mainly concerns about where to send a packet and how to route
packets via a best path from the sender to the receiver over the Internet (concerning
routing protocols). The transport layer protocol (e.g., TCP or UDP) is responsible
for providing a logical transport channel between the sender and the receiver hosts
(or build a logical channel between two processes running on two hosts which are
linked by the Internet). Unlike the physical/link layer and/or network layer proto-
cols which are run by all network devices (such as wireless access points, network
1.5 VoIP Protocol Stack and the Scope of the Book 13
switches and routers) along the path from the sender to the receiver, the transport
layer protocol, together with application layer protocols, are only run in end sys-
tems. The VoIP protocol stack involves both TCP and UDP transport layer proto-
cols, with media transport protocols (such as RTP/RTCP) located on top of UDP,
whereas the signalling protocol (e.g., SIP) can be located on top of either TCP or
UDP as shown in Fig. 1.13.
This book will focus on VoIP techniques and protocols at the application layer
which include audio/video media compression (how voice and video streams are
compressed before they are sent over to the Internet, which will be presented in
Chaps. 2 and 3, respectively), media transport protocols (how voice and video
streams are packetised and transmitted over the Internet which include the Real-
time Transport Protocol (RTP) and the RTP Control Protocol (RTCP) will be dis-
cussed in Chap. 4), VoIP signalling protocols (how VoIP sessions are established,
maintained, and teared down, which are dealt with by the Session Initiation Pro-
tocol (SIP), together with the Session Description Protocol (SDP) will be covered
in Chap. 5). We focus only on the SIP signalling protocol from the IETF (Internet
Engineering Task Force, or the Internet community), mainly due to its popularity
with the Internet applications, its applicability (e.g., with 3GPP mobile applications
and with the Next Generation Networks (NGNs)) and its simplified structure. For
the alternative VoIP signalling protocol, that is, H.323 [12, 17] from ITU-T (the In-
ternational Telecommunication Union, Telecommunication Standardisation Sector,
or from the Telecommunications community), it is recommended to read relevant
books such as [4].
When a VoIP session is established, it is important to know how good/bad the
voice or video quality is provided. The user perceived quality of VoIP or the Quality
of Experience (QoE) of VoIP services are key for the success of VoIP applications
from both service providers and network operators. How to assess and monitor VoIP
quality (voice and video quality) will be discussed in Chap. 6.
In Chap. 7, we will introduce the IP Multimedia Subsystem (IMS) and mobile
VoIP. IMS is a standardised Next Generation Network (NGN) architecture for de-
livering multimedia services over converged, all IP-based networks. It provides a
combined structure for delivering voice and video over fixed and mobile networks
including fixed and mobile access (e.g., ADSL/cable modem, WLAN and 3G/4G
mobile networks) with SIP as its signalling protocol. This chapter will describe the
future of VoIP services and video streaming services, over the next generation net-
works.
14 1 Introduction
In the last three chapters (from Chap. 8 to Chap. 10), we provide three case stud-
ies to guide the reader to get hands-on experiences regarding VoIP system, VoIP pro-
tocol analysis, voice/video quality assessment and mobile VoIP system. In Chaps. 8
and 9, we provide two case studies to demonstrate how to build up a VoIP system
based on open source Asterisk tool in a lab or home environment and how to evalu-
ate and analyse voice and video quality for voice/video calls in the set VoIP testbed.
The reader can follow the step-by-step instructions to set up your own VoIP system,
and to analyse the VoIP trace data captured by Wireshark, together with recorded
voice samples or captured video clips for voice/video quality evaluation (for infor-
mal subjective and further objective analysis). Many challenge questions are set in
the case studies for the reader to test your knowledge and stretch your understanding
on the topic. In the last chapter (Chap. 10), we present the third case study to build
up mobile VoIP system based on the Open Source IMS Core and IMSDroid as an
IMS client. Step-by-step instructions will be provided for setting up Open Source
IMS Core in Ubuntu and IMSDroid in an Android based mobile handset. We will
also demonstrate how to make SIP audio and video calls between two Android based
mobile handsets.
The book will provide you the required basic principles and the latest advances
in VoIP technologies, together with many practical case studies and examples for
VoIP and mobile VoIP applications including both voice and video calls.
1.6 Summary
In this chapter, we have given an overview of VoIP (including its importance) and
explained how it works and factors that affect VoIP quality. We have introduced a
number of key VoIP tools that are used in practice. The range of applications and
trends in VoIP show that this technology is having a major impact on our lives in
both business and at home.
References
1. Bonfiglio D, Mellia M, Meo M, Rossi D (2009) Detailed analysis of Skype traffic. IEEE
Trans Multimed 11(1):117–127
2. Boyaci O, Forte AG, Schulzrinne H (2009) Performance of video-chat applications under
congestion. In: 11th IEEE international symposium on multimedia, pp 213–218
3. Cohen A, Haramaty L (1998) Audio transceiver. US Patent: 5825771
4. Collins D (2003) Carrier grade voice over IP. McGraw-Hill Professional, New York. ISBN
0-07-140634-4
5. Counterpath (2011) X-lite 4. http://www.counterpath.com/x-lite.html. [Online; accessed 12-
June-2011]
6. Google (2012) Ovum telecoms research. http://ovum.com/section/telecoms/. [Online; ac-
cessed 30-August-2012]
7. Hosfeld T, Binzenhofer A (2008) Analysis of Skype VoIP traffic in UMTS: end-to-end QoS
and QoE measurements. Comput Netw 52(3):650–666
8. Huang TY, Huang P, Chen KT, Wang PJ (2010) Could Skype be more satisfying? A QoE-
centric study of the FEC mechanism in an Internet-scale VoIP system. IEEE Netw 24(2):42–
48
References 15
9. Kho W, Baset SA, Schulzrinne H (2008) Skype relay calls: measurements and experiments.
In: IEEE INFOCOM, pp 1–6
10. Maisto M (2012) Mobile VoIP trend. http://www.eweek.com/networking/. [Online; accessed
25-September-2012]
11. Myers D (2012) Service provider VoIP and IMS. http://www.infonetics.com/research.asp.
[Online; accessed 30-September-2012]
12. Packet-based multimedia communications systems. ITU-T H.323 v.2 (1998)
13. Sat B, Wah BW (2007) Evaluation of conversational voice communication quality of the
Skype, Google-Talk, Windows Live, and Yahoo Messenger VoIP systems. In: IEEE 9th
workshop on multimedia signal processing, pp 135–138
14. Schulze H, Mochalski K (2012) The impact of p2p file sharing, voice over IP, Skype, Joost,
instant messaging, one-click hosting and media streaming such as Youtube on the Internet.
http://www.ipoque.com/sites/default/files/mediafiles/documents/internet-study-2007.pdf.
[Online; accessed 30-August-2012]
15. Schulze H, Mochalski K (2012) Internet study 2008 and 2009. http://www.ipoque.com/
sites/default/files/mediafiles/documents/internet-study-2008-2009.pdf. [Online; accessed
30-August-2012]
16. TeleGeography (2012) Global internet geography. http://www.telegeography.com/research-
services/global-internet-geography/index.html. [Online; accessed 15-August-2012]
17. Visual telephone systems and equipment for local area networks which provide a non-
guaranteed quality of service. ITU-T H.323 v.1 (1996)
Speech Compression
2
2.1 Introduction
In VoIP applications, voice call is the mandatory service even when a video ses-
sion is enabled. A VoIP tool (e.g., Skype, Google Talk and xLite) normally provides
many voice codecs which can be selected or updated manually or automatically.
Typical voice codecs used in VoIP include ITU-T standards such as 64 kb/s G.711
PCM, 8 kb/s G.729 and 5.3/6.3 kb/s G.723.1; ETSI standards such as AMR; open-
source codecs such as iLBC and proprietary codecs such as Skype’s SILK codec
which has variable bit rates in the range of 6 to 40 kb/s and variable sampling fre-
quencies from narrowband to super-wideband. Some codecs can only operate at a
fixed bit rate, whereas many advanced codecs can have variable bit rates which may
L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 17
DOI 10.1007/978-1-4471-4905-7_2, © Springer-Verlag London 2013
18 2 Speech Compression
be used for adaptive VoIP applications to improve voice quality or QoE. Some VoIP
tools can allow speech codecs used to be changed during a VoIP session, making it
possible to select the most suitable codec for a given network condition.
Voice codecs or speech codecs are based on different speech compression tech-
niques which aim to remove redundancy from the speech signal to achieve compres-
sion and to reduce transmission and storage costs. In practice, speech compression
codecs are normally compared with the 64 kb/s PCM codec which is regarded as the
reference for all speech codecs. Speech codecs with the lowest data rates (e.g., 2.4
or 1.2 kb/s Vocoder) are used mainly in secure communications. These codecs can
achieve compression ratios of about 26.6 or 53.3 (compared to PCM) and still main-
tain intelligibility, but with speech quality that is somewhat ‘mechanical’. Most of
speech codecs operate in the range of 4.8 kb/s to 16 kb/s and have good speech qual-
ity and reasonable compression ratio. These codecs are mainly used in bandwidth
resource limited mobile/wireless applications. In general, the higher the speech bit
rate, the higher the speech quality and the greater the bandwidth and storage require-
ments. In practice, it is always a trade-off between bandwidth utilisation and speech
quality.
In this chapter, we first introduce briefly underpinning basics of speech compres-
sion, including speech signal digitisation, voice waveform, spectrum and spectro-
gram, and the concept of voiced and unvoiced speech. We then look at key tech-
niques in speech compression coding which include waveform coding, parametric
coding and hybrid coding (or Analysis-by-Synthesis coding). Finally, we present
a number of key speech compression standards, from international standardisation
body (ITU-T), regional standardisation bodies (Europe’s ETSI and North America’s
TIA), together with some open source and proprietary codecs (such as GIP’s iLBC,
now Google’s iLBC and Skype’s SILK codec).
same (here the value of Δ) in the speech dynamic range considered. For a speech
signal in the range of 0 to Δ (input), the output after quantisation will be represented
by the quantised value of 0.5Δ with maximum quantisation error of 0.5Δ. When
non-uniform quantisation is applied, different quantisation steps will be applied in
the speech dynamic range. Due to the fact that speech has non-uniform PDFs, the
quantisation step will be kept smaller in lower level signal. For example for speech
signal in the range of 0 to 0.5Δ (input), the output will be represented by quantised
value of 0.25Δ with maximum quantisation error of 0.25Δ (lower than that for
uniform quantisation for low level signals). Similarly for higher level speech signal
with lower PDF values, the quantisation step is set much bigger than that for uniform
quantisation (coarse quantisation). As illustrated in the figure, for speech signal from
1.5Δ to 3Δ, the quantisation output will be 2.25Δ, with maximum quantisation
error of 0.75Δ, much higher than that for uniform quantisation (0.5Δ), also higher
than that for lower level speech signal (e.g., 0.25Δ for speech between 0 to 0.5Δ).
As PDF of low level speech signal is much higher than that of high level speech
signal. The overall performance (in terms of Signal-to-Noise Ratio (SNR)) will be
better than that for uniform quantisation coding. In this example, for both uniform
and non-uniform quantisation, same signal dynamic range is applied (i.e., from −3Δ
to +3Δ for the input signal). Non-uniform quantisation has been applied in Pulse
Coding Modulation (PCM), the most simple and commonly used speech codec.
PCM explores non-uniform quantisation by using a logarithm companding method
to provide fine quantisation for low speech and coarse quantisation for high speech
signal.
After sampling, quantisation and coding, the analog speech signal is converted
into a digitised speech signal which can be processed, transmitted or stored. Speech
compression coding is normally carried out before digital transmission or storage in
order to reduce the required transmission bandwidth or required storage space. For
the PCM codec with 8000 sampling rate, each sample is represented by 8 bits, giv-
ing transmission bit rate of 8000 × 8 = 64000 bit/s (64 kb/s). Speech compression
2.2 Speech Compression Basics 21
coding algorithms are normally compared with 64 kb/s PCM to obtain the compres-
sion ratio. Details of speech compression coding techniques will be discussed in
Sect. 2.3.
If we look more closely at the spectrum for voiced signal, it shows harmonic
frequency components. For a normal male, the pitch is about 125 Hz and for a
female the pitch is at about 250 Hz (Fig. 2.4 has a pitch of 285 Hz for a fe-
male sample) for voiced speech, whereas unvoiced signal does not have this fea-
ture (as can be seen from Fig. 2.5, the spectrum is almost flat and similar to the
spectrum for white noise). The spectrum in Figs. 2.4 and 2.5 are obtained by us-
ing a Hamming window with 256 sample window length. The value of waveform
amplitude has been normalised to −1 to +1. Spectrum magnitude is converted
to dB value. For detailed function of Hamming window and roles of windows in
speech signal frequency analysis, readers are recommended to read the book by
Kondoz [26].
Figure 2.6 shows the speech waveform of a sentence, “Note closely the size of
the gas tank” spoken by a female speaker and its spectrogram. The sentence is
about 2.5 seconds long. Speech spectrogram displays the spectrum of the whole
sentence of speech with the grade scale of grey for magnitude of the spectrum
(the darker the color, the higher the spectrum energy). Pitch harmonised bars are
also illustrated clearly in the spectrogram for voiced segments of speech. From
the sentence, it is clearly shown that the percentage of the voiced speech seg-
ments (with pitch bars) are higher than that of the unvoiced ones (without pitch
bars).
2.2 Speech Compression Basics 23
Speech compression, especially at low bit rate speech compression, explores the
nature of human speech production mechanism. In this section, we briefly explain
how human speech is produced.
Figure 2.7 shows a conceptual diagram of human speech production physical
model. When we speak, the air from lungs push through the vocal tract and out of the
mouth to produce a sound. For some sounds for example,. a voiced sound, or vowel
sounds of ‘a’, ‘i’ and ‘μ’, as shown in Fig. 2.4, the vocal cords vibrate (open and
close) at a rate (fundamental frequency or pitch frequency) and the produced speech
samples show a quasi-periodic pattern. For other sounds (e.g., certain fricatives as
‘s’ and ‘f’, and plosives as ‘p’, ‘t’ and ‘k’ , named as unvoiced sound as shown
in Fig. 2.5) [28], the vocal cords do not vibrate and remain open during the sound
production. The waveform of unvoiced sound is more like noise. The change of the
shape of the vocal tract (in combination of the shape of nose and mouth cavities and
the position of the tongue) produces different sound and the change of the shape is
relatively slow (e.g., 10–100 ms). This forms the basis for the short-term stationary
feature of speech signal used for all frame-based speech coding techniques which
will be discussed in the next section.
24 2 Speech Compression
smaller than that of the PCM input signal, less coding bits are needed to represent
the ADPCM sample.
e(n) = s(n) − ŝ(n) (2.2)
The difference between e(n) and eq (n) is due to quantisation error (nq (n)), as
given in Eq. (2.3).
The decoder at the receiver side will use the same prediction algorithm to recon-
struct the speech sample. If we don’t consider channel error, eq (n) = eq (n). The
difference between the reconstructed PCM signal at the decoder (s̃ (n)) and the in-
put linear PCM signal at the encoder (s(n)) will be just the quantisation error of
nq (n). In this case, the Signal-to-Noise Ratio (SNR) for the ADPCM system will
be mainly decided by the signal to quantisation noise ratio and the quality will be
based on the performance of the adaptive quantiser.
If an ADPCM sample is coded into 4 bits, the produced ADPCM bit rate is
4 × 8 = 32 kb/s. This means that one PCM channel (at 64 kb/s) can transmit two
ADPCM channels at 32 kb/s each. If an ADPCM sample is coded into 2 bits, then
ADPCM bit rate is 2 × 8 = 16 kb/s. One PCM channel can transmit four ADPCM
28 2 Speech Compression
at 16 kb/s each. ITU-T G.726 [15] defines ADPCM bit rate at 40, 32, 24 and 16 kb/s
which corresponds to 5, 4, 3, 2 bits of coding for each ADPCM sample. The higher
the ADPCM bit rate, the higher the numbers of the quantisation levels, the lower
the quantisation error, and thus the better the voice quality. This is why the quality
for 40 kb/s ADPCM is better than that of 32 kb/s. The quality of 24 kb/s ADPCM
is also better than that of 16 kb/s.
S(z) G
H (z) = = p (2.4)
X(z) 1 − j =1 aj z−j
2.3 Speech Compression and Coding Techniques 29
p
s(n) = Gx(n) + aj s(n − j ) (2.5)
j =1
For more detailed explanation of speech signal generation model and LPC anal-
ysis, readers are recommended to read the reference book [26].
these parameters to the channel/network and then synthesise the speech based on
the received parameters at the decoder. For a continuous speech signal which is
segmented for 20 ms speech frames, this process is repeated for each speech frame.
The basic LPC model is illustrated in Fig. 2.10.
At the encoder, the key components are pitch estimation (to estimate the pitch
period of the speech segment), voicing decision(to decide whether it is a voiced or
unvoiced frame), gain calculation (to calculate the power of the speech segment) and
LPC filter analysis (to predict the LPC filter coefficients for this segment of speech).
These parameters/coefficients are quantised, coded and packetised appropriately (in
the right order) before they are sent to the channel. The parameters and coded bits
from the LPC encoder are listed below.
• Pitch period (T): for example, coded in 7 bits as in LPC-10 (together with voic-
ing decision) [31].
• Voiced/unvoiced decision: to indicate whether it is voiced or unvoiced segment.
For hard-decision, a binary bit is enough.
• Gain (G) or signal power: coded in 5 bits as in LPC-10.
• Vocal tract model coefficients: or LPC filter coefficients, normally in 10-order,
i.e. a1 , a2 , . . . , a10 , coded in 41 bits in LPC-10.
At the decoder, the packetised LPC-bitstream are unpacked and sent to the rele-
vant decoder components (e.g., LPC decoder, pitch period decoder) to retrieve the
LPC coefficients, pitch period and gain. The voicing detection bit will be used
to control the voiced/unvoiced switch. The pitch period will control the impulse
train sequence period when in a voiced segment. The synthesiser will synthesise the
speech according to the received parameters/coefficients.
LPC-10 [31] is a standard specified by Department of Defence (DoD) Federal
Standard (FS) 1015 in USA and is based on 10th order LP analysis. Its coded
bits are 54 (including one bit for synchronisation) for one speech frame with 180
2.3 Speech Compression and Coding Techniques 31
samples. For 8 kHz sampling rate, 180 samples per frame which is 22.5 ms per
frame (180/8000 = 22.5 ms). For every 22.5 ms, 54 coded binary bits from the
encoder are sent to the channel. The encoder bit rate is 2400 bit/s or 2.4 kb/s
(54 bits/22.5 ms = 2.4 kb/s). The compression ratio is 26.7 when compared with
64 kb/s PCM (64/2.4). LPC-10 was mainly used in radio communications with
secure voice transmissions. The quality of voice is low in its natureness (more me-
chanic sound), but with reasonable intelligibility. Some variants of LPC-10 explore
different techniques (e.g., subsampling, silence detection, variable LP coded bits) to
achieve bit rates from 2400 bit/s to 800 bit/s.
are transmitted to the receiver. The decoder will synthesise the speech signal based
on the optimum excitation signal. The difference between the synthesised at the out-
put of the decoder and the one estimated at the encoder is due to channel error. If
there is no channel transmission error, the synthesised signals at the encoder and the
decoder are the same.
In hybrid compression coding, the most successful one is Code-Excitation Linear
Prediction (CELP) based AbS technique which was a major breakthrough at low bit
rate speech compression coding in later 1980s. CELP-based coding normally con-
tains a codebook with a size of 256 to 1024 at both sender and receiver. Each code-
book entry contains a waveform-like excitation signal, or multi-pulse excitation sig-
nal [5] (instead of only periodic impulse train and noise in parametric coding). This
resolves a major problem in the coding of a transition frame (or “onset” frame), for
example, a frame contains transition from unvoiced to voiced, such as the phonetic
sound at the beginning of the word “see” [si:] or “tea” [ti:] which is very important
from perceptual quality point of view (affects the intelligibility of speech commu-
nications). The closed-loop search process will find the best match excitation from
the codebook and only the index of the matched excitation of the codebook will be
2.3 Speech Compression and Coding Techniques 33
coded and sent to the decoder at the receiver side. At the decoder side, the matched
excitation signal will be retrieved from the same codebook and used to reconstruct
the speech. For a codebook with the size of 256 to 1024, 8–10 bits can be used for the
coding of codebook index. In order to achieve high efficiency in coding and low in
34 2 Speech Compression
(TIA) of the Electronic Industries Association (EIA), and the International Maritime
Satellite Corporation (INMARSAT).
• LD-CELP: Low Delay CELP, used in ITU-T G.728 at 16 kb/s [16].
• CS-ACELP: Conjugate-Structure Algebraic-Code-Excited Linear Prediction,
used in ITU-T G.729 [17] at 8 kb/s.
• RPE/LTP: Regular Pulse Excitation/Long Term Prediction, used in ETSI GSM
Full-Rate (FR) at 13 kb/s [6].
• VSELP: Vector Sum Excited Linear Prediction: ETSI GSM Half-Rate (HR) at
5.6 kb/s [7].
• EVRC based on RCELP: Enhanced Variable Rate Codec [30], specified in
TIA/EIA’s Interim Standard TIA/EIA/IS-127 for use in the CDMA systems in
North America, operating at bit rates of 8.5, 4 or 0.8 kb/s (full-rate, half-rate,
eighth-rate at 20 ms speech frame) [30].
• ACELP: Algebraic CELP, used in ETSI GSM Enhanced Full-Rate (EFR) at
12.2 kb/s [9] and ETSI AMR from 4.75 to 12.2 kb/s [8].
• ACELP/MP-MLQ: Algebraic CELP/Multi Pulse—Maximum Likelihood Quan-
tisation, used in ITU-T G.723.1 at 5.3/6.3 kb/s [18].
• IMBE: Improved Multiband Excitation Coding at 4.15 kb/s for INMARSAT-M.
• AMBE: Advanced Multiband Excitation Coding at 3.6 kb/s for INMARSAT-
AMBE.
Table 2.1 Summary of NB, WB, SWB and FB speech/audio compression coding
Mode Signal Sampling rate Bit-rate (kb/s) Examples
bandwidth (Hz) (kHz)
Narrowband 300–3400 8 2.4–64 G.711, G.729,
(NB) G.723.1, AMR,
LPC-10
Wideband (WB) 50–7000 16 6.6–96 G.711.1, G.722,
G.722.1, G.722.2
Super-wideband 50–14000 32 24–48 G.722.1 (Annex C)
(SWB)
Fullband (FB) 20–20000 48 32–128 G.719
In this section, we will discuss some key standardised narrowband (NB), wideband
(WB), super-wideband (SWB) and fullband (FB) speech/audio codecs from the In-
ternational Telecommunication Union, Telecommunications Section (ITU-T) (e.g.,
G.729, G.723.1, G.722.2 and G.719), from the European Telecommunications Stan-
dards Institute (ETSI) (e.g., GSM, AMR, AMR-WB) and from the Internet Engi-
neering Task Force (IETF) (e.g., iLBC and SILK) which are normally used in VoIP
and conferencing systems.
G.711 for 64 kb/s Pulse Coding Modulation (PCM) was first adopted by ITU-T in
1972 and further amended in 1988 [14]. It is the first ITU-T speech compression
2.4 Standardised Narrowband to Fullband Speech/Audio Codecs 37
coding standard for the ITU-T G-series for narrowband speech with a frequency
range of 300–3400 Hz. Two logarithmic companding laws were defined due to his-
toric reasons, with the μ-law for use in North America and Japan, and the A-law for
use in Europe and the rest of the world. The G.711 encoder converts linear 14 bits
uniform PCM code to 8 bits A-law or μ-law PCM (non-uniform quantisation, or
logarithm companding) code per sample with fine quantisation for low level speech
signal and coarse quantisation for high level speech signal. At the decoder side, de-
companding process is applied to convert back to its uniform PCM signal. PCM
operates at 64 kb/s and is sample-based coding, which means that the algorithmic
delay for the encoder is only one sample of 0.125 ms at 8000 Hz sampling rate.
When PCM codec is used in VoIP applications, 20 ms of speech frame is nor-
mally formed up and packetised for transmission over the network. The original
G.711 PCM standard did not contain packet loss concealment mechanism which is
necessary for codecs for VoIP applications. G.711 Appendix I [19] was added in
1999 which contains a high quality low-complexity algorithm for packet loss con-
cealment. This G.711 with packet loss concealment algorithm (PLC) is mandatory
for all VoIP applications.
G.711.1 is the wideband extension for G.711 Pulse Code Modulation (PCM-WB)
defined by ITU-T in 2008 [24]. It supports both narrowband and wideband speech
coding. When it is applied for wideband speech coding, it can support speech and
audio input signal frequency range from 50 to 7000 Hz. The encoder input signal,
sampled at 16 kHz (in wideband coding case), is divided into 8 kHz sampled lower-
band and higher-band signals with the lower-band using G.711-compatible coding,
whereas the higher-band based on Modified Discrete Cosine Transform (MDCT)
based on 5 ms speech frame. For the lower-band and higher-band signals, there are
three layers of bitstreams as listed below.
• Layer 0: lower-band base bitstream at 64 kb/s PCM (base bitstream), 320 coded
bits for 5 ms speech frame.
• Layer 1: lower-band enhancement bitstream at 16 kb/s, 80 coded bits for 5 ms
speech frame.
• Layer 2: higher-band enhancement bitstream at 16 kb/s, 80 coded bits for 5 ms
speech frame.
The overall bit rates for G.711.1 PCM-WB can be 64, 80 and 96 kb/s. With
5 ms speech frame, the coded bits are 320, 400, 480 bits, respectively. The algo-
rithmic delay is 11.875 ms (5 ms speech frame, 5 ms look-ahead, and 1.875 ms for
Quadrature-Mirror Filterbank (QMF) analysis/synthesis).
per sample are 5, 4, 3, and 2 bits. It operates at narrowband with the sampling rate
of 8000 Hz. G.726 was originally proposed to be used for Digital Circuit Multi-
plication Equipment (DCME) to improve transmission efficiency for long distance
speech transmission (e.g. one 64 kb/s PCM channel can hold two 32 kb/s ADPCM
and four 16 kb/s ADPCM channels). The G.726 codec currently is also used for
VoIP applications.
10 ms speech frame, whereas the excitation signal parameters (fixed and adaptive
codebook indices and gains) are estimated based on the analysis of each subframe
(5 ms). LPC filter coefficients are transformed to Line Spectrum Pairs (LSP) for sta-
bility and efficiency of transmission. For the G.729 encoder, every 10 ms speech
frame (for 8 kHz sampling rate, it is equivalent to 80 speech samples) is anal-
ysed to obtain relevant parameters, which are then encoded to 80 bits and trans-
mitted to the channel. The encoder bit rate is 8 kb/s (80 bits/10 ms = 8 kb/s).
G.729 supports three speech frame types, which are normal speech frame (with
80 bits), Silence Insertion Description (SID) frame (with 15 bits, to indicate the
features of background noise when voice activity detection (VAD) is enabled)
and a null frame (with 0 bit). G.729 was designed for cellular and network ap-
plications. It has a built-in concealment mechanism to conceal a missing speech
frame using interpolation techniques based on previous received speech frames.
For detailed bit allocation of 80 bits to LPC filter coefficients and excitation code-
books, you can read ITU-T G.729 [17]. In the G.729 standard, it also defines
G.729A (G.729 Annex A) for reduced complexity algorithm operating at 8 kb/s,
Annex D for low-rate extension at 6.4 kb/s and Annex E for high-rate extension
at 11.8 kb/s.
ITU-T G.723 [18], standardised in 1996, is based on Algebraic CELP (ACELP) for
bit rate at 5.3 kb/s and Multi Pulse—Maximum Likelihood Quantisation (MP-MLQ)
for bit rate at 6.3 kb/s. It was proposed for multimedia communications such as for
very low bit rate visual telephony applications and provides dual rates for flexibility.
The higher bit rate will have better speech quality. G.723.1 uses a 30 ms speech
frame (240 samples for a frame for 8 kHz sampling rate). The switch between the
two bit rates can be carried out at any frame boundary (30 ms). Each 30 ms speech
frame is divided into four subframes (each 7.5 ms). The look-ahead of G.723.1
is 7.5 ms (one subframe length), this results in an algorithmic delay of 37.5 ms.
The 10th order LPC analysis is applied for each subframe. Both open-loop and
close-loop pitch period estimation/prediction are performed for every two subframes
(120 samples). Two different excitation methods are used for the high and the low
bit rate codecs (one on ACELP and one on MP-MLQ).
four subframes (5 ms each). LP analysis is carried out for each speech frame (20 ms).
The Regular pulse excitation (RPE) analysis is based on the subframe, whereas Long
Term Prediction (LTP) is based on the whole speech frame. The encoded block of
260 bits contains the parameters from LPC filter, RPE and LTP analysis. Detailed
bits allocation can be found from [6].
GSM half rate (HR), known as GSM 06.20, was defined by ETSI in 1999 [7].
This codec is based on VSELP (Vector-Sum Excited Linear Prediction) operating at
5.6 kb/s. It uses vector-sum excited linear prediction codebook with each codebook
vector is formed up by a linear combination of fixed basis vectors. The speech frame
length is 20 ms and is divided into four subframes (5 ms each). The LPC filter is
10th order. The encoded block length is 112 bits containing parameters for LPC
filter, codecbook indices and gain.
Enhanced Full Rate (EFR) GSM, known as GSM 06.60, was defined by ETSI in
2000 [9]. It is based on ACELP (Algebraic CELP) and operates at 12.2 kb/s, same
as the highest rate in AMR (see the next section).
Adaptive Multi Rate (AMR) narrowband speech codec, based on ACELP (Alge-
braic Code Excited Linear Prediction), was defined by ETSI, Special Mobile Group
(SMG), in 2000 [8]. It has been chosen by 3GPP (the 3rd Generation Partnership
Project) as the mandatory codec for Universal Mobile Telecom Systems (UMTS)
or the 3rd Generation Mobile Networks (3G). AMR is a multi-mode codec with 8
narrowband modes for bit rates of 4.75, 5.15, 5.9, 6.7, 7.4, 7.95, 10.2 and 12.2 kb/s.
The speech frame length is 20 ms (160 speech samples at 8000 sampling rate).
Mode switching can occur at the boundary of each speech frame (20 ms). For a
speech frame, the speech signal is analysed in order to obtain the parameters of 10th
LP coefficients, adaptive and fixed codebooks’ indices and gains. The LP analysis
is carried out twice for 12.2 kb/s AMR mode and only once for all other modes.
Each 20 ms speech frame is divided into four subframes (5 ms each). Pitch anal-
ysis is based on every subframe, and adaptive and fixed codebooks parameters are
transmitted for every subframe. The bit numbers for encoded blocks for the 8 modes
from 4.75 to 12.2 kb/s are 95, 103, 118, 134, 148, 159, 204 and 244 bits, respec-
tively. Here you can calculate and check the relevant bit rate based on bit numbers in
an encoded block. For example, for 244 bits over a 20 ms speech frame, the bit rate
is 12.2 kb/s (244 bits/20 ms = 12.2 kb/s). For detailed bit allocation for 8 modes
AMR, the reader can follow the AMR ETSI specification [8].
The flexibility on bandwidth requirements and tolerance in bit errors for the
AMR codec are not only beneficial for wireless links, but are also desirable for VoIP
applications, e.g. in QoE management for mobile VoIP applications using automatic
AMR bit rate adaptation in response to network congestions [27].
2.4 Standardised Narrowband to Fullband Speech/Audio Codecs 41
iLBC (Internet Low Bit Rate Codec), an open source speech codec, was proposed
by Andersen et al. in 2002 [3] at Global IP Sound (GIP, acquired by Google Inc in
2011)1 and was defined in IETF RFC 3951 [2] in 2004. It was aimed for Internet ap-
plications with robustness to packet loss. Based on block independent CELP (frame-
independent long-term prediction), it can overcome the error propagation problem
occurred in traditional CELP codecs and achieve better voice quality under packet
loss conditions (when compared with other CELP codecs, such as G.729, G.723.1
and AMR) [29]. The frame length for iLBC is 20 ms (15.2 kb/s, with 304 bits per
coded block) or 30 ms (13.33 kb/s, with 400 bits per coded block). Each speech
frame is divided into four (for 20 ms frame with 160 samples) or six subframes
(for 30 ms frame with 240 samples) with each subframe corresponding to 5 ms of
speech (40 samples). For 30 ms frame, two LPC analyses are carried out, whereas
for 20 ms frame, only one LPC analysis is required (both are based on 10th order
LPC analysis). Codebook search is carried out for each subframe. Key techniques
used in iLBC are LPC analysis, dynamic codebooks search, scalar quantization and
perceptual weighting. The dynamic codebooks are used to code the residual signal
only for the current speech block, without using the information based on previous
speech frames, thus, eliminating the error propagation problem due to packet loss.
This method enhances the packet loss concealment performance and results in better
speech quality under packet loss conditions.
iLBC has been used in many VoIP tools such as Google Talk and Yahoo! Mes-
senger.
SILK , the Super Wideband Audio Codec, is the recent codec used in Skype. It is
designed and developed by Skype2 as a speech codec for real-time and packet-based
voice communications and was submitted to IETF in 2009 [32].
The SILK codec has four operating modes which are Narrowband (NB, 8 kHz
sampling rate), Mediumband (MB, 8 or 12 kHz sampling rate), Wideband (WB,
8, 12 or 16 kHz sampling rate) and Super Wideband (SWB, 8, 12, 16 or 24 kHz
sampling rate). Its basic speech frame is 20 ms (160 samples at 8 kHz sampling
rate). The core Skype encoder uses similar AbS techniques which include pitch
estimation (every 5 ms) and voicing decision (every 20 ms), short-term prediction
(LPC) and long-term prediction (LTP), LTP scaling control, LPC transformed to
LSF coefficients, together with noise shaping analysis.
The key scalability features of SILK codec can be categorized as following, as
shown in Fig. 2.15.
1 http://www.globalipsound.com
2 https://developer.skype.com/silk/
42 2 Speech Compression
• Sampling rate: Skype supports the sampling rates of 8, 12, 16 or 24 kHz which
can be updated in real-time to support NB, MB, WB and SWB voice applica-
tions.
• Bit rate: Skype supports bit rates from 6 to 40 kb/s. Bit rates can be adapted
automatically according to network conditions.
• Packet loss rate: packet loss rate can be used as one of the control parameters
for the Skype encoder to control its Forward Error Control (FEC) and packet
loss concealment mechanisms.
• Use FEC: Forward Error Control (FEC) mechanism can be controlled whether
to use or not depending on network conditions. Perceptually important packets
for example, speech transition frames can be encoded at a lower bit rate and
sent again over the channel. At the receiver side, if the main speech packet
is lost, its lower bit rate packet can be used to recover the lost packet and to
improve overall speech quality. However, FEC increases bandwidth usage as
extra information is needed to be sent through the network.
• Complexity: There are three complexity settings provided in Skype which are
high, medium and low. Appropriate complexity (CPU load) can be decided ac-
cording to applications.
Other features such as changing packet size (e.g., one packet can contain 1, 2,
up to 5 speech frames) and DTX (Discontinuous transmission) to stop transmitting
packets in silence period are common features which can also be found for other
speech codecs.
G.722 [12], defined by ITU-T in 1988, is a compression coding standard for 7 kHz
audio at 16 kHz sampling rate. It is based on sub-band adaptive differential pulse
code modulation (SB-ADPCM) with bit rates of 64, 56 or 48 kb/s (depending on the
operation mode). When encoder bit rate is 56 or 48 kb/s, an auxiliary data channel
of 8 or 16 kb/s bit rate can be added during transmission to form up a 64 kb/s data
channel.
2.4 Standardised Narrowband to Fullband Speech/Audio Codecs 43
At the SB-ADPCM encoder, the input audio signal (0 to 8 kHz) at 16 kHz sam-
pling rate is split into two sub-band signals, each at 8 kHz sampling rate. The lower
sub-band is for the signal from 0 to 4 kHz (same frequency range as narrowband
speech), and the higher sub-band is for signal from 4 to 8 kHz. Each sub-band
signal is encoded based on ADPCM, a similar structure as illustrated in Fig. 2.8
including adaptive quantiser and adaptive predictor. The lower sub-band ADPCM
applies an adaptive 60-level non-uniform quantisation which requires 6 bits coding
for each ADPCM codeword, resulting in 48 kb/s bit rate. The higher sub-band AD-
PCM applies 4-level non-uniform quantisation using 2 bits coding and can achieve
16 kb/s transmission bit rate. Overall, 64 kb/s can be achieved for the SB-ADPCM
coding. In the mode for 56 or 48 kb/s operation, 30-level or 15-level non-uniform
quantisation is used, instead of 60-level quantisation, which results in a 5 or 4 bits
coding for each ADPCM codeword for the lower-subband. 4-level quantisation for
higher sub-band remains the same.
Due to the nature of ADPCM sample-based coding, G.722 ADPCM-WB is suit-
able for both wideband speech and music coding.
G.722.1 [20], approved by ITU-T in 1999, is for 7 kHz audio coding at 24 and
32 kb/s for hands-free applications, for example, conferencing systems. It can be
used for both speech and music. Encoder input signal is sampled at 16 kHz sam-
pling rate. The coding algorithm is based on transform coding, named as Modu-
lated Lapped Transform (MLT). The audio coding frame is 20 ms (320 samples at
16 kHz sampling rate), with 20 ms look-ahead, resulting in coding algorithmic de-
lay of 40 ms. For each 20 ms audio frame, it is transformed to 320 MLT coefficients
independently, and then coded to 480 and 640 bits for the bit rate of 24 and 32 kb/s,
respectively. This independent coding of MLT coefficients for each frame has a bet-
ter resilience to frame loss as no error propagation exists in this coding algorithm.
This is why G.722.1 is suitable for use in a conferencing system with low frame
loss. Bit rate change for this codec can occur at the boundary of any 20 ms frames.
In the latest version of G.722.1 [22] (2005), it defines both the 7 kHz audio
coding mode (in the main body) and the 14 kHz coding mode (in Annex C). The
new 14 kHz audio coding mode further expands audio’s frequency range from 7 kHz
to 14 kHz, with sampling rate doubled from 16 to 32 kHz and samples doubled from
320 to 640 for each audio frame. The bit rates supported by Annex C are 24, 32 and
48 kb/s. The produced speech by the 14 kHz coding algorithm is normally referred
to as “High Definition Voice” or “HD” voice. This codec has been used in video
conference phones, and video streaming systems by Polycom.3
3 http://www.polycom.com
44 2 Speech Compression
Adaptive Multi-Rate Wideband (AMR-WB) has been defined by both 3GPP [1]
in Technical Specification TS 26.190 and ITU-T G.722.2 [21]. It is for wideband
application (7 kHz bandwidth speech signals) with 16 kHz sampling rate. It operates
at a wide range of bit rates from 6.6 to 23.85 kb/s (6.60, 8.85, 12.65, 14.25, 15.85,
18.25, 19.85, 23.05 or 23.85 kb/s) with bit rate change at any 20 ms frame boundary.
Same as AMR, AMB-WB is based on ACELP coding technique, but uses a 16th
order linear prediction (LP) filter (or short-term prediction filter), instead of 10th
LP as used in AMR narrowband. AMR-WB can provide high quality voice and
is suitable for applications such as combined speech and music, and multi-party
conferences.
G.719, approved by ITU-T in 2008, is the latest ITU-T standard for Fullband (FB)
audio coding [23] with bit rates ranging from 32 to 128 kb/s and audio frequen-
cies up to 20 kHz. It is a joint effort from Polycom and Ericsson,4 and is aimed for
high quality speech, music and general audio transmission and suitable for conver-
sational applications such as teleconferencing and telepresence. The 20 Hz–20 kHz
frequency range covers the full human auditory bandwidth and represents all fre-
quency human ear can hear. The sample rate at the input of encoder and the output
of the decoder is 48 kHz. The frame size is 20 ms, with 20 ms look-ahead, resulting
in an algorithmic delay of 40 ms. The compression technique is based on Transform
Coding. The features such as adaptive time-resolution, adaptive bit allocation and
lattice vector quantization, make it flexible and efficient for incorporating different
input signal characteristics of audio and to be able to provide a variable bit rate
from 32 to 128 kb/s. The encoder detects each 20 ms input signal frame and classi-
fies it as either a stationary frame (such as speech) or a non-stationary frame (such
as music) and applies different transform coding techniques accordingly. For a sta-
tionary frame, the modified Discrete Cosine Transform (DCT) is applied, whereas
for a non-stationary frame, a higher temporal resolution transform (in the range of
5 ms) is used. The spectral coefficients after transform coding are grouped into dif-
ferent bands, then quantised using lattice-vector quantisation and coded based on
different bit allocation strategies to achieve different transmission bit rates from 32
to 128 kb/s. G.719 can be applied for high-end video conferencing and telepresence
applications to provide high definition (HD) voice, in accompany with a HD video
stream.
4 http://www.ericsson.com
2.5 Illustrative Worked Examples 45
In the previous sections, we have discussed key narrowband to fullband speech com-
pression codecs standardised by ITU-T, ETSI and IETF. We now summarize them
in Table 2.2 which includes each codec’s basic information such as which standard-
isation body was involved, which year was standardised, codec type, Narrowband
(NB), Wideband (WB), Super-wideband (SWB) or Fullband (FB), bit rate (kb/s),
length of speech frame (ms), bits per sample/frame (coded bits per sample or per
frame), look-ahead time (ms), and coding’s algorithmic delay (ms). From this table,
you should be able to see the historic development of speech compression coding
standards (from 64 kb/s, 32 kb/s, 16 kb/s, 8 kb/s to 6.4/5.3 kb/s) for achieving high
compression efficiency, the mobile codecs development from GSM to AMR for 2G
and 3G applications, the development from single rate codec, dual-rate codec, 8-
mode codec to variable rate codec for achieving high application flexibility, and
the trend from narrowband (NB) codecs to wideband codecs (WB) for achieving
high speech quality (even for High Definition voice). This development has made
speech compression codecs more efficient and more flexible for many different ap-
plications including VoIP. In the table, the columns on coded bits per sample/frame
and speech frame for each codec will help you to understand payload size and to
calculate VoIP bandwidth which will be covered in Chap. 4 on RTP transport pro-
tocol. The columns on look-ahead time and codec’s algorithmic delay will help to
understand codec delay and VoIP end-to-end delay, a key QoS metric, which will be
discussed in detail in Chap. 6 on VoIP QoE.
It has to be mentioned that many VoIP phones (hardphones or softphones) have
incorporated many different NB and even WB codecs. How to negotiate which
codec to be used at each VoIP terminal and how to change the codec/mode/bit
rate during a VoIP session on the fly will be discussed in Chap. 5 on SIP/SDP sig-
nalling.
2.5.1 Question 1
Determine the input and output data rates (in kb/s) and hence the compression ratio
for a G.711 codec. Assume that the input speech signal is first sampled at 8 kHz and
that each sample is then converted to 14-bit linear code before being compressed
into 8-bit non-linear PCM by the G.711 codec.
SOLUTION: As the input speech signal is sampled at 8 kHz which means that
there are 8000 samples per second. Then each sample is coded using 14-bit. Thus
the input data rate is:
Codec Standard Type NB or Bit rate Speech Bits per Look- Algor.
Body/Year WB or (kb/s) frame sample/ ahead delay
FB (ms) frame (ms) (ms)
G.711 ITU/1972 PCM NB 64 0.125 8 0 0.125
G.726 ITU/1990 ADPCM NB 40 0.125 5 0 0.125
32 4
24 3
16 2
G.728 ITU/1992 LD-CELP NB 16 0.625 10 0 0.625
G.729 ITU/1996 CS-ACELP NB 8 10 80 5 15
G.723.1 ITU/1996 ACELP NB 5.3 30 159 7.5 37.5
MP-MLQ NB 6.3 189
GSM ETSI/1991 (FR) RPE-LTP NB 13 20 260 0 20
ETSI/1999 (HR) VSELP NB 5.6 112 0 20
ETSI/2000 (EFR) ACELP NB 12.2 244 0 20
AMR ETSI/2000 ACELP NB 4.75 20 95 5 25
5.15 103
5.9 118
6.7 134
7.4 148
7.95 159
10.2 204
12.2 244 0 20
iLBC IETF/2004 CELP NB 15.2 20 304 0 20
13.33 30 400 30
G.711.1 ITU/2008 PCM-WB NB/WB 64 5 320 5 11.875
(MDCT) 80 400
96 480
G.722 ITU/1988 SB-ADPCM WB 64 0.125 8 0 0.125
56 7
48 6
G.722.1 ITU/1999 Transform WB 24 20 480 20 40
Coding 32 640
ITU/2005 SWB 24/32/48 480–
960
G.719 ITU/2008 Transform FB 32–128 20 640– 20 40
Coding 2560
AMR-WB ETSI/ITU ACELP WB 6.6– 20 132– 0 20
(G.722.2) /2003 23.85 477
SILK IETF/2009 CELP WB 6–40 20 120– 0 20
800
2.5 Illustrative Worked Examples 47
For the output data, each sample is coded using 8-bit, thus the output data rate is:
112/64 = 1.75
2.5.2 Question 2
The G.726 is the ITU-T standard codec based on ADPCM. Assume the codec’s
input speech signal is 16-bit linear PCM and the sampling rate is 8 kHz. The output
of the G.726 ADPCM codec can operate at four possible data rates: 40 kb/s, 32 kb/s,
24 kb/s and 16 kb/s. Explain how these rates are obtained and what the compression
ratios are when compared with 64 kb/s PCM.
Thus, using 5 bits to code each quantised difference signal will create an ADPCM
bit steam operating at 40 kb/s.
Similarly, for 32, 24 and 16 kb/s, the required bits for each quantised difference
signal is 4 bits, 3 bits and 2 bits, respectively. The lower the coding bits, the higher
the quantisation error, thus, the lower the speech quality.
For the compression ratio for 40 kb/s ADPCM when compared with 64 kb/s
PCM, it is 64/40 = 1.6.
For 32, 24 and 16 kb/s ADPCM, the compression ratio is 2, 2.67, 4, respectively.
2.5.3 Question 3
For the G.723.1 codec, it is known that the transmission bit rates can operate at
either 5.3 or 6.3 kb/s. What is the frame size for G.723.1 codec? How many speech
samples are there within one speech frame? Determine the number of parameters
bits coded for the G.723.1 encoding.
48 2 Speech Compression
SOLUTION: For the G.723.1 codec, the frame size is 30 ms. As G.723.1 is
narrowband codec, the sampling rate is 8 kHz. The number of speech samples in a
speech frame is:
So, there are 240 speech samples within one speech frame.
For 5.3 kb/s G.723.1, the number of parameters bits used is:
For 6.3 kb/s G.723.1, the number of parameters bits used is:
2.6 Summary
In this chapter, we discussed speech/audio compression techniques and summarised
narrowband, wideband and fullband speech/audio compression standards from
ITU-T, ETSI and IETF. We focused mainly on narrowband speech compression,
but covered some wideband and the latest fullband speech/audio compression stan-
dards. We started the chapter from some fundamental concepts of speech, includ-
ing speech signal digitisation (sampling, quantisation and coding), speech signal
characteristics for voiced and unvoiced speech, and speech signal presentation in-
cluding speech waveform and speech spectrum. We then presented three key speech
compression techniques which are waveform compression, parametric compression
and hybrid compression. For waveform compression, we mainly explained ADPCM
which is widely used for both narrowband and wideband speech/audio compres-
sion. For parametric compression, we started from the speech production model and
then explained the concept of parametric compression techniques, such as LPC-10.
For hybrid compression, we started from the problems with waveform and para-
metric compression techniques, the need to develop high speech quality and high
compression ratio speech codecs, and then discussed the revolutionary Analysis-
by-Synthesis (AbS) and CELP (Code Excited Linear Prediction) approach. We also
listed out major CELP variants used in mobile, satellite and secure communications
systems.
In this chapter, we summarised major speech/audio compression standards
for narrowband, wideband and fullband speech/audio compression coding from
ITU-T, ETSI and IETF. We covered narrowband codecs including G.711, G.726,
G.728, G.729, G.723.1, GSM, AMR and iLBC; wideband codecs including G.722,
G.722.1, G.722.2/AMR-WB; and fullband codec (i.e., G.719). We explained the
historic development of these codecs and the trend from narrowband, wideband
to fullband speech/audio compression to provide high fidelity or “High Definition
Voice” quality. Their applications cover VoIP, video call, video conferencing and
telepresence.
2.7 Problems 49
This chapter, together with the next chapter on video compression, form the ba-
sis for other chapters in the book. We illustrated the concepts such as speech codec
type, speech frame size, sampling rate, bit rate and coded bits for each speech frame.
This will help you to understand the payload size and to calculate VoIP bandwidth
which will be covered in Chap. 4 on the RTP transport protocol. The codec com-
pression and algorithmic delay also affect overall VoIP quality which will be further
discussed in Chap. 6 on VoIP QoE. How to negotiate and decide which codec to be
used for a VoIP session and how to change the mode or codec type during a session
will be discussed in Chap. 5 on the SIP/SDP signalling.
2.7 Problems
14. For G.722 ADPCM-WB, what is the sampling rate for signal at the input of
the encoder? What is the sampling rate for the input at each sub-band ADPCM
block?
15. Describe the speech/audio frequency range and sampling rate for narrowband,
wideband, super-wideband and fullband speech/audio compression coding.
16. Describe the differences between G.711 and G.711.1.
References
1. 3GPP (2011) Adaptive Multi-Rate—Wideband (AMR-WB) speech codec, transcoding
functions (Release 10). 3GPP TS 26.190 V10.0.0
2. Andersen S, Duric A, et al (2004) Internet Low Bit rate Codec (iLBC). IETF RFC 3951
3. Andersen SV, Kleijn WB, Hagen R, Linden J, Murthi MN, Skoglund J (2002) iLBC—
a linear predictive coder with robustness to packet losses. In: Proceedings of IEEE 2002
workshop on speech coding, Tsukuba Ibaraki, Japan, pp 23–25
4. Atal BS, Hanauer SL (1971) Speech analysis and synthesis by linear prediction. J Acoust
Soc Am 50:637–655
5. Atal BS, Remde JR (1982) A new model of LPC excitation for producing natural-sounding
speech at low bit rates. In: Proc IEEE int conf acoust speech, signal processing, pp 614–617
6. ETSI (1991) GSM full rate speech transcoding. GSM Rec 06.10
7. ETSI (1999) Digital cellular telecommunications system (Phase 2+); half rate speech; half
rate speech transcoding. ETSI-EN-300-969 V6.0.1
8. ETSI (2000) Digital cellular telecommunications system (Phase 2+); Adaptive Multi-Rate
(AMR) speech transcoding. ETSI-EN-301-704 V7.2.1
9. ETSI (2000) digital cellular telecommunications system (phase 2+); Enhanced Full Rate
(EFR) speech transcoding. ETSI-EN-300-726 V8.0.1
10. Griffin DW, Lim JS (1988) Multiband excitation vocoder. IEEE Trans Acoust Speech Signal
Process 36:1223–1235
11. ITU-T (1988) 32 kbit/s adaptive differential pulse code modulation (ADPCM). ITU-T
G.721
12. ITU-T (1988) 7 kHz audio-coding within 64 kbit/s. ITU-T Recommendation G.722
13. ITU-T (1988) Extensions of Recommendation G.721 adaptive differential pulse code mod-
ulation to 24 and 40 kbit/s for digital circuit multiplication equipment application. ITU-T
G.723
14. ITU-T (1988) Pulse code modulation (PCM) of voice frequencies. ITU-T G.711
15. ITU-T (1990) 40, 32, 24, 16 kbit/s Adaptive Differential Pulse Code Modulation (ADPCM).
ITU-T G.726
16. ITU-T (1992) Coding of speech at 16 kbit/s using low-delay code excited linear prediction.
ITU-T G.728
17. ITU-T (1996) Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited
linear prediction (CS-ACELP). ITU-T G.729
18. ITU-T (1996) Dual rate speech coder for multimedia communication transmitting at 5.3 and
6.3 kbit/s. ITU-T Recommendation G.723.1
19. ITU-T (1999) G.711: a high quality low-complexity algorithm for packet loss concealment
with G.711. ITU-T G.711 Appendix I
20. ITU-T (1999) Coding at 24 and 32 kbit/s for hands-free operation in systems with low frame
loss. ITU-T Recommendation G.722.1
21. ITU-T (2003) Wideband coding of speech at around 16 kbit/s using Adaptive Multi-Rate
Wideband (AMR-WB). ITU-T Recommendation G.722.2
22. ITU-T (2005) Low-complexity coding at 24 and 32 kbit/s for hands-free operation in sys-
tems with low frame loss. ITU-T Recommendation G.722.1
References 51
23. ITU-T (2008) Low-complexity, full-band audio coding for high-quality, conversational ap-
plications. ITU-T Recommendation G.719. http://www.itu.int/rec/T-REC-G.719-200806-I
24. ITU-T (2008) Wideband embedded extension for G.711 pulse code modulation. ITU-T
G.711.1
25. Jayant NS (1974) Digital coding of speech waveforms: PCM, DPCM and DM quantizers.
Proc IEEE 62:611–632
26. Kondoz AM (2004) Digital speech: coding for low bit rate communication systems, 2nd ed.
Wiley, New York. ISBN:0-470-87008-7
27. Mkwawa IH, Jammeh E, Sun L, Ifeachor E (2010) Feedback-free early VoIP quality adapta-
tion scheme in next generation networks. In: Proceedings of IEEE Globecom 2010, Miami,
Florida
28. Schroeder MR (1966) Vocoders: analysis and synthesis of speech. Proc IEEE 54:720–734
29. Sun L, Ifeachor E (2006) Voice quality prediction models and their applications in VoIP
networks. IEEE Trans Multimed 8:809–820
30. TIA/EIA (1997) Enhanced Variable Rate Codec (EVRC). TIA-EIA-IS-127. http://www.
3gpp2.org/public_html/specs/C.S0014-0_v1.0_revised.pdf
31. Tremain TE (1982) The government standard linear predictive coding algorithm: LPC-10.
Speech Technol Mag 40–49
32. Vos K, Jensen S, et al (2009) SILK speech codec. IETF RFC draft-vos-silk-00
Video Compression
3
Compression in VoIP is the technical term which refers to the reduction of the size
and bandwidth requirement of voice and video data. In VoIP, ensuring acceptable
voice and video quality is critical for acceptance and success. However, quality is
critically dependent on the compression method and on the sensitivity of the com-
pressed bitstream to transmission impairments. An understanding of standard voice
and video compression techniques, encoders and decoders (codecs) is necessary in
order to design robust VoIP applications that ensure reliable and acceptable qual-
ity of delivery. This understanding of the techniques and issues with compression
is important to ensure that appropriate codecs are selected and configured properly.
This chapter firstly introduces the need for media compression and then explains
some basic concepts for video compression, such as video signal representation,
resolution, frame rate, lossless and lossy video compression. This is followed by
video compression techniques including predictive coding, quantisation, transform
coding and interframe coding. The chapter finally describes the standards in video
compression, e.g. H.120, H.261, MPEG1&2, H.263, MPEG4, H.264 and the latest
HEVC (High Efficiency Video Coding) standard.
In the recent past, personal and business activities have been heavily impacted by
the Internet and mobile communications which have gradually become pervasive.
New consumer devices such as mobile phones have increasingly been integrated
with mobile communications capabilities that enable them to access VoIP services.
Advances in broadband and wireless network technologies such as third generation
(3G), the emerging fourth generation (4G) systems, and the IEEE 802.11 WLAN
standards have increased channel bandwidth. This has resulted in the proliferation
of networked multimedia communications services such as VoIP.
Although the success of VoIP services depends on viable business models and
device capability, they also depend to a large extent on perceived quality of service
L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 53
DOI 10.1007/978-1-4471-4905-7_3, © Springer-Verlag London 2013
54 3 Video Compression
(PQoS) and the quality of experience (QoE). The Internet is notorious for a limited
and time varying bandwidth and wireless channel characteristics are also notorious
for a limited and fluctuating channel bandwidth and block error rate (BLER). These
limitations make the reliable delivery of VoIP services over the Internet and wireless
channels a challenge.
A video sequence is generally captured and represented in its basic form as a
sequence of images which are commonly referred to as frames. The frames are dis-
played at a constant rate, called frame rate (frames/second). The most commonly
used frame rates are 30 and 25 frames per second. Analogue video signals are
produced by scanning a 2-D scene which is then converted to a 1-D signal. The
analogue video signal is digitised by the process of filtering, sampling and quanti-
sation [4].
The filtering process reduces the aliasing effect that would otherwise be in-
troduced as a result of sampling. The choice of sampling frequency influences
the quality of the video and its bandwidth requirement. A sampling frequency of
13.5 MHz has been recommended by International Radio Consultative Committee
CCIR [3] for broadcast quality video with the quantised samples being coded using
8 bits and 256 levels. Digitising analogue video signals results in huge amount of
data which require a high transmission bandwidth because of the process of dig-
itally sampling each video frame in light intensity (luminance) and color compo-
nents (chrominance) and sending it at 25/30 frames per second. As an example,
a typical 625-line encoded video with 720 pixels by 576 lines, 25 frames/second
with 24 bits representing luminance and colour components requires 249 Mbit/s
(720 × 576 × 25 × 24 = 248832000). Clearly, the transmission of raw video in VoIP
services is unrealistic and very expensive given its huge bandwidth requirements
and the limited channel bandwidth of the Internet and wireless channels.
It is necessary to compress the video to reduce its bandwidth requirements and
make its transmission over current transmission media realistic and cheap. Video
compression involves the removal of redundancies inherent in the video signal.
The removal of redundancies is the job of a video CODEC (enCOder DECoder).
There are correlations between two successive frames in a video sequence (tempo-
ral redundancy). Subtracting the previous frame from the current frame and only
sending the resulting difference (residual frame) leads to a significant reduction in
transmission bandwidth requirement. The process of only sending the difference
of successive frames instead of the current frame is called inter-frame or temporal
redundancy removal. Compression can be increased by the further removal of cor-
relation between adjacent pixels within a frame (spatial redundancy), color spectral
redundancy and redundancy due to the human visual system.
Coding techniques such as the Motion Picture Expert Group (MPEG) use block-
based motion compensation techniques, together with predictive and interpolative
coding. Motion estimation and compensation removes temporal redundancies in the
video sequence [11]. Spatial redundancies are removed by converting the sequences
into a transform domain such as the Discrete Cosine Transform (DCT) [1, 9, 13] and
then quantising, followed by Variable Length Coding (VLC) [5, 16] of the trans-
formed coefficients to reduce the bit rate. The efficiency of a compression scheme
3.2 Video Compression Basics 55
As illustrated in Table 3.2, video formats can be represented by different video reso-
lutions, such as Common Intermediate Format (CIF) and Quarter CIF (QCIF). Each
resolution can be represented by the number of horizontal pixels × the number of
vertical pixels. For example, a CIF format (or resolution) is represented as 352 × 288
which means there are 352 pixels at the horizontal dimension and 288 pixels at
the vertical dimension. Each pixel can be represented using 8 bits for a grayscale
image with a value from 0 (black) to 255 (white). It needs 24 bits to represent a
pixel of colour image as shown in Fig. 3.1 where each pixel is represented by three
colour components, Red (R), Green (G) and Blue (B) with each colour compo-
nent using 8 bits to represent. For example, the middle left pixel is represented as
[208, 132, 116] which is dominated by the red colour.
56 3 Video Compression
The above Red, Green and Blue colour representation, also known as the RGB
colour format, is normally transformed to YUV or (Y Cb Cr ) format based on
Eq. (3.1) [15], where Y represents the luminance component, U and V (or Cb
and Cr ) represent the chrominance components.
Table 3.1 Required bits per pixel under different colour sub-sampling schemes
Format (YUV) Y (bits/pixel) Cb (bits/pixel) Cr (bits/pixel) total bits/pixel
4:4:4 8 8 8 24
4:2:2 8 4 4 16
4:2:0 8 2 2 12
The video bandwidth requirement also depends on what resolution is used. Table 3.2
shows the bandwidth requirement for selected video formats based on Common
Intermediate Format (CIF), 25 frames per second and 4 : 2 : 0 (Y Cb Cr or YUV)
58 3 Video Compression
The huge amounts of data needed to represent high-quality voice and video makes
their transmission in the Internet and wireless channels impractical and very ex-
pensive. The problem of huge bandwidth requirement for voice is less severe. The
huge data and bandwidth requirements make data compression essential. The cur-
rent level and advances in compression have been achieved over several decades and
the advances in computing power and advances in signal processing have enabled
the implementation of video compression in a variety of equipments such as mobile
telephones.
There are in general two basic methods of compression. These are lossy and
lossless compression methods. These two compression methods are generally used
together because combining them achieves greater compression. Lossless compres-
sion techniques do not introduce any distortion to the original video and an exact
copy of the input to the compressor can be recovered at the output. Lossy video
compression techniques on the other hand introduces distortions and it is impossi-
ble to recover an exact replica of the input video at the output. Clever techniques
have been developed to ensure that the introduced distortion is tailored to match the
characteristics of the Human Visual System (HVS) to ensure a reduction of distor-
tion.
1
H =− pi log2 pi = pi log2 (3.2)
pi
all i all i
1
H= pi log2 (3.3)
pi
all i
The Huffman code provides compression when the set of input video symbols have
an entropy that is very much less than log2 (number of symbols). In other words, it
works better with data that has a highly non-uniform Probability Density Function
(PDF). However, PDF of typical video sequences does not fit this profile.
scheme with its block diagram shown in Fig. 3.3. At the encoder side, the estimated
current signal is predicted based on previous signal samples (via the predictor). Only
the difference signal or the prediction error signal (e(m, n) = s(m, n) − ŝ(m, n)) is
sent to the quantiser and then coded through entropy encoder before sending to the
channel. At the decoder side, the binary input signal is first decoded by entropy
decoder, then the reconstructed signal (s (m, n)) is obtained by adding the signal
estimate generated by the predictor.
Unlike speech ADPCM scheme which is one-dimensional, video DPCM scheme
is based on two-dimensional (2D) signal. As shown in Fig. 3.3, video source signal
is represented by a 2D signal (s(m, n)) with m and n representing the horizontal
and vertical position of a pixel. The prediction for the current source signal can be
based on the previous samples within a picture (intra-coding, exploit spatial correla-
tion) or based on samples belong to previous pictures (inter-coding, exploit temporal
correlation).
Similar to ADPCM in speech coding, at least 1 bit is needed to code each predic-
tion error. Thus DPCM coding is not suitable for low bit rate video coding.
3.5.2 Quantisation
tail. Further compression can also be achieved by subsampling the video data both
vertically and horizontally.
Transform coding is different from predictive coding methods, but like predictive
coding, its purpose is the exploitation of spatial correlations within the picture, and
to conceal compression artifacts. Transform coding does not generally work on the
whole frame but rather on small blocks such as 8 × 8 blocks. Transform coding
literally transforms the video signal from one domain to another domain in such
a way that the transformed elements become uncorrelated. This allows them to be
individually quantised. Although the Karhunen–Loève Transform (KLT) is theoret-
ically maximally efficient, it has implementation limitations. The Discrete Cosine
Transform (DCT) which is less efficient than the KLT, is used instead because it
is straightforward to compute. It is therefore used in JPEG and MPEG compres-
sion. Figure 3.4 illustrates a block diagram for video encoder and decoder (codec)
based on DCT. We use x(n) to represent video samples in the time-domain and
y(k) representing the transformed coefficients in the frequency-domain. The trans-
formed DCT coefficients are quantised and coded using Variable Length Coding
(VLC) (e.g., Huffman coding). The coded bitstream is packetised and sent to the
channel/network. At the decoder, the bitstream from the channel may be differ-
ent from the bitstream generated from the encoder due to channel error or network
packet loss. Thus, we use x (n) and y (k) to differentiate them from those used at
the encoder.
The DCT requires only one set of orthogonal basis functions for each ‘fre-
quency’ and for a block of N × 1 picture elements (pixels), expressed by x(n),
62 3 Video Compression
N −1
1
y[0] = √ x(n) (3.4)
N n=0
N −1
2 kπ(2n + 1)
y[k] = x(n) · cos , k = 1, . . . , N − 1 (3.5)
N 2N
n=0
N −1
1 2 nπ(2k + 1)
x [n] = √ y [0] +
y [k] · cos , n = 0, . . . , N − 1
N N 2N
k=0
(3.6)
Standard video codecs all follow a generic structure that consists of motion esti-
mation and compensation to remove temporal or inter-frame redundancies, trans-
form coding to manipulate the resultant PDF of the transform coefficients and de-
correlate inter and intra pixel redundancies, and entropy coding to remove statistical
redundancies. They are intended for lossy compression of natural occurring video
sequences. Significant coding efficiency has been achieved over several years. This
gain in coding efficiency has seen improvements in algorithms of the generic codes.
This has resulted in much more advanced standard codecs. But it is from motion
estimation and compensation that largest compression gains have been achieved.
Another area that has seen significant improvements is on computational complex-
ity.
Video standards are mainly from ITU-T (International Telecommunication
Union, H series standards, e.g., H.261, H.263 and H.264) and from the Motion
Picture Experts Group (MPEG), a working group, formed by ISO (International Or-
ganisation for Standards) and IEC (International Electrotechnical Commission) to
set standards for audio and video compression and transmission. The well-known
MPEG standards are MPEG1, MPEG2 and MPEG4.
This section describes the various standard video codecs that have been devel-
oped so far through successive refinement of the various coding algorithms.
3.6.1 H.120
This was the first video coding developed by the International Telegraph and Tele-
phone Consultative Committee (CCITT, now ITU-T) in 1984. It was targeted for
video conferencing applications. It was based on Differential Pulse Code Mod-
ulation (DPCM), scalar quantisation, and conditional replenishment. H.120 sup-
ported bit rates of that were aligned to the T1 and E1 with bitrates of 1.544 and
2.048 Mbit/s. This codec was abandoned not long afterwards with the development
of the H.261 video standard.
3.6.2 H.261
This codec (H.261 [6]) was developed as a replacement to H.120 which is widely
regarded as the origin of modern video compression standards. H.261 introduced
the hybrid coding structure that is predominantly used in current video codecs. This
codec used 16 ×16 macroblock (MB) motion estimation and compensation, an 8 ×8
DCT block, zig-zag scanning of DCT coefficients, scalar quantisation and variable
length coding (VLC). This codec was the first to use a loop filter for the removal
of block boundaries artifacts. H.261 supports bitrates of p × 64 kbit/s (p = 1–30)
that ranges from 64 kbit/s to 1920 kbit/s (64 kbit/s is the base rate for ISDN links).
Although it is still used, H.261 has been replaced by H.263 video codec.
64 3 Video Compression
MPEG-1 was developed in 1991 mainly for video storage applications on CD-
ROM. MPEG-1 is based on the same structures as the H.261 structure. However, it
was the first codec to use bi-directional prediction in which bi-directional pictures
(B-pictures) were predicted from anchor intra pictures (I-pictures) and predictively
coded (P-pictures) pictures. It has a much more improved picture quality to the
H.261 and operates in bitrates up to 1.5 Mbit/s for CIF picture sizes (352 × 288
pixels) and it has an improved motion estimation algorithm to the H.261.
The MPEG-2 coding standard, which is also known as H.262 was developed
around 1994/95 for DVD and Digital Video Broadcasting (DVB). The only differ-
ence between this codec and the MPEG-1 standard is the introduction of interlaced
scanning pictures to increase compression efficiency and the provision of scalability
that enabled channel adaptation. MPEG-2 was targeted towards high quality video
with bitrates that range between 2 and 20 Mbit/s. It is not generally suitable for low
bit rate applications such as VoIP application that has bitrates below 1 Mbit/s.
The MPEG-2 video standard is a hybrid coder that uses a mixture of intraframe
coding to remove spatial redundancies and Motion Compensated (MC) interframe
coding to remove temporal redundancies [12]. Intraframe coding exploits the spatial
correlation of nearby pixels in the same picture, while interframe coding exploits the
correlation between adjacent pixels in the corresponding area of a nearby picture to
achieve compression. In intraframe coding, the pixels are transformed into the DCT
domain, resulting in a set of uncorrelated coefficients, which are subsequently quan-
tised and VLC encoded. Interframe coding removes temporal redundancy by using
reference picture(s) to predict the current picture being encoded and the prediction
error is transformed, quantised and encoded [4]. In MPEG-2 standard [11] either a
past or future picture can be used for the prediction of the current picture being en-
coded, and the reference picture(s) can be located more than one picture away from
the current picture being encoded.
The DCT removes spatial redundancy in a picture or block by mapping a set of
N pixels into a set of N uncorrelated coefficients that represent the spatial frequency
components of the picture or pixel block [1]. This transformation does not yield
any compression by itself, but concentrates the transformed coefficients in the low
frequency domain of the transform. Compression is achieved by discarding the least
important coefficients to the human visual system and the remaining coefficients are
not represented with full quality. This process is achieved through the quantisation
of the transformed coefficients using visually weighted factors [4]. Quantisation
is a nonlinear process and it is nonreversible. The original coefficients cannot be
reconstructed without error once quantisation has taken place. Further compression
is then achieved by VLC coding of the quantised coefficients.
MPEG-2 is a coding standard intended for moving pictures and was developed
for video storage, delivery of video over Telecommunications networks, and for
multimedia applications [11]. For streaming compressed video over IP, MPEG-2
bit streams are normally encoded at the Source Intermediate Format (SIF) size of
352 × 288 pixels and at a temporal resolution of 25 f/s for Europe and 352 × 240
3.6 Video Coding Standards 65
pixels and 30 f/s for America [14]. The MPEG-2 standard defines three main picture
types:
• I: This picture is intracoded without reference to any other picture. They provide
an access point to the sequence for decoding and have moderate compression
levels.
• P: These pictures are predictively coded with reference to past I or P pictures,
and are themselves used as reference for coding of future pictures.
• B: These are the bidirectionally coded pictures, and require both past and future
pictures to be coded and are not used as a reference to code other pictures.
Figure 3.5 shows a simplified model of an MPEG-2 encoder. The frame reorder-
ing process allows the coding of the B pictures to be delayed until the I and P
pictures are coded. This allows the I and the P pictures to be used as reference
in coding the B pictures. DCT performs the transformation into the DCT domain,
Quantise performs the quantisation process and VLC is the variable length coding
process. A buffer, BUF is used for rate control and smoothing of the encoded bit
rate. The frame store and predictors are used to hold pictures to enable predictive
coding of pictures.
An MPEG-2 encoded video has a hierarchical representation of the video signal
as shown on Fig. 3.6.
• Sequence: This is the top layer of the hierarchy and is a sequence of the input
video. It contains a header and a group of pictures (GOP).
• GOP: This coding unit provides for random access into the video sequence and
is defined by two parameters: the distance between anchor pictures (M) and the
66 3 Video Compression
total number of pictures in a GOP (N). A GOP always starts with an intraframe
(I) picture and contains a combination of predictive (P) and bi-directional (B)
coded pictures.
• Picture: This is the main coding and display unit and can be I, P or B type. Its
size is determined by the spatial resolution required by an application.
• Slice: This consists of a number of marcroblocks (MB) and is the smallest self-
contained coding and re-synchronisation unit.
• Macroblock: This is the basic coding unit of the pictures and consists of blocks
of luminance and chrominance.
• Block: This is an 8 × 8 pixel block and is the smallest coding unit in the video
signal structure and is the DCT unit.
3.6 Video Coding Standards 67
Fig. 3.7 Group of Blocks (GoBs) and Microblocks (MBs) in H.263 (for CIF format)
3.6.4 H.263
This video coding standard was developed in 1996 [7] as a replacement to the H.261
video coding standard. It was intended to be used for low bit rate communication,
such as for video conferencing applications. It supports standard video formats
based on Common Intermediate Format (CIF), which includes sub-QCIF, QCIF,
CIF, 4CIF and 16CIF. It utilises DCT to reduce spatial redundancy and motion com-
pensation prediction for removing temporal redundancy. The YUV format applied is
4 : 2 : 0 and the standard Picture Clock Frequency (PCF) is 30000/1001 (approxi-
mately 29.97) times per second.
An H.263 picture is made up of Group of Blocks (GoB) or slices, which con-
sists of k × 16 lines, where k depends on the picture format (k = 1 for QCIF and
CIF, k = 2 for 4CIF and k = 4 for 16CIF). So a CIF picture consists of 18 GoBs
(288/16 = 18) and each GoB contains one row of macroblocks (MBs) as shown
in Fig. 3.7. A MB contains four blocks of luminance components and two blocks
of chrominance components (one for Cb and one for Cr ) for YUV 4 : 2 : 0 format.
The position of luminance and chrominance component blocks are also shown in
Fig. 3.7.
The number of pixels (horizontal × vertical or width × height) for the lumi-
nance and chrominance components for each H.263 picture format are summarised
in Table 3.3.
H.263 has seven basic picture types, including I-picture, P-picture, PB-picture,
Improved PB-picture, B-picture, EI-picture and EP-picture. Within these seven pic-
ture types, only I-picture and P-picture are mandatory. I-picture is an intra-coded
picture with no reference to other pictures for prediction. I-picture only exploits to
68 3 Video Compression
Table 3.3 Number of pixels (horizontal × vertical) for luminance and chrominance components
for H.263
Format Luminance (Y) Chrominance (Cb ) Chrominance (Cr )
Sub-QCIF 128 × 96 64 × 48 64 × 48
QCIF 176 × 144 88 × 72 88 × 72
CIF 352 × 288 176 × 144 176 × 144
4CIF 704 × 576 352 × 288 352 × 288
16CIF 1408 × 1152 704 × 576 704 × 576
3.6.5 MPEG-4
MPEG-4 was developed in 1999 and has many similarities to the H.263 design.
MPEG-4 has the capability to code multiple objects within a video frame. It has
many application profiles and levels. It is highly more complex than MPEG-1 and
MPEG-2 and is regarded as a toolset of compression rather than a codec in their
strict sense of MPEG-1 and MPEG-2.
MPEG4 was developed mainly for storing and delivering multimedia content
over the Internet. It has bit rates from 64 kb/s to 2 Mb/s for CIF and QCIF formats.
Its simple profile (level 0) is normally used in 3G video call applications (e.g., for
QCIF operating at 64 kbit/s).
3.6.6 H.264
H.264 which is also known as Advanced Video Coding (AVC), MPEG-4 Part-10
or Joint Video Team (JVT) is the most advanced state-of-the-art video codec which
3.7 Illustrative Worked Examples 69
was standardised in 2003 [8]. Its use in applications is wide ranging and includes
broadcast with set-top-boxes, DVD storage, use in IP networks, multimedia tele-
phone and networked multimedia such as VoIP. It has a wide range of bit rates and
quality resolutions and supports HDTV, Blue-ray disc storage, applications with
limited computing resources such as mobile phones, video-conferencing and mo-
bile applications. H.264 uses fixed-point implementation and it is network friendly
in that it has a video coding layer (VCL) and a network abstraction layer (NAL).
It uses predictive intra-frame coding, multi-frame and variable block size motion
compensation and has an increased range of quantisation parameters.
3.7.1 Question 1
Calculate the bandwidth requirement for QCIF video, 25 frames per second, with
8 bits per component and 4 : 2 : 0 format. If the YUV format is changed to 4 : 2 : 2,
what is the bandwidth requirement?
SOLUTION: For QCIF video, the video resolution is 176 × 144. For 4 : 2 : 0 for-
mat, the bits required for each pixel is 12 bits. For 4 : 2 : 2 format, the bits required
for each pixel is 16 bits.
70 3 Video Compression
Fig. 3.8 GoBs and MBs for H.263 (based on QCIF format)
3.7.2 Question 2
Illustrate the concept of Group of Blocks (GoBs) and Macroblocks (MB) used in
H.263 coding based on QCIF format. Decide the number of GoBs and MBs con-
tained in a picture of QCIF format.
SOLUTION: Figure 3.8 illustrates the concept of Group of Blocks (GoBs) and
Macroblocks (MBs) used in H.263 coding for the QCIF format. For QCIF format,
there are 9 GoBs (144/16 = 9).
As shown in the figure, a picture contains a number of Group of Blocks (from
GoB 1 to GoB N). Each GoB may contain one or several rows of Macroblocks
(MBs) depending on the video format. For the QCIF format, one GoB contains only
one row of MBs as shown in the figure. Each GoB contains 11 MBs (176/16 = 11).
This results in a total of 99 MBs for one picture of the QCIF format (11 × 9 = 99).
Each MB contains four blocks for luminance components (four blocks for Y) and
two blocks for chrominance components (one block for Cb and one block for Cr )
which is equivalent to YUV 4 : 2 : 0 format. Each block consists of 8 × 8 pixels
which is the basic block for DCT transform.
3.8 Summary 71
3.7.3 Question 3
3.8 Summary
The acceptance and success of VoIP services is very much dependent on their ability
to provide services that have acceptable quality of experience. The video coding
techniques and the selection of video codes is the key to providing the required
QoE. It is paramount for designers and VoIP service providers to understand the
issues with compression in order to select the appropriate coding techniques and
codecs that will enable them to provide the necessary QoE to their users.
This chapter discussed video compression in the context of VoIP services. The
chapter starts by describing the need for compression. It then describes the basic
techniques for video compression including lossless video compression and lossy
video compression with a focus on lossy video compression which includes predic-
tive coding, quantisation, transform coding and interframe coding. The chapter then
gives a description of the most popular video coding standards including MPEG1,
MPEG2, MPEG4, H.261, H.263, H.264 and the latest HEVC. It also shows the
evolution of these standards.
72 3 Video Compression
3.9 Problems
References
1. Chen W, Smith C, Fralick S (1979) A fast computational algorithm for the discrete cosine
transform. IEEE Trans Commun 1004–1009
2. Choi H, Nam J, Sim D, Bajic IV (2011) Scalable video coding based on high efficiency video
coding (HEVC). In: Proceedings of 2011 IEEE Pacific Rim conference on communications,
computers and signal processing (PacRim), pp 346–351
3. Encoding parameters of digital television for studios, digital methods of transmitting televi-
sion information. ITU-R BT.601 (2011)
4. Ghanbari M (2003) Standard codecs image compression to advanced video coding. IEE,
London. ISBN:0-85296-0-710-2
5. Huffman D (1952) A method for the construction of minimum redundancy codes. In: Pro-
cedure of the IRE 40, pp 1098–1101
6. ITU-T (1993) Video codec for audiovisual services at p × 64 kbit/s. ITU-T H.261
7. ITU-T (1996) Video coding for low bit rate communication. ITU-T H.263
8. ITU-T (2003) Advanced video coding for generic audiovisual services. ITU-T H.264
9. Jain AK (1989) The fundamentals of digital image processing. Prentice Hall, New York
10. Jayant NS, Noll P (1984) Digital coding of waveforms: principles and applications to speech
and video. Prentice Hall, London
11. MPEG-1: Coding of moving pictures and associated audio for digital storage media at up to
about 1.5 Mbit/s. ISO/IEC (1991)
12. Netrali AN, Haskell BG (1988) Digital pictures: representation and compression. Plenum,
New York
13. Oppenheim AV, Schafer RW (1990) Discrete-time signal processing. Prentice Hall, New
York
14. Rosdiana E (2000) Transmission of transcoded video over ABR networks. Master’s thesis,
University of Essex
15. Symes P (1998) Video compression. McGraw Hill, New York. ISBN:0-07-063344-4
16. Witten IH, Neal RM, Cleary JG (1987) Arithmetic coding for data compression. Commun
ACM 520–540
Media Transport for VoIP
4
TCP and UDP are the most commonly used transport layer protocols. TCP is a
connection-oriented, reliable, in-order transport protocol. Its features such as re-
transmission, flow control and congestion control are not suitable for real-time mul-
timedia applications such as VoIP. UDP is a connectionless and unreliable transport
protocol. Its simple header, non-retransmission and non-congestion-control features
make it suitable for real-time applications. However, as UDP does not have the se-
quence number in the UDP header, the media stream packet transferred over UDP
may experience duplication or arrive not in the right order. This will cause the re-
ceived media (e.g., voice or video) unrecognisable or unviewable. The Real-time
Transport Protocol (RTP) was developed to assist the transfer of real-time media
streams on top of the unreliable UDP protocol. It has many fields, such as the se-
quence number (to detect packet loss), the timestamp (to know the location of me-
dia packet) and the payload type (to know voice or video codec used). The associ-
ated RTP Control Protocol (RTCP) was also developed to assist media control and
QoS/QoE management for VoIP applications. This chapter presents the key con-
cepts of RTP and RTCP, together with detailed header analysis based on real trace
data using Wireshark. The compressed RTP (cRTP) and bandwidth efficiency issues
are also discussed together with illustrative worked examples for VoIP bandwidth
calculation.
After voice/video has been compressed via the encoder at the sender side, the com-
pressed voice/video bit streams need to be packetised and then sent over packet
based networks (e.g., IP networks). For voice over IP, one packet normally contains
one or several speech frames. For example, for G.729, one speech frame contains
10 seconds of speech samples. If one packet contains one speech frame, then for
every 10 seconds, one IP packet will be sent to the IP network (via the network in-
terface). If one packet contains two speech frames, then for every 20 seconds, one
L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 73
DOI 10.1007/978-1-4471-4905-7_4, © Springer-Verlag London 2013
74 4 Media Transport for VoIP
IP packet which contains two speech frames, will be sent to the network. If more
speech frames are put in an IP packet, it will have longer end-to-end transmission
delay which will affect the quality of VoIP sessions, but will be more efficient in
the usage of transmission bandwidth (considering the same protocol headers need
to be added for each packet). It is always a tradeoff in deciding the right number of
speech frames to be put in an IP packet.
In the TCP/IP protocol stack, there are two transport layer protocols, which are
Transmission Control Protocol or Transport Control Protocol (TCP) and User
Datagram Protocol (UDP). About 90 % of today’s Internet traffic are from TCP-
based applications such as HTTP/Web, e-mail, file transfer, instant messaging, on-
line gaming, and some video streaming applications (e.g., YouTube). The remain-
ing 10 % of Internet traffic belong to UDP-based applications such as Domain
Name System (DNS) and real-time VoIP applications which are covered in this
book.
The TCP protocol, originally defined in RFC 793 in 1981 [3], is a connection-
oriented, point-to-point, reliable transport protocol. Connection-oriented means that
TCP will establish a connection between the sender and the receiver via three-way
handshake before a data transfer session starts (as shown in Fig. 4.1 for TCP header,
flag bits of SYN, ACK are used in the initial connection buildup stage). Each TCP
header contains 16-bit source port number, 16-bit destination port number, 32-bit se-
quence number, 32-bit acknowledgement number, 4-bit TCP header length, Flag bits
such as FIN (Finish), SYN (Synchronisation), RST (Reset), PSU (Push ‘data’), ACK
(Acknowledgement), URG (Urgent bit), 16-bit checksum, 16-bit urgent pointer and
options fields. The minimum TCP header is 20 bytes (when there are no options).
The sequence number and the acknowledgement number are used to indicate the
location of send packet within the sending stream and to acknowledge the receiving
of relevant packets (together with the ACK flag bit). This acknowledgement mech-
anism together with retransmission for lost packets are key for the reliable trans-
mission of TCP packets. Other features such as flow control (through the use of 16-
bit window size) will guarantee the sending and receiving processes at a matching
speed (not sending too fast or too slow). The congestion control mechanism is used
to adjust send bit rate in response to network congestion (e.g., when there is a lost
packet which indicates the possibility of network congestion, the TCP sending side
will automatically reduce its send bit rate in order to release the congestion status of
the network). TCP’s congestion mechanism is very important for the healthy opera-
tion of the Internet. Due to the above features, TCP is mainly used for transmitting
delay-insensitive, high reliable data applications (such as email, ftp data transfer
and http web applications). The features of acknowledgement, retransmission, con-
gestion control are not suitable for real-time VoIP applications. Point-to-point and
flow control is also not suitable for voice/video conference applications in which
4.2 TCP or UDP? 75
one stream needs to be send to several clients. This has made UDP an only option
for transmitting of VoIP packets.
Compared to TCP, UDP (User Datagram Protocol), originally defined in RFC
768 in 1980 [8], is very simple in its structure and functions. Figure 4.2 shows the
UDP header which only contains 8 bytes, with 16 bits for source port number, 16 bits
for destination port number, 16 bits for UDP packet length and remaining 16 bits
for UDP checksum (for some error detection). There are no connection establish-
ment stage, no flow control, congestion control and retransmission mechanisms as
provided in TCP. No connection stage (connectionless) and no retransmission mech-
anism mean that UDP transfer is faster than TCP transfer. No sequence number and
acknowledgement mechanism mean the UDP packet transfer won’t know the order
of its packets and won’t know whether a packet is received or not. This makes UDP
transfer fast, but unreliable. The fast transfer nature of UDP makes it suitable for
real-time multimedia applications, such as VoIP, which can also tolerate some de-
gree of packet loss. No point-to-point and flow control nature make UDP suitable
for both unicast and multicast applications.
From the socket implementation point of view, a UDP packet is just sent to the
destination side (via its destination socket). It will all depend on the network whether
it will reach the destination or not. Due to the nature of IP network, some packets
may be duplicated, some packets may arrive out of the order. It is clear that UDP
itself cannot solve the problem in relation to putting the voice or video packets in
right order in order to be played out properly at the receiver side for VoIP appli-
cations. This has pushed the development of Real-time Transport Protocol (RTP)
which will be covered in the next section.
For more details about the TCP/UDP protocols, the reader is recommended to
read relevant books on computer networking, such as [6].
76 4 Media Transport for VoIP
RTP was originally proposed in RFC 1889 [10] in 1996 (now obsolete) and refined
in RFC 3550 [11] in 2003. It aims to support the transfer of real-time multimedia
data over the UDP transport protocol. RTP added the sequence number in order to
identify the lost of RTP packets. Together with the timestamp field, it allows the
receiver to playout the received voice/video packets in the right order and at the
right position. Other fields such as SSRC and CCRC (will be explained later in the
section) are used to identify the voice or video source involved in a VoIP session, or
identify the contributing sources which are mixed by the VoIP sender (e.g., in a VoIP
conference situation). RTCP (the RTP Control Protocol), in association with RTP, is
used to monitor the quality of service of a VoIP session and to convey information
about the participants in an on-going session. RTCP packets are sent periodically
and contain sender and/or receiver reports (e.g., for packet loss rate and jitter value).
The RTP header includes mainly the payload type (for audio/video codecs), the
sequence number and the timestamp.
Figure 4.3 shows the RTP header which includes the following header fields.
• V: This field (2 bits) contains the version of the RTP protocol. The version
defined by RFC 1889 [10] is two.
• P: This is the Padding bit indicating whether there are padding fields in the RTP
packet.
• X: This is the eXtension bit indicating whether extension header is present.
• CC: This field (4 bits) contains the CSRC count, the number of contributing
source identifiers.
• M: This is the Marker bit. For voice, this marks the start of a voice talkspurt,
if silence suppression is enabled at the encoder. For example, M is set to 1
for the 1st packet after a silence period and is zero otherwise. For video, the
marker bit is set to one (True) for the last packet of a video frame and zero
otherwise. For example, if an I-frame is split into 8 packets to transmit over
the channel/network, the first seven packets will have the marker bit set to Zero
4.3 Real-Time Transport Protocol—RTP 77
(false) and the 8th packet (last packet for the I-frame) will have the marker bit
set to One (True). If a P-frame is put into two packets. The first packet will have
the marker bit set to Zero and the second packet’s M bit set as One. If there is
only one packet for a P or B frame, the marker bit will always be One.
• Payload type: This field (7 bits) contains the payload type for voice or video
codecs, e.g. for PCM-μ law, the payload type is defined as zero. The payload
type for common voice and video codecs are shown in Tables 4.1 and 4.2, re-
spectively.
• Sequence number: This field (16 bits) contains the sequence number which will
be incremented by one for each RTP packet sent for detecting packet loss.
• Timestamp: This field (32 bits) indicates the sampling instant when the first
octet of the RTP data was generated. It is measured according to media clock
rate. For voice, the timestamp clock rate is 8 kHz for majority of codecs and
16 kHz for some codecs. For example, for G.723.1 codec with frame size of
30 ms (containing 240 speech samples at 8 kHz sampling rate) and one speech
frame per packet, the timestamp difference between two consecutive packets
will be 240. In the case of speech using G.723.1 codec, the clock rate is the
same with the sampling rate and the timestamp difference based on the media
clock rate for two consecutive packets can be decided by the number of speech
samples which a packet contains.
For video, the timestamp clock rate is 90 kHz for majority of codecs. The
timestamp will be the same on successive packets belonging to a same video
frame (e.g., one I-frame was segmented into several IP packets which will have
the same timestamp values in their RTP headers). If a video encoder uses the
constant frame rate, for example, 30 frames per second, the timestamp differ-
ence between two consecutive packets (belong to different video frames) will
have the same value of 3000 (90,000/30 = 3000), or the media clock difference
between two consecutive packets is 3000. If frame rate is reduced to half (e.g.,
15 frames per second), the timestamp increment will be doubled (e.g., 6000).
• SSRC identifier: SSRC is for Synchronisation Source. This field (32 bits) con-
tains the identifier for a voice or video source. Packets originated from the same
source will have the same SSRC number.
• CSRC identifier: CSRC is for Contribution Source. It will only be available
when the CC field value is nonzero which means more than one source have
been mixed to produce this packet’s contents. This field (32 bits) contains an
entry for the identifier of a contributing source. Maximum 16 entries can be
supported. More information about the functions of RTP mixers can be found
from Perkins’s book on RTP [7].
If there are no mixed sources involved in a VoIP session, the RTP header will
have the minimum header size of 12 bytes. If more mixed sources (or contributing
sources) are involved, the RTP header size will increase accordingly.
Tables 4.1 and 4.2 show examples of RTP payload types (PT) for voice and video
codecs according to RFC 3551 [9]. Media types are defined as “A” for audio only,
“V” for video only and “AV” for combined audio and video. The payload type for
H.264 is dynamic or profile defined which means that the payload type for H.264
78 4 Media Transport for VoIP
can be defined dynamically during a session. The range for dynamically assigned
payload types is from 96 to 127 according to RFC 3551. The clock rate for media
is used to define the RTP timestamp in each packet’s RTP header. The clock rate
for voice is normally the same with the sampling rate (i.e., 8000 Hz for narrowband
speech codecs). The clock rate for video is 90,000 Hz.
In order to have a practical understanding of the RTP protocol and RTP header
fields, we show some trace data examples collected from a VoIP system. The details
on how to setup a VoIP system, how to collect and further analyse the trace data will
be discussed in detail in Chaps. 8 and 9.
Figures 4.4 and 4.5 illustrate an example of RTP trace data for one direction
of voice stream with Fig. 4.4 presents an overall picture and Fig. 4.5 shows fur-
ther information regarding the RTP header. In Fig. 4.4, the filter of “ip.src ==
192.168.0.29 and rtp” was applied in order to get a view of a voice stream sent
from the source station (IP address: 192.168.0.29) to the destination station (IP ad-
dress: 192.168.0.67), and only the RTP packets were filtered out. From these two
figures, it can be seen that all the shown packets (from No. 174 to No. 192) have the
same Ethernet frame length of 214 (bytes).
The sequence number (seq) is incremented by one for each packet (e.g., the 1st
one is 61170 and the 2nd one is 61171). The sequence number can be used easily to
detect whether there is a packet loss.
4.3 Real-Time Transport Protocol—RTP 79
Fig. 4.5 RTP trace example for voice from Wireshark—more RTP information
The SSRC (synchronised source) identifier is kept to the same for these packets
(indicating that they are from the same source). The Payload Type (PT) is ITU-T
G.711 PCMU (PCM-μ law) which has the PT value of 0.
The timestamp is incremented by 160 for each packet (i.e., the 1st packet’s
timestamp is 160, the 2nd is 320, and the 3rd is 480). This is equivalent to
160 speech samples for each speech packet which contains 20 ms of speech at
8 kHz sampling rate. In other words, 20 ms speech contains 160 speech samples
(8000 samples/s × 20 ms = 160 samples). This can also be seen from the time dif-
ference between two consecutive packets, for example, the time for the No. 1 packet
is 11.863907 s and time for the No. 2 packet is 11.883855 s. The time difference be-
80 4 Media Transport for VoIP
Table 4.3 Example of RTP timestamp and packet interval for G.711 voice
Packet No. Sequence number Timestamp Packet interval (ms)
174 61170 160 0
176 61171 320 (160 × 2) 20
178 61172 480 (160 × 3) 20
181 61173 640 (160 × 4) 20
183 61174 800 (160 × 5) 20
tween them is about 0.02 s or 20 ms. We list the sequence number, the timestamp
and the packet interval for the first four packets in Table 4.3 to further show the
concept of timestamp for voice call.
If we look at the payload length, we can get the value of 160 bytes for the payload
length. The payload length is the same for all the packets (the details on how to
obtain the payload length will be explained in the later of this section). This further
demonstrates that it is for the PCM codec with 20 ms of speech packet (160 samples
equivalent to 160 bytes when each sample is coded into 8 bits or 1 byte).
If we look at the RTP header in more detail for the first packet illustrated in
Fig. 4.4 (Packet No. 174), the Marker bit is set to ONE (True). From Fig. 4.5, only
the first packet (Packet No. 174) has “Mark” listed in the last column, and the re-
maining packets without “Mark” part (or the Marker bit was set to ZERO). As we
explained in the previous section, the Marker bit of ONE indicates the beginning of
a speech talkspurt. In this example, there is only marker bit set at the beginning of
the session, all the remaining packets within the session have the same marker bit
set as ZERO, indicating that they belong to the same talkspurt. This is due to the fact
that no voice activity detection mechanism was enabled in this VoIP testbed. This
can also be seen from the steady timestamp changes in the trace data. All packets
have the same 160 samples (20 ms) difference in timestamp clock rate. If voice ac-
tivity detection is enabled, packets for silence periods of speech do not need to be
transmitted, this will result in a big gap in timestamps (more than 160 samples) and
the length of the gap will depend on the length of a silence period.
For the RTP payload length calculation, you may expand the header for Ethernet,
IP and UDP as shown in Fig. 4.6. The payload length can be obtained by deducting
protocol header size from the Ethernet frame size of 214 bytes (from the 1st line,
214 bytes on wire), from the IP length of 200 bytes (total length from IP header),
or from UDP packet length of 180 (the length field of 180 from the UDP header).
The differences between them are due to the length of Ethernet header (14 bytes), IP
header (20 bytes) and UDP header (8 bytes) as shown in Fig. 4.7. The length of RTP
header is 12 bytes. The total length of IP/UDP/RTP header is 40 bytes (20 + 8 + 12).
Let us now look at how to calculate the payload length in this trace example.
4.3 Real-Time Transport Protocol—RTP 81
Fig. 4.6 RTP trace example for voice from Wireshark—more header information
Figure 4.8 further shows the payload information for the packet No. 174. The
codec used is PCM-μ law. You can also double check the payload length of
160 bytes from the bottom panel (each row shows 16 bytes of data and there are
a total of 10 rows).
In the above example, a packet’s payload length is 160 bytes, whereas the pro-
tocols header length is 54 bytes (14 + 20 + 8 + 12) at Ethernet level, or 40 bytes
82 4 Media Transport for VoIP
at IP level. If we look at the bandwidth usage, for a pure PCM stream, the required
bandwidth is:
160 × 8 (bits)
= 64 kb/s
20 (ms)
This means that for every 20 ms, 160 bytes of data need to be sent out. This is the
required bandwidth for a PCM system as we discussed in Chap. 2. 64 kb/s PCM is
the reference point for all speech compression codecs.
When transmitting an RTP PCM stream, the required bandwidth at the IP level
(named as IP BW) is:
(160 + 40) × 8 (bits)
= 80 kb/s
20 (ms)
This means that for every 20 ms, 200 bytes of data need to be sent out to the chan-
nel/network. Within 200 bytes of data, 160 bytes belong to the payload (PCM data).
Another 40 bytes are headers of IP, UDP and RTP protocols required to send a voice
packet over the Internet. It is clear that the IP BW of 80 kb/s is higher than the pure
PCM bandwidth of 64 kb/s.
You can also calculate the Ethernet bandwidth (Ethernet BW) which is:
(160 + 54) × 8 (bits)
= 85.6 kb/s
20 (ms)
4.3 Real-Time Transport Protocol—RTP 83
It is clear that the Ethernet BW of 85.6 kb/s is higher than the IP BW because the
overhead due to the Ethernet header has to be taken into account.
The bandwidth efficiency for transmitting PCM voice stream at the IP level is:
Length of payload size 160
= = 0.8 (or 80 %)
Length of packet at IP level (160 + 40)
The bandwidth efficiency for one frame one packet case is:
10
= 0.2 (or 20 %)
(10 + 40)
The required IP bandwidth for two frames one packet case is:
(20 + 40) × 8 (bits)
= 24 kb/s
20 (ms)
The bandwidth efficiency for two frames one packet case is:
20
= 0.33 (or 33 %)
(20 + 40)
From the above example, we can see that the transmission efficiency for using
one packet one G.729 speech frame is very low (only reaches 20 %) or 80 % of
transmission bandwidth are used for transmitting overheads (e.g., IP, UDP and RTP
84 4 Media Transport for VoIP
headers). Increasing the numbers of speech frames in a packet (e.g., from one frame
one packet to two frames one packet) can increase transmission bandwidth effi-
ciency (from 20 % to 33 % in this example), however, it will increase packetisation
delay from 10 ms to 20 ms and increase overall delay for VoIP applications. It is a
tradeoff for deciding how many speech frames should be put in an IP packet. Many
VoIP systems provide a flexible packetisation scheme such as Skype’s SILK codec
(see Sect. 2.4.9) which can support 1 to 5 speech frames in a packet.
If we compare the transmission efficiency for 8 kb/s G.729 and 64 kb/s G.711
both under 20 ms frame of speech per packet, the transmission efficiencies for G.729
and G.711 are 33 % and 80 %, respectively. The lower transmission efficiency for
G.729 is due to its higher speech compression rate, thus the smaller payload size.
It is clear that the bandwidth transmission efficiency depends on which codec is
used, and how many speech frames are put in an IP packet. When you calculate the
required IP bandwidth or bandwidth efficiency, you always need to work out what
the payload size for a selected codec is and what packetisation scheme is used. If
you have any doubts about how to work out the payload size for a codec, you are
suggested to read again the contents in Chap. 2, especially Table 2.2.
Considering the cost of transmission bandwidth with any communication sys-
tems, improving the bandwidth efficiency for VoIP systems would mean cost saving
and more competitive for VoIP service providers. This has motivated the work on
cRTP (RTP Header Compression or Compressed RTP) which could compress the
40 bytes IP/UDP/TCP header into 2 or 4 bytes of cRTP header. The concept of cRTP
and the improvement on bandwidth efficiency will be covered in Sect. 4.5.
Figures 4.9 and 4.10 show an example of RTP trace data for one direction video
stream. In Fig. 4.9, the filters of “ip.src == 192.168.0.29 and udp.port == 15624
and rtp” were applied in order to get a view of the video stream sent from the
source station (IP address: 192.168.0.29) to the destination station (IP address:
192.168.0.103), and only rtp packets for video are filtered out.
Compared with the filter command used for voice RTP analysis in the previous
section, a port filter part was also added in order to only filter out the video stream
(in this example, voice and video streams were sent through different port pairs for a
video call scenario). From the figures, it can be seen that the length of video packets
are variable, and not constant as in PCM voice RTP packets. The packet No. 7170
has the longest packet size (1206 bytes which are equivalent to 1152 bytes for the
RTP payload length) indicating an I-frame (why 1152 bytes? you should be able
to work it out by yourself now. If you don’t know the answer, then you need to go
back to the previous section to work out the header length for Ethernet, IP, UDP and
RTP). All the other packets shown on the figure have shorter packet length when
compared with the I-frame packet, indicating possible P-frame packets.
From the RTP header, it is noted that video codec used is H.263 with the payload
type of 34. The sequence number for packet No. 7170 is 53916, then it is incre-
mented by one for each consecutive packet. All the packets have the same SSRC
4.3 Real-Time Transport Protocol—RTP 85
Fig. 4.10 RTP trace example for video from Wireshark—more RTP information
identifier indicating that they are from the same video source. The timestamp for
the first two packets (No. 7170 and No. 7171) have the same value, indicating that
they belong to the same video frame (the I-frame). This is due to the fact that one I-
frame has to be put into two consecutive packets (too larger to fit into one packet due
to the maximum transfer unit of Ethernet of 1500 bytes). This can also be demon-
strated by the Marker bit, which has the value of Zero for the first part of the I-frame,
and value of One for the second part of the I-frame. For the other P-frames shown
in the figure, each P-frame was put into one IP packet with its marker bit set to One.
From these figures, you can see that the sequence number was incremented by
one for each consecutive packet.
The trace data shown in Fig. 4.10 which was collected from a VoIP testbed based
on X-Lite (details see Chaps. 8 and 9) did not have a constant timestamp increment.
86 4 Media Transport for VoIP
For example, the timestamp increment from packet No. 7191 to packet No. 7198 is
3330, whereas, the timestamp difference from packet No. 7198 to packet No. 7212
is 2520. This may be due to the detailed implementation of X-Lite and the attached
camera for video capturing. We will show later another trace data example with
constant timestamp increments which is more common in real VoIP systems.
For the 1st packet (No. 174), its H.263 RTP payload header (RFC 2190) is il-
lustrated in Fig. 4.11. As indicated, the RTP payload header follows IETF RFC
2190 [12] which specifies the payload format for packetising H.263 bitstreams into
RTP packets. There are three modes defined in RFC 2190 which may use different
fragmentation schemes for packetising H.263 streams. Mode A supports fragmenta-
tion at Group of Block (GOB) boundary and modes B and C support fragmentation
at Macroblock (MB) boundary. From Fig. 4.11, we can see that the 1st bit (F bit)
is set to zero (False), indicating that the mode of the payload header is “Mode A”
with only four bytes for the H.263 header. In this mode, the video bitstream is pack-
etised on a Group of Block (GOB) boundary. This can be further explained from the
H.263 payload part (the part illustrated as “ITU-T Recommendation H.263”) which
starts either with H.263 picture start code (0x00000020) or H.263 Group of Block
start code (0x00000001) as shown in Figs. 4.11 (for the packet No. 7170, the 1st
part of the I-frame) and Fig. 4.12 (for the packet No. 7171, the 2nd part of I-frame),
respectively. For the I-frame, the Inter-coded frame bit is set to Zero (False) indi-
cating that it is an intra-coded frame. For the P-frames, this bit is set to One (True).
The SRC (source) format shows that the QCIF (176 × 144) resolution was used in
this video call settings. It has to be noted that in this example, one I-frame (for the
QCIF format) has been split into only two RTP packets with a boundary at GoB.
For other video formats, e.g. CIF, one I-frame may be split into several RTP packets
with boundary still at GoBs. The H.263 Group Number field in the H.263 header
will indicate where this part of H.263 payload is located within a picture. For more
4.3 Real-Time Transport Protocol—RTP 87
detailed explanations on RFC 2190 and ITU-T H.263, the reader is suggested to
read [12] and [4].
Figure 4.13 shows another example on video call trace data based on an IMS
client (IMS Communicator).1 The video resolution is CIF (Common Intermediate
Format, 352 × 288). From the figure, it can be seen that one I-frame has been seg-
mented into 15 IP packets starting from packet No. 30487 to packet No. 30504
which all have the same timestamp of 6986, indicating the same media clock timing
for all these packets belonging to the same I-frame. The sequence number is incre-
mented by one for each consecutive packet. The last packet of the I-frame (packet
1 http://imscommunicator.berlios.de/
88 4 Media Transport for VoIP
No. 30504) has the marker bit set as One (True) and all others have the marker bits
set as Zero (False, “MARK” is not shown on the list). For the first two P-frames
shown in the figure, both have been segmented into two IP packets (with same
timestamps). The first part of the P-frame has the marker bit set to Zero and the
second part of the P-frame has the marker bit set to One (True). For the other three
P-frames shown in the figure, each has only one IP packet with the marker bit set
to One. The timestamp increment for each video frame is constant in this trace data
(e.g., 12992 − 6986 = 6006; 18998 − 12992 = 6006; 25004 − 18998 = 6006). As
the media clock rate for H.263 is 90 kHz, the timestamp increment of 6006 indicates
that the video frame rate is about 15 frames per second (90, 000/6006 = 14.985 Hz).
This also indicates that the Picture Clock Frequency (PCF) of 15000/1001 = 14.985
is used for H.263 in this case.
The RTP Control Protocol (RTCP), also defined in RFC 1889 [10] (now obsolete)
and RFC 3550 [11], is a transport control protocol associated with RTP. It can pro-
vide quality related feedback information for an on-going RTP session, together
with identification information for participants of an RTP session. It can be used
for VoIP quality control and management (e.g., a sender can adjust its sending bit
rate according to received network and VoIP quality information) and can also be
used by the third party monitoring tools. The RTCP packets are sent periodically by
each participating member to other session members and its bandwidth should be no
more than 5 % of the RTP session bandwidth which means the session participants
need to control their RTCP sending interval.
As shown in Fig. 4.14, the RTP and RTCP packets are sent in two separate chan-
nels. The RTP channel is used to transfer audio/video data using an even-numbered
UDP port (e.g., x), whereas the RTCP channel is used to transfer control or moni-
toring information using the next odd-numbered UDP port (e.g., x + 1).
If an RTP session is established between two end points as Host A: 192.168.0.29:
19124 and Host B: 192.168.0.67:26448, the associated RTCP channel is also built
up between 192.168.0.29:19125 and 192.168.0.67:26449. The RTP session uses
even-numbered UDP port (19124, and 26448), whereas, the RTCP session will use
the next odd-numbered UDP port (19125, and 26449, in this example). As RTCP
does not use the same channel as the RTP media stream, RTCP is normally regarded
as an out-of-band protocol (out of media band).
There are five different types of RTCP packets, which are SR (Sender Report),
RR (Receiver Report), SDES (Source Description) packet, BYE (Goodbye) packet,
and APP (Application-defined) packet.
• SR: Sender Report—provide feed-forward information about the data sent and
feedback information of reception statistics for all sources from which the
sender receives RTP data.
4.4 RTP Control Protocol—RTCP 89
The format of Sender Report (SR) is shown in Fig. 4.15. It contains three sections in-
cluding header part, sender information part and reception report blocks (for sources
from 1 to n).
The SR’s header part includes the following fields:
• V: Version (2 bits), version 2 for RFC 1889.
• P: Padding (1 bit), zero (False) for no padding part or one (True) with padding.
• RC: Reception Report Count (5 bits), indicting how many reception report
blocks are included. The number can be from 0 to 31, which means it can con-
tain zero block or up to 31 reception report blocks.
• PT: Packet Type (8 bits), PT = 200 for Sender Report.
• Length: Packet Length (16 bits), the length of this RTCP Sender Report.
• SSRC sender (32 bits): sender source identifier.
The SR’s sender information part includes the following:
90 4 Media Transport for VoIP
• NTP timestamp (64 bits): consists of MSW (most significant word) and LSW
(least significant word) of the NTP (Network Time Protocol) timestamp. MSW
and LSW form 8 bytes NTP timestamp, e.g. Nov 22, 2011 14:52:26.593000000
UTC. This reflects the time when this RTCP packet is generated. It can be used
to calculate Round Trip Time (RTT).
• RTP timestamp (32 bits): the RTP timestamp for the RTP packet just before this
RTCP packet is sent (from the same sender). It shows where the sender samples
clock (RTP timestamp) is at the moment of this RTCP sender report is issued.
This is normally used for intra- and inter-media synchronisation.
• Sender’s packet count (32 bits): the total number of RTP packets transmitted
from the sender since starting transmission up until the moment this sender
report was generated.
• Sender’s octet count (32 bits): the total number of RTP payload octet (byte)
count since the beginning of the transmission until the moment this RTCP re-
port was sent. For example, if sender’s packet count = 168, each packet’s RTP
length is 172 bytes (160 bytes for PCM payload + 12 RTP header), then the
4.4 RTP Control Protocol—RTCP 91
total sender’s RTP payload octet is 168 × 160 = 26880 bytes. This field can be
used to estimate the average payload data rate.
The SR’s report block (e.g., report block 1) includes the following fields:
• SSRC_1 (32 bits): SSRC of the 1st source (if there are only two hosts involved
in a VoIP session. This will be the SSRC of the receiver).
• Fraction Lost (8 bits): RTP packet lost fraction since the previous SR was sent,
which is defined by the number of packets lost divided by the number of packets
expected.
• Cumulative number of packet lost (24 bits): the total number of RTP packets
lost since the start of the transmission.
• Extended highest sequence number received (32 bits): the highest sequence
number received, together with the first sequence number received, are used
to compute the number of packets expected.
• Interarrival jitter (32 bits): estimation of interarrival jitter. Details about how
jitter is estimated will be covered in Chap. 6.
• Time of last sender report (LSR): 32 bits, the timestamp (the middle 16-bit of
the NTP timestamp) of the most recently Sender Report received.
• Delay since last sender report (DLSR): 32 bits, the delay between the time when
the last sender report was received and the time when this reception report was
generated. DLSR and LSR are used to estimate Round Trip Time (RTT).
If there are only two hosts involved in a VoIP session, there will be only one
report block (i.e., report block 1) which will provide feedback information for the
1st source, or the receiver in this case. The QoS information (e.g., fraction lost,
interarrival jitter) regarding the VoIP session can be used for VoIP quality control
and management.
If there are a total of N participants involved in a VoIP session (e.g., in a VoIP
conference), there will be N − 1 report blocks from block 1 (source 1) to block
N − 1 (source N − 1).
Figure 4.16 shows an example of RTCP Sender Report (SR) from Wireshark.
Please note that the fraction lost shown in Wireshark is expressed as 14/256. When
considering fraction loss rate, this value needs to be multiplied by 256, which means
fraction loss rate is 14 % in this case.
• RC: Reception Report Count (5 bits), indicating the number of Reception Re-
port blocks contained within the RR.
• PT: Packet Type (8 bits), PT = 201 for Receiver Report.
• Length: Packet Length (16 bits) of the Receiver Report.
• SSRC sender: (32 bits) sender source identifier.
Figure 4.18 shows an example of RTCP Receiver Report (RR) from Wireshark.
The format of RTCP Source Description is illustrated in Fig. 4.19. It contains the
following fields:
• V: Version (2 bits), version 2 for RFC 1889.
• P: Padding (1 bit), zero (False) for no padding part or one (True) with padding.
• SC: Source Count (5 bits), the count of the number of sources involved. The
count is 0 to 31.
• PT: Packet Type (8 bits), PT = 202 for source description.
• Length: Packet Length (16 bits) of the Source Description packet.
• SSRC/CSRC_1: (32 bits) sender source identifier and 1st contributing source
identifier.
• SDES items: including types such as CNAME (Canonical Name), for example,
user@domain or id@host; Length for the type defined and the Text content for
the type for the sender and 1st contributing source. SDES items can contain
information such as name, e-mail, phone, location or notes.
4.4 RTP Control Protocol—RTCP 93
Figure 4.20 shows an example of RTCP Source Description packet from Wire-
shark. Please note that RTCP Sender Report or Receiver Report is always listed
before Source Description packet. In Chunk 1 part of data, it includes SSRC/CSRC
and SDES items. In this example, the SDES items have three parts with each nor-
mally containing Type, Length, and Text.
94 4 Media Transport for VoIP
The format of the RTCP Goodbye (BYE) packet is illustrated in Fig. 4.21. It contains
the following fields:
• V: Version (2 bits), version 2 for RFC 1889.
• P: Padding (1 bit), zero (False) for no padding part or one (True) with padding.
• SC: Source Count (5 bits), indicating the number of SSRC identifiers included.
4.4 RTP Control Protocol—RTCP 95
The RTCP Sender Report (SR) and Receiver Report (RR) only contain basic QoS
information regarding a media session, such as packet loss rate, and interarrival jitter
value. In order to provide more information regarding the underlying network QoS
and VoIP QoE metrics such as Mean Opinion Score (MOS) for quality monitoring
purposes, extended RTCP report type (XR) was defined by RFC 3611 [2] in 2003.
96 4 Media Transport for VoIP
The VoIP metrics provided by the RTCP XR, are shown in Fig. 4.23.
According to their functions, these metrics are divided into the following six
categories. The detailed descriptions about these metrics (e.g., burst characteristics,
R-factor, MOS-LQ and MOS-CQ) will be covered in Chap. 6.
• Loss and Discard: include metrics for Loss rate (due to packet loss in the net-
work), discard rate (due to arrive too late and discarded at the receiver; burst/gap
density, bust/gap duration and Gmin (metrics to describe the characteristics of
burst packet losses). Gmin of 16 is the recommended minimum distance re-
quired for the number of consecutively received packets (no loss) for the transi-
tion from a burst state to a gap state.
• Delay: include round trip time (RTT) and delay introduced by an end system, in-
cluding encoding delay, packetisation delay, decoding delay and playout buffer
delay.
• Signal related: include signal level (or speech signal level), noise level (or back-
ground noise level during silence period), Residual Echo Return Loss (RERL).
• VoIP Call Quality: include R Factor, extended R Factor, and MOS scores for
listening quality (LQ) and conversational quality (CQ).
• Configuration: Rx config (receiver configuration byte) to reflect the receiver
configuration on what kind of packet loss concealment (PLC) method is used,
whether adaptive or fixed jitter buffer is used, or if for adaptive jitter buffer,
jitter buffer adjustment rate.
• Jitter Buffer: include jitter buffer values, such as jitter buffer nominal delay,
jitter buffer maximum delay, and jitter buffer absolute maximum delay.
in RFC 2508 in 1999 [1] to improve transmission efficiency while sending audio
or video over low-speed serial links, such as dial-up modems at 14.4 or 28.8 kb/s.
It compressed 40 bytes of IP/UDP/RTP headers to either 2 bytes when there is no
checksum in UDP header or 4 bytes when there is UDP checksum. The compressed
cRTP header will be decompressed to original full IP/UDP/RTP headers at the re-
ceiver side before going through the RTP, UDP and IP level packet header processes.
The idea of compression is based on the concept that the header fields in IP,
UDP and RTP are either constant between consecutive packets or the differences
between these fields are constant or very small. For example, the RTP header fields
of SSRC (Synchronisation Source Identifier) and the PT (Payload Type) are constant
for packets from the same voice or video source as you can see from Fig. 4.5. The
difference between RTP header fields of the Sequence number and the Timestamp
are also constant between consecutive packets. From Fig. 4.5, you can see that the
difference between sequence number of two consecutive packets is one, whereas the
difference between timestamp between consecutive packets is 160 (samples). Based
on the above concept, the compressed RTP works by sending a full headers packet
at the initial stage, and then only updates between headers of consecutive packets
are sent to the decompressor at the receiver side. Decompressor will decompress the
cRTP header information based on received full RTP header. Full header packets
are sent periodically in order to avoid desynchronisation between compressor and
decompressor due to packet loss. To further improve the performance of cRTP over
links with packet loss, packet reordering and long delay, enhanced cRTP was also
proposed in RFC 3545 in 2003 [5], which specifies methods to prevent context cor-
ruption and to improve synchronisation process between compressor and decom-
pressor when the scheme is out of synchronisation due to packet loss. For more
details on the principles of cRTP, the reader can read Perkins book on RTP [7].
In the following section, we will illustrate a worked example for calculating
transmission efficiency of VoIP system and to demonstrate the efficiency improved
by using cRTP.
98 4 Media Transport for VoIP
SOLUTION: For 6.3 kb/s G.723.1 codec, the length of a speech frame is 30 ms
which results in a coded bits of 189 bits (6.3 × 30). The payload size for 189 bits is
equivalent to 24 bytes (padded with three zero bits at the end of the last byte).
Considering the IP/UDP/RTP header size of 40 bytes, the required IP bandwidth
for G.723.1 RTP is:
(24 + 40) × 8 (bits)
= 17 kb/s
30 (ms)
From the above example, it is clear that cRTP scheme can reduce the required IP
bandwidth (from 17 kb/s reduce to 7.46 kb/s from the above example), and improve
the transmission bandwidth efficiency (from 37.5 % increase to 86 %).
4.6 Summary
In this chapter, we discussed media transport of VoIP which included mainly two
protocols, the RTP and its associated RTCP protocol. We started the chapter from
why real-time VoIP applications are based on UDP for data transfer, what prob-
lems there are when UDP is used and the need for an application layer protocol, for
example, the need for RTP, to facilitate VoIP data transfer. We explained the RTP
header in details and showed their examples for voice and video from Wireshark
based on real VoIP trace data collected from the lab. We also explained the concept
4.7 Problems 99
4.7 Problems
11. During a VoIP application, you have decided to change its codec from 64 kb/s
PCM to 8 kb/s G.729. What will be the IP bandwidth usage change due to this
codec change (assuming both use 30 ms of speech in a packet)? If the applica-
tion developer has decided to use cRTP with only four bytes for compressed
IP/UDP/RTP headers instead of 40 bytes in normal RTP case, what will be the
bandwidth usage change (if codec is still G.711 64 kb/s and 30 ms of speech in
a packet)? From your results, which method (i.e., change codec from G.711 to
G.729 or change from RTP to cRTP) is more efficient from bandwidth usage
point of view? You need to show the process of your solution.
12. It is known that G.711.1 PCM-WB is used in a VoIP system. Calculate the
IP bandwidth usage for Layer 0, Layer 1 and Layer 2 bitstream, respectively.
What is the overall IP bandwidth for G.711.1 at 96 kb/s.
13. In general principles for RTCP, it is required that the bandwidth for RTCP
transmission should be no more than 5 % of RTP session bandwidth, how do
session participants measure and control their RTCP transmission rate? When
the number of participants for a VoIP session increases, will the RTCP packet
size get bigger? How about RTCP transmission bandwidth consumption?
14. Why is RTCP message always sent in a compound packet, or bundled packet
(including different types of RTCP packets)?
15. Describe QoS metrics in RTCP reports.
16. Describe QoE metrics in the Extended RTCP report.
References
1. Casner S, Jacobson V (1999) Compressing IP/UDP/RTP headers for low-speed serial links.
IETF RFC 2508
2. Friedman T, Caceres R, Clark A (2003) RTP control protocol extended reports (RTCP XR).
IETF RFC 3611
3. Information Sciences Institute (1981) Transmission control protocol. IETF RFC 793
4. ITU-T (1960) Video coding for low bit rate communication. ITU-T H.263
5. Koren T, Casner S, et al (2003) Enhanced compressed RTP (CRTP) for links with high
delay, packet loss and reordering. IETF RFC 3545
6. Kurose JF, Ross KW (2010) Computer networking, a top–down approach, 5th edn. Pearson
Education, Boston. ISBN-10:0-13-136548-7
7. Perkins C (2003) RTP: audio and video for the Internet. Addison-Wesley, Reading. ISBN:0-
672-32249-8
8. Postel J (1980) User datagram protocol. IETF RFC 768
9. Schulzrinne H, Casner S (2003) RTP profile for audio and video conferences with minimal
control. IETF RFC 3551
10. Schulzrinne H, Casner S, et al. (1996) RTP: a transport protocol for real-time applications.
IETF RFC 1889
11. Schulzrinne H, Casner S, et al (2003) RTP: a transport protocol for real-time applications.
IETF RFC 3550
12. Zhu C (1997) RTP payload format for H.263 video streams. IETF RFC 2190
VoIP Signalling—SIP
5
L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 101
DOI 10.1007/978-1-4471-4905-7_5, © Springer-Verlag London 2013
102 5 VoIP Signalling—SIP
Clients and Servers are the two main devices defined in the SIP architecture (cf.,
Fig. 5.1). A client is described in RFC 3261 [14] as a network element that sends
SIP requests and receives SIP responses. A client may or may not interact with a
human being. Similarly, the server is a network element that receives requests in
order to service them, and then responds to those requests.
Generally, SIP network elements have the following capabilities,
• SIP determines the location of the UAs. This is achieved during UAs registration
process. The registration process allows SIP to easily find the IP addresses of
UAs.
• SIP determines the availability of the UAs. Application servers are used to keep
the availability of UAs. UAs have options to forward calls to voice mails if they
are not available. UAs can create profiles on how to route calls when they are
not available, on multiple locations (e.g., office, home and mobile) or on several
devices (laptop, mobile and desktop computer).
• SIP establishes a session between UAs. SIP can create sessions by using SIP
methods such as Invite.
• In addition to establishing sessions between UAs, SIP has the capability to man-
age sessions. For instance, if a UA is registered on several devices, a call can be
seamlessly transferred between devices (e.g., from a mobile phone to a laptop).
• SIP determines UAs capabilities. This is mainly related to media capabilities
such as voice/video codecs of the participating UAs. These capabilities are ex-
tracted from the Session Description Protocol (SDP). UAs capabilities can be
5.1 What is Session Initiation Protocol? 103
It is possible that a device can have both UAC and UAS functions, and sometimes
a session will have both UAC and UAS roles. This happens if a device wants to add
another participant in an ongoing call.
A Proxy Server (proxy) is responsible for receiving SIP requests and responses and
forward them to their particular destination on behalf of the UAs. The proxy simply
acts as a router of SIP messages. There are three types of proxies,
1. Stateful Proxy: A stateful proxy maintains the state of every dialog it is servic-
ing. It remembers call identifiers of each session and receives all responses if a
session status has changed or ended.
2. Stateless Proxy: A stateless proxy does not maintain any state of any dialog it
is servicing. It simply forwards SIP request and responses.
3. Forking Proxy is responsible for forwarding SIP requests to more than one
destination. There are two types of forking proxy, parallel and sequential. In
the case of the parallel proxy, a given user can have several UAs available
and registered at different locations such as home, office and on a mobile. The
parallel proxy will therefore call all three user locations simultaneously. In the
case sequential proxy, it will try to call different UA for a certain period of
time one after the other until one is picked up. A forking proxy must be a
stateful proxy. Figure 5.5 illustrates how sequential forking proxy operates.
A SIP message destined for Alice is received at the forking proxy, since Alice
has three registered UAs, at home, office and on a mobile, the proxy server will
ring Alice at home, but the call is not answered after a certain duration of time,
5.1 What is Session Initiation Protocol? 105
this causes the proxy to ring Alice at the office, this is not answered too. Finally
the forking proxy rings Alice on a mobile and the call is answered.
5.1.5 Registrar
A registrar is responsible for authentication and recording UAs. UA sends a REG-
ISTER SIP message (cf., Fig. 5.7) to a registrar when it is witched on or changes its
106 5 VoIP Signalling—SIP
IP address. After receiving the REGISTER SIP message from the UA, registrar can
either accept the UA registration or challenge the registration by rejecting the first
registration. This challenge will force the UA to send its credentials for verification.
1. Syntax and encoding layer: It is the lowest layer. It is a set of rules that de-
fines the format and structure of a SIP message. Syntax and encoding layer is
mandatory for each SIP network element.
2. Transport layer: It defines how SIP network elements send and receive SIP
requests and responses. All SIP network elements must support transport layer.
3. Transaction layer: It is responsible for handling all SIP transactions. SIP trans-
actions can be defined as generated SIP requests and responses by UAs. Trans-
action layer handles retransmissions, timeouts and correlation between SIP re-
quests and responses. Client transactions component is called client transac-
tion, while that of the server is call server transaction. Transaction layer is only
available in UAs and stateful proxies.
4. Transaction user layer: It creates client transactions such as an INVITE with
destination IP address, port number and transport.
SIP message format is text-based and very similar to HTTP/1.1. Requests and re-
sponses are the two types of SIP messages whereby UAC sends requests and UAS
replies with responses. SIP URIs [5] are used to identify SIP UAs, they are made
up of username and domain name. SIP URIs can have other parameters such as
transport. sip:alice@home and sip:alice@home; transport = udp are two SIP URIs
without transport and with transport, respectively. SIPS URIs use Transport Layer
Security (TLS) [7] for messages security. sips:alice@home is an example of SIPS
URI. The SIP message format for both request and response type is depicted in
Table 5.1.
108 5 VoIP Signalling—SIP
Request-Line
The Request-Line contains SIP method name, Request-URI and SIP protocol
version. An example of the Request-Line is “INVITE sip:alice@office SIP/2.0”.
The SIP method defines the purpose of the request, in this example INVITE de-
noted the SIP method. The Request-URI shows the request’s destination which is
alice@home. The SIP protocol version is 2.0. Table 5.2 lists the main SIP methods.
The request identifies the type of session that is being requested by the UAC. The
requirements for supporting a session such as payload types and encoding param-
eters are included as part of the UAC’s request. However, there are other requests
that are specific such as MESSAGE method that do not require a session or dialog,
the UAC might choose to accept the message or not through a response.
The SIP request of interest is INVITE [15], this request invites UAs to participate
in a session. Its body contains the session description in the form of SDP. This
request includes a unique identifier for the call, the destination and originating IP
addresses, and information about the session type to be established. An example of
the INVITE request is depicted in Fig. 5.10. The first line contains the method name
which is INVITE. The rest of the lines that follow are header fields. The header
fields are,
• Via: This field records the path that a request takes to reach the destination, the
same path should be taken by all corresponding responses.
• To: This field contains the URI of the destination UA, in this scenario, the value
is <sip:[email protected]>.
• From: This field contains the URI of the originating UA, in this case it is From:
<sip:[email protected]>.
5.2 SIP Protocol Structure 109
• CSeq: This field contains sequence number and SIP method name. It is used to
match requests and responses. At the start of a transaction, the first message is
given a random integer sequence number, then there will be an increment of one
for each new message. In this example the sequence number is 2.
• Contact: This field identifies the URI that should be used to contact the UA who
created the request. In this example, the contact value is <sip:[email protected].
208.151:40332>.
• Content-Type: This field is used to identify the content media type sent to an-
other UA in a message body. In this example, the content type is application/sdp.
• Call-ID: This field provides a unique identifier for a SIP message. This allows
UAS to keep track of each session.
• Max-Forward: This field is used to limit the number of hope a request can tra-
verse. It is decreased by one at each hope. In this scenario, Max-Forward value
is 70.
• Allow: This field is used by the UAC to determine which SIP methods are sup-
ported by the UAS. For instance, a query from UAC to UAS to find out which
methods are supported can be replied by the UAS in a response containing:
ALLOW: INVITE, CANCEL, BYE, SUBSCRIBE.
• Content-Length: This contains an octet (byte) count of the message body. It is
402B in this example.
Status-Line
The Status-Line consists of SIP version, the status code is in integers between 100
to 699 inclusive and the reason phrase. An example of the Status-Line is “SIP/2.0
200 OK”. In the example, SIP version is 2.0 and status code is 200 which means
OK (Success). The status codes are grouped in classes, there are 6 classes, the first
class defines the class of the code (cf., Table 5.3).
The provisional or informational class indicates that a request has been received
and is being processed. This class serves as an acknowledgement with the purpose
110 5 VoIP Signalling—SIP
The global failure class illustrates problems associated with the network rather
than SIP network elements. For instance, the 600 busy everywhere response is is-
sued by the UAS only if the UAC is not available in the network.
Table 5.4 outlines the list of SIP status codes with their descriptions.
Header Fields
A header field consists of a header field name, a colon and the header field value. It
contains detailed information of the UAC requests and UAS responses. The header
field mainly includes the origination and destination addresses of SIP requests and
responses together with routing information. The following main field names can be
found in the header field of a SIP message.
• Via: This field records the path that a request takes to reach the destination, the
same path should be taken by all corresponding responses.
• To: This field contains the URI of the destination UA, e.g., “To:Alice<sip:alice@
home>;tag = 1234”.
• From: This field contains the URI of the originating UA.
• CSeq: This field contains sequence number and SIP method name. It is used to
match requests and responses. At the start of a transaction, the first message is
112 5 VoIP Signalling—SIP
given a random integer sequence number, then there will be an increment of one
for each new message.
• Contact: This field identifies the URI that should be used to contact the UA who
created the request.
• Content-Type: This field is used to identify the content media type sent to an-
other UA in a message body.
• Call-ID: This field provides a unique identifier for a SIP message. This allows
UAS to keep track of each session.
• Max-Forward: This field is used to limit the number of hope a request can tra-
verse. It is decreased by one at each hope.
• Allow: This field is used by the UAC to determine which SIP methods are sup-
ported by the UAS. For instance, a query from UAC to UAS to find out which
methods are supported can be replied by the UAS in a response containing:
ALLOW: INVITE, CANCEL, BYE, SUBSCRIBE.
SIP Identities
SIP and SIPS URIs are used to identify SIP elements in an IP network. These two
URIs are identical but SIPS URI denotes that the URI is secured. SIP and SIPS URIs
are defined in the SIP RFC 3261 and take the form of sip:user:password@host:
port;uri-parameters?headers. The URI can be in the form of an IP address or
a DNS. The common form of SIP URI is alice@home:5060, where 5060 repre-
sents the port number on which SIP stack listens to incoming SIP requests and
responses.
Private User Identity: Private user identity uniquely identifies the UAC sub-
scription. Private identity enables the VoIP network operator to identify one sub-
scription for all VoIP services for the purpose of authorization, registration, admin-
istration and billing.
The private identity is not visible to other network providers and it is not used
for routing purposes. RFC 2486 [3] specifies the private identify to take the form of
the Network Access Identifier (NAI). The NAI is similar to e-mail addresses, where
“@” separates the username and the domain parts.
Public User Identity: Public user identity is used by UAs to advertise their pres-
ence. Public identity takes the form of the NAI format. Public identity is not limited
per subscriber, a subscriber can have more than one public identity in order to use
difference devices.
Public identity allows VoIP service providers to offer flexibility to subscribers by
eliminating the need of having multiple accounts for each identity. Public identity
allows flexible routing, for example, if Alice is not in the office, a call can be routed
to Alice device at home. Private and public identities are sent in a REGISTER mes-
sage when the UAC is registering for a VoIP service.
5.3 Session Descriptions Protocol 113
Message Body
The empty line separate the message body and the header fields. The message body
can be divided into different parts. SIP uses MIME to encode it multiple message
bodies. There are set of header fields that provide information on the contents of a
particular body part such as Content-Disposition, Content-Encoding and Content-
Type. Figure 5.13 shows a multipart SIP message body.
In Fig. 5.13, the Content-Disposition shows that the body is a session description,
the Content-Type denotes that the session description used is Session Description
Protocol (SDP) [9] and the Content-Length indicates the length of the body in bytes.
The first part of the SIP message body consists of an SDP session description and
the second part is made up of the text.
Message bodies are transmitted end to end as a result the proxy servers are not
needed to parse the message body in order to route the message. UA may also wish
to encrypt the content of the message body.
3GPP has made SDP to be the de facto session description protocol in IMS be-
cause it has the capability to describe a wide range of media types which can also
be treated separately. For instance in a Webinar session, there would be voice, video
and powerpoint presentation and many more such as text editors and whiteboard
session. All these media types would be described in one SIP message by using
SDP.
The SDP is carried within the SIP message body and has three main parts, the
Session, the Time, and the Media descriptions.
z = time zone adjustments. This is important for session participants who are in
different time zones in order to properly communicate the session time.
k = encryption key. If encryption is in place then the encryption key is needed to
read the payload.
a = zero or more session attribute lines. Attributes are used to extend SDP for
other applications whose attributes are not defined by the IETF.
Time description provides the information about the time of a session. This might
include when the session should start, stop and repeated.
t = time the session is active. This field denotes the start and stop times for the
session. Its format is t = <start-time><stop-time>. The session is unbounded
if <stop-time> is set to 0.
r = zero or more repeat times. This field denotes when the session will be re-
peated. Its format is r = <repeat interval><active duration><offsets from
start-time>.
z = time zone adjustments. This field is used by UAs to make time zone adjust-
ments. This field is important because different time zones change their times at
different times of the day and several countries adjust daylight saving times at
different dates, while some countries do not have daylight saving times.
Data represents data streams sent by an application for processing by the desti-
nation application. An application can be any multimedia application such as white-
board or similar multimedia applications. Control represents a control panel for an
end application.
The port defines the port number to receive the session. Transport describes the
transport protocol to be used for the session, RTP/AVP are supported for Real-time
Protocol and Audio Video Profile. The media formats defines the formats to be used
such as μ-law encoded voice and H 264 encoded video.
5.3.4 Attributes
Attributes are the SDP extensions and a number of it are defined in [9], the main
attributes are,
a = rtpmap: <payload type><encoding name>/<clock rate><encoding
parameters>. In this attribute, the payload type denotes whether the session
is audio or video. Encoding parameters are optional which identify the number
of audio channels. There are no encoding parameters for video session.
a = cat:<category>. This SDP attribute hierarchically lists session category
whereby the session receiver can filter the unwanted sessions.
a = keywds:<keywords>. This attribute enables the session receiver to search
sessions according to specific keywords.
a = tool:<name and version of tool>. This attribute makes it possible for the
session receiver to establish which too has been used to setup the session.
a = ptime:<packet time>. This attribute is useful for audio data which provides
the length of time in milliseconds represented in received packets of the session.
This attribute is intended as a recommendation for the packetization of audio
packets.
a = rcvonly. This attribute is used to set the UAs to receive only mode when re-
ceiving a session.
a = sendrecv. This attributes set the UAs to send and receive mode. This is will
enable the receiving UA to participate in the session.
a = orient:<whiteboard orientation>. This attribute is used with whiteboard ap-
plications to specify the orientation of the whiteboard application on the screen.
The three supported values are landscape, portrait and seascape.
a = type:<conference type>. This attribute specifies the type of the conference.
The suggested values are Broadcast, Meeting, Moderated and Test.
a = charset:<character set>. This attribute specifies the character set to describe
the session name and the information. The ISO-10646 character set is used by
default.
a = sdplang:<SDP language>. This attribute specifies the language to be used
in the SDP. The default language is English.
5.4 SIP Messages Flow 117
The SDP messages extracted from the Wireshark are shown in Fig. 5.14. Each line
of the SDP message describes a particular attribute of the session to be created and
follows the format described in Sect. 5.3. For VoIP sessions, the important attributes
are:
a—attributes, this is in the form of a = rtpmap:<payload type><encoding
name>/<clock rate><encoding parameters>. The payload types in this exam-
ple are 101 and 107 with the corresponding clock rates of 8000 and 16000, re-
spectively.
c—connection information, which has the connection address (192.168.2.4) for
the RTP stream. The connection type is “IN” and the address type is IPV4.
m—media description, which includes the port number 52942 on which the RTP
stream will be received. The media type is audio, the transport is RTP/AVP and
the media formats supported are PCMU and PCMA.
Before presenting the SIP messages flow for multimedia session establishment, it is
important to describe the relationship between SIP messages, transaction and dialog.
Although SIP messages are sent independently between UAs, they are normally
118 5 VoIP Signalling—SIP
Session establishment is a 3-way process and the UAC must be registered before
establishing a session. Figure 5.16 illustrates that the current location of Alice
5.4 SIP Messages Flow 119
5.5 Summary
The Session Initiation Protocol has emerged as the industry choice for real time
communication and applications, such as voice and video over IP, Instant Messag-
ing and presence. Borrowing from the proven Internet Protocols, such as SMTP
and HTTP, SIP is ideal for the Internet and other IP platforms. SIP provides the
platform to implement a range of features such as call control, next generation
service creation and interoperability with existing mobile and telephony systems.
SIP is the de facto signalling protocol in IMS, TISPAN and PacketCable architec-
tures.
5.6 Problems
1. What do the acronyms UA, UAC and UAS stand for? Describe what they do.
2. Describe any four types of SIP servers.
3. Do we really need a proxy server? Explain your answer.
4. Why is the forking proxy a stateful proxy?
5. Describe advantages and disadvantages of using stateful proxy server.
References 121
References
1. 3GPP (2008) TISPAN; Presence Service; Architecture and functional description [Endorse-
ment of 3GPP TS 23.141 and OMA-AD-Presence_SIMPLE-V1_0]. TS 23.508, 3rd Gener-
ation Partnership Project (3GPP). http://www.3gpp.org/ftp/Specs/html-info/23508.htm
2. 3GPP (2012) 3rd generation partnership project. http://www.3gpp.org. [Online; accessed
15-August-2012]
3. Aboba B, Beadles M (1999) The network access identifier. RFC 2486, Internet Engineering
Task Force. http://www.rfc-editor.org/rfc/rfc2486.txt
4. Berners-Lee T, Fielding R, Frystyk H (1996) Hypertext transfer protocol—HTTP/1.0. RFC
1945, Internet Engineering Task Force. http://www.rfc-editor.org/rfc/rfc1945.txt
5. Berners-Lee T, Fielding R, Masinter L (1998) Uniform resource identifiers (URI):
generic syntax. RFC 2396, Internet Engineering Task Force. http://www.rfc-editor.org/
rfc/rfc2396.txt
6. Day M, Aggarwal S, Mohr G, Vincent J (2000) Instant messaging/presence proto-
col requirements. RFC 2779, Internet Engineering Task Force. http://www.rfc-editor.org/
rfc/rfc2779.txt
7. Dierks T, Rescorla E (2008) The transport layer security (TLS) protocol version 1.2. RFC
5246, Internet Engineering Task Force. http://www.rfc-editor.org/rfc/rfc5246.txt
8. Donovan S (2000) The SIP INFO method. RFC 2976, Internet Engineering Task Force.
http://www.rfc-editor.org/rfc/rfc2976.txt
9. Handley M, Jacobson V (1998) SDP: session description protocol. RFC 2327, Internet En-
gineering Task Force. http://www.rfc-editor.org/rfc/rfc2327.txt
10. Handley M, Schulzrinne H, Schooler E, Rosenberg J (1999) SIP: session initiation protocol.
RFC 2543, Internet Engineering Task Force. http://www.rfc-editor.org/rfc/rfc2543.txt
11. ITU-T (1996) H.323: visual telephone systems and equipment for local area networks which
provide a non-guaranteed quality of service. Recommendation H.323 (11/96), International
Communication Union
12. Klensin J (2001) Simple mail transfer protocol. RFC 2821, Internet Engineering Task Force.
http://www.rfc-editor.org/rfc/rfc2821.txt
13. Rosenberg J (2002) The session initiation protocol (SIP) UPDATE method. RFC 3311, In-
ternet Engineering Task Force. http://www.rfc-editor.org/rfc/rfc3311.txt
122 5 VoIP Signalling—SIP
Quality of Experience (QoE) is a term used to describe user perceived experience for
a provided service, e.g. VoIP. This term is also referred to as User Perceived Quality
of Service (PQoS) in order to differentiate with Network Quality of Service (QoS)
which reflects network performance. Network QoS metrics generally include packet
loss, delay and jitter which are the main impairments affecting voice and video qual-
ity in VoIP applications. The key QoE metric is Mean Opinion Score (MOS), an
overall voice/video quality metric. In this chapter, the definition of QoS and QoS
metrics will be introduced first. Then the characteristics of these metrics and how
to obtain them in a practical way will be discussed. Further, the QoE concept and
an overview of QoE measurement for VoIP applications will be presented. Finally,
the most commonly used subjective and objective QoE measurement for voice and
video will be presented in detail, including Perceptual Evaluation of Speech Qual-
ity (PESQ) and E-model for voice quality assessment, and Full-Reference (FR),
Reduced-Reference (RR) and No-Reference (NR) models for video quality assess-
ment.
L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 123
DOI 10.1007/978-1-4471-4905-7_6, © Springer-Verlag London 2013
124 6 VoIP Quality of Experience (QoE)
VoIP terminal through the IP network in a PC-to-PC call scenario, or from when IP
packets leave a media gateway in a PSTN/IP combined network to another media
gateway in a phone-to-phone call scenario.
In VoIP applications, the end-to-end QoS is also regarded as mouth-to-ear qual-
ity in reflecting to the quality of a VoIP call from a user speaking to a handset’s
microphone at one end to another user listening on the phone at the other end. This
is mainly for one-way listening speech quality without consideration of interactivity
(for conversation quality).
In the past decade, the term of the end-to-end Quality of Service has gradually
been replaced by Perceived Quality of Service (PQoS) to reflect the nature on how
an end user perceives the quality provided and further Quality of Experience (QoE)
with a focus on user experience on the quality of the service provided. The QoE
concept will be covered in the later sections of the chapter.
1 www.wireshark.com
126 6 VoIP Quality of Experience (QoE)
Fig. 6.6 Example of burst packet loss and burst loss length
p11 ) is the probability that a packet will be dropped given that the previous packet
was dropped.
Let π0 and π1 denote the state probability for state 0 and 1, as π0 = P (X = 0)
and π1 = P (X = 1), respectively.
The procedure to compute π0 and π1 is as follows. At steady state, we have:
π0 = (1 − p) · π0 + q · π1
(6.2)
π0 + π1 = 1
The ulp provides a measure of the average packet loss rate. 1 − q is also referred
to as the conditional packet loss probability (clp).
The Gilbert model implies a geometric distribution of the probability for the
number of consecutive packet losses k, that is the probability of a burst loss having
length k, pk can be expressed as:
pk = P (Y = k) = q · (1 − q)k−1 (6.4)
Based on Eq. (6.4), the mean burst loss length E can be calculated as:
∞
∞
1
E= k · pk = k · q · (1 − q)k−1 = (6.5)
q
k=1 k=1
Note that E[Y ] is computed based on q which is only related with the conditional
loss probability, clp (q = 1 − clp), i.e. that the value of the mean burst loss length is
dependent only on the behaviour of consecutive loss packets.
The probability p and q can also be calculated from the loss length distribution
statistics from the trace data. Let oi , i = 1, 2, . . . , n − 1 denote the number of loss
bursts having length i, where n − 1 is the length of the longest loss bursts. Let o0
denote the number of successfully delivered packets (obviously o0 = c0 ). Then p,
q can be calculated by the following equations [44] (you can also derive Eq. (6.7)
from Eq. (6.6) based on the trace data concept):
( n−1
i=1 oi ) ( n−1
i=1 oi · (i − 1))
p= , q =1− (6.7)
o0 ( n−1
i=1 oi · i)
The unconditional loss probability (ulp or π1 ) and the conditional loss probabil-
ity (clp or 1 − q) are two metrics used to represent bursty packet loss in IP networks.
This 2-state Markov model is also used in the calculation of effective equipment im-
pairment factor (Ie-eff ) in the E-model which will be covered in Sect. 6.4.3. In prac-
tice, the mean burst loss length (E) is normally used to replace the conditional loss
probability (clp) due to its clear practical meaning. The average packet loss rate Ppl
(in %) is generally used instead of the unconditional loss probability (ulp) metric.
A numerical example to demonstrate how to obtain the average packet loss rate
and the mean burst loss length based on the trace data information will be provided
in Sect. 6.7.
chain models [49], and loss run-length and no-loss run-length models [44]. Inter-
ested readers can read those literatures for more information.
In this section, we will discuss the concept of delay and delay variation (jitter) in
VoIP applications. The components of network delay and further end-to-end delay,
and the definition of delay variation (jitter) used in Real-time Transport Protocol
(RTP) in IETF RFC 3550 [45] will be covered. This is the jitter definition generally
used in VoIP applications.
• Packetization delay: the time needed to build data packets at the sender, as well
as to strip off packet headers at the receiver. For example, for AMR codec,
if one packet contains two speech frames, then the packetization delay equals
2 × 20 = 40 ms.
• Playout buffer delay, the time waited at the playout buffer at the receiver side.
This will be explained in detail in the later section.
The end-to-end delay dend-to-end can be expressed as:
For VoIP applications, if a codec, packetization size and jitter buffer are fixed,
then end-to-end delay is mainly affected by network delay. More details on buffer
delay (dbuffer ) will be explained later.
receiver side. As illustrated at Fig. 6.10(c), packets are played out at constant inter-
val at the receiver side. In the figure, Pi represents the playout time of the packet i.
Now, Pi+1 − Pi = Pi − Pi−1 = 20 ms (here assume AMR codec is used). The time
stayed at the buffer between packet playout and arrival is called buffer delay. If
a packet arrives too late (see the packet i + 2 as an example), the packet will be
dropped out by the playout buffer (this is called late arrival packet loss in contrast
to network packet loss which is lost in the network).
Figure 6.11 illustrates the relationship between network delay and playout buffer
delay using packet i as an example. ni = Ri − Si is the network delay for the
packet i. bi is the buffer delay for the packet stayed in the playout buffer at the
receiver. di can be viewed as the time spent by the packet i from the moment it
leaves the sender to the time it is played out at the receiver.
Jitter is the statistical variance of the packet interarrival time or variance of the
packet network delay and is caused mainly by the queuing delay component along
the path. There are different definitions of jitter to represent the degree of the vari-
ance of the delay.
6.1 Concept of Quality of Service (QoS) 133
where D is the difference of the packet spacing or the difference of IP network delay
between two consecutive packets, here packet i and its previous packet i − 1. The
D value can be calculated as below.
where Ri − Ri−1 is the packet arrival space between the packet i and the packet
i − 1, and Ri − Si is the packet network delay for the packet i, shown as ni in
Fig. 6.11.
The interarrival jitter J is a running estimate of Mean Packet to Packet Delay
Variation (MPPDV), expressed as D.
A Practical Example to Calculate Jitter and Average Delay
Here we show a practical example to calculate jitter according to IETF RFC 3550.
Assume we have a trace data after pre-processing (trace1.txt) similar to the one
shown in Fig. 6.4 with the 1st column for the sequence number and the 2nd column
for the one-way network delay. Below lists a sample Matlab code to calculate the
jitter and the average delay. At the end of the calculation, the jitter value and the
average delay are printed out on the screen.
to follow network delay changes in order to get a good tradeoff in buffer delay and
buffer loss.
In VoIP applications, there exist many jitter buffer adaptive algorithms which
may adjust jitter buffer before or during a speech talkspurt continuously during a
VoIP call to adapt to changing network conditions. The following shows an exam-
ple of jitter buffer algorithm proposed by Ramachandran et al. [42] which follows
similar concept in the estimation of the TCP round trip time (RTT) and the retrans-
mission timeout interval [37].
The algorithm attempts to maintain a running estimate of the mean and variation
of network delay, that is, d̂i and v̂i , seen up to the arrival of the i th packet. If packet
i is the first packet of a talkspurt, its playout time Pi (see Fig. 6.11) is computed as:
with α = 0.998002.
Please note that the parameters such as α = 0.998002 in the above equation
was obtained based on the trace data collected in a research carried out over 15
years ago [42]. This jitter buffer algorithm and the optimized parameter may not
be appropriate for VoIP applications in today’s Internet. The algorithm shown
above is only used to demonstrate how jitter buffer and jitter buffer algorithm
work.
Currently there are no standards for jitter buffer algorithms. The implementation
of jitter buffer algorithms in VoIP terminals/softwares are purely vendor dependent.
quality such as network packet loss, delay and jitter. The key QoE metric for VoIP
applications is Mean Opinion Score (MOS) which is a metric used to represent an
overall quality of speech quality provided during a VoIP call. This metric is gener-
ally obtained by averaging the overall quality scores provided by a group of users
(for the term of mean opinion). MOS is also used to represent perceived video qual-
ity for video call or video streaming applications or perceived audiovisual quality
when both voice and video are taken into account as in video conference scenar-
ios. In this chapter, we focus on MOS, the most widely used QoE metric for VoIP
applications. Other QoE metrics for speech quality such as intelligibility (how you
understand a VoIP call, or its content) or fidelity (how faithfulness the degraded
speech is when compared to its original one) are not covered.
There are many factors affecting end-to-end voice quality in VoIP applications. They
are normally classified into two categories.
• Network Factors: for those factors occurring in the IP network, such as packet
loss, delay and jitter.
• Application Factors: for those factors occurring in the application devices/soft-
wares (e.g., codec impairment, jitter buffer in VoIP terminals).
Figure 6.14 shows factors affecting voice quality along an end-to-end transmis-
sion path for a VoIP application. At the sender, it includes codec impairment (e.g.,
quantization error), coding delay (e.g., the time to form up a speech frame) and pack-
etization delay (e.g., put two or three speech frames into a packet). In IP network,
it includes network packet loss, delay and delay variation (jitter). At the receiver, it
contains depacketization delay (e.g., remove the header and get the payload), buffer
delay (the time spent in the playout buffer) and buffer loss (due to arrive too late),
and codec impairment and codec delay. From an end-to-end point of view, end-to-
end packet loss may include network packet loss and late arrival loss occurred at
the receiver. End-to-end delay needs to include all delays from sender, IP network
to receiver as shown in the figure.
6.2 Quality of Experience (QoE) for VoIP 137
At the application level, other impairment factors which are not shown in
Fig. 6.14 may include echo, sidetone (if analog network or analog circuit is in-
volved) and background noise. Application related mechanisms such as Forward
Error Correction (FEC), packet loss concealment (PLC), codec bitrate adaptation
and jitter buffer algorithms at either sender or receiver side may also affect the end-
to-end overall perceived voice quality.
VoIP quality measurement can be categorized into subjective assessment and ob-
jective assessment. In objective assessment, it can be further divided into intru-
sive or non-intrusive methods. Figure 6.15 shows a conceptual diagram for dif-
ferent speech quality assessment methods which will be covered in detail in this
section. We follow the ITU-T Rec. P.800.1 [22] for Mean Opinion Score (MOS)
terminology to describe the relevant MOS scores obtained from subjective and ob-
jective tests. In general, subjective tests are carried out based on degraded speech
samples. The subjective test scores can be described as MOS-LQS (MOS-listening
quality subjective) for one-way listening quality. If a conversation is involved,
then MOS-CQS (MOS-conversational quality subjective) is used instead. For ob-
jective tests, if a reference speech is inserted into the system/network and the
measurement is carried out by comparing the reference speech with the degraded
speech obtained at the other end of the tested system/network, this is called in-
trusive measurement. As shown in the figure, MOS-LQO (MOS-listening quality
objective) is obtained with the typical example of PESQ (Perceptual Evaluation of
Speech Quality) [17]. For non-intrusive measurement, there is no reference speech
inserted into the system/network. Instead, the measurement is carried out by ei-
ther utilizing/analyzing the captured IP packet headers (parameter-based) or ana-
lyzing the degraded speech signal itself (single-end signal based). For parameter-
138 6 VoIP Quality of Experience (QoE)
based methods, a typical example is ITU-T Rec. G.107 (E-model) [32] for predict-
ing conversational speech quality, MOS-CQE (MOS-Conversational Quality Esti-
mated, with delay and echo impairment considered) or listening-only speech qual-
ity, MOS-LQE (MOS-Listening Quality Estimated) from network related parame-
ters such as packet loss, delay and jitter. Methods to predict listening-only speech
quality from network parameters are also summarized in ITU-T Rec. P.564 [23].
In some applications, parameter-based speech quality prediction models are em-
bedded into the end device (e.g., locating just after the jitter buffer, thus, late ar-
rival loss and jitter buffer are taken into account in calculating relevant parame-
ters such as packet loss and delay). For signal-based method, a typical example is
3SQM (Single Sided Speech Quality Measure) following ITU Rec. P.563 [21] which
predicts listening-only speech quality from analyzing the degraded speech signal
alone. In the following section, we will give a detailed description on comparison-
based method (PESQ) and parameter-based method (E-model) which are two most
widely used objective quality assessment methods for VoIP applications in indus-
try.
Further comparison of all voice quality measurement methods is shown in
Fig. 6.16. Please note that all objective measurement methods are calibrated by
subjective test results. In other words, objective measurement methods are solely
to predict how subjects assessing the quality for the tested system/network/device.
User experience or user perceived quality is always the final judgement for the qual-
ity assessment.
Subjective voice quality tests are carried out by asking people to grade the qual-
ity of speech samples under controlled conditions (e.g., in a sound-proof room).
6.3 Subjective Speech Quality Assessment 139
speech codecs such as ITU-T Rec. G.728 [10] and G.729 [11]. In these cases, the
impairments due to speech compression are normally consistent during a test speech
sample. The MOS score given at the end of a test sample reflects the overall codec
quality for the test sequence. However, in VoIP applications, the impairments from
the IP network, such as packet loss has an inconsistent nature when compared with
codec compression impairment. Research has shown that the perceived quality of a
speech sample varies with the location of impairments such as packet loss. Subjects
tend to give lower MOS score when the impairments occurring near the end of the
test sample compared when the impairments occurring early in the sample. This is
called “Recency Effect” as humans tend to remember the last few things more than
those in the middle or at the beginning.
Figure 6.17 depicts the test results from an experiment described in ANSI
T1A1.7/98-031 [3] in which noise bursts were introduced at the beginning, mid-
dle and end of a 60 second test call. It shows that subjects gave the lowest MOS
score when bursts occurred at the end of the call. Similar recency effects were also
observed in another subject tests where noise bursts were replaced with bursts of
packet loss [16]. Due to the nature of IP network, the impact of network impairment
such as packet loss on speech quality are inconsistent during a VoIP.
In order to capture this inconsistency for time-varying speech quality, instan-
taneous subjective speech quality measurement was developed in addition to the
overall MOS score tests as in ITU-T Rec. P.800 [12]. In EURESCOM Project [1], a
continuous rating of a 1-minute sample is proposed to assess quality for voice signal
over the Internet and UMTS networks. Instead of voting at the end of a test sentence
(as in ITU-T Rec. P.800), a continuous voting is carried out at several segments
of the test sentence to obtain a more accurate assessment of voice quality. Further
in ITU-T Rec. P.880 [19], continuous evaluation of time varying speech quality
was standardized, in which both instantaneous perceived quality (perceived at any
instant of a speech sequence) and the overall perceived quality (perceived at the
6.4 Objective Speech Quality Assessment 141
end of the speech sequence) are required to be tested for time varying speech sam-
ples. This method is called Continuous Evaluation of Time Varying Speech Quality
(CETVSQ). Instead of short speech sequence (e.g., 8 s) as in P.800, longer speech
test sequence (between 45 seconds and 3 minutes) is recommended. An adequate
number (at least 24) naive listeners shall participate in the test. An appropriate slider
bar should be used to assist the continuous evaluation of speech quality according to
the continuous quality scale as defined in P.880 and shown in Fig. 6.18 during a test
sample. At the end of each test sequence, subjects are still asked to rate its overall
speech quality according to ACR scale in P.800. Overall the continuous evaluation
of time varying speech quality test method is more suitable for subjective assess-
ment of speech quality in voice over IP or mobile networks when network packet
loss or link bit error are inevitable.
of the network or system under test. Figure 6.19 shows an example of the reference
and the degraded speech signals. The degraded speech signal has experienced some
packet losses as indicated in the figure. The example shown in the figure is from
the G.729 codec [11] which has a built-in packet loss concealment mechanism and
has inserted the missing parts based on previous packets information for packet loss
concealment process.
There are a variety of intrusive objective speech quality measurement methods,
which are normally classified into three categories.
1. Time Domain Measures: based on time-domain signal processing and analysis
(i.e., analyze time-domain speech waveform as shown in Fig. 6.19. Typical
methods include Signal-to-Noise Ratio (SNR) and Segmental Signal-to-Noise
Ratio (SNRseg) analysis. These methods are very simple to implement, but are
not suitable for estimating the quality for low bit rate codecs (normally not
waveform based codecs) and voice over IP networks.
2. Spectral Domain Measures: based on spectral-domain signal analysis, such as
the Linear Predictive Coding (LPC) parameter distance measures and the cep-
stral distance measure. These distortion measures are closely related to speech
codec design and use the parameters of speech production models. Their per-
formance is limited by the constraints of the speech production models used in
codecs.
3. Perceptual Domain Measures: based on perceptual domain measures which
use models of human auditory perception. The models transform speech signal
into a perceptually relevant domain such as bark spectrum or loudness domain
and incorporate human auditory models. Perceptual based models provide the
6.4 Objective Speech Quality Assessment 143
unique problem in VoIP applications which has posed a real challenge to traditional
time-aligned objective quality evaluation methods. The “insert” or “slip” of some
speech segments for jitter buffer adjustment normally occurs during the silence pe-
riod of a call. If a jitter buffer adjustment also carries out during mid-talkspurt, the
adjustment itself may also affect voice quality.
After the development of ITU-T Rec. P.862 (PESQ) algorithm, two extensions of
PESQ are also standardized in ITU-T Rec. P.862.1 [18] for mapping from raw PESQ
score (ranging from −0.5 to 4.5) to MOS-LQO (ranging from 1 to 5) and ITU-T
Rec. P.862.2 [26] for mapping from raw PESQ score (narrow-band) to wideband
PESQ. The mapping function from PESQ to MOS-LQO is defined in Eq. (6.14).
4.999 − 0.999
y = 0.999 + (6.14)
1 + e−1.4945·x+4.6607
where x is the raw PESQ MOS score, and y is MOS-LQO score after mapping.
Normal VoIP applications are default for narrow-band (300–3400 Hz) speech ap-
plications. When wideband telephony applications/systems (50–7000 Hz) are con-
sidered, the PESQ raw score needs to be mapped to PESQ-WB (PESQ-WideBand)
score using the following Eq. (6.15).
4.999 − 0.999
y = 0.999 + (6.15)
1 + e−1.3669·x+3.8224
where x is the raw PESQ MOS score, and y is the PESQ-WB MOS value after
mapping.
ITU-T P.863 [34] defined in 2011, is an objective speech quality prediction
method for both narrowband (300–3400 Hz) and super-wideband (50 to 14000 Hz)
speech and is regarded as the next generation speech quality assessment technology
suitable for fixed, mobile and IP-based networks. It predicts listening-only speech
quality in terms of MOS. The predicted speech quality for its narrowband and super-
wideband mode is expressed as MOS-LQOn (MOS Listening Quality Objective
for Narrowband) and MOS-LQOsw (MOS Listening Quality Objective for Super-
wideband), respectively. P.863 is suitable for all the narrow and wideband speech
codecs listed in Table 2.2 in Chap. 2 and can be applied for applications over GSM,
UMTS, CDMA, VoIP, video telephony and TETRA emergency communications
networks.
6.4 Objective Speech Quality Assessment 145
R = R0 − Is − Id − Ie-eff + A (6.16)
where
R0 : S/N at 0 dBr point (groups the effects of noise)
Is : Impairments that occur simultaneously with speech (e.g. quantization
noise, received speech level and sidetone level)
Id : Impairments that are delayed with respect to speech (e.g. talker/listener
echo and absolute delay)
Ie-eff : Effective equipment impairment (e.g. codecs, packet loss and jitter)
A: Advantage factor or expectation factor (e.g. 0 for wireline and 10 for GSM)
The ITU-T Rec. G.107 has gone through 7 different versions in the past ten years
which reflect the continuous development of the model for modern applications,
such as VoIP. For example, the Ie model in the E-model has evolved from a simple
random loss model, a 2-state Markov model to more complicated 4-state Markov
model which takes into account bursty losses and gap/bursty states (as discussed
in Sect. 6.1.3) to reflect real packet loss characteristics in IP networks. Efforts are
still ongoing to further improve E-model in applications in modern fixed/mobile
networks.
ITU-T Rec. G.109 [15] defines the speech quality classes with the Rating (R), as
illustrated in Table 6.4. A rating below 50 indicates unacceptable quality.
The score obtained from the E-model is referred as MOS-CQE (MOS conversa-
tional quality estimated). This MOS score can be converted from R-value by using
Eq. (6.17) according to ITU-T Rec. G.107 [32], which is also depicted in Fig. 6.23.
146 6 VoIP Quality of Experience (QoE)
It is clear that when R is below 50, MOS score is below 2.6 indicating a low voice
quality. When R is above 80, MOS score is over 4 which indicates high voice quality
(reaches “toll quality” category (MOS: 4.0–4.5) used in traditional PSTN networks).
A detailed mapping of R vs. MOS when R is above 50 is listed in Table 6.5.
⎧
⎪
⎨1 for R ≤ 0
MOS = 1 + 0.035R + R(R − 60)(100 − R)7 · 10−6 for 0 < R < 100 (6.17)
⎪
⎩
4.5 for R ≥ 100
6.4 Objective Speech Quality Assessment 147
Effective equipment impairment factor Ie-eff can be calculated in Eq. (6.21) ac-
cording to ITU-T G.107 [32].
Ppl
Ie-eff = Ie + (95 − Ie ) · Ppl
(6.21)
BurstR + Bpl
Ie is the equipment impairment factor at zero packet loss which reflects purely
codec impairment. Bpl is defined as the packet-loss robustness factor which is also
148 6 VoIP Quality of Experience (QoE)
codec-specific. As defined in ITU-T Rec. G.113 [25], Ie = 0 for G.711 PCM codec
at 64 kb/s, which is set as a reference point (zero codec impairment). All other
codecs have higher than zero Ie value (e.g. Ie = 10 for G.729 at 8 kb/s, Ie = 15 for
G.723.1 at 6.3 kb/s). Normally the lower the codec bit rate, the higher the equipment
impairment Ie value for the codec. Bpl value reflects codec’s built-in packet loss con-
cealment ability to deal with packet loss. The value is not only codec-dependent,
but also packet-size-dependent (i.e., depends on how many speech frames in a
packet). According to G.113, Bpl = 16.1 for G.723.1+VAD (Voice Activity De-
tection (VAD) is activated) with packet size of 30 ms (only one speech frame in a
packet). Bpl = 19.0 for G.729A+VAD (VAD activated) with packet size of 20 ms (2
speech frames in a packet). Ppl is the average packet-loss rate (in %).
BurstR is the so-called burst ratio. When packet loss is random, BurstR = 1; and
when packet loss is bursty, BurstR > 1.
In a 2-state Markov model as shown in Fig. 6.5, BurstR can be calculated as:
1 Ppl /100 1 − Ppl /100
BurstR = = = (6.22)
p+q p q
Please note that p value is the transitional probability from “No Loss” to “Loss”
and q value is the transitional probability from “Loss“ to “No Loss” state, as shown
in Fig. 6.5. From using Ppl and p to using p and q in Eq. (6.22), this can be easily
derived from Eq. (6.3).
Overall, effective equipment impairment factor Ie-eff can be obtained when codec
type and packet size are known and network packet loss parameters (in a 2-state
Markov model) have been derived.
The E-model R-factor can be calculated from Eq. (6.18) after Id and Ie-eff are
derived. Further MOS can be obtained from the R-factor according to Eq. (6.17).
of an assessment, test session (should last up to half an hour and random presen-
tation order for test video sequences) and final subjective test results presentation
(e.g., to calculate mean score and 95 % confidence interval and to remove inconsis-
tent observers).
Depending on how evaluation or quality voting is carried out, subjective test
methods can be either as a standalone one-vote test (e.g., give the voting at the
end of a test session for a test video sequence) or as a continuous test (e.g., the
viewer moves a voting scale bar and indicates the video quality continuously during
a test session for a test video sequence). The latter is more appropriate for assess-
ing video quality in VoIP applications as the network impairments such as packet
loss and jitter are time-varying. Their impact on video quality also depends on the
location of these impairments in connection with the video contents or scenes. Typ-
ical subjective test methods include Absolute Category Rating (ACR), Absolute
Category Rating with Hidden Reference (ACR-HR), Degradation Category Rating
(DCR), Pair Comparison Method (PC), Double-stimulus continuous quality-scale
(DSCQS), Single stimulus continuous quality evaluation (SSCQE) and simultane-
ous double stimulus for continuous evaluation (SDSCE) methods, which are listed
and explained below.
• Absolute Category Rating (ACR) method: also called single stimulus (SS)
method, where only the degraded video sequence is shown to the viewer for
quality evaluation. The five-scale quality rating for ACR is 5 (Excellent), 4
(Good), 3 (Fair), 2 (Poor) and 1 (Bad) (similar as the one shown in Table 6.2 for
speech quality evaluation).
• Absolute Category Rating with Hidden Reference (ACR-HR) method: includes
a reference version of each test video sequence as its test stimulus (refers to the
term of hidden reference). Differential viewer scores (DV) are calculated as
in Table 6.6 (if you compare this table with Table 6.3 for voice, you will notice
that the difference is only the change of the word ‘inaudible’ for voice/audio
condition to the word ‘imperceptible’ for video condition).
• Pair Comparison (PC) method: a pair of test video clips are presented to the
viewer who indicates his/her preference for a video (e.g., if the viewer prefers
the 1st video sequence, he/she will tick the box for the 1st one and vice versa.
• Single stimulus continuous quality evaluation (SSCQE) method: the viewer is
asked to provide a continuous quality assessment using a slider ranging from
0 (Bad) to 100 (Excellent). Final results will be mapped to a single quality
metric such as 5 level MOS score. The test video sequence is typically of 20–30
minutes duration.
• Double-stimulus continuous quality-scale (DSCQS) method: viewers are asked
to assess the video quality for a pair of video clips including both the reference
and the degraded video clips. The degraded video clips may include hidden
reference pictures. Test video sequences are short (about 10 seconds). Pairs of
video clips are normally shown twice and viewers are asked to give voting dur-
ing the second presentation for both video clips using a continuous quality-scale
as shown in Fig. 6.25.
• Simultaneous double stimulus for a continuous evaluation (SDSCE) method:
viewers are asked to view two video clips (one reference and one degraded,
normally displayed side-by-side in one monitor) at the same time. Viewers are
requested to concentrate on viewing the differences between two video clips
and judge the fidelity of the test video to the reference one by moving the slider
continuously (100 for the highest fidelity and 0 for the lowest fidelity) during
a test session. The length of the test sequence can be longer for SDSCE when
compared with that of DSCQS.
where M × N is the width × height (in pixels) of the image. F and f are the
luminance component for the original and the degraded images (pixel by pixel).
The number of bits per pixel (luminance component) is normally 8 which results
in Vpeak of 255 (this is where the name ‘Peak’ Signal-to-Noise Ratio come from).
PSNR between the reference and degraded video sequences can be obtained from
the PSNR value image by image (or frame by frame) and can be expressed by an
average PSNR value among all frames considered.
PSNR expressed in dB can be mapped to the MOS score of video quality accord-
ing to [36] and is shown in Table 6.7.
Other popular FR video quality measurement methods are the Structural Simi-
larity Index (SSIM) [46] and Video Quality Metric (VQM) from National Telecom-
munications and Information Administration (NTIA) [41], also defined in ITU-T
152 6 VoIP Quality of Experience (QoE)
J.144 [20]. These FR models assume that the reference and degraded video se-
quences are properly aligned at both spatial (pixel by pixel) and temporal (frame
by frame) domains. Perceptual video quality assessment is based on pixel-by-pixel
and frame-by-frame comparison between the reference and the degraded video sig-
nals. If applying metrics such as PSNR and SSIM to the reference and degraded
video clips with spatial or temporal misalignment directly, poor PSNR or SSIM
results will be obtained. In these test environments such as Internet video applica-
tions, video perceptual quality assessment algorithms including spatial and tempo-
ral alignment mechanisms need to be integrated into the perceptual video quality
model.
It has to be mentioned that Table 6.7 is only a brief mapping between PSNR
and the MOS score. Researches such as [47] have demonstrated that degraded video
clips with same PSNR values may have different perceived video quality. In the
Phase II Full-Reference model test for Standard-Definition (SD) TV carried out by
the Video Quality Experts Group (VQEG)2 (Phase II of VQEG’s FRTV test), PSNR
only achieved about 70 % correlation with the subjective test results (MOS) [48].
Many efforts have been put on the research and development of better FR models
for predicting perceived video quality more accurately. VQEG have also conducted
several projects on Full-Reference (FR) video quality assessment including FR-TV
Phase I and FR-TV Phase II aiming for evaluating FR video quality assessment for
SD TV applications with a focus on MPEG-2 compression for digital TV broad-
casting and Multimedia (MM) Phase I and MM Phase II aiming for multimedia
applications such as Internet multimedia streaming, video telephony and conferenc-
ing and mobile video streaming. The former work resulted in the specification of
ITU-T J.144 (2004) [20] which included four perceptual video quality measure-
ment methods from British Telecom (BT)3 from the UK; Yonsei University4 /SK
Telecom5 /Radio Research Laboratory (RRL) from Korea; the Telecommunications
Research and Development Center (CPqD) from Brazil;6 and the National Telecom-
munications and Information Administration (NTIA)7 from the USA. The Video
Quality Metric (VQM) software developed by NTIA is also downloadable (royalty
free) from the NTIA/ITS website.8 The MM Phase I produced four FR models for
multimedia applications which are specified in ITU-T J.247 [28] and included four
models from NTT, Japan;9 Opticom, Germany;10 Psytechnics, UK11 and Yousei
2 http://www.vqeg.org
3 http://www.bt.com
4 http://www.yonsei.ac.kr
5 http://www.sktelecom.com
6 http://www.cpqd.com.br
7 http://www.ntia.doc.gov
8 http://www.its.bldrdoc.gov/resources/video-quality-research/software.aspx
9 http://www.ntt.com
10 http://www.opticom.de
11 http://www.psytechnics.com
6.6 Objective Video Quality Assessment 153
12 http://www.pevq.org
154 6 VoIP Quality of Experience (QoE)
for video quality prediction, which is easier to be transmitted to the receiver for
quality comparison. ITU-T J.246 (2008) [29] defines RR models for multimedia ap-
plications in which both temporal and spatial alignment processes are applied and
test conditions are similar with those set in ITU-T J.247 (e.g. supporting video res-
olutions of QCIF, CIF and VGA and video frame rates from 5 to 30 fps). The RR
model proposed by Yonsei University, Korea is also included in ITU-T J.246 An-
nex A. ITU-T J.249 (2010) [33] specified RR models for SDTV (Standard Defini-
tion Television) applications and included three RR models from Yonsei University
of Korea, NEC of Japan13 and National Telecommunications and Information Ad-
ministration/Institute for Telecommunication Sciences (NTIA/ITS)14 of USA. All
three models contain spatial and temporal alignment process and gain adjustment
between the reference and degraded video sequences.
13 http://www.nec.com
14 http://www.its.bldrdoc.gov
6.7 Illustrative Worked Examples 155
6.7.1 Question 1
Explain, with the aid of a suitable block diagram, how speech signals are trans-
ported over IP networks in real-time. Your answer should highlight the main infor-
mation/signal processing operations that take place in transporting speech from the
speaker to the listener and how they affect voice quality.
Indicate on your diagram the main impairment factors in VoIP systems.
156 6 VoIP Quality of Experience (QoE)
6.7.2 Question 2
successfully delivered packets. Calculate the average packet loss rate and the mean
burst loss length for this trace data.
So the average packet loss rate is 1.4 % and the mean burst loss length is 1.39 for
this trace data.
6.7.3 Question 3
Explain, with the aid of a suitable block diagram, how Full-Reference (FR) and No-
Reference (NR) video quality assessment models work? In video quality monitoring
for VoIP video call applications, which model (FR or NR) should we use? Why?
SOLUTION: The full-reference (FR) and no-reference (NR) video quality as-
sessment models are shown in Fig. 6.30. Figure (a) is for Full-Reference model
where the reference video and the degraded video signals are both inputted to the
FR model to predict video quality. For the NR model (see figures (b) and (c)), there
is no reference video involved in the video quality prediction. In stead, only the de-
graded video signal (see figure (b)) or parameters derived from the transmission sys-
tem (e.g., parameters derived from IP packet headers, or derived from TS streams,
see figure (c)) are used for the video quality prediction. It is also possible that both
the degraded video signal and parameters derived from the transmission system are
both used for the video quality prediction (named as hybrid video quality model).
In video quality monitoring for VoIP video call applications, no-reference model
is normally used. This is because that there is no reference video signal injected
into the tested system/network and only received degraded video signal or received
video packets/bitstreams are used for video quality prediction. No-reference model
can be used for real-time video quality monitoring for operational systems/networks
for video quality prediction. NR model can also be incorporated into terminals (such
as mobile phones and TV set-top boxes) for real-time video quality monitoring for
video call or video streaming applications.
158 6 VoIP Quality of Experience (QoE)
Fig. 6.30 Full-Reference (FR) and No-Reference (NR) video quality assessment
6.8 Summary
In this chapter, we have discussed the concept of Quality of Service (QoS) and
Quality of Experience (QoE) mainly around VoIP applications. QoS is generally
used to express network performance, using metrics such as packet loss, delay and
jitter. QoE is normally used to express user perceived quality for a provided ser-
vice such as VoIP and usually uses Mean Opinion Score (MOS) to represent an
overall quality for voice, video or audiovisual applications. In the chapter, we have
explained QoS metrics (i.e., loss, delay and jitter) from their definition, character-
istics, to a practical approach on how to obtain these metrics. We have presented
in detail QoE measurements for VoIP applications, ranging from subjective tests
(ACR/DCR), intrusive/non-intrusive objective measurements (e.g., PESQ and E-
model) and practical approach on how to use them for VoIP applications. We have
also presented subjective and objective video quality measurement. For subjective
video quality measurement, we have discussed ACR, ACR-HR, DCR, PC, SSCQE,
6.9 Problems 159
DSCQS and SDSCE methods. For objective video quality measurement, we have
illustrated FR, RR and NR quality measurement and summarised standardisation
efforts from VQEG and ITU-T on FR/RR/NR models.
6.9 Problems
17. Illustrate and describe briefly the FR, RR and NR video quality measurement
methods. What is the difference between the NR bitstream-model and the NR
hybrid-model?
18. Why do we need to develop non-intrusive (or no reference) speech/video qual-
ity assessment model?
References
1. AQUAVIT—assessment of quality for audio-visual signals over Internet and UMTS—
Deliverable 2: Methodology for subjective audio-visual quality evaluation in mobile and
IP networks. EURESCOM Project P905-PF (2000)
2. Allnatt J (1975) Subjective rating and apparent magnitude. Int J Man-Mach Stud 7:801–816
3. ANSI (1998) Testing the quality of connections having time varying impairments. ANSI
T1A1.7/98-031
4. Clark A (2001) Modeling the effects of burst packet loss and recency on subjective voice
quality. In: Proceedings of the 2nd IP-telephony workshop, Columbia University, New York,
USA, pp 123–127
5. Cole RG, Rosenbluth JH (2001) Voice over IP performance monitoring. Comput Commun
Rev 31(2):9–24
6. Ellis M, Perkins C (2010) Packet loss characteristics of IPTV-like traffic on residential links.
In: 7th IEEE, consumer communications and networking conference (CCNC), pp 1–5
7. ETSI (1996) Speech communication quality from mouth to ear of 3.1 kHz handset telephony
across networks. Tech. report. ETSI ETR250
8. ITU-R (2012) Methodology for the subjective assessment of the quality of televi-
sion pictures. ITU-R Recommendation BT.500-13. http://www.itu.int/rec/R-REC-BT.500-
13-201201-I/en
9. ITU-T (1988) Quality of service and dependability vocabulary. ITU-T Recommendation
E.800
10. ITU-T (1992) Coding of speech at 16 kbit/s using low-delay code excited linear prediction.
ITU-T Recommendation G.728
11. ITU-T (1996) Coding of speech at 8 kbit/s using Conjugate-Structure Algebraic-Code-
Excited Linear-Prediction (CS-ACELP). ITU-T Recommendation G.729
12. ITU-T (1996) Methods for subjective determination of transmission quality. ITU-T Recom-
mendation P.800
13. ITU-T (1996) Objective quality measurement of telephone-band (300–3400 Hz) speech
codecs. ITU-T Recommendation P.861
14. ITU-T (1996) Subjective performance assessment of telephone-band and wideband digital
codecs. ITU-T Recommendation P.830
15. ITU-T (1999) Definition of categories of speech transmission quality. ITU-T Recommenda-
tion G.109
16. ITU-T (2000) Study of the relationship between instantaneous and overall subjective speech
quality for time-varying quality speech sequences: influence of a recency effect (Delayed
Contributions 9–18 May 2000). ITU-T Contribution COM12-D139
17. ITU-T (2001) Perceptual evaluation of speech quality (PESQ), an objective method for end-
to-end speech quality assessment of narrow-band telephone networks and speech codec.
ITU-T Recommendation P.862
18. ITU-T (2003) Mapping function for transforming P.862 raw result scores to MOS-LQO.
ITU-T Recommendation P.862.1
19. ITU-T (2004) Continuous evaluation of time varying speech quality. ITU-T Recommenda-
tion P.880
20. ITU-T (2004) Objective perceptual video quality measurement techniques for digital cable
television in the presence of a full reference. ITU-T Recommendation J.144
References 161
21. ITU-T (2004) Single-ended method for objective speech quality assessment in narrow-band
telephony applications. ITU-T Recommendation P.563
22. ITU-T (2006) Mean opinion score (MOS) terminology. ITU-T Recommendation P.800
23. ITU-T (2007) Conformance testing for narrowband voice over IP transmission quality as-
sessment models. ITU-T Recommendation P.564
24. ITU-T (2007) Opinion model for video-telephony applications. ITU-T Recommendation
G.1070
25. ITU-T (2007) Transmission impairments due to speech processing. ITU-T Recommendation
G.113
26. ITU-T (2007) Wideband extension to Recommendation P.862 for the assessment of wide-
band telephone networks and speech codecs. ITU-T Recommendation P.862.2
27. ITU-T (2008) Full reference and reduced reference calibration methods for video transmis-
sion systems with constant misalignment of spatial and temporal domains with constant gain
and offset. ITU-T Recommendation J.244
28. ITU-T (2008) Objective perceptual multimedia video quality measurement in the presence
of a full reference. ITU-T Recommendation J.247
29. ITU-T (2008) Perceptual visual quality measurement techniques for multimedia services
over digital cable television networks in the presence of a reduced bandwidth reference.
ITU-T Recommendation J.246
30. ITU-T (2008) Subjective video quality assessment methods for multimedia applications.
ITU-T Recommendation P.910. http://www.itu.int/rec/T-REC-P.910-200804-I
31. ITU-T (2008) Vocabulary for performance and quality of service, Amendment 2: New
definitions for inclusion in Recommendation ITU-T P.10/G.100. ITU-T Recommendation
G.100
32. ITU-T (2009) The E-model, a computational model for use in transmission planning. ITU-T
Recommendation G.107. http://www.itu.int/rec/T-REC-G.107
33. ITU-T (2010) Perceptual video quality measurement techniques for digital cable television
in the presence of a reduced reference. ITU-T Recommendation J.249
34. ITU-T (2011) Perceptual objective listening quality assessment. ITU-T Recommendation
P.863. http://www.itu.int/rec/T-REC-P.863-201101-I
35. ITU-T (2011) The E-model, a computational model for use in transmission planning. ITU-T
Recommendation G.107. http://www.itu.int/rec/T-REC-G.107
36. Klaue J, Rathke B, Wolisz A (2003) EvalVid—a framework for video transmission and
quality evaluation. In: Proc of the 13th international conference on modelling techniques
and tools for computer performance evaluation
37. Kurose JF, Ross KW (2010) Computer networking, a top–down approach, 5th edn. Pearson
Education, Boston ISBN-10:0-13-136548-7
38. Ma L, Li S, Zhang F, Ngan KN (2011) Reduced-reference image quality assessment using
reorganized DCT-based image representation. IEEE Trans Multimed 13(4):824–829
39. Mkwawa IH, Jammeh E, et al (2010) Feedback-free early VoIP quality adaptation scheme
in next generation networks. In: IEEE Globecom, pp 1–5
40. Mohamed S, Rubino G (2002) A study of real-time packet video quality using random
neural networks. IEEE Trans Circuits Syst Video Technol 12(12):1071–1083
41. Pinson MH, Wolf S (2004) A new standardized method for objectively measuring video
quality. IEEE Trans Broadcast 50(3):312–322
42. Ramjee R, Kurose J, et al (1994) Adaptive playout mechanisms for packetized audio appli-
cations in wide-area networks. In: Proc of IEEE Infocom, pp 680–688
43. Reibman AR, Vaishampayan VA, Sermadevi Y (2004) Quality monitoring of video over a
packet network. IEEE Trans Multimed 6(2):327–334
44. Sanneck H (2000) Packet loss recovery and control for voice transmission over the Internet.
PhD Dissertation, Technical University of Berlin
45. Schulzrinne H, Casner S (2003) RTP: a transport protocol for real-time applications. IETF
RFC 3550
162 6 VoIP Quality of Experience (QoE)
46. Wang Z, Lu L, Bovik AC (2004) Video quality assessment based on structural distortion
measurement. Signal Process Image Commun 19(2):121–132
47. Winkler S, Mohandas P (2008) The evolution of video quality measurement: from PSNR to
hybrid metrics. IEEE Trans Broadcast 54(3):660–668
48. Winkler S, Mohandas P (2009) Video quality measurement standards CCurrent status and
trends. In: ICICS 2009
49. Yajnik M, Kurose J, Towsley D (1995) Packet loss correlation in the MBone multicast net-
work experimental measurements and Markov chain models. Technical report, University
of Massachusetts, UM-CS-1995-115
IMS and Mobile VoIP
7
The Internet evolution started from a small network linking a few research centres to
a massive network with billions of computers. The reason behind the growth of the
Internet has been its ability to provide very useful services such as World Wide Web,
email, instant messaging, VoIP and video conferencing. On the other hand, cellular
networks have experienced dramatic growth over the years. The cellular network
growth was not only due to its services such as voice and video calls and short mes-
saging services, but also because cellular network users can access the network from
virtually everywhere. These facts prompted 3GPP to come up with the idea of the
IP Multimedia Subsystem. The IP Multimedia Subsystem aims at merging cellular
networks and the Internet, two of the most successful infrastructures in telecommu-
nication. By merging the two infrastructures, the IP Multimedia Subsystem will be
able to provide ubiquitous cellular access to all services that are provided by the
Internet.
Through the data connection, cellular network users can access the Internet services.
So why do we need IMS for if the IMS idea is to offer the Internet services by
using cellular networks and already cellular networks have full access to the Internet
services through the data connection? The answer is that, IMS goes beyond the idea
L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 163
DOI 10.1007/978-1-4471-4905-7_7, © Springer-Verlag London 2013
164 7 IMS and Mobile VoIP
The IMS architecture supports a range of services enabled by the Session Initiation
Protocol (SIP). IMS consists of all the core network elements providing IP multi-
media services such as audio, video text and chat over the packet switched domain
of the core network. The overall network is made up of two parts,
1. The access network which provides the wireless access points.
2. The core network which provides service control and the fixed connectivity to
other access points, to other fixed networks and to services resources such as
databases and content delivery.
The IMS architecture is capable of supporting several application servers provid-
ing conventional telephony and non telephony services such as instant messaging,
push to talk over cellular (PoC), multimedia streaming and multimedia messaging
(cf., Fig. 7.1).
The IMS services architecture consists of logical functions which are divided
into three layers.
the appropriate routing of the SIP signalling messages to application servers. The
CSCFs work together with the transport and endpoint layer to provision the quality
of experience across all services.
The session control layer also includes the Home Subscriber Server (HSS)
database on which service profiles for each IMS user can be store and retrieved.
The end users’ service profiles keep all of the users service data and preferences
in a centralised manner. User profiles include IMS users’ current registration infor-
mation such as IP address, roaming, telephony and instant messaging services and
voicemail box options. The centralised approach of IMS users’ profiles enables ap-
plications to share the information and create unified presence information, blended
services and personal directories. This approach also simplifies the administration of
user information by making sure that the consistent views of active IMS subscribers
across all services are maintained.
In the session control layer also resides the Media Gateway Control Function
(MGCF) [11] which by the use of the SIP signalling protocol controls the media
gateways such as H.248 protocol [18].
capable of supporting various telephony and non telephony application servers. For
instance, SIP based applications have been mainly developed to run telephony and
Instant Messaging services.
Telephony Application Server The IMS architecture can support several appli-
cation servers that implements SIP telephony services. The Telephony Application
Server (TAS) is a Back-to-Back User Agent (B2BUA) that maintains the SIP call
state. TAS is comprised of a service logic which provides simple call processing ser-
vices such as digit analysis and call setup, routing, waiting and forwarding. TAS’s
service logic can also invoke media servers in order to support SIP call progress
announcements and tones. If SIP calls are originating or terminating on the PSTN
then TAS will send SIP signalling message to the MGCF for the media gateways to
convert PSTN voice bit stream into RTP stream and to direct the RTP stream to the
IP address of the callee/caller user equipment.
TAS implements Advanced Intelligent Network (AIN) call trigger points. If a
SIP call progresses to a trigger point then TAS will suspend call processing and will
check the subscriber profile to find out if any additional service is required to the call
at that moment. If the subscriber profile requires any service from any application
server then TAS will format the IMS Control (ISC) message and handover the SIP
call control to an required application server.
For instance, one TAS can serve centrex business services such as speed dialing,
call transfer, direct call, call park, divert, hold and barring, another TAS can provide
service such as call back, last number redial, reminder calls and number display ser-
vices. Multiple TASs can inter-operate via SIP signalling to perform several services
for different IMS UEs.
Open Service Access Gateway The IMS architecture has the flexibility to in-
corporate additional services, this can be done by integrating additional services
with SIP bases application servers. For instance, an organisation might want to ini-
7.1 What Is IP Multimedia Subsystem? 167
tiate a call or an instant message from its back office delivery software if an or-
der is few minutes to be delivered to an address. This can be done by retrieving
location information of the courier. This is made possible in IMS by the well de-
fined Parlay API. The API hides the complex signalling protocols such as SS7 [7],
CAMEL [5], ANSI41, SIP and ISDN to non telecommunication application devel-
opers and make it simple to interact with telephony service such as IM and voice
calls. 3GPP and ETSI organizations have closely worked with Parlay forum to de-
fined the Parlay API for accessing telephony networks. The interface between SIP
and the Parlay API interconnection is defined in the Open Services Access Gate-
way (OSAGW). 3GPP IMS architecture defines OSAGW as part of the application
server layer.
The IMS architecture through the SIP signalling can support the provision of ad-
vanced broadband multimedia services such as Internet Protocol Television (IPTV),
video on demand (VoD) and video telephony.
By the use application servers, service providers have the flexibility to enable
application developer partners who are located outside of the core IMS domain
to develop new applications and integrate with the IMS core domain via SIP sig-
nalling.
The overall IMS network architecture is made up of access and core networks. The
core network has two domains.
Packet Switched Domain Packet switched connections are the opposite to the
circuit switched because they do not require dedicated resources. The informa-
tion is broken down into packets which are routed independently through the net-
work to their destinations, where they are reassembled into the original information
streams.
The IMS comprises of eight elements in the packet switched domain (cf.,
Fig. 7.2).
The IMS elements are logical functions which can be implemented in one or
different servers. They can be broken down into the following functionalities.
Proxy CSCF The Proxy CSCF (P-CSCF) is the first contact point within the IMS
core network for the IMS subscribers. P-CSCF accepts SIP requests and serves them
internally in the home IMS or forwards them to external IMS. The main functions
of P-CSCF are [15],
• Forwarding SIP registration requests received from the UE to an entry point out-
lined either by using the home domain name or from the SIP registration header.
When SIP requests are sent to external IMS domain they might, if required, be
routed via a local network Interconnection Border Control Function (IBCF),
which will then forward the SIP request to the entry point of the external IMS
domain.
• Ensuring that SIP messages received from the UE contain the correct informa-
tion regarding the access network type currently used by the UE. The P-CSCF
shall not modify the SIP Request URI in the SIP INVITE message.
• Detecting and handling emergency session establishment through SIP requests.
• Maintaining a Security Association such as IPSec with the UE, therefore secur-
ing the access part of the SIP signalling plane.
• Performing SIP message compression and decompression.
7.1 What Is IP Multimedia Subsystem? 169
Serving CSCF The Serving CSCF (S-CSCF) serves the session control services
for the IMS subscribers. S-CSCF maintains session states needed by the network
operator to support the established services. With the same IMS domain several S-
CSCFs may be deployed with different functionalities. The S-CSCF can retrieve
subscriber’s security credentials and service information from HSS. S-CSCF can
also act as a SIP registrar for translating SIP messages to UE location. S-CSCF also
uses filter criteria to determine the call flow from the UE to the destined applica-
tion server or another UE. The S-CSCF is the main call session controller with the
following responsibilities [15],
Interrogating CSCF Is the network entry point into the IMS home domain. It
is tasked with SIP messages routing and selection of S-CSCFs. The selection of
S-CSCFs is done by the I-CSCF during the UE IMS registration process. According
to [15], I-CSCF,
• If the I-CSCF determines that the destination of the session is not within the
IMS home domain, it will either forward the SIP request or return with a failure
response code towards the originating UE.
PSTN Gateway
A PSTN gateway interfaces the IMS core network with PSTN circuit switched net-
works. Table 7.1 outlines the technologies used by both circuit switched networks
and IMS regarding signalling messages and transfer of media. Figure 7.3 depicts the
PSTN Gateway with its communication interfaces.
Signalling Gateway Signalling Gateway (SGW) interfaces with the IMS sig-
nalling plane of the CS network. It transforms lower layer protocols such as Stream
Control Transmission Protocol (SCTP) [38] into Message Transfer Part (MTP)
which is an SS7 protocol, in order to pass the ISUP from the Media Gateway Control
Function (MGCF) to the CS network.
charging details. Other key features are the support of the SGW functionality and
the control of the announcement application server. The MGCF and the MGW are
not compulsory to be deployed in the IMS network but both network elements are
only required when the IMS has to inter-operate with the PSTN CS domain.
Breakout Gateway
Breakout Gateway Control Function If the S-CSCF of the origination UE can-
not route the SIP Invite message to a terminating IMS then it passes the SIP session
to the Breakout Gateway Control Function (BGCF). The BGCF is tasked with se-
lecting the MGCF. The BGCF always is co-located with the S-CSCF. The BGCF is
not compulsory in the IMS core network but it is required whenever the IMS has to
inter-operate with the PSTN CS Domain.
• The protocol on the ISC interface must allow the S-CSCF to differentiate
between SIP requests on Mw, Mm and Mg interfaces [14] and SIP Requests
on the ISC interface.
User Databases
Home Subscriber Server: Home Subscriber Server (HSS) is a database that
stores the IMS user profiles of all IMS subscribers. The user profile includes the
subscriber’s identity information such as IMPU, IMPI, IMSI, and MSISDN. It also
includes service provisioning information such as Filter Criteria (FC), user mobility
information such as CSCF IP addresses and charging information such as GCID,
SGSN-ID, and ECF address. The HSS also provides support for the IMS user au-
thentication. In the IMS, the HSS communicates with the HLR using the MAP
protocol in order to get the AKA security information from the HLR. The HSS
performs authentication and authorisation of the IMS user and also provides infor-
mation about the physical location of the IMS user.
Subscription Locator Function (SLF): The SLF locates the database that store
subscribers information in response to requests from the I-CSCF or AS.
Other
Signaling Gateway Function: Signaling Gateway Function (SGF) provides sig-
nalling conversion between SS7 and IP networks.
Policy Decision Function: Policy Decision Function (PDF) provides two main
functions,
• Policy based Network Resource Control: It is used to authorize and control re-
sources usage for each GPRS/UMTS Secondary Packet Data Protocol (PDP)
context. Policy based network resource control prevents the misuse of network
quality of service agreement, for instance it will stop the IMS subscriber to use
Secondary PDP Context with higher QoS classes as agreed. Policy based net-
work resource control also allows the network operator to limit the resource
usage. The PDF acts as Policy Decision Point (PDP) and the GGSN is its cor-
responding Policy Enforcement Point (PEP).
• Charging Correlation Support: This function support the exchange of charging
correlation information between the PDF and the GGSN. Charging correlation
support is not a mandatory IMS network element, but is required whenever
GPRS/UMTS Secondary PDP Contexts using QoS have to be controlled.
IMS can enable new converged and existing services for IMS subscribers using
wireline or wireless access. An IMS based network can provide the ability to blend
174 7 IMS and Mobile VoIP
multimedia services that may currently be available in isolation. One such instance
is that an IMS subscriber could simultaneously have web portal based management
of personal call features and policies, presence capability to detect the availability of
individuals on the contact list, and the ability to share video clips with contacts while
on conversation with them, regardless of their network access types or devices.
Users are interested in multimedia services that allow them to share information
via their preferred delivery methods with several individuals across multiple access
networks at the time of their choose. Examples of such services are as follows:
• Voice/Video calls: This IMS multimedia service involves point to point commu-
nication between IMS users.
• Conferencing: This IMS service includes video or audio conferencing which
allows two or more IMS users to interact simultaneously.
• Content sharing: This IMS service allows users to share any type of informa-
tion, such as video and data files with any contact.
• Push-to-talk Over Cellular: This service involves voice calls which are in half
duplex mode of communication. In half duplex mode, when one person speaks
another one listens.
• Multiparty Gaming: This IMS service allows IMS users play games with each
other in real time.
• Presence Information: This IMS service allows an IMS user to be notified
whenever a contact is available. Presence information displays the availability
and willingness of an IMS user to communicate.
• Messaging There are several types of messaging in IMS. These are:
– Instant messaging: This is a real time text communication between two or
more IMS users.
– Universal messaging: This IMS service gives IMS users the ability to com-
pose, send, reply and forward voice messages to and from voicemail sys-
tem.
– Enhanced messaging: This is an enhancement to SMS for mobile networks.
A mobile phone is capable of sending and receiving messages that have
special formatting, pictures, animations, icons, and ring tones.
• Speech Synthesis and Recognition
– Text to Speech (TTS) conversation: Speech synthesis is an artificial produc-
tion of human speech. A system used for this purpose is termed a speech
synthesizer, and can be implemented in software or hardware.
– Speech Recognition: Also known as automatic speech recognition, it is the
process of converting a speech signal to a set of words, by means of an
algorithm implemented as a computer application software.
This section outlines the main IMS internal and external interfaces.
• Gm-interface: Interface between the UE and the P-CSCF [14] (cf., Fig. 7.5).
7.1 What Is IP Multimedia Subsystem? 175
• Sh-interface: Interface between the HSS and AS [16] (cf., Fig. 7.9). Only an
IPv4 transport is supported for the Sh interface. 3GPP defines the use of IPv6
at the Network.
• Mw-interface: Interface between CSCFs [14] (cf., Fig. 7.10). IPv4 and IPv6 are
supported for the Mw interface.
• Mg-interface: Interface between the MGCF and the S-/I-CSCF [14] (cf.,
Fig. 7.11). 3GPP defines the use of IPv6 at the Network Layer.
More IMS internal and external interfaces are listed in Table 7.2.
7.2 Mobile Access Networks 177
We will concentrate on the standards which are relevant in the area of the European
Union. This means that we will have a look at GSM [14] and UMTS [3] but also
178 7 IMS and Mobile VoIP
at the GSM to UMTS migration path. We will also look briefly at the latest 3GPP
standard termed as a Long Term Evolution (LTE).
In the quest to establish UMTS standards, a number of proposals were submitted
to ITU-R for evaluation and adoption within the IMT-2000 [25] family.
Figure 7.12 depicts the IMT-2000 terrestrial radio interfaces and categories. It
also points out various technologies and their affiliation by categories.
3GPP specifications group deals with cellular standards such as W-CDMA, TD-
CDMA, TD-SCDMA and EDGE but 3GPP2 handles CDMA2000. The later will
not be considered, because it describes enhancements on the IS-95 protocol, not
used in Europe. The aforementioned technologies and standards are considered to
be of great significance to 3G mobile networks.
Wireless standards are divided into cellular and non-cellular standards. Mobile op-
erators are primarily interested in obtaining the best possible cellular system, so
the standards that are most important to this brand of system are discussed in this
book. Cellular systems are principally operated in the frequency range of 800 to
2200 MHz.
For many years, 2G systems have been operated successfully all over the world.
The anticipated demand for mobile data services providing high throughput, excel-
lent quality of service and improved system capacity prompted operators to begin
screening their options for the best choice in a 3G mobile system. A look at 3G re-
alizations reveals that all have related 2G predecessors. These can be classed in two
major families, GSM/UMTS and IS-95/CDMA2000. This book will only discuss
GSM/UMTS. This means that smooth evolution from 2G to 3G within each family
is possible, which is therefore also valid in the context of IMS/FMC configuration.
7.2 Mobile Access Networks 179
The GSM system is commonly operated in 900 MHz and 1800 MHz bands but
450 MHz, 850 MHz, and 1900 MHz bands are also used. It requires a paired spec-
trum and supports a carrier bandwidth granularity of 200 kHz. The GSM radio inter-
face uses a combination of FDMA and TDMA (cf., Fig. 7.13). The TDMA structure
comprises eight time slots (bursts) per TDMA frame on each carrier providing a
gross bit rate of 22.8 kb/s per time slot or physical channel. Dedicated logical chan-
nels carry user data or signaling information, and they are mapped on time slots of
the TDMA frame structure on a given frequency carrier.
The basic GSM system supports voice bearers at 13 kb/s (full rate codec, FR)
or 6.5 kb/s (half rate codec, HR) as well as circuit switched (CS) data services at
300 bps up to 14.4 kb/s. A suitable combination of FR and HR channels/codecs for
voice can increase voice capacity by 50 % over FR channels alone.
Voice and data is transported via multiple 16 kb/s channels within the GSM Radio
Access Network (RAN) (i.e., between the network entities BTS and BSC of the Base
Station Subsystem (BSS)).
Transport systems such as PCM30 or PCM24 are realized for the Abis interface.
The GSM Core Network (cf., Fig. 7.14) provides circuit switched bearers for
voice and data at 64 kb/s granularity. The GSM system uses the Mobile Application
Part (MAP), which runs on signalling system No. 7 (SS7 of ITU-T) to exchange
mobility related information between the core network entities.
flexible user rates for packet oriented data transfer using time slot assignment on
demand rather than via permanent occupation. To this end, General Packet Radio
Service (GPRS) introduces packet data functions to the radio interface, the radio
access and the core networks (cf., Fig. 7.14)
SGSN and GGSN network nodes are introduced into the core network to support
GPRS. SGSn and GGSN also communicates with the HLR using a GSM MAP that
has been extended with data related functions.
SGSN and GGSN are used exclusively for packet data transport and control.
Packet information is conveyed between SGSN and GGSN via the GPRS Tunnelling
Protocol (GTP) on top of an IP based network.
The basic frame structure of the radio interface remains unchanged, but one or
more time slots are allocated on demand to transmit one or more packets.
Four new coding sets (CS-1 to CS-4) are introduced to adapt the radio interface
to the given radio conditions and improve its performance. This provides maximum
user data rates per time slot as indicated in Table 7.3.
A number of time slots on a given carrier can be concatenated to a GPRS chan-
nel. This provides potential data rates up to 171.2 kb/s (8 times lots for CS4). In
addition, the system supports a limited number of QoS characteristics such as delay,
throughput and packet loss. Fast resource allocation on demand in core and radio
networks enables an “always on” terminal status.
GPRS services may be charged on the basis of transported data volume rather
than channel occupation time. With GPRS the entire GSM system supports voice
and CS data as well as packet oriented data services.
7.2 Mobile Access Networks 181
The Third Generation Partnership Project (3GPP) specification group defined the
Universal Mobile Telecommunication System (UMTS) in the past decade.
182 7 IMS and Mobile VoIP
The first release of the specifications provides a new radio network archi-
tecture including W-CDMA (FDD) and TD-CDMA (TDD) radio technologies,
GSM/GPRS/EDGE enabled services both for the CS and PO domain, and inter-
working to GSM. Meanwhile further features are defined like Virtual Home Envi-
ronment (VHE) and Open Services Architecture (OSA) evolution, full support of
Location Services (LCS) in CS and PO domains, an additional TDD mode (TD-
SCDMA), evolution of UTRAN transport (primarily IP support), multi-rate wide-
band voice codec, IP-based multimedia services (IMS), and high speed downlink
packet access (HSDPA).
As for GSM, the UMTS network architecture defines a core network (CN) and a
terrestrial radio access network (UTRAN) (cf., Fig. 7.15), the interface between the
two is named Iu. Notably, this interface is also projected to connect to GERAN.
This approach is evolutionary, so the UMTS core network may integrate into
the GSM core network. This also applies to core network entities as well as to
functions and protocols across the network such as call processing (CP) and mo-
bility management (MM). It applies specifically to the GSM/UMTS mobile appli-
cation part (MAP), which is independent of the RAN. The integrated GSM and
UMTS core network entities facilitate development, provisioning of network enti-
ties and introduction of UMTS services. Multi-mode terminals for both GSM and
UMTS allow for smooth migration from GSM to UMTS. Based on CDMA tech-
nology, UTRAN has been designed specifically to satisfy the service requirements
of 3G.
CDMA fundamental function (cf., Fig. 7.13) is to spread actual user data sig-
nals over a broad frequency range fending off multi path fading. For this purpose
signals are multiplied with a unique bit sequence (spreading code) at a certain bit
7.2 Mobile Access Networks 183
rate (called chip rate). In this way users and channels are separated on the same
carrier. In contrast to a TDMA system, in a CDMA system other users within the
same cell generate most of the interference. This allows adjacent cells to use the
same frequency, which they usually do, and obviates the need for frequency plan-
ning.
Time division principles may be used within a CDMA system much in the way of
FDMA systems. This has its benefits of allowing time division duplexing to be used
to separate uplink from downlink signals, this leads to creating radio transmission
technology suited for use in unpaired frequency bands.
The UTRAN system is designed to efficiently handle voice and data as well as
realtime and non realtime services over the same air interface (i.e., on the same
carrier), all at the same time and in any mix of data and voice. This variant is better
suited for data transport than GSM, and it provides a powerful platform for voice
traffic. A comprehensive channel structure was defined for the radio interface.
It consists of:
• Dedicated channels that may be assigned to one and only one mobile at any
given time.
• Common channels that may be used by all mobiles within this cell.
• Shared channels that are like common channels, but can only be used by an
assigned subset of mobiles at a given time. These channels are used for packet
data transfer.
The UTRAN system calls for several radio interface modes. Essentially, the def-
inition distinguishes between two modes of operations,
• Frequency division duplexing (UTRAN FDD) for operation in paired frequency
bands.
184 7 IMS and Mobile VoIP
FDD and TDD are harmonized, particularly in terms of how higher layers of the
radio network protocols and the Iu interface are used. In practice, the various modes
are hidden from the core network, meaning that the particulars of FDD and TDD
are limited to the UTRAN and to terminals. Both the operator and user benefit when
FDD and TDD are available in the same network,
• Unique UMTS service can be offered to the end users irrespective of the radio
access technology.
• The end user will enjoy the best possible coverage without giving a thought to
technical implications.
• The UMTS network can be deployed in such a way as to drive down costs.
Wideband CDMA
The UTRAN FDD mode employs Wideband CDMA (W-CDMA).This radio ac-
cess technology uses direct sequence CDMA with a chip rate of 3.84 Mcps on a
2 × 5 MHz bandwidth carrier (uplink/downlink).
Due to the nature of the system, it usually operates with a frequency reuse of one,
meaning that all cells use the same carrier frequencies. As a consequence, the system
provides a special process that mitigates interference among the cells, especially at
cell borders. Soft handover (SHO) is used for CS traffic. Rather than using SHO,
PO traffic is switched in between two subsequent packets. In the course of an SHO,
a mobile terminal is connected to more than just one NodeB, depending on actual
radio conditions.
The RNC multiplies and combines signals sent to and received from the terminal.
Though SHO is primarily a macro diversity feature, it also provides the basis for
smooth and seamless inter-cell handover within the same frequency band. SHO is
used between sectors of one base station.
This enhances efficiency, but it requires improved digital signal processing capa-
bilities within the base station. Its effect is comparable to that of an SHO. Again,
other users within the same cell generate the majority of interference. This means
that a CDMA system’s cell size depends on the actual cell load. This effect is called
cell breathing. To address this issue and ensure cell stability, CDMA networks
should operate with a nominal cell load of some 50 %, leaving margin for inter-
ference and allowing for some flexibility under peak load conditions. More than one
carrier can be used within a given cell or cell sector. Hard HO capability is provided
to handover between these carriers. Separate carriers do not have common channels,
therefore, they operate on their own.
The radio network controller (RNC) coordinates all carriers within a given area
such as handling of admission control. W-CDMA can be used in all environments
7.2 Mobile Access Networks 185
such as vehicular, pedestrian and indoor, and for all kinds of traffic. However, by its
very nature it is primarily suited for symmetric traffic using macro or micro cells in
areas with medium population density.
• The 3GPP UTRAN standard: It is the technology is UTRAN TDD’s 1.28 Mcps
option.
186 7 IMS and Mobile VoIP
Fig. 7.16
D-CDMA/TDSCDMA
spectrum usage
• The CWTS TSM standard: It is complemented with GSM radio procedures and
embedded entirely in the GSM BSS and inter-worked into a GSM core network
using GSM A and Gb interfaces.
The latest developed standard in cellular networks which are being deployed around
the world is an evolution of 3G towards the evolved radio access. This evolution is
widely know as Long Term Evolution (LTE).
7.2 Mobile Access Networks 187
• Capabilities: The downlink and uplink maximum data rates are 100 Mbs and
50 Mbs, respectively when working in 20 MHz spectrum. The user plane la-
tency requirement is denoted as the time it takes to transmit an IP packet from
the mobile terminal to the RAN edge node or vice versa. The one way trans-
mission time should not exceed 5 ms in an unloaded network (i.e., no other
terminals are present in the cell).
• System performance: The LTE system performance is in terms of user through-
put, mobility, spectrum efficiency and coverage.
– The LTE user throughput target is categorized into the average and at the
fifth percentile of the user distribution.
– The LTE spectrum efficiency is defined as the system throughput per cell in
bit/s/MHz/cell.
– The mobility targets focus on the mobile terminals speed. Maximum per-
formance is achievable at low mobile terminal speeds of 0–15 km/h. The
slight degradation is expected for higher speeds. For mobile terminal speeds
up to 120 km/h, the system should provide high performance and for speeds
above 120 km/h, the system should be able to maintain its connection with
the mobile terminals.
• Deployment aspects: There are two deployment scenario, first is one LTE is
deployed as a stand alone and second is when LTE is deployed together with
UMTS and/or GSM. When LTE coexists with other 3GPP systems then there
are requirements on the acceptable interruption time in the mobility manage-
ment (cf., Table 7.5).
188 7 IMS and Mobile VoIP
Overall Architecture
EPS provides LTE users with the IP connectivity to access the Internet and other
IP based services such as VoIP via the PDN. Figure 7.17 depicts the overall archi-
tecture of the EPS which includes the network elements as well as the standardized
interfaces.
The architecture is composed of the core network (CN) and the access net-
work (AN). The CN which is EPC has several logical nodes while the AN (i.e.,
E-UTRAN) is made up of only one node which is eNodeB. eNodeB connects UE to
the CN.
The CN has the following main nodes,
• PDN Gateway: This gateway provides the IP connectivity from the UE to the
Internet. This means that the PDN Gateway is the point of entry and exit of the
IP traffic for the UE.
• Serving Gateway: When UE moves between eNodeBs, the Serving Gateway
serves as an anchor for the data bearers.
• MME: Mobility Management Entity (MME) is responsible for processing sig-
naling messages between CN and UE. Other functions are to establish, maintain
and release bearers. It is also involved in the security between the UE and the
network. It interfaces SGSN of 2G and 3G networks during handover process.
• HSS: It serves as a central database where LTE users profiles are store.
7.3 Summary
UMTS networks have been rolled out in steps. Deployment kicked off in urban ar-
eas where a specific demand for data services was anticipated. Next came suburban
7.4 Problems 189
areas, and so on down the line. In order to provide full coverage for service conti-
nuity, cellular networks and terminals are designed to enable roaming and handover
between GSM/GPRS and UMTS.
In general, the migration of the overall architecture from GSM to UMTS had
gone smoothly, particularly in the core network from GSM to UMTS. A look at the
core migration issues is as follows:
• Terminals: Handsets manufacturers were committed to providing GSM/UMTS
dual-mode terminals.
• Radio network: UMTS technologies were designed specifically to use a band-
width of 5 MHz (TD-SCDMA occupies 1.6 MHz only) of an unpaired or
2 × 5 MHz of a paired spectrum to efficiently support high user data rate ser-
vices in highly mobile environments. A new radio spectrum was allocated for
3G, providing the basis for introducing new radio technologies without requir-
ing spectrum to be re-farmed and legacy services and equipment to be replaced.
The entire 3G spectrum was subdivided into 5 MHz bands. UTRAN was in-
troduced alongside GSM RAN owing to its extended functionality and band-
width.
• Core network: GSM and UMTS define the same core network architecture.
GPRS is part of both GSM and UMTS.
The implications for UMTS service introduction were clear, the legacy GSM
core network could be upgraded to operate both GSM and UMTS within an in-
tegrated UMTS core network. This meant that operators could offer wide area
coverage via GSM/GPRS and gradually buildup their UMTS radio access infras-
tructure. At the same time, GPRS nodes and GSM MSCs were upgraded to sup-
port UMTS services and interconnect the UMTS radio network via new Iu inter-
face.
The rollout of LTE networks is now gathering pace around the world. The report
by the Global Mobile Suppliers Association (GSA) confirms 338 mobile operators
in 101 countries have committed to start the commercial LTE network deployments
or are in the process of conducting trials and testing [23]. The GSA have confirmed
that LTE is the fasted growing mobile technology ever.
7.4 Problems
1. The caller dials a number by using SIP URI of another IMS subscriber. At
which call session control function does the call first make a contact?
2. What does the P-CSCF first checks when it receives a call setup request from
the IMS subscriber?
3. If the SIP URI is found by the P-CSCF through the DNS during a call setup,
what will be the next step performed by the P-CSCF?
4. What is the relationship between the S-CSCF and the HSS?
5. Explain the importance of the initial Filter criteria (iFC).
6. Explain a scenario where the initial Filter criteria would be applied.
190 7 IMS and Mobile VoIP
7. What are interface and protocol used to connect the HSS and S-CSCF?
8. What are interface and protocol used to connect the application server and
S-CSCF?
9. I want to provide video on demand (VoD) services to IMS subscribers, where
would I place it in the IMS architecture? Explain your answer.
10. Why does GSM use TDMA and not CDMA?
11. I have GSM mobile handset, can I use it in all countries? Explain your an-
swer.
References
1. 3GPP Policy and charging control over Gx reference point. TS 29.212, 3rd Generation Part-
nership Project (3GPP). http://www.3gpp.org/ftp/Specs/html-info/29212.htm
2. 3GPP Policy and charging control over Rx reference point. TS 29.214, 3rd Generation Part-
nership Project (3GPP). http://www.3gpp.org/ftp/Specs/html-info/29214.htm
3. 3GPP (2001) UMTS Phase 1. TS 22.100, 3rd Generation Partnership Project (3GPP).
http://www.3gpp.org/ftp/Specs/html-info/22100.htm
4. 3GPP (2005) Application of ISDN User Part (ISUP) Version 3 for the Integrated Ser-
vices Digital Network (ISDN)—Public Land Mobile Network (PLMN) signalling inter-
face; Part 1: Protocol specification. TS 09.14, 3rd Generation Partnership Project (3GPP).
http://www.3gpp.org/ftp/Specs/html-info/0914.htm
5. 3GPP (2005) Customized Applications for Mobile network Enhanced Logic (CAMEL);
Service description; Stage 1. TS 22.078, 3rd Generation Partnership Project (3GPP).
http://www.3gpp.org/ftp/Specs/html-info/22078.htm
6. 3GPP (2007) Policy control over Gq interface. TS 29.209, 3rd Generation Partnership
Project (3GPP). http://www.3gpp.org/ftp/Specs/html-info/29209.htm
7. 3GPP (2007) Signaling System No 7 (SS7) signalling transport in core network; Stage 3.
TS 29.202, 3rd Generation Partnership Project (3GPP). http://www.3gpp.org/ftp/Specs/
html-info/29202.htm
8. 3GPP (2007) Technical specifications and technical reports for a GERAN-based 3GPP sys-
tem. TS 01.01, 3rd Generation Partnership Project (3GPP). http://www.3gpp.org/ftp/Specs/
html-info/0101.htm
9. 3GPP (2008) Internet Protocol (IP) multimedia call control protocol based on Session Initi-
ation Protocol (SIP) and Session Description Protocol (SDP); Stage 3. TS 24.229, 3rd Gen-
eration Partnership Project (3GPP). http://www.3gpp.org/ftp/Specs/html-info/24229.htm
10. 3GPP (2008) IP Multimedia (IM) subsystem Cx and Dx interfaces; Signaling flows and mes-
sage contents. TS 29.228, 3rd Generation Partnership Project (3GPP). http://www.3gpp.org/
ftp/Specs/html-info/29228.htm
11. 3GPP (2008) Media Gateway Control Function (MGCF)—IM Media Gateway (IM-MGW);
Mn interface. TS 29.332, 3rd Generation Partnership Project (3GPP). http://www.3gpp.org/
ftp/Specs/html-info/29332.htm
12. 3GPP (2008) Mobile Radio Interface NAS signalling—SIP translation/conversion. TS
29.292, 3rd Generation Partnership Project (3GPP). http://www.3gpp.org/ftp/Specs/html-
info/29292.htm
13. 3GPP (2008) Multimedia Resource Function Controller (MRFC)—Multimedia Resource
Function Processor (MRFP) Mp interface; Procedures descriptions. TS 23.333, 3rd Gener-
ation Partnership Project (3GPP). http://www.3gpp.org/ftp/Specs/html-info/23333.htm
14. 3GPP (2008) Network architecture. TS 23.002, 3rd Generation Partnership Project (3GPP).
http://www.3gpp.org/ftp/Specs/html-info/23002.htm
References 191
15. 3GPP (2008) Service requirements for the Internet Protocol (IP) multimedia core net-
work subsystem (IMS); Stage 1. TS 22.228, 3rd Generation Partnership Project (3GPP).
http://www.3gpp.org/ftp/Specs/html-info/22228.htm
16. 3GPP (2008) Sh interface based on the Diameter protocol; Protocol details. TS
29.329, 3rd Generation Partnership Project (3GPP). http://www.3gpp.org/ftp/Specs/
html-info/29329.htm
17. 3GPP (2012) 3rd generation partnership project. http://www.3gpp.org. [Online; accessed
15-August-2012]
18. Blatherwick P, Bell R, Holland P (2001) Megaco IP phone media gateway application pro-
file. RFC 3054, Internet Engineering Task Force. http://www.rfc-editor.org/rfc/rfc3054.txt
19. Calhoun P, Loughney J, Guttman E, Zorn G, Arkko J (2003) Diameter base protocol. RFC
3588, Internet Engineering Task Force. http://www.rfc-editor.org/rfc/rfc3588.txt
20. Camarillo G, Marshall W, Rosenberg J (2002) Integration of resource management and
Session Initiation Protocol (SIP). RFC 3312, Internet Engineering Task Force. http://www.
rfc-editor.org/rfc/rfc3312.txt
21. Deering S, Hinden R (1998) Internet Protocol, version 6 (IPv6) specification. RFC 2460,
Internet Engineering Task Force. http://www.rfc-editor.org/rfc/rfc2460.txt
22. ETSI (2012) The European Telecommunications Standards Institute (ETSI). http://www.
etsi.org/WebSite/AboutETSI/AboutEtsi.aspx. [Online; accessed 15-August-2012]
23. GSA (2011) GSA—the global mobile suppliers association. http://www.gsacom.com. [On-
line; accessed 25-August-2012]
24. Handley M, Jacobson V (1998) SDP: Session Description Protocol. RFC 2327, Internet
Engineering Task Force. http://www.rfc-editor.org/rfc/rfc2327.txt
25. ITU (2006) Detailed specifications of the radio interfaces of the international mobile
telecommunications-2000 (IMT-2000). Recommendation 1457, International Communica-
tion Union
26. Jennings C, Peterson J, Watson M (2002) Private extensions to the Session Initiation Pro-
tocol (SIP) for asserted identity within trusted networks. RFC 3325, Internet Engineering
Task Force. http://www.rfc-editor.org/rfc/rfc3325.txt
27. Kent S, Atkinson R (1998) Security architecture for the Internet Protocol. RFC 2401, Inter-
net Engineering Task Force. http://www.rfc-editor.org/rfc/rfc2401.txt
28. Olson S, Camarillo G, Roach AB (2002) Support for IPv6 in Session Description Pro-
tocol (SDP). RFC 3266, Internet Engineering Task Force. http://www.rfc-editor.org/rfc/
rfc3266.txt
29. Peterson J (2002) A privacy mechanism for the Session Initiation Protocol (SIP). RFC 3323,
Internet Engineering Task Force. http://www.rfc-editor.org/rfc/rfc3323.txt
30. Postel J (1980) User Datagram Protocol. RFC 0768, Internet Engineering Task Force.
http://www.rfc-editor.org/rfc/rfc768.txt
31. Postel J (1981) Internet Protocol. RFC 0791, Internet Engineering Task Force. http://www.
rfc-editor.org/rfc/rfc791.txt
32. Postel J (1981) Transmission Control Protocol. RFC 0793, Internet Engineering Task Force.
http://www.rfc-editor.org/rfc/rfc793.txt
33. Price R, Bormann C, Christoffersson J, Hannu H, Liu Z, Rosenberg J (2003) Signaling Com-
pression (SigComp). RFC 3320, Internet Engineering Task Force. http://www.rfc-editor.org/
rfc/rfc3320.txt
34. Rosenberg J (2002) The Session Initiation Protocol (SIP) UPDATE method. RFC 3311,
Internet Engineering Task Force. http://www.rfc-editor.org/rfc/rfc3311.txt
35. Rosenberg J, Schulzrinne H (2002) An offer/answer model with Session Description
Protocol (SDP). RFC 3264, Internet Engineering Task Force. http://www.rfc-editor.org/
rfc/rfc3264.txt
36. Rosenberg J, Schulzrinne H (2002) Reliability of provisional responses in Session Initia-
tion Protocol (SIP). RFC 3262, Internet Engineering Task Force. http://www.rfc-editor.org/
rfc/rfc3262.txt
192 7 IMS and Mobile VoIP
In this case study, we will setup and configure VoIP testbed by using Asterisk PBX
and X-Lite 4 soft phone. Upon completion of this lab, you will be able to add SIP
phones, configure basic dial-plans, setup SIP soft phone, start and stop Asterisk,
make voice calls between SIP soft phones, make video calls between SIP soft phones
and make voice calls between SIP soft phones and analogue phones.
Asterisk is an open source private branch exchange (PBX) software that provides
all of the features expected from a PBX and more. Asterisk implements VoIP and
can inter-operate with almost all standard telephony equipment using relatively in-
expensive hardware. Digium [3] is the creator and sponsor of the Asterisk project.
SIP, Inter-Asterisk Exchange (IAX2) and H323 are the main control protocols used
in Asterisk. Figure 8.1 illustrates the Asterisk architecture.
Asterisk architecture is built on modules whereby each module is a loadable
component with a specific function. Asterisk main modules are:
1. Channel modules: They handle different types of connection such as (SIP,
Digium Asterisk Hardware Device Interface (DAHDI), H323, IAX2).
2. Codec translator modules: They support audio and video encoding and decod-
ing formats such as G711, GSM, H263 and H264.
3. Application modules: They support various functions such as voicemail, call
detail recording (CDR), dialplans and SMS
4. File format modules: They handle writing and reading of different file formats
in the Asterisk system.
Asterisk will turn a computer into a communications server. Asterisk powers IP
PBX systems, VoIP gateways, conference servers and more. It is mainly used by
small and large businesses, call centers, carriers and government institutions around
the world.
L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 193
DOI 10.1007/978-1-4471-4905-7_8, © Springer-Verlag London 2013
194 8 Case Study 1—Building Up a VoIP System Based on Asterisk
Channel modules are responsible for Asterisk to make calls. Each channel module
is specific to a channel type. A list of some useful channel modules used in this Lab
is illustrated in Table 8.1.
Codec translator modules are essential for transcoding media formats in a scenario
whereby communication end points do not have compatible codecs. Table 8.2 lists
useful codec translator used in this Lab.
8.1 What is Asterisk? 195
The most popular applications in Asterisk are dialplan applications. Dialplan appli-
cations are configured in extentions.conf file which defines incoming and outgoing
call routings. Table 8.3 lists some useful and essential applications used in this Lab.
The file format module is responsible for transcoding media files during recording
and playback. For instance, if a recorded file is in GSM format and playback is over
aPSTN channel, then GSM to PCM interpreter is required. Table 8.4 denotes some
useful file format modules used in this Lab.
196 8 Case Study 1—Building Up a VoIP System Based on Asterisk
Asterisk is suited to work under Linux, most Linux distributions such as Fedora,
Debian, Ubuntu and SUSE have Asterisk as part of their packages and you can
simply use these packages to install Asterisk. You can download Asterisk from [1]
if your Linux distribution does not have Linux Package in its repositories.
Together with Asterisk, DAHDI and libri are essential for Asterisk to work prop-
erly with ISDN or PSTN. The libri library enables Asterisk to communicate with
ISDN connections. If you plan to connect to an ISDN line this library is recom-
mended on your system. The DAHDI library enables Asterisk to communicate with
analog and digital telephones lines, including communication to the PSTN. DAHDI
and libri are commended to be install in your system even if you don’t plan to use
analog or digital telephone lines.
DAHDI is the abbreviation of Digium Asterisk Hardware Device Interface. It is a
set of drivers and utilities for analog and digital telephony cards, such as TDM cards
manufactured by Digium. The DAHDI drivers are independent of Asterisk, and can
be used by other communication applications. DAHDI originates from Zaptel which
was created by the Zapata Telephony Project.
As Ubuntu is one of the most popular Linux distribution, the following instruc-
tions demonstrate how to install Asterisk in Ubuntu.
1. It is essential to update Ubuntu system and reboot
• sudo apt-get update
• sudo apt-get upgrade
• sudo reboot
2. It is essential to synchronise time in communication, therefore, make sure Net-
work Time Protocol (NTTP) is installed
• sudo apt-get install ntp
3. It is important to install software dependency, in Ubuntu xml, ncurses, ssl and
subverion for getting Asterisk source code vi subversion system
• sudo apt-get install subverion
• sudo apt-get libssl-dev libncurses5-dev libxml2-dev
4. Create a directory where Asterisk source files will reside
• sudo mkdir /src/asterisk
5. Go to the created Asterisk directory
• cd /src/asterisk
6. Download Asterisk source code via subversion
• sudo svn co http://svn.asterisk.org/svn/asterisk/branches/1.6
7. Build and install Asterisk
8.2 What Is X-Lite 4 197
• cd 1.6
• sudo ./configure
• sudo make
• sudo make install
• sudo make config
At this stage your system is ready to configure dialplans, channels and any addi-
tional modules such as sounds and music on hold. The full documentation on how
to install Asterisk, libri and DAHDI is available at [1].
X-Lite 4 is a proprietary freeware VoIP soft phone that uses SIP for VoIP sessions
setup and termination. It combines voice calls, video calls and Instant Messaging
in a simple interface. X-Lite 4 is developed by CounterPath [2]. The screen shot of
X-Lite 4 is depicted in Fig. 8.2.
X-Lite 4 has basic functionalities which include:
• Call display and message waiting indicator.
• Speaker phone and mute icon.
• Hold and redial.
• Call history for incoming, outgoing and missed calls.
The X-Lite 4 supports the following features and functions:
• Video call support.
• Instant message and present via SIMPLE protocol support.
• Contact list support.
• Automatic detection and configuration of voice and video devices.
• Support of echo cancelation, voice activity detection (VAD) and automatic gain
control.
• The following audio codecs are supported, Broadvoice-32, Broadvoice-32 FEC,
DVU4, DVI4 Wideband, G711aLaw, G711uLaw, GSM,L16 PCM Wideband,
iLBC, Speex, Speex FEC, Speex Wideband, Speec Wideband FEC.
• DTMF (RFC 2833 [7], inband DTMF or SIP INFO messages) support.
X-Lite can be started from the “Windows Start Menu” if is installed. If it is not
installed, it can be downloaded and installed from [2].
X-Lite Menu
• Softphone. Under the Softphone Menu (cf., Fig. 8.2) the following can be con-
figured.
198 8 Case Study 1—Building Up a VoIP System Based on Asterisk
– Accounts. This item will configure how X-Lite communicates with your
Asterisk server. The settings such as X-Lite username, password, display
name and SIP server domain or proxy IP address can be configured under
this item.
– Preferences. Preferences such as audio and video codecs can be configured
in this item, moreover, audio and video devices can be configured under
this item.
– Exit. This item will let you exit X-Lite, by pressing Ctrl + Q will also let
you exit the X-Lite.
• View. This menu item will change how X-Lite looks.
• Contacts. This menu item will let you manage your contacts such as adding and
modifying contact list.
• Actions. Depending on the X-Lite state, this menu item will let you perform
several actions. For instance, if a contact is selected, actions such as:
Placing a Call
A call can be placed by two ways,
• by using traditional phone number such as 123456
• by using url address such as [email protected]
8.2 What Is X-Lite 4 199
Ending a Call
The red End call button on the established call windows is used to end a call (cf.,
Fig. 8.4).
• Video: If the incoming call is a video call, then the Video button will appear to
accept the video call (cf., Fig. 8.6).
• Audio: If the incoming call is a video call, then the Audio button will also ap-
pear. You will have the choice to accept only the audio call (cf., Fig. 8.6).
8.3 Voice and Video Injection Tools 201
As discussed in Chap. 6, voice and video quality can be assessed using full-reference
model or intrusive voice/video quality assessment model, in which a reference
voice/video signal can be compared with the corresponding degraded voice/video
signal. For full reference voice/video quality assessment, standard reference voice
signal or video clip need to be inserted into the system under test, and the degraded
voice signal or video clip to be recorded at the receiving end.
In this section, we introduce methods to inject a standard voice signal (e.g., from
standard speech samples from ITU-T P. 50 Appendix I [8]) or a standard video clip
from VQEG video clip database [9].
The popular tool for video injection is Manycam Virtual Webcam or simply
Manycam [4] and that for audio is Virtual Audio Cable [6].
ManyCam Virtual Webcam or simply ManyCam is a freeware live effects and we-
bcam effects software. The abbreviation is derived from its feature which allows a
single webcam to be used with multiple applications at the same time. ManyCam
takes as input stream a webcam, video camera or picture or movie files and replicates
the stream as an alternative source of input into other applications such as Skype,
Google Talk and X-Lite. This book uses ManyCam version 3.0.91. The front screen
of the ManyCam is depicted in Fig. 8.7.
202 8 Case Study 1—Building Up a VoIP System Based on Asterisk
Virtual Audio Cable (VAC) is a freeware software that allows to transfer audio
streams between applications and/or audio devices. VAC creates a virtual audio de-
vices “Virtual Cables” whereby any application can send or receive audio stream
though it without sound quality loss.
8.4 Lab Scenario 203
Figure 8.12 depicts the scenario we are going to use in this lab. Asterisk version 1.6
is used under UBUNTU 10.4 LTS. X-Lite 4 SIP phones Version 4.1 build 63214 for
Windows and Asterisk are connected via an Ethernet switch. The analogue phone is
204 8 Case Study 1—Building Up a VoIP System Based on Asterisk
connected to Asterisk via Digium TDM11B TDM PCI Card by using RJ11 cable.
Headset and webcams are required for voice and video calls on the X-Lite 4.
For the purpose of learning, a low powered CPU between 433 MHz and 700 MHz
Celeron processors will be able to host Asterisk system. This is the minimal require-
ment for two to three concurrent VoIP calls.
8.5 Adding SIP Phones 205
[2000]
type=friend
username=2000
secret=1234
context=mycontext
host=dynamic
Each line added into the sip.conf file is described below:
type: This defines connection class for each phone. There are three options to be
used, these are peer, user and friend. Peer can only make calls, it is used when
206 8 Case Study 1—Building Up a VoIP System Based on Asterisk
Asterisk is connecting to a proxy. User can only accept calls. Friend is used as
both a peer and a user, i.e., can call and accept calls.
username: Sets the username for registering into Asterisk.
secret: The password for registering into Asterisk.
context: This sets the context for this phone. This context will be used in exten-
sions.conf file (cf., Sect. 8.6) for this phone dial plans.
host: IP address or host name of the phone. It can also be set to ‘dynamic’ to allow
the phone to connect from any IP address.
The file we are particularly interested for configuring dial plans is extensions.conf,
so go into /etc/asterisk and edit extensions.conf file. By adding the following lines
into extensions.conf file will configure dial plans for phones 1000 and 2000 added
in the sip.conf file (cf., Sect. 8.5).
An extension is made up of three components separated by comma,
• The name/number of the extension
• The priority
• The application/command
Extension components are formatted as exten => name, priority, application()
and the real example of this is exten => 100, 1, Answer(). In this example, 100
is the extension name/number, the priority is 1, and the application/command is
Answer().
[mycontext]
exten => 100,1,Answer()
exten => 100,2,Dial(SIP/1000)
exten => 100,3,Hangup()
Below is the description of each line added into the extensions.conf file:
[mycontext]: Phones 1000 or 2000 will be directed to this context whenever a call
is initiated. Note that this context name is defined in sip.conf.
exten => 100,1,Answer(): If phone 1000 or 2000 dials extension 100, this line
will answer a ringing SIP channel.
8.7 Configuring DAHDI Channels 207
exten => 100,2,Dial(SIP/1000): If phone 1000 or 2000 dials 100, this line re-
quests a new SIP channel, places an outgoing call to phone 1000 and bridges the
two SIP channels when phone 1000 answers.
exten => 100,3,Hangup(): Unconditionally hangup a SIP channel, terminating a
call.
exten => 300,2,Dial(DAHDI/1): If phone 1000 or 2000 dials 300, this line re-
quests a new DAHDI channel, places an outgoing call to an analogue phone and
bridges the DAHDI and DAHDI channels when the analogue phone answers.
Before making voice calls between SIP and analog phones it is important to ensure
Digium cards are working well. After installing Asterisk and DAHDI, verify that
your cards are setup and configured properly by executing commands in steps 6
and 7. If you get errors in steps 6 and 7, then follow all steps below to make sure
your Asterisk and one Digium card to work properly.
1. Detect your hardware. The command below will detect your hardware and
if it is successfully the files /etc/dahdi/system.conf and /etc/asterisk/dahdi-
channels.conf will be generated.
Linux:~# dahdi_genconf
2. This command will read system.conf file generated from the above step and
configure the kernel of your Linux distribution.
Linux:~# dahdi_cfg -v
3. The following line will restart DAHDI in which all modules and drivers will be
unloaded and loaded again. Note that the location of this script may vary from
one Linux distribution to another.
Linux:~# /etc/init.d/dahdi restart
4. The statement below will include the file /etc/asterisk/dahdi-channels.conf in
chan_dahdi.conf under section [channels].
[channels]
# include /etc/asterisk/dahdi-channels.conf
5. These configurations will not take effect unless you restart Asterisk if it is not
running. The following command will restart Asterisk.
Linux:~# /etc/init.d/asterisk restart
6. After restarting the Asterisk, verify your card status. Reconnect to Asterisk and
run the following command under CLI, you will get the output like this.
208 8 Case Study 1—Building Up a VoIP System Based on Asterisk
asterisk*CLI>
To start Asterisk , login to Linux as a user with permission to run asterisk, at the
terminal console type “asterisk -vvvc” and press return key. “-c” enables console
(command line interface) mode and “-v” tells Asterisk to produce verbose output.
More “v”s will produce more verbose output. If Asterisk is successfully started,
then the console mode will look like in Fig. 8.13. To stop Asterisk type “core stop
now” at the Asterisk console and press return key. The console mode will exit and
Asterisk will stop.
This lab uses X-Lite 4 Version 4.1 build 63214 under Microsoft Windows XP as a
SIP phone to connect to Asterisk . The following steps are needed in order to setup
X-Lite 4,
• Open X-Lite 4 from Microsoft Windows.
• Click on the “Softphone” menu, then select “Account Settings” (cf., Fig. 8.14).
• Under “SIP Account” Input the following details (cf., Fig. 8.14):
Display Name: 1000
User Name: 1000
Password: 1234
Authorization User: 1000
8.10 Making Voice Calls Between SIP Phones 209
The following steps are required to make voice calls between SIP phones.
210 8 Case Study 1—Building Up a VoIP System Based on Asterisk
• Start Asterisk
• Start X-Lite 4 with SIP phone 1000. The X-Lite 4 screen will look like in
Fig. 8.15, showing that SIP phone 1000 is successfully registered
• Start X-Lite 4 with username 2000 in another PC. The X-Lite 4 screen will look
similar to Fig. 8.15, but this time showing that SIP phone 2000 is successfully
registered
• Observe the output messages at Asterisk CLI console for possible errors or
successful registration messages
• If you are using SIP phone 2000, dial extension 100 to call SIP phone 1000 by
clicking “Call” icon, let SIP phone 1000 pick up the call and start talking. You
can type the extension number in the X-Lite 4 call entry field or use the X-Lite
4 dial pad. The dial pad can be expanded by clicking the “Show or hide dial
pad” icon
• If you are using SIP phone 1000, dial extension 200 to call SIP phone 2000, let
SIP phone 2000 pick up the call and start talking
• Observe the output messages at Asterisk CLI when the VoIP session is estab-
lished between the SIP phones 1000 and 2000
• The call can be terminated by using “End” icon
• Observe the output messages at Asterisk CLI when the VoIP session is termi-
nated between the SIP phones 1000 and 2000
8.11 Making Video Calls Between SIP Phones 211
In order for video calls to work, you are required to turn on support for SIP video.
This is done by adding “videosupport=yes” line under [general] context in sip.conf.
the [general] context, among other lines will look like this,
[general]
videosupport=yes
After enabling SIP video support, carry on with the following steps,
• Start X-Lite 4
• Open “Video window” from the X-Lite 4 screen (cf., Fig. 8.16) by clicking
“Show or hide video window” icon. You should be able to see yourself from
your own camera (cf., Fig. 8.16)
• Dial extension 100 if you are in SIP phone 2000 to call SIP phone 1000. When
SIP phone 1000 picks up the call, click “Send video” icon the “Video window”
• Both SIP phones will now display videos calls (cf., Fig. 8.17)
This section will make use of DAHDI channels configured above. Extension 300 is
used to connect to analogue phone.
• Start “X-Lite 4”
• Dial extension 300, the analogue phone will ring, pick up the phone and start
talking
• From the analogue phone, dial extension 100 or 200 for calls to SIP phones
1000 or 2000, respectively, pick up the calls from X-Lite 4 and start talking
212 8 Case Study 1—Building Up a VoIP System Based on Asterisk
8.13 Problems
This challenge will help you understand more about adding SIP phones and dial
plans in sip.conf and extensions.conf, respectively. To help with current time play-
back and voice mail dial plans configuration, “Asterisk: The Future of Telephony”
is recommended [5].
1. Add SIP phone with username: 3000, password: 1234, type: friend and host:
dynamic.
2. Add SIP phone with username: 4000, password: 1234 and type: user.
3. Make voice call from SIP phone 2000 to 4000. (hint, you should add dial plans
and setup X-Lite 4 for this to work).
4. Make voice call from SIP phone 4000 to 2000.
5. Explain if any of Step 3 or 4 is not working and why.
6. Rectify any problem if exists in Step 5.
7. Add a dial plan that will playback the current time (hint, use SayUnixTime
Asterisk command).
8. Add a dial plan that will include a voice mail (hint, use VoiceMail command).
The file /etc/asterisk/voicemail.conf is used to setup Asterisk voicemail con-
texts. The voice mail dial plan should include the following functions:
a. Enable SIP phone user to leave voice mail after 30 seconds of no answer.
b. Enable SIP phone user to access and retrieve saved voicemails.
c. Prompt for a password when accessing voicemail mailbox.
References
1. Asterisk (2011) Asterisk downloads. http://www.asterisk.org/downloads/. [Online; accessed
02-February-2011]
2. Counterpath (2011) X-lite 4. http://www.counterpath.com/x-lite.html. [Online; accessed 12-
June-2011]
References 213
In this case study, we will analyse and assess voice and video quality of the multi-
media sessions established in Chap. 8. We will use Wireshark in order to capture and
analyse SIP and RTP packet headers. Upon completion of this lab, you will be able
to familiarize with Wireshark, tc commands in Linux network emulator, and experi-
ence SIP message flows during user registration and multimedia sessions setup and
termination. This Lab will also help you to emulate Wide Area Network (WAN) by
using tc command in Linux in order to assess the impact of packet loss and jitter on
the quality of VoIP calls.
L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 215
DOI 10.1007/978-1-4471-4905-7_9, © Springer-Verlag London 2013
216 9 Case Study 2—VoIP Quality Analysis and Assessment
• If “From the first to the last marked packet” is chosen then all packets marked
from the first to the last one will be processed.
• If “Specify a packet range” is chosen the you will have to input your own packet
range in the text field below it and all the ranges specified in the text field will
be processed.
Packets can be read and written in many different capture file formats such
as tcpdump (libpcap), Pcap NG, Catapult DCT2000, Cisco Secure IDS iplog,
Microsoft Network Monitor, Network General Sniffer (compressed and uncom-
pressed), Sniffer Pro, and NetXray, Network Instruments Observer, NetScreen
snoop, Novell LANalyzer, RADCOM WAN/LAN Analyzer, Shomiti/Finisar Sur-
veyor, Tektronix K12xx, Visual Networks Visual UpTime, WildPackets Ether-
Peek/TokenPeek/AiroPeek, and many others.
Live data can be read from Ethernet, IEEE 802.11, PPP/HDLC, ATM, Blue-
tooth, USB, Token Ring, Frame Relay, FDDI, and others (depending on your plat-
form). It also support decryption for many protocols, including IPsec, ISAKMP,
Kerberos, SNMPv3, SSL/TLS, WEP, and WPA/WPA2. Output can be exported to
XML, PostScript, CSV, or plain text and capture files compressed with gzip can be
decompressed on the fly.
• No.: This column denotes the number of the packet in the capture file. This
number will not change, even if a display filter is used.
• Time: This column shows the timestamp of the packet. The format of this
timestamp can be changed into different time format such as seconds, mil-
liseconds and nanoseconds or date and time of the day.
• Source: This column displays the IP address where this packet is coming
from.
• Destination The IP address where this packet is destined to.
• Protocol: The protocol name in a short is displayed in this column such as
TCP, UDP and HTTP.
• Info: Additional information about the packet content is displayed in this
column such as NOTIFY and TCP segment of a reassembled PDU.
2. The “Packet Details” pane (cf., Fig. 9.3): The packet selected in the “Packet
List” pane will result into more details shown in the “Packet Details” pane.
This pane gives the protocols and the protocols fields corresponding to the
packet selected in the “Packet List” pane. These fields are in a tree structured
which can be expanded to reveal more information.
3. The “Packet Bytes” pane (cf., Fig. 9.4): This pane displays the hexdump style
of the packet selected in the “Packet List” pane. The left side of the hexdump
style shows the offset of the packet data. The hexadecimal representation of
the packet data is shown in the middle and the right side shows the ASCII
characters of the corresponding hexadecimal representation.
218 9 Case Study 2—VoIP Quality Analysis and Assessment
Rich VoIP analysis is available under “Telephony” menu item where RTP and SIP
protocols together with VoIP calls statistics can be analysed.
Under Telephony menu (cf., Fig. 9.5), SIP statistics can be generated when SIP
protocol is selected. For instance, SIP statistics of Fig. 9.6 shows that 14 SIP packets
where captured, two packets for “SIP 100 Trying” and “SIP 180 Ringing”, respec-
tively. Four “SIP 200 OK” packets were recorded. Two “SIP INVITE”, one “SIP
ACK” and three “SIP REGISTER” packets were captured.
VoIP call analysis can be done in the same “Telephony” menu item by selecting
“VoIP calls” in the drop down menu. A list of detected VoIP call will appear (cf.,
Fig. 9.7). The VoIP calls list includes:
• Start Time: Start time of the VoIP call.
• Stop Time: Stop time of the VoIP call.
• Initial Speaker: The IP address of the source of the packet that initiated the VoIP
call.
• From: This is the “From” field of the SIP INVITE.
• To: This is the “To” field of the SIP INVITE.
• Protocol: This column displays the VoIP protocols used such as SIP and H323.
• Packets: This column denotes the number of packets involved in the VoIP call.
9.1 What Is Wireshark 219
• State: This displays the current VoIP call state. This can be the following.
– CALL SETUP: This will show a VoIP call in setup state (Setup, Proceeding,
Progress or Alerting).
– RINGING: This will display a VoIP call ringing (this state is only supported
for MGCP calls).
– IN CALL: This will denote that a VoIP call is still connected.
– CANCELED: This state will illustrate that a VoIP call was released before
connected from the originated caller.
– COMPLETED: This will show a VoIP call was connected and then released.
– REJECTED: This state shows that a VoIP call was released before con-
nected by the callee.
– UNKNOWN: This VoIP call is in unknown state.
• Comment: Any additional comment will be displayed here for a VoIP call.
220 9 Case Study 2—VoIP Quality Analysis and Assessment
A desired VoIP call list can be chosen for playback through the Wireshark
RTP player (cf., Fig. 9.8). RTP player will show percentages of all packets that
are dropped because of the jitter buffer as well as the packets that are out of se-
quence.
From the list of the VoIP call, SIP flow diagram can also be displayed as a graph,
the graph will display the following information (cf., Fig. 9.9):
• All SIP packets that are int he same call will be coloured with the same colour.
• Arrows showing the direction of each SIP packets.
• Labels on top of each arrows will show SIP message type.
• RTP traffic will be represented wit a wider arrow with the corresponding pay-
load type on its top.
• UDP/TCP source and destination post per packet will be shown.
9.1 What Is Wireshark 221
• The comment column will depend on the VoIP protocol in use. For SIP protocol
comments such as “Request” or “Status” message will appear. For RTP proto-
col, comments such as the number of RTP packets and duration in seconds will
be displayed.
RTP streams can also be analysed from the “Telephony” menu item. A list of
RTP streams (cf., Fig. 9.10) will be displayed and any one can be picked up for
analysis (cf., Fig. 9.11).
The “RTP Stream Analysis” windows will shoe some basic data such as RTP
packet number, sequence number. Enhanced statistics which are created based on
each packet arrival time, delay, jitter and packet size are also listed per packet basis.
The lower pane of the windows denotes overall statistics such as minimum and
maximum delta, clock skew, jitter and packet loss rate. There is also an option to
save the payload for offline playback or further analysis.
222 9 Case Study 2—VoIP Quality Analysis and Assessment
A graph depicting several parameters such as jitter and delta can also be drawn
(cf., Fig. 9.12).
The following steps will help you to familiarize with Wireshark Graphic User In-
terface (GUI). Wireshark has five main windows: command menu, display filter
specification, listing of capture packets, details of selected packet header and packet
content in hexadecimal and ASCII windows.
• Start Wireshark from both Computer 1 and Computer 2.
• Start capture a short period of live traffic and view it immediately. To do this
click on Capture → Start (you may first have to select an interface under the
capture menu) (cf., Fig. 9.13).
• If you then try to browse any website (e.g., google.com), you should be able to
see captured packets on the capture screen.
• Stop the capture.
• Understand the significance of each of the columns in the top part of the Wire-
shark window: No, Time, Source, Destination, Protocol, Info.
• Wireshark will show both the dotted decimal form and the hex form of IP ad-
dresses. Look at an example and check if they are the same.
9.3 Introduction to Netem and tc Commands 223
Netem is a network emulator for the Linux kernel 2.6.7 and higher versions. Netem
emulates network dynamics such as IP packets delay, drops, corruptions and dupli-
cates. Netem extends Linux Traffic Control (tc) command available under iproute2
package. In order for the Netem to work, Linux must be configured to act as a router.
The simplest network topology for the Netem to work is depicted in Fig. 9.14.
224 9 Case Study 2—VoIP Quality Analysis and Assessment
IP packets from the sender enters the Linux router via a network interface card
whereby each packet is classified and queued before getting into the Linux internal
IP packet handling. The IP packet classification is done by examining packet headers
parameters such as source and destination IP addresses.
From the IP packet handling process, packets are classified and queued ready for
transmission via the egress network interface card. The tc command is tasked for
classifying IP packets. The default queueing discipline used for tc is First In First
Out (FIFO).
Changing and deleting qdisc commands have the structure as the adding qdisc com-
mand. To change 2 % loss rate added to the qdisc in Sect. 9.3.1 to 10 %, the follow-
ing command is used,
• tc qdisc change dev eth0 root netem loss 10 %
9.4 Lab Scenario 225
To modify 100 ms delay added to the qdisc in Sect. 9.3.1 to 200 ms, the following
command is used,
• tc qdisc change dev eth0 root netem delay 200 ms
Figure 9.15 depicts the results shown (below the red line) when pinging a
Linux router with an emulated delay of 200 ms. The results show an average of
201 ms delay. The results above the red line show pinging results at the normal
delay without any Netem entry, there is an average delay of 1 ms.
To delete a complete root qdisc tree added in Sect. 9.3.1 or modified in this
section, the following command is used,
• tc qdisc del dev eth0 root
This Lab will use Wireshark on Microsoft Windows XP and is configured as shown
in Fig. 9.16. In this scenario, Wireshark is installed in the same computer where the
X-Lite 4 is installed.
9.4.1 Challenges
This Lab will help you understand and analyse captured SIP messages during SIP
registration process.
• Start Wireshark capture.
• Filter SIP protocol only (cf., Fig. 9.17).
• Start Asterisk server.
• Start X-Lite 4.
• Examine the SIP messages header under Wireshark packet header screen.
9.5.1 Challenges
From Wireshark,
1. List the source and destination IP addresses.
2. List the source and destination port numbers.
3. List SIP methods seen on the header.
9.6 SIP Invite 227
This Lab will help you understand and analyse SIP message and RTP headers during
SIP Invite process.
• Start Wireshark capture.
• Make a voice call to another SIP client.
• Filter SIP protocol and examine the SIP Message header (cf., Fig. 9.18).
9.6.1 Challenges
From Wireshark,
1. List SIP methods seen on the header.
2. List the status codes and their description.
3. Find the content-type.
228 9 Case Study 2—VoIP Quality Analysis and Assessment
This Lab will help you to understand VoIP message flows between two to several
communication SIP phones.
• Start Wireshark capture.
• Make a voice call to another SIP client.
• After a couple of minutes stop Wireshark.
• Under Wireshark menu, click “Telephony” and select “VoIP calls”.
• A list of calls will appear (cf., Fig. 9.22).
• Select one or more calls on the list and click flow.
• The messages flow of VoIP calls will appear (cf., Fig. 9.23).
9.7.1 Challenges
This Lab will use traffic controller “tc” in Linux in order to manipulate traffic control
settings. This Lab will also help you to assess the impact of packet losses on voice
and video quality.
• Start Wireshark capture.
• Make video calls between X-Lite clients and assess its quality (Q) in the range
of 1 to 5, being worst and 5 excellent.
• Start Wireshark at each end where X-Lite is running.
• Capture the flowing RTP/RTCP traffic.
• Use Wireshark “Statistics” tab on RTP, and by showing all streams, examine the
packet loss rate from Asterisk (cf., Fig. 9.24 ) to your X-Lite client. Fill in the
Tables in (cf., Fig. 9.25) by replacing Q with the quality between 1 and 5 and P
with the packet loss rate obtained from the Wireshark statistics.
• Stop Wireshark capture.
9.8.1 Challenges
1. At the Linux terminal, identify the network interface used by Asterisk and
execute the command below
a. # tc qdisc add dev ethx root netem loss 2 %
Replace ethx with the network interface used by Asterisk.
2. Give the meaning of the above command.
3. Start Wireshark at each end where X-Lite is running.
4. Capture the flowing RTP/RTCP traffic.
9.9 VoIP Quality Assessment: Delay Variation 233
5. Use Wireshark “Statistics” tab on RTP, and by showing all streams, examine
the packet loss rate from Asterisk to your X-Lite client. Wait for few minutes
until packet loss rates stabilize. Fill in the Table in Fig. 9.26 by replacing Q
with the quality between 1 and 5 and P with the packet loss rate obtained from
the Wireshark statistics.
6. Observe packet loss rate values found in the RTCP report.
7. Stop Wireshark capture.
8. Delete the emulation created by the “tc” command, user “tc qdisc del dev ethx
root”.
9. Repeat this Lab for loss rate between 2 % and 10 %.
10. Up to which packet loss rate do you start to notice video quality degradation?
11. Up to which packet loss rate do you start to notice voice quality degradation?
This Lab will help you to assess the impact of delay variation on voice and video
quality. The following steps will help you to achieve this.
• Make video calls between X-Lite clients and assess its quality (Q) in the range
of 1 to 5, being worst and 5 excellent.
• Start Wireshark capture at each end where X-Lite is running.
• Capture the flowing RTP/RTCP traffic.
• Use Wireshark “Statistics” tab on RTP, and by showing all streams, examine
the Max Jitter and Mean Jitter from Asterisk to your X-Lite client. Fill in the
Table in Fig. 9.27 by replacing Q with the quality between 1 and 5, MaxJ and
MeanJ with Max Jitter and Mean Jitter obtained from the Wireshark statistics,
respectively.
• Stop the Wireshark capture.
234 9 Case Study 2—VoIP Quality Analysis and Assessment
9.9.1 Challenges
1. At the Linux terminal, identify the network interface used by Asterisk and
execute the command below
a. # tc qdisc add dev ethx root netem delay 150ms 5ms
Replace ethx with the network interface used by Asterisk.
2. Give the meaning of the above command.
3. Start Wireshark at each end where X-Lite is running.
4. Capture the flowing RTP/RTCP traffic.
5. Use Wireshark “Statistics” tab on RTP, and by showing all streams, examine
Max and Mean Jitter from Asterisk to your X-Lite client. Fill in the Table in
Fig. 9.28 by replacing Q with the quality between 1 and 5, MaxJ and MeanJ
with Max Jitter and Mean Jitter obtained from the Wireshark statistics, re-
spectively.
6. Observe delay variations found in the RTCP reports.
7. Stop Wireshark.
8. Delete the emulation created by “tc” command, user “tc qdisc del dev ethx
root”.
9. Repeat this Lab for delay variations between 5 ms and 10 ms.
10. Up to which delay variations do you start to notice video quality degradation?
11. Up to which delay variations do you start to notice voice quality degradation?
9.10 Problems
1. Investigate and find out how network impairment (e.g., packet loss) affect
voice/video call quality (using subjective observation) and voice/video call es-
tablishment time by performing the following steps.
• Use “tc” command to set two packet loss rates within the range of 0 % to
10 % (e.g., choose 0 % and 5 % respectively) to investigate how network
impairment (i.e., packet loss) affect voice/video quality.
• Before you start a video call, make sure you start Wireshark first and cap-
ture the traffic during the call setup and the beginning part of the call ses-
sion (to keep trace data size small, but make sure the video session has
started before you stop Wireshark). When the required data is collected,
References 235
you can close Wireshark and save the trace data for offline analysis (the
trace data will be used for questions in both Problem 1 and Problem 2). At
the same time, you can start evaluating voice/video quality subjectively.
• Observe and explain how voice/video quality is affected by network packet
loss (you may give your own MOS score, and describe your observation for
quality changes, for example, some video freezing observed when packet
loss rate is x %). You need to first explain briefly your VoIP testbed and
what “tc” commands have been used in your experiments for the task.
• Based on the trace data, draw a diagram and explain how SIP call set up
is established in the VoIP testbed. Further explain how call setup time is
calculated and how call setup time is affected by the network impairments.
2. Choose one captured Wireshark trace data (from Problem 1 above) for a
voice/video call and answer the following questions.
• Are voice stream and video stream transmitted separately, or combined
together during a video call? What are payload types for voice and video
sessions? What are payload sizes for voice and video? Does the payload
size change for voice or video session? What is an average payload size (in
bit) and sender bit rate (in kb/s) for voice or video session? (Hint: choose
one session from PC-A to PC-B, choose 3 or 4 GOPs for video to calculate
average payload size and sender bit rate for video session.) Explain your
findings or how you get your results/conclusions.
• From Wireshark, select “Statistics”, then “RTP”, then “Show All streams”.
Select one stream which you want to analyse (e.g., ITU-T G.711 PCMU),
click “Analyze”, then choose “Save Payload” which will save the sound
file in different format (e.g., .au). Using other tools (e.g., Audacity), can
you listen to the dumped audio trace? Is VoIP system secure? Explain your
findings.
References
1. Wireshark (2011) The world’s foremost network protocol analyzer. http://www.wireshark.
org/. [Online; accessed 27-August-2011]
Case Study 3—Mobile VoIP Applications
and IMS 10
This Lab introduces an Open Source IMS Core which deploys the main IMS call
session control functions described in Chap. 7 and IMSDroid as an IMS client. We
will also outline the main steps needed for successful installation, configuration and
setup of an Open Source IMS Core in Ubuntu and IMSDroid in an Android based
mobile handset. We will finally demonstrate how to make SIP audio and video calls
between two Android based mobile handsets via the Open Source IMS Core over
Wi-Fi access network.
In 2004, the Fraunhofer Institute FOKUS launched the “Open IMS Playground” and
by November 2006 the Open Source IMS Core (OSIMS Core) was released under
a GNU General Public License on the FOKUS BerliOS.
The main goal of releasing OSIMS Core was to fill the void of open source
IMS software which existed in the mid of 2000s. The OSIMS Core has enabled
several research and development activities such as ADAMANTIUM and GERYON
to deploy IMS services and proof of concepts around the core IMS elements.
The OSIMS Core deploys the main IMS Call Session Control Functions for cen-
tral routing elements for any IMS SIP signaling and a Home Subscriber Server to
manage user profiles and all associated routing rules. The central components of the
OSIMS Core are the Open IMS CSCFs (P-CSCF, I-CSCF and S-CSCF). The OS-
IMS Core were developed as extensions to the SIP Express Router (SER) [6]. SER
is an open source SIP server which acts as a SIP registrar, proxy or redirect server.
A simple HSS, the FOKUS Home Subscriber Server (FHoSS) is part of the OS-
IMS Core. The FHoSS is written in Java via the open source Tomcat servlet con-
tainer. The main component of the HSS is based on MySQL database system. The
main function of the FoHSS is to manage user profiles and all its associated routing
rules.
L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 237
DOI 10.1007/978-1-4471-4905-7_10, © Springer-Verlag London 2013
238 10 Case Study 3—Mobile VoIP Applications and IMS
• Providing local registrar synchronization via “Reg” event as per RFC 3680.
• Providing path header support by inserting network and path identifiers for the
correct further SIP messages processing.
• Providing verification and enforcement of service routes.
• Maintaining stateful dialog and supporting Record-route verification and en-
forcement.
• Supporting IPSec setup by using Cipher Key (CK) and Integrity Key (IK) from
Authentication and Key Agreement (AKA).
• Providing integrity protection for UA authentication.
• Supporting security-client, security-server and provide security-verify header
support as per RFC 3329, Security Mechanism Agreement for the SIP.
• Providing support for basic P-Charging-Vector according to RFC 3455.
• Provides support for Visited-Network-ID header as per RFC 3455.
• Acting as a router between end points by supporting NAT during signaling.
• Providing NAT support for media in case it is configured as a media proxy
through RTPProxy [8].
• Hiding the internal network from the outside by encrypting parts of the SIP
message, this is known as Topology Hiding Interwork Gateway (THIG).
• Firewalling capacity that only allows signaling traffic coming from trusted net-
works via Network Domain Security (NDS).
The features of S-CSCF in OSIMS Core are illustrated in Fig. 10.4 and include,
• Supporting full Cx interface to HSS according to 3GPP TS 29.228.
• Providing authentication through AKAv1-MD5, AKAv2-MD5 and MD5.
• Support service-Route header as per RFC 3455.
• Supporting path header as per RFC 3455.
• Supporting P-Asserted-Identity header according to RFC 3455.
• Supporting Visited-Network-ID header according to RFC 3455.
• Downloading of Service-Profile from HSS vi Cx interface as per 3GPP TS
29.228.
• Supporting Initial Filter Criteria triggering (iFC) to enforce specific user routing
rules.
• Supporting ISC interface routing towards SIP application servers. The ISC helps
application server to know the capabilities of the UA and invoke its services.
• “Implementing Reg” event server with access restrictions which allows it to
bind UA location.
• Maintaining the state of SIP Dialog.
10.1 What Is Open Source IMS Core 241
The features of FoHSS in OSIMS Core are depicted in Fig. 10.5 and include,
• Supporting the 3GPP Cx Diameter interface to S-CSCF and I-CSCF as per
3GPP TS 29.228.
• Supporting the 3GPP Sh Diameter interface to application servers as per 3GPP
TS 29.228.
• Supporting for the 3GPP Zh Diameter interface per 3GPP TS 29.109.
• Supporting integrated simple Authentication Centre (AuC) functionality.
• Implementing Java based Diameter Stack.
242 10 Case Study 3—Mobile VoIP Applications and IMS
As Ubuntu is one of the most popular Linux distributions, the following instructions
demonstrate how to install OSIMS Core in Ubuntu Linux distribution.
Prerequisite Packages
The following Linux packages are required for successful OSIMS Core installation:
Oracle-java7-jdk, mysql-server, libmysqlclient15-dev, libxml2-dev, bind, ant, flex,
curl, libcurl4-gnutls-dev, openssl, bison and subversion.
Execute the following commands at the Ubuntu console terminal in order to add
oracle-java7-jdk repository and eventually installing it,
• sudo add-apt-repository ppa:webupd8team/java
• sudo apt-get update
• sudo apt-get install oracle-java7-installer
• If the installation is successfully then by running “java version”, you should be
able to get a positive response of the Java verion (cf., Fig. 10.6)
The following commands will install mysql-server, libmysqlclient15-dev,
libxml2, libxml2-dev, bind9, ant, flex, bison, curl, libcurl4-gnutls-dev, openssl
and subversion for OSIMS Core,
– sudo apt-get install mysql-server libmysqlclient15-dev libxml2 libxml2-dev
bind9 ant flex bison curl libcurl4-gnutls-dev openssl subversion If MySQL
installation is successfully then by running “mysql -V”, you should be able
to get a positive response of the MySQL version (cf., Fig. 10.7).
Give yourself the ownership of the OSIMS Core directory, replace username with
your current username,
• sudo chown -R username /opt/OpenIMSCore/
Create CSCFs and the FHoSS directories in the OSIMS Core directory,
• cd /opt/OpenIMSCore
• mkdir ser_ims
• mkdir FHoSS
Execute the following commands to checkout the latest version of the OSIMS
Core from the BerliOS subversion server,
• svn checkout http://svn.berlios.de/svnroot/repos/openimscore/ser_ims/trunkser_
ims
• svn checkout
http://svn.berlios.de/svnroot/repos/openimscore/FHoSS/trunkFHoSS
zone "open-ims.test" {
type master;
file "/etc/bind/open-ims.dnszone";
};
10.1 What Is Open Source IMS Core 245
Edit the file /etc/resolv.conf and add the following lines below,
• search open-ims.test
• nameserver 127.0.0.1
You might need to reload bind for the above changes to take effect,
• sudo /etc/init.d/bind9 reload
Try to ping and see if you get a positive response (cf., Fig. 10.10).
• ping pcscf.open-ims.test
Mobile operators that form OHA include, Bouygues Telecom, China Mo-
bile Communications Corporation, China Telecommunications Corporation, China
United Network Communications, KDDI CORPORATION, NTT DOCOMO, INC.,
SOFTBANK MOBILE Corp., Sprint Nextel, T-Mobile, Telecom Italia, Telefónica,
TELUS and Vodafone.
Handset manufactures that contribute to Android through OHA include, Acer
Inc., Alcatel mobile phones, ASUSTeK Computer Inc., CCI, Dell, Foxconn Inter-
national Holdings Limited, FUJITSU LIMITED, Garmin International, Inc., Haier
Telecom (Qingdao) Co., Ltd., HTC Corporation, Huawei Technologies, Kyocera,
Lenovo Mobile Communication Technology Ltd., LG Electronics, Inc., Motorola,
Inc., NEC Corporation, Pantech, Samsung Electronics, Sharp Corporation, Sony Er-
icsson, Toshiba Corporation and ZTE Corporation.
Semiconductors company that constitute OHA include, AKM Semiconduc-
tor Inc, Audience, ARM, Atheros Communications, Broadcom Corporation, CSR
Plc., Cypress Semiconductor Corporation, Freescale Semiconductor, Gemalto, In-
tel Corporation, Marvell Semiconductor, Inc., MediaTek, Inc., MIPS Technologies,
Inc., NVIDIA Corporation, Qualcomm Inc., Renesas Electronics Corporation, ST-
Ericsson, Synaptics, Inc., Texas Instruments Incorporated and Via Telecom.
Software companies that form OHA include, Ándago Ingeniería S.L., ACCESS
CO., LTD., Ascender Corp., Cooliris, Inc., eBay Inc., Google Inc., LivingImage
LTD., Myriad, MOTOYA Co., Ltd., Nuance Communications, Inc., NXP Software,
OMRON SOFTWARE Co, Ltd., PacketVideo (PV), SkyPop, SONiVOX, SVOX,
and VisualOn Inc.
Commercialization companies that comprise OHA include, Accenture, Aplix
Corporation, Borqs, Intrinsyc Software International, L&T Infotech, Noser Engi-
neering Inc., Sasken Communication Technologies Limited, SQLStar International
Inc., TAT—The Astonishing Tribe AB, Teleca AB, Wind River and Wipro Tech-
nologies.
248 10 Case Study 3—Mobile VoIP Applications and IMS
Android has experienced a significant growth since its first release of Android beta
in 2007. According to Canalys’ [1] statistics, as per 2012 second quarter (Q2), a
quarterly shipment of Android has surpassed 100 millions smart phones for the
first time. Table 10.1 depicts the smart phone market share amongst popular mobile
operating systems.
The Android architecture (cf., Fig. 10.13) follows the bottom-up paradigm. The
bottom layer is the Linux Kernel which runs Linux version 2.6x for core system
services such as security, memory and process management, network stacks and
driver model.
The next layer is the Android native libraries, written in C and C++. Some of the
main native libraries are,
The Android Runtime is made up of Dalvik Virtual Machine (DVM) and Core
Java Libraries. DVM is a Java Virtual Machine (JVM) which is optimized for low
10.2 What Is Android 249
memory and processing power Android mobile devices. Core Java libraries provides
most of the classes defined in the JAVA SE libraries such as networking and IO
libraries.
Application Framework layer provides the interface between Android applica-
tions and the native Android libraries. This layer also manages the default phone
functions such as voice call and resource management such as energy and memory
resources.
The application layer includes default pre-installed Android applications such as
SMS, dialer, web browser and contact manager. This layer allows Android develop-
ers to install own applications without seeking permission from the main developer
Google. This layer is written in Java.
The official release of Android was in October 2008 when T-mobile G1 was
launched in the USA. The Table 10.2 traces the history of the major Android ver-
sions from October 2008 to August 2012.
250 10 Case Study 3—Mobile VoIP Applications and IMS
IMSDroid [2] is an open source IMS client that implements 3GPP IMS Client spec-
ifications. The client is developed by Doubango Telecom [3]. Doubango Telecom
is a Telecommunication company specializing in NGN technologies such as 3PP,
TISPAN, and Packet Cable with the aim of providing open source NGN products.
Apart from Android, Doubango also have open source IMS clients for Windows
Mobile, iPhone, iPad and Symbian. The SIP implementation is based on [RFC3261]
and [3GPPTS24.229] Rel-9 specifications. IMSDroid is built to support both voice
and SMS over LTE as outlined in the One Voice initiative (Version 1.0.0) (cf.,
Fig. 10.14).
One Voice Profile which outlines minimum requirements for a wireless mobile
device and network in order to guarantee an interoperable and high quality IMS
based telephony service over LTE network access during implementation. The ar-
chitecture includes IMS capabilities which include SIP registration, authentication,
addressing, call establishment, call termination and signalling tracing and compres-
sion.
10.2 What Is Android 251
• Supports video call codecs such as VP8, H264, MP4V-ES, Theora, H.263,
H.263-1998 and H.261.
• Supports DTMF according to RFC 4733.
• Implements QoS negotiation using Preconditions according to RFC 3312, RFC
4032 and RFC 5027.
• Implements SIP Session Timers as per RFC 4028.
• Implements Provisional Response Acknowledgments (PRACK).
• Supports communication Hold according to 3GPP TS 24.610.
• Implements Message Waiting Indication (MWI) as per 3GPP TS 24.606.
• It is capable of calling E.164 numbers by using ENUM protocol according to
RFC 3761.
• Supports NAT Traversal using STUN2 as per RFC 5389 with possibilities to
automatically discover the server by using DNS SRV.
• Supports Image Sharing according to PRD IR.79 Image Share Inter-operability
Specification 1.0.
• Supports Video Sharing as per PRD IR.74 Video Share Inter-operability Speci-
fication, 1.0.
• Implements File Transfer which conforms to OMA SIMPLE IM 1.0.
• Support Explicit Communication Transfer (ECT) using IP Multimedia (IM)
Core Network (CN) subsystem as per 3GPP TS 24.629.
• Supports IP Multimedia Subsystem (IMS) emergency sessions according to
3GPP TS 23.167.
• Supports Full HD (1080p) video.
• Supports NAT Traversal using ICE.
• Support for TLS, SRTP.
• Full support for RTCP as per RFC 3550 and other extensions such as RFC 4585
and RFC 5104.
• Implements MSRP chat.
• Implements adaptive video jitter buffer. This has advanced features like error
correction, packet loss retransmission and delay recovery.
• Fully supports RTCWeb standards such as ICE, SRTP/SRTCP, and RTCP-
MUX.
Figures 10.15 and 10.16 depict IMSDroid screen shots before and after register-
ing to IMS, respectively.
10.3 Lab Scenario 253
Figure 10.17 depicts the Lab scenario that is will be used to build up a testbed
for voice and video IMS communication using OSIMS Core and IMSDroid. The
testbed consists of Wi-Fi access networks through a wireless router, two Android
based mobile handsets installed with IMSDroid and OSIMS Core for VoIP session
setup and termination.
254 10 Case Study 3—Mobile VoIP Applications and IMS
IMSDroid can be downloaded and installed from Google Play [4], previously known
as Google Market. The following configurations are required In order for the IMS-
Droid to work with OSIMS Core.
After successful installation and running of OSIMS Core , the main task left is
to add OSIMS Core subscribers. This easily done trough FHoSS web based inter-
face manager. By default FHoSS comes provisioned with [email protected] and
[email protected] subscribers. The password for alice is alice and that of bob is
bob.
You can always use the FHoSS web interface at http://localhost:8080 on the
FHoSS machine. By default, the administrator username is “hssAdmin” and the
password is “hss”.
Click “Create” menu item under “IMS Subscription” on the left menu, then insert
Name of a new user of your choice. In our case we have used “charlie” as a username
(cf., Fig. 10.21), leave other fields unchanged and click the “Save” button.
Associate IMSU
When the “Save” button is clicked, another screen will appear on the right side.
This screen (cf., Fig. 10.23) is for associating the IMSU to IMPI. Input your IMS
User subscription which you created, in our case is “charlie” and then click on the
“Add/Change” button. Once “charlie” IMSU is added then “charlie” will appear
under the “Associated IMSU” section (cf., Fig. 10.24).
then click the “Save” button to save and leave the rest of the fields unchanged (cf.,
Fig. 10.25).
After clicking the “Save” button, another screen will appear on the right,
this screen (cf., Fig. 10.26) is for adding “Visited network” for the IMPU
sip:[email protected]. If this step is not done then user charlie will not be able
to register to the IMS. Under the list of “Visited network” select open-ims.test and
click “Add” button. IMPU sip:[email protected] will then appear in “Visited
network” section (cf., Fig. 10.27).
258 10 Case Study 3—Mobile VoIP Applications and IMS
This section will demonstrate how to make voice and video calls between Android
based mobile handsets installed with IMSDroid via the IMS.
Voice and video calls can be placed from the dialer, address book and History. The
dialer is accessible from the home screen.
You can enter any phone number (for instance, ‘+441752586230’ or
‘01752586278’), SIP URI (for example, ‘sip:[email protected]’). If the SIP Uri
is incomplete (for instance, ‘alice’) the IMSdroid application will automatically add
the prefix “sip:” and a domain name (in our case ‘@open-ims.test’) as described in
the “realm” before placing a call.
If you input a telephone number with a ‘tel:’ prefix, the client will map the num-
ber to the SIP URI using ENUM protocol.
The IMSDroid dialer is depicted in Fig. 10.29. Alice username is ready to be
called, the bottom left square is for placing video calls, while the second bottom
left square is for placing voice calls. Once a call is placed, the outgoing screen will
appear (cf., Fig. 10.30) with callee username and the “Cancel” button to terminate
the call if needed.
10.5 Problems 261
Once a call is placed and answered at the other end, a new screen “In Call Screen”
(cf., Fig. 10.31) will appear and a notification icon will be displayed in the status
bar of the mobile handset. This icon will stay in the status bar as long the phone is
on call. This icon in the status bar is useful because it will allow you to reopen the
‘In Call Screen’ if it is hidden. The “Incoming Call Screen” appears (cf., Fig. 10.32)
with caller username and two buttons, either to “Answer” the call or “Cancel”.
If the “Answer” button is clicked then the session will be established. The video
screen will appear if the video session is established (cf., Fig. 10.33). You can share
multimedia content by pressing the “Menu” button as long as a call is ongoing (cf.,
Fig. 10.34).
10.5 Problems
1. This case study and the case study in Chap. 8 uses SIP as a signalling protocol.
This means that there is a possibility to interconnect the two systems.
• Outline the steps needed to interconnect OSIMS Core and Asterisk.
• Implement the above steps and make sure you can make a call from a user
connected to Asterisk to a user connected to IMS and vice versa.
2. Compute the call setup time between users connected to OSIMS Core.
262 10 Case Study 3—Mobile VoIP Applications and IMS
3. Compare the call setup time between the two systems, i.e., Asterisk and OSIMS
Core.
4. By using Wireshark, compare and contrast SIP Registration headers of OSIMS
Core and Asterisk.
10.5 Problems 263
5. By using Wireshark, compare and contrast SIP Invite headers of OSIMS Core
and Asterisk.
264 10 Case Study 3—Mobile VoIP Applications and IMS
References
1. Canalys (2012) Global smart phone market. http://www.eeherald.com/section/news/
nws20120861.html. [Online; accessed 07-August-2012]
2. Doubango (2011) Imsdroid: SIP/IMS client for Android. http://code.google.com/p/imsdroid/.
[Online; accessed 07-August-2012]
3. Doubango (2011) Ngn open source projects. http://www.doubango.org/index.html. [Online;
accessed 07-August-2012]
4. Google (2012) Google play. https://play.google.com/store. [Online; accessed 07-August-
2012]
5. GSMA (2012) GSM Association. http://www.gsma.com. [Online; accessed 07-August-
2012]
6. IPTEL (2001) Sip express router. http://www.iptel.org/ser. [Online; accessed 12-August-
2012]
7. OHA (2009) Open handset alliance. http://www.openhandsetalliance.com. [Online; accessed
07-August-2012]
8. Software S (2008) Sippy rtpproxy. http://www.rtpproxy.org/. [Online; accessed 12-August-
2012]
Index
L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 265
DOI 10.1007/978-1-4471-4905-7, © Springer-Verlag London 2013
266 Index
D G
DAHDI, 193, 196, 211 G1, 249
Dalvik Virtual Machine, 248 GERAN, 181
Database, 119 GGSN, 173, 180
DCME, 38 GIP, 180
DCR, 139, 149 Google Chrome, 7
DCT, 44 Google Talk, 7
Debian, 196 GPRS, 180
Degradation Category Rating, 139, 149 GSM, 39, 179, 251
Dialog, 117 GSMA, 251
Differential Pulse Coding Modulation, 59 GUI, 222
Digital Circuit Multiplication Equipment, 38
Digium, 195, 207 H
Discontinuous transmission, 42 H.263, 67
Discrete Cosine Transform, 44, 54 H.264, 68
DNS, 106 H.265, 69
Doubango, 250 H.323, 101
Double stimulus impairment scale, 149 H323, 193
Double-stimulus continuous quality-scale, 150 Header Fields, 111
DPCM, 59 HEVC, 69
DSCH, 186 High Definition Voice, 44
DSCQS, 150 Highly Efficiency Video Coding, 69
DSIS, 149 HSDPA, 182
DTMF, 197 HSS, 173
DTX, 42 HTTP, 101, 107
Hypertext Transfer Protocol, 101
DVM, 248
I
E
IAX2, 193, 194
E-model, 145, 147
IBCF, 168
EIA, 35 ICID, 171
Electronic Industries Association, 35 IETF, 36, 115, 163
EPC, 187 IFC, 240
EPS, 187 ILBC, 41
Establishment, 118 IM, 10, 101
Ethernet bandwidth, 82 IM-SSF, 172
Ethernet BW, 82 IMPI, 256
ETSI, 34, 36, 163 IMPU, 256
European Telecommunication Standards IMPUI, 169
Institute, 34 IMS, 101, 114, 163, 251
IMSDroid, 250
F IMSU, 255
Fedora, 196 IMT-2000, 178
FHoSS, 237 INFO, 108
FIFO, 224 Information and Communication Technology
Flow, 117 (ICT), 2
FOKUS, 237 INMARSAT, 35
Forking Proxy, 104 Instant Message (IM), 2
Format, 107 Instant Messaging, 10
Forward Error Control (FEC), 6 Integrated Switched Digital Network (ISDN),
From, 108, 111 3
Full-Reference (FR) video quality assessment, Integrity Ley, 239
150 Interarrival jitter, 133
Fullband, 44 Interim Standard, 35
Index 267