0% found this document useful (0 votes)
218 views277 pages

Computer Communications and Networks: For Further Volumes

Uploaded by

Zakaria Elaguab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
218 views277 pages

Computer Communications and Networks: For Further Volumes

Uploaded by

Zakaria Elaguab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 277

Computer Communications and

Networks

For further volumes:


www.springer.com/series/4198
The Computer Communications and Networks series is a range of textbooks,
monographs and handbooks. It sets out to provide students, researchers and non-
specialists alike with a sure grounding in current knowledge, together with com-
prehensible access to the latest developments in computer communications and net-
working.

Emphasis is placed on clear and explanatory styles that support a tutorial approach,
so that even the most complex of topics is presented in a lucid and intelligible man-
ner.
Lingfen Sun r Is-Haka Mkwawa r
Emmanuel Jammeh r Emmanuel Ifeachor

Guide to Voice and


Video over IP
For Fixed and Mobile Networks
Lingfen Sun Emmanuel Jammeh
School of Computing and Mathematics, School of Computing and Mathematics,
University of Plymouth, University of Plymouth,
Plymouth, UK Plymouth, UK

Is-Haka Mkwawa Emmanuel Ifeachor


School of Computing and Mathematics, School of Computing and Mathematics,
University of Plymouth, University of Plymouth,
Plymouth, UK Plymouth, UK

Series Editors
A.J. Sammes
Centre for Forensic Computing
Cranfield University
Shrivenham campus
Swindon, UK

ISSN 1617-7975 Computer Communications and Networks


ISBN 978-1-4471-4904-0 ISBN 978-1-4471-4905-7 (eBook)
DOI 10.1007/978-1-4471-4905-7
Springer London Heidelberg New York Dordrecht

Library of Congress Control Number: 2013930008

© Springer-Verlag London 2013


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of
this publication or parts thereof is permitted only under the provisions of the Copyright Law of the
Publisher’s location, in its current version, and permission for use must always be obtained from Springer.
Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations
are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of pub-
lication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any
errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect
to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)


Preface

Since the release of the first Internet Phone in 1995, Voice over Internet Protocol
(VoIP) has grown exponentially, from a lab-based application to today’s established
technology, with global penetration, for real-time communications for business and
daily life. Many organisations are moving from the traditional PSTN networks to
modern VoIP solutions and are using VoIP products such as audio/video conferenc-
ing systems for their daily business operation. We depend on different VoIP tools
such as Skype, Google Talk and Microsoft Lync to keep contact with our business
partners, colleagues, friends and family members, virtually any time and from any-
where. We now enjoy free or low cost VoIP audio or even high quality video calls
which have made the world like a small village for real-time audio/video communi-
cations. VoIP tools have been incorporated into our mobile devices, tablets, desktop
PCs and even TV sets and the use of VoIP tools is just an easy one-click task.
Behind the huge success and global penetration of VoIP, we have witnessed
great advances in the technologies that underpin VoIP such as speech/video sig-
nal processing and compression (e.g., from narrowband, wideband to fullband
speech/audio compression), computer networking techniques and protocols (for bet-
ter and more efficient transmission of multimedia services), and mobile/wireless
communications (e.g., from 2G, 3G to 4G broadband mobile communications).
This book aims to provide an understanding and a practical guide to some of
the fundamental techniques (including their latest developments) which are behind
the success of VoIP. These include speech compression, video compression, me-
dia transport protocols (RTP/RTCP), VoIP signalling protocols (SIP/SDP), QoS and
QoE for voice/video calls, Next Generation Networks based on IP Multimedia Sub-
system (IMS) and mobile VoIP, together with case studies on how to build a VoIP
system based on Asterisk, how to assess and analyse VoIP quality, and how to set
up a mobile VoIP system based on Open IMS and Android mobile. We have pro-
vided many practical examples including real trace data to illustrate and explain
the concepts of relevant transport and signalling protocols. Exercises, illustrative
worked examples in the chapters and end-of-chapter problems will also help readers
to check their understanding of the topics and to stretch their knowledge. Step-by-
step instructions are provided in the case studies to enable readers to build their own
open-source based VoIP system and to assess voice/video call quality accordingly,
or to set up their own mobile VoIP system based on Open IMS Core and IMSDroid
with an Android mobile. Challenging questions are set up in the case studies to help
them to think deeper and to practice more.

v
vi Preface

This book has benefitted from the authors’ research activities in VoIP and re-
lated activities of over 10 years. In particular, it has benefitted from the recent in-
ternational collaborative projects, including the EU FP7 ADAMANTIUM project
(Grant agreement no. 214751), the EU FP7 GERYON project (Grant agreement
no. 284863) and the EU COST Action IC1003 European Network on Quality of
Experience in Multimedia Systems and Services (QUALINET). The book has also
benefitted from the authors’ teaching experience in developing and delivering mod-
ules on “Voice and Video over IP” to undergraduate and postgraduate students at
Plymouth University in the past four years. Some of the contents of the book were
drawn from the lecture notes and some of the case studies materials from the lab
activities.
This book can be used as a textbook for final year undergraduate and first year
postgraduate courses in computer science and/or electronic engineering. It can also
serve as a reference book for engineers in industry and for those interested in VoIP,
for example, those who wish to have a general understanding of VoIP as well as
those who wish to have an in-depth and practical understanding of key VoIP tech-
nologies.
In this book, Dr. Sun has contributed to Chaps. 1 (Introduction), 2 (Speech Com-
pression), 3 (Video Compression), 4 (Media Transport) and 6 (VoIP QoE); Dr. Mk-
wawa has contributed to Chaps. 1 (Introduction), 5 (SIP Signalling), 7 (IMS and Mo-
bile VoIP), 8 (Case Study 1), 9 (Case Study 2) and 10 (Case Study 3); Dr. Jammeh
has contributed to Chap. 3 (Video Compression) and Professor Ifeachor has con-
tributed to Chap. 1 (Introduction) and the book editing. Due to the time constraints
and the limitation of our knowledge, some errors or omissions may be inevitable in
the book, we welcome any feedbacks and comments about the book.
Finally, we would like to thank Simon Rees, our editor at Springer-Verlag for
his encouragement, patience, support and understanding in the past two years in
helping us complete the book. We would also like to express our deepest gratitude
to our family for their love, support and encouragement throughout the process of
this book.
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Overview of VoIP . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 How VoIP Works and Factors That Affect Quality . . . . . . . . . 3
1.3 VoIP Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Microsoft’s Lync . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Skype . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Google Talk . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.4 X-Lite . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 VoIP Trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 VoIP Protocol Stack and the Scope of the Book . . . . . . . . . . . 12
1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Speech Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Speech Compression Basics . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Speech Signal Digitisation . . . . . . . . . . . . . . . . . . 18
2.2.2 Speech Waveform and Spectrum . . . . . . . . . . . . . . 21
2.2.3 How Is Human Speech Produced? . . . . . . . . . . . . . . 23
2.3 Speech Compression and Coding Techniques . . . . . . . . . . . . 25
2.3.1 Waveform Compression Coding . . . . . . . . . . . . . . . 26
2.3.2 Parametric Compression Coding . . . . . . . . . . . . . . 28
2.3.3 Hybrid Compression Coding—Analysis-by-Synthesis . . . 31
2.3.4 Narrowband to Fullband Speech Audio Compression . . . 35
2.4 Standardised Narrowband to Fullband Speech/Audio Codecs . . . 36
2.4.1 ITU-T G.711 PCM and G.711.1 PCM-WB . . . . . . . . . 36
2.4.2 ITU-T G.726 ADPCM . . . . . . . . . . . . . . . . . . . . 37
2.4.3 ITU-T G.728 LD-CELP . . . . . . . . . . . . . . . . . . . 38
2.4.4 ITU-T G.729 CS-ACELP . . . . . . . . . . . . . . . . . . 38
2.4.5 ITU-T G.723.1 MP-MLQ/ACELP . . . . . . . . . . . . . 39
2.4.6 ETSI GSM . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4.7 ETSI AMR . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.4.8 IETF’s iLBC . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.4.9 Skype/IETF’s SILK . . . . . . . . . . . . . . . . . . . . . 41
2.4.10 ITU-T G.722 ADPCM-WB . . . . . . . . . . . . . . . . . 42

vii
viii Contents

2.4.11 ITU-T G.722.1 Transform Coding . . . . . . . . . . . . . 43


2.4.12 ETSI AMR-WB and ITU-T G.722.2 . . . . . . . . . . . . 44
2.4.13 ITU-T G.719 Fullband Audio Coding . . . . . . . . . . . . 44
2.4.14 Summary of Narrowband to Fullband Speech Codecs . . . 45
2.5 Illustrative Worked Examples . . . . . . . . . . . . . . . . . . . . 45
2.5.1 Question 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.2 Question 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.5.3 Question 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3 Video Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1 Introduction to Video Compression . . . . . . . . . . . . . . . . . 53
3.2 Video Compression Basics . . . . . . . . . . . . . . . . . . . . . 55
3.2.1 Digital Image and Video Colour Components . . . . . . . . 55
3.2.2 Colour Sub-sampling . . . . . . . . . . . . . . . . . . . . 56
3.2.3 Video Resolution and Bandwidth Requirement . . . . . . . 57
3.3 Video Compression Techniques . . . . . . . . . . . . . . . . . . . 58
3.4 Lossless Video Compression . . . . . . . . . . . . . . . . . . . . 58
3.5 Lossy Video Compression . . . . . . . . . . . . . . . . . . . . . . 59
3.5.1 Predictive Coding . . . . . . . . . . . . . . . . . . . . . . 59
3.5.2 Quantisation . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.5.3 Transform Coding . . . . . . . . . . . . . . . . . . . . . . 61
3.5.4 Interframe Coding . . . . . . . . . . . . . . . . . . . . . . 62
3.6 Video Coding Standards . . . . . . . . . . . . . . . . . . . . . . . 63
3.6.1 H.120 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.6.2 H.261 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.6.3 MPEG 1&2 . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.6.4 H.263 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.6.5 MPEG-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.6.6 H.264 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.6.7 Highly Efficiency Video Coding (HEVC) . . . . . . . . . . 69
3.7 Illustrative Worked Examples . . . . . . . . . . . . . . . . . . . . 69
3.7.1 Question 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.7.2 Question 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.7.3 Question 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4 Media Transport for VoIP . . . . . . . . . . . . . . . . . . . . . . . . 73
4.1 Media Transport over IP Networks . . . . . . . . . . . . . . . . . 73
4.2 TCP or UDP? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Real-Time Transport Protocol—RTP . . . . . . . . . . . . . . . . 76
4.3.1 RTP Header . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.2 RTP Header for Voice Call Based on Wireshark . . . . . . 78
4.3.3 RTP Payload and Bandwidth Calculation for VoIP . . . . . 80
Contents ix

4.3.4 Illustrative Worked Example . . . . . . . . . . . . . . . . 83


4.3.5 RTP Header for Video Call Based on Wireshark . . . . . . 84
4.4 RTP Control Protocol—RTCP . . . . . . . . . . . . . . . . . . . . 88
4.4.1 RTCP Sender Report and Example . . . . . . . . . . . . . 89
4.4.2 RTCP Receiver Report and Example . . . . . . . . . . . . 91
4.4.3 RTCP Source Description and Example . . . . . . . . . . . 92
4.4.4 RTCP BYE Packet and Example . . . . . . . . . . . . . . 94
4.4.5 Extended RTCP Report—RTCP XR for VoIP Metrics . . . 95
4.5 Compressed RTP—cRTP . . . . . . . . . . . . . . . . . . . . . . 96
4.5.1 Basic Concept of Compressed RTP—cRTP . . . . . . . . . 96
4.5.2 Illustrative Worked Example . . . . . . . . . . . . . . . . 98
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5 VoIP Signalling—SIP . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.1 What is Session Initiation Protocol? . . . . . . . . . . . . . . . . . 101
5.1.1 SIP Network Elements . . . . . . . . . . . . . . . . . . . . 102
5.1.2 User Agent . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.1.3 Proxy Server . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.1.4 Redirect Server . . . . . . . . . . . . . . . . . . . . . . . 105
5.1.5 Registrar . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.1.6 Location Server . . . . . . . . . . . . . . . . . . . . . . . 106
5.2 SIP Protocol Structure . . . . . . . . . . . . . . . . . . . . . . . . 106
5.2.1 SIP Message Format . . . . . . . . . . . . . . . . . . . . . 107
5.3 Session Descriptions Protocol . . . . . . . . . . . . . . . . . . . . 113
5.3.1 Session Description . . . . . . . . . . . . . . . . . . . . . 114
5.3.2 Time Description . . . . . . . . . . . . . . . . . . . . . . 115
5.3.3 Media Description . . . . . . . . . . . . . . . . . . . . . . 115
5.3.4 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.3.5 Example of SDP Message from Wireshark . . . . . . . . . 117
5.4 SIP Messages Flow . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.4.1 Session Establishment . . . . . . . . . . . . . . . . . . . . 118
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6 VoIP Quality of Experience (QoE) . . . . . . . . . . . . . . . . . . . 123
6.1 Concept of Quality of Service (QoS) . . . . . . . . . . . . . . . . 123
6.1.1 What is Quality of Service (QoS)? . . . . . . . . . . . . . 123
6.1.2 QoS Metrics and Measurements . . . . . . . . . . . . . . . 124
6.1.3 Network Packet Loss and Its Characteristics . . . . . . . . 125
6.1.4 Delay, Delay Variation (Jitter) and Its Characteristics . . . . 130
6.2 Quality of Experience (QoE) for VoIP . . . . . . . . . . . . . . . . 135
6.2.1 What is Quality of Experience (QoE)? . . . . . . . . . . . 135
6.2.2 Factors Affect Voice Quality in VoIP . . . . . . . . . . . . 136
6.2.3 Overview of QoE for Voice and Video over IP . . . . . . . 137
6.3 Subjective Speech Quality Assessment . . . . . . . . . . . . . . . 138
x Contents

6.4 Objective Speech Quality Assessment . . . . . . . . . . . . . . . . 141


6.4.1 Comparison-Based Intrusive Objective Test (Full-
Reference Model) . . . . . . . . . . . . . . . . . . . . . . 141
6.4.2 Parameter-Based Measurement: E-Model . . . . . . . . . . 145
6.4.3 A Simplified and Applicable E-Model . . . . . . . . . . . 147
6.5 Subjective Video Quality Assessment . . . . . . . . . . . . . . . . 148
6.6 Objective Video Quality Assessment . . . . . . . . . . . . . . . . 150
6.6.1 Full-Reference (FR) Video Quality Assessment . . . . . . 150
6.6.2 Reduced-Reference (RR) Video Quality Assessment . . . . 153
6.6.3 No-Reference Video Quality Assessment . . . . . . . . . . 154
6.7 Illustrative Worked Examples . . . . . . . . . . . . . . . . . . . . 155
6.7.1 Question 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.7.2 Question 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.7.3 Question 3 . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7 IMS and Mobile VoIP . . . . . . . . . . . . . . . . . . . . . . . . . . 163
7.1 What Is IP Multimedia Subsystem? . . . . . . . . . . . . . . . . . 163
7.1.1 What Do We Need IMS for? . . . . . . . . . . . . . . . . . 163
7.1.2 IMS Architecture . . . . . . . . . . . . . . . . . . . . . . 164
7.1.3 IMS Elements . . . . . . . . . . . . . . . . . . . . . . . . 167
7.1.4 IMS Services . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.1.5 IMS Signalling and Bearer Traffic Interfaces . . . . . . . . 174
7.2 Mobile Access Networks . . . . . . . . . . . . . . . . . . . . . . 177
7.2.1 Cellular Standards . . . . . . . . . . . . . . . . . . . . . . 178
7.2.2 The GSM Standard . . . . . . . . . . . . . . . . . . . . . 179
7.2.3 The UMTS Standard . . . . . . . . . . . . . . . . . . . . . 181
7.2.4 Long-Term Evolution . . . . . . . . . . . . . . . . . . . . 186
7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8 Case Study 1—Building Up a VoIP System Based on Asterisk . . . . 193
8.1 What is Asterisk? . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.1.1 Channel Modules . . . . . . . . . . . . . . . . . . . . . . 194
8.1.2 Codec Translator Modules . . . . . . . . . . . . . . . . . . 194
8.1.3 Application Modules . . . . . . . . . . . . . . . . . . . . 195
8.1.4 File Format Modules . . . . . . . . . . . . . . . . . . . . . 195
8.1.5 Installing Asterisk . . . . . . . . . . . . . . . . . . . . . . 196
8.2 What Is X-Lite 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.2.1 Using X-Lite . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.3 Voice and Video Injection Tools . . . . . . . . . . . . . . . . . . . 201
8.3.1 Manycam Video Injection Tool . . . . . . . . . . . . . . . 201
8.3.2 Virtual Audio Cable Injection Tool . . . . . . . . . . . . . 202
8.4 Lab Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
8.5 Adding SIP Phones . . . . . . . . . . . . . . . . . . . . . . . . . 205
Contents xi

8.6 Configuring Dial Plans . . . . . . . . . . . . . . . . . . . . . . . . 206


8.7 Configuring DAHDI Channels . . . . . . . . . . . . . . . . . . . . 207
8.8 Starting and Stopping Asterisk . . . . . . . . . . . . . . . . . . . 208
8.9 Setup SIP Phone . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
8.10 Making Voice Calls Between SIP Phones . . . . . . . . . . . . . . 209
8.11 Making Video Calls Between SIP Phones . . . . . . . . . . . . . . 211
8.12 Making Voice Calls Between SIP and Analogue Phones . . . . . . 211
8.13 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
9 Case Study 2—VoIP Quality Analysis and Assessment . . . . . . . . 215
9.1 What Is Wireshark . . . . . . . . . . . . . . . . . . . . . . . . . . 215
9.1.1 Live Capture and Offline Analysis . . . . . . . . . . . . . 215
9.1.2 Three-Pane Packet Browser . . . . . . . . . . . . . . . . . 216
9.1.3 VoIP Analysis . . . . . . . . . . . . . . . . . . . . . . . . 218
9.2 Wireshark Familiarization . . . . . . . . . . . . . . . . . . . . . . 222
9.3 Introduction to Netem and tc Commands . . . . . . . . . . . . . . 223
9.3.1 Adding qdisc . . . . . . . . . . . . . . . . . . . . . . . . . 224
9.3.2 Changing and Deleting qdisc . . . . . . . . . . . . . . . . 224
9.4 Lab Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
9.4.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 225
9.5 SIP Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.5.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.6 SIP Invite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
9.6.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 227
9.7 VoIP Messages Flow . . . . . . . . . . . . . . . . . . . . . . . . . 230
9.7.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 230
9.8 VoIP Quality Assessment: Packet Losses . . . . . . . . . . . . . . 232
9.8.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 232
9.9 VoIP Quality Assessment: Delay Variation . . . . . . . . . . . . . 233
9.9.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . 234
9.10 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234
10 Case Study 3—Mobile VoIP Applications and IMS . . . . . . . . . . 237
10.1 What Is Open Source IMS Core . . . . . . . . . . . . . . . . . . . 237
10.1.1 The Main Features of OSIMS Core P-CSCF . . . . . . . . 238
10.1.2 The Main Features of OSIMS Core I-CSCF . . . . . . . . 239
10.1.3 The Main Features of OSIMS Core S-CSCF . . . . . . . . 240
10.1.4 The Main Features of OSIMS Core FHoSS . . . . . . . . . 241
10.1.5 Installation and Configuration of OSIMS Core . . . . . . 242
10.2 What Is Android . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
10.2.1 Android Smart Phone Market Share . . . . . . . . . . . . . 248
10.2.2 Android Architecture . . . . . . . . . . . . . . . . . . . . 248
10.2.3 The History of Android . . . . . . . . . . . . . . . . . . . 249
10.2.4 IMSDroid IMS Client . . . . . . . . . . . . . . . . . . . . 250
10.3 Lab Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
10.3.1 Configuring IMSDroid . . . . . . . . . . . . . . . . . . . . 254
xii Contents

10.3.2 Adding OSIMS Core Subscribers . . . . . . . . . . . . . . 255


10.4 Making Voice and Video Calls . . . . . . . . . . . . . . . . . . . 260
10.4.1 Placing a Call . . . . . . . . . . . . . . . . . . . . . . . . 260
10.4.2 In Call Screen . . . . . . . . . . . . . . . . . . . . . . . . 261
10.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Introduction
1

This chapter provides background information about the book. In particular, it pro-
vides an overview of VoIP to make the reader aware of its benefits and growing
importance, how it works and factors that affect VoIP quality. We also introduce
current VoIP approaches and tools which are used in the real world for VoIP calls
and highlight the trends in VoIP and its applications. Finally, we give an outline of
the book in relation to VoIP stack to give the reader a deeper insight into the contents
of the book.

1.1 Overview of VoIP

Voice over Internet Protocol or Voice over IP (VoIP) is a technology used to trans-
mit real-time voice over the Internet Protocol (IP) based networks (e.g., the Inter-
net or private IP networks). The original idea behind VoIP is to transmit real-time
speech signal over a data network and to reduce the cost for long distance calls
as VoIP calls would go through packet-based networks at a flat rate, instead of
the traditional Public Switched Telephone Network (PSTN) which was expensive
for long-distance calls. Today, the new trend is to include both voice and video
calls in VoIP. VoIP was originally invented by Alon Cohen and Lior Haramaty in
1995. The first Internet Phone was released in February 1995 by VocalTec and a
flagship patent on audio transceiver for real-time or near real-time communication
of audio signal over a data network was filed in 1998 [3]. This was the first at-
tempt in the telecommunications history aimed at transmitting both data and voice
at the same time and over one common network. Traditionally, voice and data
were sent over two separate networks, with data on packet networks and speech
over PSTN.
Since its invention, VoIP has grown exponentially, from a small-scale lab-based
application to today’s global tool with applications in most areas of business and
daily life. More and more organisations are moving from traditional PSTN to mod-
ern VoIP solutions, such as Microsoft’s Unified Communications Solution (Mi-

L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 1
DOI 10.1007/978-1-4471-4905-7_1, © Springer-Verlag London 2013
2 1 Introduction

Fig. 1.1 A systematic diagram of VoIP systems and networks

crosoft Lync1 ) which provides a unified solution for voice, Instant Message, audio
and video conferencing for business operations. Telecommunication and network
service providers now offer attractive packages to customers which include the pro-
vision of VoIP, TV (or IPTV) together with broadband data access for Triple-play
and Quadruple-play services (including mobility). An increasing number of peo-
ple in different age groups now rely on VoIP products and tools such as Skype,2
to make voice/video calls to keep in contact with family and friends because they
are free or inexpensive. Many companies and organizations use VoIP (e.g., Skype)
for routine conference calls for project meetings and for interviewing prospective
employees. New VoIP applications such as mobile VoIP have widened the VoIP
arena further to include seamless and timely communications. VoIP has truly be-
come an invaluable tool which we rely on for business, social and family communi-
cations.
Behind the great success and the wide penetration of VoIP lie major technol-
ogy advances in Information and Communication Technology (ICT) which under-
pin its delivery and applications. Without these, VoIP as we know today would not
be possible. The key technologies include advanced speech compression methods
(including for narrowband, wideband and fullband compression), advanced video
compression methods (including layered coding to support various network condi-
tions), transport signalling protocols (SIP/SDP), transport protocols (RTP/RTCP),
Quality of Service (QoS) and Quality of Experience (QoE) management, monitor-
ing and control; IMS (IP Multimedia Subsystem) and mobile VoIP. Descriptions of
these key technologies and their use in VoIP form an important part of this book.
IP based networks now carry all types of traffic, including real-time voice and
video. Figure 1.1 depicts a generalised set-up for VoIP calls. As can be seen in the
figure, a VoIP call can originate from or be sent to a mobile or landline device or

1 http://lync.microsoft.com

2 http://www.skype.com
1.2 How VoIP Works and Factors That Affect Quality 3

a PC and may be routed through a variety of networks, including private networks,


the Internet, mobile/cellular networks and satellite links. For example, a VoIP call
is packetised and transmitted through IP network to reach a callee via an IP phone,
a PC softphone, or an analogue/ISDN phone through appropriate media gateways.
A mobile phone call can also reach an IP phone or a PC softphone through wireless
Gateway and IP network. Current smart phones with dual-modes can automatically
switch between Wireless LAN (WLAN) access (when in the WLAN cloud) and
cellular mobile access. Internet access can be from wireless access or fixed line
(wired) access. For wireless access, this can be based on WLAN, Wireless Mesh
Network or WiMAX. Fixed line or wired access can be based on ADSL, cable
modem or optical fibre to home. The broadband capability of wireless and wired
Internet access have greatly extended the scope of VoIP applications. Video phone
call and conferencing features further extend the applications of VoIP services. VoIP
facilities can be based on proprietary products, such as Cisco’s Voice Gateway and
Cisco’s CallManager,3 or based on open-source approach, such as Asterisk4 Open-
source VoIP PBX (Private Branch Exchange).

1.2 How VoIP Works and Factors That Affect Quality


Figure 1.2 shows the key steps and processes that take place as voice information
is transported over IP networks from the speaker to the listener (the process will be
similar for a video call where video codec will be involved instead of voice codec).
Unfortunately, IP networks are not designed to support real-time voice (or video)
communications. Factors such as network delay, jitter (variations in delay) and
packet loss lead to unpredictable deterioration in voice (or video) quality and these
should be borne in mind when we use VoIP. There is a fundamental need to measure
voice (or video) quality in communications network for technical, legal and com-
mercial reasons. The way this is done in practice is an important aspect of this book.
As can be seen in Fig. 1.2, the first step is to digitise the analog voice signal
using the encoder (a specialised analog-to-digital converter). This compresses the
voice signal into digital samples (see later for more detail). The basic encoder is the
ITU G.711 which samples the voice signal once every 0.125 ms (8 kHz) and gener-
ates 8-bits per sample (i.e., 64 kb/s). The more recent encoders provide significant
reduction in data rate (e.g., G.723.1, G.726 and G.729). The encoder introduces a
variety of impairments, including delay and encoding distortions which depend on
the type of encoder used.
Next, the packetizer places a certain number of speech samples (in the case of
G.711) or frames (in the case of encoders such as G.723.1 and G.729) into packets
and then adds relevant protocol headers to the data to form IP packets. The headers
are necessary for successful transmission and routing of the packets through the
networks and for recovery of the data at the receiving end.

3 http://www.cisco.com

4 http://www.asterisk.org
4 1 Introduction

Fig. 1.2 Key processes that take place in transmitting VoIP

The voice packets are then sent over the IP network. As the voice packets are
transported, they may be subjected to a variety of impairments such as delay, delay
variation and packet loss.
At the receiving end, de-packetizer is used to remove the protocol header added
for transmission; jitter buffer (or playback buffer) is used to absorb delay variations
suffered by the voice packets and make it possible to obtain a smooth playout and
hence smooth reconstruction of speech. Playout buffers can lead to additional packet
loss as packets arriving too late are discarded. Some modern codecs have built-in
packet loss concealment mechanisms which can alleviate the impact of network
packet loss on voice quality.

1.3 VoIP Tools

Since the first VoIP software (named as Internet Phone) released by VocalTec Ltd
in 1995, there have been many VoIP softphones in the market. Typical VoIP tools
or softphones include Microsoft Netmeeting, Yahoo Messenger, MSN Messenger,
Windows Live Messenger, Skype, Google Talk, Linphone, Eyebeam and XLite.
Some of them are open source such as Linphone and XLite and some are propri-
etary such as Skype. In this section, we present some key VoIP tools, including
Microsoft’s Lync which is a unified VoIP solution, Skype, Google Talk and XLite.

1.3.1 Microsoft’s Lync

Microsoft’s Lync 2010 is a rich client application providing a unified solution for
Instance Messaging (IM), presence, audio and video conferencing. It can be easily
1.3 VoIP Tools 5

Fig. 1.3 Microsoft Lync


2010 sign in screen

incorporated with Microsoft’s Office applications such as Microsoft Outlook, Mi-


crosoft Word and Microsoft SharePoint to provide a one-click access/setup from
your familiar Microsoft tools. Microsoft’s Lync also supports High-Definition (res-
olution: 1270 × 720; aspect ratio: 16:9) peer-to-peer video calls, together with VGA
(resolution: 640 × 480; aspect ratio: 4:3) video conferencing. It has many features
such as Group Chat, document sharing, meeting recordings (for audio, video and/or
meeting contents), flexible call forwarding and single number reach support (utiliz-
ing a single phone number for office phone, PCs and mobile phones and users can
be reached no matter where they are). The Microsoft Lync 2010 sign in screen is
depicted in Fig. 1.3.
Due to its licensing cost, Microsoft Lync is mainly for business use, incorpo-
rating with business’s existing Microsoft applications such as MS Outlook and MS
SharePoint.

1.3.2 Skype

Skype is a peer-to-peer VoIP application software originally developed by Estonians


Ahti Heinla, Priit Kasesalu, and Jaan Tallinn, who are also co-founders of KaZaA,
a well-known peer-to-peer file sharing software in 2002. Skype was acquired by
eBay in 2005 and then owned by Microsoft in 2011. By the end of September 2011,
6 1 Introduction

Fig. 1.4 Skype main home


screen

Skype has over 660 million registered users and has a record of over 36 million
simultaneous online users in March 2012.5 Skype has been incorporated into many
devices including mobile phones such as Android based phones and iPhones, tablet
such as iPad, and PCs. The main home screen of Skype is shown in Fig. 1.4.
Skype with IM, voice call, video call, and audio/video conferencing features, is
the most successful VoIP software with the largest register users. It provides free
PC-to-PC voice/video call or conferencing functions and cheap call rate for PC to
traditional landline calls. Skype supports narrowband to wideband audio coding and
high quality video such as iSAC, SVOPC, iLBC, SILK for speech/audio coding and
VP7/VP8 from On26 (now part of Google7 ) for video coding. Skype has many ad-
vanced features such as Forward Error Control (FEC), variable audio/video sender
bitrate, variable packet size and variable sampling rate and outperforms other VoIP
tools such as Google Talk and Windows Live Messenger in many experiments car-
ried out by VoIP researchers [2, 13].
Due to the proprietary nature of Skype, there has been many research efforts
trying to evaluate the performance of Skype’s voice/video call when compared
with other VoIP tools such as Google Talk, MSN messenger and Windows Live
Messenger [2, 9, 13], to analyse Skype traffic (e.g., how does it compete with

5 http://en.wikipedia.org/wiki/Skype

6 http://www.on2.com/

7 http://www.google.com
1.3 VoIP Tools 7

Fig. 1.5 Google Talk


screenshot with video chat
enabled

other TCP background traffic under limited network bandwidth and whether it is
TCP-friendly or not) [2, 7], and to understand Skype’s architecture and its un-
derlying QoE control mechanisms (e.g., how does it adapt audio/video sender bi-
trate to available network bandwidth and how does it cope with network conges-
tion) [1, 8].

1.3.3 Google Talk

Google Talk [6] is a voice and instant messaging service that can be run in Google
Chrome OS, Microsoft Windows OS, Android and Blackberry. The communication
between Google Talk servers and clients for authentication, presence and messag-
ing is via Extensible Messaging and Presence Protocol (XMPP). The popularity of
Google Talk is driven by its integration into Gmail whereby Gmail users can send
instant messages and talk to each others. Furthermore, it works within a browser and
therefore, the Google Talk client application does not need to be downloaded to talk
and send instant messages to amongst Gmail users. Figure 1.5 shows the Google
Talk client home screenshot with video chat enabled.
Google Talk supports the following audio and video codecs, PCMA, PCMU,
G.722, GSM, iLBC, Speex, ISAC, IPCMWB, EG711U, EG711A, H.264/SVC,
H.264, H.263-1998 and Google VP8.
8 1 Introduction

Fig. 1.6 X-Lite 4 screen shot

Fig. 1.7 X-Lite 4 video


window

1.3.4 X-Lite

X-Lite is a proprietary freeware VoIP soft phone that uses SIP for VoIP sessions
setup and termination. It combines voice calls, video calls and Instant Messaging
in a simple interface. X-Lite is developed by CounterPath [5]. The screen shot of
X-Lite version 4 is depicted in Fig. 1.6 together with its video screen in Fig. 1.7.
Some of the X-Lite basic functionalities include, call display and message waiting
indicator, speaker phone and mute icon, hold and redial and call history for incoming
and outgoing and missed calls.
1.4 VoIP Trend 9

Some of the X-Lite enhanced features and functions include, video call support,
instant message and presence via SIMPLE protocol support, contact list support, au-
tomatic detection and configuration of voice and video devices, support of echo can-
celation, voice activity detection (VAD) and automatic gain control. X-Lite supports
the following audio and video codecs, Broadvoice-32, Broadvoice-32 FEC, DVU4,
DVI4 Wideband, G711aLaw, G711uLaw, GSM, L16 PCM Wideband, iLBC, Speex,
Speex FEC, Speex Wideband, Speec Wideband FEC, H.263 and H.263+1998.

1.4 VoIP Trend

The German Internet traffic management systems provider Ipoque [14] sampled
about three Petabytes of data in December 2007 from Australia, Germany, East-
ern and southern Europe and the Middle East. Ipoque found that while VoIP made
about 1 % of all Internet traffic, but it was used by around 30 % of the Internet users.
The Skype accounted for 95 % of all VoIP traffic in December 2007.
The growth of VoIP subscribers continued to increase in Western Europe
whereby in June 2007 it reached 21.7 million. The growth was significantly high
compared to 15.6 million in January 2007. TeleGeography [16] estimated that Eu-
ropean VoIP subscribers would have grown to 29 million by December 2007. The
report by Infonetics [11] indicated that there were about 80 million VoIP subscribers
in the world in 2007, with the high rate of adoption coming from the Asia Pacific
region.
The report by OVUM Telecom Research [6] on World Consumer VoIP of
September 2009 which predicted the VoIP trend for 2009–14 showed that in the
4th quarter of 2009 in Europe,
• VoIP voice call volumes rose. Mobile VoIP communication was the driving fac-
tor to this growth.
• VoIP voice call prices fell. Conventional telephony voice call prices fell at an-
nual rate of 2.6 %. Mobile VoIP call prices fell sharply.
From OVUM World Consumer VoIP forecast shows that worldwide revenues
from VoIP subscribers will continue to rise until 2014 and this growth will follow a
slow down in the subsequent years (cf., Fig. 1.8).
Figure 1.9 illustrates the projected growth in VoIP revenues per region whereby
North America is at the top followed by Europe and Asia-Pacific.
This growth in VoIP revenues is attributed by the increase of VoIP subscribers.
As depicted in Fig. 1.10, the number of VoIP subscribers will keep up the upward
trend.
The trend of VoIP and Skype in 2009 as per Ipoque [15] showed that SIP gen-
erated over 50 % of all VoIP traffic where Skype was number one in the Middle
East and the Eastern Europe. Skype still is a popular VoIP application due to its
diverse functions and the ease to use. It can provide voice, video, file transfer, it
also has the ability to go through firewalls and Network Address Translation (NAT)
enabled routers. Now applications such as Yahoo and Microsoft Messengers and
10 1 Introduction

Fig. 1.8 VoIP revenue growth

Fig. 1.9 VoIP revenue growth per region

Google Talk which previously where used for messaging now offer VoIP services
as well. They are different from Skype because they use standard based or modified
SIP protocol and therefore RTP packets are used to transport voice and video pay-
loads. These has triggered another trend which is SIP/RTP traffic initiated by Instant
Messaging (IM) application. According to Ipoque [15], SIP/RTP traffic initiated by
IM accounts to 20–30 % of the overall VoIP traffic.
Figure 1.11 depicts the VoIP protocols distribution. It can be seen that Skype is
far popular VoIP protocol in the Eastern Europe and the Middle East with more than
1.4 VoIP Trend 11

Fig. 1.10 VoIP subscribers growth

Fig. 1.11 Traffic distribution of VoIP protocols

80 % share. Skype is popular in these regions where internet speed is low because it
has adaptive audio codec under varying internet bandwidth.
The rapid growth of mobile broadband and advancement in mobile devices capa-
bilities have prompted an increase in VoIP services on mobile devices. According to
12 1 Introduction

Fig. 1.12 VoIP subscribers growth for Mobile VoIP

the report by In-stat [10], it is estimated that there are about 255 million active VoIP
subscribers via GPRS/3G/HSDPA. In-Stat also forecasts mobile VoIP applications
and services will generate the annual revenue of around $33 billion. Figure 1.12 de-
picts the growth of VoIP subscribers via UMTS and HSPA/LTE cellular networks.

1.5 VoIP Protocol Stack and the Scope of the Book

In order to have a better understanding of the scope of the book, here we introduce
briefly the VoIP protocol stack, which is illustrated in Fig. 1.13.
From the top to the bottom, the VoIP protocol stack consists of techniques/proto-
cols at the application layer, the transport layer (e.g., TCP or UDP), the network
layer (e.g., IP) and the link/physical layer. The link/physical layer concerns about the
techniques and protocols on transmission networks/medium such as Ethernet (IEEE
802.2), wireless local area networks (WLANs, e.g., IEEE 802.11) and cellular mo-
bile networks (e.g., GSM/UMTS-based 2G/3G mobile networks or LTE-based 4G
mobile networks). The network layer protocol, such as the Internet Protocol (IP) is
responsible for the transmission of IP packets from the sender to the receiver over
the Internet and mainly concerns about where to send a packet and how to route
packets via a best path from the sender to the receiver over the Internet (concerning
routing protocols). The transport layer protocol (e.g., TCP or UDP) is responsible
for providing a logical transport channel between the sender and the receiver hosts
(or build a logical channel between two processes running on two hosts which are
linked by the Internet). Unlike the physical/link layer and/or network layer proto-
cols which are run by all network devices (such as wireless access points, network
1.5 VoIP Protocol Stack and the Scope of the Book 13

Fig. 1.13 VoIP Protocol


Stack

switches and routers) along the path from the sender to the receiver, the transport
layer protocol, together with application layer protocols, are only run in end sys-
tems. The VoIP protocol stack involves both TCP and UDP transport layer proto-
cols, with media transport protocols (such as RTP/RTCP) located on top of UDP,
whereas the signalling protocol (e.g., SIP) can be located on top of either TCP or
UDP as shown in Fig. 1.13.
This book will focus on VoIP techniques and protocols at the application layer
which include audio/video media compression (how voice and video streams are
compressed before they are sent over to the Internet, which will be presented in
Chaps. 2 and 3, respectively), media transport protocols (how voice and video
streams are packetised and transmitted over the Internet which include the Real-
time Transport Protocol (RTP) and the RTP Control Protocol (RTCP) will be dis-
cussed in Chap. 4), VoIP signalling protocols (how VoIP sessions are established,
maintained, and teared down, which are dealt with by the Session Initiation Pro-
tocol (SIP), together with the Session Description Protocol (SDP) will be covered
in Chap. 5). We focus only on the SIP signalling protocol from the IETF (Internet
Engineering Task Force, or the Internet community), mainly due to its popularity
with the Internet applications, its applicability (e.g., with 3GPP mobile applications
and with the Next Generation Networks (NGNs)) and its simplified structure. For
the alternative VoIP signalling protocol, that is, H.323 [12, 17] from ITU-T (the In-
ternational Telecommunication Union, Telecommunication Standardisation Sector,
or from the Telecommunications community), it is recommended to read relevant
books such as [4].
When a VoIP session is established, it is important to know how good/bad the
voice or video quality is provided. The user perceived quality of VoIP or the Quality
of Experience (QoE) of VoIP services are key for the success of VoIP applications
from both service providers and network operators. How to assess and monitor VoIP
quality (voice and video quality) will be discussed in Chap. 6.
In Chap. 7, we will introduce the IP Multimedia Subsystem (IMS) and mobile
VoIP. IMS is a standardised Next Generation Network (NGN) architecture for de-
livering multimedia services over converged, all IP-based networks. It provides a
combined structure for delivering voice and video over fixed and mobile networks
including fixed and mobile access (e.g., ADSL/cable modem, WLAN and 3G/4G
mobile networks) with SIP as its signalling protocol. This chapter will describe the
future of VoIP services and video streaming services, over the next generation net-
works.
14 1 Introduction

In the last three chapters (from Chap. 8 to Chap. 10), we provide three case stud-
ies to guide the reader to get hands-on experiences regarding VoIP system, VoIP pro-
tocol analysis, voice/video quality assessment and mobile VoIP system. In Chaps. 8
and 9, we provide two case studies to demonstrate how to build up a VoIP system
based on open source Asterisk tool in a lab or home environment and how to evalu-
ate and analyse voice and video quality for voice/video calls in the set VoIP testbed.
The reader can follow the step-by-step instructions to set up your own VoIP system,
and to analyse the VoIP trace data captured by Wireshark, together with recorded
voice samples or captured video clips for voice/video quality evaluation (for infor-
mal subjective and further objective analysis). Many challenge questions are set in
the case studies for the reader to test your knowledge and stretch your understanding
on the topic. In the last chapter (Chap. 10), we present the third case study to build
up mobile VoIP system based on the Open Source IMS Core and IMSDroid as an
IMS client. Step-by-step instructions will be provided for setting up Open Source
IMS Core in Ubuntu and IMSDroid in an Android based mobile handset. We will
also demonstrate how to make SIP audio and video calls between two Android based
mobile handsets.
The book will provide you the required basic principles and the latest advances
in VoIP technologies, together with many practical case studies and examples for
VoIP and mobile VoIP applications including both voice and video calls.

1.6 Summary
In this chapter, we have given an overview of VoIP (including its importance) and
explained how it works and factors that affect VoIP quality. We have introduced a
number of key VoIP tools that are used in practice. The range of applications and
trends in VoIP show that this technology is having a major impact on our lives in
both business and at home.

References
1. Bonfiglio D, Mellia M, Meo M, Rossi D (2009) Detailed analysis of Skype traffic. IEEE
Trans Multimed 11(1):117–127
2. Boyaci O, Forte AG, Schulzrinne H (2009) Performance of video-chat applications under
congestion. In: 11th IEEE international symposium on multimedia, pp 213–218
3. Cohen A, Haramaty L (1998) Audio transceiver. US Patent: 5825771
4. Collins D (2003) Carrier grade voice over IP. McGraw-Hill Professional, New York. ISBN
0-07-140634-4
5. Counterpath (2011) X-lite 4. http://www.counterpath.com/x-lite.html. [Online; accessed 12-
June-2011]
6. Google (2012) Ovum telecoms research. http://ovum.com/section/telecoms/. [Online; ac-
cessed 30-August-2012]
7. Hosfeld T, Binzenhofer A (2008) Analysis of Skype VoIP traffic in UMTS: end-to-end QoS
and QoE measurements. Comput Netw 52(3):650–666
8. Huang TY, Huang P, Chen KT, Wang PJ (2010) Could Skype be more satisfying? A QoE-
centric study of the FEC mechanism in an Internet-scale VoIP system. IEEE Netw 24(2):42–
48
References 15

9. Kho W, Baset SA, Schulzrinne H (2008) Skype relay calls: measurements and experiments.
In: IEEE INFOCOM, pp 1–6
10. Maisto M (2012) Mobile VoIP trend. http://www.eweek.com/networking/. [Online; accessed
25-September-2012]
11. Myers D (2012) Service provider VoIP and IMS. http://www.infonetics.com/research.asp.
[Online; accessed 30-September-2012]
12. Packet-based multimedia communications systems. ITU-T H.323 v.2 (1998)
13. Sat B, Wah BW (2007) Evaluation of conversational voice communication quality of the
Skype, Google-Talk, Windows Live, and Yahoo Messenger VoIP systems. In: IEEE 9th
workshop on multimedia signal processing, pp 135–138
14. Schulze H, Mochalski K (2012) The impact of p2p file sharing, voice over IP, Skype, Joost,
instant messaging, one-click hosting and media streaming such as Youtube on the Internet.
http://www.ipoque.com/sites/default/files/mediafiles/documents/internet-study-2007.pdf.
[Online; accessed 30-August-2012]
15. Schulze H, Mochalski K (2012) Internet study 2008 and 2009. http://www.ipoque.com/
sites/default/files/mediafiles/documents/internet-study-2008-2009.pdf. [Online; accessed
30-August-2012]
16. TeleGeography (2012) Global internet geography. http://www.telegeography.com/research-
services/global-internet-geography/index.html. [Online; accessed 15-August-2012]
17. Visual telephone systems and equipment for local area networks which provide a non-
guaranteed quality of service. ITU-T H.323 v.1 (1996)
Speech Compression
2

This chapter presents an introduction to speech compression techniques, together


with a detailed description of speech/audio compression standards including nar-
rowband, wideband and fullband codecs. We will start with the fundamental con-
cepts of speech signal digitisation, speech signal characteristics such as voiced
speech and unvoiced speech and speech signal representation. We will then discuss
three key speech compression techniques, namely waveform compression, paramet-
ric compression and hybrid compression methods. This is followed by a consider-
ation of the concept of narrowband, wideband and fullband speech/audio compres-
sion. Key features of standards for narrowband, wideband and fullband codecs are
then summarised. These include ITU-T, ETSI and IETF speech/audio codecs, such
as G.726, G.728, G.729, G.723.1, G.722.1, G.719, GSM/AMR, iLBC and SILK
codecs. Many of these codecs are widely used in VoIP applications and some have
also been used in teleconferencing and telepresence applications. Understanding
the principles of speech compression and main parameters of speech codecs such
as frame size, codec delay, bitstream is important to gain a deeper understanding of
the later chapters on Media Transport, Signalling and Quality of Experience (QoE)
for VoIP applications.

2.1 Introduction

In VoIP applications, voice call is the mandatory service even when a video ses-
sion is enabled. A VoIP tool (e.g., Skype, Google Talk and xLite) normally provides
many voice codecs which can be selected or updated manually or automatically.
Typical voice codecs used in VoIP include ITU-T standards such as 64 kb/s G.711
PCM, 8 kb/s G.729 and 5.3/6.3 kb/s G.723.1; ETSI standards such as AMR; open-
source codecs such as iLBC and proprietary codecs such as Skype’s SILK codec
which has variable bit rates in the range of 6 to 40 kb/s and variable sampling fre-
quencies from narrowband to super-wideband. Some codecs can only operate at a
fixed bit rate, whereas many advanced codecs can have variable bit rates which may

L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 17
DOI 10.1007/978-1-4471-4905-7_2, © Springer-Verlag London 2013
18 2 Speech Compression

be used for adaptive VoIP applications to improve voice quality or QoE. Some VoIP
tools can allow speech codecs used to be changed during a VoIP session, making it
possible to select the most suitable codec for a given network condition.
Voice codecs or speech codecs are based on different speech compression tech-
niques which aim to remove redundancy from the speech signal to achieve compres-
sion and to reduce transmission and storage costs. In practice, speech compression
codecs are normally compared with the 64 kb/s PCM codec which is regarded as the
reference for all speech codecs. Speech codecs with the lowest data rates (e.g., 2.4
or 1.2 kb/s Vocoder) are used mainly in secure communications. These codecs can
achieve compression ratios of about 26.6 or 53.3 (compared to PCM) and still main-
tain intelligibility, but with speech quality that is somewhat ‘mechanical’. Most of
speech codecs operate in the range of 4.8 kb/s to 16 kb/s and have good speech qual-
ity and reasonable compression ratio. These codecs are mainly used in bandwidth
resource limited mobile/wireless applications. In general, the higher the speech bit
rate, the higher the speech quality and the greater the bandwidth and storage require-
ments. In practice, it is always a trade-off between bandwidth utilisation and speech
quality.
In this chapter, we first introduce briefly underpinning basics of speech compres-
sion, including speech signal digitisation, voice waveform, spectrum and spectro-
gram, and the concept of voiced and unvoiced speech. We then look at key tech-
niques in speech compression coding which include waveform coding, parametric
coding and hybrid coding (or Analysis-by-Synthesis coding). Finally, we present
a number of key speech compression standards, from international standardisation
body (ITU-T), regional standardisation bodies (Europe’s ETSI and North America’s
TIA), together with some open source and proprietary codecs (such as GIP’s iLBC,
now Google’s iLBC and Skype’s SILK codec).

2.2 Speech Compression Basics


The purpose of speech compression is to reduce the number of bits required to repre-
sent speech signals (by reducing redundancy) in order to minimise the requirement
for transmission bandwidth (e.g., for voice transmission over mobile channels with
limited capacity) or to reduce the storage costs (e.g., for speech recording).
Before we start describing speech compression coding techniques, it is important
to understand how speech signal is represented in its digital form, that is, the process
of speech signal digitisation. We then need to understand what the key features
of speech signal are (e.g., voiced and unvoiced speech) and their characteristics.
In broad terms, speech compression techniques are mainly focused on removing
short-term correlation (in the order of 1 ms) among speech samples and long-term
correlation (in the order of 5 to 10 ms) among repeated pitch patterns. In this section,
we will start with speech signal digitisation and then discuss speech signal features
and speech representation (waveform, spectrum and spectrogram).

2.2.1 Speech Signal Digitisation


Speech signal digitisation is the process to convert speech from analog signal to dig-
ital signal in order for digital processing and transmission. The three main phases
2.2 Speech Compression Basics 19

Fig. 2.1 Example of voice digitisation

in speech signal digitisation are sampling, quantisation and coding. As shown in


Fig. 2.1, sampling is periodic measurement of an analog signal and changes a
continuous-time signal into a discrete-time signal. For a narrow-band speech sig-
nal with a bandwidth limited to 300 to 3400 Hz (normally simplified to 0–4 kHz),
the sampling rate is 8 kHz (i.e., 2 times the maximum signal bandwidth) in accor-
dance with the sampling theorem. If the sampling rate is at least twice the Nyquist
frequency (4 kHz for narrow-band voice), the analogue signal can be fully recovered
from the samples [26]. If 8 kHz sampling rate is applied, the time difference between
two consecutive samples is 0.125 milliseconds (1/8000 = 0.125). Quantisation con-
verts the signal from continuous-amplitude into discrete-amplitude signal. The cod-
ing process will further convert discrete-amplitude signal into a series of binary bits
(or bitstream) for transmission and storage. For uniform quantisation, quantisation
steps are kept the same for all signal amplitudes, see, for example, Fig. 2.1. In the
figure, the amplitude space is evenly divided into 6 steps. For 6 different quantisa-
tion steps, three-bit binary codes can be used. Each sampled speech signal will be
approximated by its closest available quantisation amplitude and then coded into
binary bit streams through the coding process. For example, for the 1st sample, the
quantised amplitude is zero and the coded bits are 100. For the 2nd sample in the
figure, the quantised amplitude is 2 and the coded bits are 010. The difference be-
tween the quantised amplitude and actual signal amplitude is called “quantisation
error”. Clearly, the more quantisation steps (fine quantisation) there are, the lower
the quantisation error, but this requires more bits to represent the signal and the
transmission bandwidth will also be greater. In practice, it is always a tradeoff be-
tween the desired quantisation error and the transmission bandwidth used.
Considering that speech signals have a non-uniform Probability Density Function
(PDF) with lower level speech signal having a much higher PDF than high level
speech signal, uniform quantisation will normally create higher quantisation error
(or quantisation noise) for low speech signal and hence lower speech quality. Thus,
non-uniform quantisation is normally used in speech compression coding. In non-
uniform quantisation, fine quantisation is applied for low speech signal. As shown
in Fig. 2.2, when uniform quantisation is applied, the quantisation step is kept the
20 2 Speech Compression

Fig. 2.2 Uniform quantisation and non-uniform quantisation

same (here the value of Δ) in the speech dynamic range considered. For a speech
signal in the range of 0 to Δ (input), the output after quantisation will be represented
by the quantised value of 0.5Δ with maximum quantisation error of 0.5Δ. When
non-uniform quantisation is applied, different quantisation steps will be applied in
the speech dynamic range. Due to the fact that speech has non-uniform PDFs, the
quantisation step will be kept smaller in lower level signal. For example for speech
signal in the range of 0 to 0.5Δ (input), the output will be represented by quantised
value of 0.25Δ with maximum quantisation error of 0.25Δ (lower than that for
uniform quantisation for low level signals). Similarly for higher level speech signal
with lower PDF values, the quantisation step is set much bigger than that for uniform
quantisation (coarse quantisation). As illustrated in the figure, for speech signal from
1.5Δ to 3Δ, the quantisation output will be 2.25Δ, with maximum quantisation
error of 0.75Δ, much higher than that for uniform quantisation (0.5Δ), also higher
than that for lower level speech signal (e.g., 0.25Δ for speech between 0 to 0.5Δ).
As PDF of low level speech signal is much higher than that of high level speech
signal. The overall performance (in terms of Signal-to-Noise Ratio (SNR)) will be
better than that for uniform quantisation coding. In this example, for both uniform
and non-uniform quantisation, same signal dynamic range is applied (i.e., from −3Δ
to +3Δ for the input signal). Non-uniform quantisation has been applied in Pulse
Coding Modulation (PCM), the most simple and commonly used speech codec.
PCM explores non-uniform quantisation by using a logarithm companding method
to provide fine quantisation for low speech and coarse quantisation for high speech
signal.
After sampling, quantisation and coding, the analog speech signal is converted
into a digitised speech signal which can be processed, transmitted or stored. Speech
compression coding is normally carried out before digital transmission or storage in
order to reduce the required transmission bandwidth or required storage space. For
the PCM codec with 8000 sampling rate, each sample is represented by 8 bits, giv-
ing transmission bit rate of 8000 × 8 = 64000 bit/s (64 kb/s). Speech compression
2.2 Speech Compression Basics 21

Fig. 2.3 Sample of speech waveform for the word ‘Decision’

coding algorithms are normally compared with 64 kb/s PCM to obtain the compres-
sion ratio. Details of speech compression coding techniques will be discussed in
Sect. 2.3.

2.2.2 Speech Waveform and Spectrum

Speech waveform is the time-domain representation of digitised speech signal.


Speech spectrum is the representation of the speech signal in the frequency-domain.
Figure 2.3 shows the speech waveform for the word ‘Decision’. The speech wave-
form is normally formed up by voiced and unvoiced speech segments. This is mainly
linked to the nature of how speech is produced (details will be discussed in the later
Sect. 2.2.3). For voiced speech sounds (e.g., vowel sounds such as ‘a’, ‘i’), these
are essentially produced by the vibrations of the vocal cords, and are oscillatory
in nature with repeatable patterns. Figure 2.4 illustrates a waveform for a voiced
speech segment which has repetitive patterns and its spectrum which shows the ba-
sic frequency (pitch) and its harmonic frequencies. For unvoiced sounds, such as
‘s’, ‘sh’, the signals are more noise-like and there are no repeatable patterns (see
Fig. 2.5 for an example of a speech waveform and its spectrum for unvoiced speech
segment).
22 2 Speech Compression

Fig. 2.4 Sample of voiced speech—waveform and spectrum

If we look more closely at the spectrum for voiced signal, it shows harmonic
frequency components. For a normal male, the pitch is about 125 Hz and for a
female the pitch is at about 250 Hz (Fig. 2.4 has a pitch of 285 Hz for a fe-
male sample) for voiced speech, whereas unvoiced signal does not have this fea-
ture (as can be seen from Fig. 2.5, the spectrum is almost flat and similar to the
spectrum for white noise). The spectrum in Figs. 2.4 and 2.5 are obtained by us-
ing a Hamming window with 256 sample window length. The value of waveform
amplitude has been normalised to −1 to +1. Spectrum magnitude is converted
to dB value. For detailed function of Hamming window and roles of windows in
speech signal frequency analysis, readers are recommended to read the book by
Kondoz [26].
Figure 2.6 shows the speech waveform of a sentence, “Note closely the size of
the gas tank” spoken by a female speaker and its spectrogram. The sentence is
about 2.5 seconds long. Speech spectrogram displays the spectrum of the whole
sentence of speech with the grade scale of grey for magnitude of the spectrum
(the darker the color, the higher the spectrum energy). Pitch harmonised bars are
also illustrated clearly in the spectrogram for voiced segments of speech. From
the sentence, it is clearly shown that the percentage of the voiced speech seg-
ments (with pitch bars) are higher than that of the unvoiced ones (without pitch
bars).
2.2 Speech Compression Basics 23

Fig. 2.5 Sample of unvoiced speech—waveform and spectrum

2.2.3 How Is Human Speech Produced?

Speech compression, especially at low bit rate speech compression, explores the
nature of human speech production mechanism. In this section, we briefly explain
how human speech is produced.
Figure 2.7 shows a conceptual diagram of human speech production physical
model. When we speak, the air from lungs push through the vocal tract and out of the
mouth to produce a sound. For some sounds for example,. a voiced sound, or vowel
sounds of ‘a’, ‘i’ and ‘μ’, as shown in Fig. 2.4, the vocal cords vibrate (open and
close) at a rate (fundamental frequency or pitch frequency) and the produced speech
samples show a quasi-periodic pattern. For other sounds (e.g., certain fricatives as
‘s’ and ‘f’, and plosives as ‘p’, ‘t’ and ‘k’ , named as unvoiced sound as shown
in Fig. 2.5) [28], the vocal cords do not vibrate and remain open during the sound
production. The waveform of unvoiced sound is more like noise. The change of the
shape of the vocal tract (in combination of the shape of nose and mouth cavities and
the position of the tongue) produces different sound and the change of the shape is
relatively slow (e.g., 10–100 ms). This forms the basis for the short-term stationary
feature of speech signal used for all frame-based speech coding techniques which
will be discussed in the next section.
24 2 Speech Compression

Fig. 2.6 Speech waveform and speech spectrogram

Fig. 2.7 Conceptual diagram


of human speech production
2.3 Speech Compression and Coding Techniques 25

2.3 Speech Compression and Coding Techniques


Speech compression aims to remove redundancy in speech representation to reduce
transmission bandwidth and storage space (and further to reduce cost). There are
in general three basic speech compression techniques, which are waveform-based,
parametric-based and hybrid coding techniques. As the name implied, waveform-
based speech compression is mainly to remove redundancy in the speech waveform
and to reconstruct the speech waveform at the decoder side as closely as possible
to the original speech waveform. Waveform-based speech compression techniques
are simple and normally low in implementation complexity, whereas their com-
pression ratios are also low. The typical bit rate range for waveform-based speech
compression coding is from 64 kb/s to 16 kb/s. At bit rate lower than 16 kb/s, the
quantisation error for waveform-based speech compression coding is too high, and
this results in lower speech quality. Typical waveform-based speech compression
codecs are PCM and ADPCM (Adaptive Differential PCM) and these will be cov-
ered in Sect. 2.3.1.
Parametric-based speech coding is based on the principles of how speech is pro-
duced. It is based on the features that speech signal is stationary or the shape of the
vocal tract is stable in short period of time (e.g., 20 ms). During this period of time, a
speech segment can be classified as either a voiced or unvoiced speech segment. The
spectral characteristics of the vocal tract can be represented by a time-varying digi-
tal filter. For each speech segment, the vocal tract filter parameters, voiced/unvoiced
decision, pitch period and gain (signal energy) parameters are obtained via speech
analysis at the encoder. These parameters are then coded into binary bitstream and
sent to transmission channel. The decoder at the receiver side will reconstruct the
speech (carry out speech synthesis) based on the received parameters. Compared
to waveform-based codecs, parametric-based codecs are higher in implementation
complexity, but can achieve better compression ratio. The quality of parametric-
based speech codecs is low, with mechanic sound, but with reasonable intelligibil-
ity. A typical parametric codec is Linear Prediction Coding (LPC) vocoder which
has a bit rate from 1.2 to 4.8 kb/s and is normally used in secure wireless com-
munications systems when transmission bandwidth is very limited. The details of
parametric-based speech coding will be discussed in Sect. 2.3.2.
As parametric-based codecs cannot achieve high speech quality because of the
use of simple classification of speech segments into either voiced or unvoiced speech
and simple representation of voiced speech with impulse period train, hybrid cod-
ing techniques were proposed to combine the features of both waveform-based and
parametric-based coding (and hence the name of hybrid coding). It keeps the nature
of parametric coding which includes vocal tract filter and pitch period analysis, and
voiced/unvoiced decision. Instead of using an impulse period train to represent the
excitation signal for voiced speech segment, it uses waveform-like excitation sig-
nal for voiced, unvoiced or transition (containing both voiced or unvoiced) speech
segments. Many different techniques are explored to represent waveform-based ex-
citation signals such as multi-pulse excitation, codebook excitation and vector quan-
tisation. The most well known one, so called “Codebook Excitation Linear Predic-
tion (CELP)” has created a huge success for hybrid speech codecs in the range of
26 2 Speech Compression

4.8 kb/s to 16 kb/s for mobile/wireless/satellite communications achieving toll qual-


ity (MOS over 4.0) or communications quality (MOS over 3.5). Almost all modern
speech codecs (such as G.729, G.723.1, AMR, iLBC and SILK codecs) belong to
the hybrid compression coding with majority of them based on CELP techniques.
More details regarding hybrid speech coding will be presented in Sect. 2.3.3.

2.3.1 Waveform Compression Coding

Waveform-based codecs are intended to remove waveform correlation between


speech samples to achieve speech compression. It aims to minimize the error be-
tween the reconstructed and the original speech waveforms. Typical ones are Pulse
Code Modulation (PCM) and Adaptive Differential PCM (ADPCM).
For PCM, it uses non-uniform quantisation to have more fine quantisation steps
for small speech signal and coarse quantisation steps for large speech signal (log-
arithmic compression). Statistics have shown that small speech signal has higher
percentage in overall speech representations. Smaller quantisation steps will have
lower quantisation error, thus better Signal-to-Noise Ratio (SNR) for PCM coding.
There are two PCM codecs, namely PCM μ-law which is standardised for use in
North America and Japan, and PCM A-law for use in Europe and the rest of the
world. ITU-T G.711 was standardised by ITU-T for PCM codecs in 1988 [14].
For both PCM A-law and μ-law, each sample is coded using 8 bits (compressed
from 16-bit linear PCM data per sample), this yields the PCM transmission rate
of 64 kb/s when 8 kHz sample rate is applied (8000 samples/s × 8 bits/sample =
64 kb/s). 64 kb/s PCM is normally used as a reference point for all other speech
compression codecs.
ADPCM, proposed by Jayant in 1974 at Bell Labs [25], was developed to further
compress PCM codec based on correlation between adjacent speech samples. Con-
sisting of adaptive quantiser and adaptive predictor, a block diagram for ADPCM
encoder and decoder (codec) is illustrated in Fig. 2.8.
At the encoder side, ADPCM first converts 8 bit PCM signal (A-law or μ-law)
to 16 bit linear PCM signal (the conversion is not shown in the figure). The adap-
tive predictor will predict or estimate the current speech signal based on previously
received (reconstructed) N speech signal samples s̃(n) as given in Eq. (2.1).

N
ŝ(n) = ai (n)s̃(n − i) (2.1)
i=1

where ai , i = 1, . . . , N are the estimated predictor coefficients, and a typical N


value is six.
Difference signal e(n), also known as prediction error, is calculated from the
speech signal s(n) and the signal estimate ŝ(n) and is given in Eq. (2.2). Only this
difference signal (thus the name differential coding) is input to the adaptive quan-
tiser for quantisation process. As the dynamic range of the prediction error, e(n) is
2.3 Speech Compression and Coding Techniques 27

Fig. 2.8 Block diagram for ADPCM codec

smaller than that of the PCM input signal, less coding bits are needed to represent
the ADPCM sample.
e(n) = s(n) − ŝ(n) (2.2)

The difference between e(n) and eq (n) is due to quantisation error (nq (n)), as
given in Eq. (2.3).

e(n) = eq (n) + nq (n) (2.3)

The decoder at the receiver side will use the same prediction algorithm to recon-
struct the speech sample. If we don’t consider channel error, eq (n) = eq (n). The
difference between the reconstructed PCM signal at the decoder (s̃  (n)) and the in-
put linear PCM signal at the encoder (s(n)) will be just the quantisation error of
nq (n). In this case, the Signal-to-Noise Ratio (SNR) for the ADPCM system will
be mainly decided by the signal to quantisation noise ratio and the quality will be
based on the performance of the adaptive quantiser.
If an ADPCM sample is coded into 4 bits, the produced ADPCM bit rate is
4 × 8 = 32 kb/s. This means that one PCM channel (at 64 kb/s) can transmit two
ADPCM channels at 32 kb/s each. If an ADPCM sample is coded into 2 bits, then
ADPCM bit rate is 2 × 8 = 16 kb/s. One PCM channel can transmit four ADPCM
28 2 Speech Compression

at 16 kb/s each. ITU-T G.726 [15] defines ADPCM bit rate at 40, 32, 24 and 16 kb/s
which corresponds to 5, 4, 3, 2 bits of coding for each ADPCM sample. The higher
the ADPCM bit rate, the higher the numbers of the quantisation levels, the lower
the quantisation error, and thus the better the voice quality. This is why the quality
for 40 kb/s ADPCM is better than that of 32 kb/s. The quality of 24 kb/s ADPCM
is also better than that of 16 kb/s.

2.3.2 Parametric Compression Coding

Waveform-based coding aims to reduce redundancy among speech samples and


to reconstruct speech as close as possible to the original speech waveform. Due
to its nature of speech sample-based compression, waveform-based coding can-
not achieve high compression ratio and normally operates at bit rate ranging from
64 kb/s to 16 kb/s.
In contrast, parametric-based compression methods are based on how speech is
produced. In stead of transmitting speech waveform samples, parametric compres-
sion only sends relevant parameters related with speech production to the receiver
side and reconstructs the speech from the speech production model. Thus, high com-
pression ratio can be achieved. The most typical example of parametric compression
is Linear Prediction Coding (LPC), proposed by Atal in 1971 [4] at Bell Labs. It
was designed to emulate the human speech production mechanisms and the com-
pression can reach the bit rate as lower as 800 bit/s (Compression Ratio reaches
80 when compared to 64 kb/s PCM). It normally operates at bit rates from 4.8 to
1.2 kb/s. The LPC based speech codecs can achieve high compression rate, how-
ever, the voice quality is also low, especially the natureness of the speech (i.e., can
you recognise who is talking). The speech sound based on simple LPC model is
more like mechanic or robotic sound, but can still achieve high intelligibility (i.e.,
understanding the meaning of a sentence). In this section, we will discuss briefly
how human speech is generated and what a basic LPC model is.

Speech Generation Mathematic Model


Based on the nature of speech production, a speech generation mathematical model
can be shown in Fig. 2.9. Depending on whether the speech signal is voiced or
unvoiced, the speech excitation signal (x(n)) is switched between a period pulse
train signal (controlled by the pitch period of T for the voiced signal) and random
noise signal (for unvoiced speech). The excitation signal is amplified by Gain (G or
energy of the signal) and then sent to the vocal tract filter or LPC filter.
The vocal tract filter can be modelled by a linear prediction coding (LPC) filter
(a time-varying digital filter) and can be represented approximated by an all-pole
filter as given by Eq. (2.4). The LPC filter mainly reflects the spectral envelope part
of the speech segment.

S(z) G
H (z) = = p (2.4)
X(z) 1 − j =1 aj z−j
2.3 Speech Compression and Coding Techniques 29

Fig. 2.9 Speech generation mathematical model

where aj , j = 1, . . . , p, represents p-order LPC filter coefficients and the p value is


normally ten for narrow-band speech (normally named as the ten-order LPC filter).
When converted to the time-domain, we can obtain the generated speech signal
s(n) from a difference equation (see Eq. (2.5)). This means that the output speech
signal s(n) can be predicted from the weighted sum of the past p speech output
signal samples (s(n − j ), j = 1, . . . , p), or from the linear combination of pre-
vious speech outputs (thus, the name of Linear Prediction Coding, LPC), and the
present excitation signal x(n) and the gain (G). Equation (2.5) represents a general
expression for LPC-based model which includes mainly two key elements, i.e. the
excitation part and the LPC filter. In a basic LPC model, only impulse pulse train
(for voiced) or white noise (for unvoiced) is used for the excitation signal. This
simplified excitation model can achieve high compression efficiency (with bit rates
normally between 800 bit/s to 2,400 bit/s), but with low perceived speech quality
(due to mechanic sound) and reasonable intelligibility. They are mainly used in se-
cure telephony communications.


p
s(n) = Gx(n) + aj s(n − j ) (2.5)
j =1

For more detailed explanation of speech signal generation model and LPC anal-
ysis, readers are recommended to read the reference book [26].

Linear Prediction Coding (LPC) Model


The LPC model, also known as the LPC vocoder (VOice enCODER), was proposed
in 1960s and is based on the speech generation model presented in Fig. 2.9. The
idea is that for a given segment of speech (e.g., 20 ms of speech, which corresponds
to 160 samples at 8 kHz sampling rate), if we can detect whether it is voiced or
unvoiced and estimate its LPC filter parameters, pitch period (for voiced signal)
and its gain (power) via speech signal analysis, we can then just encode and send
30 2 Speech Compression

Fig. 2.10 The LPC model

these parameters to the channel/network and then synthesise the speech based on
the received parameters at the decoder. For a continuous speech signal which is
segmented for 20 ms speech frames, this process is repeated for each speech frame.
The basic LPC model is illustrated in Fig. 2.10.
At the encoder, the key components are pitch estimation (to estimate the pitch
period of the speech segment), voicing decision(to decide whether it is a voiced or
unvoiced frame), gain calculation (to calculate the power of the speech segment) and
LPC filter analysis (to predict the LPC filter coefficients for this segment of speech).
These parameters/coefficients are quantised, coded and packetised appropriately (in
the right order) before they are sent to the channel. The parameters and coded bits
from the LPC encoder are listed below.
• Pitch period (T): for example, coded in 7 bits as in LPC-10 (together with voic-
ing decision) [31].
• Voiced/unvoiced decision: to indicate whether it is voiced or unvoiced segment.
For hard-decision, a binary bit is enough.
• Gain (G) or signal power: coded in 5 bits as in LPC-10.
• Vocal tract model coefficients: or LPC filter coefficients, normally in 10-order,
i.e. a1 , a2 , . . . , a10 , coded in 41 bits in LPC-10.
At the decoder, the packetised LPC-bitstream are unpacked and sent to the rele-
vant decoder components (e.g., LPC decoder, pitch period decoder) to retrieve the
LPC coefficients, pitch period and gain. The voicing detection bit will be used
to control the voiced/unvoiced switch. The pitch period will control the impulse
train sequence period when in a voiced segment. The synthesiser will synthesise the
speech according to the received parameters/coefficients.
LPC-10 [31] is a standard specified by Department of Defence (DoD) Federal
Standard (FS) 1015 in USA and is based on 10th order LP analysis. Its coded
bits are 54 (including one bit for synchronisation) for one speech frame with 180
2.3 Speech Compression and Coding Techniques 31

samples. For 8 kHz sampling rate, 180 samples per frame which is 22.5 ms per
frame (180/8000 = 22.5 ms). For every 22.5 ms, 54 coded binary bits from the
encoder are sent to the channel. The encoder bit rate is 2400 bit/s or 2.4 kb/s
(54 bits/22.5 ms = 2.4 kb/s). The compression ratio is 26.7 when compared with
64 kb/s PCM (64/2.4). LPC-10 was mainly used in radio communications with
secure voice transmissions. The quality of voice is low in its natureness (more me-
chanic sound), but with reasonable intelligibility. Some variants of LPC-10 explore
different techniques (e.g., subsampling, silence detection, variable LP coded bits) to
achieve bit rates from 2400 bit/s to 800 bit/s.

2.3.3 Hybrid Compression Coding—Analysis-by-Synthesis

Problems with Simple LPC Model


Using a sharp voiced/unvoiced decision to differentiate a speech frame as ei-
ther voiced or unvoiced, and using a periodic impulse train to emulate voiced
speech signal and noise for unvoiced speech are major limitations of LPC-based
vocoders.
Now let’s look at an example of the output from LPC analysis. Figure 2.11 shows
the waveform, the LPC filter (vocal tract filter or spectral envelope) and the residual
signal after removing the short-term LPC estimation from the original speech signal.
From the residual signal, we can see that the signal energy is greatly less than that of
the original speech and the period pattern is still there. This is because LPC filter can
only remove short-term correlation between samples, but not long-term correlation
between period pattern signals. This can also be shown from Fig. 2.12 with residual
signal spectrum is more flat (with formants are removed via LPC filter). However,
the pitch frequency and its harmonic frequencies are still there and this needs to be
removed by Pitch filter, or the so-called Long-Term Prediction (LTP) filter which
removes correlation between pitch period patterns.
From Fig. 2.11 for a voiced speech segment, we can see that LPC residual signal
is not a simple period pulse signal. If we can find the best match of excitation sig-
nal which can represent as close as possible to this residual signal, then when this
residual signal is passed through the LPC filter, a perfect reconstruction signal will
be produced.
In order to find the best match of the excitation signal, a synthesiser (including
LPC synthesiser and pitch synthesiser) is included at the encoder side and a closed-
loop search is carried out in order to find the best match excitation signal (which
results in a minimum perceptual error estimation between the original and the syn-
thesised speech signal). This is the key concept of hybrid speech coding (combines
the features of both waveform and parametric coding), also known as Analysis-by-
Synthesis (AbS) method as shown in Fig. 2.13. The LPC synthesiser predicts the
short-term vocal tract filter coefficients, whereas, the pitch synthesiser predicts the
long-term pitch period and gain for the voiced segment. The parameters for the best
match excitation signal, together with pitch period, gain and LPC filter coefficients
32 2 Speech Compression

Fig. 2.11 Speech waveform, LPC filter, and residual signal

are transmitted to the receiver. The decoder will synthesise the speech signal based
on the optimum excitation signal. The difference between the synthesised at the out-
put of the decoder and the one estimated at the encoder is due to channel error. If
there is no channel transmission error, the synthesised signals at the encoder and the
decoder are the same.
In hybrid compression coding, the most successful one is Code-Excitation Linear
Prediction (CELP) based AbS technique which was a major breakthrough at low bit
rate speech compression coding in later 1980s. CELP-based coding normally con-
tains a codebook with a size of 256 to 1024 at both sender and receiver. Each code-
book entry contains a waveform-like excitation signal, or multi-pulse excitation sig-
nal [5] (instead of only periodic impulse train and noise in parametric coding). This
resolves a major problem in the coding of a transition frame (or “onset” frame), for
example, a frame contains transition from unvoiced to voiced, such as the phonetic
sound at the beginning of the word “see” [si:] or “tea” [ti:] which is very important
from perceptual quality point of view (affects the intelligibility of speech commu-
nications). The closed-loop search process will find the best match excitation from
the codebook and only the index of the matched excitation of the codebook will be
2.3 Speech Compression and Coding Techniques 33

Fig. 2.12 Spectrum for original speech and residual signal

Fig. 2.13 Analysis-by-Synthesis LPC codec

coded and sent to the decoder at the receiver side. At the decoder side, the matched
excitation signal will be retrieved from the same codebook and used to reconstruct
the speech. For a codebook with the size of 256 to 1024, 8–10 bits can be used for the
coding of codebook index. In order to achieve high efficiency in coding and low in
34 2 Speech Compression

Fig. 2.14 Example of Code-Excitation Linear Prediction (CELP)

implementation complexity, a large codebook is normally split into several smaller


codedbooks. Figure 2.14 shows an example of CELP used in the AMR codec [1]
which includes two codebooks, an adaptive codebook to search for pitch excitation
and a fixed codebook containing a set of fixed pulse train with preset pulse position
and signs of the pulses. Pitch excitation codebook contains waveform-like excita-
tion signals. Due to the successful use of the CELP techniques, the voice quality of
hybrid compression coding has reached toll quality (MOS score over 4) or commu-
nications quality (MOS score over 3.5) at bit rates from 16 kb/s to 4.8 kb/s. This is
impossible for waveform or parametric codecs to achieve high speech quality at this
range of bit rates. The hybrid AbS-based codecs have been widely used in today’s
mobile, satellite, marine and secure communications.
In general, there are two categories in hybrid compression coding based on
how the excitation signal is generated. One is based on excitation signal analy-
sis and generation in the time-domain and aims to reconstruct the speech frame
as close as possible on the speech waveform. Majority of CELP variants be-
long to this category, such as ACELP (Algebraic Code Excited Linear Predic-
tion) and RELP (Residual pulse Excitation Linear Prediction). Another category
is based on excitation signal analysis in the frequency-domain and aims to recon-
struct the speech frame as close as possible from the speech spectrum point of
view. Multiband Excitation (MBE) model proposed by Griffin and Lim in 1988
at MIT [10] is in this category. MBE divides the speech spectrum into several
sub-bands (about 20) and a binary voiced/unvoiced parameter is allocated to each
frequency band. This will make the spectrum of the reconstructed speech frame
more close to the spectrum of the original speech frame and will produce better
speech quality than traditional time-domain CELP at low bit rates, for example, 2.4
to 4.8 kb/s.
The typical hybrid compression codecs include the following from several stan-
dardisation bodies, such as the International Telecommunication Union, Telecom-
munication Standardisation Sector (ITU-T), European Telecommunication Stan-
dards Institute (ETSI), North America’s Telecommunications Industry Association
2.3 Speech Compression and Coding Techniques 35

(TIA) of the Electronic Industries Association (EIA), and the International Maritime
Satellite Corporation (INMARSAT).
• LD-CELP: Low Delay CELP, used in ITU-T G.728 at 16 kb/s [16].
• CS-ACELP: Conjugate-Structure Algebraic-Code-Excited Linear Prediction,
used in ITU-T G.729 [17] at 8 kb/s.
• RPE/LTP: Regular Pulse Excitation/Long Term Prediction, used in ETSI GSM
Full-Rate (FR) at 13 kb/s [6].
• VSELP: Vector Sum Excited Linear Prediction: ETSI GSM Half-Rate (HR) at
5.6 kb/s [7].
• EVRC based on RCELP: Enhanced Variable Rate Codec [30], specified in
TIA/EIA’s Interim Standard TIA/EIA/IS-127 for use in the CDMA systems in
North America, operating at bit rates of 8.5, 4 or 0.8 kb/s (full-rate, half-rate,
eighth-rate at 20 ms speech frame) [30].
• ACELP: Algebraic CELP, used in ETSI GSM Enhanced Full-Rate (EFR) at
12.2 kb/s [9] and ETSI AMR from 4.75 to 12.2 kb/s [8].
• ACELP/MP-MLQ: Algebraic CELP/Multi Pulse—Maximum Likelihood Quan-
tisation, used in ITU-T G.723.1 at 5.3/6.3 kb/s [18].
• IMBE: Improved Multiband Excitation Coding at 4.15 kb/s for INMARSAT-M.
• AMBE: Advanced Multiband Excitation Coding at 3.6 kb/s for INMARSAT-
AMBE.

2.3.4 Narrowband to Fullband Speech Audio Compression

In the above sections, we mainly discussed Narrowband (NB) speech compression,


aimed at speech spectrum from 0 to 4 kHz. Not only used in VoIP systems, this 0
to 4 kHz narrowband speech, expanded from speech frequency range of 300 Hz to
3400 Hz, has also been used in traditional digital telephony in the Public Switched
Telephone Networks (PSTN) .
In VoIP and mobile applications, there is a trend in recent years to use Wideband
(WB) speech to provide high fidelity speech transmission quality. For WB speech,
the speech spectrum is expanded to 0–7 kHz, with sampling rate at 16 kHz. Com-
pared to 0–4 kHz narrowband speech, wideband speech will have more higher fre-
quency components and have high speech fidelity. The 0–7 kHz wideband speech
frequency range is equivalent to general audio signal frequency range (e.g., mu-
sic).
There are currently three wideband speech compression methods which have
been used in different wideband speech codecs standardised by ITU-T or ETSI.
They are:
• Waveform compression based on sub-band (SB) ADPCM: such as ITU-T
G.722 [12].
• Hybrid compression based on CELP: such as AMR-WB or ITU-T G.722.2 [21].
• Transform compression coding: such as ITU-T G.722.1 [20].
36 2 Speech Compression

Table 2.1 Summary of NB, WB, SWB and FB speech/audio compression coding
Mode Signal Sampling rate Bit-rate (kb/s) Examples
bandwidth (Hz) (kHz)
Narrowband 300–3400 8 2.4–64 G.711, G.729,
(NB) G.723.1, AMR,
LPC-10
Wideband (WB) 50–7000 16 6.6–96 G.711.1, G.722,
G.722.1, G.722.2
Super-wideband 50–14000 32 24–48 G.722.1 (Annex C)
(SWB)
Fullband (FB) 20–20000 48 32–128 G.719

It needs to be mentioned that G.711.1 uses both waveform compression (for


lower band signal based on PCM) and transform compression (for higher band sig-
nal based on Modified DCT).
Supter-wideband (SWB) is normally referred to speech compression coding for
speech and audio frequency from 50 to 14 000 Hz.
Fullband (FB) speech/audio compression coding considers the full human audi-
tory bandwidth from 20 Hz to 20 kHz to provide high quality, efficient compres-
sion for speech, music and general audio. The example is the latest ITU-T standard
G.719 [23]. It is mainly used in teleconferencing and telepresence applications.
Table 2.1 summarizes the Narrowband, Wideband, Super-wideband and Fullband
speech/audio compression coding basic information, including signal bandwidth,
sampling rate, typical bit rate range and standards examples. Details regarding these
standard codecs will be covered in the next section.

2.4 Standardised Narrowband to Fullband Speech/Audio


Codecs

In this section, we will discuss some key standardised narrowband (NB), wideband
(WB), super-wideband (SWB) and fullband (FB) speech/audio codecs from the In-
ternational Telecommunication Union, Telecommunications Section (ITU-T) (e.g.,
G.729, G.723.1, G.722.2 and G.719), from the European Telecommunications Stan-
dards Institute (ETSI) (e.g., GSM, AMR, AMR-WB) and from the Internet Engi-
neering Task Force (IETF) (e.g., iLBC and SILK) which are normally used in VoIP
and conferencing systems.

2.4.1 ITU-T G.711 PCM and G.711.1 PCM-WB

G.711 for 64 kb/s Pulse Coding Modulation (PCM) was first adopted by ITU-T in
1972 and further amended in 1988 [14]. It is the first ITU-T speech compression
2.4 Standardised Narrowband to Fullband Speech/Audio Codecs 37

coding standard for the ITU-T G-series for narrowband speech with a frequency
range of 300–3400 Hz. Two logarithmic companding laws were defined due to his-
toric reasons, with the μ-law for use in North America and Japan, and the A-law for
use in Europe and the rest of the world. The G.711 encoder converts linear 14 bits
uniform PCM code to 8 bits A-law or μ-law PCM (non-uniform quantisation, or
logarithm companding) code per sample with fine quantisation for low level speech
signal and coarse quantisation for high level speech signal. At the decoder side, de-
companding process is applied to convert back to its uniform PCM signal. PCM
operates at 64 kb/s and is sample-based coding, which means that the algorithmic
delay for the encoder is only one sample of 0.125 ms at 8000 Hz sampling rate.
When PCM codec is used in VoIP applications, 20 ms of speech frame is nor-
mally formed up and packetised for transmission over the network. The original
G.711 PCM standard did not contain packet loss concealment mechanism which is
necessary for codecs for VoIP applications. G.711 Appendix I [19] was added in
1999 which contains a high quality low-complexity algorithm for packet loss con-
cealment. This G.711 with packet loss concealment algorithm (PLC) is mandatory
for all VoIP applications.
G.711.1 is the wideband extension for G.711 Pulse Code Modulation (PCM-WB)
defined by ITU-T in 2008 [24]. It supports both narrowband and wideband speech
coding. When it is applied for wideband speech coding, it can support speech and
audio input signal frequency range from 50 to 7000 Hz. The encoder input signal,
sampled at 16 kHz (in wideband coding case), is divided into 8 kHz sampled lower-
band and higher-band signals with the lower-band using G.711-compatible coding,
whereas the higher-band based on Modified Discrete Cosine Transform (MDCT)
based on 5 ms speech frame. For the lower-band and higher-band signals, there are
three layers of bitstreams as listed below.
• Layer 0: lower-band base bitstream at 64 kb/s PCM (base bitstream), 320 coded
bits for 5 ms speech frame.
• Layer 1: lower-band enhancement bitstream at 16 kb/s, 80 coded bits for 5 ms
speech frame.
• Layer 2: higher-band enhancement bitstream at 16 kb/s, 80 coded bits for 5 ms
speech frame.
The overall bit rates for G.711.1 PCM-WB can be 64, 80 and 96 kb/s. With
5 ms speech frame, the coded bits are 320, 400, 480 bits, respectively. The algo-
rithmic delay is 11.875 ms (5 ms speech frame, 5 ms look-ahead, and 1.875 ms for
Quadrature-Mirror Filterbank (QMF) analysis/synthesis).

2.4.2 ITU-T G.726 ADPCM

G.726 [15], defined by ITU-T in 1990, is an ADPCM-based narrowband codec oper-


ating at bit rates of 40, 32, 24 and 16 kb/s. G.726 incorporates the previous ADPCM
standards of G.721 [11] at 32 kb/s and G.723 [13] at 24 and 40 kb/s (both speci-
fied in 1988). For 40, 32, 24 and 16 kb/s bit rates, the corresponding ADPCM bits
38 2 Speech Compression

per sample are 5, 4, 3, and 2 bits. It operates at narrowband with the sampling rate
of 8000 Hz. G.726 was originally proposed to be used for Digital Circuit Multi-
plication Equipment (DCME) to improve transmission efficiency for long distance
speech transmission (e.g. one 64 kb/s PCM channel can hold two 32 kb/s ADPCM
and four 16 kb/s ADPCM channels). The G.726 codec currently is also used for
VoIP applications.

2.4.3 ITU-T G.728 LD-CELP

G.728 [16], a narrowband codec, was standardised by ITU-T in 1992, as a 16 kb/s


speech compression coding standard based on low-delay code excited linear predic-
tion (LD-CELP). It represented a major breakthrough in speech compression coding
history and was the first speech standard based on Code Excited Linear Prediction
(CELP) using an analysis-by-synthesis approach for codebook search. It achieves
near toll quality (MOS score near 4) at 16 kb/s, similar quality as 32 kb/s ADPCM
and 64 kb/s PCM. After G.728, many speech coding standards were proposed based
on variants of CELP.
To achieve low delay, G.728 uses a small speech block for CELP coding. The
speech block consists of only five consecutive speech samples, which has an algo-
rithmic delay of 0.625 ms (0.125 ms × 5). It uses a size of 1024 vectors of codebook
and is coded into 10 bits for codebook index (“code-vector”). Only these 10 bits
for codebook index (or Vector Quantisation, VQ index) is sent to the receiver for
each block of speech (equivalent to 10 bits/0.625 ms = 16 kb/s). Backward gain
adaptation and backward predictor adaptation are used to derive the excitation gain
and LPC synthesis filter coefficients at both encoder and decoder. These param-
eters are updated at every four consecutive blocks (every 20 speech samples or
2.5 ms). A 50th order (instead of 10th) LPC predictor is applied. To reduce code-
book search complexity, two smaller codebooks are used instead of one 10-bit 1024-
entry codebook (one 7-bit 128-entry “shape codebook” and one 3-bit 8-entry “gain
codebook”).
G.728 can further reduce its transmission bit rate to 12.8 and 9.6 kb/s which are
defined at Annex H. The lower bit rate transmission is more efficient in DCME and
VoIP applications.

2.4.4 ITU-T G.729 CS-ACELP

ITU-T G.729 [17], standardised in 1996, is based on CS-ACELP (Conjugate


Structure- Algebraic Code Excited Linear Prediction) algorithm. It operates at 8 kb/s
with 10 ms speech frame length, plus 5 ms look-ahead (a total algorithmic delay
of 15 ms). Each 10 ms speech frame is formed up by two sub-frames with each
of 5 ms. The LPC filter coefficients are estimated based on the analysis on the
2.4 Standardised Narrowband to Fullband Speech/Audio Codecs 39

10 ms speech frame, whereas the excitation signal parameters (fixed and adaptive
codebook indices and gains) are estimated based on the analysis of each subframe
(5 ms). LPC filter coefficients are transformed to Line Spectrum Pairs (LSP) for sta-
bility and efficiency of transmission. For the G.729 encoder, every 10 ms speech
frame (for 8 kHz sampling rate, it is equivalent to 80 speech samples) is anal-
ysed to obtain relevant parameters, which are then encoded to 80 bits and trans-
mitted to the channel. The encoder bit rate is 8 kb/s (80 bits/10 ms = 8 kb/s).
G.729 supports three speech frame types, which are normal speech frame (with
80 bits), Silence Insertion Description (SID) frame (with 15 bits, to indicate the
features of background noise when voice activity detection (VAD) is enabled)
and a null frame (with 0 bit). G.729 was designed for cellular and network ap-
plications. It has a built-in concealment mechanism to conceal a missing speech
frame using interpolation techniques based on previous received speech frames.
For detailed bit allocation of 80 bits to LPC filter coefficients and excitation code-
books, you can read ITU-T G.729 [17]. In the G.729 standard, it also defines
G.729A (G.729 Annex A) for reduced complexity algorithm operating at 8 kb/s,
Annex D for low-rate extension at 6.4 kb/s and Annex E for high-rate extension
at 11.8 kb/s.

2.4.5 ITU-T G.723.1 MP-MLQ/ACELP

ITU-T G.723 [18], standardised in 1996, is based on Algebraic CELP (ACELP) for
bit rate at 5.3 kb/s and Multi Pulse—Maximum Likelihood Quantisation (MP-MLQ)
for bit rate at 6.3 kb/s. It was proposed for multimedia communications such as for
very low bit rate visual telephony applications and provides dual rates for flexibility.
The higher bit rate will have better speech quality. G.723.1 uses a 30 ms speech
frame (240 samples for a frame for 8 kHz sampling rate). The switch between the
two bit rates can be carried out at any frame boundary (30 ms). Each 30 ms speech
frame is divided into four subframes (each 7.5 ms). The look-ahead of G.723.1
is 7.5 ms (one subframe length), this results in an algorithmic delay of 37.5 ms.
The 10th order LPC analysis is applied for each subframe. Both open-loop and
close-loop pitch period estimation/prediction are performed for every two subframes
(120 samples). Two different excitation methods are used for the high and the low
bit rate codecs (one on ACELP and one on MP-MLQ).

2.4.6 ETSI GSM

GSM (Global System for Mobile Communications), is a speech codec standard


specified by ETSI for Pan-European Digital Mobile Radio Systems (2G mobile
communications). GSM Rec 06.10 (1991) [6] defines full-rate GSM operating at
13 kb/s and is based on Regular Pulse Excitation/Long Term Prediction (RPE/LTP)
Linear Prediction coder. The speech frame length is 20 ms (160 samples at 8 kHz
sampling rate) and the encoded block is 260 bits. Each speech frame is divided into
40 2 Speech Compression

four subframes (5 ms each). LP analysis is carried out for each speech frame (20 ms).
The Regular pulse excitation (RPE) analysis is based on the subframe, whereas Long
Term Prediction (LTP) is based on the whole speech frame. The encoded block of
260 bits contains the parameters from LPC filter, RPE and LTP analysis. Detailed
bits allocation can be found from [6].
GSM half rate (HR), known as GSM 06.20, was defined by ETSI in 1999 [7].
This codec is based on VSELP (Vector-Sum Excited Linear Prediction) operating at
5.6 kb/s. It uses vector-sum excited linear prediction codebook with each codebook
vector is formed up by a linear combination of fixed basis vectors. The speech frame
length is 20 ms and is divided into four subframes (5 ms each). The LPC filter is
10th order. The encoded block length is 112 bits containing parameters for LPC
filter, codecbook indices and gain.
Enhanced Full Rate (EFR) GSM, known as GSM 06.60, was defined by ETSI in
2000 [9]. It is based on ACELP (Algebraic CELP) and operates at 12.2 kb/s, same
as the highest rate in AMR (see the next section).

2.4.7 ETSI AMR

Adaptive Multi Rate (AMR) narrowband speech codec, based on ACELP (Alge-
braic Code Excited Linear Prediction), was defined by ETSI, Special Mobile Group
(SMG), in 2000 [8]. It has been chosen by 3GPP (the 3rd Generation Partnership
Project) as the mandatory codec for Universal Mobile Telecom Systems (UMTS)
or the 3rd Generation Mobile Networks (3G). AMR is a multi-mode codec with 8
narrowband modes for bit rates of 4.75, 5.15, 5.9, 6.7, 7.4, 7.95, 10.2 and 12.2 kb/s.
The speech frame length is 20 ms (160 speech samples at 8000 sampling rate).
Mode switching can occur at the boundary of each speech frame (20 ms). For a
speech frame, the speech signal is analysed in order to obtain the parameters of 10th
LP coefficients, adaptive and fixed codebooks’ indices and gains. The LP analysis
is carried out twice for 12.2 kb/s AMR mode and only once for all other modes.
Each 20 ms speech frame is divided into four subframes (5 ms each). Pitch anal-
ysis is based on every subframe, and adaptive and fixed codebooks parameters are
transmitted for every subframe. The bit numbers for encoded blocks for the 8 modes
from 4.75 to 12.2 kb/s are 95, 103, 118, 134, 148, 159, 204 and 244 bits, respec-
tively. Here you can calculate and check the relevant bit rate based on bit numbers in
an encoded block. For example, for 244 bits over a 20 ms speech frame, the bit rate
is 12.2 kb/s (244 bits/20 ms = 12.2 kb/s). For detailed bit allocation for 8 modes
AMR, the reader can follow the AMR ETSI specification [8].
The flexibility on bandwidth requirements and tolerance in bit errors for the
AMR codec are not only beneficial for wireless links, but are also desirable for VoIP
applications, e.g. in QoE management for mobile VoIP applications using automatic
AMR bit rate adaptation in response to network congestions [27].
2.4 Standardised Narrowband to Fullband Speech/Audio Codecs 41

2.4.8 IETF’s iLBC

iLBC (Internet Low Bit Rate Codec), an open source speech codec, was proposed
by Andersen et al. in 2002 [3] at Global IP Sound (GIP, acquired by Google Inc in
2011)1 and was defined in IETF RFC 3951 [2] in 2004. It was aimed for Internet ap-
plications with robustness to packet loss. Based on block independent CELP (frame-
independent long-term prediction), it can overcome the error propagation problem
occurred in traditional CELP codecs and achieve better voice quality under packet
loss conditions (when compared with other CELP codecs, such as G.729, G.723.1
and AMR) [29]. The frame length for iLBC is 20 ms (15.2 kb/s, with 304 bits per
coded block) or 30 ms (13.33 kb/s, with 400 bits per coded block). Each speech
frame is divided into four (for 20 ms frame with 160 samples) or six subframes
(for 30 ms frame with 240 samples) with each subframe corresponding to 5 ms of
speech (40 samples). For 30 ms frame, two LPC analyses are carried out, whereas
for 20 ms frame, only one LPC analysis is required (both are based on 10th order
LPC analysis). Codebook search is carried out for each subframe. Key techniques
used in iLBC are LPC analysis, dynamic codebooks search, scalar quantization and
perceptual weighting. The dynamic codebooks are used to code the residual signal
only for the current speech block, without using the information based on previous
speech frames, thus, eliminating the error propagation problem due to packet loss.
This method enhances the packet loss concealment performance and results in better
speech quality under packet loss conditions.
iLBC has been used in many VoIP tools such as Google Talk and Yahoo! Mes-
senger.

2.4.9 Skype/IETF’s SILK

SILK , the Super Wideband Audio Codec, is the recent codec used in Skype. It is
designed and developed by Skype2 as a speech codec for real-time and packet-based
voice communications and was submitted to IETF in 2009 [32].
The SILK codec has four operating modes which are Narrowband (NB, 8 kHz
sampling rate), Mediumband (MB, 8 or 12 kHz sampling rate), Wideband (WB,
8, 12 or 16 kHz sampling rate) and Super Wideband (SWB, 8, 12, 16 or 24 kHz
sampling rate). Its basic speech frame is 20 ms (160 samples at 8 kHz sampling
rate). The core Skype encoder uses similar AbS techniques which include pitch
estimation (every 5 ms) and voicing decision (every 20 ms), short-term prediction
(LPC) and long-term prediction (LTP), LTP scaling control, LPC transformed to
LSF coefficients, together with noise shaping analysis.
The key scalability features of SILK codec can be categorized as following, as
shown in Fig. 2.15.

1 http://www.globalipsound.com

2 https://developer.skype.com/silk/
42 2 Speech Compression

Fig. 2.15 Features for Skype


codec

• Sampling rate: Skype supports the sampling rates of 8, 12, 16 or 24 kHz which
can be updated in real-time to support NB, MB, WB and SWB voice applica-
tions.
• Bit rate: Skype supports bit rates from 6 to 40 kb/s. Bit rates can be adapted
automatically according to network conditions.
• Packet loss rate: packet loss rate can be used as one of the control parameters
for the Skype encoder to control its Forward Error Control (FEC) and packet
loss concealment mechanisms.
• Use FEC: Forward Error Control (FEC) mechanism can be controlled whether
to use or not depending on network conditions. Perceptually important packets
for example, speech transition frames can be encoded at a lower bit rate and
sent again over the channel. At the receiver side, if the main speech packet
is lost, its lower bit rate packet can be used to recover the lost packet and to
improve overall speech quality. However, FEC increases bandwidth usage as
extra information is needed to be sent through the network.
• Complexity: There are three complexity settings provided in Skype which are
high, medium and low. Appropriate complexity (CPU load) can be decided ac-
cording to applications.
Other features such as changing packet size (e.g., one packet can contain 1, 2,
up to 5 speech frames) and DTX (Discontinuous transmission) to stop transmitting
packets in silence period are common features which can also be found for other
speech codecs.

2.4.10 ITU-T G.722 ADPCM-WB

G.722 [12], defined by ITU-T in 1988, is a compression coding standard for 7 kHz
audio at 16 kHz sampling rate. It is based on sub-band adaptive differential pulse
code modulation (SB-ADPCM) with bit rates of 64, 56 or 48 kb/s (depending on the
operation mode). When encoder bit rate is 56 or 48 kb/s, an auxiliary data channel
of 8 or 16 kb/s bit rate can be added during transmission to form up a 64 kb/s data
channel.
2.4 Standardised Narrowband to Fullband Speech/Audio Codecs 43

At the SB-ADPCM encoder, the input audio signal (0 to 8 kHz) at 16 kHz sam-
pling rate is split into two sub-band signals, each at 8 kHz sampling rate. The lower
sub-band is for the signal from 0 to 4 kHz (same frequency range as narrowband
speech), and the higher sub-band is for signal from 4 to 8 kHz. Each sub-band
signal is encoded based on ADPCM, a similar structure as illustrated in Fig. 2.8
including adaptive quantiser and adaptive predictor. The lower sub-band ADPCM
applies an adaptive 60-level non-uniform quantisation which requires 6 bits coding
for each ADPCM codeword, resulting in 48 kb/s bit rate. The higher sub-band AD-
PCM applies 4-level non-uniform quantisation using 2 bits coding and can achieve
16 kb/s transmission bit rate. Overall, 64 kb/s can be achieved for the SB-ADPCM
coding. In the mode for 56 or 48 kb/s operation, 30-level or 15-level non-uniform
quantisation is used, instead of 60-level quantisation, which results in a 5 or 4 bits
coding for each ADPCM codeword for the lower-subband. 4-level quantisation for
higher sub-band remains the same.
Due to the nature of ADPCM sample-based coding, G.722 ADPCM-WB is suit-
able for both wideband speech and music coding.

2.4.11 ITU-T G.722.1 Transform Coding

G.722.1 [20], approved by ITU-T in 1999, is for 7 kHz audio coding at 24 and
32 kb/s for hands-free applications, for example, conferencing systems. It can be
used for both speech and music. Encoder input signal is sampled at 16 kHz sam-
pling rate. The coding algorithm is based on transform coding, named as Modu-
lated Lapped Transform (MLT). The audio coding frame is 20 ms (320 samples at
16 kHz sampling rate), with 20 ms look-ahead, resulting in coding algorithmic de-
lay of 40 ms. For each 20 ms audio frame, it is transformed to 320 MLT coefficients
independently, and then coded to 480 and 640 bits for the bit rate of 24 and 32 kb/s,
respectively. This independent coding of MLT coefficients for each frame has a bet-
ter resilience to frame loss as no error propagation exists in this coding algorithm.
This is why G.722.1 is suitable for use in a conferencing system with low frame
loss. Bit rate change for this codec can occur at the boundary of any 20 ms frames.
In the latest version of G.722.1 [22] (2005), it defines both the 7 kHz audio
coding mode (in the main body) and the 14 kHz coding mode (in Annex C). The
new 14 kHz audio coding mode further expands audio’s frequency range from 7 kHz
to 14 kHz, with sampling rate doubled from 16 to 32 kHz and samples doubled from
320 to 640 for each audio frame. The bit rates supported by Annex C are 24, 32 and
48 kb/s. The produced speech by the 14 kHz coding algorithm is normally referred
to as “High Definition Voice” or “HD” voice. This codec has been used in video
conference phones, and video streaming systems by Polycom.3

3 http://www.polycom.com
44 2 Speech Compression

2.4.12 ETSI AMR-WB and ITU-T G.722.2

Adaptive Multi-Rate Wideband (AMR-WB) has been defined by both 3GPP [1]
in Technical Specification TS 26.190 and ITU-T G.722.2 [21]. It is for wideband
application (7 kHz bandwidth speech signals) with 16 kHz sampling rate. It operates
at a wide range of bit rates from 6.6 to 23.85 kb/s (6.60, 8.85, 12.65, 14.25, 15.85,
18.25, 19.85, 23.05 or 23.85 kb/s) with bit rate change at any 20 ms frame boundary.
Same as AMR, AMB-WB is based on ACELP coding technique, but uses a 16th
order linear prediction (LP) filter (or short-term prediction filter), instead of 10th
LP as used in AMR narrowband. AMR-WB can provide high quality voice and
is suitable for applications such as combined speech and music, and multi-party
conferences.

2.4.13 ITU-T G.719 Fullband Audio Coding

G.719, approved by ITU-T in 2008, is the latest ITU-T standard for Fullband (FB)
audio coding [23] with bit rates ranging from 32 to 128 kb/s and audio frequen-
cies up to 20 kHz. It is a joint effort from Polycom and Ericsson,4 and is aimed for
high quality speech, music and general audio transmission and suitable for conver-
sational applications such as teleconferencing and telepresence. The 20 Hz–20 kHz
frequency range covers the full human auditory bandwidth and represents all fre-
quency human ear can hear. The sample rate at the input of encoder and the output
of the decoder is 48 kHz. The frame size is 20 ms, with 20 ms look-ahead, resulting
in an algorithmic delay of 40 ms. The compression technique is based on Transform
Coding. The features such as adaptive time-resolution, adaptive bit allocation and
lattice vector quantization, make it flexible and efficient for incorporating different
input signal characteristics of audio and to be able to provide a variable bit rate
from 32 to 128 kb/s. The encoder detects each 20 ms input signal frame and classi-
fies it as either a stationary frame (such as speech) or a non-stationary frame (such
as music) and applies different transform coding techniques accordingly. For a sta-
tionary frame, the modified Discrete Cosine Transform (DCT) is applied, whereas
for a non-stationary frame, a higher temporal resolution transform (in the range of
5 ms) is used. The spectral coefficients after transform coding are grouped into dif-
ferent bands, then quantised using lattice-vector quantisation and coded based on
different bit allocation strategies to achieve different transmission bit rates from 32
to 128 kb/s. G.719 can be applied for high-end video conferencing and telepresence
applications to provide high definition (HD) voice, in accompany with a HD video
stream.

4 http://www.ericsson.com
2.5 Illustrative Worked Examples 45

2.4.14 Summary of Narrowband to Fullband Speech Codecs

In the previous sections, we have discussed key narrowband to fullband speech com-
pression codecs standardised by ITU-T, ETSI and IETF. We now summarize them
in Table 2.2 which includes each codec’s basic information such as which standard-
isation body was involved, which year was standardised, codec type, Narrowband
(NB), Wideband (WB), Super-wideband (SWB) or Fullband (FB), bit rate (kb/s),
length of speech frame (ms), bits per sample/frame (coded bits per sample or per
frame), look-ahead time (ms), and coding’s algorithmic delay (ms). From this table,
you should be able to see the historic development of speech compression coding
standards (from 64 kb/s, 32 kb/s, 16 kb/s, 8 kb/s to 6.4/5.3 kb/s) for achieving high
compression efficiency, the mobile codecs development from GSM to AMR for 2G
and 3G applications, the development from single rate codec, dual-rate codec, 8-
mode codec to variable rate codec for achieving high application flexibility, and
the trend from narrowband (NB) codecs to wideband codecs (WB) for achieving
high speech quality (even for High Definition voice). This development has made
speech compression codecs more efficient and more flexible for many different ap-
plications including VoIP. In the table, the columns on coded bits per sample/frame
and speech frame for each codec will help you to understand payload size and to
calculate VoIP bandwidth which will be covered in Chap. 4 on RTP transport pro-
tocol. The columns on look-ahead time and codec’s algorithmic delay will help to
understand codec delay and VoIP end-to-end delay, a key QoS metric, which will be
discussed in detail in Chap. 6 on VoIP QoE.
It has to be mentioned that many VoIP phones (hardphones or softphones) have
incorporated many different NB and even WB codecs. How to negotiate which
codec to be used at each VoIP terminal and how to change the codec/mode/bit
rate during a VoIP session on the fly will be discussed in Chap. 5 on SIP/SDP sig-
nalling.

2.5 Illustrative Worked Examples

2.5.1 Question 1

Determine the input and output data rates (in kb/s) and hence the compression ratio
for a G.711 codec. Assume that the input speech signal is first sampled at 8 kHz and
that each sample is then converted to 14-bit linear code before being compressed
into 8-bit non-linear PCM by the G.711 codec.

SOLUTION: As the input speech signal is sampled at 8 kHz which means that
there are 8000 samples per second. Then each sample is coded using 14-bit. Thus
the input data rate is:

8000 × 14 = 112,000 (bit/s) = 112 (kb/s)


46 2 Speech Compression

Table 2.2 Summary of NB, WB, SWB and FB speech codecs

Codec Standard Type NB or Bit rate Speech Bits per Look- Algor.
Body/Year WB or (kb/s) frame sample/ ahead delay
FB (ms) frame (ms) (ms)
G.711 ITU/1972 PCM NB 64 0.125 8 0 0.125
G.726 ITU/1990 ADPCM NB 40 0.125 5 0 0.125
32 4
24 3
16 2
G.728 ITU/1992 LD-CELP NB 16 0.625 10 0 0.625
G.729 ITU/1996 CS-ACELP NB 8 10 80 5 15
G.723.1 ITU/1996 ACELP NB 5.3 30 159 7.5 37.5
MP-MLQ NB 6.3 189
GSM ETSI/1991 (FR) RPE-LTP NB 13 20 260 0 20
ETSI/1999 (HR) VSELP NB 5.6 112 0 20
ETSI/2000 (EFR) ACELP NB 12.2 244 0 20
AMR ETSI/2000 ACELP NB 4.75 20 95 5 25
5.15 103
5.9 118
6.7 134
7.4 148
7.95 159
10.2 204
12.2 244 0 20
iLBC IETF/2004 CELP NB 15.2 20 304 0 20
13.33 30 400 30
G.711.1 ITU/2008 PCM-WB NB/WB 64 5 320 5 11.875
(MDCT) 80 400
96 480
G.722 ITU/1988 SB-ADPCM WB 64 0.125 8 0 0.125
56 7
48 6
G.722.1 ITU/1999 Transform WB 24 20 480 20 40
Coding 32 640
ITU/2005 SWB 24/32/48 480–
960
G.719 ITU/2008 Transform FB 32–128 20 640– 20 40
Coding 2560
AMR-WB ETSI/ITU ACELP WB 6.6– 20 132– 0 20
(G.722.2) /2003 23.85 477
SILK IETF/2009 CELP WB 6–40 20 120– 0 20
800
2.5 Illustrative Worked Examples 47

For the output data, each sample is coded using 8-bit, thus the output data rate is:

8000 × 8 = 64,000 (bit/s) = 64 (kb/s)

The compression ratio for a G.711 codec is:

112/64 = 1.75

2.5.2 Question 2

The G.726 is the ITU-T standard codec based on ADPCM. Assume the codec’s
input speech signal is 16-bit linear PCM and the sampling rate is 8 kHz. The output
of the G.726 ADPCM codec can operate at four possible data rates: 40 kb/s, 32 kb/s,
24 kb/s and 16 kb/s. Explain how these rates are obtained and what the compression
ratios are when compared with 64 kb/s PCM.

SOLUTION: ADPCM codec uses speech signal waveform correlation to com-


press speech. For the ADPCM encoder, only the difference signal between the input
PCM linear signal and the predicted signal is quantised and coded. The dynamic
range of the difference signal is much smaller than that of the input PCM speech
signal, thus less quantisation levels and coding bits are needed for the ADPCM cod-
ing.
For 40 kb/s ADPCM, let’s assume the number of bits needed to code each quan-
tised difference signal is x, then we have:

40 kb/s = 8000 (samples/s) × x (bits/sample)


x = 40 × 1000/8000 = 5 (bits)

Thus, using 5 bits to code each quantised difference signal will create an ADPCM
bit steam operating at 40 kb/s.
Similarly, for 32, 24 and 16 kb/s, the required bits for each quantised difference
signal is 4 bits, 3 bits and 2 bits, respectively. The lower the coding bits, the higher
the quantisation error, thus, the lower the speech quality.
For the compression ratio for 40 kb/s ADPCM when compared with 64 kb/s
PCM, it is 64/40 = 1.6.
For 32, 24 and 16 kb/s ADPCM, the compression ratio is 2, 2.67, 4, respectively.

2.5.3 Question 3

For the G.723.1 codec, it is known that the transmission bit rates can operate at
either 5.3 or 6.3 kb/s. What is the frame size for G.723.1 codec? How many speech
samples are there within one speech frame? Determine the number of parameters
bits coded for the G.723.1 encoding.
48 2 Speech Compression

SOLUTION: For the G.723.1 codec, the frame size is 30 ms. As G.723.1 is
narrowband codec, the sampling rate is 8 kHz. The number of speech samples in a
speech frame is:

30 (ms) × 8000 (samples/s) = 240 (samples)

So, there are 240 speech samples within one speech frame.
For 5.3 kb/s G.723.1, the number of parameters bits used is:

30 (ms) × 5.3 (kb/s) = 159 (bits)

For 6.3 kb/s G.723.1, the number of parameters bits used is:

30 (ms) × 6.3 (kb/s) = 189 (bits)

2.6 Summary
In this chapter, we discussed speech/audio compression techniques and summarised
narrowband, wideband and fullband speech/audio compression standards from
ITU-T, ETSI and IETF. We focused mainly on narrowband speech compression,
but covered some wideband and the latest fullband speech/audio compression stan-
dards. We started the chapter from some fundamental concepts of speech, includ-
ing speech signal digitisation (sampling, quantisation and coding), speech signal
characteristics for voiced and unvoiced speech, and speech signal presentation in-
cluding speech waveform and speech spectrum. We then presented three key speech
compression techniques which are waveform compression, parametric compression
and hybrid compression. For waveform compression, we mainly explained ADPCM
which is widely used for both narrowband and wideband speech/audio compres-
sion. For parametric compression, we started from the speech production model and
then explained the concept of parametric compression techniques, such as LPC-10.
For hybrid compression, we started from the problems with waveform and para-
metric compression techniques, the need to develop high speech quality and high
compression ratio speech codecs, and then discussed the revolutionary Analysis-
by-Synthesis (AbS) and CELP (Code Excited Linear Prediction) approach. We also
listed out major CELP variants used in mobile, satellite and secure communications
systems.
In this chapter, we summarised major speech/audio compression standards
for narrowband, wideband and fullband speech/audio compression coding from
ITU-T, ETSI and IETF. We covered narrowband codecs including G.711, G.726,
G.728, G.729, G.723.1, GSM, AMR and iLBC; wideband codecs including G.722,
G.722.1, G.722.2/AMR-WB; and fullband codec (i.e., G.719). We explained the
historic development of these codecs and the trend from narrowband, wideband
to fullband speech/audio compression to provide high fidelity or “High Definition
Voice” quality. Their applications cover VoIP, video call, video conferencing and
telepresence.
2.7 Problems 49

This chapter, together with the next chapter on video compression, form the ba-
sis for other chapters in the book. We illustrated the concepts such as speech codec
type, speech frame size, sampling rate, bit rate and coded bits for each speech frame.
This will help you to understand the payload size and to calculate VoIP bandwidth
which will be covered in Chap. 4 on the RTP transport protocol. The codec com-
pression and algorithmic delay also affect overall VoIP quality which will be further
discussed in Chap. 6 on VoIP QoE. How to negotiate and decide which codec to be
used for a VoIP session and how to change the mode or codec type during a session
will be discussed in Chap. 5 on the SIP/SDP signalling.

2.7 Problems

1. Describe the purpose of non-uniform quantisation.


2. What are the main differences between vocoder and hybrid coding?
3. What is the normal bit rate range for waveform speech codecs, vocoder and
hybrid speech codecs?
4. From human speech production mechanism, explain the difference between
‘unvoiced’ speech and ‘voiced’ speech.
5. What is the LPC filter order used in modern codecs such as G.729, G.723.1
and AMR?
6. Based on the human speech production mechanism, illustrate and explain
briefly the LPC model. What are the main reasons for LPC model achieving
low bit rate, but with low speech quality (especially on fidelity and natureness
of the speech). In which application areas, LPC-based vocoder is still used
today?
7. What is the basis for speech compression for hybrid and parametric codings
using 10 to 30 ms speech frames?
8. Describe the bit rate or bit rate ranges used in the following codecs, G.711,
G.726, G.729.1, G.723.1 and AMR.
9. In an ADPCM system, it is known that 62-level non-linear quantiser is used.
How many bits are required to code each ADPCM codeword (i.e. prediction
error signal)? What is the bit rate of this ADPCM system?
10. Explain the reasons for CELP-based codecs to achieve better speech quality
when compared with LPC vocoder.
11. Determine the compression ratio for LPC-10 at 2.4 kb/s when compared with
G.711 PCM. Determine the number of parameters bits coded in LPC-10 for
one speech frame with 180 speech samples.
12. Which ITU-T speech coding standard is the first ITU-T standard based on
CELP technique? What is the size of codebook used in this standard? How
many bits are required to transmit the codebook index? How do you calculate
its bit rate?
13. Based on the bit rate sequence from high to low, list out for the following
codecs: LPC-10, G.723.1 ACELP, G.711 PCM, G.728 LD-CELP, FR-GSM,
G.729 CS-ACELP.
50 2 Speech Compression

14. For G.722 ADPCM-WB, what is the sampling rate for signal at the input of
the encoder? What is the sampling rate for the input at each sub-band ADPCM
block?
15. Describe the speech/audio frequency range and sampling rate for narrowband,
wideband, super-wideband and fullband speech/audio compression coding.
16. Describe the differences between G.711 and G.711.1.

References
1. 3GPP (2011) Adaptive Multi-Rate—Wideband (AMR-WB) speech codec, transcoding
functions (Release 10). 3GPP TS 26.190 V10.0.0
2. Andersen S, Duric A, et al (2004) Internet Low Bit rate Codec (iLBC). IETF RFC 3951
3. Andersen SV, Kleijn WB, Hagen R, Linden J, Murthi MN, Skoglund J (2002) iLBC—
a linear predictive coder with robustness to packet losses. In: Proceedings of IEEE 2002
workshop on speech coding, Tsukuba Ibaraki, Japan, pp 23–25
4. Atal BS, Hanauer SL (1971) Speech analysis and synthesis by linear prediction. J Acoust
Soc Am 50:637–655
5. Atal BS, Remde JR (1982) A new model of LPC excitation for producing natural-sounding
speech at low bit rates. In: Proc IEEE int conf acoust speech, signal processing, pp 614–617
6. ETSI (1991) GSM full rate speech transcoding. GSM Rec 06.10
7. ETSI (1999) Digital cellular telecommunications system (Phase 2+); half rate speech; half
rate speech transcoding. ETSI-EN-300-969 V6.0.1
8. ETSI (2000) Digital cellular telecommunications system (Phase 2+); Adaptive Multi-Rate
(AMR) speech transcoding. ETSI-EN-301-704 V7.2.1
9. ETSI (2000) digital cellular telecommunications system (phase 2+); Enhanced Full Rate
(EFR) speech transcoding. ETSI-EN-300-726 V8.0.1
10. Griffin DW, Lim JS (1988) Multiband excitation vocoder. IEEE Trans Acoust Speech Signal
Process 36:1223–1235
11. ITU-T (1988) 32 kbit/s adaptive differential pulse code modulation (ADPCM). ITU-T
G.721
12. ITU-T (1988) 7 kHz audio-coding within 64 kbit/s. ITU-T Recommendation G.722
13. ITU-T (1988) Extensions of Recommendation G.721 adaptive differential pulse code mod-
ulation to 24 and 40 kbit/s for digital circuit multiplication equipment application. ITU-T
G.723
14. ITU-T (1988) Pulse code modulation (PCM) of voice frequencies. ITU-T G.711
15. ITU-T (1990) 40, 32, 24, 16 kbit/s Adaptive Differential Pulse Code Modulation (ADPCM).
ITU-T G.726
16. ITU-T (1992) Coding of speech at 16 kbit/s using low-delay code excited linear prediction.
ITU-T G.728
17. ITU-T (1996) Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited
linear prediction (CS-ACELP). ITU-T G.729
18. ITU-T (1996) Dual rate speech coder for multimedia communication transmitting at 5.3 and
6.3 kbit/s. ITU-T Recommendation G.723.1
19. ITU-T (1999) G.711: a high quality low-complexity algorithm for packet loss concealment
with G.711. ITU-T G.711 Appendix I
20. ITU-T (1999) Coding at 24 and 32 kbit/s for hands-free operation in systems with low frame
loss. ITU-T Recommendation G.722.1
21. ITU-T (2003) Wideband coding of speech at around 16 kbit/s using Adaptive Multi-Rate
Wideband (AMR-WB). ITU-T Recommendation G.722.2
22. ITU-T (2005) Low-complexity coding at 24 and 32 kbit/s for hands-free operation in sys-
tems with low frame loss. ITU-T Recommendation G.722.1
References 51

23. ITU-T (2008) Low-complexity, full-band audio coding for high-quality, conversational ap-
plications. ITU-T Recommendation G.719. http://www.itu.int/rec/T-REC-G.719-200806-I
24. ITU-T (2008) Wideband embedded extension for G.711 pulse code modulation. ITU-T
G.711.1
25. Jayant NS (1974) Digital coding of speech waveforms: PCM, DPCM and DM quantizers.
Proc IEEE 62:611–632
26. Kondoz AM (2004) Digital speech: coding for low bit rate communication systems, 2nd ed.
Wiley, New York. ISBN:0-470-87008-7
27. Mkwawa IH, Jammeh E, Sun L, Ifeachor E (2010) Feedback-free early VoIP quality adapta-
tion scheme in next generation networks. In: Proceedings of IEEE Globecom 2010, Miami,
Florida
28. Schroeder MR (1966) Vocoders: analysis and synthesis of speech. Proc IEEE 54:720–734
29. Sun L, Ifeachor E (2006) Voice quality prediction models and their applications in VoIP
networks. IEEE Trans Multimed 8:809–820
30. TIA/EIA (1997) Enhanced Variable Rate Codec (EVRC). TIA-EIA-IS-127. http://www.
3gpp2.org/public_html/specs/C.S0014-0_v1.0_revised.pdf
31. Tremain TE (1982) The government standard linear predictive coding algorithm: LPC-10.
Speech Technol Mag 40–49
32. Vos K, Jensen S, et al (2009) SILK speech codec. IETF RFC draft-vos-silk-00
Video Compression
3

Compression in VoIP is the technical term which refers to the reduction of the size
and bandwidth requirement of voice and video data. In VoIP, ensuring acceptable
voice and video quality is critical for acceptance and success. However, quality is
critically dependent on the compression method and on the sensitivity of the com-
pressed bitstream to transmission impairments. An understanding of standard voice
and video compression techniques, encoders and decoders (codecs) is necessary in
order to design robust VoIP applications that ensure reliable and acceptable qual-
ity of delivery. This understanding of the techniques and issues with compression
is important to ensure that appropriate codecs are selected and configured properly.
This chapter firstly introduces the need for media compression and then explains
some basic concepts for video compression, such as video signal representation,
resolution, frame rate, lossless and lossy video compression. This is followed by
video compression techniques including predictive coding, quantisation, transform
coding and interframe coding. The chapter finally describes the standards in video
compression, e.g. H.120, H.261, MPEG1&2, H.263, MPEG4, H.264 and the latest
HEVC (High Efficiency Video Coding) standard.

3.1 Introduction to Video Compression

In the recent past, personal and business activities have been heavily impacted by
the Internet and mobile communications which have gradually become pervasive.
New consumer devices such as mobile phones have increasingly been integrated
with mobile communications capabilities that enable them to access VoIP services.
Advances in broadband and wireless network technologies such as third generation
(3G), the emerging fourth generation (4G) systems, and the IEEE 802.11 WLAN
standards have increased channel bandwidth. This has resulted in the proliferation
of networked multimedia communications services such as VoIP.
Although the success of VoIP services depends on viable business models and
device capability, they also depend to a large extent on perceived quality of service

L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 53
DOI 10.1007/978-1-4471-4905-7_3, © Springer-Verlag London 2013
54 3 Video Compression

(PQoS) and the quality of experience (QoE). The Internet is notorious for a limited
and time varying bandwidth and wireless channel characteristics are also notorious
for a limited and fluctuating channel bandwidth and block error rate (BLER). These
limitations make the reliable delivery of VoIP services over the Internet and wireless
channels a challenge.
A video sequence is generally captured and represented in its basic form as a
sequence of images which are commonly referred to as frames. The frames are dis-
played at a constant rate, called frame rate (frames/second). The most commonly
used frame rates are 30 and 25 frames per second. Analogue video signals are
produced by scanning a 2-D scene which is then converted to a 1-D signal. The
analogue video signal is digitised by the process of filtering, sampling and quanti-
sation [4].
The filtering process reduces the aliasing effect that would otherwise be in-
troduced as a result of sampling. The choice of sampling frequency influences
the quality of the video and its bandwidth requirement. A sampling frequency of
13.5 MHz has been recommended by International Radio Consultative Committee
CCIR [3] for broadcast quality video with the quantised samples being coded using
8 bits and 256 levels. Digitising analogue video signals results in huge amount of
data which require a high transmission bandwidth because of the process of dig-
itally sampling each video frame in light intensity (luminance) and color compo-
nents (chrominance) and sending it at 25/30 frames per second. As an example,
a typical 625-line encoded video with 720 pixels by 576 lines, 25 frames/second
with 24 bits representing luminance and colour components requires 249 Mbit/s
(720 × 576 × 25 × 24 = 248832000). Clearly, the transmission of raw video in VoIP
services is unrealistic and very expensive given its huge bandwidth requirements
and the limited channel bandwidth of the Internet and wireless channels.
It is necessary to compress the video to reduce its bandwidth requirements and
make its transmission over current transmission media realistic and cheap. Video
compression involves the removal of redundancies inherent in the video signal.
The removal of redundancies is the job of a video CODEC (enCOder DECoder).
There are correlations between two successive frames in a video sequence (tempo-
ral redundancy). Subtracting the previous frame from the current frame and only
sending the resulting difference (residual frame) leads to a significant reduction in
transmission bandwidth requirement. The process of only sending the difference
of successive frames instead of the current frame is called inter-frame or temporal
redundancy removal. Compression can be increased by the further removal of cor-
relation between adjacent pixels within a frame (spatial redundancy), color spectral
redundancy and redundancy due to the human visual system.
Coding techniques such as the Motion Picture Expert Group (MPEG) use block-
based motion compensation techniques, together with predictive and interpolative
coding. Motion estimation and compensation removes temporal redundancies in the
video sequence [11]. Spatial redundancies are removed by converting the sequences
into a transform domain such as the Discrete Cosine Transform (DCT) [1, 9, 13] and
then quantising, followed by Variable Length Coding (VLC) [5, 16] of the trans-
formed coefficients to reduce the bit rate. The efficiency of a compression scheme
3.2 Video Compression Basics 55

is determined by the amount of redundancy remaining in the video sequence after


compression and on the coding techniques used [10].
Removing redundancies in a video sequence impacts VoIP services in two ways.
Firstly, compression results in a loss in video quality which is generally propor-
tional to the level of compression and bandwidth reduction. Secondly, the removal
of redundancy in the video makes it sensitive to loss and error. The higher the com-
pression the lower the bandwidth requirement but the more sensitive the compressed
video stream is to loss and error. Fortunately, the eye can tolerate loss of quality for
most natural video scenes. The digitisation and compression of voice signal has bro-
ken the limit imposed by the huge bandwidth of the analogue video signal that had
previously limited their use to broadcast services.
Video traffic is predominant in the Internet and in 3G/4G networks. This is due
to increased traffic volumes and increased demand and huge bandwidth require-
ments for video services. This further increases the challenge of delivering reliable
VoIP over the Internet and wireless channels. Specifically, the variable and unpre-
dictable QoS, variable and unbounded delay, high error rates and variable available
bandwidth impacts delivered VoIP quality. Additionally, the very nature of video
compression makes the compressed bitstream sensitive to transmission impairments
such as loss and error and increases the challenge of delivering VoIP services with
acceptable quality.
In VoIP services, ensuring acceptable voice and video quality is critical, but the
sensitivity of the compressed video bitstream to transmission channel impairments
has the potential to degrade quality to unacceptable levels. The method used to com-
press video is therefore critical to the success of VoIP, and an understanding of stan-
dard video compression techniques and video encoders and decoders (codecs) is
necessary for the design of robust VoIP services with acceptable QoS. This under-
standing of the mechanisms and issues with digital video compression is important
to ensure that appropriate video codecs are selected and configured properly to en-
sure quality.

3.2 Video Compression Basics


3.2.1 Digital Image and Video Colour Components

As illustrated in Table 3.2, video formats can be represented by different video reso-
lutions, such as Common Intermediate Format (CIF) and Quarter CIF (QCIF). Each
resolution can be represented by the number of horizontal pixels × the number of
vertical pixels. For example, a CIF format (or resolution) is represented as 352 × 288
which means there are 352 pixels at the horizontal dimension and 288 pixels at
the vertical dimension. Each pixel can be represented using 8 bits for a grayscale
image with a value from 0 (black) to 255 (white). It needs 24 bits to represent a
pixel of colour image as shown in Fig. 3.1 where each pixel is represented by three
colour components, Red (R), Green (G) and Blue (B) with each colour compo-
nent using 8 bits to represent. For example, the middle left pixel is represented as
[208, 132, 116] which is dominated by the red colour.
56 3 Video Compression

Fig. 3.1 Digital colour image representation

The above Red, Green and Blue colour representation, also known as the RGB
colour format, is normally transformed to YUV or (Y Cb Cr ) format based on
Eq. (3.1) [15], where Y represents the luminance component, U and V (or Cb
and Cr ) represent the chrominance components.

Y = 0.299 × R + 0.587 × G + 0.114 × B


U = 0.701 × R − 0.587 × G − 0.114 × B (3.1)
V = −0.299 × R − 0.587 × G + 0.886 × B

3.2.2 Colour Sub-sampling


Since the human visual system is more sensitive to luminance resolution (bright-
ness), the chrominance (colour) components are normally down-sampled in video
processing. This will aid video compression. Figure 3.2 illustrates the concept of
YUV 4 : 2 : 2 and YUV 4 : 2 : 0 formats for down-sampling the chrominance com-
ponents. In the J : a : b format, J represents the number of pixels (for the luminance
component) considered at the horizontal dimension, a represents the corresponding
number of chrominance components at the first row and b is the corresponding num-
ber of chrominance components at the second row.
When there is no down-sampling, the YUV format can be expressed as 4 : 4 : 4.
If we assume 8 bits per component, there are 8 bits for the luminance component,
8 bits for Cb (colour for blue) and 8 bits for Cr (colour for red), resulting in a total
of 24 bits per pixel for YUV components. In the YUV 4 : 2 : 2 format, the chromi-
nance component is half-sampled at the horizontal dimension, resulting in an overall
of half-sampling of chrominance components. If we look at the number of bits per
pixel, there are 8 bits for the luminance component, 4 bits for the Cb component
(8/2 = 4 bits) and 4 bits for the Cr component (8/2 = 4 bits), resulting in a total
of 16 bits (8 + 4 + 4 = 16) per pixel. In the YUV 4 : 2 : 0 format, the chrominance
3.2 Video Compression Basics 57

Fig. 3.2 Digital colour image representation

Table 3.1 Required bits per pixel under different colour sub-sampling schemes
Format (YUV) Y (bits/pixel) Cb (bits/pixel) Cr (bits/pixel) total bits/pixel
4:4:4 8 8 8 24
4:2:2 8 4 4 16
4:2:0 8 2 2 12

component is half-sampled at both the horizontal and vertical dimensions (equiva-


lent to an overall of quarter-sampling of the chrominance components). Let’s look
again at the number of bits required per pixel, there are 8 bits for the luminance
component, 2 bits for the Cb (8/4 = 2 bits) and 2 bits for the Cr (8/4 = 2 bits),
resulting in a total of 12 bits (8 + 2 + 2 = 12) per pixel. Compared to the 24 bits per
pixel for the 4 : 4 : 4 format, the required video bandwidth for 4 : 2 : 0 format is only
half of that for the 4 : 4 : 4 format. The above example illustrates how video colour
sub-sampling can assist video compression. Table 3.1 summarises the bits required
for each component and overall YUV components per pixel under different colour
sub-sampling schemes.

3.2.3 Video Resolution and Bandwidth Requirement

The video bandwidth requirement also depends on what resolution is used. Table 3.2
shows the bandwidth requirement for selected video formats based on Common
Intermediate Format (CIF), 25 frames per second and 4 : 2 : 0 (Y Cb Cr or YUV)
58 3 Video Compression

Table 3.2 Bandwidth requirement for selected video formats


Format Resolution (horizontal × vertical) Bits per frame Bandwidth
Sub-QCIF 128 × 96 147456 3.68 Mbit/s
QCIF 176 × 144 304128 7.6 Mbit/s
CIF 352 × 288 1216512 30.42 Mbit/s
4CIF 704 × 576 4866048 121.65 Mbit/s
16CIF 1408 × 1152 19464192 486.6 Mbit/s

format (for 8 bits/component, 4 : 2 : 0 YUV format is equivalent to 12 bits/pixel). It


is clear that the higher the video resolution, the higher the required video bandwidth.

3.3 Video Compression Techniques

The huge amounts of data needed to represent high-quality voice and video makes
their transmission in the Internet and wireless channels impractical and very ex-
pensive. The problem of huge bandwidth requirement for voice is less severe. The
huge data and bandwidth requirements make data compression essential. The cur-
rent level and advances in compression have been achieved over several decades and
the advances in computing power and advances in signal processing have enabled
the implementation of video compression in a variety of equipments such as mobile
telephones.
There are in general two basic methods of compression. These are lossy and
lossless compression methods. These two compression methods are generally used
together because combining them achieves greater compression. Lossless compres-
sion techniques do not introduce any distortion to the original video and an exact
copy of the input to the compressor can be recovered at the output. Lossy video
compression techniques on the other hand introduces distortions and it is impossi-
ble to recover an exact replica of the input video at the output. Clever techniques
have been developed to ensure that the introduced distortion is tailored to match the
characteristics of the Human Visual System (HVS) to ensure a reduction of distor-
tion.

3.4 Lossless Video Compression

Lossless compression techniques exploit the statistical characteristics of video sig-


nals. The level of achievable lossless compression is given by the statistical entropy
of the video signal H , which is the lower bound on compression level. H is formally
defined in Eqs. (3.2) and (3.3).
3.5 Lossy Video Compression 59

  1
H =− pi log2 pi = pi log2 (3.2)
pi
all i all i
 1
H= pi log2 (3.3)
pi
all i

where p are the probabilities of the symbols of the video source.


Lossless compression in its basic form is a mechanism of allocating variable
number bits to symbols in such a manner that the bits allocated to symbols are
inversely proportional to the probability of the symbols occurring. Less bits are
allocated to frequently occurring symbols and more bits allocated to less frequent
symbols. This bit allocation mechanism improves transmission efficiency. Morse
code is an example of lossless coding. In the English language the letter “E” is the
most common. It is represented by a dot in Morse code to improve transmission
efficiency.
In video compression, Huffman coding is a very good example of lossless encod-
ing. The Huffman coding procedure is a Variable Length Coding (VLC) mechanism
that generates variable length codes in such a way that the resultant average bitrate
of the compressed video is as close to the minimum as determined by the entropy
of the video signal.

3.5 Lossy Video Compression

The Huffman code provides compression when the set of input video symbols have
an entropy that is very much less than log2 (number of symbols). In other words, it
works better with data that has a highly non-uniform Probability Density Function
(PDF). However, PDF of typical video sequences does not fit this profile.

3.5.1 Predictive Coding

Lossy compression techniques process the video signal in a totally non-reversible


manner such that the resultant PDF fits the required highly non-uniform require-
ment. Specifically, this is achieved by the exploitation of spatial and temporal corre-
lation within and between video frames. The system generates a prediction of future
frames but only transmits a coded error of the prediction (residual frame) instead of
the whole fame. The same prediction is made based on received frames onto which
the decoded error is added. This achieves significant compression ratio because in
addition to a reduction in the required data to represent the residual instead of the
full frame, it also results in a modification of the PDF of the video which will ensure
a further gain in lossless compression.
Similar to the speech ADPCM codec explained in Sect. 2.3.1, video predic-
tion coding can be illustrated by a Differential Pulse Coding Modulation (DPCM)
60 3 Video Compression

Fig. 3.3 Block diagram for video predictive coding—based on DPCM

scheme with its block diagram shown in Fig. 3.3. At the encoder side, the estimated
current signal is predicted based on previous signal samples (via the predictor). Only
the difference signal or the prediction error signal (e(m, n) = s(m, n) − ŝ(m, n)) is
sent to the quantiser and then coded through entropy encoder before sending to the
channel. At the decoder side, the binary input signal is first decoded by entropy
decoder, then the reconstructed signal (s  (m, n)) is obtained by adding the signal
estimate generated by the predictor.
Unlike speech ADPCM scheme which is one-dimensional, video DPCM scheme
is based on two-dimensional (2D) signal. As shown in Fig. 3.3, video source signal
is represented by a 2D signal (s(m, n)) with m and n representing the horizontal
and vertical position of a pixel. The prediction for the current source signal can be
based on the previous samples within a picture (intra-coding, exploit spatial correla-
tion) or based on samples belong to previous pictures (inter-coding, exploit temporal
correlation).
Similar to ADPCM in speech coding, at least 1 bit is needed to code each predic-
tion error. Thus DPCM coding is not suitable for low bit rate video coding.

3.5.2 Quantisation

Further compression can be achieved by incorporating the HVS to deliberately intro-


duce distortions in such a way that it does not have a significant effect. For example,
the HVS is not sensitive to high frequency detail in video frames, and although dis-
carding high frequency video data can reduce its data requirement, the subsequent
video quality impairment is masked by the user’s lack of high frequency video de-
3.5 Lossy Video Compression 61

Fig. 3.4 Block diagram for video codec based on DCT

tail. Further compression can also be achieved by subsampling the video data both
vertically and horizontally.

3.5.3 Transform Coding

Transform coding is different from predictive coding methods, but like predictive
coding, its purpose is the exploitation of spatial correlations within the picture, and
to conceal compression artifacts. Transform coding does not generally work on the
whole frame but rather on small blocks such as 8 × 8 blocks. Transform coding
literally transforms the video signal from one domain to another domain in such
a way that the transformed elements become uncorrelated. This allows them to be
individually quantised. Although the Karhunen–Loève Transform (KLT) is theoret-
ically maximally efficient, it has implementation limitations. The Discrete Cosine
Transform (DCT) which is less efficient than the KLT, is used instead because it
is straightforward to compute. It is therefore used in JPEG and MPEG compres-
sion. Figure 3.4 illustrates a block diagram for video encoder and decoder (codec)
based on DCT. We use x(n) to represent video samples in the time-domain and
y(k) representing the transformed coefficients in the frequency-domain. The trans-
formed DCT coefficients are quantised and coded using Variable Length Coding
(VLC) (e.g., Huffman coding). The coded bitstream is packetised and sent to the
channel/network. At the decoder, the bitstream from the channel may be differ-
ent from the bitstream generated from the encoder due to channel error or network
packet loss. Thus, we use x  (n) and y  (k) to differentiate them from those used at
the encoder.
The DCT requires only one set of orthogonal basis functions for each ‘fre-
quency’ and for a block of N × 1 picture elements (pixels), expressed by x(n),
62 3 Video Compression

n = 0, . . . , N − 1, the forward DCT, expressed by y(k), k = 0, . . . , N − 1, is de-


fined by Eqs. (3.4) and (3.5).

N −1
1 
y[0] = √ x(n) (3.4)
N n=0
 N −1  
2  kπ(2n + 1)
y[k] = x(n) · cos , k = 1, . . . , N − 1 (3.5)
N 2N
n=0

where y[0] is for DC (zero-frequency) coefficient. y[k] (k = 1, . . . , N − 1) is for


other frequency coefficients. The lower the k value, the lower the frequency.
Transform coding using the DCT does not achieve any compression. It simply
transforms the video signal into the DCT domain. The transformed DCT coeffi-
cients have an energy distribution that is heavily biased towards lower frequency
coefficients. They then therefore lend themselves to an adaptive quantisation pro-
cess (Quantiser in Fig. 3.4) in such a way that higher frequency coefficients are
truncated. This truncation of insignificant high frequency DCT coefficients achieves
compression. The more the high frequency coefficients are discarded the higher the
compression ratio, but the more distortion introduced. The DCT coefficients are
quantised (e.g., based on uniform quantiser) and coded using Variable Length Cod-
ing (VLC), for example, Huffman coding at the encoder side before it is transmitted
to the channel or packetised and sent to the network.
At the decoder side, after VLC decoding and inverse quantisation, the inverse
DCT is used to transform the frequency-domain signal back to the time-domain
video signal. To consider the impact from channel error and quantisation error, we
use y  (k) and x  (n) to represent the frequency and time-domain signals at the de-
coder side. The Inverse DCT is defined by Eq. (3.6).

 N −1  
1 2   nπ(2k + 1)
x [n] = √ y  [0] +

y [k] · cos , n = 0, . . . , N − 1
N N 2N
k=0
(3.6)

where x  (n) is the reconstructed video signal.

3.5.4 Interframe Coding

Interframe coding exploits the temporal redundancies of a sequence of frames to


achieve bit-rate reduction. Natural moving video has a strong correlation in the tem-
poral domain. Consecutive frames tend to be very similar which can be exploited to
reduce temporal redundancy.
3.6 Video Coding Standards 63

3.6 Video Coding Standards

Standard video codecs all follow a generic structure that consists of motion esti-
mation and compensation to remove temporal or inter-frame redundancies, trans-
form coding to manipulate the resultant PDF of the transform coefficients and de-
correlate inter and intra pixel redundancies, and entropy coding to remove statistical
redundancies. They are intended for lossy compression of natural occurring video
sequences. Significant coding efficiency has been achieved over several years. This
gain in coding efficiency has seen improvements in algorithms of the generic codes.
This has resulted in much more advanced standard codecs. But it is from motion
estimation and compensation that largest compression gains have been achieved.
Another area that has seen significant improvements is on computational complex-
ity.
Video standards are mainly from ITU-T (International Telecommunication
Union, H series standards, e.g., H.261, H.263 and H.264) and from the Motion
Picture Experts Group (MPEG), a working group, formed by ISO (International Or-
ganisation for Standards) and IEC (International Electrotechnical Commission) to
set standards for audio and video compression and transmission. The well-known
MPEG standards are MPEG1, MPEG2 and MPEG4.
This section describes the various standard video codecs that have been devel-
oped so far through successive refinement of the various coding algorithms.

3.6.1 H.120

This was the first video coding developed by the International Telegraph and Tele-
phone Consultative Committee (CCITT, now ITU-T) in 1984. It was targeted for
video conferencing applications. It was based on Differential Pulse Code Mod-
ulation (DPCM), scalar quantisation, and conditional replenishment. H.120 sup-
ported bit rates of that were aligned to the T1 and E1 with bitrates of 1.544 and
2.048 Mbit/s. This codec was abandoned not long afterwards with the development
of the H.261 video standard.

3.6.2 H.261

This codec (H.261 [6]) was developed as a replacement to H.120 which is widely
regarded as the origin of modern video compression standards. H.261 introduced
the hybrid coding structure that is predominantly used in current video codecs. This
codec used 16 ×16 macroblock (MB) motion estimation and compensation, an 8 ×8
DCT block, zig-zag scanning of DCT coefficients, scalar quantisation and variable
length coding (VLC). This codec was the first to use a loop filter for the removal
of block boundaries artifacts. H.261 supports bitrates of p × 64 kbit/s (p = 1–30)
that ranges from 64 kbit/s to 1920 kbit/s (64 kbit/s is the base rate for ISDN links).
Although it is still used, H.261 has been replaced by H.263 video codec.
64 3 Video Compression

3.6.3 MPEG 1&2

MPEG-1 was developed in 1991 mainly for video storage applications on CD-
ROM. MPEG-1 is based on the same structures as the H.261 structure. However, it
was the first codec to use bi-directional prediction in which bi-directional pictures
(B-pictures) were predicted from anchor intra pictures (I-pictures) and predictively
coded (P-pictures) pictures. It has a much more improved picture quality to the
H.261 and operates in bitrates up to 1.5 Mbit/s for CIF picture sizes (352 × 288
pixels) and it has an improved motion estimation algorithm to the H.261.
The MPEG-2 coding standard, which is also known as H.262 was developed
around 1994/95 for DVD and Digital Video Broadcasting (DVB). The only differ-
ence between this codec and the MPEG-1 standard is the introduction of interlaced
scanning pictures to increase compression efficiency and the provision of scalability
that enabled channel adaptation. MPEG-2 was targeted towards high quality video
with bitrates that range between 2 and 20 Mbit/s. It is not generally suitable for low
bit rate applications such as VoIP application that has bitrates below 1 Mbit/s.
The MPEG-2 video standard is a hybrid coder that uses a mixture of intraframe
coding to remove spatial redundancies and Motion Compensated (MC) interframe
coding to remove temporal redundancies [12]. Intraframe coding exploits the spatial
correlation of nearby pixels in the same picture, while interframe coding exploits the
correlation between adjacent pixels in the corresponding area of a nearby picture to
achieve compression. In intraframe coding, the pixels are transformed into the DCT
domain, resulting in a set of uncorrelated coefficients, which are subsequently quan-
tised and VLC encoded. Interframe coding removes temporal redundancy by using
reference picture(s) to predict the current picture being encoded and the prediction
error is transformed, quantised and encoded [4]. In MPEG-2 standard [11] either a
past or future picture can be used for the prediction of the current picture being en-
coded, and the reference picture(s) can be located more than one picture away from
the current picture being encoded.
The DCT removes spatial redundancy in a picture or block by mapping a set of
N pixels into a set of N uncorrelated coefficients that represent the spatial frequency
components of the picture or pixel block [1]. This transformation does not yield
any compression by itself, but concentrates the transformed coefficients in the low
frequency domain of the transform. Compression is achieved by discarding the least
important coefficients to the human visual system and the remaining coefficients are
not represented with full quality. This process is achieved through the quantisation
of the transformed coefficients using visually weighted factors [4]. Quantisation
is a nonlinear process and it is nonreversible. The original coefficients cannot be
reconstructed without error once quantisation has taken place. Further compression
is then achieved by VLC coding of the quantised coefficients.
MPEG-2 is a coding standard intended for moving pictures and was developed
for video storage, delivery of video over Telecommunications networks, and for
multimedia applications [11]. For streaming compressed video over IP, MPEG-2
bit streams are normally encoded at the Source Intermediate Format (SIF) size of
352 × 288 pixels and at a temporal resolution of 25 f/s for Europe and 352 × 240
3.6 Video Coding Standards 65

Fig. 3.5 A simplified MPEG-2 encoder

pixels and 30 f/s for America [14]. The MPEG-2 standard defines three main picture
types:
• I: This picture is intracoded without reference to any other picture. They provide
an access point to the sequence for decoding and have moderate compression
levels.
• P: These pictures are predictively coded with reference to past I or P pictures,
and are themselves used as reference for coding of future pictures.
• B: These are the bidirectionally coded pictures, and require both past and future
pictures to be coded and are not used as a reference to code other pictures.
Figure 3.5 shows a simplified model of an MPEG-2 encoder. The frame reorder-
ing process allows the coding of the B pictures to be delayed until the I and P
pictures are coded. This allows the I and the P pictures to be used as reference
in coding the B pictures. DCT performs the transformation into the DCT domain,
Quantise performs the quantisation process and VLC is the variable length coding
process. A buffer, BUF is used for rate control and smoothing of the encoded bit
rate. The frame store and predictors are used to hold pictures to enable predictive
coding of pictures.
An MPEG-2 encoded video has a hierarchical representation of the video signal
as shown on Fig. 3.6.
• Sequence: This is the top layer of the hierarchy and is a sequence of the input
video. It contains a header and a group of pictures (GOP).
• GOP: This coding unit provides for random access into the video sequence and
is defined by two parameters: the distance between anchor pictures (M) and the
66 3 Video Compression

Fig. 3.6 MPEG-2 coded video structure

total number of pictures in a GOP (N). A GOP always starts with an intraframe
(I) picture and contains a combination of predictive (P) and bi-directional (B)
coded pictures.
• Picture: This is the main coding and display unit and can be I, P or B type. Its
size is determined by the spatial resolution required by an application.
• Slice: This consists of a number of marcroblocks (MB) and is the smallest self-
contained coding and re-synchronisation unit.
• Macroblock: This is the basic coding unit of the pictures and consists of blocks
of luminance and chrominance.
• Block: This is an 8 × 8 pixel block and is the smallest coding unit in the video
signal structure and is the DCT unit.
3.6 Video Coding Standards 67

Fig. 3.7 Group of Blocks (GoBs) and Microblocks (MBs) in H.263 (for CIF format)

3.6.4 H.263

This video coding standard was developed in 1996 [7] as a replacement to the H.261
video coding standard. It was intended to be used for low bit rate communication,
such as for video conferencing applications. It supports standard video formats
based on Common Intermediate Format (CIF), which includes sub-QCIF, QCIF,
CIF, 4CIF and 16CIF. It utilises DCT to reduce spatial redundancy and motion com-
pensation prediction for removing temporal redundancy. The YUV format applied is
4 : 2 : 0 and the standard Picture Clock Frequency (PCF) is 30000/1001 (approxi-
mately 29.97) times per second.
An H.263 picture is made up of Group of Blocks (GoB) or slices, which con-
sists of k × 16 lines, where k depends on the picture format (k = 1 for QCIF and
CIF, k = 2 for 4CIF and k = 4 for 16CIF). So a CIF picture consists of 18 GoBs
(288/16 = 18) and each GoB contains one row of macroblocks (MBs) as shown
in Fig. 3.7. A MB contains four blocks of luminance components and two blocks
of chrominance components (one for Cb and one for Cr ) for YUV 4 : 2 : 0 format.
The position of luminance and chrominance component blocks are also shown in
Fig. 3.7.
The number of pixels (horizontal × vertical or width × height) for the lumi-
nance and chrominance components for each H.263 picture format are summarised
in Table 3.3.
H.263 has seven basic picture types, including I-picture, P-picture, PB-picture,
Improved PB-picture, B-picture, EI-picture and EP-picture. Within these seven pic-
ture types, only I-picture and P-picture are mandatory. I-picture is an intra-coded
picture with no reference to other pictures for prediction. I-picture only exploits to
68 3 Video Compression

Table 3.3 Number of pixels (horizontal × vertical) for luminance and chrominance components
for H.263
Format Luminance (Y) Chrominance (Cb ) Chrominance (Cr )
Sub-QCIF 128 × 96 64 × 48 64 × 48
QCIF 176 × 144 88 × 72 88 × 72
CIF 352 × 288 176 × 144 176 × 144
4CIF 704 × 576 352 × 288 352 × 288
16CIF 1408 × 1152 704 × 576 704 × 576

remove spatial correlation. P-picture is an inter-coded picture, which uses previous


pictures for prediction in order to further remove temporal redundancy.
H.263 reduces the bitrates to half of those of H.261. This bitrate reduction was
possible due to an improved motion prediction and compensation and a 3-D VLC
for coding of DCT coefficients. The H.263 was subsequently refined to the H.263+
codec in 1998 and subsequently to H.263++ in 2000. This video coding standard
was developed for mobile networks and the Internet and therefore has improved
error resiliency and scalability features.
H.263 is normally used for video phone over the Internet and 3G mobile hand-
sets for video call applications (e.g., H.263 baseline level 10 with bit rates of up
to 64 kbit/s used in 3G mobile). In the VoIP testbed explained in Chaps. 8 and 9,
H.263 is used within X-Lite VoIP soft phone based on QCIF or CIF format. The RTP
H.263 payload format will be explained in detail (based on real trace data collected)
in Chap. 4.

3.6.5 MPEG-4

MPEG-4 was developed in 1999 and has many similarities to the H.263 design.
MPEG-4 has the capability to code multiple objects within a video frame. It has
many application profiles and levels. It is highly more complex than MPEG-1 and
MPEG-2 and is regarded as a toolset of compression rather than a codec in their
strict sense of MPEG-1 and MPEG-2.
MPEG4 was developed mainly for storing and delivering multimedia content
over the Internet. It has bit rates from 64 kb/s to 2 Mb/s for CIF and QCIF formats.
Its simple profile (level 0) is normally used in 3G video call applications (e.g., for
QCIF operating at 64 kbit/s).

3.6.6 H.264

H.264 which is also known as Advanced Video Coding (AVC), MPEG-4 Part-10
or Joint Video Team (JVT) is the most advanced state-of-the-art video codec which
3.7 Illustrative Worked Examples 69

was standardised in 2003 [8]. Its use in applications is wide ranging and includes
broadcast with set-top-boxes, DVD storage, use in IP networks, multimedia tele-
phone and networked multimedia such as VoIP. It has a wide range of bit rates and
quality resolutions and supports HDTV, Blue-ray disc storage, applications with
limited computing resources such as mobile phones, video-conferencing and mo-
bile applications. H.264 uses fixed-point implementation and it is network friendly
in that it has a video coding layer (VCL) and a network abstraction layer (NAL).
It uses predictive intra-frame coding, multi-frame and variable block size motion
compensation and has an increased range of quantisation parameters.

3.6.7 Highly Efficiency Video Coding (HEVC)

With the increasing demand on applications of streaming high-resolution video, e.g.


HDTV and further UHDTV (Ultra-HDTV) over the Internet and transmission of
high quality video over bandwidth limited networks (e.g., mobile networks), there is
a growing need for higher video compression efficiency compared to H.264 video.
This has motivated work on Highly Efficiency Video Coding (HEVC) [2], a new
generation video coding standard, also known as H.265, from the Joint Collaborative
Team on Video Coding (JCT-VC) from the ITU-T Study Group 16 and the ISO/IEC
MPEG working group. HEVC is currently at proposal and test stage with the final
standard expected to be released in 2013.
HEVC aims to double the compression efficiency of Advanced Video Codec
(AVC) while keeping comparable video quality with increased implementation com-
plexity (longer encoding time). It targets at a wide variety of applications from mo-
bile TV, home cinema and UHDTV (Ultra-High Definition TV).
Although the final HEVC standard has not been released yet, there have been
some industry implementations of preliminary versions of HEVC such as the im-
plementation of HEVC on Android tablet from Qualcomm and equipments from
Ericsson to support TV broadcasting over mobile.

3.7 Illustrative Worked Examples

3.7.1 Question 1

Calculate the bandwidth requirement for QCIF video, 25 frames per second, with
8 bits per component and 4 : 2 : 0 format. If the YUV format is changed to 4 : 2 : 2,
what is the bandwidth requirement?

SOLUTION: For QCIF video, the video resolution is 176 × 144. For 4 : 2 : 0 for-
mat, the bits required for each pixel is 12 bits. For 4 : 2 : 2 format, the bits required
for each pixel is 16 bits.
70 3 Video Compression

Fig. 3.8 GoBs and MBs for H.263 (based on QCIF format)

The bandwidth required for 4 : 2 : 0 format is:

176 × 144 × 25 × 12 = 7.6 (Mb/s)

The bandwidth required for 4 : 2 : 2 format is:

176 × 144 × 25 × 16 = 10.1 (Mb/s)

3.7.2 Question 2

Illustrate the concept of Group of Blocks (GoBs) and Macroblocks (MB) used in
H.263 coding based on QCIF format. Decide the number of GoBs and MBs con-
tained in a picture of QCIF format.

SOLUTION: Figure 3.8 illustrates the concept of Group of Blocks (GoBs) and
Macroblocks (MBs) used in H.263 coding for the QCIF format. For QCIF format,
there are 9 GoBs (144/16 = 9).
As shown in the figure, a picture contains a number of Group of Blocks (from
GoB 1 to GoB N). Each GoB may contain one or several rows of Macroblocks
(MBs) depending on the video format. For the QCIF format, one GoB contains only
one row of MBs as shown in the figure. Each GoB contains 11 MBs (176/16 = 11).
This results in a total of 99 MBs for one picture of the QCIF format (11 × 9 = 99).
Each MB contains four blocks for luminance components (four blocks for Y) and
two blocks for chrominance components (one block for Cb and one block for Cr )
which is equivalent to YUV 4 : 2 : 0 format. Each block consists of 8 × 8 pixels
which is the basic block for DCT transform.
3.8 Summary 71

3.7.3 Question 3

The structure of Group of Pictures (GoPs) can be expressed by GoP(M, N ). De-


scribe how M and N can be decided. Describe how GoP(3, 12) is formed up. From
the video coding point of view, explain the difference between I-frame, B-frame and
P-frame and their possible impact on video quality when it is lost.

SOLUTION: In GoP(M, N ) expression, M is the distance between two anchor


pictures (e.g. I-picture and P-picture) and N is the total number of pictures within a
GoP or GoP length.
GoP(3, 12) can be illustrated as IBBPBBPBBPBB which consists of 12 pictures
in one Group of Picture (GoP) and the distance between two anchor pictures (i.e.,
I-frame or P-frame) is three.
The I-frame is intra-coded based on the current picture without the need for ref-
erences to any other pictures. It only exploits spatial redundancies for video com-
pression. It is used as a reference point for remaining pictures within a GoP during
decoding.
The P-frame is coded using motion-compensated prediction from a previous I-
frame or P-frame, thus, the name of predictive-coded frame or P-frame.
The B-frame is coded with reference to both previous and future reference frames
(either I or P frames). It uses bidirectional motion compensation, thus the name
bidirectional predictive-coded frame or B-frame.
When an I-frame is lost during transmission, it will affect the whole GoP, or the
error due to the I-frame loss will propagate through to all pictures within the GoP.
When a P-frame is lost during transmission, it may affect the remaining P frames
and the relevant B-frames within the GoP. When a B-frame is lost, it only affects
the B-frame itself and no error propagation will occur. From the impact on video
quality, I-frame is the worst, then P-frame, then B-frame.

3.8 Summary
The acceptance and success of VoIP services is very much dependent on their ability
to provide services that have acceptable quality of experience. The video coding
techniques and the selection of video codes is the key to providing the required
QoE. It is paramount for designers and VoIP service providers to understand the
issues with compression in order to select the appropriate coding techniques and
codecs that will enable them to provide the necessary QoE to their users.
This chapter discussed video compression in the context of VoIP services. The
chapter starts by describing the need for compression. It then describes the basic
techniques for video compression including lossless video compression and lossy
video compression with a focus on lossy video compression which includes predic-
tive coding, quantisation, transform coding and interframe coding. The chapter then
gives a description of the most popular video coding standards including MPEG1,
MPEG2, MPEG4, H.261, H.263, H.264 and the latest HEVC. It also shows the
evolution of these standards.
72 3 Video Compression

3.9 Problems

1. Calculate the bandwidth requirement encoded video with a resolution of


352 × 288, 25 frames per second, with 8 bits per component and 4 : 2 : 2 for-
mat. What is the bandwidth requirement if picture colouring format is changed
to YUV 4 : 2 : 0?
2. Decide the number of Group of Blocks and Macroblocks contained in a pic-
ture of sub-QCIF and 4CIF formats based on H.263 coding.
3. What is the resolution for CIF format? How about QCIF and 4CIF?
4. Describe the purpose of variable length coding (VLC).
5. Describe the two mandatory picture types used in H.263.
6. What are the main differences between MPEG-1 and MPEG-2 coding stan-
dards.
7. What is the effect of a scene change on an H.264 encoded video?
8. What is the main purpose of Intra (I) coded video frames?
9. Explain the differences between I-frame, P-frame and B-frame.
10. Explain the meaning of M and N in GoP(M, N ) structure. Illustrate the struc-
ture of GoP(3, 9).

References
1. Chen W, Smith C, Fralick S (1979) A fast computational algorithm for the discrete cosine
transform. IEEE Trans Commun 1004–1009
2. Choi H, Nam J, Sim D, Bajic IV (2011) Scalable video coding based on high efficiency video
coding (HEVC). In: Proceedings of 2011 IEEE Pacific Rim conference on communications,
computers and signal processing (PacRim), pp 346–351
3. Encoding parameters of digital television for studios, digital methods of transmitting televi-
sion information. ITU-R BT.601 (2011)
4. Ghanbari M (2003) Standard codecs image compression to advanced video coding. IEE,
London. ISBN:0-85296-0-710-2
5. Huffman D (1952) A method for the construction of minimum redundancy codes. In: Pro-
cedure of the IRE 40, pp 1098–1101
6. ITU-T (1993) Video codec for audiovisual services at p × 64 kbit/s. ITU-T H.261
7. ITU-T (1996) Video coding for low bit rate communication. ITU-T H.263
8. ITU-T (2003) Advanced video coding for generic audiovisual services. ITU-T H.264
9. Jain AK (1989) The fundamentals of digital image processing. Prentice Hall, New York
10. Jayant NS, Noll P (1984) Digital coding of waveforms: principles and applications to speech
and video. Prentice Hall, London
11. MPEG-1: Coding of moving pictures and associated audio for digital storage media at up to
about 1.5 Mbit/s. ISO/IEC (1991)
12. Netrali AN, Haskell BG (1988) Digital pictures: representation and compression. Plenum,
New York
13. Oppenheim AV, Schafer RW (1990) Discrete-time signal processing. Prentice Hall, New
York
14. Rosdiana E (2000) Transmission of transcoded video over ABR networks. Master’s thesis,
University of Essex
15. Symes P (1998) Video compression. McGraw Hill, New York. ISBN:0-07-063344-4
16. Witten IH, Neal RM, Cleary JG (1987) Arithmetic coding for data compression. Commun
ACM 520–540
Media Transport for VoIP
4

TCP and UDP are the most commonly used transport layer protocols. TCP is a
connection-oriented, reliable, in-order transport protocol. Its features such as re-
transmission, flow control and congestion control are not suitable for real-time mul-
timedia applications such as VoIP. UDP is a connectionless and unreliable transport
protocol. Its simple header, non-retransmission and non-congestion-control features
make it suitable for real-time applications. However, as UDP does not have the se-
quence number in the UDP header, the media stream packet transferred over UDP
may experience duplication or arrive not in the right order. This will cause the re-
ceived media (e.g., voice or video) unrecognisable or unviewable. The Real-time
Transport Protocol (RTP) was developed to assist the transfer of real-time media
streams on top of the unreliable UDP protocol. It has many fields, such as the se-
quence number (to detect packet loss), the timestamp (to know the location of me-
dia packet) and the payload type (to know voice or video codec used). The associ-
ated RTP Control Protocol (RTCP) was also developed to assist media control and
QoS/QoE management for VoIP applications. This chapter presents the key con-
cepts of RTP and RTCP, together with detailed header analysis based on real trace
data using Wireshark. The compressed RTP (cRTP) and bandwidth efficiency issues
are also discussed together with illustrative worked examples for VoIP bandwidth
calculation.

4.1 Media Transport over IP Networks

After voice/video has been compressed via the encoder at the sender side, the com-
pressed voice/video bit streams need to be packetised and then sent over packet
based networks (e.g., IP networks). For voice over IP, one packet normally contains
one or several speech frames. For example, for G.729, one speech frame contains
10 seconds of speech samples. If one packet contains one speech frame, then for
every 10 seconds, one IP packet will be sent to the IP network (via the network in-
terface). If one packet contains two speech frames, then for every 20 seconds, one

L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 73
DOI 10.1007/978-1-4471-4905-7_4, © Springer-Verlag London 2013
74 4 Media Transport for VoIP

IP packet which contains two speech frames, will be sent to the network. If more
speech frames are put in an IP packet, it will have longer end-to-end transmission
delay which will affect the quality of VoIP sessions, but will be more efficient in
the usage of transmission bandwidth (considering the same protocol headers need
to be added for each packet). It is always a tradeoff in deciding the right number of
speech frames to be put in an IP packet.

4.2 TCP or UDP?

In the TCP/IP protocol stack, there are two transport layer protocols, which are
Transmission Control Protocol or Transport Control Protocol (TCP) and User
Datagram Protocol (UDP). About 90 % of today’s Internet traffic are from TCP-
based applications such as HTTP/Web, e-mail, file transfer, instant messaging, on-
line gaming, and some video streaming applications (e.g., YouTube). The remain-
ing 10 % of Internet traffic belong to UDP-based applications such as Domain
Name System (DNS) and real-time VoIP applications which are covered in this
book.
The TCP protocol, originally defined in RFC 793 in 1981 [3], is a connection-
oriented, point-to-point, reliable transport protocol. Connection-oriented means that
TCP will establish a connection between the sender and the receiver via three-way
handshake before a data transfer session starts (as shown in Fig. 4.1 for TCP header,
flag bits of SYN, ACK are used in the initial connection buildup stage). Each TCP
header contains 16-bit source port number, 16-bit destination port number, 32-bit se-
quence number, 32-bit acknowledgement number, 4-bit TCP header length, Flag bits
such as FIN (Finish), SYN (Synchronisation), RST (Reset), PSU (Push ‘data’), ACK
(Acknowledgement), URG (Urgent bit), 16-bit checksum, 16-bit urgent pointer and
options fields. The minimum TCP header is 20 bytes (when there are no options).
The sequence number and the acknowledgement number are used to indicate the
location of send packet within the sending stream and to acknowledge the receiving
of relevant packets (together with the ACK flag bit). This acknowledgement mech-
anism together with retransmission for lost packets are key for the reliable trans-
mission of TCP packets. Other features such as flow control (through the use of 16-
bit window size) will guarantee the sending and receiving processes at a matching
speed (not sending too fast or too slow). The congestion control mechanism is used
to adjust send bit rate in response to network congestion (e.g., when there is a lost
packet which indicates the possibility of network congestion, the TCP sending side
will automatically reduce its send bit rate in order to release the congestion status of
the network). TCP’s congestion mechanism is very important for the healthy opera-
tion of the Internet. Due to the above features, TCP is mainly used for transmitting
delay-insensitive, high reliable data applications (such as email, ftp data transfer
and http web applications). The features of acknowledgement, retransmission, con-
gestion control are not suitable for real-time VoIP applications. Point-to-point and
flow control is also not suitable for voice/video conference applications in which
4.2 TCP or UDP? 75

Fig. 4.1 TCP Header

Fig. 4.2 UDP Header

one stream needs to be send to several clients. This has made UDP an only option
for transmitting of VoIP packets.
Compared to TCP, UDP (User Datagram Protocol), originally defined in RFC
768 in 1980 [8], is very simple in its structure and functions. Figure 4.2 shows the
UDP header which only contains 8 bytes, with 16 bits for source port number, 16 bits
for destination port number, 16 bits for UDP packet length and remaining 16 bits
for UDP checksum (for some error detection). There are no connection establish-
ment stage, no flow control, congestion control and retransmission mechanisms as
provided in TCP. No connection stage (connectionless) and no retransmission mech-
anism mean that UDP transfer is faster than TCP transfer. No sequence number and
acknowledgement mechanism mean the UDP packet transfer won’t know the order
of its packets and won’t know whether a packet is received or not. This makes UDP
transfer fast, but unreliable. The fast transfer nature of UDP makes it suitable for
real-time multimedia applications, such as VoIP, which can also tolerate some de-
gree of packet loss. No point-to-point and flow control nature make UDP suitable
for both unicast and multicast applications.
From the socket implementation point of view, a UDP packet is just sent to the
destination side (via its destination socket). It will all depend on the network whether
it will reach the destination or not. Due to the nature of IP network, some packets
may be duplicated, some packets may arrive out of the order. It is clear that UDP
itself cannot solve the problem in relation to putting the voice or video packets in
right order in order to be played out properly at the receiver side for VoIP appli-
cations. This has pushed the development of Real-time Transport Protocol (RTP)
which will be covered in the next section.
For more details about the TCP/UDP protocols, the reader is recommended to
read relevant books on computer networking, such as [6].
76 4 Media Transport for VoIP

Fig. 4.3 RTP Header

4.3 Real-Time Transport Protocol—RTP

RTP was originally proposed in RFC 1889 [10] in 1996 (now obsolete) and refined
in RFC 3550 [11] in 2003. It aims to support the transfer of real-time multimedia
data over the UDP transport protocol. RTP added the sequence number in order to
identify the lost of RTP packets. Together with the timestamp field, it allows the
receiver to playout the received voice/video packets in the right order and at the
right position. Other fields such as SSRC and CCRC (will be explained later in the
section) are used to identify the voice or video source involved in a VoIP session, or
identify the contributing sources which are mixed by the VoIP sender (e.g., in a VoIP
conference situation). RTCP (the RTP Control Protocol), in association with RTP, is
used to monitor the quality of service of a VoIP session and to convey information
about the participants in an on-going session. RTCP packets are sent periodically
and contain sender and/or receiver reports (e.g., for packet loss rate and jitter value).

4.3.1 RTP Header

The RTP header includes mainly the payload type (for audio/video codecs), the
sequence number and the timestamp.
Figure 4.3 shows the RTP header which includes the following header fields.
• V: This field (2 bits) contains the version of the RTP protocol. The version
defined by RFC 1889 [10] is two.
• P: This is the Padding bit indicating whether there are padding fields in the RTP
packet.
• X: This is the eXtension bit indicating whether extension header is present.
• CC: This field (4 bits) contains the CSRC count, the number of contributing
source identifiers.
• M: This is the Marker bit. For voice, this marks the start of a voice talkspurt,
if silence suppression is enabled at the encoder. For example, M is set to 1
for the 1st packet after a silence period and is zero otherwise. For video, the
marker bit is set to one (True) for the last packet of a video frame and zero
otherwise. For example, if an I-frame is split into 8 packets to transmit over
the channel/network, the first seven packets will have the marker bit set to Zero
4.3 Real-Time Transport Protocol—RTP 77

(false) and the 8th packet (last packet for the I-frame) will have the marker bit
set to One (True). If a P-frame is put into two packets. The first packet will have
the marker bit set to Zero and the second packet’s M bit set as One. If there is
only one packet for a P or B frame, the marker bit will always be One.
• Payload type: This field (7 bits) contains the payload type for voice or video
codecs, e.g. for PCM-μ law, the payload type is defined as zero. The payload
type for common voice and video codecs are shown in Tables 4.1 and 4.2, re-
spectively.
• Sequence number: This field (16 bits) contains the sequence number which will
be incremented by one for each RTP packet sent for detecting packet loss.
• Timestamp: This field (32 bits) indicates the sampling instant when the first
octet of the RTP data was generated. It is measured according to media clock
rate. For voice, the timestamp clock rate is 8 kHz for majority of codecs and
16 kHz for some codecs. For example, for G.723.1 codec with frame size of
30 ms (containing 240 speech samples at 8 kHz sampling rate) and one speech
frame per packet, the timestamp difference between two consecutive packets
will be 240. In the case of speech using G.723.1 codec, the clock rate is the
same with the sampling rate and the timestamp difference based on the media
clock rate for two consecutive packets can be decided by the number of speech
samples which a packet contains.
For video, the timestamp clock rate is 90 kHz for majority of codecs. The
timestamp will be the same on successive packets belonging to a same video
frame (e.g., one I-frame was segmented into several IP packets which will have
the same timestamp values in their RTP headers). If a video encoder uses the
constant frame rate, for example, 30 frames per second, the timestamp differ-
ence between two consecutive packets (belong to different video frames) will
have the same value of 3000 (90,000/30 = 3000), or the media clock difference
between two consecutive packets is 3000. If frame rate is reduced to half (e.g.,
15 frames per second), the timestamp increment will be doubled (e.g., 6000).
• SSRC identifier: SSRC is for Synchronisation Source. This field (32 bits) con-
tains the identifier for a voice or video source. Packets originated from the same
source will have the same SSRC number.
• CSRC identifier: CSRC is for Contribution Source. It will only be available
when the CC field value is nonzero which means more than one source have
been mixed to produce this packet’s contents. This field (32 bits) contains an
entry for the identifier of a contributing source. Maximum 16 entries can be
supported. More information about the functions of RTP mixers can be found
from Perkins’s book on RTP [7].
If there are no mixed sources involved in a VoIP session, the RTP header will
have the minimum header size of 12 bytes. If more mixed sources (or contributing
sources) are involved, the RTP header size will increase accordingly.
Tables 4.1 and 4.2 show examples of RTP payload types (PT) for voice and video
codecs according to RFC 3551 [9]. Media types are defined as “A” for audio only,
“V” for video only and “AV” for combined audio and video. The payload type for
H.264 is dynamic or profile defined which means that the payload type for H.264
78 4 Media Transport for VoIP

Table 4.1 Examples of RTP payload type for voice codecs


PT Codec Media type ms/frame Default ms/packet Clock rate (Hz)
0 PCMU A 20 8,000
3 GSM A 20 20 8,000
4 G.723 A 30 30 8,000
7 LPC A 8,000
8 PCMA A 20 8,000
9 G.722 A 20 8,000
18 G.729 A 10 20 8,000

Table 4.2 Examples of RTP


PT Codec Media type Clock rate (Hz)
payload type for video codecs
26 JPEG V 90,000
31 H.261 V 90,000
33 MP2T AV
34 H.263 V 90,000
dynamic (or profile) H.264 V 90,000

can be defined dynamically during a session. The range for dynamically assigned
payload types is from 96 to 127 according to RFC 3551. The clock rate for media
is used to define the RTP timestamp in each packet’s RTP header. The clock rate
for voice is normally the same with the sampling rate (i.e., 8000 Hz for narrowband
speech codecs). The clock rate for video is 90,000 Hz.

4.3.2 RTP Header for Voice Call Based on Wireshark

In order to have a practical understanding of the RTP protocol and RTP header
fields, we show some trace data examples collected from a VoIP system. The details
on how to setup a VoIP system, how to collect and further analyse the trace data will
be discussed in detail in Chaps. 8 and 9.
Figures 4.4 and 4.5 illustrate an example of RTP trace data for one direction
of voice stream with Fig. 4.4 presents an overall picture and Fig. 4.5 shows fur-
ther information regarding the RTP header. In Fig. 4.4, the filter of “ip.src ==
192.168.0.29 and rtp” was applied in order to get a view of a voice stream sent
from the source station (IP address: 192.168.0.29) to the destination station (IP ad-
dress: 192.168.0.67), and only the RTP packets were filtered out. From these two
figures, it can be seen that all the shown packets (from No. 174 to No. 192) have the
same Ethernet frame length of 214 (bytes).
The sequence number (seq) is incremented by one for each packet (e.g., the 1st
one is 61170 and the 2nd one is 61171). The sequence number can be used easily to
detect whether there is a packet loss.
4.3 Real-Time Transport Protocol—RTP 79

Fig. 4.4 RTP trace example for voice from Wireshark

Fig. 4.5 RTP trace example for voice from Wireshark—more RTP information

The SSRC (synchronised source) identifier is kept to the same for these packets
(indicating that they are from the same source). The Payload Type (PT) is ITU-T
G.711 PCMU (PCM-μ law) which has the PT value of 0.
The timestamp is incremented by 160 for each packet (i.e., the 1st packet’s
timestamp is 160, the 2nd is 320, and the 3rd is 480). This is equivalent to
160 speech samples for each speech packet which contains 20 ms of speech at
8 kHz sampling rate. In other words, 20 ms speech contains 160 speech samples
(8000 samples/s × 20 ms = 160 samples). This can also be seen from the time dif-
ference between two consecutive packets, for example, the time for the No. 1 packet
is 11.863907 s and time for the No. 2 packet is 11.883855 s. The time difference be-
80 4 Media Transport for VoIP

Table 4.3 Example of RTP timestamp and packet interval for G.711 voice
Packet No. Sequence number Timestamp Packet interval (ms)
174 61170 160 0
176 61171 320 (160 × 2) 20
178 61172 480 (160 × 3) 20
181 61173 640 (160 × 4) 20
183 61174 800 (160 × 5) 20

tween them is about 0.02 s or 20 ms. We list the sequence number, the timestamp
and the packet interval for the first four packets in Table 4.3 to further show the
concept of timestamp for voice call.
If we look at the payload length, we can get the value of 160 bytes for the payload
length. The payload length is the same for all the packets (the details on how to
obtain the payload length will be explained in the later of this section). This further
demonstrates that it is for the PCM codec with 20 ms of speech packet (160 samples
equivalent to 160 bytes when each sample is coded into 8 bits or 1 byte).
If we look at the RTP header in more detail for the first packet illustrated in
Fig. 4.4 (Packet No. 174), the Marker bit is set to ONE (True). From Fig. 4.5, only
the first packet (Packet No. 174) has “Mark” listed in the last column, and the re-
maining packets without “Mark” part (or the Marker bit was set to ZERO). As we
explained in the previous section, the Marker bit of ONE indicates the beginning of
a speech talkspurt. In this example, there is only marker bit set at the beginning of
the session, all the remaining packets within the session have the same marker bit
set as ZERO, indicating that they belong to the same talkspurt. This is due to the fact
that no voice activity detection mechanism was enabled in this VoIP testbed. This
can also be seen from the steady timestamp changes in the trace data. All packets
have the same 160 samples (20 ms) difference in timestamp clock rate. If voice ac-
tivity detection is enabled, packets for silence periods of speech do not need to be
transmitted, this will result in a big gap in timestamps (more than 160 samples) and
the length of the gap will depend on the length of a silence period.

4.3.3 RTP Payload and Bandwidth Calculation for VoIP

For the RTP payload length calculation, you may expand the header for Ethernet,
IP and UDP as shown in Fig. 4.6. The payload length can be obtained by deducting
protocol header size from the Ethernet frame size of 214 bytes (from the 1st line,
214 bytes on wire), from the IP length of 200 bytes (total length from IP header),
or from UDP packet length of 180 (the length field of 180 from the UDP header).
The differences between them are due to the length of Ethernet header (14 bytes), IP
header (20 bytes) and UDP header (8 bytes) as shown in Fig. 4.7. The length of RTP
header is 12 bytes. The total length of IP/UDP/RTP header is 40 bytes (20 + 8 + 12).
Let us now look at how to calculate the payload length in this trace example.
4.3 Real-Time Transport Protocol—RTP 81

Fig. 4.6 RTP trace example for voice from Wireshark—more header information

Fig. 4.7 IP/UDP/RTP


headers and payload length

• Ethernet Header: 14 bytes


• IP Header: 20 bytes
• UDP Header: 8 bytes
• RTP Header: 12 bytes
• Payload length: 214 − 14 − 20 − 8 − 12 = 160 bytes, or deducted from IP packet
length: 200 − 20 − 8 − 12 = 160 bytes, or deducted from UDP packet length:
180 − 8 − 12 = 160 bytes

Figure 4.8 further shows the payload information for the packet No. 174. The
codec used is PCM-μ law. You can also double check the payload length of
160 bytes from the bottom panel (each row shows 16 bytes of data and there are
a total of 10 rows).
In the above example, a packet’s payload length is 160 bytes, whereas the pro-
tocols header length is 54 bytes (14 + 20 + 8 + 12) at Ethernet level, or 40 bytes
82 4 Media Transport for VoIP

Fig. 4.8 RTP Payload for PCM codec

at IP level. If we look at the bandwidth usage, for a pure PCM stream, the required
bandwidth is:
160 × 8 (bits)
= 64 kb/s
20 (ms)
This means that for every 20 ms, 160 bytes of data need to be sent out. This is the
required bandwidth for a PCM system as we discussed in Chap. 2. 64 kb/s PCM is
the reference point for all speech compression codecs.
When transmitting an RTP PCM stream, the required bandwidth at the IP level
(named as IP BW) is:
(160 + 40) × 8 (bits)
= 80 kb/s
20 (ms)
This means that for every 20 ms, 200 bytes of data need to be sent out to the chan-
nel/network. Within 200 bytes of data, 160 bytes belong to the payload (PCM data).
Another 40 bytes are headers of IP, UDP and RTP protocols required to send a voice
packet over the Internet. It is clear that the IP BW of 80 kb/s is higher than the pure
PCM bandwidth of 64 kb/s.
You can also calculate the Ethernet bandwidth (Ethernet BW) which is:
(160 + 54) × 8 (bits)
= 85.6 kb/s
20 (ms)
4.3 Real-Time Transport Protocol—RTP 83

It is clear that the Ethernet BW of 85.6 kb/s is higher than the IP BW because the
overhead due to the Ethernet header has to be taken into account.
The bandwidth efficiency for transmitting PCM voice stream at the IP level is:
Length of payload size 160
= = 0.8 (or 80 %)
Length of packet at IP level (160 + 40)

The bandwidth efficiency at the Ethernet level is:


Length of payload size 160
= = 0.75 (or 75 %)
Length of packet at Ethernet level (160 + 54)

4.3.4 Illustrative Worked Example


QUESTION: In a VoIP system, 8 kb/s G.729 codec is applied. The packet size
can be configured to include ONE G.729 speech frame in a packet or TWO G.729
speech frames in a packet. Calculate the G.729 payload size, the required G.729 IP
Bandwidth and the bandwidth efficiency in both cases.
SOLUTION: For 8 kb/s G.729 codec, the length of a speech frame is 10 ms, the
codec bits for 10 ms speech frame is 80 bits (8 × 10).
If one packet includes one speech frame, the payload size is 80 bits or 10 bytes.
If one packet includes two speech frames, the payload size is 160 bits (80 × 2) or
20 bytes.
Considering the IP/UDP/RTP header size of 40 bytes, the required IP bandwidth
for one frame one packet case is:
(10 + 40) × 8 (bits)
= 40 kb/s
10 (ms)

The bandwidth efficiency for one frame one packet case is:
10
= 0.2 (or 20 %)
(10 + 40)

The required IP bandwidth for two frames one packet case is:
(20 + 40) × 8 (bits)
= 24 kb/s
20 (ms)

The bandwidth efficiency for two frames one packet case is:
20
= 0.33 (or 33 %)
(20 + 40)

From the above example, we can see that the transmission efficiency for using
one packet one G.729 speech frame is very low (only reaches 20 %) or 80 % of
transmission bandwidth are used for transmitting overheads (e.g., IP, UDP and RTP
84 4 Media Transport for VoIP

headers). Increasing the numbers of speech frames in a packet (e.g., from one frame
one packet to two frames one packet) can increase transmission bandwidth effi-
ciency (from 20 % to 33 % in this example), however, it will increase packetisation
delay from 10 ms to 20 ms and increase overall delay for VoIP applications. It is a
tradeoff for deciding how many speech frames should be put in an IP packet. Many
VoIP systems provide a flexible packetisation scheme such as Skype’s SILK codec
(see Sect. 2.4.9) which can support 1 to 5 speech frames in a packet.
If we compare the transmission efficiency for 8 kb/s G.729 and 64 kb/s G.711
both under 20 ms frame of speech per packet, the transmission efficiencies for G.729
and G.711 are 33 % and 80 %, respectively. The lower transmission efficiency for
G.729 is due to its higher speech compression rate, thus the smaller payload size.
It is clear that the bandwidth transmission efficiency depends on which codec is
used, and how many speech frames are put in an IP packet. When you calculate the
required IP bandwidth or bandwidth efficiency, you always need to work out what
the payload size for a selected codec is and what packetisation scheme is used. If
you have any doubts about how to work out the payload size for a codec, you are
suggested to read again the contents in Chap. 2, especially Table 2.2.
Considering the cost of transmission bandwidth with any communication sys-
tems, improving the bandwidth efficiency for VoIP systems would mean cost saving
and more competitive for VoIP service providers. This has motivated the work on
cRTP (RTP Header Compression or Compressed RTP) which could compress the
40 bytes IP/UDP/TCP header into 2 or 4 bytes of cRTP header. The concept of cRTP
and the improvement on bandwidth efficiency will be covered in Sect. 4.5.

4.3.5 RTP Header for Video Call Based on Wireshark

Figures 4.9 and 4.10 show an example of RTP trace data for one direction video
stream. In Fig. 4.9, the filters of “ip.src == 192.168.0.29 and udp.port == 15624
and rtp” were applied in order to get a view of the video stream sent from the
source station (IP address: 192.168.0.29) to the destination station (IP address:
192.168.0.103), and only rtp packets for video are filtered out.
Compared with the filter command used for voice RTP analysis in the previous
section, a port filter part was also added in order to only filter out the video stream
(in this example, voice and video streams were sent through different port pairs for a
video call scenario). From the figures, it can be seen that the length of video packets
are variable, and not constant as in PCM voice RTP packets. The packet No. 7170
has the longest packet size (1206 bytes which are equivalent to 1152 bytes for the
RTP payload length) indicating an I-frame (why 1152 bytes? you should be able
to work it out by yourself now. If you don’t know the answer, then you need to go
back to the previous section to work out the header length for Ethernet, IP, UDP and
RTP). All the other packets shown on the figure have shorter packet length when
compared with the I-frame packet, indicating possible P-frame packets.
From the RTP header, it is noted that video codec used is H.263 with the payload
type of 34. The sequence number for packet No. 7170 is 53916, then it is incre-
mented by one for each consecutive packet. All the packets have the same SSRC
4.3 Real-Time Transport Protocol—RTP 85

Fig. 4.9 RTP trace example for video from Wireshark

Fig. 4.10 RTP trace example for video from Wireshark—more RTP information

identifier indicating that they are from the same video source. The timestamp for
the first two packets (No. 7170 and No. 7171) have the same value, indicating that
they belong to the same video frame (the I-frame). This is due to the fact that one I-
frame has to be put into two consecutive packets (too larger to fit into one packet due
to the maximum transfer unit of Ethernet of 1500 bytes). This can also be demon-
strated by the Marker bit, which has the value of Zero for the first part of the I-frame,
and value of One for the second part of the I-frame. For the other P-frames shown
in the figure, each P-frame was put into one IP packet with its marker bit set to One.
From these figures, you can see that the sequence number was incremented by
one for each consecutive packet.
The trace data shown in Fig. 4.10 which was collected from a VoIP testbed based
on X-Lite (details see Chaps. 8 and 9) did not have a constant timestamp increment.
86 4 Media Transport for VoIP

Fig. 4.11 H.263 RTP payload header—1

For example, the timestamp increment from packet No. 7191 to packet No. 7198 is
3330, whereas, the timestamp difference from packet No. 7198 to packet No. 7212
is 2520. This may be due to the detailed implementation of X-Lite and the attached
camera for video capturing. We will show later another trace data example with
constant timestamp increments which is more common in real VoIP systems.
For the 1st packet (No. 174), its H.263 RTP payload header (RFC 2190) is il-
lustrated in Fig. 4.11. As indicated, the RTP payload header follows IETF RFC
2190 [12] which specifies the payload format for packetising H.263 bitstreams into
RTP packets. There are three modes defined in RFC 2190 which may use different
fragmentation schemes for packetising H.263 streams. Mode A supports fragmenta-
tion at Group of Block (GOB) boundary and modes B and C support fragmentation
at Macroblock (MB) boundary. From Fig. 4.11, we can see that the 1st bit (F bit)
is set to zero (False), indicating that the mode of the payload header is “Mode A”
with only four bytes for the H.263 header. In this mode, the video bitstream is pack-
etised on a Group of Block (GOB) boundary. This can be further explained from the
H.263 payload part (the part illustrated as “ITU-T Recommendation H.263”) which
starts either with H.263 picture start code (0x00000020) or H.263 Group of Block
start code (0x00000001) as shown in Figs. 4.11 (for the packet No. 7170, the 1st
part of the I-frame) and Fig. 4.12 (for the packet No. 7171, the 2nd part of I-frame),
respectively. For the I-frame, the Inter-coded frame bit is set to Zero (False) indi-
cating that it is an intra-coded frame. For the P-frames, this bit is set to One (True).
The SRC (source) format shows that the QCIF (176 × 144) resolution was used in
this video call settings. It has to be noted that in this example, one I-frame (for the
QCIF format) has been split into only two RTP packets with a boundary at GoB.
For other video formats, e.g. CIF, one I-frame may be split into several RTP packets
with boundary still at GoBs. The H.263 Group Number field in the H.263 header
will indicate where this part of H.263 payload is located within a picture. For more
4.3 Real-Time Transport Protocol—RTP 87

Fig. 4.12 H.263 RTP payload header—2

Fig. 4.13 H.263 RTP trace example 2

detailed explanations on RFC 2190 and ITU-T H.263, the reader is suggested to
read [12] and [4].
Figure 4.13 shows another example on video call trace data based on an IMS
client (IMS Communicator).1 The video resolution is CIF (Common Intermediate
Format, 352 × 288). From the figure, it can be seen that one I-frame has been seg-
mented into 15 IP packets starting from packet No. 30487 to packet No. 30504
which all have the same timestamp of 6986, indicating the same media clock timing
for all these packets belonging to the same I-frame. The sequence number is incre-
mented by one for each consecutive packet. The last packet of the I-frame (packet

1 http://imscommunicator.berlios.de/
88 4 Media Transport for VoIP

No. 30504) has the marker bit set as One (True) and all others have the marker bits
set as Zero (False, “MARK” is not shown on the list). For the first two P-frames
shown in the figure, both have been segmented into two IP packets (with same
timestamps). The first part of the P-frame has the marker bit set to Zero and the
second part of the P-frame has the marker bit set to One (True). For the other three
P-frames shown in the figure, each has only one IP packet with the marker bit set
to One. The timestamp increment for each video frame is constant in this trace data
(e.g., 12992 − 6986 = 6006; 18998 − 12992 = 6006; 25004 − 18998 = 6006). As
the media clock rate for H.263 is 90 kHz, the timestamp increment of 6006 indicates
that the video frame rate is about 15 frames per second (90, 000/6006 = 14.985 Hz).
This also indicates that the Picture Clock Frequency (PCF) of 15000/1001 = 14.985
is used for H.263 in this case.

4.4 RTP Control Protocol—RTCP

The RTP Control Protocol (RTCP), also defined in RFC 1889 [10] (now obsolete)
and RFC 3550 [11], is a transport control protocol associated with RTP. It can pro-
vide quality related feedback information for an on-going RTP session, together
with identification information for participants of an RTP session. It can be used
for VoIP quality control and management (e.g., a sender can adjust its sending bit
rate according to received network and VoIP quality information) and can also be
used by the third party monitoring tools. The RTCP packets are sent periodically by
each participating member to other session members and its bandwidth should be no
more than 5 % of the RTP session bandwidth which means the session participants
need to control their RTCP sending interval.
As shown in Fig. 4.14, the RTP and RTCP packets are sent in two separate chan-
nels. The RTP channel is used to transfer audio/video data using an even-numbered
UDP port (e.g., x), whereas the RTCP channel is used to transfer control or moni-
toring information using the next odd-numbered UDP port (e.g., x + 1).
If an RTP session is established between two end points as Host A: 192.168.0.29:
19124 and Host B: 192.168.0.67:26448, the associated RTCP channel is also built
up between 192.168.0.29:19125 and 192.168.0.67:26449. The RTP session uses
even-numbered UDP port (19124, and 26448), whereas, the RTCP session will use
the next odd-numbered UDP port (19125, and 26449, in this example). As RTCP
does not use the same channel as the RTP media stream, RTCP is normally regarded
as an out-of-band protocol (out of media band).
There are five different types of RTCP packets, which are SR (Sender Report),
RR (Receiver Report), SDES (Source Description) packet, BYE (Goodbye) packet,
and APP (Application-defined) packet.
• SR: Sender Report—provide feed-forward information about the data sent and
feedback information of reception statistics for all sources from which the
sender receives RTP data.
4.4 RTP Control Protocol—RTCP 89

Fig. 4.14 RTP and RTCP channels

• RR: Receiver Report—provide feedback information of reception statistics for


all participants during the reporting period.
• SDES: Source Description—provide source identifier information, such as
Canonical name (CNAME) of participating source, for example, user name and
email address.
• BYE: Goodbye packet—indicate the end of a participant (e.g., a participant
hangs up a VoIP call).
• APP: Application-defined packet—provide application specific functions.
When a host involved in an RTP session, sends an RTCP packet, it normally
sends a compound RTCP packet which contains several different types of RTCP
packets. For example, one RTCP packet could contain Sender Report and Source
Description; or Receiver Report and Source Description; or Receiver Report, Source
Description and BYE packets.

4.4.1 RTCP Sender Report and Example

The format of Sender Report (SR) is shown in Fig. 4.15. It contains three sections in-
cluding header part, sender information part and reception report blocks (for sources
from 1 to n).
The SR’s header part includes the following fields:
• V: Version (2 bits), version 2 for RFC 1889.
• P: Padding (1 bit), zero (False) for no padding part or one (True) with padding.
• RC: Reception Report Count (5 bits), indicting how many reception report
blocks are included. The number can be from 0 to 31, which means it can con-
tain zero block or up to 31 reception report blocks.
• PT: Packet Type (8 bits), PT = 200 for Sender Report.
• Length: Packet Length (16 bits), the length of this RTCP Sender Report.
• SSRC sender (32 bits): sender source identifier.
The SR’s sender information part includes the following:
90 4 Media Transport for VoIP

Fig. 4.15 RTCP Sender Report (SR)

• NTP timestamp (64 bits): consists of MSW (most significant word) and LSW
(least significant word) of the NTP (Network Time Protocol) timestamp. MSW
and LSW form 8 bytes NTP timestamp, e.g. Nov 22, 2011 14:52:26.593000000
UTC. This reflects the time when this RTCP packet is generated. It can be used
to calculate Round Trip Time (RTT).
• RTP timestamp (32 bits): the RTP timestamp for the RTP packet just before this
RTCP packet is sent (from the same sender). It shows where the sender samples
clock (RTP timestamp) is at the moment of this RTCP sender report is issued.
This is normally used for intra- and inter-media synchronisation.
• Sender’s packet count (32 bits): the total number of RTP packets transmitted
from the sender since starting transmission up until the moment this sender
report was generated.
• Sender’s octet count (32 bits): the total number of RTP payload octet (byte)
count since the beginning of the transmission until the moment this RTCP re-
port was sent. For example, if sender’s packet count = 168, each packet’s RTP
length is 172 bytes (160 bytes for PCM payload + 12 RTP header), then the
4.4 RTP Control Protocol—RTCP 91

total sender’s RTP payload octet is 168 × 160 = 26880 bytes. This field can be
used to estimate the average payload data rate.
The SR’s report block (e.g., report block 1) includes the following fields:
• SSRC_1 (32 bits): SSRC of the 1st source (if there are only two hosts involved
in a VoIP session. This will be the SSRC of the receiver).
• Fraction Lost (8 bits): RTP packet lost fraction since the previous SR was sent,
which is defined by the number of packets lost divided by the number of packets
expected.
• Cumulative number of packet lost (24 bits): the total number of RTP packets
lost since the start of the transmission.
• Extended highest sequence number received (32 bits): the highest sequence
number received, together with the first sequence number received, are used
to compute the number of packets expected.
• Interarrival jitter (32 bits): estimation of interarrival jitter. Details about how
jitter is estimated will be covered in Chap. 6.
• Time of last sender report (LSR): 32 bits, the timestamp (the middle 16-bit of
the NTP timestamp) of the most recently Sender Report received.
• Delay since last sender report (DLSR): 32 bits, the delay between the time when
the last sender report was received and the time when this reception report was
generated. DLSR and LSR are used to estimate Round Trip Time (RTT).
If there are only two hosts involved in a VoIP session, there will be only one
report block (i.e., report block 1) which will provide feedback information for the
1st source, or the receiver in this case. The QoS information (e.g., fraction lost,
interarrival jitter) regarding the VoIP session can be used for VoIP quality control
and management.
If there are a total of N participants involved in a VoIP session (e.g., in a VoIP
conference), there will be N − 1 report blocks from block 1 (source 1) to block
N − 1 (source N − 1).
Figure 4.16 shows an example of RTCP Sender Report (SR) from Wireshark.
Please note that the fraction lost shown in Wireshark is expressed as 14/256. When
considering fraction loss rate, this value needs to be multiplied by 256, which means
fraction loss rate is 14 % in this case.

4.4.2 RTCP Receiver Report and Example


Unlike the SR, the Receiver Report (RR), as shown in Fig. 4.17, contains only two
sections which are header part and Receiver report blocks information (for sources
from 1 to n). Compared with SR as shown in Fig. 4.15, the difference between
header and report blocks sections is only the packet type (PT = 201 for RR, whereas
PT = 200 for SR). When a host is actively involved in sending RTP data, it will send
Sender Report. Otherwise, it will send Receiver Report (RR).
The fields of header part of RR include:
• V: Version (2 bits), version 2 for RFC 1889.
• P: Padding (1 bit), zero (False) for no padding part or one (True) with padding.
92 4 Media Transport for VoIP

Fig. 4.16 RTCP Sender Report from Wireshark

• RC: Reception Report Count (5 bits), indicating the number of Reception Re-
port blocks contained within the RR.
• PT: Packet Type (8 bits), PT = 201 for Receiver Report.
• Length: Packet Length (16 bits) of the Receiver Report.
• SSRC sender: (32 bits) sender source identifier.
Figure 4.18 shows an example of RTCP Receiver Report (RR) from Wireshark.

4.4.3 RTCP Source Description and Example

The format of RTCP Source Description is illustrated in Fig. 4.19. It contains the
following fields:
• V: Version (2 bits), version 2 for RFC 1889.
• P: Padding (1 bit), zero (False) for no padding part or one (True) with padding.
• SC: Source Count (5 bits), the count of the number of sources involved. The
count is 0 to 31.
• PT: Packet Type (8 bits), PT = 202 for source description.
• Length: Packet Length (16 bits) of the Source Description packet.
• SSRC/CSRC_1: (32 bits) sender source identifier and 1st contributing source
identifier.
• SDES items: including types such as CNAME (Canonical Name), for example,
user@domain or id@host; Length for the type defined and the Text content for
the type for the sender and 1st contributing source. SDES items can contain
information such as name, e-mail, phone, location or notes.
4.4 RTP Control Protocol—RTCP 93

Fig. 4.17 RTCP Receiver Report (SR)

Fig. 4.18 RTCP Receiver Report from Wireshark

Figure 4.20 shows an example of RTCP Source Description packet from Wire-
shark. Please note that RTCP Sender Report or Receiver Report is always listed
before Source Description packet. In Chunk 1 part of data, it includes SSRC/CSRC
and SDES items. In this example, the SDES items have three parts with each nor-
mally containing Type, Length, and Text.
94 4 Media Transport for VoIP

Fig. 4.19 RTCP Source Description (SD)

Fig. 4.20 RTCP Source Description from Wireshark

4.4.4 RTCP BYE Packet and Example

The format of the RTCP Goodbye (BYE) packet is illustrated in Fig. 4.21. It contains
the following fields:
• V: Version (2 bits), version 2 for RFC 1889.
• P: Padding (1 bit), zero (False) for no padding part or one (True) with padding.
• SC: Source Count (5 bits), indicating the number of SSRC identifiers included.
4.4 RTP Control Protocol—RTCP 95

Fig. 4.21 RTCP Goodbye Packet (BYE)

Fig. 4.22 RTCP Goodbye Description from Wireshark

• PT: Packet Type (8 bits), PT = 203 for BYE packet.


• Length: Length (16 bits) of the BYE packet.
• SSRC/CSRC_1: (32 bits) sender source identifier and 1st contributing source
identifier.
• ...
• SSRC/CSRC_n: (32 bits) sender source identifier and the nth contributing
source identifier.
• Length and Reason for leaving: (optional fields), length for the reason for leav-
ing fields, and can list out reasons for leaving, for example, “camera malfunc-
tion”.
Figure 4.22 shows an example of the BYE packet based on Wireshark. Please
note that in this figure, the BYE packet is listed after the RTCP RR and the RTCP
SD packets in a compound RTCP packet and there is no optional field (i.e., reason
for leaving) in this example.

4.4.5 Extended RTCP Report—RTCP XR for VoIP Metrics

The RTCP Sender Report (SR) and Receiver Report (RR) only contain basic QoS
information regarding a media session, such as packet loss rate, and interarrival jitter
value. In order to provide more information regarding the underlying network QoS
and VoIP QoE metrics such as Mean Opinion Score (MOS) for quality monitoring
purposes, extended RTCP report type (XR) was defined by RFC 3611 [2] in 2003.
96 4 Media Transport for VoIP

Fig. 4.23 Extended RTCP report—VoIP Metrics

The VoIP metrics provided by the RTCP XR, are shown in Fig. 4.23.
According to their functions, these metrics are divided into the following six
categories. The detailed descriptions about these metrics (e.g., burst characteristics,
R-factor, MOS-LQ and MOS-CQ) will be covered in Chap. 6.
• Loss and Discard: include metrics for Loss rate (due to packet loss in the net-
work), discard rate (due to arrive too late and discarded at the receiver; burst/gap
density, bust/gap duration and Gmin (metrics to describe the characteristics of
burst packet losses). Gmin of 16 is the recommended minimum distance re-
quired for the number of consecutively received packets (no loss) for the transi-
tion from a burst state to a gap state.
• Delay: include round trip time (RTT) and delay introduced by an end system, in-
cluding encoding delay, packetisation delay, decoding delay and playout buffer
delay.
• Signal related: include signal level (or speech signal level), noise level (or back-
ground noise level during silence period), Residual Echo Return Loss (RERL).
• VoIP Call Quality: include R Factor, extended R Factor, and MOS scores for
listening quality (LQ) and conversational quality (CQ).
• Configuration: Rx config (receiver configuration byte) to reflect the receiver
configuration on what kind of packet loss concealment (PLC) method is used,
whether adaptive or fixed jitter buffer is used, or if for adaptive jitter buffer,
jitter buffer adjustment rate.
• Jitter Buffer: include jitter buffer values, such as jitter buffer nominal delay,
jitter buffer maximum delay, and jitter buffer absolute maximum delay.

4.5 Compressed RTP—cRTP

4.5.1 Basic Concept of Compressed RTP—cRTP

The compressed RTP (cRTP) refers to a technique to compress IP/UDP/RTP head-


ers (e.g., from 40 bytes to 2–4 bytes) as shown in Fig. 4.24. cRTP was first defined
4.5 Compressed RTP—cRTP 97

Fig. 4.24 Compressed IP/UDP/RTP Header

in RFC 2508 in 1999 [1] to improve transmission efficiency while sending audio
or video over low-speed serial links, such as dial-up modems at 14.4 or 28.8 kb/s.
It compressed 40 bytes of IP/UDP/RTP headers to either 2 bytes when there is no
checksum in UDP header or 4 bytes when there is UDP checksum. The compressed
cRTP header will be decompressed to original full IP/UDP/RTP headers at the re-
ceiver side before going through the RTP, UDP and IP level packet header processes.
The idea of compression is based on the concept that the header fields in IP,
UDP and RTP are either constant between consecutive packets or the differences
between these fields are constant or very small. For example, the RTP header fields
of SSRC (Synchronisation Source Identifier) and the PT (Payload Type) are constant
for packets from the same voice or video source as you can see from Fig. 4.5. The
difference between RTP header fields of the Sequence number and the Timestamp
are also constant between consecutive packets. From Fig. 4.5, you can see that the
difference between sequence number of two consecutive packets is one, whereas the
difference between timestamp between consecutive packets is 160 (samples). Based
on the above concept, the compressed RTP works by sending a full headers packet
at the initial stage, and then only updates between headers of consecutive packets
are sent to the decompressor at the receiver side. Decompressor will decompress the
cRTP header information based on received full RTP header. Full header packets
are sent periodically in order to avoid desynchronisation between compressor and
decompressor due to packet loss. To further improve the performance of cRTP over
links with packet loss, packet reordering and long delay, enhanced cRTP was also
proposed in RFC 3545 in 2003 [5], which specifies methods to prevent context cor-
ruption and to improve synchronisation process between compressor and decom-
pressor when the scheme is out of synchronisation due to packet loss. For more
details on the principles of cRTP, the reader can read Perkins book on RTP [7].
In the following section, we will illustrate a worked example for calculating
transmission efficiency of VoIP system and to demonstrate the efficiency improved
by using cRTP.
98 4 Media Transport for VoIP

4.5.2 Illustrative Worked Example


QUESTION: In a VoIP system deploying 6.3 kb/s G.723.1 codec, the packet
size is set as one speech frame per packet. In order to improve transmission effi-
ciency, it is suggested to use Compressed RTP (cRTP, with 4 bytes for compressed
IP/UDP/RTP headers) to replace the current RTP scheme. Calculate the required
IP Bandwidth and the transmission bandwidth efficiency for both RTP and cRTP
schemes.

SOLUTION: For 6.3 kb/s G.723.1 codec, the length of a speech frame is 30 ms
which results in a coded bits of 189 bits (6.3 × 30). The payload size for 189 bits is
equivalent to 24 bytes (padded with three zero bits at the end of the last byte).
Considering the IP/UDP/RTP header size of 40 bytes, the required IP bandwidth
for G.723.1 RTP is:
(24 + 40) × 8 (bits)
= 17 kb/s
30 (ms)

The bandwidth efficiency for G.723.1 RTP scheme is:


24
= 0.375 (or 37.5 %)
(24 + 40)

Considering the compressed IP/UDP/RTP header of 4 bytes, the required IP


bandwidth for G.723.1 cRTP is:
(24 + 4) × 8 (bits)
= 7.46 kb/s
30 (ms)

The bandwidth efficiency for G.723.1 cRTP scheme is:


24
= 0.86 (or 86 %)
(24 + 4)

From the above example, it is clear that cRTP scheme can reduce the required IP
bandwidth (from 17 kb/s reduce to 7.46 kb/s from the above example), and improve
the transmission bandwidth efficiency (from 37.5 % increase to 86 %).

4.6 Summary
In this chapter, we discussed media transport of VoIP which included mainly two
protocols, the RTP and its associated RTCP protocol. We started the chapter from
why real-time VoIP applications are based on UDP for data transfer, what prob-
lems there are when UDP is used and the need for an application layer protocol, for
example, the need for RTP, to facilitate VoIP data transfer. We explained the RTP
header in details and showed their examples for voice and video from Wireshark
based on real VoIP trace data collected from the lab. We also explained the concept
4.7 Problems 99

of IP bandwidth, Ethernet bandwidth and bandwidth efficiency, and further demon-


strated with illustrative worked examples on how to calculate IP bandwidth and/or
bandwidth efficiency when different codecs or different packet sizes were applied.
In this chapter, we also explained the RTCP protocol, an associated RTP con-
trol protocol, mainly for session management and quality control/management pur-
poses. We discussed different RTCP reports and showed examples of these reports
from real trace data collected and based on Wireshark. Further, an extended RTCP
report (RTCP XR), representing VoIP quality metrics was discussed. The detailed
meaning of these metrics will be covered in Chap. 6. We concluded this chapter with
Compressed RTP (cRTP), a technique used to compress the header of IP/UDP/RTP
to improve the transmission bandwidth efficiency. We also demonstrated the cal-
culation of the required transmission bandwidth and bandwidth efficiency in cRTP
when compared with normal RTP.

4.7 Problems

1. Why is RTCP regarded as an out-of-band protocol?


2. A VoIP session is set up between two hosts. Via Wireshark, it shows that
the RTP session is established between 192.168.0.47:9060 and 192.168.0.32:
12300, what are the UDP port numbers used in the RTCP session?
3. There are five different types of RTCP packets. Describe them.
4. What is the M (Marker) bit used for in RTP header? Explain its usage for
voice or video calls.
5. Explain the sequence number and timestamp fields in RTP header. Explain
their usages for voice and video calls.
6. Describe the main QoS parameters/metrics in RTCP reports. What is the pur-
pose for RTCP Extended Report?
7. For G.729 codec with 8 kb/s transmission rate and frame length of 10 ms.
The RTP timestamp clock rate is 8 kHz. Assuming the packet size is
2 frames/packet, what is the increment step for RTP timestamp for each RTP
packet? If the packet size is 1 frame/packet, what is the increment step for
RTP timestamp?
8. A VoIP system uses H.263 for video codec and frame rate of 30 frames per
second. It is known that the media clock rate is 90 kHz. What is the increment
step for RTP timestamp between consecutive video frames? If several RTP
packets belong to the same video frame (e.g., I-frame or P-frame), are the
timestamps for these RTP packets the same?
9. For video call applications, one video I-frame may be put into several RTP
packets, how do you know that these RTP packets belong to the same I-frame?
10. Describe the difference between SR and RR. If a participant is only receiving
RTP data (e.g., only listening), what report (SR or RR) does the participant
send? If the participant starts to send RTP data, what report (SR or RR) does
the participant send now?
100 4 Media Transport for VoIP

11. During a VoIP application, you have decided to change its codec from 64 kb/s
PCM to 8 kb/s G.729. What will be the IP bandwidth usage change due to this
codec change (assuming both use 30 ms of speech in a packet)? If the applica-
tion developer has decided to use cRTP with only four bytes for compressed
IP/UDP/RTP headers instead of 40 bytes in normal RTP case, what will be the
bandwidth usage change (if codec is still G.711 64 kb/s and 30 ms of speech in
a packet)? From your results, which method (i.e., change codec from G.711 to
G.729 or change from RTP to cRTP) is more efficient from bandwidth usage
point of view? You need to show the process of your solution.
12. It is known that G.711.1 PCM-WB is used in a VoIP system. Calculate the
IP bandwidth usage for Layer 0, Layer 1 and Layer 2 bitstream, respectively.
What is the overall IP bandwidth for G.711.1 at 96 kb/s.
13. In general principles for RTCP, it is required that the bandwidth for RTCP
transmission should be no more than 5 % of RTP session bandwidth, how do
session participants measure and control their RTCP transmission rate? When
the number of participants for a VoIP session increases, will the RTCP packet
size get bigger? How about RTCP transmission bandwidth consumption?
14. Why is RTCP message always sent in a compound packet, or bundled packet
(including different types of RTCP packets)?
15. Describe QoS metrics in RTCP reports.
16. Describe QoE metrics in the Extended RTCP report.

References
1. Casner S, Jacobson V (1999) Compressing IP/UDP/RTP headers for low-speed serial links.
IETF RFC 2508
2. Friedman T, Caceres R, Clark A (2003) RTP control protocol extended reports (RTCP XR).
IETF RFC 3611
3. Information Sciences Institute (1981) Transmission control protocol. IETF RFC 793
4. ITU-T (1960) Video coding for low bit rate communication. ITU-T H.263
5. Koren T, Casner S, et al (2003) Enhanced compressed RTP (CRTP) for links with high
delay, packet loss and reordering. IETF RFC 3545
6. Kurose JF, Ross KW (2010) Computer networking, a top–down approach, 5th edn. Pearson
Education, Boston. ISBN-10:0-13-136548-7
7. Perkins C (2003) RTP: audio and video for the Internet. Addison-Wesley, Reading. ISBN:0-
672-32249-8
8. Postel J (1980) User datagram protocol. IETF RFC 768
9. Schulzrinne H, Casner S (2003) RTP profile for audio and video conferences with minimal
control. IETF RFC 3551
10. Schulzrinne H, Casner S, et al. (1996) RTP: a transport protocol for real-time applications.
IETF RFC 1889
11. Schulzrinne H, Casner S, et al (2003) RTP: a transport protocol for real-time applications.
IETF RFC 3550
12. Zhu C (1997) RTP payload format for H.263 video streams. IETF RFC 2190
VoIP Signalling—SIP
5

Traditional circuit switching networks such as Public Switched Telephone Network


(PSTN) were designed to use dedicated circuits for connecting end to end voice
calls. Although circuit switching networks remain reliable and better in quality, they
are highly inefficient in resource utilization. This is because the circuits remain ded-
icated throughout the call duration and idle until the next call. Despite their short-
comings in QoS, the packet switching networks are more efficient than the circuit
switching networks in resource utilization. Since packet switching networks were
not originally designed for real-time applications, session control protocols such as
the Session Initiation Protocol (SIP) have been developed to enable real-time appli-
cations such as voice and video calls over packet switching networks.

5.1 What is Session Initiation Protocol?

Session Initiation Protocol (SIP) is an ASCII-based protocol developed by the IETF


Working Group for creating, modifying, and terminating interactive user sessions
between two or more participants. These sessions can involve multimedia elements
such as video, voice, instant messaging, online games, and virtual reality.
SIP was published in March 1999 under RFC 2543 [10], and in November 2000 it
was accepted as a 3GPP [2] signalling protocol and permanent element of the IMS
architecture. It is one of the leading signalling protocols for Voice over IP, along
with H.323 [11].
In recent years, there have been SIP extensions such as the Session Initiation
Protocol for Instant Messaging and Presence Leveraging Extensions (SIMPLE) [1].
This is an instant messaging (IM) and presence protocol suite [6] which is used to
handle subscriptions to events and deliver instant messages. The SIP protocol is a
TCP/IP based application layer protocol designed to be independent of the underly-
ing transport layer, it combines elements of the Hypertext Transfer Protocol (HTTP)
[4] and the Simple Mail Transfer Protocol (SMTP) [12].

L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 101
DOI 10.1007/978-1-4471-4905-7_5, © Springer-Verlag London 2013
102 5 VoIP Signalling—SIP

Fig. 5.1 SIP architecture

5.1.1 SIP Network Elements

Clients and Servers are the two main devices defined in the SIP architecture (cf.,
Fig. 5.1). A client is described in RFC 3261 [14] as a network element that sends
SIP requests and receives SIP responses. A client may or may not interact with a
human being. Similarly, the server is a network element that receives requests in
order to service them, and then responds to those requests.
Generally, SIP network elements have the following capabilities,
• SIP determines the location of the UAs. This is achieved during UAs registration
process. The registration process allows SIP to easily find the IP addresses of
UAs.
• SIP determines the availability of the UAs. Application servers are used to keep
the availability of UAs. UAs have options to forward calls to voice mails if they
are not available. UAs can create profiles on how to route calls when they are
not available, on multiple locations (e.g., office, home and mobile) or on several
devices (laptop, mobile and desktop computer).
• SIP establishes a session between UAs. SIP can create sessions by using SIP
methods such as Invite.
• In addition to establishing sessions between UAs, SIP has the capability to man-
age sessions. For instance, if a UA is registered on several devices, a call can be
seamlessly transferred between devices (e.g., from a mobile phone to a laptop).
• SIP determines UAs capabilities. This is mainly related to media capabilities
such as voice/video codecs of the participating UAs. These capabilities are ex-
tracted from the Session Description Protocol (SDP). UAs capabilities can be
5.1 What is Session Initiation Protocol? 103

Fig. 5.2 Cisco 7960 SIP


hard phone

Fig. 5.3 X-Lite 4 SIP soft


phone

extended to encryption, decryption algorithms and SIP applications. For exam-


ple, a white-board application will not work on an IP phone device but will
work on a personal computer.

5.1.2 User Agent

A User Agent (UA) is a piece of software that runs in a computer or embedded in


devices like mobile phones. The UA is used to create SIP requests and responds to
them. The UA can perform the role of User Agent Client and User Agent Server.
User Agent Client (UAC)
UAC is a logical function that creates a SIP request, and then uses the client func-
tionality to send that request. SIP phones such as hard and soft phones (Figs. 5.2
and 5.3) are examples of UACs.
104 5 VoIP Signalling—SIP

Fig. 5.4 Examples of UAS

It is possible that a device can have both UAC and UAS functions, and sometimes
a session will have both UAC and UAS roles. This happens if a device wants to add
another participant in an ongoing call.

User Agent Server (UAS)


UAS is a logical function that responds to requests from other UACs. The UAS can
reject or accept UAC’s requests depending on its capabilities such as codec types
and QoS. The main UASs are depicted in Fig. 5.4.
If a request is received from the UAC, the UAS will first find if the SIP method
in the request is supported by the destination device. For instance, if the MESSAGE
method is sent, the UAS will determine if the destination device supports messaging,
if not, the UAS will respond with parameters that the method is not supported.

5.1.3 Proxy Server

A Proxy Server (proxy) is responsible for receiving SIP requests and responses and
forward them to their particular destination on behalf of the UAs. The proxy simply
acts as a router of SIP messages. There are three types of proxies,
1. Stateful Proxy: A stateful proxy maintains the state of every dialog it is servic-
ing. It remembers call identifiers of each session and receives all responses if a
session status has changed or ended.
2. Stateless Proxy: A stateless proxy does not maintain any state of any dialog it
is servicing. It simply forwards SIP request and responses.
3. Forking Proxy is responsible for forwarding SIP requests to more than one
destination. There are two types of forking proxy, parallel and sequential. In
the case of the parallel proxy, a given user can have several UAs available
and registered at different locations such as home, office and on a mobile. The
parallel proxy will therefore call all three user locations simultaneously. In the
case sequential proxy, it will try to call different UA for a certain period of
time one after the other until one is picked up. A forking proxy must be a
stateful proxy. Figure 5.5 illustrates how sequential forking proxy operates.
A SIP message destined for Alice is received at the forking proxy, since Alice
has three registered UAs, at home, office and on a mobile, the proxy server will
ring Alice at home, but the call is not answered after a certain duration of time,
5.1 What is Session Initiation Protocol? 105

Fig. 5.5 Forking proxy

Fig. 5.6 Redirect server

this causes the proxy to ring Alice at the office, this is not answered too. Finally
the forking proxy rings Alice on a mobile and the call is answered.

5.1.4 Redirect Server


Redirect servers are used to provide alternative addresses for UAs. Redirect servers
are very useful in providing alternative addresses when some proxies become faulty
or overloaded. Figure 5.6 depicts operations of the redirect server. Bob who is at
home sends a SIP message to Alice at home, but the redirect server sends an alter-
native Alice who is reached at the office.

5.1.5 Registrar
A registrar is responsible for authentication and recording UAs. UA sends a REG-
ISTER SIP message (cf., Fig. 5.7) to a registrar when it is witched on or changes its
106 5 VoIP Signalling—SIP

Fig. 5.7 Registrar

Fig. 5.8 Location server role


with registrar and proxy
server

IP address. After receiving the REGISTER SIP message from the UA, registrar can
either accept the UA registration or challenge the registration by rejecting the first
registration. This challenge will force the UA to send its credentials for verification.

5.1.6 Location Server


The function of location server is to provide subscriber addresses to proxies. Lo-
cation server mainly gets it data from DNS and it is normally integrated in a SIP
server providing registrar and proxy services. Figure 5.8 illustrates the role of loca-
tion server. Alice location information is updated in location server by the registrar
once Alice is registered. The location of Alice is probed at the location server by the
proxy when Bob wants to communicate with Alice.

5.2 SIP Protocol Structure


SIP is a text-based protocol which makes it easier to understand. SIP is a layered
protocol whereby each SIP network element must be able to support the first two
layers, there are 4 layers (cf., Fig. 5.9):
5.2 SIP Protocol Structure 107

Fig. 5.9 SIP Layers

Table 5.1 SIP message


Request message Response message
format
Request-Line Status-Line
Header fields Header fields
Empty line Empty line
Message body Message body

1. Syntax and encoding layer: It is the lowest layer. It is a set of rules that de-
fines the format and structure of a SIP message. Syntax and encoding layer is
mandatory for each SIP network element.
2. Transport layer: It defines how SIP network elements send and receive SIP
requests and responses. All SIP network elements must support transport layer.
3. Transaction layer: It is responsible for handling all SIP transactions. SIP trans-
actions can be defined as generated SIP requests and responses by UAs. Trans-
action layer handles retransmissions, timeouts and correlation between SIP re-
quests and responses. Client transactions component is called client transac-
tion, while that of the server is call server transaction. Transaction layer is only
available in UAs and stateful proxies.
4. Transaction user layer: It creates client transactions such as an INVITE with
destination IP address, port number and transport.

5.2.1 SIP Message Format

SIP message format is text-based and very similar to HTTP/1.1. Requests and re-
sponses are the two types of SIP messages whereby UAC sends requests and UAS
replies with responses. SIP URIs [5] are used to identify SIP UAs, they are made
up of username and domain name. SIP URIs can have other parameters such as
transport. sip:alice@home and sip:alice@home; transport = udp are two SIP URIs
without transport and with transport, respectively. SIPS URIs use Transport Layer
Security (TLS) [7] for messages security. sips:alice@home is an example of SIPS
URI. The SIP message format for both request and response type is depicted in
Table 5.1.
108 5 VoIP Signalling—SIP

Table 5.2 SIP method names and their meanings


SIP method name Description
REGISTER [10] Register UA
INVITE [15] Invite UA to a session (establish a session)
ACK [10] Acknowledge receipt of a request
BYE [10] Terminate a session or transaction
CANCEL [10] Cancel a transaction
NOTIFY [10] Notify UA about a particular event
UPDATE [13] Update session information like without sending re-invite
MESSAGE [10] Indicate an instant message
SUBSCRIBE [10] Subscribe to a particular event
INFO [8] Send optional application layer information such as account balance
information and wireless signal strength
OPTIONS [10] Quarries a UAS about its capabilities

Request-Line
The Request-Line contains SIP method name, Request-URI and SIP protocol
version. An example of the Request-Line is “INVITE sip:alice@office SIP/2.0”.
The SIP method defines the purpose of the request, in this example INVITE de-
noted the SIP method. The Request-URI shows the request’s destination which is
alice@home. The SIP protocol version is 2.0. Table 5.2 lists the main SIP methods.
The request identifies the type of session that is being requested by the UAC. The
requirements for supporting a session such as payload types and encoding param-
eters are included as part of the UAC’s request. However, there are other requests
that are specific such as MESSAGE method that do not require a session or dialog,
the UAC might choose to accept the message or not through a response.
The SIP request of interest is INVITE [15], this request invites UAs to participate
in a session. Its body contains the session description in the form of SDP. This
request includes a unique identifier for the call, the destination and originating IP
addresses, and information about the session type to be established. An example of
the INVITE request is depicted in Fig. 5.10. The first line contains the method name
which is INVITE. The rest of the lines that follow are header fields. The header
fields are,

• Via: This field records the path that a request takes to reach the destination, the
same path should be taken by all corresponding responses.
• To: This field contains the URI of the destination UA, in this scenario, the value
is <sip:[email protected]>.
• From: This field contains the URI of the originating UA, in this case it is From:
<sip:[email protected]>.
5.2 SIP Protocol Structure 109

Fig. 5.10 INVITE method from Wireshark

• CSeq: This field contains sequence number and SIP method name. It is used to
match requests and responses. At the start of a transaction, the first message is
given a random integer sequence number, then there will be an increment of one
for each new message. In this example the sequence number is 2.
• Contact: This field identifies the URI that should be used to contact the UA who
created the request. In this example, the contact value is <sip:[email protected].
208.151:40332>.
• Content-Type: This field is used to identify the content media type sent to an-
other UA in a message body. In this example, the content type is application/sdp.
• Call-ID: This field provides a unique identifier for a SIP message. This allows
UAS to keep track of each session.
• Max-Forward: This field is used to limit the number of hope a request can tra-
verse. It is decreased by one at each hope. In this scenario, Max-Forward value
is 70.
• Allow: This field is used by the UAC to determine which SIP methods are sup-
ported by the UAS. For instance, a query from UAC to UAS to find out which
methods are supported can be replied by the UAS in a response containing:
ALLOW: INVITE, CANCEL, BYE, SUBSCRIBE.
• Content-Length: This contains an octet (byte) count of the message body. It is
402B in this example.

Status-Line
The Status-Line consists of SIP version, the status code is in integers between 100
to 699 inclusive and the reason phrase. An example of the Status-Line is “SIP/2.0
200 OK”. In the example, SIP version is 2.0 and status code is 200 which means
OK (Success). The status codes are grouped in classes, there are 6 classes, the first
class defines the class of the code (cf., Table 5.3).
The provisional or informational class indicates that a request has been received
and is being processed. This class serves as an acknowledgement with the purpose
110 5 VoIP Signalling—SIP

Table 5.3 6 SIP classes of


Status code Description
status codes and their
meanings 1xx Provisional (also known as informational)
2xx Success
3xx Redirection
4xx Client error
5xx Server error
6xx Global failure

Fig. 5.11 180 SIP response

Fig. 5.12 401 SIP response

of preventing retransmission of requests. One example of the provisional class is


the 180 RINGING response (cf., Fig. 5.11), this response indicates that the UAS is
alerting the UAC of an incoming session request.
The redirection class is used to give UAC additional addresses in order to reach
other subscribers. For instance, the 301 response indicates a subscriber has moved to
another address permanently. In this case, the registrar will provide the new address
to the UAC for address book updates.
The client error class indicates problems with a session at the UAC side. For
example, the common 401 unauthorized response is sent from the registrar to the
UAC in order to challenge a registration. Upon receiving the 401 response (cf.,
Fig. 5.12), the UAC will send credentials as required by the authentication scheme.
The server error class denotes problems at the UAS when processing requests and
responses from UAC. One example of this is the 501 not implemented response, this
response is generated when a SIP method is not implemented at the UAS.
5.2 SIP Protocol Structure 111

Table 5.4 SIP status codes with their meanings


Status code Status code Status code
100—Trying 380—Alternative Service 410—Gone
180—Ringing 400—Bad Request 411—Length Required
181—Call Being Forwarded 401—Unauthorized 413—Request Entity Too Large
182—Call Queued 402—Payment Required 414—Request URI Too Long
183—Session Progress 403—Forbidden 415—Unsupported Media Type
200—OK 404—Not Found 416—Unsupported URI Scheme
202—Accepted 405—Method Not Allowed 420—Bad Extension
300—Multiple Choices 406—Not Acceptable 421—Extension Required
301—Moved Permanently 407—Authentication Required 423—Interval Too Brief
302—Moved Temporarily 408—Request Timeout 480—Temporarily Unavailable
305—Use Proxy 409—Conflict 481—Call Does Not Exist
482—Loop Detected 483—Too Many Hops 484—Address Incomplete
485—Ambiguous 486—Busy Here 487—Request Terminated
488—Not Acceptable Here 491—Request Pending 493—Undecipherable
500—Server Internal Error 501—Not Implemented 502—Bad Gateway
503—Service Unavailable 504—Server Time-Out 505—Version Not Supported
513—Message Too Large 600—Busy Everywhere 603—Declined
604—Does Not Exist Anywhere 605—Not Acceptable

The global failure class illustrates problems associated with the network rather
than SIP network elements. For instance, the 600 busy everywhere response is is-
sued by the UAS only if the UAC is not available in the network.
Table 5.4 outlines the list of SIP status codes with their descriptions.

Header Fields
A header field consists of a header field name, a colon and the header field value. It
contains detailed information of the UAC requests and UAS responses. The header
field mainly includes the origination and destination addresses of SIP requests and
responses together with routing information. The following main field names can be
found in the header field of a SIP message.

• Via: This field records the path that a request takes to reach the destination, the
same path should be taken by all corresponding responses.
• To: This field contains the URI of the destination UA, e.g., “To:Alice<sip:alice@
home>;tag = 1234”.
• From: This field contains the URI of the originating UA.
• CSeq: This field contains sequence number and SIP method name. It is used to
match requests and responses. At the start of a transaction, the first message is
112 5 VoIP Signalling—SIP

given a random integer sequence number, then there will be an increment of one
for each new message.
• Contact: This field identifies the URI that should be used to contact the UA who
created the request.
• Content-Type: This field is used to identify the content media type sent to an-
other UA in a message body.
• Call-ID: This field provides a unique identifier for a SIP message. This allows
UAS to keep track of each session.
• Max-Forward: This field is used to limit the number of hope a request can tra-
verse. It is decreased by one at each hope.
• Allow: This field is used by the UAC to determine which SIP methods are sup-
ported by the UAS. For instance, a query from UAC to UAS to find out which
methods are supported can be replied by the UAS in a response containing:
ALLOW: INVITE, CANCEL, BYE, SUBSCRIBE.

SIP Identities
SIP and SIPS URIs are used to identify SIP elements in an IP network. These two
URIs are identical but SIPS URI denotes that the URI is secured. SIP and SIPS URIs
are defined in the SIP RFC 3261 and take the form of sip:user:password@host:
port;uri-parameters?headers. The URI can be in the form of an IP address or
a DNS. The common form of SIP URI is alice@home:5060, where 5060 repre-
sents the port number on which SIP stack listens to incoming SIP requests and
responses.

Private User Identity: Private user identity uniquely identifies the UAC sub-
scription. Private identity enables the VoIP network operator to identify one sub-
scription for all VoIP services for the purpose of authorization, registration, admin-
istration and billing.
The private identity is not visible to other network providers and it is not used
for routing purposes. RFC 2486 [3] specifies the private identify to take the form of
the Network Access Identifier (NAI). The NAI is similar to e-mail addresses, where
“@” separates the username and the domain parts.

Public User Identity: Public user identity is used by UAs to advertise their pres-
ence. Public identity takes the form of the NAI format. Public identity is not limited
per subscriber, a subscriber can have more than one public identity in order to use
difference devices.
Public identity allows VoIP service providers to offer flexibility to subscribers by
eliminating the need of having multiple accounts for each identity. Public identity
allows flexible routing, for example, if Alice is not in the office, a call can be routed
to Alice device at home. Private and public identities are sent in a REGISTER mes-
sage when the UAC is registering for a VoIP service.
5.3 Session Descriptions Protocol 113

Fig. 5.13 A SIP message


with multiple parts

Message Body
The empty line separate the message body and the header fields. The message body
can be divided into different parts. SIP uses MIME to encode it multiple message
bodies. There are set of header fields that provide information on the contents of a
particular body part such as Content-Disposition, Content-Encoding and Content-
Type. Figure 5.13 shows a multipart SIP message body.
In Fig. 5.13, the Content-Disposition shows that the body is a session description,
the Content-Type denotes that the session description used is Session Description
Protocol (SDP) [9] and the Content-Length indicates the length of the body in bytes.
The first part of the SIP message body consists of an SDP session description and
the second part is made up of the text.
Message bodies are transmitted end to end as a result the proxy servers are not
needed to parse the message body in order to route the message. UA may also wish
to encrypt the content of the message body.

5.3 Session Descriptions Protocol

Session description describes how a multimedia session is communicated between


UAs. Session Description Protocol (SDP) defines the format of multimedia session
between UAs. The SDP format is made up of a number of text lines and formatted
as <type> = <value>, where <type> defines a unique session parameter and the
<value> provides a specific value for that parameter. The SDP was not originally
intended for SIP usage only, it can be used by many protocols including HTTP,
SMTP and RTP.
114 5 VoIP Signalling—SIP

3GPP has made SDP to be the de facto session description protocol in IMS be-
cause it has the capability to describe a wide range of media types which can also
be treated separately. For instance in a Webinar session, there would be voice, video
and powerpoint presentation and many more such as text editors and whiteboard
session. All these media types would be described in one SIP message by using
SDP.
The SDP is carried within the SIP message body and has three main parts, the
Session, the Time, and the Media descriptions.

5.3.1 Session Description


Session description describes the session, such as the host and address of the session.
It is possible to have more than one session description within a single SIP message,
this is done through the SDP. For instance, in a conference call, multiple media types
such as voice, video and whiteboard applications might be needed.
v = protocol version. This describes the SDP version in use in a particular ses-
sion. By knowing the protocol version, the destination UA will figure out how to
interpret the rest of the attribute lines in the SDP.
o = owner or originator. This gives the originator of the session and session iden-
tifier. The format is o = <username><sessionid><version><networktype>
<addresstype><address>. The username should be in one word and without
spaces, like alice.john. The <networktype> uses “IN” to denote the Internet,
in other environment such as IMS, “IMS” might be used. The <addresstype>
indicates the version of IP being used. IPv4 or IPv6 can be used.
s = session name. The session name describes a session such as s = “A Practical
Guide to Voice and Video over IP”.
i = session information. This is used together with session name in order to pro-
vide additional information about a session such as i = “Voice and video calls
over fixed and mobile IP networks”.
u = URI of description. This represents the location where session participants
can retrieve further information about a session. The format is u = “www.
sessiondescr.com”.
e = email address. This is an email address provided by the session owner to
other participants for further contacts regarding the session.
p = phone number. This is the phone number provided by the session owner for
further contacts regarding the session.
c = connection information. This is in the format of c = <networktype>
<addresstype><connectionaddress>, where <connectionaddress> is the ac-
tual IP address to be used for the connection.
b = bandwidth information. This indicates the bandwidth size to be used in the
session. The format is b = <modifier><bandwidth-value>, where the modifier
can be either “CT” or “AS”. “CT” stands for conference total and indicates the
total bandwidth of all media in a session. “AS” stands for application specific and
denotes the amount of bandwidth for a single application in a session.
5.3 Session Descriptions Protocol 115

z = time zone adjustments. This is important for session participants who are in
different time zones in order to properly communicate the session time.
k = encryption key. If encryption is in place then the encryption key is needed to
read the payload.
a = zero or more session attribute lines. Attributes are used to extend SDP for
other applications whose attributes are not defined by the IETF.

5.3.2 Time Description

Time description provides the information about the time of a session. This might
include when the session should start, stop and repeated.
t = time the session is active. This field denotes the start and stop times for the
session. Its format is t = <start-time><stop-time>. The session is unbounded
if <stop-time> is set to 0.
r = zero or more repeat times. This field denotes when the session will be re-
peated. Its format is r = <repeat interval><active duration><offsets from
start-time>.
z = time zone adjustments. This field is used by UAs to make time zone adjust-
ments. This field is important because different time zones change their times at
different times of the day and several countries adjust daylight saving times at
different dates, while some countries do not have daylight saving times.

5.3.3 Media Description

Media description provides information regarding media of the session. Multiple


media descriptions exist where there are several media types in the session.
m = media name and transport address
i = media title
c = connection information
b = bandwidth information
k = encryption key
a = zero or more media attribute lines
The first part of SIP message body in Fig. 5.13 depicts the SDP format. Me-
dia description gives specific information about the media of the session estab-
lished. The format for the media descriptions is m = <media><port><transport>
<media formats>, where the media type can be,
• audio,
• video,
• data,
• control, and
• application.
116 5 VoIP Signalling—SIP

Data represents data streams sent by an application for processing by the desti-
nation application. An application can be any multimedia application such as white-
board or similar multimedia applications. Control represents a control panel for an
end application.
The port defines the port number to receive the session. Transport describes the
transport protocol to be used for the session, RTP/AVP are supported for Real-time
Protocol and Audio Video Profile. The media formats defines the formats to be used
such as μ-law encoded voice and H 264 encoded video.

5.3.4 Attributes

Attributes are the SDP extensions and a number of it are defined in [9], the main
attributes are,
a = rtpmap: <payload type><encoding name>/<clock rate><encoding
parameters>. In this attribute, the payload type denotes whether the session
is audio or video. Encoding parameters are optional which identify the number
of audio channels. There are no encoding parameters for video session.
a = cat:<category>. This SDP attribute hierarchically lists session category
whereby the session receiver can filter the unwanted sessions.
a = keywds:<keywords>. This attribute enables the session receiver to search
sessions according to specific keywords.
a = tool:<name and version of tool>. This attribute makes it possible for the
session receiver to establish which too has been used to setup the session.
a = ptime:<packet time>. This attribute is useful for audio data which provides
the length of time in milliseconds represented in received packets of the session.
This attribute is intended as a recommendation for the packetization of audio
packets.
a = rcvonly. This attribute is used to set the UAs to receive only mode when re-
ceiving a session.
a = sendrecv. This attributes set the UAs to send and receive mode. This is will
enable the receiving UA to participate in the session.
a = orient:<whiteboard orientation>. This attribute is used with whiteboard ap-
plications to specify the orientation of the whiteboard application on the screen.
The three supported values are landscape, portrait and seascape.
a = type:<conference type>. This attribute specifies the type of the conference.
The suggested values are Broadcast, Meeting, Moderated and Test.
a = charset:<character set>. This attribute specifies the character set to describe
the session name and the information. The ISO-10646 character set is used by
default.
a = sdplang:<SDP language>. This attribute specifies the language to be used
in the SDP. The default language is English.
5.4 SIP Messages Flow 117

Fig. 5.14 SDP message

5.3.5 Example of SDP Message from Wireshark

The SDP messages extracted from the Wireshark are shown in Fig. 5.14. Each line
of the SDP message describes a particular attribute of the session to be created and
follows the format described in Sect. 5.3. For VoIP sessions, the important attributes
are:
a—attributes, this is in the form of a = rtpmap:<payload type><encoding
name>/<clock rate><encoding parameters>. The payload types in this exam-
ple are 101 and 107 with the corresponding clock rates of 8000 and 16000, re-
spectively.
c—connection information, which has the connection address (192.168.2.4) for
the RTP stream. The connection type is “IN” and the address type is IPV4.
m—media description, which includes the port number 52942 on which the RTP
stream will be received. The media type is audio, the transport is RTP/AVP and
the media formats supported are PCMU and PCMA.

5.4 SIP Messages Flow

Before presenting the SIP messages flow for multimedia session establishment, it is
important to describe the relationship between SIP messages, transaction and dialog.
Although SIP messages are sent independently between UAs, they are normally
118 5 VoIP Signalling—SIP

Fig. 5.15 Relationship between SIP message, transaction and dialog

arranged into transactions by UAs. In this context SIP is known as a transactional


protocol.
Transaction Requests and responses are the only two types of SIP messages that
are exchanged between UAs. A transaction takes place between UAs and com-
prises of all messages from the initial request sent from UA to another UA up to
the final non-1xx response sent from UAs. In the case of INVITE request, the
transaction includes ACK, but only if the final response was not a 2xx. If the
final response was 2xx, then the ACK will not be part of the transaction.
Dialog A dialog is a peer-to-peer SIP relationship between UAs. A dialog per-
sists for some time and is identified by a Call-ID, a local tag and a remote tag.
SIP messages that have the same identifiers belong to the same dialog. A dialog
cannot be determined until a UA receives 2xx response from a UA. A session
consists of all the dialogues it is involved in. Figure 5.15 clearly depicts the rela-
tionship between SIP message, transaction and dialog.

5.4.1 Session Establishment

Session establishment is a 3-way process and the UAC must be registered before
establishing a session. Figure 5.16 illustrates that the current location of Alice
5.4 SIP Messages Flow 119

Fig. 5.16 Successful UAC


registration

Fig. 5.17 UAC registration


from Wireshark

“sip:alice@home” is successfully registered with the registrar by issuing the REG-


ISTER request. The registrar provides a challenge to Alice, Alice enters valid cre-
dentials (user ID and password). The registrar validates the Alice’s credentials, it
then registers Alice in its contact database and returns a 200 OK response to Alice.
The 200 OK response includes Alice’s current contact list in contact headers.
If Alice wants to cancel the registration with the registrar, Alice will send a REG-
ISTER request (cf., Fig. 5.17) to the registrar. The REGISTER request will have an
expiration period of 0 and will apply to all existing contact locations. Since Alice
already has been authenticated by the registrar, Alice provides authentication cre-
dentials with the REGISTER request and will not be challenged by the registrar.
The SIP server will then validates Alice’s credentials and clears Alice’s contact list,
and returns a 200 OK response to Alice.
Figure 5.18 depicts how UAs can establish a multimedia session. In this scenario,
Alice establish a multimedia session with Bob by using a proxy. The initial INVITE
request containing a route header with Bob public URI and the address of the proxy
is sent by Alice and the proxy relays the INVITE request to Bob. Bob then accepts
the invitation by sending a 200 OK response to Alice via the proxy. The 200 OK
response includes Bob contact header field. The contact header field will be used
by Alice for further exchange of messages with Bob. After receiving an ACK the
session is now established. If any changes are desired to be made such as adding
video, then another INVITE has to be sent usually known as re-INVITE.
If Alice wants to terminate the session with Bob, Alice will send a BYE request
via the proxy to Bob. Bob will respond with 200 OK and the session will terminate.
120 5 VoIP Signalling—SIP

Fig. 5.18 Successful session establishment via a proxy

5.5 Summary

The Session Initiation Protocol has emerged as the industry choice for real time
communication and applications, such as voice and video over IP, Instant Messag-
ing and presence. Borrowing from the proven Internet Protocols, such as SMTP
and HTTP, SIP is ideal for the Internet and other IP platforms. SIP provides the
platform to implement a range of features such as call control, next generation
service creation and interoperability with existing mobile and telephony systems.
SIP is the de facto signalling protocol in IMS, TISPAN and PacketCable architec-
tures.

5.6 Problems

1. What do the acronyms UA, UAC and UAS stand for? Describe what they do.
2. Describe any four types of SIP servers.
3. Do we really need a proxy server? Explain your answer.
4. Why is the forking proxy a stateful proxy?
5. Describe advantages and disadvantages of using stateful proxy server.
References 121

6. How does a caller determine it’s proxy server?


7. Name any six types of SIP method and describe the purpose of each one.
8. There is an ongoing voice and video call between two UACs, due to network
conditions one of the UACs decides to switch the current voice codec to an-
other. The new codec is supported by both UACs. Which SIP METHOD will
be issued to accomplish the codec switching? Sketch the flow of SIP messages
in this process.
9. Differentiate between a Dialog and a Transaction.
10. What is the importance of SDP in SIP call setup?
11. Does SIP carry voice packets? Explain your answer.
12. Explain the importance of location server?
13. Does a caller need to know the location of the location server?
14. Explain the relationship between the following SIP headers, From, Contact,
Via and Record-Route/Route.

References

1. 3GPP (2008) TISPAN; Presence Service; Architecture and functional description [Endorse-
ment of 3GPP TS 23.141 and OMA-AD-Presence_SIMPLE-V1_0]. TS 23.508, 3rd Gener-
ation Partnership Project (3GPP). http://www.3gpp.org/ftp/Specs/html-info/23508.htm
2. 3GPP (2012) 3rd generation partnership project. http://www.3gpp.org. [Online; accessed
15-August-2012]
3. Aboba B, Beadles M (1999) The network access identifier. RFC 2486, Internet Engineering
Task Force. http://www.rfc-editor.org/rfc/rfc2486.txt
4. Berners-Lee T, Fielding R, Frystyk H (1996) Hypertext transfer protocol—HTTP/1.0. RFC
1945, Internet Engineering Task Force. http://www.rfc-editor.org/rfc/rfc1945.txt
5. Berners-Lee T, Fielding R, Masinter L (1998) Uniform resource identifiers (URI):
generic syntax. RFC 2396, Internet Engineering Task Force. http://www.rfc-editor.org/
rfc/rfc2396.txt
6. Day M, Aggarwal S, Mohr G, Vincent J (2000) Instant messaging/presence proto-
col requirements. RFC 2779, Internet Engineering Task Force. http://www.rfc-editor.org/
rfc/rfc2779.txt
7. Dierks T, Rescorla E (2008) The transport layer security (TLS) protocol version 1.2. RFC
5246, Internet Engineering Task Force. http://www.rfc-editor.org/rfc/rfc5246.txt
8. Donovan S (2000) The SIP INFO method. RFC 2976, Internet Engineering Task Force.
http://www.rfc-editor.org/rfc/rfc2976.txt
9. Handley M, Jacobson V (1998) SDP: session description protocol. RFC 2327, Internet En-
gineering Task Force. http://www.rfc-editor.org/rfc/rfc2327.txt
10. Handley M, Schulzrinne H, Schooler E, Rosenberg J (1999) SIP: session initiation protocol.
RFC 2543, Internet Engineering Task Force. http://www.rfc-editor.org/rfc/rfc2543.txt
11. ITU-T (1996) H.323: visual telephone systems and equipment for local area networks which
provide a non-guaranteed quality of service. Recommendation H.323 (11/96), International
Communication Union
12. Klensin J (2001) Simple mail transfer protocol. RFC 2821, Internet Engineering Task Force.
http://www.rfc-editor.org/rfc/rfc2821.txt
13. Rosenberg J (2002) The session initiation protocol (SIP) UPDATE method. RFC 3311, In-
ternet Engineering Task Force. http://www.rfc-editor.org/rfc/rfc3311.txt
122 5 VoIP Signalling—SIP

14. Rosenberg J, Schulzrinne H, Camarillo G, Johnston A, Peterson J, Sparks R, Handley M,


Schooler E (2002) SIP: session initiation protocol. RFC 3261, Internet Engineering Task
Force. http://www.rfc-editor.org/rfc/rfc3261.txt
15. Rosenberg J, Schulzrinne H, Mahy R (2005) An INVITE-initiated dialog event pack-
age for the session initiation protocol (SIP). RFC 4235, Internet Engineering Task Force.
http://www.rfc-editor.org/rfc/rfc4235.txt
VoIP Quality of Experience (QoE)
6

Quality of Experience (QoE) is a term used to describe user perceived experience for
a provided service, e.g. VoIP. This term is also referred to as User Perceived Quality
of Service (PQoS) in order to differentiate with Network Quality of Service (QoS)
which reflects network performance. Network QoS metrics generally include packet
loss, delay and jitter which are the main impairments affecting voice and video qual-
ity in VoIP applications. The key QoE metric is Mean Opinion Score (MOS), an
overall voice/video quality metric. In this chapter, the definition of QoS and QoS
metrics will be introduced first. Then the characteristics of these metrics and how
to obtain them in a practical way will be discussed. Further, the QoE concept and
an overview of QoE measurement for VoIP applications will be presented. Finally,
the most commonly used subjective and objective QoE measurement for voice and
video will be presented in detail, including Perceptual Evaluation of Speech Qual-
ity (PESQ) and E-model for voice quality assessment, and Full-Reference (FR),
Reduced-Reference (RR) and No-Reference (NR) models for video quality assess-
ment.

6.1 Concept of Quality of Service (QoS)


6.1.1 What is Quality of Service (QoS)?

Quality of Service (QoS) is defined as “the collective effect of service performance,


which determine the degree of satisfaction of a user of the service” in ITU-T Rec.
E.800 (1988 version) [9]. In networking or more specific voice and video over IP
field, QoS is generally referred to as Network Quality of Service (NQoS) with a
focus on quality of IP network performance in contrast to end-to-end QoS which
also includes the quality of the terminal/end-device or segments/devices related
with Switched Communication Networks (SCN) such as PSDN, ISDN and GSM
as shown in Fig. 6.1. In VoIP applications, Network QoS covers the quality for IP
transmission segment which starts from when IP packets leave the terminal or end
device (e.g., a PC or a laptop running a voip software such as Skype) to another

L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 123
DOI 10.1007/978-1-4471-4905-7_6, © Springer-Verlag London 2013
124 6 VoIP Quality of Experience (QoE)

Fig. 6.1 Network QoS vs. End-to-End QoS

Fig. 6.2 Active network QoS


measurement

VoIP terminal through the IP network in a PC-to-PC call scenario, or from when IP
packets leave a media gateway in a PSTN/IP combined network to another media
gateway in a phone-to-phone call scenario.
In VoIP applications, the end-to-end QoS is also regarded as mouth-to-ear qual-
ity in reflecting to the quality of a VoIP call from a user speaking to a handset’s
microphone at one end to another user listening on the phone at the other end. This
is mainly for one-way listening speech quality without consideration of interactivity
(for conversation quality).
In the past decade, the term of the end-to-end Quality of Service has gradually
been replaced by Perceived Quality of Service (PQoS) to reflect the nature on how
an end user perceives the quality provided and further Quality of Experience (QoE)
with a focus on user experience on the quality of the service provided. The QoE
concept will be covered in the later sections of the chapter.

6.1.2 QoS Metrics and Measurements

Network Quality of Service (QoS) or network performance is normally represented


by metrics such as packet loss, delay and delay variation (jitter). These metrics can
be measured or monitored by either active (intrusive) measurement or passive (non-
intrusive) measurement. In active measurement as shown in Fig. 6.2, probe packets
(e.g., Internet Control Message Protocol (ICMP) ping packets as used in common
“ping” tool) are sent into the network and further compared with the echo packet
to obtain the network performance metrics such as packet loss percentage and max-
6.1 Concept of Quality of Service (QoS) 125

Fig. 6.3 Passive network


QoS measurement

Fig. 6.4 Trace data example


with sequence number and
delay

imum/average/minimum Round Trip Time (RTT). In passive (non-intrusive) mea-


surement as shown in Fig. 6.3, there are no probe packets involved. A measure-
ment/monitoring tool such as Wireshark1 is normally run in a monitor computer or
sometimes just resided in a sender or receiver host to monitor network performance
(in terms of QoS metrics) and its behavior by analyzing headers of IP packets of the
interested traffic (e.g., traffic in a VoIP session). Figure 6.4 illustrates an example of
trace data collected/pre-processed at the receiver side which shows the packet se-
quence number (the 1st column) and the one-way network delay (the 2nd column).
The missing of the packet number 13 indicates that the packet is lost. From the trace
data, packet loss, delay and delay variation metrics can be calculated. These will be
presented in the following sections.

6.1.3 Network Packet Loss and Its Characteristics


Network packet loss is a key network QoS impairment which will affect voice/video
quality in VoIP applications. There are mainly two kinds of packet loss. One is
caused by network congestions at bottleneck links along the path due to router
buffer queuing overflow. This kind of loss is called Congestive Loss which is bursty

1 www.wireshark.com
126 6 VoIP Quality of Experience (QoE)

in nature. Another kind of packet loss, which is called Non-Congestive Loss is


mainly due to lossy links such as mobile/wireless networks and ADSL access net-
works [6, 39] and is of random nature. Busty packet loss has more adverse effect
on voice/video quality when compared with random packet loss. This is because
of modern codec’s built-in packet loss concealment mechanism at the decoder side
which is able to conceal those lost packets based on previous received packets in-
formation.
Many research have been carried out to investigate the packet loss charac-
teristics based on real Internet trace data collection via either active or passive
QoS measurement. Different packet loss models have been developed to char-
acterizing the features of packet losses over the Internet. In this section, we
will discuss three most widely used packet loss models and provide a practi-
cal approach on how to obtain these packet loss metrics. The application of
these models in voice quality prediction (e.g., E-model) will be covered in
Sect. 6.4.3.

Bernoulli Loss Model


The most simple packet loss model is called Bernoulli loss model or random loss
model which assumes each packet loss in independent (memoryless), regardless of
whether the previous packet is lost or not. For this model, only one metric is needed
to represent average packet loss rate. This one single parameter of average packet
loss rate (Ppl ) can be calculated based on the total number of lost packets divided
by the total number of sent packets as represented in Eq. (6.1). This model was used
in the first edition of ITU-T Rec. G.107 [35] E-model (Edition 1 was approved in
1998; now Edition 8 was approved in 2011).

total number of lost packets


Ppl = × 100 % (6.1)
total number of sent packets

2-State Markov Model


In IP networks, because several of the mechanisms that contribute to loss are tran-
sient in nature (e.g., network congestion and buffer overflow), the packet loss in IP
network is bursty in nature (not purely random). When a packet is lost due to con-
gestion in a network, there is a temporal dependency for the next packet to be lost
as well. The temporal dependency of packet loss can be characterized by a Markov
model, typically a 2-state Markov model (also called Gilbert model) as shown in
Fig. 6.5.
There are two states (state 0 and state 1). We define a random variable X as fol-
lows: X = 0 (state 0) is for a packet received (no loss) and X = 1 (state 1) is for a
packet dropped (packet lost). p01 , p11 , p10 and p00 are used to represent four transi-
tion probabilities between No-loss state and Loss state. p (or p01 ) is the probability
that a packet will be dropped given that the previous packet was received. 1 − q (or
6.1 Concept of Quality of Service (QoS) 127

Fig. 6.5 2-state Markov model

Fig. 6.6 Example of burst packet loss and burst loss length

p11 ) is the probability that a packet will be dropped given that the previous packet
was dropped.
Let π0 and π1 denote the state probability for state 0 and 1, as π0 = P (X = 0)
and π1 = P (X = 1), respectively.
The procedure to compute π0 and π1 is as follows. At steady state, we have:

π0 = (1 − p) · π0 + q · π1
(6.2)
π0 + π1 = 1

Thus π1 , the unconditional loss probability (ulp), can be computed as follows:


p
π1 = (6.3)
p+q

The ulp provides a measure of the average packet loss rate. 1 − q is also referred
to as the conditional packet loss probability (clp).
The Gilbert model implies a geometric distribution of the probability for the
number of consecutive packet losses k, that is the probability of a burst loss having
length k, pk can be expressed as:
pk = P (Y = k) = q · (1 − q)k−1 (6.4)

Y is defined as a random variable which describes the distribution of burst loss


lengths with respect to the burst loss events. The concept of burst loss length is also
shown in Fig. 6.6.
128 6 VoIP Quality of Experience (QoE)

Fig. 6.7 Trace data example


to calculate packet loss
parameters

Based on Eq. (6.4), the mean burst loss length E can be calculated as:

 ∞
 1
E= k · pk = k · q · (1 − q)k−1 = (6.5)
q
k=1 k=1

Note that E[Y ] is computed based on q which is only related with the conditional
loss probability, clp (q = 1 − clp), i.e. that the value of the mean burst loss length is
dependent only on the behaviour of consecutive loss packets.

Practical Approach to Calculate Packet Loss Parameters


The calculation of packet loss parameters according to the 2-state Markov model is
normally based on the analysis of the trace data collected/monitored. For example,
we may need to calculate these parameters for a trace data of 20 seconds (assuming
AMR codec is used with 20 ms as packet interval, then 20 seconds of trace data
is equivalent to 1000 packets sent out from the sender). At the receiver side, the
received packet is marked as “0”, whereas the lost packet is marked as “1” as shown
in Fig. 6.7. Assume c01 is used as the counter for counting the number of transition
from state 0 to state 1 (the previous packet is received, the next packet is lost), c11
is the counter for transition from state 1 to state 1. c0 and c1 are the counters for a
total number of packets at state 0 and state 1, respectively.
The probability p and q can be calculated from these counters based on a trace
data as below:
c01 c11
p= , q =1− (6.6)
c0 c1

The probability p and q can also be calculated from the loss length distribution
statistics from the trace data. Let oi , i = 1, 2, . . . , n − 1 denote the number of loss
bursts having length i, where n − 1 is the length of the longest loss bursts. Let o0
denote the number of successfully delivered packets (obviously o0 = c0 ). Then p,
q can be calculated by the following equations [44] (you can also derive Eq. (6.7)
from Eq. (6.6) based on the trace data concept):
 
( n−1
i=1 oi ) ( n−1
i=1 oi · (i − 1))
p= , q =1−  (6.7)
o0 ( n−1
i=1 oi · i)

When p = 1 − q, the 2-state Markov model reduces to a Bernoulli loss model.


6.1 Concept of Quality of Service (QoS) 129

Fig. 6.8 4-state Markov model

The unconditional loss probability (ulp or π1 ) and the conditional loss probabil-
ity (clp or 1 − q) are two metrics used to represent bursty packet loss in IP networks.
This 2-state Markov model is also used in the calculation of effective equipment im-
pairment factor (Ie-eff ) in the E-model which will be covered in Sect. 6.4.3. In prac-
tice, the mean burst loss length (E) is normally used to replace the conditional loss
probability (clp) due to its clear practical meaning. The average packet loss rate Ppl
(in %) is generally used instead of the unconditional loss probability (ulp) metric.
A numerical example to demonstrate how to obtain the average packet loss rate
and the mean burst loss length based on the trace data information will be provided
in Sect. 6.7.

4-State Markov Model


In IP network, network performance is affected by the amount of usage of the net-
work. For example, during peak time (from 9am to 5pm), office network traffic is
much higher than that during off peak time. The network traffic in the residential
area may display a totally different pattern. In account for the pattern of busy (or
bursty) state and idle (or non-busty state), the 2-state Markov model is further ex-
tended to a 4-state Markov model as shown in Fig. 6.8.
This model is used in the Extended E-model [4]. A new parameter, Gmin is also
introduced to represent the consecutive received packets for a transit from a burst
state to a gap state and the default value in Extended E-model is 16.
The above three packet loss models (Bernoulli loss model, 2-state Markov model
and 4-state Markov model) are the most widely used loss models in VoIP applica-
tions. There exist other more complicated packet loss models such as 8-state Markov
130 6 VoIP Quality of Experience (QoE)

Fig. 6.9 Example of end-to-end delay for VoIP applications

chain models [49], and loss run-length and no-loss run-length models [44]. Inter-
ested readers can read those literatures for more information.

6.1.4 Delay, Delay Variation (Jitter) and Its Characteristics

In this section, we will discuss the concept of delay and delay variation (jitter) in
VoIP applications. The components of network delay and further end-to-end delay,
and the definition of delay variation (jitter) used in Real-time Transport Protocol
(RTP) in IETF RFC 3550 [45] will be covered. This is the jitter definition generally
used in VoIP applications.

Delay and Delay Components


As shown in Fig. 6.9 for a VoIP application example, IP network delay is the time a
packet sent out from the sender (Point A) to the time it reaches the receiver (Point B).
The IP network delay mainly consists of the following components:
• Propagation delay: depends only on the physical distance of the communica-
tions path and the communication medium.
• Transmission delay: the sum of the time it takes the network interfaces in the
routers to send out the packet along the path.
• Nodal Processing delay: the sum of the time it takes in the routers to decide
where (which interface) to send the packet based on packet header analysis and
the routing table.
• Queuing delay: the time a packet has to spend in the queues of the routers along
the path. It is mainly caused by network congestion.
More details about these delay components for network delay and how to calcu-
late them can be found in [37].
If we consider the end-to-end delay, the following delay components incurred at
the sender and the receiver have to be taken into account.
• Codec delay: this is the delay used by the encoder and decoder (codec) to en-
code the speech samples into the speech bitstream and decode back into speech
samples. For modern hybrid codecs, speech compression is based on a speech
frame (normally 10–30 ms). Some codecs also need a look-ahead time (about
6.1 Concept of Quality of Service (QoS) 131

Table 6.1 Codec


Codec Bit Rates Frame length Look-ahead Codec delay
algorithmic delay
(kb/s) (ms) (ms) (ms)
G.711 64 0.125 0 0.25
G.729 8 10 5 25
G.723.1 5.3/6.3 30 7.5 67.5
AMR 4.75 ∼ 12.2 20 0 40

a half or a quarter of a speech frame) to complete the encoding process. Codec


processing also needs one frame length time. So the total codec delay is the
sum of the speech frame length × 2, plus look-ahead time (see some examples
in Table 6.1). For waveform-based codecs, such as PCM and ADPCM, encod-
ing is sample-based, instead of frame-based, the codec delay is only 2 sampling
intervals.

Codec delay = 2 × FrameSize + Look ahead

• Packetization delay: the time needed to build data packets at the sender, as well
as to strip off packet headers at the receiver. For example, for AMR codec,
if one packet contains two speech frames, then the packetization delay equals
2 × 20 = 40 ms.
• Playout buffer delay, the time waited at the playout buffer at the receiver side.
This will be explained in detail in the later section.
The end-to-end delay dend-to-end can be expressed as:

dend-to-end = dcodec + dpacketization + dnetwork + dbuffer (6.8)

For VoIP applications, if a codec, packetization size and jitter buffer are fixed,
then end-to-end delay is mainly affected by network delay. More details on buffer
delay (dbuffer ) will be explained later.

Delay Variation (Jitter) and Its Characteristics


In VoIP applications, packets are sent out at a fixed time interval at the sender side
as shown in Fig. 6.10(a) for a packet sequence from i − 2 to i + 3 with packets
send-out time expressed as Si−2 , to Si+3 . The packet interval is codec/packetization
dependent. For example, if AMR codec is used and one packet contains one speech
frame, then, packets are sent out in a 20 ms interval which can be expressed as
Si+1 − Si = Si − Si−1 = 20 ms. As packets may traverse along different route/path
which may incur different IP network delay, this will cause packets arriving at the
receiver side with different packet interval. As shown in Fig. 6.10(b), Ri represents
the receiving time of packet i. Then it is clear that Ri+1 − Ri = Ri − Ri−1 . The
variation of delay is called jitter which is one of the major network impairments
affecting voice/video quality. The playout buffer at the receiver side is used to alle-
viate the impact of jitter and to guarantee a smooth playout of audio or video at the
132 6 VoIP Quality of Experience (QoE)

Fig. 6.10 Conceptual diagram of delay variation (jitter)

Fig. 6.11 Network delay and


playout buffer delay

receiver side. As illustrated at Fig. 6.10(c), packets are played out at constant inter-
val at the receiver side. In the figure, Pi represents the playout time of the packet i.
Now, Pi+1 − Pi = Pi − Pi−1 = 20 ms (here assume AMR codec is used). The time
stayed at the buffer between packet playout and arrival is called buffer delay. If
a packet arrives too late (see the packet i + 2 as an example), the packet will be
dropped out by the playout buffer (this is called late arrival packet loss in contrast
to network packet loss which is lost in the network).
Figure 6.11 illustrates the relationship between network delay and playout buffer
delay using packet i as an example. ni = Ri − Si is the network delay for the
packet i. bi is the buffer delay for the packet stayed in the playout buffer at the
receiver. di can be viewed as the time spent by the packet i from the moment it
leaves the sender to the time it is played out at the receiver.
Jitter is the statistical variance of the packet interarrival time or variance of the
packet network delay and is caused mainly by the queuing delay component along
the path. There are different definitions of jitter to represent the degree of the vari-
ance of the delay.
6.1 Concept of Quality of Service (QoS) 133

In VoIP applications, the definition of jitter is normally based on IETF RFC


3550 [45] (replacing obsoleted IETF RFC 1889) where the jitter is defined to be
the mean deviation (the smoothed absolute value) of the packet spacing change be-
tween the sender and the receiver. This jitter value can be calculated and displayed
in the periodically generated RTCP reports to reflect network performance during a
VoIP application.
For the packet i, the interarrival jitter Ji is calculated as:


Ji = Ji−1 + D(i − 1, i) − Ji−1 /16 (6.9)

where D is the difference of the packet spacing or the difference of IP network delay
between two consecutive packets, here packet i and its previous packet i − 1. The
D value can be calculated as below.

D(i − 1, i) = (Ri − Ri−1 ) − (Si − Si−1 ) = (Ri − Si ) − (Ri−1 − Si−1 ) (6.10)

where Ri − Ri−1 is the packet arrival space between the packet i and the packet
i − 1, and Ri − Si is the packet network delay for the packet i, shown as ni in
Fig. 6.11.
The interarrival jitter J is a running estimate of Mean Packet to Packet Delay
Variation (MPPDV), expressed as D.
A Practical Example to Calculate Jitter and Average Delay
Here we show a practical example to calculate jitter according to IETF RFC 3550.
Assume we have a trace data after pre-processing (trace1.txt) similar to the one
shown in Fig. 6.4 with the 1st column for the sequence number and the 2nd column
for the one-way network delay. Below lists a sample Matlab code to calculate the
jitter and the average delay. At the end of the calculation, the jitter value and the
average delay are printed out on the screen.

Jitter calculation example (jitter1.m) using Matlab

x = trace1(:,1); % x is the seq. number


y = trace1(:,2); % y is the network delay
T = length(x); % T is the total packets received
j = 0; % set initial jitter as zero
d = y(1); % set initial delay as y(1)
for i = 2: T
D = y(i) - y(i-1); % calculate delay variation
j = j + (abs(D) - j)/16; % calculate jitter
d = d + y(i);
end
avgd = d/T;
display(’jitter=’); display(j); % display jitter result
display(’avg delay=’); delay(avgd); %display average delay
134 6 VoIP Quality of Experience (QoE)

Fig. 6.12 Examples of IP network delay and jitter

Fig. 6.13 Example of


adaptive buffer and fixed
buffer

Jitter Characteristics and Jitter Buffer Adjustment


Figure 6.12 shows two examples of the Internet trace data collected for IP network
delay and delay variation. Figure 6.12(a) has a stable network performance with a
roughly constant level of delay variation (jitter). The IP network delay of all packets
shown are in the range of 15 to 20 ms. A fixed jitter buffer (e.g., a buffer of 5 ms)
will incorporate all jitter effects and will not create any late arrival loss due to jitter.
Figure 6.12(b) shows small spike of jitter with a periodic pattern. In this case, if a
fixed jitter buffer is considered, periodic single packet loss due to late arrival loss
may occur.
If a jitter spike occupies many packets (or last a quite long time due to major
network congestion), many consecutive packets may be lost due to late arrival loss
if a fixed jitter buffer is used. In these cases, normally adaptive jitter buffer which
can adapt to network delay variations may be used. Figure 6.13 shows an example
of adaptive jitter buffer which has adapted the buffer size according to changing
network conditions. In the figure, if a fixed buffer is used, majority of the packets
at the beginning of the trace data will be lost. An adaptive jitter buffer will be able
6.2 Quality of Experience (QoE) for VoIP 135

to follow network delay changes in order to get a good tradeoff in buffer delay and
buffer loss.
In VoIP applications, there exist many jitter buffer adaptive algorithms which
may adjust jitter buffer before or during a speech talkspurt continuously during a
VoIP call to adapt to changing network conditions. The following shows an exam-
ple of jitter buffer algorithm proposed by Ramachandran et al. [42] which follows
similar concept in the estimation of the TCP round trip time (RTT) and the retrans-
mission timeout interval [37].
The algorithm attempts to maintain a running estimate of the mean and variation
of network delay, that is, d̂i and v̂i , seen up to the arrival of the i th packet. If packet
i is the first packet of a talkspurt, its playout time Pi (see Fig. 6.11) is computed as:

Pi = Si + d̂i + μ × v̂i (6.11)

where μ is a constant (normally μ = 4) and v̂i is given by:

v̂i = α v̂i−1 + (1 − α)|d̂i − ni | (6.12)

ni is the network delay of the i th packet.


The playout delay for subsequent packets (e.g., packet j ) in a talkspurt is kept
the same as dj = di (the buffer adjustment is only occurred during the silence period
which is normally imperceptible).
The mean delay is estimated through an exponentially weighted average.

d̂i = α d̂i−1 + (1 − α)ni (6.13)

with α = 0.998002.
Please note that the parameters such as α = 0.998002 in the above equation
was obtained based on the trace data collected in a research carried out over 15
years ago [42]. This jitter buffer algorithm and the optimized parameter may not
be appropriate for VoIP applications in today’s Internet. The algorithm shown
above is only used to demonstrate how jitter buffer and jitter buffer algorithm
work.
Currently there are no standards for jitter buffer algorithms. The implementation
of jitter buffer algorithms in VoIP terminals/softwares are purely vendor dependent.

6.2 Quality of Experience (QoE) for VoIP

6.2.1 What is Quality of Experience (QoE)?

Quality of Experience (QoE) is defined as “the overall acceptability of an applica-


tion or service, as perceived subjectively by the end-user” according to ITU-T Rec.
P.10/G.100 Amendment 2 [31]. QoE is normally regarded as Perceived Quality of
Service (PQoS) in differentiation with the Quality of Service (QoS) which is gener-
ally regarded as Network Quality of Service (NQoS) or related metrics for network
136 6 VoIP Quality of Experience (QoE)

Fig. 6.14 Factors affect end-to-end speech quality

quality such as network packet loss, delay and jitter. The key QoE metric for VoIP
applications is Mean Opinion Score (MOS) which is a metric used to represent an
overall quality of speech quality provided during a VoIP call. This metric is gener-
ally obtained by averaging the overall quality scores provided by a group of users
(for the term of mean opinion). MOS is also used to represent perceived video qual-
ity for video call or video streaming applications or perceived audiovisual quality
when both voice and video are taken into account as in video conference scenar-
ios. In this chapter, we focus on MOS, the most widely used QoE metric for VoIP
applications. Other QoE metrics for speech quality such as intelligibility (how you
understand a VoIP call, or its content) or fidelity (how faithfulness the degraded
speech is when compared to its original one) are not covered.

6.2.2 Factors Affect Voice Quality in VoIP

There are many factors affecting end-to-end voice quality in VoIP applications. They
are normally classified into two categories.
• Network Factors: for those factors occurring in the IP network, such as packet
loss, delay and jitter.
• Application Factors: for those factors occurring in the application devices/soft-
wares (e.g., codec impairment, jitter buffer in VoIP terminals).
Figure 6.14 shows factors affecting voice quality along an end-to-end transmis-
sion path for a VoIP application. At the sender, it includes codec impairment (e.g.,
quantization error), coding delay (e.g., the time to form up a speech frame) and pack-
etization delay (e.g., put two or three speech frames into a packet). In IP network,
it includes network packet loss, delay and delay variation (jitter). At the receiver, it
contains depacketization delay (e.g., remove the header and get the payload), buffer
delay (the time spent in the playout buffer) and buffer loss (due to arrive too late),
and codec impairment and codec delay. From an end-to-end point of view, end-to-
end packet loss may include network packet loss and late arrival loss occurred at
the receiver. End-to-end delay needs to include all delays from sender, IP network
to receiver as shown in the figure.
6.2 Quality of Experience (QoE) for VoIP 137

Fig. 6.15 Conceptual diagram of voice quality assessment methods

At the application level, other impairment factors which are not shown in
Fig. 6.14 may include echo, sidetone (if analog network or analog circuit is in-
volved) and background noise. Application related mechanisms such as Forward
Error Correction (FEC), packet loss concealment (PLC), codec bitrate adaptation
and jitter buffer algorithms at either sender or receiver side may also affect the end-
to-end overall perceived voice quality.

6.2.3 Overview of QoE for Voice and Video over IP

VoIP quality measurement can be categorized into subjective assessment and ob-
jective assessment. In objective assessment, it can be further divided into intru-
sive or non-intrusive methods. Figure 6.15 shows a conceptual diagram for dif-
ferent speech quality assessment methods which will be covered in detail in this
section. We follow the ITU-T Rec. P.800.1 [22] for Mean Opinion Score (MOS)
terminology to describe the relevant MOS scores obtained from subjective and ob-
jective tests. In general, subjective tests are carried out based on degraded speech
samples. The subjective test scores can be described as MOS-LQS (MOS-listening
quality subjective) for one-way listening quality. If a conversation is involved,
then MOS-CQS (MOS-conversational quality subjective) is used instead. For ob-
jective tests, if a reference speech is inserted into the system/network and the
measurement is carried out by comparing the reference speech with the degraded
speech obtained at the other end of the tested system/network, this is called in-
trusive measurement. As shown in the figure, MOS-LQO (MOS-listening quality
objective) is obtained with the typical example of PESQ (Perceptual Evaluation of
Speech Quality) [17]. For non-intrusive measurement, there is no reference speech
inserted into the system/network. Instead, the measurement is carried out by ei-
ther utilizing/analyzing the captured IP packet headers (parameter-based) or ana-
lyzing the degraded speech signal itself (single-end signal based). For parameter-
138 6 VoIP Quality of Experience (QoE)

Fig. 6.16 Comparison of voice quality measurement methods

based methods, a typical example is ITU-T Rec. G.107 (E-model) [32] for predict-
ing conversational speech quality, MOS-CQE (MOS-Conversational Quality Esti-
mated, with delay and echo impairment considered) or listening-only speech qual-
ity, MOS-LQE (MOS-Listening Quality Estimated) from network related parame-
ters such as packet loss, delay and jitter. Methods to predict listening-only speech
quality from network parameters are also summarized in ITU-T Rec. P.564 [23].
In some applications, parameter-based speech quality prediction models are em-
bedded into the end device (e.g., locating just after the jitter buffer, thus, late ar-
rival loss and jitter buffer are taken into account in calculating relevant parame-
ters such as packet loss and delay). For signal-based method, a typical example is
3SQM (Single Sided Speech Quality Measure) following ITU Rec. P.563 [21] which
predicts listening-only speech quality from analyzing the degraded speech signal
alone. In the following section, we will give a detailed description on comparison-
based method (PESQ) and parameter-based method (E-model) which are two most
widely used objective quality assessment methods for VoIP applications in indus-
try.
Further comparison of all voice quality measurement methods is shown in
Fig. 6.16. Please note that all objective measurement methods are calibrated by
subjective test results. In other words, objective measurement methods are solely
to predict how subjects assessing the quality for the tested system/network/device.
User experience or user perceived quality is always the final judgement for the qual-
ity assessment.

6.3 Subjective Speech Quality Assessment

Subjective voice quality tests are carried out by asking people to grade the qual-
ity of speech samples under controlled conditions (e.g., in a sound-proof room).
6.3 Subjective Speech Quality Assessment 139

Table 6.2 Absolute


Category Speech quality
Category Rating (ACR)
5 Excellent
4 Good
3 Fair
2 Poor
1 Bad

Table 6.3 Degradation


Category Degradation level
Category Rating (DCR)
5 Inaudible
4 Audible but not annoying
3 Slightly annoying
2 Annoying
1 Very annoying

The methods and procedures for conducting subjective evaluations of transmis-


sion system and components have been set out clearly in ITU-T Recommenda-
tion P.800 [12]. Recommended methods include of those for Listening-opinion tests
and Conversation-opinion tests. For Listening-opinion tests, the recommended test
method is “Absolute Category Rating” (ACR) as shown in Table 6.2 in which sub-
jects only listen the degraded speech test samples and give an opinion score from 5
(Excellent) to 1 (Bad). An alternative Listening-opinion test, the Degradation Cat-
egory Rating (DCR) is also provided, in which subjects need to listen both the ref-
erence and the degraded speech samples and decide their quality difference on a
five-point scale as shown in Table 6.3. Normally the DCR method is more suit-
able when the difference between the reference speech and the degraded speech is
small. The standard has also set out basic test requirements or principles, including
the test room setting (e.g., a sound proof and the level of its background noises),
test procedure (e.g., eligibility of subjects), speech test material preparation, test se-
quences (need to be random) and test data analysis. The mean opinion score (MOS)
is obtained by averaging individual opinion scores for a number of listeners (e.g.,
from 32–100). The suggested speech samples for testing are normally 10–30 sec-
onds consisting of several short sentences spoken by both male and female speakers.
Conversation-opinion tests require two subjects seated in two separate soundproof
rooms/cabinets to carry out a conversation test on selected topics. The five-point
opinion scale (from 5 for excellent to 1 for bad) is normally used and the mean
opinion score for conversational tests can be expressed as MOSc . Subjective tests
based on the mean opinion score (MOS) reflect an overall speech quality, which is
an opinion score normally given by a subject at the end of a tested speech sample.
The ITU-T P.800, together with ITU-T P.830 [14], were originally proposed for
assessing voice quality in selecting speech codec algorithms and standardizations of
140 6 VoIP Quality of Experience (QoE)

Fig. 6.17 Impact of impairment location on MOS (Recency Effect)

speech codecs such as ITU-T Rec. G.728 [10] and G.729 [11]. In these cases, the
impairments due to speech compression are normally consistent during a test speech
sample. The MOS score given at the end of a test sample reflects the overall codec
quality for the test sequence. However, in VoIP applications, the impairments from
the IP network, such as packet loss has an inconsistent nature when compared with
codec compression impairment. Research has shown that the perceived quality of a
speech sample varies with the location of impairments such as packet loss. Subjects
tend to give lower MOS score when the impairments occurring near the end of the
test sample compared when the impairments occurring early in the sample. This is
called “Recency Effect” as humans tend to remember the last few things more than
those in the middle or at the beginning.
Figure 6.17 depicts the test results from an experiment described in ANSI
T1A1.7/98-031 [3] in which noise bursts were introduced at the beginning, mid-
dle and end of a 60 second test call. It shows that subjects gave the lowest MOS
score when bursts occurred at the end of the call. Similar recency effects were also
observed in another subject tests where noise bursts were replaced with bursts of
packet loss [16]. Due to the nature of IP network, the impact of network impairment
such as packet loss on speech quality are inconsistent during a VoIP.
In order to capture this inconsistency for time-varying speech quality, instan-
taneous subjective speech quality measurement was developed in addition to the
overall MOS score tests as in ITU-T Rec. P.800 [12]. In EURESCOM Project [1], a
continuous rating of a 1-minute sample is proposed to assess quality for voice signal
over the Internet and UMTS networks. Instead of voting at the end of a test sentence
(as in ITU-T Rec. P.800), a continuous voting is carried out at several segments
of the test sentence to obtain a more accurate assessment of voice quality. Further
in ITU-T Rec. P.880 [19], continuous evaluation of time varying speech quality
was standardized, in which both instantaneous perceived quality (perceived at any
instant of a speech sequence) and the overall perceived quality (perceived at the
6.4 Objective Speech Quality Assessment 141

Fig. 6.18 Continuous


quality scale

end of the speech sequence) are required to be tested for time varying speech sam-
ples. This method is called Continuous Evaluation of Time Varying Speech Quality
(CETVSQ). Instead of short speech sequence (e.g., 8 s) as in P.800, longer speech
test sequence (between 45 seconds and 3 minutes) is recommended. An adequate
number (at least 24) naive listeners shall participate in the test. An appropriate slider
bar should be used to assist the continuous evaluation of speech quality according to
the continuous quality scale as defined in P.880 and shown in Fig. 6.18 during a test
sample. At the end of each test sequence, subjects are still asked to rate its overall
speech quality according to ACR scale in P.800. Overall the continuous evaluation
of time varying speech quality test method is more suitable for subjective assess-
ment of speech quality in voice over IP or mobile networks when network packet
loss or link bit error are inevitable.

6.4 Objective Speech Quality Assessment


As shown in Fig. 6.15, objective speech quality assessment includes intrusive and
non-intrusive measurement methods. Intrusive measurement is an active method
which needs an injection of a reference speech signal into the tested system/network
and predicts speech quality by comparing the reference and the degraded speech sig-
nals. Non-intrusive measurement is a passive method and predicts speech quality by
analyzing the IP packet header (and relevant parameters) or analyzing the degraded
speech signal itself. It does not need the injection of the test signal and is mainly
used for quality monitoring for operational services. In this section, we start with
intrusive objective test such as ITU-T Rec. P.862 (PESQ) and P.863 (POLQA) for
Perceptual Objective Listening Quality Assessment and then discuss non-intrusive
tests including parameter-based model (e.g., E-model).

6.4.1 Comparison-Based Intrusive Objective Test (Full-Reference


Model)

Comparison-based intrusive objective test method, also known as ‘full-reference’


or ‘double-ended’, predicts speech quality by comparing the reference (or original)
speech signal and the degraded (or distorted) speech signal measured at the output
142 6 VoIP Quality of Experience (QoE)

Fig. 6.19 Example of reference and degraded speech signals

of the network or system under test. Figure 6.19 shows an example of the reference
and the degraded speech signals. The degraded speech signal has experienced some
packet losses as indicated in the figure. The example shown in the figure is from
the G.729 codec [11] which has a built-in packet loss concealment mechanism and
has inserted the missing parts based on previous packets information for packet loss
concealment process.
There are a variety of intrusive objective speech quality measurement methods,
which are normally classified into three categories.
1. Time Domain Measures: based on time-domain signal processing and analysis
(i.e., analyze time-domain speech waveform as shown in Fig. 6.19. Typical
methods include Signal-to-Noise Ratio (SNR) and Segmental Signal-to-Noise
Ratio (SNRseg) analysis. These methods are very simple to implement, but are
not suitable for estimating the quality for low bit rate codecs (normally not
waveform based codecs) and voice over IP networks.
2. Spectral Domain Measures: based on spectral-domain signal analysis, such as
the Linear Predictive Coding (LPC) parameter distance measures and the cep-
stral distance measure. These distortion measures are closely related to speech
codec design and use the parameters of speech production models. Their per-
formance is limited by the constraints of the speech production models used in
codecs.
3. Perceptual Domain Measures: based on perceptual domain measures which
use models of human auditory perception. The models transform speech signal
into a perceptually relevant domain such as bark spectrum or loudness domain
and incorporate human auditory models. Perceptual based models provide the
6.4 Objective Speech Quality Assessment 143

Fig. 6.20 Conceptual Diagram of Comparison-based tests

Fig. 6.21 Effect of buffer adjustment—slip a frame (adjustment size = 1 frame)

highest accuracy in predicting human perceived speech quality when compared


with time-domain and frequency-domain methods.
The basic structure of the perceptual measure methods is illustrated in Fig. 6.20.
It consists of two modules: perceptual transform module and cognition/judge mod-
ule. The perceptual transform module transforms the signal into a psychophysical
representation that approximates human perception (mimic how human ear pro-
cesses speech). The cognition/judge model can map the difference between the orig-
inal and the degraded signals into estimated perceptual distortion or further to Mean
Opinion Score (MOS) (mimic how brain perceives/judges the speech quality).
Typical examples of perceptual-domain measures include the ITU-T P.861, Per-
ceptual Speech Quality Measure (PSQM) [13], proposed in 1996 mainly for assess-
ing speech quality of codecs; the ITU-T P.862, Perceptual Evaluation of Speech
Quality (PESQ) [17] specified in 2001; and the latest ITU-T P.863, Perceptual Ob-
jective Listening Quality Assessment (POLQA) [34], defined in 2011. The main dif-
ference between PSQM and PESQ is that PESQ has incorporated a time-alignment
algorithm to tackle the time difference between the reference and the degraded
speech samples due to jitter buffer adjustment as shown in Figs. 6.21 and 6.22 where
a speech frame is either “inserted” or “slipped” due to jitter buffer adjustment (as-
suming buffer adjustment size of 1 speech frame). This time non-alignment between
the reference speech and the degraded speech caused by jitter buffer adjustment is a
144 6 VoIP Quality of Experience (QoE)

Fig. 6.22 Effect of buffer adjustment—insert a frame (adjustment size = 1 frame)

unique problem in VoIP applications which has posed a real challenge to traditional
time-aligned objective quality evaluation methods. The “insert” or “slip” of some
speech segments for jitter buffer adjustment normally occurs during the silence pe-
riod of a call. If a jitter buffer adjustment also carries out during mid-talkspurt, the
adjustment itself may also affect voice quality.
After the development of ITU-T Rec. P.862 (PESQ) algorithm, two extensions of
PESQ are also standardized in ITU-T Rec. P.862.1 [18] for mapping from raw PESQ
score (ranging from −0.5 to 4.5) to MOS-LQO (ranging from 1 to 5) and ITU-T
Rec. P.862.2 [26] for mapping from raw PESQ score (narrow-band) to wideband
PESQ. The mapping function from PESQ to MOS-LQO is defined in Eq. (6.14).
4.999 − 0.999
y = 0.999 + (6.14)
1 + e−1.4945·x+4.6607
where x is the raw PESQ MOS score, and y is MOS-LQO score after mapping.
Normal VoIP applications are default for narrow-band (300–3400 Hz) speech ap-
plications. When wideband telephony applications/systems (50–7000 Hz) are con-
sidered, the PESQ raw score needs to be mapped to PESQ-WB (PESQ-WideBand)
score using the following Eq. (6.15).
4.999 − 0.999
y = 0.999 + (6.15)
1 + e−1.3669·x+3.8224
where x is the raw PESQ MOS score, and y is the PESQ-WB MOS value after
mapping.
ITU-T P.863 [34] defined in 2011, is an objective speech quality prediction
method for both narrowband (300–3400 Hz) and super-wideband (50 to 14000 Hz)
speech and is regarded as the next generation speech quality assessment technology
suitable for fixed, mobile and IP-based networks. It predicts listening-only speech
quality in terms of MOS. The predicted speech quality for its narrowband and super-
wideband mode is expressed as MOS-LQOn (MOS Listening Quality Objective
for Narrowband) and MOS-LQOsw (MOS Listening Quality Objective for Super-
wideband), respectively. P.863 is suitable for all the narrow and wideband speech
codecs listed in Table 2.2 in Chap. 2 and can be applied for applications over GSM,
UMTS, CDMA, VoIP, video telephony and TETRA emergency communications
networks.
6.4 Objective Speech Quality Assessment 145

6.4.2 Parameter-Based Measurement: E-Model

The E-model abbreviated from the European Telecommunications Standards In-


stitute (ETSI) Computation Model was originally developed by a working group
within ETSI during the work on ETSI Technical Report ETR 250 [7]. It is a passive,
computational tool mainly for network planning. It takes into account all possible
impairments for an end-to-end speech transmission such as equipment-related im-
pairment (e.g., codec, packet loss), delay-related impairment (e.g., end-to-end delay
and echo) and impairments that occur simultaneously with speech (e.g., quantiza-
tion noise, speech level).
The fundamental principle of the E-model is based on a concept established more
than 30 years ago by J. Allnatt [2]: “Psychological factors on the psychological
scale are additive”. It is used for describing the perceptual effects of diverse impair-
ments occurring simultaneously on a telephone connection. Because the perceived
integral quality is a multidimensional attribute, the dimensionality is reduced to a
one-dimensional so-called transmission rating scale or R scale (in the range of 0
to 100). On this scale, all the impairments are—by definition—additive and thus
independent of one another.
The E-model takes into account all possible impairments for an end-to-end
speech transmission, such as delay-related impairment and equipment-related im-
pairment, and is given by the equation below according to ITU-T Rec. G.107 [32].

R = R0 − Is − Id − Ie-eff + A (6.16)

where
R0 : S/N at 0 dBr point (groups the effects of noise)
Is : Impairments that occur simultaneously with speech (e.g. quantization
noise, received speech level and sidetone level)
Id : Impairments that are delayed with respect to speech (e.g. talker/listener
echo and absolute delay)
Ie-eff : Effective equipment impairment (e.g. codecs, packet loss and jitter)
A: Advantage factor or expectation factor (e.g. 0 for wireline and 10 for GSM)
The ITU-T Rec. G.107 has gone through 7 different versions in the past ten years
which reflect the continuous development of the model for modern applications,
such as VoIP. For example, the Ie model in the E-model has evolved from a simple
random loss model, a 2-state Markov model to more complicated 4-state Markov
model which takes into account bursty losses and gap/bursty states (as discussed
in Sect. 6.1.3) to reflect real packet loss characteristics in IP networks. Efforts are
still ongoing to further improve E-model in applications in modern fixed/mobile
networks.
ITU-T Rec. G.109 [15] defines the speech quality classes with the Rating (R), as
illustrated in Table 6.4. A rating below 50 indicates unacceptable quality.
The score obtained from the E-model is referred as MOS-CQE (MOS conversa-
tional quality estimated). This MOS score can be converted from R-value by using
Eq. (6.17) according to ITU-T Rec. G.107 [32], which is also depicted in Fig. 6.23.
146 6 VoIP Quality of Experience (QoE)

Table 6.4 Definition of categories of speech transmission quality [15]


R-value range Speech transmission quality category User satisfaction
100–90 Best Very satisfied
90–80 High Satisfied
80–70 Medium Some users dissatisfied
70–60 Low Many users dissatisfied
60–50 Poor Nearly all users dissatisfied

Table 6.5 Mapping of R vs.


R-value MOS score User experience
MOS for R over 50
90 4.3 Good
80 4.0 Good
70 3.6 Fair
60 3.1 Fair
50 2.6 Poor

Fig. 6.23 MOS vs. R-value


from ITU-T G.107

It is clear that when R is below 50, MOS score is below 2.6 indicating a low voice
quality. When R is above 80, MOS score is over 4 which indicates high voice quality
(reaches “toll quality” category (MOS: 4.0–4.5) used in traditional PSTN networks).
A detailed mapping of R vs. MOS when R is above 50 is listed in Table 6.5.


⎨1 for R ≤ 0
MOS = 1 + 0.035R + R(R − 60)(100 − R)7 · 10−6 for 0 < R < 100 (6.17)


4.5 for R ≥ 100
6.4 Objective Speech Quality Assessment 147

Fig. 6.24 Id vs. one-way


delay

6.4.3 A Simplified and Applicable E-Model


For VoIP applications, a simplified E-model as shown in Eq. (6.18) can be used
when only packet loss and delay impairments are considered.
R = 93.2 − Id − Ie-eff (6.18)

Now Id can be expressed as a function of one-way delay d. Assuming only the


default values listed in G.107 [32] are used, the relationship of Id versus one-way
delay (d) from G.107 is shown in Fig. 6.24 (curve from G.107). As the computa-
tional process to obtain Id according to G.107 is too complicated, a simplified Id
calculation is proposed as Eq. (6.19) [5]. The curves of Id vs. one-way delay from
both G.107 and the simplified model are shown in Fig. 6.24.
Id = 0.024d + 0.11(d − 177.3)H (d − 177.3)

H (x) = 0 if x < 0
where (6.19)
H (x) = 1 if x ≥ 0

For VoIP applications, delay d is normally regarded as end-to-end one-way delay


which includes network delay dn , codec/packetization delay dc , and jitter buffer
delay db , as shown below:
d = d n + dc + db (6.20)

Effective equipment impairment factor Ie-eff can be calculated in Eq. (6.21) ac-
cording to ITU-T G.107 [32].
Ppl
Ie-eff = Ie + (95 − Ie ) · Ppl
(6.21)
BurstR + Bpl

Ie is the equipment impairment factor at zero packet loss which reflects purely
codec impairment. Bpl is defined as the packet-loss robustness factor which is also
148 6 VoIP Quality of Experience (QoE)

codec-specific. As defined in ITU-T Rec. G.113 [25], Ie = 0 for G.711 PCM codec
at 64 kb/s, which is set as a reference point (zero codec impairment). All other
codecs have higher than zero Ie value (e.g. Ie = 10 for G.729 at 8 kb/s, Ie = 15 for
G.723.1 at 6.3 kb/s). Normally the lower the codec bit rate, the higher the equipment
impairment Ie value for the codec. Bpl value reflects codec’s built-in packet loss con-
cealment ability to deal with packet loss. The value is not only codec-dependent,
but also packet-size-dependent (i.e., depends on how many speech frames in a
packet). According to G.113, Bpl = 16.1 for G.723.1+VAD (Voice Activity De-
tection (VAD) is activated) with packet size of 30 ms (only one speech frame in a
packet). Bpl = 19.0 for G.729A+VAD (VAD activated) with packet size of 20 ms (2
speech frames in a packet). Ppl is the average packet-loss rate (in %).
BurstR is the so-called burst ratio. When packet loss is random, BurstR = 1; and
when packet loss is bursty, BurstR > 1.
In a 2-state Markov model as shown in Fig. 6.5, BurstR can be calculated as:
1 Ppl /100 1 − Ppl /100
BurstR = = = (6.22)
p+q p q

Please note that p value is the transitional probability from “No Loss” to “Loss”
and q value is the transitional probability from “Loss“ to “No Loss” state, as shown
in Fig. 6.5. From using Ppl and p to using p and q in Eq. (6.22), this can be easily
derived from Eq. (6.3).
Overall, effective equipment impairment factor Ie-eff can be obtained when codec
type and packet size are known and network packet loss parameters (in a 2-state
Markov model) have been derived.
The E-model R-factor can be calculated from Eq. (6.18) after Id and Ie-eff are
derived. Further MOS can be obtained from the R-factor according to Eq. (6.17).

6.5 Subjective Video Quality Assessment


Similar to subjective voice quality assessment, subjective video quality assessment
is to evaluate video quality based on overall subjective opinion score, such as Mean
Opinion Score (MOS). It is mainly defined by ITU-T P.910 [30] with a focus on sub-
jective assessment for multimedia applications and ITU-R BT.500 [8] with a focus
on subjective assessment of television pictures. ITU-T P.910 describes subjective
assessment methods for evaluating one-way overall video quality for applications
such as video conferencing and telemedical applications. It specifies subjective test
requirements and procedures including source signal preparation and selection; test
methods and experimental design (such as ACR and DCR and length for a test ses-
sion), test procedures (such as viewing conditions, viewers selection, instructions
to viewers and training session) and final test results analysis. ITU-R BT.500 also
defines subjective test methodology, including general viewing conditions, source
signal (source video sequences), selection of test materials (different test conditions
to create different test materials), observers (experts or non-experts depending on
the purpose of a subjective test. At least 15 observers should be used), instructions
6.5 Subjective Video Quality Assessment 149

Table 6.6 Degradation


Category Degradation level
Category Rating (DCR) for
video 5 Imperceptible
4 Perceptible but not annoying
3 Slightly annoying
2 Annoying
1 Very annoying

of an assessment, test session (should last up to half an hour and random presen-
tation order for test video sequences) and final subjective test results presentation
(e.g., to calculate mean score and 95 % confidence interval and to remove inconsis-
tent observers).
Depending on how evaluation or quality voting is carried out, subjective test
methods can be either as a standalone one-vote test (e.g., give the voting at the
end of a test session for a test video sequence) or as a continuous test (e.g., the
viewer moves a voting scale bar and indicates the video quality continuously during
a test session for a test video sequence). The latter is more appropriate for assess-
ing video quality in VoIP applications as the network impairments such as packet
loss and jitter are time-varying. Their impact on video quality also depends on the
location of these impairments in connection with the video contents or scenes. Typ-
ical subjective test methods include Absolute Category Rating (ACR), Absolute
Category Rating with Hidden Reference (ACR-HR), Degradation Category Rating
(DCR), Pair Comparison Method (PC), Double-stimulus continuous quality-scale
(DSCQS), Single stimulus continuous quality evaluation (SSCQE) and simultane-
ous double stimulus for continuous evaluation (SDSCE) methods, which are listed
and explained below.
• Absolute Category Rating (ACR) method: also called single stimulus (SS)
method, where only the degraded video sequence is shown to the viewer for
quality evaluation. The five-scale quality rating for ACR is 5 (Excellent), 4
(Good), 3 (Fair), 2 (Poor) and 1 (Bad) (similar as the one shown in Table 6.2 for
speech quality evaluation).
• Absolute Category Rating with Hidden Reference (ACR-HR) method: includes
a reference version of each test video sequence as its test stimulus (refers to the
term of hidden reference). Differential viewer scores (DV) are calculated as

DV = Test video score − Reference video score + 5

A DV of 5 indicates an ‘Excellent’ video quality (or the quality is the same as


the reference video sequence). A DV of 1 indicates a ‘Bad’ video quality.
• Degradation Category Rating (DCR) method: also known as double stimulus
impairment scale (DSIS), where the reference and the degraded video sequences
are presented to the viewer in pairs (the reference first, then the degraded one;
the pair can be presented twice if impairments for the degraded video clip is
small when compared with the reference one). The 5-level DCR scale is shown
150 6 VoIP Quality of Experience (QoE)

Fig. 6.25 Double-stimulus


continuous quality-scale
(DSCQS)

in Table 6.6 (if you compare this table with Table 6.3 for voice, you will notice
that the difference is only the change of the word ‘inaudible’ for voice/audio
condition to the word ‘imperceptible’ for video condition).
• Pair Comparison (PC) method: a pair of test video clips are presented to the
viewer who indicates his/her preference for a video (e.g., if the viewer prefers
the 1st video sequence, he/she will tick the box for the 1st one and vice versa.
• Single stimulus continuous quality evaluation (SSCQE) method: the viewer is
asked to provide a continuous quality assessment using a slider ranging from
0 (Bad) to 100 (Excellent). Final results will be mapped to a single quality
metric such as 5 level MOS score. The test video sequence is typically of 20–30
minutes duration.
• Double-stimulus continuous quality-scale (DSCQS) method: viewers are asked
to assess the video quality for a pair of video clips including both the reference
and the degraded video clips. The degraded video clips may include hidden
reference pictures. Test video sequences are short (about 10 seconds). Pairs of
video clips are normally shown twice and viewers are asked to give voting dur-
ing the second presentation for both video clips using a continuous quality-scale
as shown in Fig. 6.25.
• Simultaneous double stimulus for a continuous evaluation (SDSCE) method:
viewers are asked to view two video clips (one reference and one degraded,
normally displayed side-by-side in one monitor) at the same time. Viewers are
requested to concentrate on viewing the differences between two video clips
and judge the fidelity of the test video to the reference one by moving the slider
continuously (100 for the highest fidelity and 0 for the lowest fidelity) during
a test session. The length of the test sequence can be longer for SDSCE when
compared with that of DSCQS.

6.6 Objective Video Quality Assessment

6.6.1 Full-Reference (FR) Video Quality Assessment

Full-Reference (FR) video quality assessment is to compare a reference video se-


quence with the degraded video sequence to predict the video quality (e.g. in terms
6.6 Objective Video Quality Assessment 151

Fig. 6.26 Full-Reference (FR) video quality assessment

Table 6.7 Mapping from


PSNR (dB) MOS
PSNR to MOS
>37 5 (Excellent)
31–37 4 (Good)
25–31 3 (Fair)
20–25 2 (Poor)
<20 1 (Bad)

of MOS) as illustrated in Fig. 6.26. The comparison is carried out at a basis of


pixel by pixel and frame by frame between the reference and the degraded video
sequences.
The most simple and widely used video quality assessment is Peak Signal-to-
Noise Ratio (PSNR) which measures the Mean Squared Error (MSE) between the
reference and the test sequences, as defined by Eqs. (6.23) and (6.24).

∀M,N [f (M, N ) − F (M, N)]
2
MSE = (6.23)
 N · M
Vpeak
PSNR = 20 · log10 √
MSE
Vpeak = 2k − 1, k = number of bits per pixel (luminance component) (6.24)

where M × N is the width × height (in pixels) of the image. F and f are the
luminance component for the original and the degraded images (pixel by pixel).
The number of bits per pixel (luminance component) is normally 8 which results
in Vpeak of 255 (this is where the name ‘Peak’ Signal-to-Noise Ratio come from).
PSNR between the reference and degraded video sequences can be obtained from
the PSNR value image by image (or frame by frame) and can be expressed by an
average PSNR value among all frames considered.
PSNR expressed in dB can be mapped to the MOS score of video quality accord-
ing to [36] and is shown in Table 6.7.
Other popular FR video quality measurement methods are the Structural Simi-
larity Index (SSIM) [46] and Video Quality Metric (VQM) from National Telecom-
munications and Information Administration (NTIA) [41], also defined in ITU-T
152 6 VoIP Quality of Experience (QoE)

J.144 [20]. These FR models assume that the reference and degraded video se-
quences are properly aligned at both spatial (pixel by pixel) and temporal (frame
by frame) domains. Perceptual video quality assessment is based on pixel-by-pixel
and frame-by-frame comparison between the reference and the degraded video sig-
nals. If applying metrics such as PSNR and SSIM to the reference and degraded
video clips with spatial or temporal misalignment directly, poor PSNR or SSIM
results will be obtained. In these test environments such as Internet video applica-
tions, video perceptual quality assessment algorithms including spatial and tempo-
ral alignment mechanisms need to be integrated into the perceptual video quality
model.
It has to be mentioned that Table 6.7 is only a brief mapping between PSNR
and the MOS score. Researches such as [47] have demonstrated that degraded video
clips with same PSNR values may have different perceived video quality. In the
Phase II Full-Reference model test for Standard-Definition (SD) TV carried out by
the Video Quality Experts Group (VQEG)2 (Phase II of VQEG’s FRTV test), PSNR
only achieved about 70 % correlation with the subjective test results (MOS) [48].
Many efforts have been put on the research and development of better FR models
for predicting perceived video quality more accurately. VQEG have also conducted
several projects on Full-Reference (FR) video quality assessment including FR-TV
Phase I and FR-TV Phase II aiming for evaluating FR video quality assessment for
SD TV applications with a focus on MPEG-2 compression for digital TV broad-
casting and Multimedia (MM) Phase I and MM Phase II aiming for multimedia
applications such as Internet multimedia streaming, video telephony and conferenc-
ing and mobile video streaming. The former work resulted in the specification of
ITU-T J.144 (2004) [20] which included four perceptual video quality measure-
ment methods from British Telecom (BT)3 from the UK; Yonsei University4 /SK
Telecom5 /Radio Research Laboratory (RRL) from Korea; the Telecommunications
Research and Development Center (CPqD) from Brazil;6 and the National Telecom-
munications and Information Administration (NTIA)7 from the USA. The Video
Quality Metric (VQM) software developed by NTIA is also downloadable (royalty
free) from the NTIA/ITS website.8 The MM Phase I produced four FR models for
multimedia applications which are specified in ITU-T J.247 [28] and included four
models from NTT, Japan;9 Opticom, Germany;10 Psytechnics, UK11 and Yousei

2 http://www.vqeg.org

3 http://www.bt.com

4 http://www.yonsei.ac.kr

5 http://www.sktelecom.com

6 http://www.cpqd.com.br

7 http://www.ntia.doc.gov

8 http://www.its.bldrdoc.gov/resources/video-quality-research/software.aspx

9 http://www.ntt.com

10 http://www.opticom.de

11 http://www.psytechnics.com
6.6 Objective Video Quality Assessment 153

Fig. 6.27 Reduced-Reference (RR) video quality assessment

University, Korea. The model from Opticom is named as Perceptual Evaluation of


Video Quality (PEVQ)12 and is widely used in industry. The multimedia FR models
can be applied to VGA (Video Graphics Array, 640 × 480 pixels), CIF (Common
Intermediate Format, 352 × 288 pixels) and QCIF (Quarter Common Intermedi-
ate Format, 176 × 144 pixels) resolution, video frame rates from 5 to 30 fps, video
codecs including H.264/AVC (MPEG-4 part 10) and MPEG-4 Part 2, and maximum
temporal errors (pausing with skipping) of 2 seconds.
Considering the impact of network impairments (e.g., packet loss and jitter) on
degraded video sequence such as frame skip and frame repeat, and the effect of
adaptive video terminals such as frame rate adjustment/adaptation, all FR models
specified in ITU-T J.144 and J.247 have included some degree of temporal and
spatial alignment processing algorithm before normal Full-Reference video qual-
ity assessment method can be applied. More details on calibration algorithms for
detecting and registering spatio-temporal misalignment between the reference and
degraded video sequences can also be found in ITU-T J.244 [27].

6.6.2 Reduced-Reference (RR) Video Quality Assessment

Reduced-Reference (RR) video quality assessment is a method utilizing reduced


video reference information, instead of full-reference video sequence. As shown in
Fig. 6.27, important perception-based image or video features extracted from the
reference video source, such as features related with edge degradation (i.e. loca-
tions and pixel values of some edge pixels) as used in ITU-T J.246 [29] or informa-
tion regarding the Discrete Cosine Transform (DCT) coefficient distributions used
in [38] are encoded and transmitted over a side channel to the receiver. At the re-
ceiver side, RR video quality assessment model predicts perceived video quality
(e.g. MOS) by comparing the perception-based video features extracted from the
degraded video signal with those received from the channel (features extracted from
the source reference sequence). Compared with the FR model, the RR-model only
uses low-bandwidth perception-based video features instead of full video sequences

12 http://www.pevq.org
154 6 VoIP Quality of Experience (QoE)

for video quality prediction, which is easier to be transmitted to the receiver for
quality comparison. ITU-T J.246 (2008) [29] defines RR models for multimedia ap-
plications in which both temporal and spatial alignment processes are applied and
test conditions are similar with those set in ITU-T J.247 (e.g. supporting video res-
olutions of QCIF, CIF and VGA and video frame rates from 5 to 30 fps). The RR
model proposed by Yonsei University, Korea is also included in ITU-T J.246 An-
nex A. ITU-T J.249 (2010) [33] specified RR models for SDTV (Standard Defini-
tion Television) applications and included three RR models from Yonsei University
of Korea, NEC of Japan13 and National Telecommunications and Information Ad-
ministration/Institute for Telecommunication Sciences (NTIA/ITS)14 of USA. All
three models contain spatial and temporal alignment process and gain adjustment
between the reference and degraded video sequences.

6.6.3 No-Reference Video Quality Assessment

No-Reference (NR) or Zero-Reference (ZR) video quality assessment is a video


quality assessment method to predict video quality from only degraded video signal
or received video bitstreams, or a combination of both. It does not require any infor-
mation from the reference video signal. As no reference information is required in
NR model, NR video quality assessment can be easily applied in live (or in-service)
real-time video quality monitoring and assessment and be applied for QoS/QoE
management.
Depending on what the input signal to the NR model is, a NR model can be
classified into the following three categories (also shown in Fig. 6.28).
• Packet-based NR model: to predict video quality only from packet header in-
formation or information derived from packet header, for example, packet loss
rate, codec type, jitter.
• Bitstream-based NR model: to predict video quality only from bitstream header
information, e.g. Transport Stream (TS) information in TV broadcasting appli-
cations, and/or bitstream information from video codec headers such as GOB
start codes, picture size, frame rate, motion vector length or number of I/B/P
frame/slice losses. The examples include the one proposed by Mohamed [40]
which predicts video quality based on sender bitrate, frame rate, packet loss
rate, loss burstiness and the ratio of the encoded intra macro-blocks to inter
macro-blocks and the models proposed by Reibman [43] which can predict
video quality from either network-level parameters, or parameters extracted
from bitstreams.
• Signal-based NR model: to predict video quality only from the degraded video
signal. This is also called picture metric.

13 http://www.nec.com

14 http://www.its.bldrdoc.gov
6.7 Illustrative Worked Examples 155

Fig. 6.28 No-Reference (NR) video quality assessment

• Hybrid NR model: to predict video quality from both packet/bitstream header


information and the degraded video signal, such as V-factor model [47].
Similar as E-model for speech, ITU-T. G.1070 [24] defines an opinion model for
predicting multimedia quality (MM q ) for video-telephony applications. The output
from the model also includes video quality influenced by speech quality Vq (Sq )
and speech quality influenced by video quality Sq (Vq ). Multimedia quality contains
a combined effect from both perceived video quality and perceived speech quality.
The video quality can be estimated from application and network parameters includ-
ing codec type (e.g., H.263 or MPEG4), video format (QQVGA (Quarter Quarter
VGA, 160 × 120 pixels), QVGA (Quarter Video Graphics Array, 320 × 240 pixels)
and VGA), key frame interval (i.e., the interval between two consecutive I frames
or the length of GoP), video display size (e.g., 4.2-inch or 2.1-inch), video bitrate
(kb/s), frame rate (from 1 to 30 fps), packet loss rate (PLR, also including packet
loss robustness factor) and end-to-end delay. The opinion model is mainly used for
network planning purposes, for example, QoS/QoE planning.

6.7 Illustrative Worked Examples

6.7.1 Question 1

Explain, with the aid of a suitable block diagram, how speech signals are trans-
ported over IP networks in real-time. Your answer should highlight the main infor-
mation/signal processing operations that take place in transporting speech from the
speaker to the listener and how they affect voice quality.
Indicate on your diagram the main impairment factors in VoIP systems.
156 6 VoIP Quality of Experience (QoE)

Fig. 6.29 Sources of impairment in VoIP systems

Table 6.8 Burst length


Burstloss length (i) 0 1 2 3 4 5
distribution for a sample trace
data Counts (oi ) 3005 25 4 0 0 2

SOLUTION: A suitable block diagram is shown in Fig. 6.29 which highlights


the key information processing stages in VoIP and the impairment factors at each
stage. The following is a brief explanation of the information processing operations
and associated impairment factors should be given:
• Encoder—samples, quantises/compresses the voice signal. The basic encoder is
the ITU G.711 which uses PCM and samples the voice signal once every 0.125
ms (8 kHz) and generates 8-bits per sample (i.e., 64 kb/s). The more recent
encoders provide significant reduction in data rate (e.g., G.723.1, G.726 and
G.729). Many of these are frame-based.
• Packetizer—places a certain number of speech samples (in the case of G.711)
or frames (in the case of G.723.1 and G.729) into packets and then adds the RTP
header. In addition, UDP header and IP header are added to form IP packets.
• Network—voice packets are then sent over the network. As the voice packets
are transported, they may be subjected to a variety of impairments—delay, delay
variation and packet loss.
• Playback buffer—playback buffers are used to absorb variations suffered by
the voice packets and make it possible to obtain a smooth playout and hence
smooth reconstruction of speech. Playout buffer can lead to additional packet
loss as packets arriving too late are discarded.

6.7.2 Question 2

Given a burst loss length distribution as illustrated in Table 6.8, where oi (i =


1, 2, 3, 4, 5) denotes the number of packet loss bursts that have length i. In this
case, 5 is the length of the longest packet loss bursts, and o0 denotes the number of
6.7 Illustrative Worked Examples 157

successfully delivered packets. Calculate the average packet loss rate and the mean
burst loss length for this trace data.

SOLUTION: Based on Eq. (6.7), we can calculate p and q as below:



( n−1
i=1 oi ) 25 + 4 + 2 31
p= = = = 0.0103
o0 3005 3005

( n−1
i=1 oi · (i − 1)) 4 · (2 − 1) + 2 · (5 − 1) 12 31
q =1− n−1 =1− =1− =
( i=1 oi · i) 1 · 25 + 2 · 4 + 5 · 2 43 43
= 0.721
p 0.0103 0.0103
π1 = = = = 0.014
p+q 0.0103 + 0.721 0.734
Ppl = π1 × 100 % = 1.4 %
1
E = = 1.39
q

So the average packet loss rate is 1.4 % and the mean burst loss length is 1.39 for
this trace data.

6.7.3 Question 3

Explain, with the aid of a suitable block diagram, how Full-Reference (FR) and No-
Reference (NR) video quality assessment models work? In video quality monitoring
for VoIP video call applications, which model (FR or NR) should we use? Why?

SOLUTION: The full-reference (FR) and no-reference (NR) video quality as-
sessment models are shown in Fig. 6.30. Figure (a) is for Full-Reference model
where the reference video and the degraded video signals are both inputted to the
FR model to predict video quality. For the NR model (see figures (b) and (c)), there
is no reference video involved in the video quality prediction. In stead, only the de-
graded video signal (see figure (b)) or parameters derived from the transmission sys-
tem (e.g., parameters derived from IP packet headers, or derived from TS streams,
see figure (c)) are used for the video quality prediction. It is also possible that both
the degraded video signal and parameters derived from the transmission system are
both used for the video quality prediction (named as hybrid video quality model).
In video quality monitoring for VoIP video call applications, no-reference model
is normally used. This is because that there is no reference video signal injected
into the tested system/network and only received degraded video signal or received
video packets/bitstreams are used for video quality prediction. No-reference model
can be used for real-time video quality monitoring for operational systems/networks
for video quality prediction. NR model can also be incorporated into terminals (such
as mobile phones and TV set-top boxes) for real-time video quality monitoring for
video call or video streaming applications.
158 6 VoIP Quality of Experience (QoE)

Fig. 6.30 Full-Reference (FR) and No-Reference (NR) video quality assessment

6.8 Summary

In this chapter, we have discussed the concept of Quality of Service (QoS) and
Quality of Experience (QoE) mainly around VoIP applications. QoS is generally
used to express network performance, using metrics such as packet loss, delay and
jitter. QoE is normally used to express user perceived quality for a provided ser-
vice such as VoIP and usually uses Mean Opinion Score (MOS) to represent an
overall quality for voice, video or audiovisual applications. In the chapter, we have
explained QoS metrics (i.e., loss, delay and jitter) from their definition, character-
istics, to a practical approach on how to obtain these metrics. We have presented
in detail QoE measurements for VoIP applications, ranging from subjective tests
(ACR/DCR), intrusive/non-intrusive objective measurements (e.g., PESQ and E-
model) and practical approach on how to use them for VoIP applications. We have
also presented subjective and objective video quality measurement. For subjective
video quality measurement, we have discussed ACR, ACR-HR, DCR, PC, SSCQE,
6.9 Problems 159

DSCQS and SDSCE methods. For objective video quality measurement, we have
illustrated FR, RR and NR quality measurement and summarised standardisation
efforts from VQEG and ITU-T on FR/RR/NR models.

6.9 Problems

1. What is the difference between QoS and QoE?


2. List five factors from both network and application levels which may affect
QoE.
3. Describe briefly three application mechanisms which can be used to improve
QoE.
4. What is the main cause of packet loss in IP networks?
5. Describe the characteristics of network packet loss. Does network packet loss
occur in random or in busty? Why?
6. What is the difference between network packet loss and late arrival loss? Is
there any difference on their impact on the end-to-end voice quality?
7. How to calculate mean burst loss length given average packet loss and condi-
tional packet loss probabilities.
8. For a 2-state Markov model, how to calculate state probabilities (i.e., π1 for
state 1 representing a packet received and π0 for state 0 representing a packet
lost) from state transition probabilities (i.e., p00 , p01 , p11 and p10 )? Given
state transition probabilities p00 = 0.85 and p10 = 0.8, estimate the average
packet loss rate and mean burst loss length.
9. What is the difference between unconditional loss probability (ulp) and con-
ditional loss probability (clp)?
10. What is the difference between trace data collection using “ping” tool and
real VoIP software (explain from for example, network protocol, packet size,
route)?
11. Sketch a systematic diagram to show and explain the difference between in-
trusive speech quality measurement (e.g., PESQ) and non-intrusive speech
quality measurement (e.g. E-model).
12. Describe how PSNR (Peak Signal-to-Noise Ratio) works? Can we use PSNR
as a full-reference perceived video quality assessment metric? Why?
13. Describe the main factors that contribute to end-to-end transmission delay at
sender side, network side and receiver side, respectively. What is the main
cause for delay variation? How does it affect end-to-end voice quality?
14. According to ITU-T P.910 and ITU-R BT.500, what is the maximum length
of a subjective test session? What is the minimum number of subjects (or
observers) required for a subjective video/picture quality test?
15. Describe four subjective video quality test methods and explain the differ-
ences between them.
16. Describe briefly video subjective measurement on ACR, ACR-HR, DCR, PC,
SSCQE, DSCQS and SDSCE methods.
160 6 VoIP Quality of Experience (QoE)

17. Illustrate and describe briefly the FR, RR and NR video quality measurement
methods. What is the difference between the NR bitstream-model and the NR
hybrid-model?
18. Why do we need to develop non-intrusive (or no reference) speech/video qual-
ity assessment model?

References
1. AQUAVIT—assessment of quality for audio-visual signals over Internet and UMTS—
Deliverable 2: Methodology for subjective audio-visual quality evaluation in mobile and
IP networks. EURESCOM Project P905-PF (2000)
2. Allnatt J (1975) Subjective rating and apparent magnitude. Int J Man-Mach Stud 7:801–816
3. ANSI (1998) Testing the quality of connections having time varying impairments. ANSI
T1A1.7/98-031
4. Clark A (2001) Modeling the effects of burst packet loss and recency on subjective voice
quality. In: Proceedings of the 2nd IP-telephony workshop, Columbia University, New York,
USA, pp 123–127
5. Cole RG, Rosenbluth JH (2001) Voice over IP performance monitoring. Comput Commun
Rev 31(2):9–24
6. Ellis M, Perkins C (2010) Packet loss characteristics of IPTV-like traffic on residential links.
In: 7th IEEE, consumer communications and networking conference (CCNC), pp 1–5
7. ETSI (1996) Speech communication quality from mouth to ear of 3.1 kHz handset telephony
across networks. Tech. report. ETSI ETR250
8. ITU-R (2012) Methodology for the subjective assessment of the quality of televi-
sion pictures. ITU-R Recommendation BT.500-13. http://www.itu.int/rec/R-REC-BT.500-
13-201201-I/en
9. ITU-T (1988) Quality of service and dependability vocabulary. ITU-T Recommendation
E.800
10. ITU-T (1992) Coding of speech at 16 kbit/s using low-delay code excited linear prediction.
ITU-T Recommendation G.728
11. ITU-T (1996) Coding of speech at 8 kbit/s using Conjugate-Structure Algebraic-Code-
Excited Linear-Prediction (CS-ACELP). ITU-T Recommendation G.729
12. ITU-T (1996) Methods for subjective determination of transmission quality. ITU-T Recom-
mendation P.800
13. ITU-T (1996) Objective quality measurement of telephone-band (300–3400 Hz) speech
codecs. ITU-T Recommendation P.861
14. ITU-T (1996) Subjective performance assessment of telephone-band and wideband digital
codecs. ITU-T Recommendation P.830
15. ITU-T (1999) Definition of categories of speech transmission quality. ITU-T Recommenda-
tion G.109
16. ITU-T (2000) Study of the relationship between instantaneous and overall subjective speech
quality for time-varying quality speech sequences: influence of a recency effect (Delayed
Contributions 9–18 May 2000). ITU-T Contribution COM12-D139
17. ITU-T (2001) Perceptual evaluation of speech quality (PESQ), an objective method for end-
to-end speech quality assessment of narrow-band telephone networks and speech codec.
ITU-T Recommendation P.862
18. ITU-T (2003) Mapping function for transforming P.862 raw result scores to MOS-LQO.
ITU-T Recommendation P.862.1
19. ITU-T (2004) Continuous evaluation of time varying speech quality. ITU-T Recommenda-
tion P.880
20. ITU-T (2004) Objective perceptual video quality measurement techniques for digital cable
television in the presence of a full reference. ITU-T Recommendation J.144
References 161

21. ITU-T (2004) Single-ended method for objective speech quality assessment in narrow-band
telephony applications. ITU-T Recommendation P.563
22. ITU-T (2006) Mean opinion score (MOS) terminology. ITU-T Recommendation P.800
23. ITU-T (2007) Conformance testing for narrowband voice over IP transmission quality as-
sessment models. ITU-T Recommendation P.564
24. ITU-T (2007) Opinion model for video-telephony applications. ITU-T Recommendation
G.1070
25. ITU-T (2007) Transmission impairments due to speech processing. ITU-T Recommendation
G.113
26. ITU-T (2007) Wideband extension to Recommendation P.862 for the assessment of wide-
band telephone networks and speech codecs. ITU-T Recommendation P.862.2
27. ITU-T (2008) Full reference and reduced reference calibration methods for video transmis-
sion systems with constant misalignment of spatial and temporal domains with constant gain
and offset. ITU-T Recommendation J.244
28. ITU-T (2008) Objective perceptual multimedia video quality measurement in the presence
of a full reference. ITU-T Recommendation J.247
29. ITU-T (2008) Perceptual visual quality measurement techniques for multimedia services
over digital cable television networks in the presence of a reduced bandwidth reference.
ITU-T Recommendation J.246
30. ITU-T (2008) Subjective video quality assessment methods for multimedia applications.
ITU-T Recommendation P.910. http://www.itu.int/rec/T-REC-P.910-200804-I
31. ITU-T (2008) Vocabulary for performance and quality of service, Amendment 2: New
definitions for inclusion in Recommendation ITU-T P.10/G.100. ITU-T Recommendation
G.100
32. ITU-T (2009) The E-model, a computational model for use in transmission planning. ITU-T
Recommendation G.107. http://www.itu.int/rec/T-REC-G.107
33. ITU-T (2010) Perceptual video quality measurement techniques for digital cable television
in the presence of a reduced reference. ITU-T Recommendation J.249
34. ITU-T (2011) Perceptual objective listening quality assessment. ITU-T Recommendation
P.863. http://www.itu.int/rec/T-REC-P.863-201101-I
35. ITU-T (2011) The E-model, a computational model for use in transmission planning. ITU-T
Recommendation G.107. http://www.itu.int/rec/T-REC-G.107
36. Klaue J, Rathke B, Wolisz A (2003) EvalVid—a framework for video transmission and
quality evaluation. In: Proc of the 13th international conference on modelling techniques
and tools for computer performance evaluation
37. Kurose JF, Ross KW (2010) Computer networking, a top–down approach, 5th edn. Pearson
Education, Boston ISBN-10:0-13-136548-7
38. Ma L, Li S, Zhang F, Ngan KN (2011) Reduced-reference image quality assessment using
reorganized DCT-based image representation. IEEE Trans Multimed 13(4):824–829
39. Mkwawa IH, Jammeh E, et al (2010) Feedback-free early VoIP quality adaptation scheme
in next generation networks. In: IEEE Globecom, pp 1–5
40. Mohamed S, Rubino G (2002) A study of real-time packet video quality using random
neural networks. IEEE Trans Circuits Syst Video Technol 12(12):1071–1083
41. Pinson MH, Wolf S (2004) A new standardized method for objectively measuring video
quality. IEEE Trans Broadcast 50(3):312–322
42. Ramjee R, Kurose J, et al (1994) Adaptive playout mechanisms for packetized audio appli-
cations in wide-area networks. In: Proc of IEEE Infocom, pp 680–688
43. Reibman AR, Vaishampayan VA, Sermadevi Y (2004) Quality monitoring of video over a
packet network. IEEE Trans Multimed 6(2):327–334
44. Sanneck H (2000) Packet loss recovery and control for voice transmission over the Internet.
PhD Dissertation, Technical University of Berlin
45. Schulzrinne H, Casner S (2003) RTP: a transport protocol for real-time applications. IETF
RFC 3550
162 6 VoIP Quality of Experience (QoE)

46. Wang Z, Lu L, Bovik AC (2004) Video quality assessment based on structural distortion
measurement. Signal Process Image Commun 19(2):121–132
47. Winkler S, Mohandas P (2008) The evolution of video quality measurement: from PSNR to
hybrid metrics. IEEE Trans Broadcast 54(3):660–668
48. Winkler S, Mohandas P (2009) Video quality measurement standards CCurrent status and
trends. In: ICICS 2009
49. Yajnik M, Kurose J, Towsley D (1995) Packet loss correlation in the MBone multicast net-
work experimental measurements and Markov chain models. Technical report, University
of Massachusetts, UM-CS-1995-115
IMS and Mobile VoIP
7

The Internet evolution started from a small network linking a few research centres to
a massive network with billions of computers. The reason behind the growth of the
Internet has been its ability to provide very useful services such as World Wide Web,
email, instant messaging, VoIP and video conferencing. On the other hand, cellular
networks have experienced dramatic growth over the years. The cellular network
growth was not only due to its services such as voice and video calls and short mes-
saging services, but also because cellular network users can access the network from
virtually everywhere. These facts prompted 3GPP to come up with the idea of the
IP Multimedia Subsystem. The IP Multimedia Subsystem aims at merging cellular
networks and the Internet, two of the most successful infrastructures in telecommu-
nication. By merging the two infrastructures, the IP Multimedia Subsystem will be
able to provide ubiquitous cellular access to all services that are provided by the
Internet.

7.1 What Is IP Multimedia Subsystem?

The IP Multimedia Subsystem(IMS) is a standardised Next Generation Network


(NGN) architecture defined by Third Generation Partnership Project (3GPP) and
3GPP2 standards and organisations [17]. It is based on IETF Internet protocols for
Internet media services. IMS has also been embraced by other standardisation bod-
ies which include ETSI [22] and TISPAN [39]. The TISPAN allows the support of
fixed networks, since the standard was originally designed for mobile networks.

7.1.1 What Do We Need IMS for?

Through the data connection, cellular network users can access the Internet services.
So why do we need IMS for if the IMS idea is to offer the Internet services by
using cellular networks and already cellular networks have full access to the Internet
services through the data connection? The answer is that, IMS goes beyond the idea

L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 163
DOI 10.1007/978-1-4471-4905-7_7, © Springer-Verlag London 2013
164 7 IMS and Mobile VoIP

of offering the Internet services by providing Quality of Service, service charging


and integration of different services.
At the time of multimedia session establishment, IMS takes care of negotiating
QoS provision in order for multimedia services users to have an acceptable quality
of experience. The IMS gives information about any service that is being used by
customers, with this information, a service provider can decide how to charge it,
either apply fixed rate, time based, QoS-based, bytes based, or any other type of
charging.
IMS has standard interfaces available to service developers, with these interfaces,
operators have an advantage of using and integrate different services from different
vendors.

7.1.2 IMS Architecture

The IMS architecture supports a range of services enabled by the Session Initiation
Protocol (SIP). IMS consists of all the core network elements providing IP multi-
media services such as audio, video text and chat over the packet switched domain
of the core network. The overall network is made up of two parts,
1. The access network which provides the wireless access points.
2. The core network which provides service control and the fixed connectivity to
other access points, to other fixed networks and to services resources such as
databases and content delivery.
The IMS architecture is capable of supporting several application servers provid-
ing conventional telephony and non telephony services such as instant messaging,
push to talk over cellular (PoC), multimedia streaming and multimedia messaging
(cf., Fig. 7.1).
The IMS services architecture consists of logical functions which are divided
into three layers.

Transport and Endpoint Layer


The transport endpoint layer initiates and terminates SIP signalling messages in or-
der to setup VoIP sessions and provide bearer services such as converting voice from
analogue to digital format in the form of IP packets through the RTP packets. This
layer resides the media gateways which convert VoIP bearer streams from digital to
analogue via PSTN TDM format.
This layer also consists of media servers that provide several media related ser-
vices such as audio and video conferencing, announcements, speech recognition and
speech synthesis. The media server resources are available across all applications
such as voicemail and Voice Extensible Markup Language (VXML).

Session Control Layer


The session control layer is made up of the Call Session Control Function (CSCF).
This layer, through its CSCFs enables the registration of SIP user equipments and
7.1 What Is IP Multimedia Subsystem? 165

Fig. 7.1 A simplified 3GPP IMS architecture

the appropriate routing of the SIP signalling messages to application servers. The
CSCFs work together with the transport and endpoint layer to provision the quality
of experience across all services.
The session control layer also includes the Home Subscriber Server (HSS)
database on which service profiles for each IMS user can be store and retrieved.
The end users’ service profiles keep all of the users service data and preferences
in a centralised manner. User profiles include IMS users’ current registration infor-
mation such as IP address, roaming, telephony and instant messaging services and
voicemail box options. The centralised approach of IMS users’ profiles enables ap-
plications to share the information and create unified presence information, blended
services and personal directories. This approach also simplifies the administration of
user information by making sure that the consistent views of active IMS subscribers
across all services are maintained.
In the session control layer also resides the Media Gateway Control Function
(MGCF) [11] which by the use of the SIP signalling protocol controls the media
gateways such as H.248 protocol [18].

Application Server Layer


The application server layer is made up of the application servers, which provide
the IMS user with the service logic. The IMS architecture and SIP signalling are
166 7 IMS and Mobile VoIP

capable of supporting various telephony and non telephony application servers. For
instance, SIP based applications have been mainly developed to run telephony and
Instant Messaging services.

Telephony Application Server The IMS architecture can support several appli-
cation servers that implements SIP telephony services. The Telephony Application
Server (TAS) is a Back-to-Back User Agent (B2BUA) that maintains the SIP call
state. TAS is comprised of a service logic which provides simple call processing ser-
vices such as digit analysis and call setup, routing, waiting and forwarding. TAS’s
service logic can also invoke media servers in order to support SIP call progress
announcements and tones. If SIP calls are originating or terminating on the PSTN
then TAS will send SIP signalling message to the MGCF for the media gateways to
convert PSTN voice bit stream into RTP stream and to direct the RTP stream to the
IP address of the callee/caller user equipment.
TAS implements Advanced Intelligent Network (AIN) call trigger points. If a
SIP call progresses to a trigger point then TAS will suspend call processing and will
check the subscriber profile to find out if any additional service is required to the call
at that moment. If the subscriber profile requires any service from any application
server then TAS will format the IMS Control (ISC) message and handover the SIP
call control to an required application server.
For instance, one TAS can serve centrex business services such as speed dialing,
call transfer, direct call, call park, divert, hold and barring, another TAS can provide
service such as call back, last number redial, reminder calls and number display ser-
vices. Multiple TASs can inter-operate via SIP signalling to perform several services
for different IMS UEs.

IP Multimedia Services Switching Function IP Multimedia Services Switching


Function (IMSSF) provides the interface between IMS and Customized Applica-
tions for Mobile Networks Enhanced Logic (CAMEL) used in 2G mobile networks.
It enables IMS subscribers to access services provided by 2G mobile networks such
as calling name and Local Number Portability (LNP) services.

Supplemental Telephony Application Servers Telephony services such prepaid


billing, click to call and voicemail services can be classified as supplemental tele-
phony services which can be served by independent application servers. These ser-
vices can be provided at the beginning of a call, middle or at the end.

Non-Telephony Application Servers Non-telephony SIP application servers


such as instant messaging, location and push to talk can be incorporated in the IMS
architecture. By implementing SIP telephony and non-telephony, IMS is therefore,
able to provide blended communication services such as presence information and
location services.

Open Service Access Gateway The IMS architecture has the flexibility to in-
corporate additional services, this can be done by integrating additional services
with SIP bases application servers. For instance, an organisation might want to ini-
7.1 What Is IP Multimedia Subsystem? 167

tiate a call or an instant message from its back office delivery software if an or-
der is few minutes to be delivered to an address. This can be done by retrieving
location information of the courier. This is made possible in IMS by the well de-
fined Parlay API. The API hides the complex signalling protocols such as SS7 [7],
CAMEL [5], ANSI41, SIP and ISDN to non telecommunication application devel-
opers and make it simple to interact with telephony service such as IM and voice
calls. 3GPP and ETSI organizations have closely worked with Parlay forum to de-
fined the Parlay API for accessing telephony networks. The interface between SIP
and the Parlay API interconnection is defined in the Open Services Access Gate-
way (OSAGW). 3GPP IMS architecture defines OSAGW as part of the application
server layer.
The IMS architecture through the SIP signalling can support the provision of ad-
vanced broadband multimedia services such as Internet Protocol Television (IPTV),
video on demand (VoD) and video telephony.
By the use application servers, service providers have the flexibility to enable
application developer partners who are located outside of the core IMS domain
to develop new applications and integrate with the IMS core domain via SIP sig-
nalling.

7.1.3 IMS Elements

The overall IMS network architecture is made up of access and core networks. The
core network has two domains.

Circuit Switched Domain Traditional circuit switching networks such as Public


Switched Telephone Network (PSTN) were designed to use dedicated circuits for
connecting end to end voice calls. Although circuit switching networks remain reli-
able and better in quality, they are highly inefficient in resource utilization. This is
because the circuits remain dedicated throughout the call duration and idle until the
next call.

Packet Switched Domain Packet switched connections are the opposite to the
circuit switched because they do not require dedicated resources. The informa-
tion is broken down into packets which are routed independently through the net-
work to their destinations, where they are reassembled into the original information
streams.
The IMS comprises of eight elements in the packet switched domain (cf.,
Fig. 7.2).
The IMS elements are logical functions which can be implemented in one or
different servers. They can be broken down into the following functionalities.

Call Session Control


Call Session Control Function The Call Session Control Function (CSCF)
works as a routing engine, policy manager and enforcement point to facilitate the
delivery of multiple real time multimedia applications over IP network. CSCF is
168 7 IMS and Mobile VoIP

Fig. 7.2 IMS elements

an application aware and it uses dynamic session information to manage network


resources such as media gateways and edge devices. CSCF provides advance allo-
cation of resources depending on the application and end users’ profiles. The CSCF
is made up of the following functions,

Proxy CSCF The Proxy CSCF (P-CSCF) is the first contact point within the IMS
core network for the IMS subscribers. P-CSCF accepts SIP requests and serves them
internally in the home IMS or forwards them to external IMS. The main functions
of P-CSCF are [15],
• Forwarding SIP registration requests received from the UE to an entry point out-
lined either by using the home domain name or from the SIP registration header.
When SIP requests are sent to external IMS domain they might, if required, be
routed via a local network Interconnection Border Control Function (IBCF),
which will then forward the SIP request to the entry point of the external IMS
domain.
• Ensuring that SIP messages received from the UE contain the correct informa-
tion regarding the access network type currently used by the UE. The P-CSCF
shall not modify the SIP Request URI in the SIP INVITE message.
• Detecting and handling emergency session establishment through SIP requests.
• Maintaining a Security Association such as IPSec with the UE, therefore secur-
ing the access part of the SIP signalling plane.
• Performing SIP message compression and decompression.
7.1 What Is IP Multimedia Subsystem? 169

• Authorizing bearer resources and quality of service management through the


Policy and Charging Control (PCC) framework.
• Generating Call Detailed Records.

Serving CSCF The Serving CSCF (S-CSCF) serves the session control services
for the IMS subscribers. S-CSCF maintains session states needed by the network
operator to support the established services. With the same IMS domain several S-
CSCFs may be deployed with different functionalities. The S-CSCF can retrieve
subscriber’s security credentials and service information from HSS. S-CSCF can
also act as a SIP registrar for translating SIP messages to UE location. S-CSCF also
uses filter criteria to determine the call flow from the UE to the destined applica-
tion server or another UE. The S-CSCF is the main call session controller with the
following responsibilities [15],

• To accept SIP registration requests, making these requests available through


though the HSS and notifying IMS subscribers about SIP registration changes.
• To provide policy information, for the IMS Public User Identity (IMPUI) from
the HSS to the P-CSCF or UE during SIP registration session.
• To register subscribers session control which includes,
– Access control, S-CSCF can accept or reject sessions after registration pro-
cess is complete.
– Translating, terminating, forwarding and forking SIP signalling messages
to other IMS nodes depending on the filtering criteria such as identity and
service. In this context, S-CSCF acts as a proxy server.
– The capabilities of terminating and generating SIP transactions. In this re-
gard S-CSCF behaves as a UA.
• Making sure that the content of SIP requests and responses conforms to the
IMS communication service definition such as available codecs between UEs
and enabled services.
• Maintaining and handling interactions through SIP ISC interface to application
servers.
• Providing IMS subscribers with service events related information such as no-
tification through tones and announcement. It also provides locations of media
resources and billing notification.

Interrogating CSCF Is the network entry point into the IMS home domain. It
is tasked with SIP messages routing and selection of S-CSCFs. The selection of
S-CSCFs is done by the I-CSCF during the UE IMS registration process. According
to [15], I-CSCF,

• Obtains IP addresses of S-CSCFs from HSS.


• Forwards SIP requests or responses to the S-CSCF.
170 7 IMS and Mobile VoIP

Table 7.1 CS and IMS


Circuit switched IMS
technologies
Signaling ISUP (ISDN User Part) over SIP over IP
MTP (Message Transfer
Part)
Media PCM RTP

Fig. 7.3 PSTN gateway

• If the I-CSCF determines that the destination of the session is not within the
IMS home domain, it will either forward the SIP request or return with a failure
response code towards the originating UE.

PSTN Gateway
A PSTN gateway interfaces the IMS core network with PSTN circuit switched net-
works. Table 7.1 outlines the technologies used by both circuit switched networks
and IMS regarding signalling messages and transfer of media. Figure 7.3 depicts the
PSTN Gateway with its communication interfaces.

Signalling Gateway Signalling Gateway (SGW) interfaces with the IMS sig-
nalling plane of the CS network. It transforms lower layer protocols such as Stream
Control Transmission Protocol (SCTP) [38] into Message Transfer Part (MTP)
which is an SS7 protocol, in order to pass the ISUP from the Media Gateway Control
Function (MGCF) to the CS network.

Media Gateway Control Function Media Gateway Control Function (MGCF)


is used to inter-operate the IMS and the PSTN CS Domain. The MGCF performs
the signalling conversion from SIP to ISUP and vice-versa. The MGCF further con-
trols the Media Gateway (MGW) via MGCP by Mn reference point and collects
7.1 What Is IP Multimedia Subsystem? 171

charging details. Other key features are the support of the SGW functionality and
the control of the announcement application server. The MGCF and the MGW are
not compulsory to be deployed in the IMS network but both network elements are
only required when the IMS has to inter-operate with the PSTN CS domain.

IP Multimedia Subsystem Media Gateway Function IP Multimedia Subsys-


tem Media Gateway Function (IMS-MGW) is deployed to interface with the IMS
media plane and the CS network media plane. The IMS-MGW may be used to sup-
port media conversion and payload processing such as codec and conference bridge.

Breakout Gateway
Breakout Gateway Control Function If the S-CSCF of the origination UE can-
not route the SIP Invite message to a terminating IMS then it passes the SIP session
to the Breakout Gateway Control Function (BGCF). The BGCF is tasked with se-
lecting the MGCF. The BGCF always is co-located with the S-CSCF. The BGCF is
not compulsory in the IMS core network but it is required whenever the IMS has to
inter-operate with the PSTN CS Domain.

Application and Media Servers


Application Server Application Server (AS) provides value added multimedia
services such voicemail and PoC application servers. It is responsible for hosting
and executing VoIP application services. It is located in IMS subscriber’s home do-
main or in the external IMS domain. AS interfaces with the S-CSCF using SIP ISC
interface [12]. A list of AS types is depicted in Fig. 7.4 and includes the follow-
ing.
SIP AS: It is the native IMS application server. The SIP AS is a node which hosts
the logic of a native IMS application and is capable of controlling the SIP session
and apply a defined SIP service. AS is proxied by S-CSCF and it is be able to,
• Accept SIP requests for services, control, terminate or initiate a new SIP
transaction.
• Route SIP session to another UE or network.
The AS is be able to communicate with HSS via Sh diameter protocol [16] in
order to obtain information about UE subscriptions and services. In order to apply
the established service logic, the AS is capable of communicating with other
external nodes such as databases or network nodes. ISC interface abbreviates
IP multimedia Subsystem Service Control, it is the interface between S-CSCF
and AS Platforms. The ISC will allow SIP session to be controlled by the AS,
moreover, it allows S-CSCF to delegate the service control to AS. The following
functions apply to the ISC interface,
• The ISC interface must be able to convey charging information. The S-CSCF
passes IMS Charging Identifier (ICID) to the AS and the later uses the ICID
to correlate with its own CDR produced during a SIP transaction.
172 7 IMS and Mobile VoIP

Fig. 7.4 Three types of IMS


application servers

• The protocol on the ISC interface must allow the S-CSCF to differentiate
between SIP requests on Mw, Mm and Mg interfaces [14] and SIP Requests
on the ISC interface.

OSA-SCS: Open Service Access–Service Capability Server (OSA-SCS) is an in-


terface amongst OSA ASs using Parlay APIs.
IM-SSF: IP Multimedia Service Switching Function (IM-SSF) is an interface be-
tween IMS and the CAMEL.

Multimedia Resource Function Multimedia Resource Function (MRF) provides


a source of media in the IMS home network. It is be divided into the follow-
ing.

Multimedia Resource Function Controller Multimedia Resource Function Con-


troller (MRFC) [13] is a signalling plane node that acts as a SIP UA to the S-
CSCF. The MRFC handles the media stream resources in the Multimedia Re-
source Function Processor (MRFP) by using the H.248 interface. The MRFC
interprets information coming from AS and S-CSCF and controls the MRFP ac-
cordingly. MRFC also generates CDRs.
Multimedia Resource Function Processor Multimedia Resource Function Pro-
cessor (MRFP) is a media plane node that implements all media related functions
such as mixing and playing. It provides a wide range of functions for multimedia
7.1 What Is IP Multimedia Subsystem? 173

resources, including provision of resources to be controlled by the MRFC, mix-


ing of incoming media streams and processing of media streams. The MRFP is
located in the IMS home network.

User Databases
Home Subscriber Server: Home Subscriber Server (HSS) is a database that
stores the IMS user profiles of all IMS subscribers. The user profile includes the
subscriber’s identity information such as IMPU, IMPI, IMSI, and MSISDN. It also
includes service provisioning information such as Filter Criteria (FC), user mobility
information such as CSCF IP addresses and charging information such as GCID,
SGSN-ID, and ECF address. The HSS also provides support for the IMS user au-
thentication. In the IMS, the HSS communicates with the HLR using the MAP
protocol in order to get the AKA security information from the HLR. The HSS
performs authentication and authorisation of the IMS user and also provides infor-
mation about the physical location of the IMS user.

Subscription Locator Function (SLF): The SLF locates the database that store
subscribers information in response to requests from the I-CSCF or AS.

Other
Signaling Gateway Function: Signaling Gateway Function (SGF) provides sig-
nalling conversion between SS7 and IP networks.

Policy Decision Function: Policy Decision Function (PDF) provides two main
functions,
• Policy based Network Resource Control: It is used to authorize and control re-
sources usage for each GPRS/UMTS Secondary Packet Data Protocol (PDP)
context. Policy based network resource control prevents the misuse of network
quality of service agreement, for instance it will stop the IMS subscriber to use
Secondary PDP Context with higher QoS classes as agreed. Policy based net-
work resource control also allows the network operator to limit the resource
usage. The PDF acts as Policy Decision Point (PDP) and the GGSN is its cor-
responding Policy Enforcement Point (PEP).
• Charging Correlation Support: This function support the exchange of charging
correlation information between the PDF and the GGSN. Charging correlation
support is not a mandatory IMS network element, but is required whenever
GPRS/UMTS Secondary PDP Contexts using QoS have to be controlled.

7.1.4 IMS Services

IMS can enable new converged and existing services for IMS subscribers using
wireline or wireless access. An IMS based network can provide the ability to blend
174 7 IMS and Mobile VoIP

multimedia services that may currently be available in isolation. One such instance
is that an IMS subscriber could simultaneously have web portal based management
of personal call features and policies, presence capability to detect the availability of
individuals on the contact list, and the ability to share video clips with contacts while
on conversation with them, regardless of their network access types or devices.
Users are interested in multimedia services that allow them to share information
via their preferred delivery methods with several individuals across multiple access
networks at the time of their choose. Examples of such services are as follows:
• Voice/Video calls: This IMS multimedia service involves point to point commu-
nication between IMS users.
• Conferencing: This IMS service includes video or audio conferencing which
allows two or more IMS users to interact simultaneously.
• Content sharing: This IMS service allows users to share any type of informa-
tion, such as video and data files with any contact.
• Push-to-talk Over Cellular: This service involves voice calls which are in half
duplex mode of communication. In half duplex mode, when one person speaks
another one listens.
• Multiparty Gaming: This IMS service allows IMS users play games with each
other in real time.
• Presence Information: This IMS service allows an IMS user to be notified
whenever a contact is available. Presence information displays the availability
and willingness of an IMS user to communicate.
• Messaging There are several types of messaging in IMS. These are:
– Instant messaging: This is a real time text communication between two or
more IMS users.
– Universal messaging: This IMS service gives IMS users the ability to com-
pose, send, reply and forward voice messages to and from voicemail sys-
tem.
– Enhanced messaging: This is an enhancement to SMS for mobile networks.
A mobile phone is capable of sending and receiving messages that have
special formatting, pictures, animations, icons, and ring tones.
• Speech Synthesis and Recognition
– Text to Speech (TTS) conversation: Speech synthesis is an artificial produc-
tion of human speech. A system used for this purpose is termed a speech
synthesizer, and can be implemented in software or hardware.
– Speech Recognition: Also known as automatic speech recognition, it is the
process of converting a speech signal to a set of words, by means of an
algorithm implemented as a computer application software.

7.1.5 IMS Signalling and Bearer Traffic Interfaces

This section outlines the main IMS internal and external interfaces.
• Gm-interface: Interface between the UE and the P-CSCF [14] (cf., Fig. 7.5).
7.1 What Is IP Multimedia Subsystem? 175

Fig. 7.5 Gm interface,


protocol layer model

Fig. 7.6 Gq interface,


protocol layer model

– SIP/SDP: As defined in [3GPP TS 24.229] [9], [RFC3261] [37] and


[RFC2327]/[RFC3266] [24, 28] with SIP extensions; for example,
[RFC3262]/[RFC3264]/[RFC3311]/[RFC3312] [20, 34–36] as necessary
for SBLP (Service Based Local Policy) support, [RFC3323], [RFC3325]
[26, 29] as required for SIP Privacy, etc.
– SigComp: As defined in [RFC3320] [33]. SigComp is only used when a
Mobile Access (e.g., GPRS/UMTS) network is used.
– TCP/UDP: As defined in [RFC793]/[RFC768] [30, 32].
– IP: As defined in [RFC791] [31] or [RFC2460] [21]. IPv4 and IPv6 are
supported for the Gm-interface. IPSec (Especially in transport mode) de-
fined in [RFC2401] [27], [RFC2406] [27] is additionally used over the
Gm-interface when the IMS AKA User Authentication is used. IPSec is
not used on the Gm-interface when the User Authentication is performed
using either HTTP Digest, Improved HTTP Digest, or Early IMS Authen-
tication.
• Gq-interface: Interface between the PDF and the P-CSCF [6] (cf., Fig. 7.6).
– GPP Gq Ext.: It denotes the 3GPP extensions of the DIAMETER protocol
for Gq.
– Diameter: As defined in [RFC3588] [19].
– TCP/UDP: As defined in [RFC793]/[RFC768] [30, 32].
– IP: As defined in [RFC791] [31]. Only an IPv4 transport is supported for
the Gq-interface. 3GPP defines the use of IPv6 at the Network Layer.
• ISC-interface: Interface between the S-CSCF and the AS (cf., Fig. 7.7). IPv4
and IPv6 are supported by the S-CSCF for the ISC-interface.
• Cx-interface: Interface between the HSS and the I-/S-CSCF [10] (cf., Fig. 7.8).
Only an IPv4 transport is supported for the Cx interface. 3GPP defines the use
of IPv6 at the Network Layer.
176 7 IMS and Mobile VoIP

Fig. 7.7 ISC interface,


protocol layer model

Fig. 7.8 Cx interface,


protocol layer model

Fig. 7.9 Sh interface,


protocol layer model

Fig. 7.10 Mw interface,


protocol layer model

Fig. 7.11 Mg interface,


protocol layer model

• Sh-interface: Interface between the HSS and AS [16] (cf., Fig. 7.9). Only an
IPv4 transport is supported for the Sh interface. 3GPP defines the use of IPv6
at the Network.
• Mw-interface: Interface between CSCFs [14] (cf., Fig. 7.10). IPv4 and IPv6 are
supported for the Mw interface.
• Mg-interface: Interface between the MGCF and the S-/I-CSCF [14] (cf.,
Fig. 7.11). 3GPP defines the use of IPv6 at the Network Layer.
More IMS internal and external interfaces are listed in Table 7.2.
7.2 Mobile Access Networks 177

Table 7.2 IMS signalling and bearer traffic interfaces

Interface Short description Protocols


B&R Back-up and restore interface between B&R B&R
server and IMS NEs
Bx Offline charging interface between BS and FTP
IMS NEs
Cx Interface between HSS and CSCF Cx over Diameter
DNS Interface between DNS server and IMS NEs DNS
or between DNS servers
ISC Interface between CSCF and AS SIP
ISUP Interface between MGCF and PSTN/CS ISUP [4]
network
Gm Interface between P-CSCF and UE SIP, but via ACME
Gx/Go Interface between PDF and GGSN Gx over Diameter [1]
Gq Interface between PDF and P-CSCF Gq over Diameter
Ia H.248 interface between PDF and BGF H.248
(Media Proxy)
MAP Interface between HSS and HLR MAP
Gi/Mb Interface between MGW and IP network RTP
Mg Interface between MGCF and S-CSCF and SIP
between MGCF and I-CSCF
Mj Interface between MGCF and BGCF SIP
(collocated with S-CSCF)
Mn Interface between MGCF and MGW H.248 [11]
Mw (Mm) Interface between CSCF’s SIP
OAM Interface for OAM purposes SNMP, HTTP, NTP, Ut, CORBA,
Telnet, (S)FTP, SSH, UDP
Ro Interface between S-CSCF and OCS Ro over Diameter
Rx The term Rx interface is used in this RADIUS [2]
document to refer to the internal RADIUS
interface to the HSS
SSA Interface for SSA between web server and XML/HTTP
HSS
Sh Interface between HSS and AS Sh over Diameter
TDM Interface between MGW and PSTN/CS TDM
network
Ut Interface for subscriber self-administration XML/HTTP, but only via VPN
between UE and AS

7.2 Mobile Access Networks

We will concentrate on the standards which are relevant in the area of the European
Union. This means that we will have a look at GSM [14] and UMTS [3] but also
178 7 IMS and Mobile VoIP

Fig. 7.12 IMT-2000 terrestrial radio interfaces and categories

at the GSM to UMTS migration path. We will also look briefly at the latest 3GPP
standard termed as a Long Term Evolution (LTE).
In the quest to establish UMTS standards, a number of proposals were submitted
to ITU-R for evaluation and adoption within the IMT-2000 [25] family.
Figure 7.12 depicts the IMT-2000 terrestrial radio interfaces and categories. It
also points out various technologies and their affiliation by categories.
3GPP specifications group deals with cellular standards such as W-CDMA, TD-
CDMA, TD-SCDMA and EDGE but 3GPP2 handles CDMA2000. The later will
not be considered, because it describes enhancements on the IS-95 protocol, not
used in Europe. The aforementioned technologies and standards are considered to
be of great significance to 3G mobile networks.

7.2.1 Cellular Standards

Wireless standards are divided into cellular and non-cellular standards. Mobile op-
erators are primarily interested in obtaining the best possible cellular system, so
the standards that are most important to this brand of system are discussed in this
book. Cellular systems are principally operated in the frequency range of 800 to
2200 MHz.
For many years, 2G systems have been operated successfully all over the world.
The anticipated demand for mobile data services providing high throughput, excel-
lent quality of service and improved system capacity prompted operators to begin
screening their options for the best choice in a 3G mobile system. A look at 3G re-
alizations reveals that all have related 2G predecessors. These can be classed in two
major families, GSM/UMTS and IS-95/CDMA2000. This book will only discuss
GSM/UMTS. This means that smooth evolution from 2G to 3G within each family
is possible, which is therefore also valid in the context of IMS/FMC configuration.
7.2 Mobile Access Networks 179

Fig. 7.13 Frequency/Time Division Multiple Access

7.2.2 The GSM Standard

The GSM system is commonly operated in 900 MHz and 1800 MHz bands but
450 MHz, 850 MHz, and 1900 MHz bands are also used. It requires a paired spec-
trum and supports a carrier bandwidth granularity of 200 kHz. The GSM radio inter-
face uses a combination of FDMA and TDMA (cf., Fig. 7.13). The TDMA structure
comprises eight time slots (bursts) per TDMA frame on each carrier providing a
gross bit rate of 22.8 kb/s per time slot or physical channel. Dedicated logical chan-
nels carry user data or signaling information, and they are mapped on time slots of
the TDMA frame structure on a given frequency carrier.
The basic GSM system supports voice bearers at 13 kb/s (full rate codec, FR)
or 6.5 kb/s (half rate codec, HR) as well as circuit switched (CS) data services at
300 bps up to 14.4 kb/s. A suitable combination of FR and HR channels/codecs for
voice can increase voice capacity by 50 % over FR channels alone.
Voice and data is transported via multiple 16 kb/s channels within the GSM Radio
Access Network (RAN) (i.e., between the network entities BTS and BSC of the Base
Station Subsystem (BSS)).
Transport systems such as PCM30 or PCM24 are realized for the Abis interface.
The GSM Core Network (cf., Fig. 7.14) provides circuit switched bearers for
voice and data at 64 kb/s granularity. The GSM system uses the Mobile Application
Part (MAP), which runs on signalling system No. 7 (SS7 of ITU-T) to exchange
mobility related information between the core network entities.

General Packet Radio Service


The basic GSM system was designed largely to cope with voice and CS data with
low bit rates. However, to support transmission of packet oriented information while
making efficient use of the air interface, the system must be able to accommodate
180 7 IMS and Mobile VoIP

Fig. 7.14 GSM network architecture

flexible user rates for packet oriented data transfer using time slot assignment on
demand rather than via permanent occupation. To this end, General Packet Radio
Service (GPRS) introduces packet data functions to the radio interface, the radio
access and the core networks (cf., Fig. 7.14)
SGSN and GGSN network nodes are introduced into the core network to support
GPRS. SGSn and GGSN also communicates with the HLR using a GSM MAP that
has been extended with data related functions.
SGSN and GGSN are used exclusively for packet data transport and control.
Packet information is conveyed between SGSN and GGSN via the GPRS Tunnelling
Protocol (GTP) on top of an IP based network.
The basic frame structure of the radio interface remains unchanged, but one or
more time slots are allocated on demand to transmit one or more packets.
Four new coding sets (CS-1 to CS-4) are introduced to adapt the radio interface
to the given radio conditions and improve its performance. This provides maximum
user data rates per time slot as indicated in Table 7.3.
A number of time slots on a given carrier can be concatenated to a GPRS chan-
nel. This provides potential data rates up to 171.2 kb/s (8 times lots for CS4). In
addition, the system supports a limited number of QoS characteristics such as delay,
throughput and packet loss. Fast resource allocation on demand in core and radio
networks enables an “always on” terminal status.
GPRS services may be charged on the basis of transported data volume rather
than channel occupation time. With GPRS the entire GSM system supports voice
and CS data as well as packet oriented data services.
7.2 Mobile Access Networks 181

Table 7.3 GPRS and EDGE


Standard Coding set Modulation Net data rate
net user data rates
scheme per time slot
GPRS CS-1 GMSK 9.05
CS-2 13.4
CS-3 15.6
CS-4 21.4
EDGE MCS-1 8.8
MCS-2 11.2
MCS-3 14.8
MCS-4 17.6
MCS-5 8-PSK 22.4
MCS-6 29.4
MCS-7 44.8
MCS-8 54.4
MCS-9 59.2

GSM/EDGE Radio Access Network


GSM/EDGE Radio Access Network (GERAN) [8] standardization has two main
objectives,
• To align GSM/GPRS/EDGE and UMTS packet services mainly in terms of the
quality of service.
• To interface with the UMTS core network (i.e., Iu interfaces both Iucs and Iupo).
In summary, GERAN constitutes a radio access network featuring EDGE modu-
lation, coding modes and interconnection to UMTS core network, which makes it a
UMTS RAN that supports 3G services. The GERAN definition has several conse-
quences for packet data transmission,
• All QoS classes defined for UMTS also apply to GERAN (cf., Table 7.4).
• QoS class definitions also enable GERAN to support wideband AMR codec.
This allows voice capacity to be improved.
• “Seamless” service can be provided across both UTRAN and GERAN for CS
and PO services.
In addition to its aforementioned capabilities, GERAN provides a backward com-
patibility architecture to GSM/GPRS via A and Gb interfaces. In this case, only QoS
classes 3 and 4 (cf., Table 7.4) can be supported across the Gb interface.

7.2.3 The UMTS Standard

The Third Generation Partnership Project (3GPP) specification group defined the
Universal Mobile Telecommunication System (UMTS) in the past decade.
182 7 IMS and Mobile VoIP

Table 7.4 UMTS QoS classes


Class Traffic class Class description Example Relevant QoS
requirements
1 Conversational Preserves time relation voice and video low jitter, low
between entities making up telephony, video delay
the stream conversational gaming, video
pattern based on human conferencing
perception; real-time
2 Streaming Preserves time relation multimedia low jitter
between entities making up video on
the stream; real-time demand,
webcast
real-time video
3 Interactive Bounded response time web browsing, low round trip
preserves the payload database delay time, low
content retrieval BER
4 Background Preserves the payload e-mail, SMS, low BER
content file transfer

The first release of the specifications provides a new radio network archi-
tecture including W-CDMA (FDD) and TD-CDMA (TDD) radio technologies,
GSM/GPRS/EDGE enabled services both for the CS and PO domain, and inter-
working to GSM. Meanwhile further features are defined like Virtual Home Envi-
ronment (VHE) and Open Services Architecture (OSA) evolution, full support of
Location Services (LCS) in CS and PO domains, an additional TDD mode (TD-
SCDMA), evolution of UTRAN transport (primarily IP support), multi-rate wide-
band voice codec, IP-based multimedia services (IMS), and high speed downlink
packet access (HSDPA).
As for GSM, the UMTS network architecture defines a core network (CN) and a
terrestrial radio access network (UTRAN) (cf., Fig. 7.15), the interface between the
two is named Iu. Notably, this interface is also projected to connect to GERAN.
This approach is evolutionary, so the UMTS core network may integrate into
the GSM core network. This also applies to core network entities as well as to
functions and protocols across the network such as call processing (CP) and mo-
bility management (MM). It applies specifically to the GSM/UMTS mobile appli-
cation part (MAP), which is independent of the RAN. The integrated GSM and
UMTS core network entities facilitate development, provisioning of network enti-
ties and introduction of UMTS services. Multi-mode terminals for both GSM and
UMTS allow for smooth migration from GSM to UMTS. Based on CDMA tech-
nology, UTRAN has been designed specifically to satisfy the service requirements
of 3G.
CDMA fundamental function (cf., Fig. 7.13) is to spread actual user data sig-
nals over a broad frequency range fending off multi path fading. For this purpose
signals are multiplied with a unique bit sequence (spreading code) at a certain bit
7.2 Mobile Access Networks 183

Fig. 7.15 UMTS network architecture

rate (called chip rate). In this way users and channels are separated on the same
carrier. In contrast to a TDMA system, in a CDMA system other users within the
same cell generate most of the interference. This allows adjacent cells to use the
same frequency, which they usually do, and obviates the need for frequency plan-
ning.
Time division principles may be used within a CDMA system much in the way of
FDMA systems. This has its benefits of allowing time division duplexing to be used
to separate uplink from downlink signals, this leads to creating radio transmission
technology suited for use in unpaired frequency bands.
The UTRAN system is designed to efficiently handle voice and data as well as
realtime and non realtime services over the same air interface (i.e., on the same
carrier), all at the same time and in any mix of data and voice. This variant is better
suited for data transport than GSM, and it provides a powerful platform for voice
traffic. A comprehensive channel structure was defined for the radio interface.
It consists of:
• Dedicated channels that may be assigned to one and only one mobile at any
given time.
• Common channels that may be used by all mobiles within this cell.
• Shared channels that are like common channels, but can only be used by an
assigned subset of mobiles at a given time. These channels are used for packet
data transfer.
The UTRAN system calls for several radio interface modes. Essentially, the def-
inition distinguishes between two modes of operations,
• Frequency division duplexing (UTRAN FDD) for operation in paired frequency
bands.
184 7 IMS and Mobile VoIP

• Time division duplexing (UTRAN TDD) for operation in unpaired frequency


bands. This option allows for alternative chip rates and bandwidths to be imple-
mented.

FDD and TDD are harmonized, particularly in terms of how higher layers of the
radio network protocols and the Iu interface are used. In practice, the various modes
are hidden from the core network, meaning that the particulars of FDD and TDD
are limited to the UTRAN and to terminals. Both the operator and user benefit when
FDD and TDD are available in the same network,

• Unique UMTS service can be offered to the end users irrespective of the radio
access technology.
• The end user will enjoy the best possible coverage without giving a thought to
technical implications.
• The UMTS network can be deployed in such a way as to drive down costs.

Wideband CDMA
The UTRAN FDD mode employs Wideband CDMA (W-CDMA).This radio ac-
cess technology uses direct sequence CDMA with a chip rate of 3.84 Mcps on a
2 × 5 MHz bandwidth carrier (uplink/downlink).
Due to the nature of the system, it usually operates with a frequency reuse of one,
meaning that all cells use the same carrier frequencies. As a consequence, the system
provides a special process that mitigates interference among the cells, especially at
cell borders. Soft handover (SHO) is used for CS traffic. Rather than using SHO,
PO traffic is switched in between two subsequent packets. In the course of an SHO,
a mobile terminal is connected to more than just one NodeB, depending on actual
radio conditions.
The RNC multiplies and combines signals sent to and received from the terminal.
Though SHO is primarily a macro diversity feature, it also provides the basis for
smooth and seamless inter-cell handover within the same frequency band. SHO is
used between sectors of one base station.
This enhances efficiency, but it requires improved digital signal processing capa-
bilities within the base station. Its effect is comparable to that of an SHO. Again,
other users within the same cell generate the majority of interference. This means
that a CDMA system’s cell size depends on the actual cell load. This effect is called
cell breathing. To address this issue and ensure cell stability, CDMA networks
should operate with a nominal cell load of some 50 %, leaving margin for inter-
ference and allowing for some flexibility under peak load conditions. More than one
carrier can be used within a given cell or cell sector. Hard HO capability is provided
to handover between these carriers. Separate carriers do not have common channels,
therefore, they operate on their own.
The radio network controller (RNC) coordinates all carriers within a given area
such as handling of admission control. W-CDMA can be used in all environments
7.2 Mobile Access Networks 185

such as vehicular, pedestrian and indoor, and for all kinds of traffic. However, by its
very nature it is primarily suited for symmetric traffic using macro or micro cells in
areas with medium population density.

Time Division CDMA


The UTRAN TDD (time division duplex) employs time division CDMA (TD-
CDMA) with a chip rate of 3.84 Mcps on a 5 MHz bandwidth carrier.
This technology uses CDMA as well as TDMA to separate the various commu-
nication channels, which is why any given radio resource is denoted by time slot and
code. Time slots can be allocated to carry either downlink or uplink channels, en-
abling this technology to operate within an unpaired band. In other words, a duplex
frequency band is not required.
That makes the minimum spectrum requirement just half the bandwidth of W-
CDMA, that is, one 5 MHz block. Moreover, TD-CDMA employs a joint detec-
tion algorithm. As its name suggests, it recognizes and decodes multiple channels
jointly. This method eliminates intra-cell interference and helps boost system capac-
ity.
The absence of intra-cell interference makes the system behave more like a
TDMA system. It does not suffer from cell breathing, nor does it require SHO ca-
pability. That makes it particularly valuable in densely populated urban areas where
indoor (pico environment) and outdoor (micro environment) solutions must cope
with heavy data loads using the smallest cells.
Moreover, since uplink and downlink time slots can be assigned separately, TD-
CDMA is well suited for asymmetric traffic.

Time Division Synchronous CDMA


A technology similar to TDCDMA, time division synchronous CDMA (TD-
SCDMA) is different in that it uses special methods to maintain uplink synchronicity
and avoid excessive guarding periods in the frame structure. It implements all the
functions of TD-CDMA (in particular the joint detection algorithm), but it is based
on a chip rate of 1.28 Mcps on a 1.6 MHz bandwidth carrier, which amounts to a
third of the TD-CDMA chip rate and carrier (cf., Fig. 7.16).
This means three carriers may be used within the given spectrum’s 5 MHz
band. This affords operators greater flexibility. The system may be operated with
frequency reuse of one, two or three. By the same token, the system could also
be used in places where a contiguous 5 MHz block of the spectrum is unavail-
able.
TD-SCDMA technology is actually a component part of two different stan-
dards,

• The 3GPP UTRAN standard: It is the technology is UTRAN TDD’s 1.28 Mcps
option.
186 7 IMS and Mobile VoIP

Fig. 7.16
D-CDMA/TDSCDMA
spectrum usage

• The CWTS TSM standard: It is complemented with GSM radio procedures and
embedded entirely in the GSM BSS and inter-worked into a GSM core network
using GSM A and Gb interfaces.

High Speed Downlink Packet Access


Designed to enhance the UTRAN system, high speed (HSDPA) endows the down-
link with user data rates up to 10 Mbps. This feature can be applied to UTRAN FDD
and TDD, that is, to W-CDMA, TDCDMA, and TD-SCDMA.
High user data rates are achieved by applying a higher level modulation scheme
(16QAM), including adapted coding rates with turbo codes. Since these modulation
schemes require a better C/I ratio, the range of such a high speed radio link will
shrink, as will, by extension, cell size.
This means HSDPA will primarily be used in scenarios with high traffic density
or peak user data rates. To achieve both high user data rates and excellent transmis-
sion quality, HSPDA defines a number of functions such as:
• Adaptive modulation and coding (AMC)
• Hybrid ARQ (H-ARQ)
• Fast cell selection (FCS)
• Standalone downlink shared channel (S-DSCH)
In general, this concept is similar to the downlink shared channel (DSCH) avail-
able in UTRAN. It allows the same physical channel to be shared by many mobile
users on a statistical basis.
The S-DSCH feature calls for a configuration in which an entire 5 MHz downlink
carrier is allocated to the DSCH and used exclusively for HSDPA. Typically this
would be an amendment to a UTRAN FDD system.

7.2.4 Long-Term Evolution

The latest developed standard in cellular networks which are being deployed around
the world is an evolution of 3G towards the evolved radio access. This evolution is
widely know as Long Term Evolution (LTE).
7.2 Mobile Access Networks 187

Table 7.5 Interruption time


Non-real-time (ms) Real-time (ms)
requirements for LTE
mobility LTE to UMTS 500 300
LTE to GSM 500 300

Unlike 3G, LTE uses Orthogonal Frequency-division Multiplexing (OFDM) for


the downlink and single carrier Frequency-division Multiple Access (SC-FDMA)
for the uplink. Its core network is all IP based instead of the mixed CS and IP as in
3G core network.
LTE aims to provide seamless IP connectivity between UE and the Packet Data
Network (PDN), without disruption to the end users applications during mobility.
LTE is also accompanied by an evolution of the non-radio systems known System
Architecture Evolution (SAE) which includes the Evolved Packet Core (EPC) net-
work. Therefore, LTE and SAE comprise the Evolved Packet System (EPS).
One main reason for the evolution of 3G mobile network is for the cellular net-
works to provide significant higher data rates compared to what is available at the
current HSPDA standards. The requirements of LTE are stipulated in 3GPP TR
25.913 by 3GGP includes,

• Capabilities: The downlink and uplink maximum data rates are 100 Mbs and
50 Mbs, respectively when working in 20 MHz spectrum. The user plane la-
tency requirement is denoted as the time it takes to transmit an IP packet from
the mobile terminal to the RAN edge node or vice versa. The one way trans-
mission time should not exceed 5 ms in an unloaded network (i.e., no other
terminals are present in the cell).
• System performance: The LTE system performance is in terms of user through-
put, mobility, spectrum efficiency and coverage.
– The LTE user throughput target is categorized into the average and at the
fifth percentile of the user distribution.
– The LTE spectrum efficiency is defined as the system throughput per cell in
bit/s/MHz/cell.
– The mobility targets focus on the mobile terminals speed. Maximum per-
formance is achievable at low mobile terminal speeds of 0–15 km/h. The
slight degradation is expected for higher speeds. For mobile terminal speeds
up to 120 km/h, the system should provide high performance and for speeds
above 120 km/h, the system should be able to maintain its connection with
the mobile terminals.
• Deployment aspects: There are two deployment scenario, first is one LTE is
deployed as a stand alone and second is when LTE is deployed together with
UMTS and/or GSM. When LTE coexists with other 3GPP systems then there
are requirements on the acceptable interruption time in the mobility manage-
ment (cf., Table 7.5).
188 7 IMS and Mobile VoIP

Fig. 7.17 LTE architecture

Overall Architecture
EPS provides LTE users with the IP connectivity to access the Internet and other
IP based services such as VoIP via the PDN. Figure 7.17 depicts the overall archi-
tecture of the EPS which includes the network elements as well as the standardized
interfaces.
The architecture is composed of the core network (CN) and the access net-
work (AN). The CN which is EPC has several logical nodes while the AN (i.e.,
E-UTRAN) is made up of only one node which is eNodeB. eNodeB connects UE to
the CN.
The CN has the following main nodes,
• PDN Gateway: This gateway provides the IP connectivity from the UE to the
Internet. This means that the PDN Gateway is the point of entry and exit of the
IP traffic for the UE.
• Serving Gateway: When UE moves between eNodeBs, the Serving Gateway
serves as an anchor for the data bearers.
• MME: Mobility Management Entity (MME) is responsible for processing sig-
naling messages between CN and UE. Other functions are to establish, maintain
and release bearers. It is also involved in the security between the UE and the
network. It interfaces SGSN of 2G and 3G networks during handover process.
• HSS: It serves as a central database where LTE users profiles are store.

7.3 Summary

UMTS networks have been rolled out in steps. Deployment kicked off in urban ar-
eas where a specific demand for data services was anticipated. Next came suburban
7.4 Problems 189

areas, and so on down the line. In order to provide full coverage for service conti-
nuity, cellular networks and terminals are designed to enable roaming and handover
between GSM/GPRS and UMTS.
In general, the migration of the overall architecture from GSM to UMTS had
gone smoothly, particularly in the core network from GSM to UMTS. A look at the
core migration issues is as follows:
• Terminals: Handsets manufacturers were committed to providing GSM/UMTS
dual-mode terminals.
• Radio network: UMTS technologies were designed specifically to use a band-
width of 5 MHz (TD-SCDMA occupies 1.6 MHz only) of an unpaired or
2 × 5 MHz of a paired spectrum to efficiently support high user data rate ser-
vices in highly mobile environments. A new radio spectrum was allocated for
3G, providing the basis for introducing new radio technologies without requir-
ing spectrum to be re-farmed and legacy services and equipment to be replaced.
The entire 3G spectrum was subdivided into 5 MHz bands. UTRAN was in-
troduced alongside GSM RAN owing to its extended functionality and band-
width.
• Core network: GSM and UMTS define the same core network architecture.
GPRS is part of both GSM and UMTS.
The implications for UMTS service introduction were clear, the legacy GSM
core network could be upgraded to operate both GSM and UMTS within an in-
tegrated UMTS core network. This meant that operators could offer wide area
coverage via GSM/GPRS and gradually buildup their UMTS radio access infras-
tructure. At the same time, GPRS nodes and GSM MSCs were upgraded to sup-
port UMTS services and interconnect the UMTS radio network via new Iu inter-
face.
The rollout of LTE networks is now gathering pace around the world. The report
by the Global Mobile Suppliers Association (GSA) confirms 338 mobile operators
in 101 countries have committed to start the commercial LTE network deployments
or are in the process of conducting trials and testing [23]. The GSA have confirmed
that LTE is the fasted growing mobile technology ever.

7.4 Problems

1. The caller dials a number by using SIP URI of another IMS subscriber. At
which call session control function does the call first make a contact?
2. What does the P-CSCF first checks when it receives a call setup request from
the IMS subscriber?
3. If the SIP URI is found by the P-CSCF through the DNS during a call setup,
what will be the next step performed by the P-CSCF?
4. What is the relationship between the S-CSCF and the HSS?
5. Explain the importance of the initial Filter criteria (iFC).
6. Explain a scenario where the initial Filter criteria would be applied.
190 7 IMS and Mobile VoIP

7. What are interface and protocol used to connect the HSS and S-CSCF?
8. What are interface and protocol used to connect the application server and
S-CSCF?
9. I want to provide video on demand (VoD) services to IMS subscribers, where
would I place it in the IMS architecture? Explain your answer.
10. Why does GSM use TDMA and not CDMA?
11. I have GSM mobile handset, can I use it in all countries? Explain your an-
swer.

References

1. 3GPP Policy and charging control over Gx reference point. TS 29.212, 3rd Generation Part-
nership Project (3GPP). http://www.3gpp.org/ftp/Specs/html-info/29212.htm
2. 3GPP Policy and charging control over Rx reference point. TS 29.214, 3rd Generation Part-
nership Project (3GPP). http://www.3gpp.org/ftp/Specs/html-info/29214.htm
3. 3GPP (2001) UMTS Phase 1. TS 22.100, 3rd Generation Partnership Project (3GPP).
http://www.3gpp.org/ftp/Specs/html-info/22100.htm
4. 3GPP (2005) Application of ISDN User Part (ISUP) Version 3 for the Integrated Ser-
vices Digital Network (ISDN)—Public Land Mobile Network (PLMN) signalling inter-
face; Part 1: Protocol specification. TS 09.14, 3rd Generation Partnership Project (3GPP).
http://www.3gpp.org/ftp/Specs/html-info/0914.htm
5. 3GPP (2005) Customized Applications for Mobile network Enhanced Logic (CAMEL);
Service description; Stage 1. TS 22.078, 3rd Generation Partnership Project (3GPP).
http://www.3gpp.org/ftp/Specs/html-info/22078.htm
6. 3GPP (2007) Policy control over Gq interface. TS 29.209, 3rd Generation Partnership
Project (3GPP). http://www.3gpp.org/ftp/Specs/html-info/29209.htm
7. 3GPP (2007) Signaling System No 7 (SS7) signalling transport in core network; Stage 3.
TS 29.202, 3rd Generation Partnership Project (3GPP). http://www.3gpp.org/ftp/Specs/
html-info/29202.htm
8. 3GPP (2007) Technical specifications and technical reports for a GERAN-based 3GPP sys-
tem. TS 01.01, 3rd Generation Partnership Project (3GPP). http://www.3gpp.org/ftp/Specs/
html-info/0101.htm
9. 3GPP (2008) Internet Protocol (IP) multimedia call control protocol based on Session Initi-
ation Protocol (SIP) and Session Description Protocol (SDP); Stage 3. TS 24.229, 3rd Gen-
eration Partnership Project (3GPP). http://www.3gpp.org/ftp/Specs/html-info/24229.htm
10. 3GPP (2008) IP Multimedia (IM) subsystem Cx and Dx interfaces; Signaling flows and mes-
sage contents. TS 29.228, 3rd Generation Partnership Project (3GPP). http://www.3gpp.org/
ftp/Specs/html-info/29228.htm
11. 3GPP (2008) Media Gateway Control Function (MGCF)—IM Media Gateway (IM-MGW);
Mn interface. TS 29.332, 3rd Generation Partnership Project (3GPP). http://www.3gpp.org/
ftp/Specs/html-info/29332.htm
12. 3GPP (2008) Mobile Radio Interface NAS signalling—SIP translation/conversion. TS
29.292, 3rd Generation Partnership Project (3GPP). http://www.3gpp.org/ftp/Specs/html-
info/29292.htm
13. 3GPP (2008) Multimedia Resource Function Controller (MRFC)—Multimedia Resource
Function Processor (MRFP) Mp interface; Procedures descriptions. TS 23.333, 3rd Gener-
ation Partnership Project (3GPP). http://www.3gpp.org/ftp/Specs/html-info/23333.htm
14. 3GPP (2008) Network architecture. TS 23.002, 3rd Generation Partnership Project (3GPP).
http://www.3gpp.org/ftp/Specs/html-info/23002.htm
References 191

15. 3GPP (2008) Service requirements for the Internet Protocol (IP) multimedia core net-
work subsystem (IMS); Stage 1. TS 22.228, 3rd Generation Partnership Project (3GPP).
http://www.3gpp.org/ftp/Specs/html-info/22228.htm
16. 3GPP (2008) Sh interface based on the Diameter protocol; Protocol details. TS
29.329, 3rd Generation Partnership Project (3GPP). http://www.3gpp.org/ftp/Specs/
html-info/29329.htm
17. 3GPP (2012) 3rd generation partnership project. http://www.3gpp.org. [Online; accessed
15-August-2012]
18. Blatherwick P, Bell R, Holland P (2001) Megaco IP phone media gateway application pro-
file. RFC 3054, Internet Engineering Task Force. http://www.rfc-editor.org/rfc/rfc3054.txt
19. Calhoun P, Loughney J, Guttman E, Zorn G, Arkko J (2003) Diameter base protocol. RFC
3588, Internet Engineering Task Force. http://www.rfc-editor.org/rfc/rfc3588.txt
20. Camarillo G, Marshall W, Rosenberg J (2002) Integration of resource management and
Session Initiation Protocol (SIP). RFC 3312, Internet Engineering Task Force. http://www.
rfc-editor.org/rfc/rfc3312.txt
21. Deering S, Hinden R (1998) Internet Protocol, version 6 (IPv6) specification. RFC 2460,
Internet Engineering Task Force. http://www.rfc-editor.org/rfc/rfc2460.txt
22. ETSI (2012) The European Telecommunications Standards Institute (ETSI). http://www.
etsi.org/WebSite/AboutETSI/AboutEtsi.aspx. [Online; accessed 15-August-2012]
23. GSA (2011) GSA—the global mobile suppliers association. http://www.gsacom.com. [On-
line; accessed 25-August-2012]
24. Handley M, Jacobson V (1998) SDP: Session Description Protocol. RFC 2327, Internet
Engineering Task Force. http://www.rfc-editor.org/rfc/rfc2327.txt
25. ITU (2006) Detailed specifications of the radio interfaces of the international mobile
telecommunications-2000 (IMT-2000). Recommendation 1457, International Communica-
tion Union
26. Jennings C, Peterson J, Watson M (2002) Private extensions to the Session Initiation Pro-
tocol (SIP) for asserted identity within trusted networks. RFC 3325, Internet Engineering
Task Force. http://www.rfc-editor.org/rfc/rfc3325.txt
27. Kent S, Atkinson R (1998) Security architecture for the Internet Protocol. RFC 2401, Inter-
net Engineering Task Force. http://www.rfc-editor.org/rfc/rfc2401.txt
28. Olson S, Camarillo G, Roach AB (2002) Support for IPv6 in Session Description Pro-
tocol (SDP). RFC 3266, Internet Engineering Task Force. http://www.rfc-editor.org/rfc/
rfc3266.txt
29. Peterson J (2002) A privacy mechanism for the Session Initiation Protocol (SIP). RFC 3323,
Internet Engineering Task Force. http://www.rfc-editor.org/rfc/rfc3323.txt
30. Postel J (1980) User Datagram Protocol. RFC 0768, Internet Engineering Task Force.
http://www.rfc-editor.org/rfc/rfc768.txt
31. Postel J (1981) Internet Protocol. RFC 0791, Internet Engineering Task Force. http://www.
rfc-editor.org/rfc/rfc791.txt
32. Postel J (1981) Transmission Control Protocol. RFC 0793, Internet Engineering Task Force.
http://www.rfc-editor.org/rfc/rfc793.txt
33. Price R, Bormann C, Christoffersson J, Hannu H, Liu Z, Rosenberg J (2003) Signaling Com-
pression (SigComp). RFC 3320, Internet Engineering Task Force. http://www.rfc-editor.org/
rfc/rfc3320.txt
34. Rosenberg J (2002) The Session Initiation Protocol (SIP) UPDATE method. RFC 3311,
Internet Engineering Task Force. http://www.rfc-editor.org/rfc/rfc3311.txt
35. Rosenberg J, Schulzrinne H (2002) An offer/answer model with Session Description
Protocol (SDP). RFC 3264, Internet Engineering Task Force. http://www.rfc-editor.org/
rfc/rfc3264.txt
36. Rosenberg J, Schulzrinne H (2002) Reliability of provisional responses in Session Initia-
tion Protocol (SIP). RFC 3262, Internet Engineering Task Force. http://www.rfc-editor.org/
rfc/rfc3262.txt
192 7 IMS and Mobile VoIP

37. Rosenberg J, Schulzrinne H, Camarillo G, Johnston A, Peterson J, Sparks R, Handley M,


Schooler E (2002) SIP: Session Initiation Protocol. RFC 3261, Internet Engineering Task
Force. http://www.rfc-editor.org/rfc/rfc3261.txt
38. Stewart R, Xie Q, Morneault K, Sharp C, Schwarzbauer H, Taylor T, Rytina I, Kalla M,
Zhang L, Paxson V (2000) Stream Control Transmission Protocol. RFC 2960, Internet En-
gineering Task Force. http://www.rfc-editor.org/rfc/rfc2960.txt
39. TISPAN (2012) The telecoms & internet converged services & protocols for advanced net-
works (tispan). http://www.tispan.org. [Online; accessed 15-August-2012]
Case Study 1—Building Up a VoIP System
Based on Asterisk 8

In this case study, we will setup and configure VoIP testbed by using Asterisk PBX
and X-Lite 4 soft phone. Upon completion of this lab, you will be able to add SIP
phones, configure basic dial-plans, setup SIP soft phone, start and stop Asterisk,
make voice calls between SIP soft phones, make video calls between SIP soft phones
and make voice calls between SIP soft phones and analogue phones.

8.1 What is Asterisk?

Asterisk is an open source private branch exchange (PBX) software that provides
all of the features expected from a PBX and more. Asterisk implements VoIP and
can inter-operate with almost all standard telephony equipment using relatively in-
expensive hardware. Digium [3] is the creator and sponsor of the Asterisk project.
SIP, Inter-Asterisk Exchange (IAX2) and H323 are the main control protocols used
in Asterisk. Figure 8.1 illustrates the Asterisk architecture.
Asterisk architecture is built on modules whereby each module is a loadable
component with a specific function. Asterisk main modules are:
1. Channel modules: They handle different types of connection such as (SIP,
Digium Asterisk Hardware Device Interface (DAHDI), H323, IAX2).
2. Codec translator modules: They support audio and video encoding and decod-
ing formats such as G711, GSM, H263 and H264.
3. Application modules: They support various functions such as voicemail, call
detail recording (CDR), dialplans and SMS
4. File format modules: They handle writing and reading of different file formats
in the Asterisk system.
Asterisk will turn a computer into a communications server. Asterisk powers IP
PBX systems, VoIP gateways, conference servers and more. It is mainly used by
small and large businesses, call centers, carriers and government institutions around
the world.

L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 193
DOI 10.1007/978-1-4471-4905-7_8, © Springer-Verlag London 2013
194 8 Case Study 1—Building Up a VoIP System Based on Asterisk

Fig. 8.1 Asterisk architecture

Table 8.1 Sample Asterisk channel modules


Name Function
chan_dahdi It provides connection to and from PSTN cards which are use DAHDI drivers
chan_sip It is an essential channel which enables SIP calls
chan_iax2 It provides IAX2 calls

8.1.1 Channel Modules

Channel modules are responsible for Asterisk to make calls. Each channel module
is specific to a channel type. A list of some useful channel modules used in this Lab
is illustrated in Table 8.1.

8.1.2 Codec Translator Modules

Codec translator modules are essential for transcoding media formats in a scenario
whereby communication end points do not have compatible codecs. Table 8.2 lists
useful codec translator used in this Lab.
8.1 What is Asterisk? 195

Table 8.2 Sample Asterisk codec translator modules


Name Function
codec_alaw It is an essential A-law PCM codec used over PSTN communication (not used
in USA/Canada)
codec_alaw It is an essential MU-law PCM codec used over PSTN communication in
USA/Canada
codec_dahdi It is an essential codec used in Digium card
codec_a_mu It transcode A-law to MU-law PCM codecs
codec_gsm It is a low bitrate codec useful over Global System for Mobile Communications
(GSM) networks

Table 8.3 Sample Asterisk application modules


Name Function
app_dial It makes phone call by connecting different channels together
app_voicemail It enables voicemail
app_playback It plays a media file such as mp3 to the channel

Table 8.4 Sample Asterisk file format modules


Name Function
format_gsm It plays audio file stored in GSM format
format_pcm It plays audio file stored in various PCM format
format_h263 It plays video file stored in h263 format
format_wav_gsm It plays audio file stored in GSM format in a WAV container

8.1.3 Application Modules

The most popular applications in Asterisk are dialplan applications. Dialplan appli-
cations are configured in extentions.conf file which defines incoming and outgoing
call routings. Table 8.3 lists some useful and essential applications used in this Lab.

8.1.4 File Format Modules

The file format module is responsible for transcoding media files during recording
and playback. For instance, if a recorded file is in GSM format and playback is over
aPSTN channel, then GSM to PCM interpreter is required. Table 8.4 denotes some
useful file format modules used in this Lab.
196 8 Case Study 1—Building Up a VoIP System Based on Asterisk

8.1.5 Installing Asterisk

Asterisk is suited to work under Linux, most Linux distributions such as Fedora,
Debian, Ubuntu and SUSE have Asterisk as part of their packages and you can
simply use these packages to install Asterisk. You can download Asterisk from [1]
if your Linux distribution does not have Linux Package in its repositories.
Together with Asterisk, DAHDI and libri are essential for Asterisk to work prop-
erly with ISDN or PSTN. The libri library enables Asterisk to communicate with
ISDN connections. If you plan to connect to an ISDN line this library is recom-
mended on your system. The DAHDI library enables Asterisk to communicate with
analog and digital telephones lines, including communication to the PSTN. DAHDI
and libri are commended to be install in your system even if you don’t plan to use
analog or digital telephone lines.
DAHDI is the abbreviation of Digium Asterisk Hardware Device Interface. It is a
set of drivers and utilities for analog and digital telephony cards, such as TDM cards
manufactured by Digium. The DAHDI drivers are independent of Asterisk, and can
be used by other communication applications. DAHDI originates from Zaptel which
was created by the Zapata Telephony Project.
As Ubuntu is one of the most popular Linux distribution, the following instruc-
tions demonstrate how to install Asterisk in Ubuntu.
1. It is essential to update Ubuntu system and reboot
• sudo apt-get update
• sudo apt-get upgrade
• sudo reboot
2. It is essential to synchronise time in communication, therefore, make sure Net-
work Time Protocol (NTTP) is installed
• sudo apt-get install ntp
3. It is important to install software dependency, in Ubuntu xml, ncurses, ssl and
subverion for getting Asterisk source code vi subversion system
• sudo apt-get install subverion
• sudo apt-get libssl-dev libncurses5-dev libxml2-dev
4. Create a directory where Asterisk source files will reside
• sudo mkdir /src/asterisk
5. Go to the created Asterisk directory
• cd /src/asterisk
6. Download Asterisk source code via subversion
• sudo svn co http://svn.asterisk.org/svn/asterisk/branches/1.6
7. Build and install Asterisk
8.2 What Is X-Lite 4 197

• cd 1.6
• sudo ./configure
• sudo make
• sudo make install
• sudo make config
At this stage your system is ready to configure dialplans, channels and any addi-
tional modules such as sounds and music on hold. The full documentation on how
to install Asterisk, libri and DAHDI is available at [1].

8.2 What Is X-Lite 4

X-Lite 4 is a proprietary freeware VoIP soft phone that uses SIP for VoIP sessions
setup and termination. It combines voice calls, video calls and Instant Messaging
in a simple interface. X-Lite 4 is developed by CounterPath [2]. The screen shot of
X-Lite 4 is depicted in Fig. 8.2.
X-Lite 4 has basic functionalities which include:
• Call display and message waiting indicator.
• Speaker phone and mute icon.
• Hold and redial.
• Call history for incoming, outgoing and missed calls.
The X-Lite 4 supports the following features and functions:
• Video call support.
• Instant message and present via SIMPLE protocol support.
• Contact list support.
• Automatic detection and configuration of voice and video devices.
• Support of echo cancelation, voice activity detection (VAD) and automatic gain
control.
• The following audio codecs are supported, Broadvoice-32, Broadvoice-32 FEC,
DVU4, DVI4 Wideband, G711aLaw, G711uLaw, GSM,L16 PCM Wideband,
iLBC, Speex, Speex FEC, Speex Wideband, Speec Wideband FEC.
• DTMF (RFC 2833 [7], inband DTMF or SIP INFO messages) support.

8.2.1 Using X-Lite

X-Lite can be started from the “Windows Start Menu” if is installed. If it is not
installed, it can be downloaded and installed from [2].

X-Lite Menu
• Softphone. Under the Softphone Menu (cf., Fig. 8.2) the following can be con-
figured.
198 8 Case Study 1—Building Up a VoIP System Based on Asterisk

Fig. 8.2 X-Lite 4 screen shot

– Accounts. This item will configure how X-Lite communicates with your
Asterisk server. The settings such as X-Lite username, password, display
name and SIP server domain or proxy IP address can be configured under
this item.
– Preferences. Preferences such as audio and video codecs can be configured
in this item, moreover, audio and video devices can be configured under
this item.
– Exit. This item will let you exit X-Lite, by pressing Ctrl + Q will also let
you exit the X-Lite.
• View. This menu item will change how X-Lite looks.
• Contacts. This menu item will let you manage your contacts such as adding and
modifying contact list.
• Actions. Depending on the X-Lite state, this menu item will let you perform
several actions. For instance, if a contact is selected, actions such as:

Placing a Call
A call can be placed by two ways,
• by using traditional phone number such as 123456
• by using url address such as [email protected]
8.2 What Is X-Lite 4 199

Fig. 8.3 X-Lite dialpad

The above ways can be performed by using the following approaches,


• Dialing: This approach is done through the X-Lite Dialpad (cf., Fig. 8.3).
• Keying: This approach is performed from the X-Lite dialpad or the computer
keyboard.
• Drag and drop contact or previous call: This approach is done through the X-
Lite Contacts or the History tab.
• Right click on the contact or previous call: This can be done from the X-Lite
Contacts or the History tab.
• Double click on a contact: This approach is done via the X-Lite Contacts tab.
• Double click on a previous call: This is performed via the X-Lite History tab.
• Redial: This approach is on the X-Lite Redial button.
An established X-Lite call screen will look like the on in Fig. 8.4.

Ending a Call
The red End call button on the established call windows is used to end a call (cf.,
Fig. 8.4).

Handling an Incoming Call


The X-Lite incoming screen (cf., Fig. 8.5) will appear when an incoming call is
ringing.
The following action can be performed on the incoming call screen.
• Answer: This can be performed by clicking the Answer button. If you are on
another call, then the first call will be put on hold.
• Decline: This can be done by clicking the Decline button. If this action is done
then there will be a busy tone. If the voicemail is configured, then the call will
be directed to the voicemail server.
200 8 Case Study 1—Building Up a VoIP System Based on Asterisk

Fig. 8.4 X-Lite established


call

Fig. 8.5 X-Lite incoming


call screen

Fig. 8.6 X-Lite incoming


video call

• Video: If the incoming call is a video call, then the Video button will appear to
accept the video call (cf., Fig. 8.6).
• Audio: If the incoming call is a video call, then the Audio button will also ap-
pear. You will have the choice to accept only the audio call (cf., Fig. 8.6).
8.3 Voice and Video Injection Tools 201

Fig. 8.7 ManyCam main screen

8.3 Voice and Video Injection Tools

As discussed in Chap. 6, voice and video quality can be assessed using full-reference
model or intrusive voice/video quality assessment model, in which a reference
voice/video signal can be compared with the corresponding degraded voice/video
signal. For full reference voice/video quality assessment, standard reference voice
signal or video clip need to be inserted into the system under test, and the degraded
voice signal or video clip to be recorded at the receiving end.
In this section, we introduce methods to inject a standard voice signal (e.g., from
standard speech samples from ITU-T P. 50 Appendix I [8]) or a standard video clip
from VQEG video clip database [9].
The popular tool for video injection is Manycam Virtual Webcam or simply
Manycam [4] and that for audio is Virtual Audio Cable [6].

8.3.1 Manycam Video Injection Tool

ManyCam Virtual Webcam or simply ManyCam is a freeware live effects and we-
bcam effects software. The abbreviation is derived from its feature which allows a
single webcam to be used with multiple applications at the same time. ManyCam
takes as input stream a webcam, video camera or picture or movie files and replicates
the stream as an alternative source of input into other applications such as Skype,
Google Talk and X-Lite. This book uses ManyCam version 3.0.91. The front screen
of the ManyCam is depicted in Fig. 8.7.
202 8 Case Study 1—Building Up a VoIP System Based on Asterisk

Fig. 8.8 X-Lite devices


settings

Injecting Video into X-Lite


First, ManyCam should be configured to use standard video clips files from your
computer. This is performed by using the Video tab from the ManyCam main win-
dow and then by selecting Sources and finally Movies (cf., Fig. 8.8). The Add Files
drop down list will then be used to upload files.
The other way of selecting a video file as the input source is by clicking Many-
Cam drop down menu list, then by selecting Video Sources and finally Movies (cf.,
Fig. 8.8).
The injection of video sequences into the X-Lite by using ManyCam is achieved
by selecting X-Lite video devices as “ManyCam Virtual Webcam”. Under the Soft-
phone menu item, choose Preferences. When the Preferences window appears, click
Devices. The camera source configuration will be available after selecting Other
Devices tab (cf., Fig. 8.8). Under the Camera drop down list, then select ManyCam
Virtual Webcam (cf., Fig. 8.8). This steps will make X-Lite to use ManyCam as the
video source devices.
Once the X-Lite video camera source is set to take the video stream from the
ManyCam Virtual Webcam device, standard video sequences such as Akiyo will
then be streamed (cf., Fig. 8.9) whenever the video files are played on the ManyCam
main window.

8.3.2 Virtual Audio Cable Injection Tool

Virtual Audio Cable (VAC) is a freeware software that allows to transfer audio
streams between applications and/or audio devices. VAC creates a virtual audio de-
vices “Virtual Cables” whereby any application can send or receive audio stream
though it without sound quality loss.
8.4 Lab Scenario 203

Fig. 8.9 X-Lite video


display from ManyCam
Virtual Webcam device

Injecting Audio into X-Lite


VAC can be used to transfer an audio stream to another application, for instance,
you can playback ITU-T P.50 sample voice such as “British English” in any audio
player such as Windows Media Player to produce the audio stream and outputting
it to the VAC device and use the VAC device as an input audio device in the X-Lite
instead of a microphone.
Before injecting audio stream into X-Lite, the well-known Windows Media
Player should be configured to playback an audio file and set VAC as an output
audio device. Loading an audio file in Windows Media is performed by following
the menu sequence File → Open, and browsing to the location of the audio file or
by dragging the audio file to the Windows Media Player.
To set VAC as the audio playback device, open the following menu sequence
Tools → Options → Devices → Speakers → Properties → Sound Playback. Then
choose Line 1 (Virtual Audio Cable) as the Playback device (cf., Fig. 8.10).
The injection of audio stream into the X-Lite is then achieved by selecting Line1
(Virtual Audio cable) device under the Softphone → Preferences. When the Prefer-
ences window appears, click Devices. The speakerphone or headset mode configu-
ration will be visible (cf., Fig. 8.11). Under the Microphone drop down list, select
Line 1 (Virtual Audio Cable) device (cf., Fig. 8.11). This steps will make X-Lite to
use VAC as the audio input source device.

8.4 Lab Scenario

Figure 8.12 depicts the scenario we are going to use in this lab. Asterisk version 1.6
is used under UBUNTU 10.4 LTS. X-Lite 4 SIP phones Version 4.1 build 63214 for
Windows and Asterisk are connected via an Ethernet switch. The analogue phone is
204 8 Case Study 1—Building Up a VoIP System Based on Asterisk

Fig. 8.10 Windows Media


Player output device

Fig. 8.11 X-Lite input audio device

connected to Asterisk via Digium TDM11B TDM PCI Card by using RJ11 cable.
Headset and webcams are required for voice and video calls on the X-Lite 4.
For the purpose of learning, a low powered CPU between 433 MHz and 700 MHz
Celeron processors will be able to host Asterisk system. This is the minimal require-
ment for two to three concurrent VoIP calls.
8.5 Adding SIP Phones 205

Fig. 8.12 Lab scenario

8.5 Adding SIP Phones


The configuration files for asterisk are by default located in /etc/asterisk directory.
The file we are particularly interested for adding SIP phones is sip.conf. This is the
configuration file that is used to define, SIP phones and channels for both inbound
and outbound calls. Navigate into /etc/asterisk and edit sip.conf file by using any
plain text editor you are familiar with in Linux.
The following lines will add two SIP phones, namely, 1000 and 2000 into
sip.conf file.
[1000]
type=friend
username=1000
secret=1234
context=mycontext
host=dynamic

[2000]
type=friend
username=2000
secret=1234
context=mycontext
host=dynamic
Each line added into the sip.conf file is described below:
type: This defines connection class for each phone. There are three options to be
used, these are peer, user and friend. Peer can only make calls, it is used when
206 8 Case Study 1—Building Up a VoIP System Based on Asterisk

Asterisk is connecting to a proxy. User can only accept calls. Friend is used as
both a peer and a user, i.e., can call and accept calls.
username: Sets the username for registering into Asterisk.
secret: The password for registering into Asterisk.
context: This sets the context for this phone. This context will be used in exten-
sions.conf file (cf., Sect. 8.6) for this phone dial plans.
host: IP address or host name of the phone. It can also be set to ‘dynamic’ to allow
the phone to connect from any IP address.

8.6 Configuring Dial Plans

The file we are particularly interested for configuring dial plans is extensions.conf,
so go into /etc/asterisk and edit extensions.conf file. By adding the following lines
into extensions.conf file will configure dial plans for phones 1000 and 2000 added
in the sip.conf file (cf., Sect. 8.5).
An extension is made up of three components separated by comma,
• The name/number of the extension
• The priority
• The application/command
Extension components are formatted as exten => name, priority, application()
and the real example of this is exten => 100, 1, Answer(). In this example, 100
is the extension name/number, the priority is 1, and the application/command is
Answer().

[mycontext]
exten => 100,1,Answer()
exten => 100,2,Dial(SIP/1000)
exten => 100,3,Hangup()

exten => 200,1,Answer()


exten => 200,2,Dial(SIP/2000)
exten => 200,3,Hangup()

exten => 300,1,Answer()


exten => 300,2,Dial(DAHDI/1)
exten => 300,3,Hangup()

Below is the description of each line added into the extensions.conf file:
[mycontext]: Phones 1000 or 2000 will be directed to this context whenever a call
is initiated. Note that this context name is defined in sip.conf.
exten => 100,1,Answer(): If phone 1000 or 2000 dials extension 100, this line
will answer a ringing SIP channel.
8.7 Configuring DAHDI Channels 207

exten => 100,2,Dial(SIP/1000): If phone 1000 or 2000 dials 100, this line re-
quests a new SIP channel, places an outgoing call to phone 1000 and bridges the
two SIP channels when phone 1000 answers.
exten => 100,3,Hangup(): Unconditionally hangup a SIP channel, terminating a
call.
exten => 300,2,Dial(DAHDI/1): If phone 1000 or 2000 dials 300, this line re-
quests a new DAHDI channel, places an outgoing call to an analogue phone and
bridges the DAHDI and DAHDI channels when the analogue phone answers.

8.7 Configuring DAHDI Channels

Before making voice calls between SIP and analog phones it is important to ensure
Digium cards are working well. After installing Asterisk and DAHDI, verify that
your cards are setup and configured properly by executing commands in steps 6
and 7. If you get errors in steps 6 and 7, then follow all steps below to make sure
your Asterisk and one Digium card to work properly.
1. Detect your hardware. The command below will detect your hardware and
if it is successfully the files /etc/dahdi/system.conf and /etc/asterisk/dahdi-
channels.conf will be generated.
Linux:~# dahdi_genconf
2. This command will read system.conf file generated from the above step and
configure the kernel of your Linux distribution.
Linux:~# dahdi_cfg -v
3. The following line will restart DAHDI in which all modules and drivers will be
unloaded and loaded again. Note that the location of this script may vary from
one Linux distribution to another.
Linux:~# /etc/init.d/dahdi restart
4. The statement below will include the file /etc/asterisk/dahdi-channels.conf in
chan_dahdi.conf under section [channels].
[channels]
# include /etc/asterisk/dahdi-channels.conf
5. These configurations will not take effect unless you restart Asterisk if it is not
running. The following command will restart Asterisk.
Linux:~# /etc/init.d/asterisk restart
6. After restarting the Asterisk, verify your card status. Reconnect to Asterisk and
run the following command under CLI, you will get the output like this.
208 8 Case Study 1—Building Up a VoIP System Based on Asterisk

Fig. 8.13 Asterisk console


(command line interface)

Linux*CLI> dahdi show status

Description Alarms IRQ bpviol CRC4


Wildcard TDM11B Board 1 OK 0 0 0
7. Verify your configured DAHDI channels
asterisk*CLI> dahdi show channels

Chan Extension Context Language MOH Interpret


pseudo default default
1 from-pstn en default
2 from-pstn en default

asterisk*CLI>

8.8 Starting and Stopping Asterisk

To start Asterisk , login to Linux as a user with permission to run asterisk, at the
terminal console type “asterisk -vvvc” and press return key. “-c” enables console
(command line interface) mode and “-v” tells Asterisk to produce verbose output.
More “v”s will produce more verbose output. If Asterisk is successfully started,
then the console mode will look like in Fig. 8.13. To stop Asterisk type “core stop
now” at the Asterisk console and press return key. The console mode will exit and
Asterisk will stop.

8.9 Setup SIP Phone

This lab uses X-Lite 4 Version 4.1 build 63214 under Microsoft Windows XP as a
SIP phone to connect to Asterisk . The following steps are needed in order to setup
X-Lite 4,
• Open X-Lite 4 from Microsoft Windows.

• Click on the “Softphone” menu, then select “Account Settings” (cf., Fig. 8.14).

• Under “SIP Account” Input the following details (cf., Fig. 8.14):
Display Name: 1000
User Name: 1000
Password: 1234
Authorization User: 1000
8.10 Making Voice Calls Between SIP Phones 209

Fig. 8.14 SIP account


settings

Domain: Your asterisk IP address


"Register with domain and receive incoming calls"
"Set outbound via" Domain
The Asterisk IP address can be obtained by using “ifconfig” Linux command
and locating an appropriate network interface. Note that proper permissions are
needed in order to execute “ifconfig” command if you are not a super user.
• Click “Ok” to close the “SIP Account” settings window.
• Click “Softphone” menu and then click “Exit” to close X-Lite 4.
At this stage you have already setup the X-Lite 4 and you are ready to connect
to Asterisk . Use other computers and repeat the same steps for SIP phones “2000”,
“3000” and “4000”.

8.10 Making Voice Calls Between SIP Phones

The following steps are required to make voice calls between SIP phones.
210 8 Case Study 1—Building Up a VoIP System Based on Asterisk

Fig. 8.15 Successfully


registered SIP phone

• Start Asterisk
• Start X-Lite 4 with SIP phone 1000. The X-Lite 4 screen will look like in
Fig. 8.15, showing that SIP phone 1000 is successfully registered
• Start X-Lite 4 with username 2000 in another PC. The X-Lite 4 screen will look
similar to Fig. 8.15, but this time showing that SIP phone 2000 is successfully
registered
• Observe the output messages at Asterisk CLI console for possible errors or
successful registration messages
• If you are using SIP phone 2000, dial extension 100 to call SIP phone 1000 by
clicking “Call” icon, let SIP phone 1000 pick up the call and start talking. You
can type the extension number in the X-Lite 4 call entry field or use the X-Lite
4 dial pad. The dial pad can be expanded by clicking the “Show or hide dial
pad” icon
• If you are using SIP phone 1000, dial extension 200 to call SIP phone 2000, let
SIP phone 2000 pick up the call and start talking
• Observe the output messages at Asterisk CLI when the VoIP session is estab-
lished between the SIP phones 1000 and 2000
• The call can be terminated by using “End” icon
• Observe the output messages at Asterisk CLI when the VoIP session is termi-
nated between the SIP phones 1000 and 2000
8.11 Making Video Calls Between SIP Phones 211

Fig. 8.16 X-Lite 4 video


window

8.11 Making Video Calls Between SIP Phones

In order for video calls to work, you are required to turn on support for SIP video.
This is done by adding “videosupport=yes” line under [general] context in sip.conf.
the [general] context, among other lines will look like this,

[general]
videosupport=yes

After enabling SIP video support, carry on with the following steps,
• Start X-Lite 4
• Open “Video window” from the X-Lite 4 screen (cf., Fig. 8.16) by clicking
“Show or hide video window” icon. You should be able to see yourself from
your own camera (cf., Fig. 8.16)
• Dial extension 100 if you are in SIP phone 2000 to call SIP phone 1000. When
SIP phone 1000 picks up the call, click “Send video” icon the “Video window”
• Both SIP phones will now display videos calls (cf., Fig. 8.17)

8.12 Making Voice Calls Between SIP and Analogue Phones

This section will make use of DAHDI channels configured above. Extension 300 is
used to connect to analogue phone.
• Start “X-Lite 4”
• Dial extension 300, the analogue phone will ring, pick up the phone and start
talking
• From the analogue phone, dial extension 100 or 200 for calls to SIP phones
1000 or 2000, respectively, pick up the calls from X-Lite 4 and start talking
212 8 Case Study 1—Building Up a VoIP System Based on Asterisk

Fig. 8.17 Video start button

8.13 Problems

This challenge will help you understand more about adding SIP phones and dial
plans in sip.conf and extensions.conf, respectively. To help with current time play-
back and voice mail dial plans configuration, “Asterisk: The Future of Telephony”
is recommended [5].
1. Add SIP phone with username: 3000, password: 1234, type: friend and host:
dynamic.
2. Add SIP phone with username: 4000, password: 1234 and type: user.
3. Make voice call from SIP phone 2000 to 4000. (hint, you should add dial plans
and setup X-Lite 4 for this to work).
4. Make voice call from SIP phone 4000 to 2000.
5. Explain if any of Step 3 or 4 is not working and why.
6. Rectify any problem if exists in Step 5.
7. Add a dial plan that will playback the current time (hint, use SayUnixTime
Asterisk command).
8. Add a dial plan that will include a voice mail (hint, use VoiceMail command).
The file /etc/asterisk/voicemail.conf is used to setup Asterisk voicemail con-
texts. The voice mail dial plan should include the following functions:
a. Enable SIP phone user to leave voice mail after 30 seconds of no answer.
b. Enable SIP phone user to access and retrieve saved voicemails.
c. Prompt for a password when accessing voicemail mailbox.

References
1. Asterisk (2011) Asterisk downloads. http://www.asterisk.org/downloads/. [Online; accessed
02-February-2011]
2. Counterpath (2011) X-lite 4. http://www.counterpath.com/x-lite.html. [Online; accessed 12-
June-2011]
References 213

3. DIGIUM (2010) Digium cards. http://www.digium.com/. [Online; accessed 27-July-2010]


4. Manycam (2012) Free live studio and webcam effects software. http://www.manycam.com.
[Online; accessed 02-October-2012]
5. Meggelen JV, Smith J, Madsen L (2005) Asterisk: the future of telephony. O’Reilly Media,
Sebastopol [Freely available at http://www.asteriskdocs.org/]
6. Muzychenko E (2012) Virtual audio cable. http://software.muzychenko.net/eng/vac.htm.
[Online; accessed 02-October-2012]
7. Schulzrinne H, Petrack S (2000) Rtp payload for dtmf digits, telephony tones and telephony
signals. RFC 2833
8. Union IT (2012) Itu-t test signals for telecommunication systems. http://www.itu.int/net/itu-t/
sigdb/genaudio/AudioForm-g.aspx?val=1000050. [Online; accessed 02-October-2012]
9. University AS (2012) Yuv video sequences. http://trace.eas.asu.edu/yuv. [Online; accessed
02-October-2012]
Case Study 2—VoIP Quality Analysis
and Assessment 9

In this case study, we will analyse and assess voice and video quality of the multi-
media sessions established in Chap. 8. We will use Wireshark in order to capture and
analyse SIP and RTP packet headers. Upon completion of this lab, you will be able
to familiarize with Wireshark, tc commands in Linux network emulator, and experi-
ence SIP message flows during user registration and multimedia sessions setup and
termination. This Lab will also help you to emulate Wide Area Network (WAN) by
using tc command in Linux in order to assess the impact of packet loss and jitter on
the quality of VoIP calls.

9.1 What Is Wireshark


Wireshark [1] is a free network protocol analyzer that runs on Microsoft Windows,
Unix, and MAC computers. Wireshark will be used throughout this Lab to study
the contents of network packets. It’s rich and powerful features have made it very
popular to network professionals, security experts, developers, and educators around
the world.

9.1.1 Live Capture and Offline Analysis


Offline analysis can be done by saving captured packets by using “Save as” menu
item. This menu item can give you options such as which file format and which
packets to save (cf., Fig. 9.1).
In the “Packet Range Frame”, you will be able save all captured packets or dis-
played packets. If the “Captured” button is selected then all packets captured will
be saved. If the “Displayed” button is chosen then all the current displayed packets
will be saved.
The radio buttons in the “Packet Range Frame” has the following functions,
• If “All packets” is chosen then all packets will be processed.
• If “Selected packet only” is selected then all the selected packet will be pro-
cessed.
• If “Marked packets only” is chosen then all marked packets will be processed.

L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 215
DOI 10.1007/978-1-4471-4905-7_9, © Springer-Verlag London 2013
216 9 Case Study 2—VoIP Quality Analysis and Assessment

Fig. 9.1 Wireshark “Save as” screenshot

• If “From the first to the last marked packet” is chosen then all packets marked
from the first to the last one will be processed.
• If “Specify a packet range” is chosen the you will have to input your own packet
range in the text field below it and all the ranges specified in the text field will
be processed.
Packets can be read and written in many different capture file formats such
as tcpdump (libpcap), Pcap NG, Catapult DCT2000, Cisco Secure IDS iplog,
Microsoft Network Monitor, Network General Sniffer (compressed and uncom-
pressed), Sniffer Pro, and NetXray, Network Instruments Observer, NetScreen
snoop, Novell LANalyzer, RADCOM WAN/LAN Analyzer, Shomiti/Finisar Sur-
veyor, Tektronix K12xx, Visual Networks Visual UpTime, WildPackets Ether-
Peek/TokenPeek/AiroPeek, and many others.
Live data can be read from Ethernet, IEEE 802.11, PPP/HDLC, ATM, Blue-
tooth, USB, Token Ring, Frame Relay, FDDI, and others (depending on your plat-
form). It also support decryption for many protocols, including IPsec, ISAKMP,
Kerberos, SNMPv3, SSL/TLS, WEP, and WPA/WPA2. Output can be exported to
XML, PostScript, CSV, or plain text and capture files compressed with gzip can be
decompressed on the fly.

9.1.2 Three-Pane Packet Browser

Wireshark has three-pane packet browser,


1. The “Packet List” pane (cf., Fig. 9.2): This pane lists all the packets captured
in the file. Each line corresponds to a packet.
The “Packet List” pane has default columns,
9.1 What Is Wireshark 217

Fig. 9.2 Wireshark packet list pane

Fig. 9.3 Wireshark “Packet Details” pane

• No.: This column denotes the number of the packet in the capture file. This
number will not change, even if a display filter is used.
• Time: This column shows the timestamp of the packet. The format of this
timestamp can be changed into different time format such as seconds, mil-
liseconds and nanoseconds or date and time of the day.
• Source: This column displays the IP address where this packet is coming
from.
• Destination The IP address where this packet is destined to.
• Protocol: The protocol name in a short is displayed in this column such as
TCP, UDP and HTTP.
• Info: Additional information about the packet content is displayed in this
column such as NOTIFY and TCP segment of a reassembled PDU.

2. The “Packet Details” pane (cf., Fig. 9.3): The packet selected in the “Packet
List” pane will result into more details shown in the “Packet Details” pane.
This pane gives the protocols and the protocols fields corresponding to the
packet selected in the “Packet List” pane. These fields are in a tree structured
which can be expanded to reveal more information.
3. The “Packet Bytes” pane (cf., Fig. 9.4): This pane displays the hexdump style
of the packet selected in the “Packet List” pane. The left side of the hexdump
style shows the offset of the packet data. The hexadecimal representation of
the packet data is shown in the middle and the right side shows the ASCII
characters of the corresponding hexadecimal representation.
218 9 Case Study 2—VoIP Quality Analysis and Assessment

Fig. 9.4 Wireshark “Packet Bytes” pane

Fig. 9.5 Telephony menu


item

9.1.3 VoIP Analysis

Rich VoIP analysis is available under “Telephony” menu item where RTP and SIP
protocols together with VoIP calls statistics can be analysed.
Under Telephony menu (cf., Fig. 9.5), SIP statistics can be generated when SIP
protocol is selected. For instance, SIP statistics of Fig. 9.6 shows that 14 SIP packets
where captured, two packets for “SIP 100 Trying” and “SIP 180 Ringing”, respec-
tively. Four “SIP 200 OK” packets were recorded. Two “SIP INVITE”, one “SIP
ACK” and three “SIP REGISTER” packets were captured.
VoIP call analysis can be done in the same “Telephony” menu item by selecting
“VoIP calls” in the drop down menu. A list of detected VoIP call will appear (cf.,
Fig. 9.7). The VoIP calls list includes:
• Start Time: Start time of the VoIP call.
• Stop Time: Stop time of the VoIP call.
• Initial Speaker: The IP address of the source of the packet that initiated the VoIP
call.
• From: This is the “From” field of the SIP INVITE.
• To: This is the “To” field of the SIP INVITE.
• Protocol: This column displays the VoIP protocols used such as SIP and H323.
• Packets: This column denotes the number of packets involved in the VoIP call.
9.1 What Is Wireshark 219

Fig. 9.6 Wireshark SIP


statistics

Fig. 9.7 List of VoIP calls

• State: This displays the current VoIP call state. This can be the following.
– CALL SETUP: This will show a VoIP call in setup state (Setup, Proceeding,
Progress or Alerting).
– RINGING: This will display a VoIP call ringing (this state is only supported
for MGCP calls).
– IN CALL: This will denote that a VoIP call is still connected.
– CANCELED: This state will illustrate that a VoIP call was released before
connected from the originated caller.
– COMPLETED: This will show a VoIP call was connected and then released.
– REJECTED: This state shows that a VoIP call was released before con-
nected by the callee.
– UNKNOWN: This VoIP call is in unknown state.
• Comment: Any additional comment will be displayed here for a VoIP call.
220 9 Case Study 2—VoIP Quality Analysis and Assessment

Fig. 9.8 VoIP call playback

Fig. 9.9 SIP messages flow

A desired VoIP call list can be chosen for playback through the Wireshark
RTP player (cf., Fig. 9.8). RTP player will show percentages of all packets that
are dropped because of the jitter buffer as well as the packets that are out of se-
quence.
From the list of the VoIP call, SIP flow diagram can also be displayed as a graph,
the graph will display the following information (cf., Fig. 9.9):
• All SIP packets that are int he same call will be coloured with the same colour.
• Arrows showing the direction of each SIP packets.
• Labels on top of each arrows will show SIP message type.
• RTP traffic will be represented wit a wider arrow with the corresponding pay-
load type on its top.
• UDP/TCP source and destination post per packet will be shown.
9.1 What Is Wireshark 221

Fig. 9.10 List of RTP streams

Fig. 9.11 RTP stream window

• The comment column will depend on the VoIP protocol in use. For SIP protocol
comments such as “Request” or “Status” message will appear. For RTP proto-
col, comments such as the number of RTP packets and duration in seconds will
be displayed.
RTP streams can also be analysed from the “Telephony” menu item. A list of
RTP streams (cf., Fig. 9.10) will be displayed and any one can be picked up for
analysis (cf., Fig. 9.11).
The “RTP Stream Analysis” windows will shoe some basic data such as RTP
packet number, sequence number. Enhanced statistics which are created based on
each packet arrival time, delay, jitter and packet size are also listed per packet basis.
The lower pane of the windows denotes overall statistics such as minimum and
maximum delta, clock skew, jitter and packet loss rate. There is also an option to
save the payload for offline playback or further analysis.
222 9 Case Study 2—VoIP Quality Analysis and Assessment

Fig. 9.12 RTP graph analysis

A graph depicting several parameters such as jitter and delta can also be drawn
(cf., Fig. 9.12).

9.2 Wireshark Familiarization

The following steps will help you to familiarize with Wireshark Graphic User In-
terface (GUI). Wireshark has five main windows: command menu, display filter
specification, listing of capture packets, details of selected packet header and packet
content in hexadecimal and ASCII windows.
• Start Wireshark from both Computer 1 and Computer 2.
• Start capture a short period of live traffic and view it immediately. To do this
click on Capture → Start (you may first have to select an interface under the
capture menu) (cf., Fig. 9.13).
• If you then try to browse any website (e.g., google.com), you should be able to
see captured packets on the capture screen.
• Stop the capture.
• Understand the significance of each of the columns in the top part of the Wire-
shark window: No, Time, Source, Destination, Protocol, Info.
• Wireshark will show both the dotted decimal form and the hex form of IP ad-
dresses. Look at an example and check if they are the same.
9.3 Introduction to Netem and tc Commands 223

Fig. 9.13 Wireshark screenshot

Fig. 9.14 Simplest Network


topology for Netem

9.3 Introduction to Netem and tc Commands

Netem is a network emulator for the Linux kernel 2.6.7 and higher versions. Netem
emulates network dynamics such as IP packets delay, drops, corruptions and dupli-
cates. Netem extends Linux Traffic Control (tc) command available under iproute2
package. In order for the Netem to work, Linux must be configured to act as a router.
The simplest network topology for the Netem to work is depicted in Fig. 9.14.
224 9 Case Study 2—VoIP Quality Analysis and Assessment

IP packets from the sender enters the Linux router via a network interface card
whereby each packet is classified and queued before getting into the Linux internal
IP packet handling. The IP packet classification is done by examining packet headers
parameters such as source and destination IP addresses.
From the IP packet handling process, packets are classified and queued ready for
transmission via the egress network interface card. The tc command is tasked for
classifying IP packets. The default queueing discipline used for tc is First In First
Out (FIFO).

9.3.1 Adding qdisc

Common terminologies used are,


• qdisc: This stands for queueing discipline where packets are queued based on
an algorithm that decides how and when to transmit packets.
• Root qdisc: This is the root qdisc that is attached to each network interface card.
It can be described as classful or classless.
• egress qdisc: This is qdisc which acts on the outgoing traffic.
• ingress qdisc: This is the qdisc which works on the incoming traffic.
Common command to add qdisc is,
• tc qdisc add dev DEVICE handle 1: root QDISC [PARAMETER]. This com-
mand will generate root qdisc DEVICE denotes the network interface card such
as eth0, eth1 and eth2. PARAMETER represents the parameter associated with
the qdisc attached and QDISC denotes the type of qdisc used such as pfifo_fast
which is the first in fast out qdisc and Token Bucket Filter (TBF) which control
the IP packet rates.
• Examples:
– tc qdisc add dev eth0 root netem loss 2 %. This tc command will add the
maximum of 2 % packet loss rate to the out going IP traffic on the network
interface card rth0.
– tc qdisc add dev eth0 root netem delay 100 ms. This tc command will add
the maximum of 100 ms delay on the outgoing traffic via the eth0 network
interface card.

9.3.2 Changing and Deleting qdisc

Changing and deleting qdisc commands have the structure as the adding qdisc com-
mand. To change 2 % loss rate added to the qdisc in Sect. 9.3.1 to 10 %, the follow-
ing command is used,
• tc qdisc change dev eth0 root netem loss 10 %
9.4 Lab Scenario 225

Fig. 9.15 Ping results for 200 ms delay

To modify 100 ms delay added to the qdisc in Sect. 9.3.1 to 200 ms, the following
command is used,
• tc qdisc change dev eth0 root netem delay 200 ms

Figure 9.15 depicts the results shown (below the red line) when pinging a
Linux router with an emulated delay of 200 ms. The results show an average of
201 ms delay. The results above the red line show pinging results at the normal
delay without any Netem entry, there is an average delay of 1 ms.
To delete a complete root qdisc tree added in Sect. 9.3.1 or modified in this
section, the following command is used,
• tc qdisc del dev eth0 root

9.4 Lab Scenario

This Lab will use Wireshark on Microsoft Windows XP and is configured as shown
in Fig. 9.16. In this scenario, Wireshark is installed in the same computer where the
X-Lite 4 is installed.

9.4.1 Challenges

1. List network interfaces and their corresponding IP addresses.


2. Find out which interface can be used for capturing.
3. List different protocols that appear.
4. Filter the UDP protocol.
226 9 Case Study 2—VoIP Quality Analysis and Assessment

Fig. 9.16 Lab scenario

9.5 SIP Registration

This Lab will help you understand and analyse captured SIP messages during SIP
registration process.
• Start Wireshark capture.
• Filter SIP protocol only (cf., Fig. 9.17).
• Start Asterisk server.
• Start X-Lite 4.
• Examine the SIP messages header under Wireshark packet header screen.

9.5.1 Challenges

From Wireshark,
1. List the source and destination IP addresses.
2. List the source and destination port numbers.
3. List SIP methods seen on the header.
9.6 SIP Invite 227

Fig. 9.17 Wireshark screenshot: SIP registration

4. Find the sequence number.


5. List the status codes and their description.
6. Name the transport protocol used.
7. Name the authorization scheme employed.
8. What is the user agent used?
9. Find how long did X-Lite take to register with the asterisk.
10. Exit X-Lite and examine extra packets that appear.
11. Compare these extra SIP headers with those of Step 3 above.
12. Stop the Wireshark capture.

9.6 SIP Invite

This Lab will help you understand and analyse SIP message and RTP headers during
SIP Invite process.
• Start Wireshark capture.
• Make a voice call to another SIP client.
• Filter SIP protocol and examine the SIP Message header (cf., Fig. 9.18).

9.6.1 Challenges

From Wireshark,
1. List SIP methods seen on the header.
2. List the status codes and their description.
3. Find the content-type.
228 9 Case Study 2—VoIP Quality Analysis and Assessment

Fig. 9.18 Wireshark screenshot: SIP Invite

4. Under Session Description Protocol examine (cf., Fig. 9.19).


a. Media description.
b. Media attributes.
c. Compare the media description attributes of Audio and Video.
5. Find how long did X-Lite take to invite another SIP phone.
While the session dialog is in progress, start the video call and,
6. Compare the SIP message header with that of voice call above.
7. Find the content-type.
8. Find the codec used.
While the voice and video call session is ongoing,
9. Filter the RTP protocol (cf., Fig. 9.20)
10. List the source and destination port numbers.
11. Identify the Payload type.
12. Find the sequence number and show how is it incremented.
13. Find time-stamps and show how are they incremented.
While the voice and video call session is ongoing,
1. Filter the RTCP protocol (cf., Fig. 9.21).
9.6 SIP Invite 229

Fig. 9.19 Wireshark screenshot: SDP

Fig. 9.20 Wireshark screenshot: RTP


230 9 Case Study 2—VoIP Quality Analysis and Assessment

Fig. 9.21 Wireshark screenshot: RTCP

2. How do RTCP packets periodically appear during a session.


3. List the source and destination port numbers and find the relationship with the
RTP ports.
4. Examine receiver report.
5. Examine the source description report.
6. Identify the packet loss information.
7. Identify Jitter information.
8. Stop Wireshark capture.

9.7 VoIP Messages Flow

This Lab will help you to understand VoIP message flows between two to several
communication SIP phones.
• Start Wireshark capture.
• Make a voice call to another SIP client.
• After a couple of minutes stop Wireshark.
• Under Wireshark menu, click “Telephony” and select “VoIP calls”.
• A list of calls will appear (cf., Fig. 9.22).
• Select one or more calls on the list and click flow.
• The messages flow of VoIP calls will appear (cf., Fig. 9.23).

9.7.1 Challenges

1. List SIP status codes and their descriptions.


9.7 VoIP Messages Flow 231

Fig. 9.22 Wireshark screenshot: VoIP calls

Fig. 9.23 Wireshark screenshot: VoIP calls

2. List SIP methods involved in the call.


3. List the number of REGISTER, ACK, BYE, INVITE packets.
4. List the number of 200, 100, 180, and 407 status code packets.
232 9 Case Study 2—VoIP Quality Analysis and Assessment

Fig. 9.24 RTP streams statistics

Fig. 9.25 Perceived voice and video without packet loss

9.8 VoIP Quality Assessment: Packet Losses

This Lab will use traffic controller “tc” in Linux in order to manipulate traffic control
settings. This Lab will also help you to assess the impact of packet losses on voice
and video quality.
• Start Wireshark capture.
• Make video calls between X-Lite clients and assess its quality (Q) in the range
of 1 to 5, being worst and 5 excellent.
• Start Wireshark at each end where X-Lite is running.
• Capture the flowing RTP/RTCP traffic.
• Use Wireshark “Statistics” tab on RTP, and by showing all streams, examine the
packet loss rate from Asterisk (cf., Fig. 9.24 ) to your X-Lite client. Fill in the
Tables in (cf., Fig. 9.25) by replacing Q with the quality between 1 and 5 and P
with the packet loss rate obtained from the Wireshark statistics.
• Stop Wireshark capture.

9.8.1 Challenges

1. At the Linux terminal, identify the network interface used by Asterisk and
execute the command below
a. # tc qdisc add dev ethx root netem loss 2 %
Replace ethx with the network interface used by Asterisk.
2. Give the meaning of the above command.
3. Start Wireshark at each end where X-Lite is running.
4. Capture the flowing RTP/RTCP traffic.
9.9 VoIP Quality Assessment: Delay Variation 233

Fig. 9.26 Perceived voice and video with packet loss

Fig. 9.27 Perceived voice and video without delay variation

5. Use Wireshark “Statistics” tab on RTP, and by showing all streams, examine
the packet loss rate from Asterisk to your X-Lite client. Wait for few minutes
until packet loss rates stabilize. Fill in the Table in Fig. 9.26 by replacing Q
with the quality between 1 and 5 and P with the packet loss rate obtained from
the Wireshark statistics.
6. Observe packet loss rate values found in the RTCP report.
7. Stop Wireshark capture.
8. Delete the emulation created by the “tc” command, user “tc qdisc del dev ethx
root”.
9. Repeat this Lab for loss rate between 2 % and 10 %.
10. Up to which packet loss rate do you start to notice video quality degradation?
11. Up to which packet loss rate do you start to notice voice quality degradation?

9.9 VoIP Quality Assessment: Delay Variation

This Lab will help you to assess the impact of delay variation on voice and video
quality. The following steps will help you to achieve this.
• Make video calls between X-Lite clients and assess its quality (Q) in the range
of 1 to 5, being worst and 5 excellent.
• Start Wireshark capture at each end where X-Lite is running.
• Capture the flowing RTP/RTCP traffic.
• Use Wireshark “Statistics” tab on RTP, and by showing all streams, examine
the Max Jitter and Mean Jitter from Asterisk to your X-Lite client. Fill in the
Table in Fig. 9.27 by replacing Q with the quality between 1 and 5, MaxJ and
MeanJ with Max Jitter and Mean Jitter obtained from the Wireshark statistics,
respectively.
• Stop the Wireshark capture.
234 9 Case Study 2—VoIP Quality Analysis and Assessment

Fig. 9.28 Perceived voice and video with delay variation

9.9.1 Challenges

1. At the Linux terminal, identify the network interface used by Asterisk and
execute the command below
a. # tc qdisc add dev ethx root netem delay 150ms 5ms
Replace ethx with the network interface used by Asterisk.
2. Give the meaning of the above command.
3. Start Wireshark at each end where X-Lite is running.
4. Capture the flowing RTP/RTCP traffic.
5. Use Wireshark “Statistics” tab on RTP, and by showing all streams, examine
Max and Mean Jitter from Asterisk to your X-Lite client. Fill in the Table in
Fig. 9.28 by replacing Q with the quality between 1 and 5, MaxJ and MeanJ
with Max Jitter and Mean Jitter obtained from the Wireshark statistics, re-
spectively.
6. Observe delay variations found in the RTCP reports.
7. Stop Wireshark.
8. Delete the emulation created by “tc” command, user “tc qdisc del dev ethx
root”.
9. Repeat this Lab for delay variations between 5 ms and 10 ms.
10. Up to which delay variations do you start to notice video quality degradation?
11. Up to which delay variations do you start to notice voice quality degradation?

9.10 Problems

1. Investigate and find out how network impairment (e.g., packet loss) affect
voice/video call quality (using subjective observation) and voice/video call es-
tablishment time by performing the following steps.
• Use “tc” command to set two packet loss rates within the range of 0 % to
10 % (e.g., choose 0 % and 5 % respectively) to investigate how network
impairment (i.e., packet loss) affect voice/video quality.
• Before you start a video call, make sure you start Wireshark first and cap-
ture the traffic during the call setup and the beginning part of the call ses-
sion (to keep trace data size small, but make sure the video session has
started before you stop Wireshark). When the required data is collected,
References 235

you can close Wireshark and save the trace data for offline analysis (the
trace data will be used for questions in both Problem 1 and Problem 2). At
the same time, you can start evaluating voice/video quality subjectively.
• Observe and explain how voice/video quality is affected by network packet
loss (you may give your own MOS score, and describe your observation for
quality changes, for example, some video freezing observed when packet
loss rate is x %). You need to first explain briefly your VoIP testbed and
what “tc” commands have been used in your experiments for the task.
• Based on the trace data, draw a diagram and explain how SIP call set up
is established in the VoIP testbed. Further explain how call setup time is
calculated and how call setup time is affected by the network impairments.
2. Choose one captured Wireshark trace data (from Problem 1 above) for a
voice/video call and answer the following questions.
• Are voice stream and video stream transmitted separately, or combined
together during a video call? What are payload types for voice and video
sessions? What are payload sizes for voice and video? Does the payload
size change for voice or video session? What is an average payload size (in
bit) and sender bit rate (in kb/s) for voice or video session? (Hint: choose
one session from PC-A to PC-B, choose 3 or 4 GOPs for video to calculate
average payload size and sender bit rate for video session.) Explain your
findings or how you get your results/conclusions.
• From Wireshark, select “Statistics”, then “RTP”, then “Show All streams”.
Select one stream which you want to analyse (e.g., ITU-T G.711 PCMU),
click “Analyze”, then choose “Save Payload” which will save the sound
file in different format (e.g., .au). Using other tools (e.g., Audacity), can
you listen to the dumped audio trace? Is VoIP system secure? Explain your
findings.

References
1. Wireshark (2011) The world’s foremost network protocol analyzer. http://www.wireshark.
org/. [Online; accessed 27-August-2011]
Case Study 3—Mobile VoIP Applications
and IMS 10

This Lab introduces an Open Source IMS Core which deploys the main IMS call
session control functions described in Chap. 7 and IMSDroid as an IMS client. We
will also outline the main steps needed for successful installation, configuration and
setup of an Open Source IMS Core in Ubuntu and IMSDroid in an Android based
mobile handset. We will finally demonstrate how to make SIP audio and video calls
between two Android based mobile handsets via the Open Source IMS Core over
Wi-Fi access network.

10.1 What Is Open Source IMS Core

In 2004, the Fraunhofer Institute FOKUS launched the “Open IMS Playground” and
by November 2006 the Open Source IMS Core (OSIMS Core) was released under
a GNU General Public License on the FOKUS BerliOS.
The main goal of releasing OSIMS Core was to fill the void of open source
IMS software which existed in the mid of 2000s. The OSIMS Core has enabled
several research and development activities such as ADAMANTIUM and GERYON
to deploy IMS services and proof of concepts around the core IMS elements.
The OSIMS Core deploys the main IMS Call Session Control Functions for cen-
tral routing elements for any IMS SIP signaling and a Home Subscriber Server to
manage user profiles and all associated routing rules. The central components of the
OSIMS Core are the Open IMS CSCFs (P-CSCF, I-CSCF and S-CSCF). The OS-
IMS Core were developed as extensions to the SIP Express Router (SER) [6]. SER
is an open source SIP server which acts as a SIP registrar, proxy or redirect server.
A simple HSS, the FOKUS Home Subscriber Server (FHoSS) is part of the OS-
IMS Core. The FHoSS is written in Java via the open source Tomcat servlet con-
tainer. The main component of the HSS is based on MySQL database system. The
main function of the FoHSS is to manage user profiles and all its associated routing
rules.

L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 237
DOI 10.1007/978-1-4471-4905-7_10, © Springer-Verlag London 2013
238 10 Case Study 3—Mobile VoIP Applications and IMS

Fig. 10.1 OSIMS Core architecture

OSIMS Core implements Extensible Markup Language Document Management


Server. The main function of a XDMS as defined by open Mobile Alliance (OMA)
is to manage contact list, groups, and access lists. The XML Configuration Access
Protocol (XCAP) RFC 4827 is the protocol used to communicate with XDMS.
Presence services are also available in OSIMS Core via the Presence Server. Rel-
evant presence information may includes IMS user and terminal availability, IMS
user communication preferences, IMS terminal capabilities, current activities and
location.
The OSIMS Core architecture is depicted in Fig. 10.1.

10.1.1 The Main Features of OSIMS Core P-CSCF


The main features of P-CSCF in OSIMS Core are depicted in Fig. 10.2 and include,
• Capabilities of providing a firewall at an application level to the core network.
• Asserting identity of UE (P-Preferred-Identity, P-Asserted-Identity header sup-
port).
10.1 What Is Open Source IMS Core 239

Fig. 10.2 OSIMS Core P-CSCF main features

• Providing local registrar synchronization via “Reg” event as per RFC 3680.
• Providing path header support by inserting network and path identifiers for the
correct further SIP messages processing.
• Providing verification and enforcement of service routes.
• Maintaining stateful dialog and supporting Record-route verification and en-
forcement.
• Supporting IPSec setup by using Cipher Key (CK) and Integrity Key (IK) from
Authentication and Key Agreement (AKA).
• Providing integrity protection for UA authentication.
• Supporting security-client, security-server and provide security-verify header
support as per RFC 3329, Security Mechanism Agreement for the SIP.
• Providing support for basic P-Charging-Vector according to RFC 3455.
• Provides support for Visited-Network-ID header as per RFC 3455.
• Acting as a router between end points by supporting NAT during signaling.
• Providing NAT support for media in case it is configured as a media proxy
through RTPProxy [8].

10.1.2 The Main Features of OSIMS Core I-CSCF


The features of I-CSCF in OSIMS Core can be seen at Fig. 10.3 and include,
• Providing support for full Cx interface to HSS as per 3GPP TS 29.228.
• Providing S-CSCF selection based on UA capabilities.
• Supporting serial forking by forwarding SIP requests and responses to S-CSCF.
• Supporting Visited-Network-ID header support and roaming permission verifi-
cation as per RFC3455.
240 10 Case Study 3—Mobile VoIP Applications and IMS

Fig. 10.3 OSIMS Core I-CSCF main features

• Hiding the internal network from the outside by encrypting parts of the SIP
message, this is known as Topology Hiding Interwork Gateway (THIG).
• Firewalling capacity that only allows signaling traffic coming from trusted net-
works via Network Domain Security (NDS).

10.1.3 The Main Features of OSIMS Core S-CSCF

The features of S-CSCF in OSIMS Core are illustrated in Fig. 10.4 and include,
• Supporting full Cx interface to HSS according to 3GPP TS 29.228.
• Providing authentication through AKAv1-MD5, AKAv2-MD5 and MD5.
• Support service-Route header as per RFC 3455.
• Supporting path header as per RFC 3455.
• Supporting P-Asserted-Identity header according to RFC 3455.
• Supporting Visited-Network-ID header according to RFC 3455.
• Downloading of Service-Profile from HSS vi Cx interface as per 3GPP TS
29.228.
• Supporting Initial Filter Criteria triggering (iFC) to enforce specific user routing
rules.
• Supporting ISC interface routing towards SIP application servers. The ISC helps
application server to know the capabilities of the UA and invoke its services.
• “Implementing Reg” event server with access restrictions which allows it to
bind UA location.
• Maintaining the state of SIP Dialog.
10.1 What Is Open Source IMS Core 241

Fig. 10.4 OSIMS Core S-CSCF main features

Fig. 10.5 OSIMS Core


FHoSS main features

10.1.4 The Main Features of OSIMS Core FHoSS

The features of FoHSS in OSIMS Core are depicted in Fig. 10.5 and include,
• Supporting the 3GPP Cx Diameter interface to S-CSCF and I-CSCF as per
3GPP TS 29.228.
• Supporting the 3GPP Sh Diameter interface to application servers as per 3GPP
TS 29.228.
• Supporting for the 3GPP Zh Diameter interface per 3GPP TS 29.109.
• Supporting integrated simple Authentication Centre (AuC) functionality.
• Implementing Java based Diameter Stack.
242 10 Case Study 3—Mobile VoIP Applications and IMS

Fig. 10.6 Java 7 version

Fig. 10.7 MySQL version

• Providing HTTP based management console for easy management of OSIMS


Core users and their iFC.

10.1.5 Installation and Configuration of OSIMS Core

As Ubuntu is one of the most popular Linux distributions, the following instructions
demonstrate how to install OSIMS Core in Ubuntu Linux distribution.

Prerequisite Packages
The following Linux packages are required for successful OSIMS Core installation:
Oracle-java7-jdk, mysql-server, libmysqlclient15-dev, libxml2-dev, bind, ant, flex,
curl, libcurl4-gnutls-dev, openssl, bison and subversion.
Execute the following commands at the Ubuntu console terminal in order to add
oracle-java7-jdk repository and eventually installing it,
• sudo add-apt-repository ppa:webupd8team/java
• sudo apt-get update
• sudo apt-get install oracle-java7-installer
• If the installation is successfully then by running “java version”, you should be
able to get a positive response of the Java verion (cf., Fig. 10.6)
The following commands will install mysql-server, libmysqlclient15-dev,
libxml2, libxml2-dev, bind9, ant, flex, bison, curl, libcurl4-gnutls-dev, openssl
and subversion for OSIMS Core,
– sudo apt-get install mysql-server libmysqlclient15-dev libxml2 libxml2-dev
bind9 ant flex bison curl libcurl4-gnutls-dev openssl subversion If MySQL
installation is successfully then by running “mysql -V”, you should be able
to get a positive response of the MySQL version (cf., Fig. 10.7).

Downloading OSIMS Core


Before downloading the OSIMS Core into your Ubuntu machine, create the follow-
ing directory for OSIMS Core by using the following command,
• sudo mkdir /opt/OpenIMSCore/
10.1 What Is Open Source IMS Core 243

Give yourself the ownership of the OSIMS Core directory, replace username with
your current username,
• sudo chown -R username /opt/OpenIMSCore/
Create CSCFs and the FHoSS directories in the OSIMS Core directory,
• cd /opt/OpenIMSCore
• mkdir ser_ims
• mkdir FHoSS
Execute the following commands to checkout the latest version of the OSIMS
Core from the BerliOS subversion server,
• svn checkout http://svn.berlios.de/svnroot/repos/openimscore/ser_ims/trunkser_
ims
• svn checkout
http://svn.berlios.de/svnroot/repos/openimscore/FHoSS/trunkFHoSS

Install OSIMS Core FHoSS


This section will setup the FHoSS database with its associated MySQL tables. The
following commands will setup the FHoSS database and populate it with its MySQL
tables,
• mysql -u root -p < ser_ims/cfg/icscf.sql
• mysql -u root -p < FHoSS/scripts/hss_db.sql
• mysql -u root -p < FHoSS/scripts/userdata.sql

Set JAVA_HOME Environment Variable


Take a note of JAVA_HOME variable as this will be added in the file /etc/environ-
ment,
• JAVA_HOME=“/usr/local/share/jdk1.7.0_xx/jre”
by replacing “xx” with your installed Java version number, add the above line at
the end of the file /etc/environment. Add the following lines in /.bashrc file,
• export JAVA_HOME=/usr/local/share/jdk1.7.0_xx/jre
replacing “xx” with your installed Java version. You can perform these operations
by using any of your desired plain text Linux editor.

Compile and Install ser_ims, FHoSS and the CSCFs


This section will compile and install OSIMS Core ser_ims, FHoSS and the CSCFs.
In /opt/OpenIMSCore/ser_ims,
• sudo make install-libs all
• This will take a bit long and you should be able to see messages scrolling along
the screen (cf., Fig. 10.8)
In /opt/OpenIMSCore/FHoSS,
• ant compile deploy
244 10 Case Study 3—Mobile VoIP Applications and IMS

Fig. 10.8 SIP Express Router compilation messages

Fig. 10.9 prepend line in dhclient file

Configure Ubuntu DHCP and DNS


Configure DHCP and DNS settings by editing the file /etc/dhcp/dhclient.conf (cf.,
Fig. 10.9). Uncomment the following line if you are running DNS in your own
Ubuntu,
• # prepend domain_name_servers 127.0.0.1;
Uncommented line will look like the line below,
• prepend domain_name_servers 127.0.0.1;
Copy the open-ims.dnszone DNS file from the SER configuration directory to
the bind folder,
• sudo cp /opt/OpenIMSCore/ser_ims/cfg/open-ims.dnszone /etc/bind/
Add these lines to /etc/bind/named.conf.local

zone "open-ims.test" {
type master;
file "/etc/bind/open-ims.dnszone";
};
10.1 What Is Open Source IMS Core 245

Fig. 10.10 Positive ping responses

Edit the file /etc/resolv.conf and add the following lines below,
• search open-ims.test
• nameserver 127.0.0.1
You might need to reload bind for the above changes to take effect,
• sudo /etc/init.d/bind9 reload
Try to ping and see if you get a positive response (cf., Fig. 10.10).
• ping pcscf.open-ims.test

Run OSIMS Core


Before running OSIMS Core, copy configuration files from the SER configuration
directory to the OSIMS Core root directory,
• cp /opt/OpenIMSCore/ser_ims/cfg/* /opt/OpenIMSCore/
Run each CSCF in its own terminal console,
• ./pcscf.sh
• ./icscf.sh
• ./scscf.sh
By default periodic log messages will appear on the screen of each CSCF. Take a
note of any error message that appears periodically.
Run the FHoSS in its own terminal console,
• cd FHoSS/deploy/
• ./startup
• Figure 10.11 depicts the successful FHoSS deployment
If all CSCFs and HSS are running well, you should be able to see debug messages
on the I-CSCF terminal console stating that the HSS is opened (cf., Fig. 10.12).
If the FHoSS fails, check if the variable JAVA_HOME is set properly by execut-
ing “echo JAVA_HOME” at the terminal console.
246 10 Case Study 3—Mobile VoIP Applications and IMS

Fig. 10.11 Successful FHoSS screen

Fig. 10.12 Successful


communication between
I-CSCF and HSS

10.2 What Is Android

Android is a mobile device platform developed by an Open Handset Software Al-


lience (OHA) [7]. Google is the leading main developer. OHA is a group of mobile
operators, handset manufactures, software, semiconductors and commercialization
companies aiming to,
• Build better mobile phones for consumers:
According to OHA own data, there are 1.5 billion sets of television in the world,
a billion people accessing the Internet but there are 3 billion people with at
least a mobile phone. Therefore, mobile phones are one of the most successful
electronic consumer products in the world. With these statistics, OHA realised
that building a better mobile phone would lead to enriching the lives of countless
people in the world.
10.2 What Is Android 247

• Innovating in the open:


Knowing the importance of open source community in responding to con-
sumers’ needs, OHA started the first joint project Android with the main goals
to be the first open, complete and free platform for mobile devices.
• Making the vision a reality:
Android is not just an operating system, it is also
– a complete set of software for mobile devices,
– a middleware, and
– key mobile applications.
OHA will therefore, provide mobile operators, developers and handset manu-
facturers everything required in order to build innovative devices, software and
services.

Mobile operators that form OHA include, Bouygues Telecom, China Mo-
bile Communications Corporation, China Telecommunications Corporation, China
United Network Communications, KDDI CORPORATION, NTT DOCOMO, INC.,
SOFTBANK MOBILE Corp., Sprint Nextel, T-Mobile, Telecom Italia, Telefónica,
TELUS and Vodafone.
Handset manufactures that contribute to Android through OHA include, Acer
Inc., Alcatel mobile phones, ASUSTeK Computer Inc., CCI, Dell, Foxconn Inter-
national Holdings Limited, FUJITSU LIMITED, Garmin International, Inc., Haier
Telecom (Qingdao) Co., Ltd., HTC Corporation, Huawei Technologies, Kyocera,
Lenovo Mobile Communication Technology Ltd., LG Electronics, Inc., Motorola,
Inc., NEC Corporation, Pantech, Samsung Electronics, Sharp Corporation, Sony Er-
icsson, Toshiba Corporation and ZTE Corporation.
Semiconductors company that constitute OHA include, AKM Semiconduc-
tor Inc, Audience, ARM, Atheros Communications, Broadcom Corporation, CSR
Plc., Cypress Semiconductor Corporation, Freescale Semiconductor, Gemalto, In-
tel Corporation, Marvell Semiconductor, Inc., MediaTek, Inc., MIPS Technologies,
Inc., NVIDIA Corporation, Qualcomm Inc., Renesas Electronics Corporation, ST-
Ericsson, Synaptics, Inc., Texas Instruments Incorporated and Via Telecom.
Software companies that form OHA include, Ándago Ingeniería S.L., ACCESS
CO., LTD., Ascender Corp., Cooliris, Inc., eBay Inc., Google Inc., LivingImage
LTD., Myriad, MOTOYA Co., Ltd., Nuance Communications, Inc., NXP Software,
OMRON SOFTWARE Co, Ltd., PacketVideo (PV), SkyPop, SONiVOX, SVOX,
and VisualOn Inc.
Commercialization companies that comprise OHA include, Accenture, Aplix
Corporation, Borqs, Intrinsyc Software International, L&T Infotech, Noser Engi-
neering Inc., Sasken Communication Technologies Limited, SQLStar International
Inc., TAT—The Astonishing Tribe AB, Teleca AB, Wind River and Wipro Tech-
nologies.
248 10 Case Study 3—Mobile VoIP Applications and IMS

Table 10.1 Global smart phone market share


Platform Q2 2012 % share Q2 2011 % share % Growth
shipments shipments (Q2’12/Q2’11)
(million) (million)
Total 158.3 100.0 107.7 100.0 46.0
Android 108.8 68.1 51.2 47.6 110.4
iOS 26.0 16.4 20.3 18.9 28.0
BlackBerry 8.5 5.4 12.5 11.6 −32.1
Symbian 6.4 4.1 18.1 16.8 −64.6
Windows 5.1 3.2 1.3 1.2 277.3
bada 3.3 2.1 3.1 2.9 5.1
Others 1.2 0.8 1.1 1.0 15.2

10.2.1 Android Smart Phone Market Share

Android has experienced a significant growth since its first release of Android beta
in 2007. According to Canalys’ [1] statistics, as per 2012 second quarter (Q2), a
quarterly shipment of Android has surpassed 100 millions smart phones for the
first time. Table 10.1 depicts the smart phone market share amongst popular mobile
operating systems.

10.2.2 Android Architecture

The Android architecture (cf., Fig. 10.13) follows the bottom-up paradigm. The
bottom layer is the Linux Kernel which runs Linux version 2.6x for core system
services such as security, memory and process management, network stacks and
driver model.
The next layer is the Android native libraries, written in C and C++. Some of the
main native libraries are,

• Surface manager: It manages different windows for different Android applica-


tions.
• Media framework: It provides various audio and video codecs for recording and
playback.
• SQLite: It provides Android with database engine for data storage and retrieval.
• WebKit: It is a browser engine for web browser.
• OpenGL: It renders 2D or 3D graphics to the screen.

The Android Runtime is made up of Dalvik Virtual Machine (DVM) and Core
Java Libraries. DVM is a Java Virtual Machine (JVM) which is optimized for low
10.2 What Is Android 249

Fig. 10.13 Android architecture

memory and processing power Android mobile devices. Core Java libraries provides
most of the classes defined in the JAVA SE libraries such as networking and IO
libraries.
Application Framework layer provides the interface between Android applica-
tions and the native Android libraries. This layer also manages the default phone
functions such as voice call and resource management such as energy and memory
resources.
The application layer includes default pre-installed Android applications such as
SMS, dialer, web browser and contact manager. This layer allows Android develop-
ers to install own applications without seeking permission from the main developer
Google. This layer is written in Java.

10.2.3 The History of Android

The official release of Android was in October 2008 when T-mobile G1 was
launched in the USA. The Table 10.2 traces the history of the major Android ver-
sions from October 2008 to August 2012.
250 10 Case Study 3—Mobile VoIP Applications and IMS

Table 10.2 The history of Android versions


Version Codename Date Main Features Device
1.5 Cupcake April 2009 An on-screen keyboard, Video T-Mobile G1
capture and playback
1.6 Donut September 2009 CDMA support, Quick Search T-Mobile G1
Box
2.0, 2.1 Eclair November 2009 ability to swipe the screen to HTC Nexus One,
unlock, Google Maps Motorola Droid
Navigation, Multiple Google One
account support
2.2 Froyo May 2010 Traditional password/PIN lock Motorola Droid
screen, Redesigned home screen Two, HTC Nexus
One
2.3 Gingerbread December 2010 An improved keyboard, Support Samsung Nexus S
for front-facing cameras, Better
battery and application
management tools
3.x Honeycomb February 2011 A move from green to blue Motorola Xoom
accents, Redesigned home screen
and widget placement
4.0 Ice cream October 2011 More home screen Samsung Galaxy
sandwich improvements, Android beam, Nexus
Face unlock, New calendar and
mail apps, Data usage analysis
4.1 Jelly Bean July 2012 Roboto refresh, Expandable and Samsung Galaxy
“actionable” notifications, Nexus
Predictive text

10.2.4 IMSDroid IMS Client

IMSDroid [2] is an open source IMS client that implements 3GPP IMS Client spec-
ifications. The client is developed by Doubango Telecom [3]. Doubango Telecom
is a Telecommunication company specializing in NGN technologies such as 3PP,
TISPAN, and Packet Cable with the aim of providing open source NGN products.
Apart from Android, Doubango also have open source IMS clients for Windows
Mobile, iPhone, iPad and Symbian. The SIP implementation is based on [RFC3261]
and [3GPPTS24.229] Rel-9 specifications. IMSDroid is built to support both voice
and SMS over LTE as outlined in the One Voice initiative (Version 1.0.0) (cf.,
Fig. 10.14).
One Voice Profile which outlines minimum requirements for a wireless mobile
device and network in order to guarantee an interoperable and high quality IMS
based telephony service over LTE network access during implementation. The ar-
chitecture includes IMS capabilities which include SIP registration, authentication,
addressing, call establishment, call termination and signalling tracing and compres-
sion.
10.2 What Is Android 251

Fig. 10.14 One profile


mobile device

IMSDroid supports GSM Association (GSM) [5] Rich Communication Suite


(RCS) Release 3. RCS is an effort by GSMA that focuses on the use of IMS to
provide more than just voice for communication services. IMSDroid supports the
following features [2],
• Fully supports SIP as per RFC 3261 and 3GPP TS 24.229 Rel-9.
• Fully supports both TCP and UDP over IPv4 or IPv6.
• Fully supports Signalling Compression (SigComp) according to RFC 3320,
RFC 3485, RFC 4077, RFC 4464, RFC 4465, RFC 4896, RFC 5049, RFC 5112
and RFC 1951.
• Fully supports enhanced Address Book such as XCAP storage, authorizations
and presence.
• Partially supports for GSM Association (GSMA) Rich Communication Suite
(RCS) release 3.
• Partially supports for One Voice Profile V1.0.0 also known as GSMA voice over
LTE (VoLTE).
• Partially supports for MMTel UNI which is used by GSMA RCS and GSMA
VoLTE.
• Implements IMS-AKA registration for both AKA-v1 and AKA-v2, Digest
MD5 and Basic.
• Implements 3GPP Early IMS Security as per 3GPP TS 33.978.
• Supports Proxy-CSCF discovery using DNS NAPTR+SRV.
• Supports private extension headers for 3GPP.
• Supports service Route discovery.
• Implements subscription to REG event package.
• Implements 3GPP SMS Over IP as per 3GPP TS 23.038, 3GPP TS 24.040,
3GPP TS 24.011, 3GPP TS 24.341 and 3GPP TS24.451.
• Supports voice call codecs such as G729AB1, AMR-NB, iLBC, GSM, PCMA,
PCMU and Speex-NB.
252 10 Case Study 3—Mobile VoIP Applications and IMS

Fig. 10.15 Home screen


before login to IMS

• Supports video call codecs such as VP8, H264, MP4V-ES, Theora, H.263,
H.263-1998 and H.261.
• Supports DTMF according to RFC 4733.
• Implements QoS negotiation using Preconditions according to RFC 3312, RFC
4032 and RFC 5027.
• Implements SIP Session Timers as per RFC 4028.
• Implements Provisional Response Acknowledgments (PRACK).
• Supports communication Hold according to 3GPP TS 24.610.
• Implements Message Waiting Indication (MWI) as per 3GPP TS 24.606.
• It is capable of calling E.164 numbers by using ENUM protocol according to
RFC 3761.
• Supports NAT Traversal using STUN2 as per RFC 5389 with possibilities to
automatically discover the server by using DNS SRV.
• Supports Image Sharing according to PRD IR.79 Image Share Inter-operability
Specification 1.0.
• Supports Video Sharing as per PRD IR.74 Video Share Inter-operability Speci-
fication, 1.0.
• Implements File Transfer which conforms to OMA SIMPLE IM 1.0.
• Support Explicit Communication Transfer (ECT) using IP Multimedia (IM)
Core Network (CN) subsystem as per 3GPP TS 24.629.
• Supports IP Multimedia Subsystem (IMS) emergency sessions according to
3GPP TS 23.167.
• Supports Full HD (1080p) video.
• Supports NAT Traversal using ICE.
• Support for TLS, SRTP.
• Full support for RTCP as per RFC 3550 and other extensions such as RFC 4585
and RFC 5104.
• Implements MSRP chat.
• Implements adaptive video jitter buffer. This has advanced features like error
correction, packet loss retransmission and delay recovery.
• Fully supports RTCWeb standards such as ICE, SRTP/SRTCP, and RTCP-
MUX.
Figures 10.15 and 10.16 depict IMSDroid screen shots before and after register-
ing to IMS, respectively.
10.3 Lab Scenario 253

Fig. 10.16 Home screen


after login to IMS

Fig. 10.17 Lab scenario

10.3 Lab Scenario

Figure 10.17 depicts the Lab scenario that is will be used to build up a testbed
for voice and video IMS communication using OSIMS Core and IMSDroid. The
testbed consists of Wi-Fi access networks through a wireless router, two Android
based mobile handsets installed with IMSDroid and OSIMS Core for VoIP session
setup and termination.
254 10 Case Study 3—Mobile VoIP Applications and IMS

Fig. 10.18 IMS client


identity configuration

10.3.1 Configuring IMSDroid

IMSDroid can be downloaded and installed from Google Play [4], previously known
as Google Market. The following configurations are required In order for the IMS-
Droid to work with OSIMS Core.

IMSDroid Identity Settings


IMS identity settings are configure in the “Options” screen (cf., Fig. 10.18).
• Display Name: user nickname.
• IMS Public Identity: Public visible identifier could be either a SIP or tel URI
(example tel: +44123456 or sip:[email protected]).
• IMS Private Identity: The unique identifier assigned to a user. It could be either
a SIP URI (example, sip:[email protected]), a tel URI (e.g., tel: +44123456).
• Password: Your password.
• Realm: The realm is the name of the domain to authenticate to. It should be a
valid SIP URI (example, open-ims.test).

IMSDroid Networking Settings


Network setting for IMS connectivity are configured as illustrated in Fig. 10.19,
• Enable WiFi: The client can be setup to use Wi-Fi access network.
• Enable 4G/3G/2.G: The client can be configured use LTE, UMTS and EDGE
access networks.
• IPv4 or IPv6: The client can use IPv4 or IPv6 depending on the P-CSCF host.
• Proxy-CSCF Host: This is the IPv4/IPv6 address or Fully-Qualified Domain
Name of the IMS client PCSCF.
10.3 Lab Scenario 255

Fig. 10.19 IMS client


network configuration

Fig. 10.20 “User Identities”


menu item

• Proxy-CSCF Port: The port associated to the PCSCF. It is 4060 by default.


• Transport: The transport protocols supported are either UDP or TCP.
• Proxy-CSCF Discovery: It is omitted by default.
• Enable SigComp: This can be selected if the PCSCF supports SigComp.

10.3.2 Adding OSIMS Core Subscribers

After successful installation and running of OSIMS Core , the main task left is
to add OSIMS Core subscribers. This easily done trough FHoSS web based inter-
face manager. By default FHoSS comes provisioned with [email protected] and
[email protected] subscribers. The password for alice is alice and that of bob is
bob.
You can always use the FHoSS web interface at http://localhost:8080 on the
FHoSS machine. By default, the administrator username is “hssAdmin” and the
password is “hss”.

Create IMS Subscription


To create IMS Subscription (IMSU) click “User Identities” menu item on the upper
menu of the FHoSS web interface (cf., Fig. 10.20).
256 10 Case Study 3—Mobile VoIP Applications and IMS

Fig. 10.21 IMS Subscription

Click “Create” menu item under “IMS Subscription” on the left menu, then insert
Name of a new user of your choice. In our case we have used “charlie” as a username
(cf., Fig. 10.21), leave other fields unchanged and click the “Save” button.

Create IMS Private Identity


The next step after creating IMSU is to create IMS Private Identity (IMPI). Click
“Create” menu item under “Private Identity” on the left menu, then input the fol-
lowing,
• Identity field (in our case [email protected])
• Secret key (the password), in our case is “charlie”
• Under Authentication Schemes, select “ALL” checkbox
• Under Default authentication, choose Digest MD5
and leave the rest of the fields unchanged (cf., Fig. 10.22) and click the “Save”
button.

Associate IMSU
When the “Save” button is clicked, another screen will appear on the right side.
This screen (cf., Fig. 10.23) is for associating the IMSU to IMPI. Input your IMS
User subscription which you created, in our case is “charlie” and then click on the
“Add/Change” button. Once “charlie” IMSU is added then “charlie” will appear
under the “Associated IMSU” section (cf., Fig. 10.24).

Create IMS Public User Identity


After associating IMSU to IMPI, the next step is to create IMS Public User Identity
(IMPU). Click “Create” menu item under “Public User Identity” on the left menu,
then input the following,
10.3 Lab Scenario 257

Fig. 10.22 IMS private


identity

• Identity field: in (in our case sip:[email protected])


• Service profile field: Select the default service profile (default_sp)

then click the “Save” button to save and leave the rest of the fields unchanged (cf.,
Fig. 10.25).
After clicking the “Save” button, another screen will appear on the right,
this screen (cf., Fig. 10.26) is for adding “Visited network” for the IMPU
sip:[email protected]. If this step is not done then user charlie will not be able
to register to the IMS. Under the list of “Visited network” select open-ims.test and
click “Add” button. IMPU sip:[email protected] will then appear in “Visited
network” section (cf., Fig. 10.27).
258 10 Case Study 3—Mobile VoIP Applications and IMS

Fig. 10.23 Associated IMSU to IMPI

Fig. 10.24 List of


Associated IMSU to IMPI

Fig. 10.25 IMS Public User Identity


10.3 Lab Scenario 259

Fig. 10.26 IMPU visited network

Fig. 10.27 List of visited


network

Fig. 10.28 Association of IMPU with IMPI

Association of IMPU with IMPI


This section will associate IMPU to IMPI. Under “Associate IMPI(s) to IMPU”
section, input IMPI “[email protected]” in the “IMPI identity” field and click
“Add” button. This step will list IMPI “[email protected]” in the “List of asso-
ciated IMPIs” section (cf., Fig. 10.28).
260 10 Case Study 3—Mobile VoIP Applications and IMS

Fig. 10.29 IMSDroid dialer

10.4 Making Voice and Video Calls

This section will demonstrate how to make voice and video calls between Android
based mobile handsets installed with IMSDroid via the IMS.

10.4.1 Placing a Call

Voice and video calls can be placed from the dialer, address book and History. The
dialer is accessible from the home screen.
You can enter any phone number (for instance, ‘+441752586230’ or
‘01752586278’), SIP URI (for example, ‘sip:[email protected]’). If the SIP Uri
is incomplete (for instance, ‘alice’) the IMSdroid application will automatically add
the prefix “sip:” and a domain name (in our case ‘@open-ims.test’) as described in
the “realm” before placing a call.
If you input a telephone number with a ‘tel:’ prefix, the client will map the num-
ber to the SIP URI using ENUM protocol.
The IMSDroid dialer is depicted in Fig. 10.29. Alice username is ready to be
called, the bottom left square is for placing video calls, while the second bottom
left square is for placing voice calls. Once a call is placed, the outgoing screen will
appear (cf., Fig. 10.30) with callee username and the “Cancel” button to terminate
the call if needed.
10.5 Problems 261

Fig. 10.30 IMSDroid


outgoing call

10.4.2 In Call Screen

Once a call is placed and answered at the other end, a new screen “In Call Screen”
(cf., Fig. 10.31) will appear and a notification icon will be displayed in the status
bar of the mobile handset. This icon will stay in the status bar as long the phone is
on call. This icon in the status bar is useful because it will allow you to reopen the
‘In Call Screen’ if it is hidden. The “Incoming Call Screen” appears (cf., Fig. 10.32)
with caller username and two buttons, either to “Answer” the call or “Cancel”.
If the “Answer” button is clicked then the session will be established. The video
screen will appear if the video session is established (cf., Fig. 10.33). You can share
multimedia content by pressing the “Menu” button as long as a call is ongoing (cf.,
Fig. 10.34).

10.5 Problems

1. This case study and the case study in Chap. 8 uses SIP as a signalling protocol.
This means that there is a possibility to interconnect the two systems.
• Outline the steps needed to interconnect OSIMS Core and Asterisk.
• Implement the above steps and make sure you can make a call from a user
connected to Asterisk to a user connected to IMS and vice versa.
2. Compute the call setup time between users connected to OSIMS Core.
262 10 Case Study 3—Mobile VoIP Applications and IMS

Fig. 10.31 In Call Screen

Fig. 10.32 Voice Incoming


Call

3. Compare the call setup time between the two systems, i.e., Asterisk and OSIMS
Core.
4. By using Wireshark, compare and contrast SIP Registration headers of OSIMS
Core and Asterisk.
10.5 Problems 263

Fig. 10.33 Ongoing video


session

Fig. 10.34 Content sharing

5. By using Wireshark, compare and contrast SIP Invite headers of OSIMS Core
and Asterisk.
264 10 Case Study 3—Mobile VoIP Applications and IMS

References
1. Canalys (2012) Global smart phone market. http://www.eeherald.com/section/news/
nws20120861.html. [Online; accessed 07-August-2012]
2. Doubango (2011) Imsdroid: SIP/IMS client for Android. http://code.google.com/p/imsdroid/.
[Online; accessed 07-August-2012]
3. Doubango (2011) Ngn open source projects. http://www.doubango.org/index.html. [Online;
accessed 07-August-2012]
4. Google (2012) Google play. https://play.google.com/store. [Online; accessed 07-August-
2012]
5. GSMA (2012) GSM Association. http://www.gsma.com. [Online; accessed 07-August-
2012]
6. IPTEL (2001) Sip express router. http://www.iptel.org/ser. [Online; accessed 12-August-
2012]
7. OHA (2009) Open handset alliance. http://www.openhandsetalliance.com. [Online; accessed
07-August-2012]
8. Software S (2008) Sippy rtpproxy. http://www.rtpproxy.org/. [Online; accessed 12-August-
2012]
Index

Symbols Bernoulli loss model, 126


2-state Markov model, 126 Blackberry, 7
3G, 40 British Telecom (BT), 152
3GPP, 40, 101, 163 Broadvoice-32, 9, 197
3GPP2, 163 Broadvoice-32 FEC, 9, 197
3rd Generation Partnership Project, 40 BSC, 179
3SQM, 138 BSS, 179
4-state Markov model, 129 BTS, 179
BYE, 108, 119
A
Absolute Category Rating, 139, 149 C
Absolute Category Rating with Hidden Call-ID, 109, 112, 118
Reference, 149 CAMEL, 166
ACELP, 39 CANCEL, 108
ACK, 108, 119
CDR, 169
ACR, 139
CELP, 32, 38
Adaptive Multi Rate, 40
Challenge, 119
Adaptive Multi-Rate Wideband, 44
Cipher Key, 239
ADPCM, 26
Classes, 109
Advanced Video Coding, 68
CLI, 210
AIN, 166
Akiyo, 202 Client, 102
Algebraic CELP, 39 CN, 188
Allow, 109, 112 Code, 109
AMR, 40 Code Excited Linear Prediction, 38
AMR-WB, 44 Code-Excitation Linear Prediction, 32
AN, 188 Codebook Excitation Linear Prediction, 25
Android, 7 Common Intermediate Format (CIF), 67, 153
ASCII, 101, 222 Compressed RTP, 96
Asterisk, 203 Contact, 109, 112
Attributes, 116 Content-Disposition, 113
AuC, 241 Content-Encoding, 113
Audio, 115, 117 Content-Length, 109, 113
Authentication, 119 Content-Type, 109, 112, 113
AVC, 68 Contribution source identifier, 77
AVP, 116, 117 CounterPath, 8
Credentials, 119
B CRTP, 96
B2BUA, 166 CSCF, 167
Back-to-back, 166 CSeq, 109, 111
Bandwidth efficiency, 83 CSRC identifier, 77

L. Sun et al., Guide to Voice and Video over IP, Computer Communications and Networks, 265
DOI 10.1007/978-1-4471-4905-7, © Springer-Verlag London 2013
266 Index

D G
DAHDI, 193, 196, 211 G1, 249
Dalvik Virtual Machine, 248 GERAN, 181
Database, 119 GGSN, 173, 180
DCME, 38 GIP, 180
DCR, 139, 149 Google Chrome, 7
DCT, 44 Google Talk, 7
Debian, 196 GPRS, 180
Degradation Category Rating, 139, 149 GSM, 39, 179, 251
Dialog, 117 GSMA, 251
Differential Pulse Coding Modulation, 59 GUI, 222
Digital Circuit Multiplication Equipment, 38
Digium, 195, 207 H
Discontinuous transmission, 42 H.263, 67
Discrete Cosine Transform, 44, 54 H.264, 68
DNS, 106 H.265, 69
Doubango, 250 H.323, 101
Double stimulus impairment scale, 149 H323, 193
Double-stimulus continuous quality-scale, 150 Header Fields, 111
DPCM, 59 HEVC, 69
DSCH, 186 High Definition Voice, 44
DSCQS, 150 Highly Efficiency Video Coding, 69
DSIS, 149 HSDPA, 182
DTMF, 197 HSS, 173
DTX, 42 HTTP, 101, 107
Hypertext Transfer Protocol, 101
DVM, 248
I
E
IAX2, 193, 194
E-model, 145, 147
IBCF, 168
EIA, 35 ICID, 171
Electronic Industries Association, 35 IETF, 36, 115, 163
EPC, 187 IFC, 240
EPS, 187 ILBC, 41
Establishment, 118 IM, 10, 101
Ethernet bandwidth, 82 IM-SSF, 172
Ethernet BW, 82 IMPI, 256
ETSI, 34, 36, 163 IMPU, 256
European Telecommunication Standards IMPUI, 169
Institute, 34 IMS, 101, 114, 163, 251
IMSDroid, 250
F IMSU, 255
Fedora, 196 IMT-2000, 178
FHoSS, 237 INFO, 108
FIFO, 224 Information and Communication Technology
Flow, 117 (ICT), 2
FOKUS, 237 INMARSAT, 35
Forking Proxy, 104 Instant Message (IM), 2
Format, 107 Instant Messaging, 10
Forward Error Control (FEC), 6 Integrated Switched Digital Network (ISDN),
From, 108, 111 3
Full-Reference (FR) video quality assessment, Integrity Ley, 239
150 Interarrival jitter, 133
Fullband, 44 Interim Standard, 35
Index 267

International Maritime Satellite Corporation, MGCP, 219


35 MHz, 178
International Radio Consultative Committee MIME, 113
(CCIR), 54 MLT, 43
International Telecommunications Union, MME, 188
Telecom Section, 34 Mobile, 105
Internet Engineering Task Force, 36 Modified Discrete Cosine Transfor, 37
Internet Low Bit Rate Codec, 41 Modulated Lapped Transform, 43
INVITE, 108, 119 MOS, 136
IP, 106 Motion Compensated, 64
IP Multimedia Subsystem, 163 Motion Picture Expert Group, 54
IPTV, 167 MP-MLQ, 39
IPv4, 114 MPEG-1, 64
IPv6, 114 MPEG-2, 64
ISC, 166, 171 MPEG-4, 68
ISDN, 196 MRFC, 172
ITU-R, 178 MSE, 151
ITU-T, 34, 36 Multi Pulse—Maximum Likelihood
Quantisation, 39
J Multiband Excitation model, 34
Java Virtual Machine, 248 Multimedia, 101, 117
JVM, 248 Multipart, 113
MySQL, 237
K
Key Frame Interval, 155 N
NAI, 112
L Narrowband speech, 35
Layers, 106 NAT, 9
LD-CELP, 38 National Telecommunications and Information
Libri, 196 Administration (NTIA), 152
Linear Prediction Coding (LPC), 29 NB speech, 35
Linux, 196 Network Address Translation, 9
Location Server, 106 Network Element, 102
Long Term Evolution, 178 Network Quality of Service, 123
Low-delay code excited linear prediction, 38 Next Generation Network, 163
LTE, 178, 186, 250 NGN, 163
No-Reference (NR) or Zero-Reference (ZR)
M video quality assessment, 154
ManyCam, 202 Non-uniform quantisation, 20
ManyCam Virtual Webcam, 201 NOTIFY, 108
MAP, 179 NQoS, 123
Marker bit, 76 NTTP, 196
Max-Forward, 109, 112
MBE, 34 O
MC, 64 OFDM, 187
MDCT, 37 OMA, 238
Mean Opinion Score, 136 OPTIONS, 108
Mean Packet to Packet Delay Variation, 133 OSA-SCS, 172
Mean Squared Error, 151
Media, 115, 117 P
MESSAGE, 108 P-CSCF, 168
Method, 108 Pair Comparison, 150
MFRP, 172 Password, 119
MGCF, 165, 166, 170 Payload, 115
268 Index

Payload length, 80 Relationship, 118


Payload type, 77 Request, 104
PBX, 193 Request-Line, 108
PCC, 169 Response, 104
PCI, 204 RFC, 101, 102
PCM, 26, 170, 195 RJ11, 204
PCM μ-law, 37 RR, 91
PCM A-law, 37 RTCP, 76, 88
PCM-WB, 37 RTCP BYE, 94
PDF, 173 RTCP Extended report, 96
PDN, 187 RTCP Goodbye, 94
PDP-Context, 173 RTCP Source Description, 92
Peak Signal Noise Ratio, 151 RTCP XR, 96
Perceived Quality of Service, 124, 135 RTP, 75, 113
Perceptual Evaluation of Speech Quality, 143 RTP Control Protocol, 76, 88
Perceptual Evaluation of Video Quality, 153
Perceptual Speech Quality Measure, 143 S
PESQ, 143 S-CSCF, 169
Port, 115 SAE, 187
PQoS, 124 SB-ADPCM, 42
Private Branch Exchange (PBX), 3 SC-FDMA, 187
Proxy, 104 SCTP, 170
Proxy Server, 104 SDP, 113
PSNR, 151 SDSCE, 150
PSQM, 143 Sender Report, 89
PSTN, 35, 101, 164, 166, 167, 194, 195 Sequence number, 77
Public Switched Telephone Network, 35, 101, Sequential, 104
167 SER, 237
Public Switched Telephone Network (PSTN), Server, 102
1, 3 Session, 117
Session Description Protocol, 113
Q Session Initiation Protocol, 101
QoE, 124, 135 SGF, 173
QoS, 104, 123 SGSN, 180, 188
Quadruple-play services, 2 SGW, 170
Quality of Experience, 124, 135 Signal-to-Noise Ratio, 142
Quality of Service, 123 SILK, 41
Quarter Common Intermediate Format (QCIF), SIMPLE, 9, 101, 197
153 Simple Mail Transfer Protocol, 101
Quarter Quarter VGA (QQVGA), 155 Simultaneous double stimulus for a continuous
Quarter Video Graphics Array (QVGA), 155 evaluation, 150
Single Sided Speech Quality Measure, 138
R Single stimulus, 149
RAN, 179 Single stimulus continuous quality evaluation,
RCS, 251 150
Re-INVITE, 119 SIP, 9, 102, 164, 193
Real-time Transport Protocol, 75 SIPS, 107, 112
Receiver Report, 91 SMS, 174, 193, 249
Recency Effect, 140 SMTP, 101, 113
Redirect Server, 105 SNR, 142
Reduced-Reference (RR) video quality SNRseg, 142
assessment, 153 Spectrogram, 22
REGISTER, 105, 108, 119 Spectrum, 22
Registrar, 105, 119 Speech signal digitisation, 19
Index 269

SR, 89 Ubuntu, 196


SSCQE, 150 UBUNTU, 203
SSRC identifier, 77 Ubuntu, 242
Standard Definition Television (SDTV), 154 Ultra-High Definition TV, 69
Standard-Definition TeleVision (SDTV), 152 UMTS, 40
Stateful Proxy, 104 Uniform quantisation, 19
Stateless Proxy, 104 Universal Mobile Telecom Systems, 40
Status-Line, 109 UPDATE, 108
Structural Similarity Index (SSIM), 151 URI, 107
Structure, 106 User Agent, 103
Sub-Band Adaptive Differential Pulse Code User Agent Client, 103
Modulation, 42 User Agent Server, 103
SUBSCRIBE, 108 Username, 107
Super-wideband, 36
SUSE, 196 V
SWB, 36 VAC, 202
Switched Communication Networks, 123 VAD, 9, 197
Synchronisation source identifier, 77 Validate, 119
Syntax and Encoding, 107 Variable Length Coding, 54
Version, 108
T VHE, 182
T-Mobile, 249 Via, 108, 111
TAS, 166 Video, 115
Tc, 215 Video Graphics Array (VGA), 153
TCP, 74 Video Quality Metric (VQM), 151
TCP/IP, 101 Virtual Audio Cable, 202
TD-CDMA, 178 VOCODER, 29
TD-SCDMA, 178 VOice enCODER, 29
TDM, 164, 196, 204 Voice over Internet Protocol (VoIP), 1
Telecommunications Industry Association, 35 Voicemail, 193
Telephony Application Server, 166
Terminate, 119 W
The 3rd Generation Mobile Networks, 40 W-CDMA, 178
TIA, 35 WAN, 215
Timestamp, 77 Waveform, 21
TISPAN, 163 WB speech, 35
TLS, 107 Wideband Extension of Pulse Code
To, 108, 111 Modulation, 37
Transaction, 117 Wideband speech, 35
Transaction Layer, 107 Windows Media Player, 203
Transaction User, 107 Wireshark, 117, 215
Transmission Control Protocol, 74
Transport Control Protocol, 74 X
Transport Layer, 107 X-Lite, 8, 197
Triple-play services, 2 XCAP, 238
XMPP, 7
U
UA, 103, 118 Z
UAC, 103 Zapata, 196
UAS, 104 Zaptel, 196

You might also like