Showing posts with label Skype. Show all posts
Showing posts with label Skype. Show all posts

Protocol Jungle of Internet multimedia communication

The diagram shows several protocols for Internet multimedia communication. (Click on the diagram to see the full size picture.) In the protocol jungle, a protocol is analogous to a species, its real-world implementation or deployment is an animal of the species. Some animals or species compete with each other for survival. Some animals live with each other in harmony. Some animals do not care or interact with each other since they live in different place, i.e., application or domain. Evolution and mutation results in long lasting survival of some species whereas others become extinct. Unlike using a protocol zoo metaphor, I use a protocol jungle, because there is really a competition between protocols when big companies have invested in certain protocol unlike a closely guarded and nurtured zoo system.




The diagram shows the species and its relationship with other species, e.g., whether A uses B or whether A and B are friendly. Due to space constraint, some items are grouped together, e.g., all the audio/video codecs, and some relationships are missing, e.g., RTMP is friendly with Speex. Ideally, we need a multi-dimensional representation to show multiple aspects of the jungle and how they are related. The following text lists the protocols that serve similar or common functions, and usually are competing within that function.

FunctionProtocols
Structured data encodingXML, ASN.1, RFC822, others
Audio encodingG.711, G.723.1, G.722, G.726, G.728, G.729, MP3, Speex, Nellymoser, AMR, Silk, GIPS, etc.
Video encodingH.261, H.263, H.264, MPEG, Sorenson, Vidyo, etc.
Media transportRTP/RTCP, SRTP, ZRTP, Skype, IAX, RTMP, RTMFP
RendezvousSIP, H.323, Skype, Stratus/RTMFP
Session descriptionSDP, H.245, Jingle
Session negotiationSIP/SDP, H.245, Jingle, Skype, RTMFP
Call signaling and controlSIP, H.225/Q.931, Skype, IAX, MGCP, SCCP (Skinny), RTMFP
Streaming media controlRTSP, RTMP
Session announcementSAP
ConnectivityICE/STUN/TURN, Skype, RTMFP
Remote Procedure CallSOAP, XMLRPC, REST, RTMP
Programming callsCGI, CPL, CCXML, MSCML
Programming voice dialogVoiceXML
Instant messagingXMPP, SIMPLE, MSRP
PresenceXMPP, SIMPLE
Shared resource accessREST, XMPP, XCAP
Shared stateXMPP, RTMP, HTTP


As you can see that a SIP system typically employs one protocol for one task or a few related tasks, but integrated monolithic systems such as those based on RTMP/RTMFP, Skype or IAX tend to combine multiple functions in the single protocol. I have not listed H.32x protocols other than H.323 because those are intended for non-IP networks. Nevertheless, there are several H.32x systems, e.g., for room based video conferencing or for carrying voice among carriers.

Interworking

With multiple protocols available for the same function, interoperability or interworking among those becomes important. I have talked about SIP and XMPP interworking in the last post. I have hands-on experience with several of the interworking scenarios among protocols shown in the diagram.

H.323-H.324: One of my projects in my first job was interworking between H.323 and H.324. Since both these systems use H.245 as the main session description and negotiation, the interworking task is relatively simple. I also worked on part of H.320 system to try to build H.323-H.320 interworking, but did not complete.

SIP-H.323: One of my first project during my M.S. at Columbia University was SIP-H.323 interworking. I have written sip323 software and couple of internet drafts and papers [1] on this. My PhD thesis gives a complete interworking procedure for basic call setup and registration. The conclusion was that while basic call setup and registration are easy to interwork, the full interworking of all the supplementary services is not feasible and not even needed in many cases. Since both SIP and H.323 use RTP/RTCP for media transport and can use the same set of codecs, the signaling gateway is efficient. The company SIPquest which productized my software demonstrated 10k simultaneous calls (this article).

SIP-RTSP: These protocols serve different purposes, but it is possible to build a system that needs both these functions in a standard compliant way. The sipum software is a voice mail and answering machine that uses SIP for calls and RTSP for recording and playback of media. Since both these use RTP/RTCP for media transport and can use the same set of codecs, the software is efficient as the media path can bypass the software. Please see my papers [1] for details.

SIP-RTMP: There have been several attempts at implementing Flash based SIP systems and SIP-RTMP translator is one of the approach. Some existing projects that implement these are siprtmp, gtalk2voip, red5phone and flaphone. Since RTMP is an integrated streaming protocol which can also do control and RPC, the translator is inefficient since it needs to incorporate the media path as well.

SIP-Skype: Being a proprietary protocol, it is not easy to interwork with Skype. However, Skype itself uses SIP to allow trunking with PSTN providers, and recently there was some news about SIP-based Skype gateway for enterprise.

SIP-IAX: Although IAX is open, it is an integrated protocol that combines media and signaling in the same connection, hence suffers from the same scalability problem as other integrated protocols like RTMP. Asterix also has a SIP gateway so that it can talk to SIP-enabled devices, especially carrier equipments.

SIP-XMPP: There is a interest group that discusses this in depth. My last post gives more links about the interworking scenarios using a gateway or co-location in the client.

SIP-RTMFP: Given the P2P promise of RTMFP, a gateway between these two protocols will be able to connect the proprietary Adobe protocol with the rest of the world for a true web-based end-to-end media path. I haven't seen any system that does this.

SIP-H.320: This gateway is particularly useful for existing room based video conferencing systems that want to connect with more Internet-style SIP devices. The idea is similar to SIP-H.323 translator, and in fact a real deployment may use two gateways: SIP-H.323 and H.323-H.320 in practice.

RTMP-XMPP: Since RTMP and XMPP serve two completely different functions, there is no need to interoperate. However, people have built systems that use XMPP for messaging and signaling while using RTMP for media path. Unfortunately since Jingle extension wants to define its own end-to-end session, it becomes not so useful for exchanging RTMP server session information. In particular use XMPP custom extensions based on presence and message to rendezvous, but do session control and call management in RTMP itself.

XMPP-SIMPLE: The SIP-XMPP interest group is also looking at SIMPLE-XMPP translation. However, given the disconnect between the two protocols, it is likely that all the presence and message updates go through the gateway and hence not as efficient as one would want for presence and instant messages.

RTMP-Skype: Now this is going to be really tough because firstly Skype is still a proprietary protocol, and secondly, both these are integrated protocols hence requiring complete conversion of signaling and media. An specific example could be allowing people to access Skype from web pages, e.g., by having a simple RTMP server in the Skype application itself. This works if Skype is running on your local computer. Alternatively, you need the Flash application to connect to Skype super-nodes running on public computers using RTMP. This poses security risk and is inefficient. Why inefficient? because RTMP over TCP means that only the applications on public Internet will be able to receive the connection, and RTMP is not really good for real-time interactive communication because of its latency and buffering. However, if such gateway are incorporated in Skype, then it truly become ubiquitous to web applications.

RTMP-RTSP: These are two competing streaming protocols. Instead of having a gateway that translates between the two protocols, it might be better to build an integrated client or integrated server -- you can record using RTMP and view using Quicktime (RTSP), or you can use the same client to access real-time streams from RTMP or RTSP. Since RTMP incorporates RPC along with streaming control and media path, whether as RTSP is just streaming control, a complete translation of all the functions may not be feasible.

ASN.1-XML: There has been effort to standardize this, e.g., XER. The proposed H.325 standard by ITU-T will use XML while allowing compatibility with some of the predecessors which are in ASN.1 PER. ASN.1 and XML are just data formats and for the purpose of P2P-SIP, they are not very significant.

If you have data about the usage in real deployment for particular protocol(s), feel free to post your comment.

Problems due to NATs and firewalls

Network Address Translators (NAT) and firewalls create problems for end-to-end connectivity on the Internet. This not only affects P2P-SIP but also client-server SIP. In this article I post some example numbers to illustrate the point.

These numbers are for example only: suppose there are 10% public Internet nodes, 30% nodes behind good (cone or address restricted) NAT, 30% nodes behind bad (symmetric) NAT and 30% nodes behind UDP blocking firewalls (F). Let's denote these as P=10%, G=30%, B=30%, F=30%. Here the public Internet nodes are typically from universities and research institutes, those behind good NAT are usually from residential DSL/Cable access, those behind bad NAT are partly from residential and partly from enterprise environment, and those behind UDP blocking firewalls are from enterprise and corporate networks. Suppose a call event between any two pair of nodes is independent of each other for the probability analysis purpose and nodes are equally likely to call any other node. Thus, percentage of calls between two public Internet nodes is (10%)^2 = 0.01 = 1%.

Now let us enumerate the NAT and firewall traversal techniques available to SIP. STUN helps with good NAT, whereas TURN relay is needed for bad NAT. ICE is used to negotiate the connectivity using STUN or TURN bindings. A TCP-based relay (or even HTTP relay) is needed for UDP blocking and very restricted firewalls. (what about TCP hole punching and other techniques?) A STUN server is light in terms of bandwidth utilization, whereas a TURN relay needs high network bandwidth and hence costs the service provider more money. Same is the case with TCP-based relay.

In a call if one participant is behind a UDP blocking firewall (F), then the call must use a TCP relay. This amounts to 1-(1-F)^2 = 51% calls going through TCP relay.

In a call if both participants are behind bad NAT, then we need a TURN relay. This amounts to B^2 = 9% of the calls.

If one participant in a call is either on public Internet or good NAT and other is on public Internet, good NAT or bad NAT, then the media can go end-to-end using STUN bindings. This amounts to 40% of the calls.

In conclusion, the VoIP provider will need to host UDP or TCP relays for 51+9=60% of the calls. This is not a good proposition.

In real world, the call events are not independent of each other: probability of a corporate user calling another corporate user within the same corporation is high. Also probability of a home user calling another home user is also high. For example, a SIP service targeted towards consumers can expect to have most of the calls among residential users. Thus, the percentage of calls that can be end-to-end is much higher than 40%. Similarly, an enterprise VoIP system can expect to have mostly internal intra-enterprise calls, which do not need to cross the enterprise firewall. Hence the percentage of calls needing the relay is not as high as 60%. Let us analyze these two use cases separately.

Suppose, for a consumer SIP service, the distribution of nodes is P=15%, G=50%, B=30%, F=5%, i.e., less number of users are from bad NAT or UDP blocking firewalls. In this scenario about 20% calls need a relay whereas 80% calls don't.

In an enterprise VoIP system, suppose 60% calls are intra-office and 40% are with outside the office network, then only those 40% calls need a relay whereas 60% calls don't. In a properly engineered enterprise VoIP system, appropriate ports are opened for UDP as well as appropriate media relays are installed in DMZ which facilitates smooth media path for inter-office communication.

While we can play with these numbers as much as we want, the fact remains that a significant percentage of calls need media relay, either UDP TURN relays or TCP relays. This puts unnecessary burden on the VoIP service provider to install and manage relays and buy network bandwidth for those relays, or simply disallow calls that require relay (in which case they may lose customers).

In a peer-to-peer system with super nodes such as Skype, these super nodes can act as media relays and hence save a lot of bandwidth and maintenance cost for the provider. There are some things to consider though: a node behind public Internet can become UDP as well as TCP relay for any call, whereas a node behind good NAT can become only UDP relay with some workaround, but not a TCP relay. This puts too much burden on nodes behind public Internet.

Let us consider the original example with P=10%, G=30%, B=30%, F=30%. In this case the 51% of calls that require TCP relay must use one of the 10% P nodes. When acting as a relay, the bandwidth requirement at the relay is twice that of when the node is in a call. Suppose each node makes N calls a day, and generally speaking needs bandwidth for N calls. However, a public Internet node not only needs bandwidth for its own N calls, but also for relaying 5xN calls of other users which amounts to total bandwidth for 11xN calls. Thus, while the super-node architecture is beneficial to the provider, it heavily punishes users on the public Internet. (My guess is that number of public nodes using VoIP are about 4-5%, which further burdens the public nodes).

A managed P2P-SIP infrastructure can be a good alternative, where corporations and universities donate hosts/bandwidth on high speed network to act as relays/super-nodes. Alternatively, one can have an incentive system to promote hosts to become relays and super-nodes.

Structured vs unstructured P2P or Why we chose DHT for P2P-SIP?

One of the questions that people ask me is 'whether structured or unstructured P2P works better for SIP-based Internet telephony?'. My answer is that structured P2P such as distributed hash table (DHT) works better because of the following reason.

Unstructured P2P networks usually rely on caching and replication to improve the lookup performance and reliability. In the case of file-sharing application, once the file named "matrix-II.mpg" is uploaded in a P2P network such as Kazaa, the content of the file remains the same (un-mutable). Thus, for reliability the file may get replicated at multiple places. Intermediate nodes in the download path may cache the file locally to improve download performance by other downstream nodes. On the other hand, a phone's contact location or IP address uploaded under the resource named "[email protected]" may change often (i.e., mutable), e.g., if bob changes his device or the device's IP address is changed. This makes all the randomly replicated and cached copy of the resource content useless. Thus, random caching and replication is not useful.

In unstructured P2P network, any replica of the file "matrix-II.mpg" is good enough, whereas in Internet telephony you want to reach only "[email protected]" but not some replica of Bob. You can not replicate Bob's phone. The only thing that can be replicated is the mapping between Bob's identifier and his phone's IP address. However, to allow frequent updates and consistent view of the mapping, it is desired that the mapping is replicated in a structured manner so that both Bob can easily update the mapping and any prospective caller can easily query the up-to-date copy of this mapping. This means given the resource name, we can know where the resource should be present in the P2P network. Thus, it is structured.

Unstructured P2P is suitable for random or blind search and allows wild-card search. For example, you can search for a file name containing "matrix" keyword. On the other hand, in Internet telephony most calls are made to the known destination (phone number or destination user identifier). Hence, hash table interface that allows lookup based on a key is good enough. Distributed hash table (DHT)-style P2P algorithms such as Pastry, Chord, Bamboo provide ideal platform for such lookups.

Skype is believed to be unstructured among the super-node, but I think there may be some form of structure in how the data is stored among the super-nodes. Plus, there is no guarantee on upper bound on lookup cost which means it might be falling back to some centralized lookup if the number of hops in search exceed some limit. Since Skype is proprietary, we do not know how exactly it does lookups.

Thunderbird as a deployment vehicle

One of the reasons for success of Skype is that they already had a deployment vehicle in the form of KaZaA file sharing network. I believe that for mass deployment of any P2P VoIP tool including open standard based P2P-SIP we need free software availability and a deployment vehicle.

Deployment vehicle is just a mechanism or tool that already exists and is widely used, e.g., tools for email, web, file sharing, or something more basic as Windows OS. For example, someone can add P2P-SIP support in popular email clients such as Microsoft Outlook Express or Mozilla Thunderbird. The latter is easier since it is open source.

Project description: the Thunderbird client can be extended to also serve as a (peer-to-peer) voice and IM client. IM buddies would appear in the Thunderbird address book, allowing to display buddies and their online status. IM transcripts might be saved and searched in the same way that messages are saved. This work can leverage a SIP project, http://www.croczilla.com/zap/, and might build on the Cockatoo project at http://cockatoo.mozdev.org.

One issue is that not many people use Thunderbird compared to Microsoft product (although I believe that will change in future). Alternatively, a web interface similar to Gmail and google-talk can be provided that can be easily accessed from any web browser. The challenge is to use P2P instead of central web server.