Kundan Singh: Python

Showing posts with label Python. Show all posts

Introducing rtclite

We have attempted to unify our diverse Python based projects related to SIP, RTP, XMPP, RTMP, REST, etc., into a single theme of real-time communication (RTC). In particular, we have migrated the relevant source code from other projects to the new "rtclite" project after cleanup and refactoring. Moving forward, any new functionality related to Python implementations of real-time communication protocols and applications will be done in this new project.

Over the next few weeks, we will continue to evolve this project, migrate or fix issues/comments, and deprecate the previous older projects.

I will also write blog posts here to introduce various specific modules and how they are used in real world in the new few weeks.

From the project page at http://www.rtclite.com (currently forwards to https://github.com/theintencity/rtclite)

"Light-weight implementations of real-time communication protocols and application in Python

This project aims to create an open source repository of light weight implementations of real-time communication (RTC) protocols and applications. In a nutshell, it contains reference implementations, prototypes and sample applications, mostly written in Python, of various RTC standards defined in the IETF, W3C and elsewhere, e.g., RFC 3261 (SIP), RFC 3550 (RTP), RFC 3920 (XMPP), RTMP, etc."

We encourage all users of my open source projects to visit this new project, and if possible migrate your applications to use this new project.

Performance of siprtmp: multitask vs gevent

Poor performance has been an issue in my RTMP server and SIP-RTMP gateway. Traditionally, I blamed the multitask framework for the poor performance. In this article I present my measurement results as well as introduce an alternative gevent-based implementation to improve the performance.

There are several performance aspects of this software, e.g., CPU utilization per call or session, memory usage, bandwidth requirement, etc. This article only focuses on the CPU performance. Moreover, I only consider the steady state CPU usage to measure the number of active simultaneous calls through the gateway. The CPU usage during call setup and termination is not considered.

The conclusion of my measurement is as follows. The SIP-RTMP gateway software using gevent takes about 2/3 the CPU cycles than using multitask, and the RTMP server software using gevent takes about 1/2 the CPU cycles than using multitask. After the improvements, on a dual-core 2.13 GHz CPU machine, a single audio call going though gevent-based siprtmp using Speex audio codec at 8Hz sampling takes about 3.1% CPU, and hence in theory can support about 60 active calls in steady state. Another way to look at it is that the software requires CPU cycles of about 66 MHz per audio call.

The gevent-based software is also available under the same license for you to try out. The next step to further improve the performance is to move part of the media processing of siprtmp to an external C/C++ extension module.

Background

Traditionally, I have used the multitask framework for co-operative multitasking in my Python software including p2p-sip, rtmplite and siprtmp. In the past, people have complained about high CPU utilization in siprtmp for a single call or even with no call. Part of the discussion is documented in issue 31. It turned out that the no-call CPU usage was a bug, and that we could optimize the multitask framework to improve the performance by approximately 2x. The optimization alters the way in which the multitask framework looks for io-events and more tasks. In particular, it gives more preference to tasks than to io-events, hence if a single io-event generates multiple tasks, all of them run before waiting for next io-events. These optimizations and fixes are in SVN r60 and r68. Unfortunately, these optimizations are not enough.

To further improve the performance, I looked at the built-in asyncore module of Python and re-implemented rtmp.py to use asyncore. There was significant improvement of approximately another 1.5x to 2x. Unfortunately, getting timers to work with asyncore is not trivial. Hence I couldn't implement siprtmp easily as the SIP/RTP library relies heavily on timers.

Then I looked at the gevent project, thanks to a co-worker for recommending it. It supports co-routine based co-operative multitasking by modifying the existing blocking modules such as socket. Compared to the multitask framework, the source code using gevent is more readable and easy to maintain because it works behind the scene. Unlike this, the multitask framework requires yield statements scattered everywhere and non-trivial StopIteration exception to return from a task. I re-implemented siprtmp.py, and related SIP/RTP modules, using gevent. Since siprtmp module includes all of rtmp module, this can also be used as an RTMP server in addition to being a SIP-RTMP gateway.

Test Setup

All my tests were done on my MacBook laptop, 2.13 GHz Intel core 2 duo, 2GB memory, and running Mac OS X 10.5.6. I used Python 2.7 for server side components and flash debug player version MAC 10,0,45,2 (how to find?). I used X-lite version 3 as a standard SIP client. The debug trace on the server was disabled, by not supplying any -d option. All my clients and server ran locally on my local host hence bandwidth was not an issue. I used Mac's Activity Monitor to measure the CPU usage.

Measurement Result

The main metric is the CPU usage in percentage as reported by the Activity Monitor. There are several parameters that were altered and the effects were measured.

The siprtmp performance was measured for an audio call between a web-based VideoPhone sample application available as part of the siprtmp software, and the third-party X-Lite application. The sampling rate of the Speex audio codec can be 8kHz or 16kHz. The larger the sampling rate, the larger the encoded packet is. The CPU usage increases with higher sampling rate. Note that there is no transcoding in siprtmp. The following table shows the percentage CPU usage for siprtmp using multitask and gevent, and for the two sampling rates.

Rate	multitask	gevent
8 kHz	4.8-5.1%	3.1-3.2%
16 kHz	6.2-6.5%	4.0-4.1%

Base on these, we can conclude that the gevent-based SIP-RTMP gateway takes about 2/3 the CPU compared to multitask-based gateway. Roughly, the gevent-based gateway takes about 66 MHz/audio-call of the CPU cycles in steady state.

The rtmp performance was measured using one publisher and zero or more players. The CPU usage increases with the number of players. Typically, audio only session gives less variance in the CPU usage, whereas if video is included then depending on the amount of movement or image details the packet size changes, and so does the CPU usage. I used the Flash VideoIO project's test page to perform the tests. If video is present, then Flash Player's camera capture uses these properties: cameraQuality=80, cameraWidth=320, cameraHeight=240, cameraFPS=12. Audio is always Speex 16 kHz with encodeQuality=6. The following tables shows the percentage CPU usage using multitask and gevent, with one publisher and different number of players, and with or without video. If the variance is small, only the average is reported, whereas if the variance is large the range is listed.

Media	#players	multitask	gevent
Audio	0	2.2%	1.3%
Audio	1	3.5%	1.8%
Audio	2	4.5%	2.1%
Audio	3	5.5%	2.5%
Audio+Video	0	3.0-3.9%	1.4-1.7%
Audio+Video	1	4.2-4.7%	2.1%
Audio+Video	2	5.5-6.3%	2.7%
Audio+Video	3	7.0-7.6%	3.1%

Based on these, we can conclude that gevent-based software takes less than 1/2 the CPU than the multitask-based software for RTMP streaming. Roughly, the gevent-based server takes 34 MHz/publisher and 12 MHz/player of the CPU cycles in steady state.

Tips for implementing application protocols

This article presents some tips for implementing application protocols such as for web services, multimedia communication, streaming or Internet telephony. The tips are mostly relevant for implementations in the Python programming language.

Keep all blocking operations outside your protocol implementation. This mostly includes sockets, files and timers. If you design your protocol parser and controller to be independent of blocking calls, then it can easily be converted to various asynchronous or synchronous controllers as needed. For example, the rfc3261.py module implements core SIP stack using the Stack class. The application supplies API for timer creation, message receiving as well as sending. When the application receives a packet on socket, it invokes a method on the stack. When stack has parsed the received packet and needs to inform an high-level event such as incoming call to the application, it invokes a method on the application. This allows the application such as voip.py to provide co-operative multitasking based controller. On the other hand, the built-in HTTPServer in Python includes synchronous and blocking calls for sockets and disks. This makes the built-in class' HTTP implementation hard to use for various high-performance application that cannot afford to block. Due to this, almost every web framework implements its own HTTP, instead of re-using the built-in class. The trade-off is that your implementation may become more involved if you keep blocking operations outside.
Do not use multi-threading in your protocol implementation. Firstly, getting a multi-threaded application right is very hard. Secondly, for CPU intensive tasks or disk I/O bound tasks, the CPython's global interpreter lock (GIL) will prevent efficiency anyway; hence multiprocessing should be used. Thirdly, for network I/O bound applications multi-threading has advantage, but not as much as multiprocessing. Consider using multiprocessing, but beware of cross-platform problems, especially on Windows! In my experience, co-operative multitasking (or green-thread) works best for protocol implementation. If you are worried about efficiency on multi-core CPUs, you should leave that decision to the main application that will use your protocol implementation to present a client or a server application. The main application can decide whether to use multi-threading or multiprocessing and co-ordinate among them.
Decouple the protocol parsing and handling implementation. Sometimes you may need to use just the parser without the handler. For example, if a single incoming TCP connection can have either HTTP, SIP or RTSP messages, then it becomes easier for the application to first parse the message to determine what it is, and then invoke the appropriate handler. Because of NAT and firewall, many application protocols need to be sent via a single port, e.g., 80 or 443. If an application from Flash Player connects to your server on TCP, it will first send a socket policy request, before sending any other actual application protocol message. If your protocol parser is separate from the handler, you can invoke the socket policy request parser as well as protocol parser, to determine what request it is.
Avoid blocking on DNS lookup, if possible. This goes back to first point; do not block in your protocol implementation. Usually it is hard to notice the DNS lookup as blocking. Most built-in libraries provide synchronous and blocking calls for DNS lookup. Consider using some asynchronous DNS library. If that is not possible, move the DNS lookup out of the core protocol implementation, to the main application. Sometimes DNS lookup is done during logging, e.g., to convert client IP address to host name, and may be hard to detect.
Log all warning, errors and exceptions. In server implementations, you may get tempted to handle various exceptions and ignore it, to make your server more "robust". Unfortunately, this practice leads to more headaches later on when some critical bug appears but is hard to detect. If you log all warning, error or exception conditions, even if you ignore them, you may be able to detect such bugs early on. A warning is a suspicious behavior either in your code or external system. An error is a failure case due to some external problem, e.g., file requested by client was not found on server. An exception is most likely a programming mistake, e.g., accessing attribute on "NoneType".
Do not hold on to resources. With automatic reference counting and garbage collection, it becomes your responsibility to free up any unused references. Typically the application protocol defines how long the resources should be kept, e.g., how long a SIP transaction lasts. But there are some resources which can persist for much longer duration, e.g., user contact location. External databases are more suitable for such resources. Secondly, with event driven software architecture such as listener-provider model, it is easy to get in to reference loops, e.g., listener has reference to provider and vice-versa. Similarly, a Message object may have list of Header objects, and each Header may refer back to the Message. Your cleanup code should correctly free up unused references, e.g., "del varname".

Programming languages for implementing SIP

The programming language used for the implementation can affect the software architecture. For example, Flash ActionScript is a pure event-based language with no way of implementing a blocking operation. Hence, when a connection is made on a socket object, the socket object will dispatch the success or failure event. The caller installs the appropriate event handlers to continue the processing after the connection is completed.

C
There are two main reasons for implementing SIP in C: ability to compile on several platforms and very high performance. The primary advantage of implementing a SIP stack in C is that it can be easily ported and compiled on variety of platforms especially embedded platforms. Usually a C compiler is available for a platform, whereas others such as Java interpreter or C++ compiler may not be. Secondly, because there is no overhead (e.g., in terms of run-time environment and code size), the performance is usually the best. The main problem with implementing a SIP stack in C is the development time and cost of maintenance of the software. Finding bugs and adding a feature in a C program is usually more challenging than other languages. However if the software is well designed then the problem can be alleviated to a large extent. In any case, the number of lines of code that needs to be written in C is usually much more than the other high level languages such as Java or Python.

The pjsip project presents an implementation of SIP and other related protocols. It has been used in a variety of real-world projects and has proven itself to be a good SIP implementation especially where performance matters or the resources are limited.

C++
The object oriented design allows better reusability and maintainability compared to programming in C. However, the number of lines of code is still large. If advanced C++ features such as standard template library (STL) are used then portability may become a concern for certain embedded platforms.

One of my earlier SIP implementation was in C++ (and C) at Columbia University. Another example implementation is reSIProcate, an open source SIP stack.

Java
The Java programming language is very popular among corporate world and enterprise application developers. The standards community has developed APIs that cover several aspects of SIP implementation and some of its extensions. This allows the application to be built against those APIs whereas the actual implementation of the SIP stack can be provided by several vendors. The main problem with an implementation written in Java is that it tends to be too verbose. Java on one hand gives the illusion of a very high level language, but on the other hand requires the programmer to write a lot of code even to do a small thing. Part of this is because of the way the language is defined -- in particular the exception handling and strict compiler enforced type checking. This requires the programmer to do a lot of typecasting and hence there is potential for run-time errors. Another problem with Java is that the run-time could become a memory hog.

NIST SIP Stack provides the reference implementation of the JAIN SIP API.

Tcl
The Tcl programming language is not as popular as the other high level languages such as perl, PHP or Java. However, because of the simplicity in the language construct it is very easy to learn. Unless the software is designed right, it becomes very difficult to maintain a large piece of software.

Columbia’s SIP user agent, sipc, is written in Tcl.

ActionScript
ActionScript (or ECMAScript 4), improves on the Java programming language by allowing much smaller source code size of the implementation and much shorter syntax for common operations. However, there are two major limitations: the platform is limited to Flash Player or AIR (Adobe integrated run-time) and the language is purely event-based. The limitation of Flash Player may not seem important at the beginning, but prevents certain features. For example, the current version of Flash Player doesn’t have UDP or TLS sockets that could be used for SIP. Even the functions of a TCP socket is limited in that you can only initiate connections but cannot receive incoming connections. This prevents us from implementing a complete SIP stack in ActionScript without support from Flash Player or an additional plugin. The Flash Player version for embedded devices is usually older than the current version which makes portability an issue. Because of the run-time overhead the performance is limited. The media codecs supplied by the Flash Player 9 or earlier are proprietary codecs that are usually not supported by any other implementation. This causes interoperability issues as well.

Python
The object oriented nature, the compact coding style and very small source code size of the implementation makes Python a very good choice for implementing application protocols such as SIP. There is some overhead because it is an interpreted language; however the overhead is comparable to that of Java run-time. The interpreter is now usually available for embedded platforms as well, making it more portable than other languages such as ActionScript or C++.
The biggest advantage of an implementation in Python is that the code size is drastically smaller than other languages. I have implemented the basic SIP stack in less than 2000 lines of Python source code. Compare this with the Java implementation of SIP which has more than 1000 files. The lower size means that not only the development time is smaller, but also the testing, code review and maintenance cost is much lower.

Welcome to 39 peers!

I have launched an open-source project named "39 peers". From the web site:

"The 39 Peers project aims at implementing an open-source peer-to-peer Internet telephony software using the Session Initiation Protocol (P2P-SIP) in the Python programming language. The software is still incomplete -- especially the P2P part.

Peer-to-peer systems inherently have high scalability, fault tolerance and robustness against catastrophic failures because there is no central server and the network self-organizes itself. Internet telephony can be an application of peer-to-peer architecture where the participants locate and communicate with each other without relying on expensive or managed service providers. 39 peers project is an attempt to provide a open source and free-for-all peer-to-peer network targeted towards open standards based real-time communication.

The 39 peers project is developed for student developers and researchers to experiment with new ideas. It is written in Python scripting language. It supports open protocols such as IETF SIP and RTP. It is licensed under GNU/GPL license."