ipcontroller process goes to 100% CPU, ignores connection requests

Background: I'm using @chapmanb's [ipython-cluster-helper](http://github.com/roryk/ipython-cluster-helper) to distribute approximately 660 jobs over a 800 core cluster equipped with SGE.

I noticed that if I request more than 250 jobs from a pool of 600, the ipcontroller process overloads, goes to 100%CPU and does not answer requests from the engines. It  does not matter if all jobs are sent at once or delayed in batches: ultimately I still observe this issue.

Setting different values of `TaskScheduler.hwm` has no effect.

Attaching gdb to the runaway process shows

```
(gdb) py-list
  92                timeout = -1
  93            
  94            timeout = int(timeout)
  95            if timeout < 0:
  96                timeout = -1
 >97            return zmq_poll(list(self.sockets.items()), timeout=timeout)
  98    
```

Called in `/usr/local/lib/python2.7/dist-packages/zmq/sugar/poll.py`.

Example log: https://gist.github.com/lbeltrame/6087409

This happens with:

ipython master (from yesterday)
pyzmq latest stable or latest git (happens with both)
Python 2.7.3

Some details on the SW/HW setup as well:

32 diskless machines + 2 diskful machines (A and B). A offers OS images for the 32, and B for the storage area (NFS for both cases). All run Debian 7. ipython etc installed through git / pip (no virtualenv).

All the processing is happening on the shared NFS storage. When this issue occurs, network traffic isn't that high (it's 1Gbit network) and I/O is negligible as well.

Previous discussion (more related to the software where this issue occurred originally): chapmanb/bcbio-nextgen#60


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ipcontroller process goes to 100% CPU, ignores connection requests #3795

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ipcontroller process goes to 100% CPU, ignores connection requests #3795

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions