Skip to content

ipcontroller process goes to 100% CPU, ignores connection requests #3795

@lbeltrame

Description

@lbeltrame

Background: I'm using @chapmanb's ipython-cluster-helper to distribute approximately 660 jobs over a 800 core cluster equipped with SGE.

I noticed that if I request more than 250 jobs from a pool of 600, the ipcontroller process overloads, goes to 100%CPU and does not answer requests from the engines. It does not matter if all jobs are sent at once or delayed in batches: ultimately I still observe this issue.

Setting different values of TaskScheduler.hwm has no effect.

Attaching gdb to the runaway process shows

(gdb) py-list
  92                timeout = -1
  93            
  94            timeout = int(timeout)
  95            if timeout < 0:
  96                timeout = -1
 >97            return zmq_poll(list(self.sockets.items()), timeout=timeout)
  98    

Called in /usr/local/lib/python2.7/dist-packages/zmq/sugar/poll.py.

Example log: https://gist.github.com/lbeltrame/6087409

This happens with:

ipython master (from yesterday)
pyzmq latest stable or latest git (happens with both)
Python 2.7.3

Some details on the SW/HW setup as well:

32 diskless machines + 2 diskful machines (A and B). A offers OS images for the 32, and B for the storage area (NFS for both cases). All run Debian 7. ipython etc installed through git / pip (no virtualenv).

All the processing is happening on the shared NFS storage. When this issue occurs, network traffic isn't that high (it's 1Gbit network) and I/O is negligible as well.

Previous discussion (more related to the software where this issue occurred originally): bcbio/bcbio-nextgen#60

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions