-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Description
Background: I'm using @chapmanb's ipython-cluster-helper to distribute approximately 660 jobs over a 800 core cluster equipped with SGE.
I noticed that if I request more than 250 jobs from a pool of 600, the ipcontroller process overloads, goes to 100%CPU and does not answer requests from the engines. It does not matter if all jobs are sent at once or delayed in batches: ultimately I still observe this issue.
Setting different values of TaskScheduler.hwm
has no effect.
Attaching gdb to the runaway process shows
(gdb) py-list
92 timeout = -1
93
94 timeout = int(timeout)
95 if timeout < 0:
96 timeout = -1
>97 return zmq_poll(list(self.sockets.items()), timeout=timeout)
98
Called in /usr/local/lib/python2.7/dist-packages/zmq/sugar/poll.py
.
Example log: https://gist.github.com/lbeltrame/6087409
This happens with:
ipython master (from yesterday)
pyzmq latest stable or latest git (happens with both)
Python 2.7.3
Some details on the SW/HW setup as well:
32 diskless machines + 2 diskful machines (A and B). A offers OS images for the 32, and B for the storage area (NFS for both cases). All run Debian 7. ipython etc installed through git / pip (no virtualenv).
All the processing is happening on the shared NFS storage. When this issue occurs, network traffic isn't that high (it's 1Gbit network) and I/O is negligible as well.
Previous discussion (more related to the software where this issue occurred originally): bcbio/bcbio-nextgen#60