douban / Tfmesos
Programming Languages
Labels
Projects that are alternatives of or similar to Tfmesos
TFMesos
.. image:: https://badges.gitter.im/douban/tfmesos.svg :alt: Join the chat at https://gitter.im/douban/tfmesos :target: https://gitter.im/douban/tfmesos?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge
.. image:: https://img.shields.io/travis/douban/tfmesos.svg :target: https://travis-ci.org/douban/tfmesos/ .. image:: https://img.shields.io/pypi/v/tfmesos.svg :target: https://pypi.python.org/pypi/tfmesos .. image:: https://img.shields.io/docker/automated/tfmesos/tfmesos.svg :target: https://hub.docker.com/r/tfmesos/tfmesos/
TFMesos is a lightweight framework to help running distributed Tensorflow <https://www.tensorflow.org>_ Machine Learning tasks on Apache Mesos <http://mesos.apache.org>_ within Docker <https://www.docker.com>_ and Nvidia-Docker <https://github.com/NVIDIA/nvidia-docker/>_ .
TFMesos dynamically allocates resources from a Mesos cluster, builds a distributed training cluster for Tensorflow, and makes different training tasks mangeed and isolated in the shared Mesos cluster with the help of Docker.
Prerequisites
- For
Mesos >= 1.0.0:
-
MesosCluster (cf:Mesos Getting Started <http://mesos.apache.org/documentation/latest/getting-started>_). All nodes in the cluster should be reachable using their hostnames, and all nodes have identical/etc/passwdand/etc/group. -
Setup
Mesos Agentto enableMesos Containerizer <http://mesos.apache.org/documentation/container-image/>_ andMesos Nvidia GPU Support <https://issues.apache.org/jira/browse/MESOS-4626>_ (optional). eg:mesos-agent --containerizers=mesos --image_providers=docker --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia -
(optional) A Distributed Filesystem (eg:
MooseFS <https://moosefs.com>_) -
Ensure latest
TFMesosdocker image (tfmesos/tfmesos <https://hub.docker.com/r/tfmesos/tfmesos/>_) is pulled across the whole cluster
- For
Mesos < 1.0.0:
-
MesosCluster (cf:Mesos Getting Started <http://mesos.apache.org/documentation/latest/getting-started>_). All nodes in the cluster should be reachable using their hostnames, and all nodes have identical/etc/passwdand/etc/group. -
Docker(cf:Docker Get Start Tutorial <https://docs.docker.com/engine/installation/linux/>_) -
Mesos Docker Containerizer Support(cf:Mesos Docker Containerizer <http://mesos.apache.org/documentation/latest/docker-containerizer/>_) -
(optional)
Nvidia-dockerinstallation (cf:Nvidia-docker installation <https://github.com/NVIDIA/nvidia-docker/wiki/Installation>_) and make sure nvidia-plugin is accessible from remote host (with-l 0.0.0.0:3476) -
(optional) A Distributed Filesystem (eg:
MooseFS <https://moosefs.com>_) -
Ensure latest
TFMesosdocker image (tfmesos/tfmesos <https://hub.docker.com/r/tfmesos/tfmesos/>_) is pulled across the whole cluster
If you are using AWS G2 instance, here is a sample <https://github.com/douban/tfmesos/blob/master/misc/setup-aws-g2.sh>_ script to setup most of there prerequisites.
Running simple Test
After setting up the mesos and pulling the docker image on a single node (or a cluser), you should be able to use the following command to run a simple test.
.. code:: bash
$ docker run -e MESOS_MASTER=mesos-master:5050 \
-e DOCKER_IMAGE=tfmesos/tfmesos \
--net=host \
-v /path-to-your-tfmesos-code/tfmesos/examples/plus.py:/tmp/plus.py \
--rm \
-it \
tfmesos/tfmesos \
python /tmp/plus.py mesos-master:5050
Successfully running the test should result in an output of 42 on the console.
Running in replica mode
This mode is called Between-graph replication in official Distributed Tensorflow Howto <https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/distributed/index.md#replicated-training>_
Most distributed training models that Google has open sourced (such as mnist_replica <https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/dist_test/python/mnist_replica.py>_ and inception <https://github.com/tensorflow/models/blob/master/inception/inception/inception_distributed_train.py>_) are using this mode. In this mode, two kind of Jobs are defined with the names 'ps' and 'worker'. 'ps' tasks act as 'Parameter Server' and 'worker' tasks run the actual training process.
Here we use our modified 'mnist_replica' <https://github.com/douban/tfmesos/blob/master/examples/mnist/mnist_replica.py>_ as example:
- Checkout the
mnistexample codes into a directory in shared filesystem, eg:/nfs/mnist - Assume Mesos master is
mesos-master:5050 - Now we can launch this script using following commands:
CPU:
.. code:: bash
$ docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
--net=host \
-v /nfs/mnist:/nfs/mnist \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u` \
-w /nfs/mnist \
tfmesos/tfmesos \
tfrun -w 1 -s 1 \
-V /nfs/mnist:/nfs/mnist \
-- python mnist_replica.py \
--ps_hosts {ps_hosts} --worker_hosts {worker_hosts} \
--job_name {job_name} --worker_index {task_index}
GPU (1 GPU per worker):
.. code:: bash
$ nvidia-docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
--net=host \
-v /nfs/mnist:/nfs/mnist \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u` \
-w /nfs/mnist \
tfmesos/tfmesos \
tfrun -w 1 -s 1 -Gw 1 -- python mnist_replica.py \
--ps_hosts {ps_hosts} --worker_hosts {worker_hosts} \
--job_name {job_name} --worker_index {task_index}
Note:
In this mode, tfrun is used to prepare the cluster and launch the training script on each node, and worker #0 (the chief worker) will be launched in the local container.
tfrun will substitute {ps_hosts}, {worker_hosts}, {job_name}, {task_index} with corresponding values of each task.
Running in fine-grained mode
This mode is called In-graph replication in official Distributed Tensorflow Howto <https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/distributed/index.md#replicated-training>_
In this mode, we have more control over the cluster spec. All nodes in the cluster is remote and just running a Grpc server. Each worker is driven by a local thread to run the training task.
Here we use our modified mnist <https://github.com/douban/tfmesos/blob/master/examples/mnist/mnist.py>_ as example:
- Checkout the
mnistexample codes into a directory, eg:/tmp/mnist - Assume Mesos master is
mesos-master:5050 - Now we can launch this script using following commands:
CPU:
.. code:: bash
$ docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
--net=host \
-v /tmp/mnist:/tmp/mnist \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u` \
-w /tmp/mnist \
tfmesos/tfmesos \
python mnist.py
GPU (1 GPU per worker):
.. code:: bash
$ nvidia-docker run --rm -it -e MESOS_MASTER=mesos-master:5050 \
--net=host \
-v /tmp/mnist:/tmp/mnist \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u` \
-w /tmp/mnist \
tfmesos/tfmesos \
python mnist.py --worker-gpus 1
