Merge branch 'master' into cuda-updates

brianjo · web-flow · commit b01eac62ad2a · 2021-02-26T15:12:23.000-05:00
diff --git a/_templates/layout.html b/_templates/layout.html
@@ -73,5 +73,10 @@
   src="https://www.facebook.com/tr?id=243028289693773&ev=PageView
   &noscript=1"/>
 </noscript>
+
+<script type="text/javascript">
+  var collapsedSections = [];
+</script>
+
 <img height="1" width="1" style="border-style:none;" alt="" src="https://www.googleadservices.com/pagead/conversion/795629140/?label=txkmCPmdtosBENSssfsC&amp;guid=ON&amp;script=0"/>
 {% endblock %}
diff --git a/index.rst b/index.rst
@@ -179,7 +179,7 @@ Welcome to PyTorch Tutorials
 
 .. customcarditem::
    :header: Train a Mario-playing RL Agent
-   :card_description: Use PyTorch to train a Double Q-learning agent to play Mario .
+   :card_description: Use PyTorch to train a Double Q-learning agent to play Mario.
    :image: _static/img/mario.gif
    :link: intermediate/mario_rl_tutorial.html
    :tags: Reinforcement-Learning
@@ -199,14 +199,14 @@ Welcome to PyTorch Tutorials
    :card_description: Introduction to TorchScript, an intermediate representation of a PyTorch model (subclass of nn.Module) that can then be run in a high-performance environment such as C++.
    :image: _static/img/thumbnails/cropped/Introduction-to-TorchScript.png
    :link: beginner/Intro_to_TorchScript_tutorial.html
-   :tags: Production
+   :tags: Production,TorchScript
 
 .. customcarditem::
    :header: Loading a TorchScript Model in C++
    :card_description:  Learn how PyTorch provides to go from an existing Python model to a serialized representation that can be loaded and executed purely from C++, with no dependency on Python.
    :image: _static/img/thumbnails/cropped/Loading-a-TorchScript-Model-in-Cpp.png
    :link: advanced/cpp_export.html
-   :tags: Production
+   :tags: Production,TorchScript
 
 .. customcarditem::
    :header: (optional) Exporting a Model from PyTorch to ONNX and Running it using ONNX Runtime
diff --git a/recipes_source/recipes_index.rst b/recipes_source/recipes_index.rst
@@ -225,6 +225,15 @@ Recipes are bite-sized, actionable examples of how to use specific PyTorch featu
    :link: ../recipes/recipes/tuning_guide.html
    :tags: Model-Optimization
 
+.. Distributed Training
+
+.. customcarditem::
+   :header: Shard Optimizer States with ZeroRedundancyOptimizer
+   :card_description: How to use ZeroRedundancyOptimizer to reduce memory consumption.
+   :image: ../_static/img/thumbnails/cropped/profiler.png
+   :link: ../recipes/zero_redundancy_optimizer.html
+   :tags: Distributed-Training
+
 .. End of tutorial card section
 
 .. raw:: html
@@ -261,4 +270,4 @@ Recipes are bite-sized, actionable examples of how to use specific PyTorch featu
    /recipes/torchscript_inference
    /recipes/deployment_with_flask
    /recipes/distributed_rpc_profiling
-   /recipes/distributed_rpc_profiling
+   /recipes/zero_redundancy_optimizer
diff --git a/recipes_source/zero_redundancy_optimizer.rst b/recipes_source/zero_redundancy_optimizer.rst
@@ -0,0 +1,155 @@
+Shard Optimizer States with ZeroRedundancyOptimizer
+===================================================
+
+.. note:
+    `ZeroRedundancyOptimizer` is introduced in PyTorch 1.8 as a prototype
+    feature. It API is subject to change.
+
+In this recipe, you will learn:
+
+- The high-level idea of ``ZeroRedundancyOptimizer``.
+- How to use ``ZeroRedundancyOptimizer`` in distributed training and its impact.
+
+
+Requirements
+------------
+
+- PyTorch 1.8+
+- `Getting Started With Distributed Data Parallel <https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`_
+
+
+What is ``ZeroRedundancyOptimizer``?
+------------------------------------
+
+The idea of ``ZeroRedundancyOptimizer`` comes from
+`DeepSpeed/ZeRO project <https://github.com/microsoft/DeepSpeed>`_ and 
+`Marian <https://github.com/marian-nmt/marian-dev>`_ that shard
+optimizer states across distributed data-parallel processes to
+reduce per-process memory footprint. In the
+`Getting Started With Distributed Data Parallel <https://pytorch.org/tutorials/intermediate/ddp_tutorial.html>`_
+tutorial, we have shown how to use
+`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html>`_
+(DDP) to train models. In that tutorial, each process keeps a dedicated replica
+of the optimizer. Since DDP has already synchronized gradients in the
+backward pass, all optimizer replicas will operate on the same parameter and
+gradient values in every iteration, and this is how DDP keeps model replicas in
+the same state. Oftentimes, optimizers also maintain local states. For example,
+the ``Adam`` optimizer uses per-parameter ``exp_avg`` and ``exp_avg_sq`` states. As a
+result, the ``Adam`` optimizer's memory consumption is at least twice the model
+size. Given this observation, we can reduce the optimizer memory footprint by
+sharding optimizer states across DDP processes. More specifically, instead of
+creating per-param states for all parameters, each optimizer instance in
+different DDP processes only keeps optimizer states for a shard of all model
+parameters. The optimizer ``step()`` function only updates the parameters in its
+shard and then broadcasts its updated parameters to all other peer DDP
+processes, so that all model replicas still land in the same state.
+
+How to use ``ZeroRedundancyOptimizer``?
+---------------------------------------
+
+The code below demonstrates how to use ``ZeroRedundancyOptimizer``. The majority
+of the code is similar to the simple DDP example presented in
+`Distributed Data Parallel notes <https://pytorch.org/docs/stable/notes/ddp.html>`_.
+The main difference is the ``if-else`` clause in the ``example`` function which
+wraps optimizer constructions, toggling between ``ZeroRedundancyOptimizer`` and
+``Adam`` optimizer.
+
+
+::
+
+    import os
+    import torch
+    import torch.distributed as dist
+    import torch.multiprocessing as mp
+    import torch.nn as nn
+    import torch.optim as optim
+    from torch.distributed.optim import ZeroRedundancyOptimizer
+    from torch.nn.parallel import DistributedDataParallel as DDP
+
+    def print_peak_memory(prefix, device):
+        if device == 0:
+            print(f"{prefix}: {torch.cuda.max_memory_allocated(device) // 1e6}MB ")
+
+    def example(rank, world_size, use_zero):
+        torch.manual_seed(0)
+        torch.cuda.manual_seed(0)
+        os.environ['MASTER_ADDR'] = 'localhost'
+        os.environ['MASTER_PORT'] = '29500'
+        # create default process group
+        dist.init_process_group("gloo", rank=rank, world_size=world_size)
+
+        # create local model
+        model = nn.Sequential(*[nn.Linear(2000, 2000).to(rank) for _ in range(20)])
+        print_peak_memory("Max memory allocated after creating local model", rank)
+
+        # construct DDP model
+        ddp_model = DDP(model, device_ids=[rank])
+        print_peak_memory("Max memory allocated after creating DDP", rank)
+
+        # define loss function and optimizer
+        loss_fn = nn.MSELoss()
+        if use_zero:
+            optimizer = ZeroRedundancyOptimizer(
+                ddp_model.parameters(),
+                optim=torch.optim.Adam,
+                lr=0.01
+            )
+        else:
+            optimizer = torch.optim.Adam(ddp_model.parameters(), lr=0.01)
+
+        # forward pass
+        outputs = ddp_model(torch.randn(20, 2000).to(rank))
+        labels = torch.randn(20, 2000).to(rank)
+        # backward pass
+        loss_fn(outputs, labels).backward()
+
+        # update parameters
+        print_peak_memory("Max memory allocated before optimizer step()", rank)
+        optimizer.step()
+        print_peak_memory("Max memory allocated after optimizer step()", rank)
+
+        print(f"params sum is: {sum(model.parameters()).sum()}")
+
+
+
+    def main():
+        world_size = 2
+        print("=== Using ZeroRedundancyOptimizer ===")
+        mp.spawn(example,
+            args=(world_size, True),
+            nprocs=world_size,
+            join=True)
+
+        print("=== Not Using ZeroRedundancyOptimizer ===")
+        mp.spawn(example,
+            args=(world_size, False),
+            nprocs=world_size,
+            join=True)
+
+    if __name__=="__main__":
+        main()
+
+The output is shown below. When enabling ``ZeroRedundancyOptimizer`` with ``Adam``,
+the optimizer ``step()`` peak memory consumption is half of vanilla ``Adam``'s
+memory consumption. This agrees with our expectation, as we are sharding
+``Adam`` optimizer states across two processes. The output also shows that, with
+``ZeroRedundancyOptimizer``, the model parameters still end up with the same
+values after one iterations (the parameters sum is the same with and without
+``ZeroRedundancyOptimizer``).
+
+::
+
+    === Using ZeroRedundancyOptimizer ===
+    Max memory allocated after creating local model: 335.0MB
+    Max memory allocated after creating DDP: 656.0MB
+    Max memory allocated before optimizer step(): 992.0MB
+    Max memory allocated after optimizer step(): 1361.0MB
+    params sum is: -3453.6123046875
+    params sum is: -3453.6123046875
+    === Not Using ZeroRedundancyOptimizer ===
+    Max memory allocated after creating local model: 335.0MB
+    Max memory allocated after creating DDP: 656.0MB
+    Max memory allocated before optimizer step(): 992.0MB
+    Max memory allocated after optimizer step(): 1697.0MB
+    params sum is: -3453.6123046875
+    params sum is: -3453.6123046875