add `auto_wrap_policy` into XLA FSDP for automatic wrapping #4318

ronghanghu · 2022-12-12T19:55:28Z

This PR adds the auto-wrapping feature in XLA FSDP, similar to the native PyTorch FSDP's auto_wrap_policy argument.

Auto-wrapping submodules based on policies

We now allow to automatically wrap the submodules in an nn.Module based on the policy specified in the auto_wrap_policy argument to the XlaFullyShardedDataParallel class.

For example, one can set

from torch_xla.distributed.fsdp.wrap import transformer_auto_wrap_policy
auto_wrap_policy = partial(transformer_auto_wrap_policy, transformer_layer_cls={GPT2Block})

to automatically wrap all GPT2Block submodules (which is probably the most common scenario in transformer-style models).

Or one can also apply it based on the parameter size of a submodule

from torch_xla.distributed.fsdp.wrap import size_based_auto_wrap_policy
auto_wrap_policy = partial(size_based_auto_wrap_policy, min_num_params=1e7)

to automatically wrap all submodules with more than e.g. 1e7 (10M) parameters.

There are also more policies such as lambda_auto_wrap_policy to determine whether to wrap a module by a custom callable. The wrapping policies are directly borrowed from native PyTorch FSDP policies in https://github.com/pytorch/pytorch/blob/v1.13.0/torch/distributed/fsdp/wrap.py.

Gradient checkpointing (i.e. activation checkpointing/rematerialization)

Additionally, now one can also specify an auto_wrapper_callable argument to the XlaFullyShardedDataParallel class to use a custom callable wrapper for the submodules (default wrapper is just XlaFullyShardedDataParallel). For example, one can use the following to apply gradient checkpointing (i.e. activation checkpointing/rematerialization) to each auto-wrapped submodule.

from torch_xla.distributed.fsdp import checkpoint_module
auto_wrapper_callable = lambda m, *args, **kwargs: XlaFullyShardedDataParallel(
    checkpoint_module(m), *args, **kwargs)

The MNIST and ImageNet examples are updated accordingly to show examples of auto-wrapping usage based on size or classes. Also, this PR changes the MNIST and ImageNet FSDP tests to pin_layout=True by default to be consistent with #4359.

cc: @AlexWertheim @JackCaoG

New tests added:

[OK] Test MNIST size-based auto-wrap FSDP (and command line checkpoint consolidation) on v3-8

python3 -u ~/xla_fsdp_dev/test/test_train_mp_mnist_fsdp_with_ckpt.py \
  --batch_size 16 --drop_last --num_epochs 2 \
  --auto_wrap_policy size_based

Results: matching expected accuracy for 2 training epochs

found 8 checkpoint files in /tmp/mnist-fsdp/final_ckpt_rank-*-of-*.pth
saved consolidated model to /tmp/mnist-fsdp/final_ckpt_consolidated.pth
Checkpoint consolidated, Accuracy=98.91 (note: it can be slightly different from the final training accuracy due to non-sync BatchNorm2d in the model)
Max Accuracy: 98.94%

[OK] Test MNIST type-based auto-wrap FSDP (and command line checkpoint consolidation) on v3-8

python3 -u ~/xla_fsdp_dev/test/test_train_mp_mnist_fsdp_with_ckpt.py \
  --batch_size 16 --drop_last --num_epochs 2 \
  --auto_wrap_policy type_based

Results: matching expected accuracy for 2 training epochs

found 8 checkpoint files in /tmp/mnist-fsdp/final_ckpt_rank-*-of-*.pth
saved consolidated model to /tmp/mnist-fsdp/final_ckpt_consolidated.pth
Checkpoint consolidated, Accuracy=98.91 (note: it can be slightly different from the final training accuracy due to non-sync BatchNorm2d in the model)
Max Accuracy: 98.94%

[OK] Test MNIST type-based auto-wrap FSDP + gradient checkpointing (and command line checkpoint consolidation) on v3-8

python3 -u ~/xla_fsdp_dev/test/test_train_mp_mnist_fsdp_with_ckpt.py \
  --batch_size 16 --drop_last --num_epochs 2 \
  --auto_wrap_policy type_based --use_gradient_checkpointing

Results: matching expected accuracy for 2 training epochs

found 8 checkpoint files in /tmp/mnist-fsdp/final_ckpt_rank-*-of-*.pth
saved consolidated model to /tmp/mnist-fsdp/final_ckpt_consolidated.pth
Checkpoint consolidated, Accuracy=98.91 (note: it can be slightly different from the final training accuracy due to non-sync BatchNorm2d in the model)
Max Accuracy: 98.94%

[OK] Test ImageNet ResNet-50 size-based auto-wrap FSDP on v3-8

python3 -u ~/xla_fsdp_dev/test/test_train_mp_imagenet_fsdp.py \
  --datadir /datasets02/imagenet-1k --drop_last \
  --model resnet50 --test_set_batch_size 64 --eval_interval 10 \
  --lr 0.4 --batch_size 128 --num_warmup_epochs 5 --lr_scheduler_divide_every_n_epochs 30 --lr_scheduler_divisor 10 --num_epochs 100 \
  --auto_wrap_policy size_based

Results: matching expected accuracy for batch size 128

Max Accuracy: 75.79%

[OK] Test ImageNet ResNet-50 type-based auto-wrap FSDP on v3-8

python3 -u ~/xla_fsdp_dev/test/test_train_mp_imagenet_fsdp.py \
  --datadir /datasets02/imagenet-1k --drop_last \
  --model resnet50 --test_set_batch_size 64 --eval_interval 10 \
  --lr 0.4 --batch_size 128 --num_warmup_epochs 5 --lr_scheduler_divide_every_n_epochs 30 --lr_scheduler_divisor 10 --num_epochs 100 \
  --auto_wrap_policy type_based

Results: matching expected accuracy for batch size 128

Max Accuracy: 75.99%

[OK] Test ImageNet ResNet-50 type-based auto-wrap + gradient checkpointing FSDP on v3-8

python3 -u ~/xla_fsdp_dev/test/test_train_mp_imagenet_fsdp.py \
  --datadir /datasets02/imagenet-1k --drop_last \
  --model resnet50 --test_set_batch_size 64 --eval_interval 10 \
  --lr 0.4 --batch_size 128 --num_warmup_epochs 5 --lr_scheduler_divide_every_n_epochs 30 --lr_scheduler_divisor 10 --num_epochs 100 \
  --auto_wrap_policy type_based --use_gradient_checkpointing

Results: matching expected accuracy for batch size 128

Max Accuracy: 75.93%

JackCaoG · 2022-12-13T02:11:08Z

This is great! @ronghanghu can you also update the usage of this arg in https://github.com/pytorch/xla/blob/master/docs/fsdp.md ? I think this is something many people would want to use!

ronghanghu · 2022-12-13T08:11:52Z

This is great! @ronghanghu can you also update the usage of this arg in https://github.com/pytorch/xla/blob/master/docs/fsdp.md ? I think this is something many people would want to use!

@JackCaoG I added the usages to this doc. We should probably test it on more cases like GPT-2 before merging.

jianguoz · 2022-12-19T03:33:56Z

Hi @ronghanghu @JackCaoG, thanks so much for your great contribution! Can I know that are auto_warp_policy also suitable for general HuggingFace models (T5, OPT, etc.) especially the ones without wrap structure like GPT2Block? Thanks

ronghanghu · 2022-12-19T03:55:34Z

Hi @ronghanghu @JackCaoG, thanks so much for your great contribution! Can I know that are auto_warp_policy also suitable for general HuggingFace models (T5, OPT, etc.) especially the ones without wrap structure like GPT2Block? Thanks

@jianguoz, yes, it should be compatible with general Hugging Face models such as BERT, T5, and OPT.

jianguoz · 2022-12-19T03:59:22Z

@ronghanghu Thanks for your quick reply! That is really awesome! Since you are testing this new feature on more cases. Before this is merged, do I also need to do a test on new models such as T5/OPT?

ronghanghu · 2022-12-19T04:08:45Z

@ronghanghu Thanks for your quick reply! That is really awesome! Since you are testing this new feature on more cases. Before this is merged, do I also need to do a test on new models such as T5/OPT?

@jianguoz Yes, although it isn't finalized yet, you're welcome to try it out on more models or cases! (And since this PR is entirely in Python, it could be added to an existing torch_xla installation by directly copying over the files in torch_xla/distributed/fsdp/)

jianguoz · 2022-12-19T04:13:39Z

@ronghanghu That is great! I will copy the files try it on the T5 and OPT). Hope you could finalize it soon and open easy door for HuggingFace very large model fine-tuning on TPU!

test/test_train_mp_imagenet_fsdp.py

torch_xla/distributed/fsdp/wrap.py

JackCaoG

Mostly lgtm, can you rebase to resolve conflicts?

JackCaoG · 2022-12-19T23:14:18Z

Hey @jianguoz if you tried this pr and confirmed it worked with other HF models, could you give an update here. FYI we have an pr huggingface/transformers#20774 to add FSDP support to HF and we will update that pr to use auto_wrap as well.

ronghanghu · 2022-12-19T23:26:09Z

I just rebased it to the latest master. Let me also test it on more cases before merging.

jianguoz · 2022-12-19T23:36:43Z

Hi @JackCaoG, thanks for the efforts:) I am trying it on other HF models, will give an update soon!

JackCaoG · 2022-12-20T19:35:01Z

Thanks @ronghanghu I am going to merge this one and add a test to CI for auto_wrap_policy.

jianguoz · 2023-01-09T23:58:13Z

Hi @ronghanghu, Good afternoon:) Thanks for the new testing cases! Can I know if autowrap or testing cases such as test_train_mp_mnist_fsdp_with_ckpt.py support running on a VM with more pods (e.g., V3-32)? I copy the folder in torch_xla/distributed/fsdp/, but it raises AssertionError: Expecting 32 files (based on metadata in /tmp/mnist-fsdp/final_ckpt_rank-00000000-of-00000032.pth) but got 4 files. Please check if you have missing or unexpected files in /tmp/mnist-fsdp/final_ckpt_rank-*-of-*.pth. on V3-32.

JackCaoG · 2023-01-10T00:01:55Z

It should work, take a look at https://github.com/pytorch/xla#how-to-run-on-tpu-vm-pods-distributed-training.

jianguoz · 2023-01-10T09:35:47Z

@JackCaoG @ronghanghu I tested test_train_mp_mnist_fsdp_with_ckpt.py on Both TPU v3-32 and TPU v4-64, and both raise errors like below

File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/fsdp/state_dict_utils.py", line 152, in consolidate_sharded_model_checkpoints
2023-01-10 09:25:38 172.16.96.127 [0]     assert ckpt["shard_metadata"]["world_size"] == len(checkpoints), (
2023-01-10 09:25:38 172.16.96.127 [0] AssertionError: Expecting 32 files (based on metadata in /tmp/mnist-fsdp/final_ckpt_rank-00000000-of-00000032.pth) but got 8 files.

Please check if you have missing or unexpected files in /tmp/mnist-fsdp/final_ckpt_rank-*-of-*.pth.

Files in /tmp/mnist for TPU V3-32:

final_ckpt_rank-00000000-of-00000032.pth
final_ckpt_rank-00000001-of-00000032.pth
final_ckpt_rank-00000002-of-00000032.pth
final_ckpt_rank-00000003-of-00000032.pth
final_ckpt_rank-00000004-of-00000032.pth
final_ckpt_rank-00000005-of-00000032.pth
final_ckpt_rank-00000006-of-00000032.pth
final_ckpt_rank-00000007-of-00000032.pth

While for TPU V4-64 (It has different topology with TPU V3), there are only 4 files (as V4-64 has 2x4x4 topology) generated in /tmp/mnist:

final_ckpt_rank-00000000-of-00000032.pth
final_ckpt_rank-00000001-of-00000032.pth
final_ckpt_rank-00000002-of-00000032.pth
final_ckpt_rank-00000003-of-00000032.pth

I guess there may be errors or wrong assert conditions (e.g., 8) when sharding model checkpoints together into a full model state dict on >8 pods. Can you check whether this is caused by the error of checkpoint saving? Thanks:)

ronghanghu · 2023-01-10T16:35:53Z

Hi @jianguoz, I think it's because on v3-32 or v4-64, the filesystems are separate on each host VM in the TPU pod. The sharded checkpoints are saved in a distributed manner by each host, while the consolidation part requires them to be in the same filesystem.

On v3-32 or v4-64, one can either skip this checkpoint consolidation part by adding --no_ckpt_consolidation, or to save checkpoints to a shared NFS filesystem across the host VMs such as Filestore by specifying --ckpt_prefix. (In my own deep learning experimentation w/ PyTorch XLA, I'm using a Filestore filesystem.)

jianguoz · 2023-01-11T01:16:32Z

Hi @ronghanghu, sorry for the late reply. Thanks for your help and super valuable contributions:) It works with the filestore system! In addition, we get similar results to your testing results, i.e.,Max Accuracy: 98.94% on V3-32 and Max Accuracy: 98.72% on V4-64.

I will modify your vit_10b_fsdp_example to add autowrap and do a test on the 10B model.

jianguoz · 2023-01-13T21:15:05Z

Hi @ronghanghu, good afternoon! I am trying the size_based setting on a 10b and another more large model on V3-32/64. I see the default is 1e8. Do you have any suggestions or experiences for tuning auto_wrap_min_num_params? Such as, do we need to limit the number to less than the per TPU memory capacity? Thanks so much!

jianguoz · 2023-01-13T21:23:23Z

Hey @jianguoz if you tried this pr and confirmed it worked with other HF models, could you give an update here. FYI we have an pr huggingface/transformers#20774 to add FSDP support to HF and we will update that pr to use auto_wrap as well.

Hi @JackCaoG, I start to test this and will give updates soon. Meanwhile, I believe HuggingFace xla_spawn only supports training on a single TPU node (<=8 cores) and it does not support TPU pods like v3-32. Hence, it is better if they can resolve this issue first and then auto FSDP could be scaled to more TPUs.

JackCaoG · 2023-01-13T21:28:40Z

spawn by design can only handle 8 cores, but you can use xla_dist to scale the training on pods.

jianguoz · 2023-01-13T21:39:26Z

spawn by design can only handle 8 cores, but you can use xla_dist to scale the training on pods.

@JackCaoG That is awesome! Before I did not know that we can use xla_dist for huggingface torch_xla examples!

ronghanghu · 2023-01-13T22:38:39Z

Hi @ronghanghu, good afternoon! I am trying the size_based setting on a 10b and another more large model on V3-32/64. I see the default is 1e8. Do you have any suggestions or experiences for tuning auto_wrap_min_num_params? Such as, do we need to limit the number to less than the per TPU memory capacity? Thanks so much!

Yes, this auto_wrap_min_num_params needs to be less than the per TPU memory capacity. I think the default 1e8 (100M parameters, or 400 MB memory size for float32 parameters) would usually be a good balance.

jianguoz · 2023-01-13T23:07:25Z

Hi @ronghanghu, good afternoon! I am trying the size_based setting on a 10b and another more large model on V3-32/64. I see the default is 1e8. Do you have any suggestions or experiences for tuning auto_wrap_min_num_params? Such as, do we need to limit the number to less than the per TPU memory capacity? Thanks so much!

Yes, this auto_wrap_min_num_params needs to be less than the per TPU memory capacity. I think the default 1e8 (100M parameters, or 400 MB memory size for float32 parameters) would usually be a good balance.

@ronghanghu Thanks so much for your suggestions! I will set them accordingly:)

jianguoz · 2023-01-23T02:44:16Z

Hi @ronghanghu, Thanks very much for your auto_wrap FSDP contributions! I have a question regarding consolidating models during modifying your code. I checked that there is a process to Consolidate the sharded model checkpoints for MNIST in test_train_mp_mnist_fsdp_with_ckpt.py, and there is no such code in run_vit_training.py. I have two questions here:

If the consolidated file is too large, e.g., a 10B model, it could cause an OOM error on rank 0. Do you have any suggestions to modify run_vit_training.py for saving a 10B model, loading it, and making inference faster without OOM?
In line 297-299 of test_train_mp_mnist_fsdp_with_ckpt.py. I saw it only has model = MNIST().to(device) and does not have FSDP wrap before loading the model. To load a 10B model and accelerate inference, can we add one line, model = fsdp_wrap(model) after Line 299:

model = MNIST().to(device)
ckpt_consolidated = torch.load(f'{flags.ckpt_prefix}_consolidated.pth')
model.load_state_dict(ckpt_consolidated['model'])
model = fsdp_wrap(model)

Thanks so much for your help again!

ronghanghu · 2023-01-23T04:01:01Z

Hi @jianguoz, thanks for your test! Here checkpoint consolidation is only needed if one wants to stitch the sharded checkpoints together into a single checkpoint file for a non-FSDP-wrapped model (the original model without fsdp_wrap). If one needs to resuming FSDP training, one can simply load the sharded checkpoint files corresponding to each rank.

In line 297-299 of test_train_mp_mnist_fsdp_with_ckpt.py. I saw it only has model = MNIST().to(device) and does not have FSDP wrap before loading the model. To load a 10B model and accelerate inference, can we add one line, model = fsdp_wrap(model) after Line 299:

This test was to verify that the consolidated checkpoint could work for the original MNIST model, so it does not have FSDP wrap before loading the model. If it is needed to resume FSDP training, one can simply load the sharded checkpoint files:

# the FSDP-wrapped model and its optimizer
model = fsdp_wrap(MNIST().to(device))
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=flags.momentum)

# load the sharded checkpoint file
rank = xm.get_ordinal()
world_size = xm.xrt_world_size()
ckpt_path = f'{flags.ckpt_prefix}_rank-{rank:08d}-of-{world_size:08d}.pth'

ckpt_sharded = torch.load(ckpt_path)
model.load_state_dict(ckpt_sharded['model'])
optimizer.load_state_dict(ckpt_sharded['optimizer'])

If the consolidated file is too large, e.g., a 10B model, it could cause an OOM error on rank 0. Do you have any suggestions to modify run_vit_training.py for saving a 10B model, loading it, and making inference faster without OOM?

I think for 10B model size (which should be around 40 GB parameter size for float32), it should still be able to fit into the host memory of a TPU VM (which typically has 300+ GB memory). Do you experience host-side OOM when consolidating the checkpoint from command line as follows (here the checkpoint files are /tmp/mnist-fsdp/final_ckpt_rank-*-of-*.pth)?

# consolidate the checkpoint files matching `ckpt_prefix` + `ckpt_suffix`
python3 -m torch_xla.distributed.fsdp.consolidate_sharded_ckpts \
  --ckpt_prefix /tmp/mnist-fsdp/final_ckpt \
  --ckpt_suffix "_rank-*-of-*.pth"

jianguoz · 2023-01-23T05:46:00Z

Hi @ronghanghu, thank you so much for your super quick reply and suggestions! I have tested above consolidating commands on mnist-fsdp (model is quite small) and it does not have any issues. Since I haven't saving the shared checkpoint files for >=10B model, I will give an update this week:).

One more question is that regarding to inference (i.e., test), do you usually consolidate the shared checkpoint for very large models or only keep the original shared files? or we consolidate files based on size, say 20G maximum for a file (like OPT 30B, BLOOM 175B), to make it easier for users download them and load them on limited GPU device (with sharding the models).

ronghanghu · 2023-01-23T06:15:46Z

One more question is that regarding to inference (i.e., test), do you usually consolidate the shared checkpoint for very large models or only keep the original shared files? or we consolidate files based on size, say 20G maximum for a file (like OPT 30B, BLOOM 175B), to make it easier for users download them and load them on limited GPU device (with sharding the models).

Hi @jianguoz, for very large models (e.g. those with 20B+ parameters), I usually just keep the original sharded checkpoint files, since these models are hard to run without FSDP anyway :) For smaller models I sometimes consolidate them into a single checkpoint to use them in other tasks.

jianguoz · 2023-01-23T06:27:16Z

Hi @ronghanghu, That really makes sense:) Thanks for your shared experience and have a nice night:)

jianguoz · 2023-01-29T19:26:15Z

Hi @JackCaoG @ronghanghu, good morning! I am testing HuggingFace models following the above auto_wrap_policy instructions. I start with a T5-3B model for seq2seq generation tasks on V3-64 (330G memory). Here is the core code for implementing auto_wrap_policy. I tried both type_based and size_based; however, I encountered some issues here:

For type_based, I wrapped transformer modules such as T5block, and per-tpu has around 45m parameters, and it takes 130-160G total memory to load the tokenizer and model. However, when I run the model, it shows OOM errors:

2023-01-29 11:15:58 172.16.96.57 [5] 2023-01-29 11:15:57.602089: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1674990955.995476830","description":"Error received from peer ipv4:172.16.96.57:51011","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC

I still have the same errors even though i set a very short input length (128 tokens just works with bfloat16) and label length (128 tokens). The issue is also not solved when i add bfloat16 and set up the batch size to 1. I do not have such errors if I use DeepSpeed zero-3, run T5 on GPUs, and set up the input length to a large number 512. I also feel it is faster on GPUs than TPUs. Could you give some insights here?

For size_based, I set up the auto_wrap_min_num_params to a number like 1e7. However, it returns

Return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse). weight' must be 2-D

My inputs are 2-D. I know that for encoder-decoder models, they may share the embeddings. Does torch_xla size_based not support such models?

Really appreciate your time and help!

jianguoz · 2023-01-30T17:18:35Z

@JackCaoG A further update is that I wrote code to test Huggingface PR using HuggingFace T5-3B model with more TPUs, i.e., v3-128. I use type_based method (size_based has 2-D issue) to wrap the T5block. Each model has 3B/128=22m parameters, which is relatively small on each device. However, It still easily gets the OOM issue.

ronghanghu · 2023-01-30T17:56:51Z

Hi @jianguoz, regarding Hugging Face transformers, I earlier set up a small example in https://github.com/huggingface/transformers/compare/main...ronghanghu:transformers:huggingface_fsdp_example?expand=1. There is an ongoing PR to add it to the Hugging Face transformer repo (here is a draft).

Regarding the issue of torch.embedding with shared embeddings between the encoder-decoder models -- sharing weights across separately-wrapped FSDP sub-modules is indeed an unsupported case at this moment. A workaround is to use the same module as output_embedding_layer rather than building two separate modules and sharing their weights. To auto-wrap both the input_embedding_layer and output_embedding_layer with an inner FSDP (to save extra memory during forward pass), then one can try building an embedding module into a layer that can be used both at the input and at the output of BERT/GPT models, such as follows:

class SharedEmbedding(torch.nn.Embedding):

	def __init__(self, *args, **kwargs):
		super().__init__(*args, **kwargs)
		# a bias term for the BERT output logits
		self.bias = nn.Parameter(self.weight.new_zeros(self.weight.size(0)))

	def forward(self, inds_or_hidden_states, use_in_output=False):
		if use_in_output:
			return torch.nn.functional.linear(inds_or_hidden_states, self.weight, self.bias)

		return super().forward(inds_or_hidden_states)

This layer can be used for both input_embedding_layer and output_embedding_layer.

jianguoz · 2023-01-30T21:40:38Z

Hi @ronghanghu, thanks for your reply and detailed information. Regarding the ongoing PR, I think my code is similar to the PR except that I changed the nested FSDP to auto-wrapping functionality into FSDP. So far, the OOM issues of 3B model is still unsolved with bfloat16 and batch size 1 on V3-128, and I am checking the potential errors.

ronghanghu force-pushed the xla_fsdp_auto_wrap branch 3 times, most recently from b6dd40f to bd01aea Compare December 13, 2022 08:04

JackCaoG requested review from AlexWertheim, JackCaoG and miladm December 13, 2022 23:21

AlexWertheim mentioned this pull request Dec 15, 2022

Enable PyTorch/XLA Fully Sharded Data Parallel (FSDP) for a Specific Class of Transformer Models huggingface/transformers#20774

Closed

JackCaoG added the fsdp label Dec 19, 2022

JackCaoG reviewed Dec 19, 2022

View reviewed changes

test/test_train_mp_imagenet_fsdp.py Outdated Show resolved Hide resolved

JackCaoG reviewed Dec 19, 2022

View reviewed changes

torch_xla/distributed/fsdp/wrap.py Outdated Show resolved Hide resolved

JackCaoG approved these changes Dec 19, 2022

View reviewed changes

ronghanghu force-pushed the xla_fsdp_auto_wrap branch 2 times, most recently from 58f632c to 8da27b5 Compare December 19, 2022 23:05

ronghanghu added 2 commits December 19, 2022 23:24

add auto_wrap_policy into XLA FSDP for automatic wrapping

8e87bf6

address review comments

0002f45

ronghanghu force-pushed the xla_fsdp_auto_wrap branch from 8da27b5 to 0002f45 Compare December 19, 2022 23:24

default to pin_layout=True in MNIST and ImageNet FSDP

a5b4da7

JackCaoG merged commit d7d0479 into pytorch:master Dec 20, 2022

JackCaoG mentioned this pull request Dec 21, 2022

Add gpu fsdp auto_wrap test #4378

Merged

AlexWertheim mentioned this pull request Feb 1, 2023

Enable PyTorch/XLA Fully Sharded Data Parallel (FSDP) huggingface/transformers#21406

Merged

add auto_wrap_policy into XLA FSDP for automatic wrapping #4318

add auto_wrap_policy into XLA FSDP for automatic wrapping #4318

Uh oh!

Conversation

ronghanghu commented Dec 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Auto-wrapping submodules based on policies

Gradient checkpointing (i.e. activation checkpointing/rematerialization)

[OK] Test MNIST size-based auto-wrap FSDP (and command line checkpoint consolidation) on v3-8

[OK] Test MNIST type-based auto-wrap FSDP (and command line checkpoint consolidation) on v3-8

[OK] Test MNIST type-based auto-wrap FSDP + gradient checkpointing (and command line checkpoint consolidation) on v3-8

[OK] Test ImageNet ResNet-50 size-based auto-wrap FSDP on v3-8

[OK] Test ImageNet ResNet-50 type-based auto-wrap FSDP on v3-8

[OK] Test ImageNet ResNet-50 type-based auto-wrap + gradient checkpointing FSDP on v3-8

Uh oh!

JackCaoG commented Dec 13, 2022

Uh oh!

ronghanghu commented Dec 13, 2022

Uh oh!

jianguoz commented Dec 19, 2022

Uh oh!

ronghanghu commented Dec 19, 2022

Uh oh!

jianguoz commented Dec 19, 2022

Uh oh!

ronghanghu commented Dec 19, 2022

Uh oh!

jianguoz commented Dec 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JackCaoG left a comment

Choose a reason for hiding this comment

Uh oh!

JackCaoG commented Dec 19, 2022

Uh oh!

ronghanghu commented Dec 19, 2022

Uh oh!

jianguoz commented Dec 19, 2022

Uh oh!

JackCaoG commented Dec 20, 2022

Uh oh!

jianguoz commented Jan 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackCaoG commented Jan 10, 2023

Uh oh!

jianguoz commented Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ronghanghu commented Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jianguoz commented Jan 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jianguoz commented Jan 13, 2023

Uh oh!

jianguoz commented Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackCaoG commented Jan 13, 2023

Uh oh!

jianguoz commented Jan 13, 2023

Uh oh!

ronghanghu commented Jan 13, 2023

Uh oh!

jianguoz commented Jan 13, 2023

Uh oh!

jianguoz commented Jan 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ronghanghu commented Jan 23, 2023

Uh oh!

jianguoz commented Jan 23, 2023

Uh oh!

ronghanghu commented Jan 23, 2023

Uh oh!

jianguoz commented Jan 23, 2023

add `auto_wrap_policy` into XLA FSDP for automatic wrapping #4318

add `auto_wrap_policy` into XLA FSDP for automatic wrapping #4318

ronghanghu commented Dec 12, 2022 •

edited

Loading

jianguoz commented Dec 19, 2022 •

edited

Loading

jianguoz commented Jan 9, 2023 •

edited

Loading

jianguoz commented Jan 10, 2023 •

edited

Loading

ronghanghu commented Jan 10, 2023 •

edited

Loading

jianguoz commented Jan 11, 2023 •

edited

Loading

jianguoz commented Jan 13, 2023 •

edited

Loading

jianguoz commented Jan 23, 2023 •

edited

Loading

jianguoz commented Jan 29, 2023 •

edited

Loading

jianguoz commented Jan 30, 2023 •

edited

Loading

jianguoz commented Jan 30, 2023 •

edited

Loading