Description
🚀 Descirbe the improvement or the new tutorial
After I read the "How FSDP works" in https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html?highlight=fsdp, I still couldn't figure out what FSDP is due to the lack of explaination of ALL_GATHER and REDUCE-SCATTER which I believe are the key concepts in FSDP.
And this article helped me. https://engineering.fb.com/2021/07/15/open-source/fsdp/
"I believe The key insight to unlock full parameter sharding is that we can decompose the all-reduce operations in DDP into separate reduce-scatter and all-gather operations:"
I think adding this part can greatly help readers to better understand FSDP.
Existing tutorials on this topic
GETTING STARTED WITH FULLY SHARDED DATA PARALLEL(FSDP)
https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html?highlight=fsdp
Additional context
No response
cc @wconstab @osalpekar @H-Huang @kwen2501 @sekyondaMeta @svekars @carljparker @NicolasHug @kit1980 @subramen