Skip to content
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Commit edd1b29

Browse files
authoredApr 30, 2025··
Merge branch 'site' into 4-27a
2 parents 6cb93d2 + 38e512e commit edd1b29

17 files changed

+358
-8
lines changed
 

‎_events/pt-27-release-qa.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,5 +21,5 @@ Nikita is a Software Engineer at Meta where, among other things, he is responsib
2121

2222
Bring your PyTorch 2.7 questions for Piotr & Nikita during this live Q&A session.
2323

24-
[Register now](/pt-27-release-qa)
24+
[Learn more about this event](/pt-27-release-qa)
2525

‎_includes/main_menu.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@
4343
<span class="dropdown-title">Tools</span>
4444
<p>Learn about the tools and frameworks in the PyTorch Ecosystem</p>
4545
</a>
46-
<a class="nav-dropdown-item" href="https://github.com/pytorch-fdn/ecosystem" target="_blank">
46+
<a class="nav-dropdown-item" href="{{ site.baseurl}}/join-ecosystem">
4747
<span class="dropdown-title">Join the Ecosystem</span>
4848
</a>
4949
<a class="nav-dropdown-item" href="{{ site.baseurl }}/#community-module">

‎_includes/mobile_menu.html

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@
5454
<a href="https://landscape.pytorch.org/">Tools</a>
5555
</li>
5656
<li>
57-
<a href="https://github.com/pytorch-fdn/ecosystem">Join the Ecosystem</a>
57+
<a href="{{ site.baseurl}}/join-ecosystem">Join the Ecosystem</a>
5858
</li>
5959
<li>
6060
<a href="{{ site.baseurl}}/#community-module">Community</a>
Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,195 @@
1+
---
2+
layout: blog_detail
3+
title: "Accelerating Large Scale Training and Convergence with PyTorch Float8 Rowwise on Crusoe 2K H200s"
4+
author: Meta and Crusoe
5+
---
6+
7+
**Meta**: Less Wright, Hamid Shojanazeri, Vasiliy Kuznetsov, Daniel Vega-Myhre, Gokul Nadathur, Will Constable, Tianyu Liu, Tristan Rice, Driss Guessous, Josh Fromm, Luca Wehrstedt, Jiecao Yu, Sandeep Parab
8+
**Crusoe**: Ethan Petersen, Martin Cala, Chip Smith
9+
10+
Working with [Crusoe.AI](http://Crusoe.AI) we were provided access to one of their new 2K H200 clusters in Iceland, which enabled us to showcase training accelerations of 34 - 43% at scale by leveraging TorchTitan’s HSDP2 and TorchAO’s new float8 rowwise, with comparable convergence and stability vs BF16.
11+
12+
13+
![bar chart](/assets/images/accelerating-training-float8-rowwise-crusoe/fg1.png){:style="width:100%;"}
14+
15+
16+
In this post we detail the synergy of H200’s with PyTorch’s new Float8 rowwise training with TorchTitan’s FSDP2/HSDP2 and CP at scale.
17+
18+
## Background - what is an H200?
19+
20+
H200’s are an ‘enhanced’ H100, offering the exact same compute as an H100, but with two additional improvements.
21+
22+
* Larger global memory, 141GiB HBM3e vs the standard 80GiB HBM3
23+
* Memory bandwidth is ~43% faster with 4.8TB/s vs 3.35 TB/s. The faster memory transfer has an outsized effect on training speed, especially for PyTorch’s AsyncTP.
24+
25+
## What is PyTorch Float8 rowwise?
26+
27+
Float 8 Rowwise is a finer grained resolution for Float8 vs the previous ‘tensor wise’ Float8. It is designed to ensure finer grained accuracy to support larger workloads that tend to become more sensitive to quantization at scale and as training progresses.
28+
29+
There are two key improvements with Float8 rowwise:
30+
31+
* Each row now maintains its own scaling factor versus a single scaling factor for the entire tensor, thus improving quantization precision. Finer grained scaling per row helps reduce the effect of outliers (extreme values that force the quantization scaling factor to stretch and degrade the precision of the normally distributed values) and thus ensures better precision.
32+
* The scaling factor itself is now implemented by rounding down to the nearest power of 2. This has been shown to help reduce quantization errors when multiplying/dividing by the scaling factor as well as ensuring large values remain scaled to the same value in both the forward and backward passes.
33+
34+
Note that other large scale models have been trained using Float8 at 2K scale with a combination of 1x128 groupwise and 128x128 blockwise, with power of 2 scaling factors. They had the same goal of improving Float8’s precision for supporting large scale training.
35+
36+
Thus, Float8 rowwise offers a similar promise to enable Float8 for very large scale training, but we wanted to provide proof of stability and convergence at scale, which training on the Crusoe H200 2k cluster provided initial verification thereof.
37+
38+
## Showcasing Float8 Rowwise Loss convergence vs BF16 at 1600 and 1920 GPU Scale:
39+
40+
In order to verify comparable loss convergence, we ran two separate runs at both 1920 and then 1600 (1.6k) gpu scale using TorchTitan and Lllama3 70B. The 1.6K GPU runs were set for 2.5k iterations, using TorchTitans’ HSDP2 and Context Parallel to enable 2D parallelism.
41+
42+
The loss convergence tests were run using Titan’s deterministic mode - this mode effectively freezes most potential sources of variation from run to run, and thus helps ensure that the only substantial change is what we want to test, namely the loss convergence and loss curves of BF16 vs Float8 Rowwise.
43+
44+
Note that deterministic mode also slows down training speed because various kernels will not be autotuned to maximize throughput (otherwise we risk using different kernels between runs and introducing variance).
45+
46+
Two runs were completed, one with BF16 and the other with Float8 Rowwise.
47+
48+
Both runs completed their assigned 2.5k iters without issue, showcasing the Crusoe cluster stability, with FP8 completing at exactly 24 hours and BF16 finishing after 31 hours, 19 minutes.
49+
50+
51+
<table class="table table-bordered">
52+
<tr>
53+
<td>DType
54+
</td>
55+
<td>Time / Iters
56+
</td>
57+
<td>Loss
58+
</td>
59+
</tr>
60+
<tr>
61+
<td>
62+
</td>
63+
<td>
64+
</td>
65+
<td>
66+
</td>
67+
</tr>
68+
<tr>
69+
<td>BF16
70+
</td>
71+
<td>24 hours
72+
</td>
73+
<td>3.15453
74+
</td>
75+
</tr>
76+
<tr>
77+
<td style="color: blue;">Float8 Rowwise
78+
</td>
79+
<td style="color: blue;">24 hours
80+
</td>
81+
<td style="color: blue;">2.86386
82+
</td>
83+
</tr>
84+
<tr>
85+
<td>
86+
</td>
87+
<td>
88+
</td>
89+
<td>
90+
</td>
91+
</tr>
92+
<tr>
93+
<td>BF16
94+
</td>
95+
<td>31 hours, 19 minutes / 2.5K
96+
</td>
97+
<td>2.88109
98+
</td>
99+
</tr>
100+
<tr>
101+
<td style="color: blue;">Float8 Rowwise
102+
</td>
103+
<td style="color: blue;">24 hours / 2.5K
104+
</td>
105+
<td style="color: blue;">2.86386
106+
</td>
107+
</tr>
108+
</table>
109+
110+
111+
At the 24 hour mark, Float8 completed 2.5K iterations showcasing the comparative speed up (even in deterministic mode) of float8 training. At the 24 hour mark, Float8 enabled a **+9.21%** relative improvement in loss compared to BF16 for the same 24 hours of large scale training time.
112+
113+
114+
After 31 hours, 19 minutes, the BF16 run finally completed its 2.5k iters.
115+
116+
117+
The final loss numbers:
118+
BF16 = **2.88109**
119+
Float8 = **2.86386**
120+
121+
From the loss curves we observed very similar curves at the first and last ⅓ and then a turbulent zone in the middle where both showed similar spikes, but with a slight skew to the relative timing of the spikes.
122+
123+
124+
![line chart](/assets/images/accelerating-training-float8-rowwise-crusoe/fg2.png){:style="width:100%;"}
125+
126+
127+
As a result of this, we can see that PyTorch’s Float8 rowwise offers similar convergence but over 33% speedup for the same amount of training time.
128+
129+
## Long Term Training stability with Float8 Rowwise
130+
131+
Beyond showcasing comparable convergence, we also wanted to show longer term training stability with Float8 and thus we launched a 4 day, 15K run at 256 scale.
132+
133+
![line chart](/assets/images/accelerating-training-float8-rowwise-crusoe/fg3.png){:style="width:100%;"}
134+
135+
136+
As shown above, Float8 training ran for over 100 hours with no issues, highlighting the long term stability of Float8 Rowwise.
137+
138+
## Determinism in TorchTitan
139+
140+
To verify determinism and to see if the spikiness in the longer runs was from scale, we also ran a smaller run comprising of 2 runs of BF16, and 1 run of Float8 at 256 scale, and with HSDP2 only (i.e. without 2D Context parallel).
141+
142+
In this case both BF16 runs had identical curves and final loss, and we saw a similar spikiness zone for all three runs.
143+
144+
At the 2K iteration mark, both Float8 and BF16 ending at nearly identical points:
145+
BF16 *2 = **3.28538**
146+
Float8 rowwise = **3.28203**
147+
148+
![line chart](/assets/images/accelerating-training-float8-rowwise-crusoe/fg4.png){:style="width:100%;"}
149+
150+
151+
The above result confirms that neither CP nor scale (2k) are responsible for spikiness in the loss as we saw similar effect at 256 scale as well. The most likely explanation for the loss spikes could be content distribution in the dataset.
152+
153+
For the sake of determinism, the experiments were run with a serialized C4 dataset (not shuffled), meaning the spikes could be from encountering new content within the dataset.
154+
155+
## Net speedups at various Scales with Float8 rowwise:
156+
157+
We performed shorter runs at various GPU scales to understand how Float8 Rowwise would scale in terms of training acceleration as cluster sizes expanded. Doubling in scale from 960 to 1920, Float8 continued to deliver impressive training speedups, with a range of over 34-43% gains compared to BF16. We also want to note that scaling from 1k to 2k GPUs communication overhead likely kicked in and we observed a 4% hit on throughput with BF16.
158+
159+
![bar chart](/assets/images/accelerating-training-float8-rowwise-crusoe/fg5.png){:style="width:100%;"}
160+
161+
162+
As shown in the longer training runs at scale above, Float8 rowwise delivered substantial speedups with equal or even slightly improved loss endpoints while delivering 34% speedups at 1920 (DeepSeek) scale.
163+
164+
## How can I use Float8 Rowwise in my training?
165+
166+
Float8 Rowwise is available now for you to use in your large scale training. It is packaged in [TorchAO’s](https://github.com/pytorch/ao) latest builds (0.9 and higher) and integrated into [TorchTitan](https://github.com/pytorch/torchtitan) natively if you want to get up and running quickly.
167+
168+
To activate Float8 Rowwise in TorchTitan:
169+
170+
First enable the model converter to hotswap the nn.linears into float8 linear layers in your models .toml file - see line 29:
171+
172+
173+
![code](/assets/images/accelerating-training-float8-rowwise-crusoe/fg6.png){:style="max-width:600px; display: block; margin-left: auto; margin-right: auto"}
174+
175+
Secondly, specify the ‘rowwise’ float8 recipe - see line 72:
176+
177+
178+
![code](/assets/images/accelerating-training-float8-rowwise-crusoe/fg7.png){:style="max-width:600px; display: block; margin-left: auto; margin-right: auto"}
179+
180+
181+
Note that you have three choices for the ‘recipe_name’:
182+
183+
* rowwise which is the recommended default,
184+
* tensorwise (the older style float8) and
185+
* rowwise_with_gw_hp.
186+
187+
The gw_hp rowwise option keeps the gradients to the weights in BF16 precision during the backwards pass, and this can further enhance float8 precision for extremely sensitive workloads. But, it can ironically be a bit more performant than generic rowwise if the majority of the matmul sizes in your model are smaller (with an estimated tipping point at roughly 13-16K dimensions on H100).
188+
189+
Thus while we recommend rowwise as the default, it may be worth comparing with gw_hp on your model to verify which provides the best performance, with an upside of even greater precision.
190+
191+
By toggling the model converter on and off with a #, you can directly compare training acceleration between BF16 and Float8 Rowwise to understand the potential speedups for your own training.
192+
193+
## Future Updates:
194+
195+
We’ll have an additional update coming showcasing multiple improvements for Pipeline Parallel and Async Distributed Checkpointing so please stay tuned.
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
---
2+
layout: blog_detail
3+
title: "PyTorch Foundation Expands to an Umbrella Foundation to Accelerate AI Innovation"
4+
author: Matt White, Executive Director, PyTorch Foundation
5+
---
6+
7+
Today, I am thrilled to announce a significant milestone for the PyTorch Foundation: we are expanding our scope to become an umbrella foundation, allowing us to host additional projects. This expansion positions the PyTorch Foundation to foster a broader ecosystem of high-value, trusted, and innovative AI projects that cater to all stages of the AI lifecycle—from training and inference to industry-specific applications.
8+
9+
## Why Expand?
10+
11+
Since its inception at the Linux Foundation two and a half years ago, the PyTorch Foundation has rapidly grown, now encompassing over 30 member organizations and 120 vibrant ecosystem projects. PyTorch itself has become the framework of choice for AI researchers, practitioners, and industry leaders worldwide. Our flagship PyTorch Conference has seen attendance multiply sixfold over just two years, reflecting the community’s tremendous enthusiasm and engagement.
12+
13+
With new initiatives such as PyTorch Day events, global community meetups, the PyTorch Ambassador Program, Open Source Program Office (OSPO) outreach, the Speaker’s Bureau, and our upcoming training and certification programs, we have significantly deepened our community’s expertise and collaboration capabilities. To sustain and accelerate this momentum, the logical next step was to expand the PyTorch Foundation into an umbrella organization.
14+
15+
## What Does an Umbrella Foundation Mean?
16+
17+
By transitioning into an umbrella foundation, PyTorch will now host a range of diverse, high-quality AI and ML projects beyond PyTorch Core. These include foundation-hosted projects in two categories:
18+
19+
20+
* **Platform Projects**: Domain-agnostic solutions essential across various stages of the AI lifecycle, such as training, inference, model optimization, and deployment as well as agentic systems.
21+
* **Vertical Projects**: Domain-specific projects tailored to particular industries or applications, such as biomedical imaging, protein folding, and geospatial analysis.
22+
23+
Projects under our umbrella gain immediate access to vendor-neutral governance, enhanced visibility, increased funding opportunities, and robust community engagement and support.
24+
25+
## Foundation-Hosted vs. Ecosystem Projects
26+
27+
As we expand, it’s important to clarify the distinction between foundation-hosted and ecosystem projects:
28+
29+
* **Foundation-Hosted Projects** are projects that fall under the umbrella, they are officially governed and administered under the PyTorch Foundation’s neutral and transparent governance model. Project maintainers continue to oversee their project, and they transfer assets to the Linux Foundation for independent stewardship and adopt an open governance model significantly reducing vendor bias and encouraging broader community contributions and adoption. These projects have greater stability and longevity and integrate with the larger PyTorch community.
30+
* **Ecosystem Projects** remain independently managed but receive recognition and increased visibility by aligning themselves closely with the PyTorch Foundation community standards. These projects meet specific quality and maturity criteria but retain full independence in governance and asset management.
31+
32+
## How to Join the PyTorch Ecosystem or Become a Foundation-Hosted Project
33+
34+
We have clearly defined pathways for projects looking to become part of the PyTorch community:
35+
36+
1. **[Ecosystem Project Status](https://github.com/pytorch-fdn/ecosystem)**: Projects must meet defined criteria, such as active development, comprehensive documentation, CI/CD infrastructure, clear governance, and community engagement. Approved ecosystem projects benefit from increased exposure and official recognition on the [PyTorch Landscape](https://landscape.pytorch.org/).
37+
2. **[Candidate Project Status](https://github.com/pytorch-fdn/foundation-hosted)**: Ecosystem projects aspiring to foundation-hosted status can become candidates by securing sponsorship from a PyTorch Foundation [Technical Advisory Council (TAC)](/tac) voting member. Candidates receive guidance on meeting all necessary governance, technical, and strategic criteria.
38+
3. **[Foundation-Hosted Project Status](https://github.com/pytorch-fdn/foundation-hosted)**: Candidate projects demonstrating high maturity, stability, multi-platform support, security best practices, and strategic value to the PyTorch community can be approved by the TAC. These projects gain extensive benefits, including neutral trademark hosting, foundation support, marketing and events resources, governance guidance, and strategic funding opportunities.
39+
40+
## Ensuring Long-Term Success and Innovation
41+
42+
By expanding our scope to become an umbrella foundation, the PyTorch Foundation is uniquely positioned to enhance collaboration, innovation, and sustained growth across the entire AI community. Our mission is clear: create a vendor-neutral, open source environment where the best AI and ML tools can thrive, benefiting users, contributors, and industry stakeholders worldwide.
43+
44+
*“PyTorch is absolutely the foundation of the innovation happening in AI today and with projects like Llama, ChatGPT, and hundreds of thousands of open projects built on PyTorch, it has cemented itself as a critical ingredient to the world of AI. This move to create an umbrella foundation enables PyTorch to significantly expand its ecosystem both horizontally and vertically in this new era of agentic systems. I am very excited about this opportunity to take the PyTorch community to the next level!” - Joe Spisak, Product Director for PyTorch at Meta.*
45+
46+
*"PyTorch sits at the very core of AI today. Meanwhile, the depth of the AI stack has grown dramatically—evolving from enabling accelerated compute to powering fully autonomous systems. Broadening the PyTorch Foundation is a key step in keeping the AI revolution open and accessible to all, across the stack and aligned with the principles PyTorch was built on." - Luca Antiga, CTO at Lightning AI.*
47+
48+
We are incredibly optimistic about the opportunities ahead and excited to welcome new projects into our growing family. The PyTorch Foundation remains deeply committed to driving AI innovation forward, and together, we will continue to build the future of open source artificial intelligence.
49+
50+
Stay tuned for more updates, announcements, and opportunities to participate!

‎announcement.html

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,11 +19,13 @@ <h1>PyTorch<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Foundation</h1>
1919
<div class="container">
2020
<div class="row content">
2121
<div class="col-md-10 body-side-text">
22+
<h2>Accelerating Open Source AI</h2>
2223
<p class="lead">
23-
Welcome to the PyTorch Foundation—a vibrant, collaborative hub created for and by the deep learning community. Here, developers, researchers, and industry leaders come together to shape and expand the open source PyTorch framework and ecosystem. Through a network of dedicated contributors to the PyTorch project, the PyTorch Foundation fuels discussion, innovation, and hands-on collaboration across the PyTorch landscape.
24+
Welcome to the PyTorch Foundation—a vibrant, community-driven hub for open source AI. Developers, researchers, and industry pioneers collaborate here to advance the PyTorch framework and strengthen the open source AI ecosystem.
2425
<br />
2526
<br />
26-
Community-driven collaboration is at the heart of PyTorch's growth and evolution. From advancing the core framework to building essential tools that power PyTorch at a production scale, your contributions are key to moving this ecosystem forward. As part of the Linux Foundation, the PyTorch community also supports a variety of initiatives: developer training, regional and local events, open source tooling, research, and guides for both newcomers and seasoned contributors—all to make your journey with PyTorch more accessible and rewarding.
27+
From cutting-edge development to production-ready tools and libraries, the PyTorch Foundation thrives through transparent collaboration and collective innovation. As part of the Linux Foundation, we host global events, deliver specialized training, support research, and provide resources to accelerate your AI journey.
28+
Whether you are contributing code, sharing your expertise, or deploying real-world AI solutions, the PyTorch Foundation actively empowers you to shape the future of accessible and impactful open source AI.
2729
</p>
2830
</div>
2931
</div>
Loading
Loading
Loading
Loading
Loading
Loading
Loading

0 commit comments

Comments
 (0)
Please sign in to comment.