diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml index 908ececa7f86..23a8f5ffbad9 100644 --- a/.github/workflows/build.yml +++ b/.github/workflows/build.yml @@ -8,7 +8,7 @@ on: jobs: tests: - uses: pytorch/test-infra/.github/workflows/linux_job.yml@main + uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main secrets: inherit with: runner: linux.12xlarge diff --git a/.github/workflows/update-quick-start-module.yml b/.github/workflows/update-quick-start-module.yml index 704d6643cbbb..7d070eb7ff8a 100644 --- a/.github/workflows/update-quick-start-module.yml +++ b/.github/workflows/update-quick-start-module.yml @@ -63,7 +63,7 @@ jobs: update-quick-start: needs: [linux-nightly-matrix, windows-nightly-matrix, macos-arm64-nightly-matrix, linux-release-matrix, windows-release-matrix, macos-arm64-release-matrix] - runs-on: "ubuntu-20.04" + runs-on: "ubuntu-latest" environment: pytorchbot-env steps: - name: Checkout pytorch.github.io diff --git a/CNAME b/CNAME index c101f6da020d..583993f7b85f 100644 --- a/CNAME +++ b/CNAME @@ -1 +1 @@ -pytorch.org \ No newline at end of file +docs.pytorch.org diff --git a/_community_blog/bringing-the-pytorch-community-together.md b/_community_blog/bringing-the-pytorch-community-together.md new file mode 100644 index 000000000000..5825c9993e4f --- /dev/null +++ b/_community_blog/bringing-the-pytorch-community-together.md @@ -0,0 +1,8 @@ +--- +title: 'Bringing the PyTorch Community Together' +author: Team PyTorch +ext_url: /blog/bringing-the-pytorch-community-together/ +date: January 22, 2025 +--- + +As we step into a new year, it’s a great moment to reflect on the incredible community events that made 2024 a memorable year for the PyTorch Foundation. Global meetups, events, and conferences brought the community together to learn, connect, and grow. Here’s a quick recap of the year’s highlights and what to expect in 2025. \ No newline at end of file diff --git a/_community_blog/datathon-2025.md b/_community_blog/datathon-2025.md new file mode 100644 index 000000000000..754406c063a0 --- /dev/null +++ b/_community_blog/datathon-2025.md @@ -0,0 +1,8 @@ +--- +title: "Solve Real-Word AI Challenges with PyTorch at Datathon 2025: DataOrbit" +author: "Aakash Senthilnathan" +ext_url: /blog/datathon-2025/ +date: Feb 12, 2025 +--- + +**We’re excited to have PyTorch sponsor [Datathon 2025: DataOrbit](https://dataorbit-2025.devpost.com/)**, a place where students can collaborate with a team to solve problems using real-world datasets! This event, hosted by Data Science UCSB in collaboration with Gaucho Sports Analytics and ACM@UCSB, will take place on **February 22–23rd, 2025 at UC Santa Barbara**, with the incredible opportunity to present your project to a panel of corporate and faculty judges – **including the executive director of Pytorch!** – for a chance to win prizes up to $3000. \ No newline at end of file diff --git a/_community_blog/optimize-llms.md b/_community_blog/optimize-llms.md new file mode 100644 index 000000000000..e0ecb819ac05 --- /dev/null +++ b/_community_blog/optimize-llms.md @@ -0,0 +1,8 @@ +--- +title: "Optimize LLMs for Efficiency & Sustainability" +ext_url: /blog/optimize-llms/ +date: Feb 19, 2025 +author: "Zach Lasiuk, Arm" +--- + +The rapid growth of large language model (LLM) applications is linked to rapid growth in energy demand. According to the International Energy Agency (IEA), data center electricity consumption is projected to roughly double by 2026 primarily driven by AI. This is due to the energy-intensive training requirements for massive LLMs – however, the increase in AI Inferencing workloads also plays a role. For example, compared with traditional search queries, a single AI inference can consume about [10x more energy](https://www.weforum.org/stories/2024/07/generative-ai-energy-emissions/). diff --git a/_community_blog/pt-fedora-os-communities.md b/_community_blog/pt-fedora-os-communities.md new file mode 100644 index 000000000000..ec37d275c4a5 --- /dev/null +++ b/_community_blog/pt-fedora-os-communities.md @@ -0,0 +1,9 @@ +--- +title: "Powering AI with PyTorch, Fedora, and Open Source Communities" +author: Sudhir Dharanendraiah +ext_url: /blog/pt-fedora-os-communities/ +date: Mar 7, 2025 +--- + +At [DevConf.IN 2025](https://www.devconf.info/in/) in Pune, I had the opportunity to host a **[PyTorch Meetup](https://pretalx.devconf.info/devconf-in-2025/talk/W3YURM/)** on February 28th. The session, titled "**Powering AI with PyTorch, Fedora, and Open Source Communities**" was aimed at introducing PyTorch to students and professionals, explaining why **PyTorch+Fedora** form an ideal AI development platform. The other key aspect I covered was collaboration between open source communities. + diff --git a/_community_blog/pytorch-at-gtc.md b/_community_blog/pytorch-at-gtc.md new file mode 100644 index 000000000000..da3632fa17fe --- /dev/null +++ b/_community_blog/pytorch-at-gtc.md @@ -0,0 +1,8 @@ +--- +title: "PyTorch at GTC 2025" +author: "Team PyTorch at NVIDIA" +ext_url: /blog/pytorch-at-gtc/ +date: Mar 16, 2025 +--- + +[GTC](https://www.nvidia.com/gtc/) is coming back to San Jose on March 17–21, 2025. Join PyTorch Foundation members Arm, AWS, Google Cloud, IBM, Lightning AI, Meta, Microsoft Azure, Snowflake, and thousands of developers as we celebrate PyTorch. Together learn how AI & accelerated computing are helping humanity solve our most complex challenges. diff --git a/_community_blog/sglang-joins-pytorch.md b/_community_blog/sglang-joins-pytorch.md new file mode 100644 index 000000000000..6a05a4714873 --- /dev/null +++ b/_community_blog/sglang-joins-pytorch.md @@ -0,0 +1,8 @@ +--- +title: "SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine" +author: "SGLang Team" +ext_url: /blog/sglang-joins-pytorch/ +date: Mar 19, 2025 +--- + +We’re thrilled to announce that the SGLang project has been integrated into the PyTorch ecosystem! This integration ensures that SGLang aligns with PyTorch’s standards and practices, providing developers with a reliable and community-supported framework for fast and flexible serving of LLMs. \ No newline at end of file diff --git a/_community_stories/56.md b/_community_stories/56.md new file mode 100644 index 000000000000..8a07059e38db --- /dev/null +++ b/_community_stories/56.md @@ -0,0 +1,8 @@ +--- +title: 'How Intel Uses PyTorch to Empower Generative AI through Intel Arc GPUs' +ext_url: /blog/how-intel-uses-pytorch-to-empower-generative-ai-through-intel-arc-gpus/ +date: Jan 24, 2025 +tags: ["Gaming"] +--- +Intel has long been at the forefront of technological innovation, and its recent venture into Generative AI (GenAI) solutions is no exception. With the rise of AI-powered gaming experiences, Intel sought to deliver an accessible and intuitive GenAI inferencing solution tailored for AI PCs powered by Intel’s latest GPUs. By leveraging PyTorch as the backbone for development efforts, Intel successfully launched AI Playground, an open source application that showcases advanced GenAI workloads. + diff --git a/_community_stories/57.md b/_community_stories/57.md new file mode 100644 index 000000000000..7e717dfd000b --- /dev/null +++ b/_community_stories/57.md @@ -0,0 +1,8 @@ +--- +title: 'How IBM Research Uses PyTorch and TerraTorch to Make Geospatial Computer Vision Accessible for Everyone' +ext_url: /blog/how-ibm-uses-pt-terratorch/ +date: May 1, 2025 +tags: ["Computer Vision"] +--- + +Geospatial computer vision is essential for understanding our planet — from monitoring deforestation to tracking urban development and analyzing the impacts of climate change. However, the coding and deep learning skills for applying AI models to satellite imagery and earth observation data has traditionally been a major barrier for many practitioners. diff --git a/_events/ai-programming.md b/_events/ai-programming.md index efca8bbb173b..32379ba8c65e 100644 --- a/_events/ai-programming.md +++ b/_events/ai-programming.md @@ -16,4 +16,4 @@ In this talk, Anton will share how he built an AI agent that ranked #1 in the fi Anton Pidkuiko is a Software Engineer at Meta, Reality Labs in London. He is currently working on applying the power of Large Language Models to Metaverse Avatar product experiences. -[Register now to join the event](/ai-powered-competitive-programming) +[More info on this event.](/ai-powered-competitive-programming) diff --git a/_events/autonomous-language-model-systems.md b/_events/autonomous-language-model-systems.md new file mode 100644 index 000000000000..8532258afef0 --- /dev/null +++ b/_events/autonomous-language-model-systems.md @@ -0,0 +1,23 @@ +--- +category: event +title: "Towards Autonomous Language Model Systems" +date: May 21, 2025 +poster: assets/images/pt-day-cfp.png +--- + + +Towards Autonomous Language Model Systems + + +**Date**: May 21, 2025, 11AM PT / 2PM ET +**Location**: Online + +Language models (LMs) are increasingly used to assist users in day-to-day tasks such as programming (Github Copilot) or search (Google's AI Overviews). But can we build language model systems that are able to autonomously complete entire tasks end-to-end? + +In this talk, Ofir Press will discuss efforts to build autonomous LM systems, focusing on the software engineering domain. Ofir will present SWE-bench, a novel method for measuring AI systems on their abilities to fix real issues in popular software libraries. Ofir will then discuss SWE-agent, a system for solving SWE-bench tasks. + +SWE-bench and SWE-agent are used by many leading AI organizations in academia and industry, including OpenAI, Anthropic, Meta, and Google, and SWE-bench has been downloaded over 2 million times. These projects show that academics on tight budgets can have a substantial impact in steering the research community toward building autonomous systems that can complete challenging tasks. + +Ofir is a postdoc at Princeton University, where they mainly work with Karthik Narasimhan's lab. Ofir previously completed their PhD at the University of Washington in Seattle, where Ofir was advised by Noah Smith. During their PhD, Ofir spent two years at Facebook AI Research Labs on Luke Zettlemoyer's team. + +[Register Now](/autonomous-language-model-systems) diff --git a/_events/ce1.md b/_events/ce1.md new file mode 100644 index 000000000000..94c9e66165d9 --- /dev/null +++ b/_events/ce1.md @@ -0,0 +1,14 @@ +--- +category: event +title: "COLING 2025" +date: Jan 19, 2025 +--- +Community Event + +**Date**: Jan 19 - 25, 2025 + +COLING, the International Conference on Computational Linguistics, is one of the premier conferences for the natural language processing and computational linguistics. + +First established in 1965, the biennial COLING conference is held in diverse parts of the globe and attracts participants from both top-ranked research centers and emerging countries. Today, the most important developments in our field are taking place not only in universities and academic research institutes but also in industrial research departments including tech-startups. COLING provides opportunities for all these communities to showcase their exciting discovery. + +[Learn more about this event](https://coling2025.org/) \ No newline at end of file diff --git a/_events/ce10.md b/_events/ce10.md new file mode 100644 index 000000000000..67d9e00f66f8 --- /dev/null +++ b/_events/ce10.md @@ -0,0 +1,13 @@ +--- +category: event +title: "PyCon 2025" +date: May 14, 2025 +--- +Community Event + +**Date**: May 15 - 22, 2025 +**Location**: Pittsburgh, PA + +At PyCon US 2025, find a program filled with pre-conference tutorials and sponsor presentations, 90+ of our community’s best talks, which includes the Charlas track, brilliant keynote speakers, posters on display, a lively Expo Hall filled with incredible Sponsors’ booths, and famed lightning talks on each main conference day. + +[Learn more about this event](https://us.pycon.org/2025/) diff --git a/_events/ce11.md b/_events/ce11.md new file mode 100644 index 000000000000..7cc0095a96cd --- /dev/null +++ b/_events/ce11.md @@ -0,0 +1,15 @@ +--- +category: event +title: "Gamesbeat Summit 2025" +date: May 19, 2025 +--- +Community Event + +**Date**: May 19 - 20, 2025 +**Location**: Los Angeles, CA + +The gaming industry is on the cusp of a transformative era, driven by innovation, cultural impact, and new economic opportunities. At GamesBeat Summit 2025, explore how creative storytelling, community engagement, and effective business strategies that are shaping the future of gaming industry. + +Delve into the diverse influences—ranging from player experiences to industry collaborations—that are paving the way for the next phase of growth. + +[Learn more about this event](https://gbs.venturebeat.com/) diff --git a/_events/ce12.md b/_events/ce12.md new file mode 100644 index 000000000000..d2ea93af6df7 --- /dev/null +++ b/_events/ce12.md @@ -0,0 +1,13 @@ +--- +category: event +title: "NYC Tech Week" +date: Jun 2, 2025 +--- +Community Event + +**Date**: Jun 2 - 8, 2025 +**Location**: New York City + +Tech Week is a decentralized tech conference presented by a16z. Every Tech Week, hundreds of events take place across the host city - from hackathons to panel events, community meetups and more. Every event is organized individually by startups, companies and VCs. + +[Learn more about this event](https://www.tech-week.com/) diff --git a/_events/ce14.md b/_events/ce14.md new file mode 100644 index 000000000000..fcfab07f890f --- /dev/null +++ b/_events/ce14.md @@ -0,0 +1,13 @@ +--- +category: event +title: "Data + AI Summit" +date: Jun 9, 2025 +--- +Community Event + +**Date**: Jun 9 - 12, 2025 +**Location**: San Francisco, CA + +Join 20,000 peers for 700+ sessions, keynotes and training at the world’s largest data, analytics and AI conference. + +[Learn more about this event](https://www.databricks.com/dataaisummit) diff --git a/_events/ce15.md b/_events/ce15.md new file mode 100644 index 000000000000..e85a7403d1e8 --- /dev/null +++ b/_events/ce15.md @@ -0,0 +1,13 @@ +--- +category: event +title: "CVPR 2025" +date: Jun 10, 2025 +--- +Community Event + +**Date**: Jun 10 - 17, 2025 +**Location**: Nashville, TN + +The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) is the premier annual computer vision event comprising the main conference and several co-located workshops and short courses. With its high quality and low cost, it provides an exceptional value for students, academics and industry researchers. + +[Learn more about this event](https://cvpr.thecvf.com/) diff --git a/_events/ce16.md b/_events/ce16.md new file mode 100644 index 000000000000..eda670bc7191 --- /dev/null +++ b/_events/ce16.md @@ -0,0 +1,13 @@ +--- +category: event +title: "We are Developers Conference" +date: Jul 9, 2025 +--- +Community Event + +**Date**: Jul 9 - 11, 2025 +**Location**: Berlin, Germany + +Join the largest gathering of software innovators, tech leaders, and decision-makers shaping the future of AI-powered technology. + +[Learn more about this event](https://www.wearedevelopers.com/world-congress) diff --git a/_events/ce17.md b/_events/ce17.md new file mode 100644 index 000000000000..ded03e328983 --- /dev/null +++ b/_events/ce17.md @@ -0,0 +1,13 @@ +--- +category: event +title: "ICML 2025" +date: Jul 13, 2025 +--- +Community Event + +**Date**: Jul 13 - 19, 2025 +**Location**: Berlin, Germany + +Forty-Second International Conference on Machine Learning. + +[Learn more about this event](https://icml.cc/) diff --git a/_events/ce18.md b/_events/ce18.md new file mode 100644 index 000000000000..dd61d8531f90 --- /dev/null +++ b/_events/ce18.md @@ -0,0 +1,13 @@ +--- +category: event +title: "SIGGRAPH 2025" +date: Aug 10, 2025 +--- +Community Event + +**Date**: Aug 10 - 14, 2025 +**Location**: Vancouver, B.C. + +[ACM SIGGRAPH](https://www.siggraph.org/) is a special interest group (SIG) devoted to computer graphics (GRAPH) within the [Association for Computing Machinery](https://www.acm.org/) (ACM), the world’s largest educational and scientific computing society devoted to advancing computing as a science and a profession. Its annual conference, first held in 1974, is the premier conference on computer graphics and interactive techniques worldwide. At SIGGRAPH 2025, we boldly look toward the future, imagining how humanity and technology will be increasingly connected and examining how we can create a future that connects our physical and digital worlds for the better. + +[Learn more about this event](https://s2025.siggraph.org/) diff --git a/_events/ce19.md b/_events/ce19.md new file mode 100644 index 000000000000..2e9625dd9a67 --- /dev/null +++ b/_events/ce19.md @@ -0,0 +1,13 @@ +--- +category: event +title: "San Francisco Tech Week" +date: Oct 6, 2025 +--- +Community Event + +**Date**: Oct 6 - 12, 2025 +**Location**: San Francisco + +Tech Week is a decentralized tech conference presented by a16z. Every Tech Week, hundreds of events take place across the host city - from hackathons to panel events, community meetups and more. Every event is organized individually by startups, companies and VCs. + +[Learn more about this event](https://www.tech-week.com/) diff --git a/_events/ce2.md b/_events/ce2.md new file mode 100644 index 000000000000..f0857e44a475 --- /dev/null +++ b/_events/ce2.md @@ -0,0 +1,15 @@ +--- +category: event +title: "Open Source AI Summit" +date: Jan 22, 2025 +--- +Community Event + +**Date**: Jan 22, 2025 +**Location**: Paris, France + +Open Source AI has become a major trend in the industry, with even many digital giants adopting an Open Source approach. While Open Source AI isn't magic, it does offer the potential to address many challenges more effectively than proprietary AI models. + +This first edition of the Paris Open Source AI Summit will bring together global leaders and industry players to address these issues. The summit will aim to establish a common set of ideas, vocabulary and definitions to create a shared understanding of the current state of Open Source AI. + +[Learn more about this event](https://opensourceaisummit.eu/#rec838155366) diff --git a/_events/ce20.md b/_events/ce20.md new file mode 100644 index 000000000000..de0a07092616 --- /dev/null +++ b/_events/ce20.md @@ -0,0 +1,13 @@ +--- +category: event +title: "LA Tech Week" +date: Oct 13, 2025 +--- +Community Event + +**Date**: Oct 13 - 19, 2025 +**Location**: Los Angeles, CA + +Tech Week is a decentralized tech conference presented by a16z. Every Tech Week, hundreds of events take place across the host city - from hackathons to panel events, community meetups and more. Every event is organized individually by startups, companies and VCs. + +[Learn more about this event](https://www.tech-week.com/) diff --git a/_events/ce21.md b/_events/ce21.md new file mode 100644 index 000000000000..c7b0e5dae932 --- /dev/null +++ b/_events/ce21.md @@ -0,0 +1,13 @@ +--- +category: event +title: "ICCV 2025" +date: Oct 20, 2025 +--- +Community Event + +**Date**: Oct 20 - 24, 2025 +**Location**: Honolulu, HI + +International Conference on Computer Vision, ICCV 2025. + +[Learn more about this event](https://iccv.thecvf.com/) diff --git a/_events/ce22.md b/_events/ce22.md new file mode 100644 index 000000000000..07ef894b515a --- /dev/null +++ b/_events/ce22.md @@ -0,0 +1,15 @@ +--- +category: event +title: "Open Source AI Week" +date: Oct 18, 2025 +--- +Community Event + +**Date**: Oct 18 - 26, 2025 +**Location**: San Francisco, CA + +Open Source AI Week is the premier event that brings together the best AI and ML conferences, hackathons, startup showcases, and networking opportunities exploring the intersection of artificial intelligence, machine learning, and open source technology. Taking place between October 18 – 26, 2025 in San Francisco area. This week-long celebration is dedicated to fostering innovation, collaboration, and community-driven solutions in the rapidly evolving AI landscape, featuring the PyTorch Conference as the flagship event. + +[Submit your event](https://linuxfoundation.research.net/r/FD6JMH5) to be included in Open Source AI Week, and check back mid-May to see the Open Source AI Week event lineup! + +[Learn more about this event](https://events.linuxfoundation.org/open-source-ai-week/) diff --git a/_events/ce23.md b/_events/ce23.md new file mode 100644 index 000000000000..e06dedf1e645 --- /dev/null +++ b/_events/ce23.md @@ -0,0 +1,13 @@ +--- +category: event +title: "NeurIPS 2025" +date: Dec 7, 2025 +--- +Community Event + +**Date**: Dec 7 - 10, 2025 +**Location**: San Diego, CA + +The Thirty-Ninth Annual Conference on Neural Information Processing Systems. + +[Learn more about this event](https://neurips.cc/) diff --git a/_events/ce24.md b/_events/ce24.md new file mode 100644 index 000000000000..d08216a6e078 --- /dev/null +++ b/_events/ce24.md @@ -0,0 +1,15 @@ +--- +category: event +title: "ECCV 2026" +date: Sep 9, 2025 +--- +Community Event + +**Date**: Sep 9 - 13, 2026 +**Location**: Malmö, Sweden + +ECCV is the official event under the European Computer Vision Association and is biannual on even numbered years. Any other event trying to utilize this title is not a sanctioned event. + +The European Conference on Computer Vision (ECCV) is a biennial premier research conference in Computer Vision and Machine Learning, managed by the [European Computer Vision Association (ECVA)](https://www.ecva.net/). It is held on even years and gathers the scientific and industrial communities on these areas. The first ECCV was held in 1990 in Antibes, France, and subsequently organized all over Europe. Paper proceedings are published by [Springer Science+Business Media](https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media). + +[Learn more about this event](https://eccv.ecva.net/) diff --git a/_events/ce25.md b/_events/ce25.md new file mode 100644 index 000000000000..2d9d6d02d568 --- /dev/null +++ b/_events/ce25.md @@ -0,0 +1,11 @@ +--- +category: event +title: "GOSIM AI" +date: May 6, 2025 +--- +Community Event + +**Date**: May 6 - 7, 2025 +**Location**: Paris, France + +[Learn more about this event](https://paris2025.gosim.org/) diff --git a/_events/ce26.md b/_events/ce26.md new file mode 100644 index 000000000000..328b0fd3d870 --- /dev/null +++ b/_events/ce26.md @@ -0,0 +1,13 @@ +--- +category: event +title: "PyTorch ATX Community Meetup" +date: April 30, 2025 +--- +Community Event + +**Date**: April 30, 2025 +**Location**: Austin, TX + +The Triton framework provides a hardware agnostic way of programming and targeting GPUs. As Triton becomes more widely adopted, it will be essential in understanding how to write, optimize and troubleshoot the Triton kernel in order to optimize GPU efficiency for algorithms. Join the PyTorch community meetup to learn how Red Hat, Intel, AMD, IBM Research and University of Texas are working on developing Triton kernels. + +[Learn more about this event](https://meetu.ps/e/NYlm0/qrnF8/i) diff --git a/_events/ce3.md b/_events/ce3.md new file mode 100644 index 000000000000..9a4e195afee3 --- /dev/null +++ b/_events/ce3.md @@ -0,0 +1,15 @@ +--- +category: event +title: "Open Source Forum" +date: Feb 13, 2025 +--- +Community Event + +**Date**: Feb 13, 2025 +**Location**: Los Angeles, CA + +The Academy Software Foundation’s (ASWF) annual Open Source Forum brings together Foundation members and select guests from the motion picture and media industries to collaborate and discuss the future of open source software. + +Open Source Forum 2025 features a new format to better enable open dialogue and interactive discussion. Hosted at Walt Disney Animation Studios in Burbank, CA, the half-day event will kick off with several presentations around the anatomy of a studio, emerging technologies impacting studios, and open source opportunities, followed by a moderated discussion. + +[Learn more about this event](https://events.linuxfoundation.org/aswf-open-source-forum/) diff --git a/_events/ce4.md b/_events/ce4.md new file mode 100644 index 000000000000..1b1063abf142 --- /dev/null +++ b/_events/ce4.md @@ -0,0 +1,13 @@ +--- +category: event +title: "AAAI Conference on AI" +date: Feb 25, 2025 +--- +Community Event + +**Date**: Feb 25 - Mar 4, 2025 +**Location**: Philadelphia, PA + +The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-25 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced. + +[Learn more about this event](https://aaai.org/conference/aaai/) diff --git a/_events/ce5.md b/_events/ce5.md new file mode 100644 index 000000000000..6be2a635a465 --- /dev/null +++ b/_events/ce5.md @@ -0,0 +1,13 @@ +--- +category: event +title: "Nvidia GTC 2025" +date: Mar 17, 2025 +--- +Community Event + +**Date**: Mar 17 - 21, 2025 +**Location**: San Jose, CA + +Nvidia's GTC 2025, a global AI conference for developers, showcased advancements in AI, robotics, and data centers, with key announcements including the Blackwell Ultra AI chip and the Vera Rubin architecture. + +[Learn more about this event](https://www.nvidia.com/gtc/) diff --git a/_events/ce6.md b/_events/ce6.md new file mode 100644 index 000000000000..1a45335fedf1 --- /dev/null +++ b/_events/ce6.md @@ -0,0 +1,15 @@ +--- +category: event +title: "LF Member Summit" +date: Mar 18, 2025 +--- +Community Event + +**Date**: Mar 18 - 20, 2025 +**Location**: Napa, CA + +The Linux Foundation Member Summit is the annual gathering for Linux Foundation member organizations. + +An annual gathering for Linux Foundation members that fosters collaboration, innovation, and partnerships among the leading projects and organizations working to drive digital transformation with open source technologies. It is a must-attend for business and technical leaders looking to advance open source strategy, implementation, and investment in their organizations and learn how to collaboratively manage the largest shared technology investment of our time. + +[Learn more about this event](https://events.linuxfoundation.org/lf-member-summit/) diff --git a/_events/ce7.md b/_events/ce7.md new file mode 100644 index 000000000000..37a87c50453f --- /dev/null +++ b/_events/ce7.md @@ -0,0 +1,15 @@ +--- +category: event +title: "ICLR 2025" +date: Apr 24, 2025 +--- +Community Event + +**Date**: Apr 24 - 28, 2025 +**Location**: Singapore + +The International Conference on Learning Representations (ICLR) is the premier gathering of professionals dedicated to the advancement of the branch of artificial intelligence called representation learning, but generally referred to as deep learning. + +ICLR is globally renowned for presenting and publishing cutting-edge research on all aspects of deep learning used in the fields of artificial intelligence, statistics and data science, as well as important application areas such as machine vision, computational biology, speech recognition, text understanding, gaming, and robotics. + +[Learn more about this event](https://iclr.cc/) diff --git a/_events/ce8.md b/_events/ce8.md new file mode 100644 index 000000000000..13d99e29d4bc --- /dev/null +++ b/_events/ce8.md @@ -0,0 +1,15 @@ +--- +category: event +title: "Dubai AI Festival" +date: Apr 23, 2025 +--- +Community Event + +**Date**: Apr 23 - 24, 2025 +**Location**: Dubai, UAE + +At Dubai AI Festival, attendees will experience the convergence of artificial intelligence, blockchain, XR, decentralised systems, driving the progression of digital economies and technological innovation. + +This dynamic platform is designed to foster collaboration, innovation, and knowledge-sharing among industry leaders, entrepreneurs, and tech enthusiasts from around the world. Join us to engage with the future of technology at Dubai AI Festival. + +[Learn more about this event](https://dubaiaifestival.com/) diff --git a/_events/ce9.md b/_events/ce9.md new file mode 100644 index 000000000000..99bfe5b69ed9 --- /dev/null +++ b/_events/ce9.md @@ -0,0 +1,13 @@ +--- +category: event +title: "MLSys" +date: May 12, 2025 +--- +Community Event + +**Date**: May 12 - 15, 2025 +**Location**: Santa Clara, CA + +The Eighth Annual Conference on Machine Learning and Systems + +[Learn more about this event](https://mlsys.org/) diff --git a/_events/devcon-meetup.md b/_events/devcon-meetup.md new file mode 100644 index 000000000000..a93c10cd4c6b --- /dev/null +++ b/_events/devcon-meetup.md @@ -0,0 +1,10 @@ +--- +category: event +title: "PyTorch Meetup at DevConf.IN 2025" +date: Feb 28, 2025 +--- + +**Date**: Feb 28, 2025 +**Location**: Pune, India + +[Event Blog](https://pytorch.org/blog/pt-fedora-os-communities/) \ No newline at end of file diff --git a/_events/docathon-2025.md b/_events/docathon-2025.md new file mode 100644 index 000000000000..88bc55a52724 --- /dev/null +++ b/_events/docathon-2025.md @@ -0,0 +1,16 @@ +--- +category: event +title: "Docathon 2025" +date: Jun 3, 2025 +--- + +**Date**: June 3-18, 2025 +**Location**: Online + + +PyTorch Docathon + + +The PyTorch Docathon 2025, akin to a hackathon, is an event dedicated to enhancing the quality of the PyTorch documentation with the invaluable assistance of our community. This is an inclusive event designed to be accessible to all levels of expertise, from newcomers to experienced ML/PyTorch users. It offers a rewarding experience as participants can see the direct impact of their contributions on the project's usability and accessibility. The Docathon promotes a collaborative environment, allowing participants to work with other contributors and PyTorch maintainers, fostering the exchange of ideas and networking. It also provides a rich learning experience, offering the opportunity to explore PyTorch modules, update docstrings, and test tutorials. + +[RSVP Now](https://community.linuxfoundation.org/events/details/lfhq-pytorch-foundation-presents-pytorch-docathon-june-3rd-18th-2025/) \ No newline at end of file diff --git a/_events/kr-conf.md b/_events/kr-conf.md new file mode 100644 index 000000000000..2acc9671a2a9 --- /dev/null +++ b/_events/kr-conf.md @@ -0,0 +1,12 @@ +--- +category: event +title: "PyTorch KR Conference" +date: March 30, 2025 +--- + +**Date**: March 30, 2025, 13:00 ~ 18:00 +**Location**: Seoul, Republic of Korea + +Hear from speakers from the PyTorch Foundation, Meta, FuriosaAI, Lablup, Nota AI, Rebellions, etc. + +[Event Info](https://event-us.kr/pytorchkr/event/100142) \ No newline at end of file diff --git a/_events/multi-modal-dl-frame.md b/_events/multi-modal-dl-frame.md new file mode 100644 index 000000000000..ed2539f2d0d0 --- /dev/null +++ b/_events/multi-modal-dl-frame.md @@ -0,0 +1,18 @@ +--- +category: event +title: "Multi-Modal Tabular Deep Learning with PyTorch Frame" +date: February 19 +poster: assets/images/multi-modal-dl-frame.png +--- + +**Date**: February 19, 12 pm PST + + +Multi-Modal Tabular Deep Learning with PyTorch Frame + + +In this talk, Akihiro introduced PyTorch Frame, a modular framework for multi-modal tabular deep learning. PyTorch Frame enables seamless integration with the PyTorch ecosystem, including PyTorch Geometric for graph-based message passing across relational data and Hugging Face Transformers for extracting rich text features. The talk also highlights its specialized data structures for efficiently handling sparse features, making PyTorch Frame an essential tool for modern tabular data. + +Akihiro Nitta is a software engineer on the ML team at Kumo.ai and a core contributor to PyTorch Frame and PyTorch Geometric, with prior experience as a maintainer of PyTorch Lightning. + +[Learn more about the event](/multi-modal-dl-frame) diff --git a/_events/pt-26-live-q-a.md b/_events/pt-26-live-q-a.md new file mode 100644 index 000000000000..6838babb7ebe --- /dev/null +++ b/_events/pt-26-live-q-a.md @@ -0,0 +1,20 @@ +--- +category: event +title: "PyTorch 2.6 Live Q&A" +date: February 7, 2025 +poster: assets/images/ai-programming.png +--- + +**Date**: February 7, 10 am PST + + +PyTorch 2.6 Live Q&A + + +Wondering what's new in the recent PyTorch 2.6 release? Do you have questions? Join us for a live Q&A on PyTorch 2.6 with PyTorch Core Maintainer, Nikita Shulga (Meta). + +Nikita is a Software Engineer at Meta where he is, among other things, responsible for PyTorch releases and continuous integration. Nikita is committed to uplifting the developer community and continuously improving PyTorch. He earned his Master’s degree in Applied Mathematics from the Moscow Institute of Physics and Technology (MIPT). + +Bring your PyTorch 2.6 questions for Nikita during this live Q&A session. + +[More info on this event.](/pt-26-live-q-a) diff --git a/_events/pt-27-release-qa.md b/_events/pt-27-release-qa.md new file mode 100644 index 000000000000..d1e75363137e --- /dev/null +++ b/_events/pt-27-release-qa.md @@ -0,0 +1,25 @@ +--- +category: event +title: "PyTorch 2.7 Release Live Q&A" +date: Apr 28, 2025 +poster: assets/images/pt27qa.png +--- + + +PyTorch 2.7 Release Q&A + + +**Date**: April 28, 12 pm PT +**Speakers**: Piotr Bialecki (NVIDIA) and Nikita Shulga (Meta) +**Location**: Online + +Have questions about PyTorch 2.7? Join PyTorch Core Maintainers Piotr Bialecki (NVIDIA) and Nikita Shulga (Meta) for a live Q&A session on Monday, April 28 at 12 PM PST. + +Piotr joined the PyTorch team at NVIDIA in 2019 and currently manages the team. He drives NVIDIA’s effort in maintaining and advancing PyTorch’s CUDA backend and received the PyTorch SUPERHERO award in 2023 for his community contributions, especially in the PyTorch discussion board. As a Core Maintainer, he is also focused on PyTorch’s long-term vision and development. + +Nikita is a Software Engineer at Meta where, among other things, he is responsible for PyTorch releases and continuous integration. Nikita is committed to uplifting the developer community and continuously improving PyTorch. He earned a Master’s degree in Applied Mathematics from the Moscow Institute of Physics and Technology (MIPT). + +Bring your PyTorch 2.7 questions for Piotr & Nikita during this live Q&A session. + +[Learn more about this event](/pt-27-release-qa) + diff --git a/_events/pt-day-china-2025.md b/_events/pt-day-china-2025.md new file mode 100644 index 000000000000..a8cb293c7fb8 --- /dev/null +++ b/_events/pt-day-china-2025.md @@ -0,0 +1,18 @@ +--- +category: event +title: "PyTorch Day China 2025" +date: June 7, 2025 +--- + + +PyTorch Day China 2025 + + +**Date:** June 7, 2025 +**Location:** Beijing, China + +PyTorch Day China 2025, proudly hosted by the PyTorch Foundation, is the premier gathering dedicated to open-source AI and machine learning innovation. Scheduled for June 7th in Beijing, China and co-located with the BAAI Conference, this community-driven event provides an unparalleled platform for PyTorch enthusiasts, machine learning engineers, AI researchers, and industry professionals. + +Immerse yourself in a vibrant day of insightful technical talks, interactive discussions, and engaging poster sessions designed to foster knowledge exchange and collaboration. PyTorch Day China is your gateway to connecting with leading experts and peers in the open-source AI community, offering you unique opportunities to explore cutting-edge advancements and shape the future of deep learning. + +[Read more about the event](https://www.lfasiallc.com/pytorch-day-china/) \ No newline at end of file diff --git a/_events/pt-day-france-2025.md b/_events/pt-day-france-2025.md new file mode 100644 index 000000000000..09b44cb627cd --- /dev/null +++ b/_events/pt-day-france-2025.md @@ -0,0 +1,18 @@ +--- +category: event +title: "PyTorch Day France 2025: Registration Open" +date: May 7, 2025 +poster: assets/images/pt-day-cfp.png +--- + + +PyTorch Day France 2025 + + +**Date**: May 7, 2025 +**Location**: Paris, France + +PyTorch Day France 2025, proudly hosted by the PyTorch Foundation, is the premier gathering dedicated to open-source AI and machine learning innovation. Scheduled for 7 May in Paris, France and co-located with the GOSIM AI Paris, this community-driven event provides an unparalleled platform for PyTorch enthusiasts, machine learning engineers, AI researchers, and industry professionals. +Immerse yourself in a vibrant day of insightful technical talks, interactive discussions, and engaging poster sessions designed to foster knowledge exchange and collaboration. PyTorch Day France is your gateway to connecting with leading experts and peers in the open-source AI community, offering you unique opportunities to explore cutting-edge advancements and shape the future of deep learning. + +[Register Now](https://events.linuxfoundation.org/pytorch-day-france/) diff --git a/_events/pt-dinov2-multi-label-plant-species-classification.md b/_events/pt-dinov2-multi-label-plant-species-classification.md new file mode 100644 index 000000000000..f4b7edede489 --- /dev/null +++ b/_events/pt-dinov2-multi-label-plant-species-classification.md @@ -0,0 +1,18 @@ +--- +category: event +title: "Using PyTorch and DINOv2 for Multi-label Plant Species Classification" +date: March 27 +poster: assets/images/pt-dinov2-multi-label-plant-species-classification.png +--- + +**Date**: March 27th, 12 PM PST + + +Using PyTorch and DINOv2 for Multi-label Plant Species Classification + + +Join us for an engaging webinar on our innovative transfer learning approach using self-supervised Vision Transformers (DINOv2) for multi-label plant species classification in the PlantCLEF 2024 challenge. We’ll cover how we efficiently extract feature embeddings from a dataset of 1.4 million images and utilize PyTorch Lightning for model training and Apache Spark for data management. Learn about our image processing techniques, including transforming images into grids of tiles and aggregating predictions to overcome computational challenges. Discover the significant performance improvements achieved and get insights into multi-label image classification. Perfect for PyTorch developers, this session will include a Q&A and access to our complete codebase at [github.com/dsgt-kaggle-clef/plantclef-2024](https://github.com/dsgt-kaggle-clef/plantclef-2024). + +Murilo Gustineli is a Senior AI Software Solutions Engineer at Intel, and is currently pursuing a Master’s in Computer Science at Georgia Tech focusing on machine learning. His work involves creating synthetic datasets, fine-tuning large language models, and training multi-modal models using Intel® Gaudi® Al accelerators as part of the Development Enablement team. He is particularly interested in deep learning, information retrieval, and biodiversity research, aiming to improve species identification and support conservation efforts. + +[Learn more about the event](/pt-dinov2-multi-label-plant-species-classification) diff --git a/_get_started/installation/linux.md b/_get_started/installation/linux.md index 35deaa0cde02..7461e2dfcd26 100644 --- a/_get_started/installation/linux.md +++ b/_get_started/installation/linux.md @@ -40,26 +40,10 @@ If you decide to use APT, you can run the following command to install it: sudo apt install python ``` -> If you use [Anaconda](#anaconda) to install PyTorch, it will install a sandboxed version of Python that will be used for running PyTorch applications. - ### Package Manager {: #linux-package-manager} -To install the PyTorch binaries, you will need to use one of two supported package managers: [Anaconda](https://www.anaconda.com/download/#linux) or [pip](https://pypi.org/project/pip/). Anaconda is the recommended package manager as it will provide you all of the PyTorch dependencies in one, sandboxed install, including Python. - -#### Anaconda - -To install Anaconda, you will use the [command-line installer](https://www.anaconda.com/download/#linux). Right-click on the 64-bit installer link, select `Copy Link Location`, and then use the following commands: - -```bash -# The version of Anaconda may be different depending on when you are installing` -curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -sh Miniconda3-latest-Linux-x86_64.sh -# and follow the prompts. The defaults are generally good.` -``` - -> You may have to open a new terminal or re-source your `~/.bashrc `to get access to the `conda` command. - +To install the PyTorch binaries, you will need to use the supported package manager: [pip](https://pypi.org/project/pip/). #### pip *Python 3* @@ -75,24 +59,6 @@ sudo apt install python3-pip ## Installation {: #linux-installation} -### Anaconda -{: #linux-anaconda} - -#### No CUDA/ROCm - -To install PyTorch via Anaconda, and do not have a [CUDA-capable](https://developer.nvidia.com/cuda-zone) or [ROCm-capable](https://rocm.docs.amd.com/) system or do not require CUDA/ROCm (i.e. GPU support), in the above selector, choose OS: Linux, Package: Conda, Language: Python and Compute Platform: CPU. -Then, run the command that is presented to you. - -#### With CUDA - -To install PyTorch via Anaconda, and you do have a [CUDA-capable](https://developer.nvidia.com/cuda-zone) system, in the above selector, choose OS: Linux, Package: Conda and the CUDA version suited to your machine. Often, the latest CUDA version is better. -Then, run the command that is presented to you. - -#### With ROCm - -PyTorch via Anaconda is not supported on ROCm currently. Please use pip instead. - - ### pip {: #linux-pip} @@ -148,7 +114,7 @@ For the majority of PyTorch users, installing from a pre-built binary via a pack ### Prerequisites {: #linux-prerequisites-2} -1. Install [Anaconda](#anaconda) or [Pip](#pip) +1. Install [Pip](#pip) 2. If you need to build PyTorch with GPU support a. for NVIDIA GPUs, install [CUDA](https://developer.nvidia.com/cuda-downloads), if your machine has a [CUDA-enabled GPU](https://developer.nvidia.com/cuda-gpus). b. for AMD GPUs, install [ROCm](https://rocm.docs.amd.com/), if your machine has a [ROCm-enabled GPU](https://rocm.docs.amd.com/) diff --git a/_get_started/installation/mac.md b/_get_started/installation/mac.md index 5294b865a38c..d3803a1ee278 100644 --- a/_get_started/installation/mac.md +++ b/_get_started/installation/mac.md @@ -1,7 +1,7 @@ # Installing on macOS {:.no_toc} -PyTorch can be installed and used on macOS. Depending on your system and GPU capabilities, your experience with PyTorch on a Mac may vary in terms of processing time. +PyTorch can be installed and used on macOS. Depending on your system and GPU capabilities, your experience with PyTorch on macOS may vary in terms of processing time. ## Prerequisites {: #mac-prerequisites} @@ -14,24 +14,13 @@ PyTorch is supported on macOS 10.15 (Catalina) or above. {: #mac-python} It is recommended that you use Python 3.9 - 3.12. -You can install Python either through the Anaconda -package manager (see [below](#anaconda)), [Homebrew](https://brew.sh/), or +You can install Python either through [Homebrew](https://brew.sh/) or the [Python website](https://www.python.org/downloads/mac-osx/). ### Package Manager {: #mac-package-manager} -To install the PyTorch binaries, you will need to use one of two supported package managers: [pip](https://pypi.org/project/pip/) or [Anaconda](https://www.anaconda.com/download/#macos). -#### Anaconda - -To install Anaconda, you can [download graphical installer](https://www.anaconda.com/download/#macos) or use the command-line installer. If you use the command-line installer, you can right-click on the installer link, select `Copy Link Address`, or use the following commands on Mac computer with Apple silicon: - -```bash -# The version of Anaconda may be different depending on when you are installing` -curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-arm64.sh -sh Miniconda3-latest-MacOSX-arm64.sh -# and follow the prompts. The defaults are generally good.` -``` +To install the PyTorch binaries, you will need to use the supported package manager: [pip](https://pypi.org/project/pip/). #### pip *Python 3* @@ -43,19 +32,10 @@ If you installed Python via Homebrew or the Python website, `pip` was installed ## Installation {: #mac-installation} -### Anaconda -{: #mac-anaconda} - -To install PyTorch via Anaconda, use the following conda command: - -```bash -conda install pytorch torchvision -c pytorch -``` - ### pip {: #mac-pip} -To install PyTorch via pip, use one of the following two commands, depending on your Python version: +To install PyTorch via pip, use the following command, depending on your Python version: ```bash # Python 3.x @@ -91,7 +71,7 @@ For the majority of PyTorch users, installing from a pre-built binary via a pack ### Prerequisites {: #mac-prerequisites-2} -1. [Optional] Install [Anaconda](#anaconda) +1. [Optional] Install [pip](https://pypi.org/project/pip/) 2. Follow the steps described here: [https://github.com/pytorch/pytorch#from-source](https://github.com/pytorch/pytorch#from-source) You can verify the installation as described [above](#mac-verification). diff --git a/_get_started/installation/windows.md b/_get_started/installation/windows.md index 2d0ec5c041f5..8000e9cddbc6 100644 --- a/_get_started/installation/windows.md +++ b/_get_started/installation/windows.md @@ -24,9 +24,6 @@ As it is not installed by default on Windows, there are multiple ways to install * [Chocolatey](https://chocolatey.org/) * [Python website](https://www.python.org/downloads/windows/) -* [Anaconda](#anaconda) - -> If you use Anaconda to install PyTorch, it will install a sandboxed version of Python that will be used for running PyTorch applications. > If you decide to use Chocolatey, and haven't installed Chocolatey yet, ensure that you are running your command prompt as an administrator. @@ -39,12 +36,7 @@ choco install python ### Package Manager {: #windows-package-manager} -To install the PyTorch binaries, you will need to use at least one of two supported package managers: [Anaconda](https://www.anaconda.com/download/#windows) and [pip](https://pypi.org/project/pip/). Anaconda is the recommended package manager as it will provide you all of the PyTorch dependencies in one, sandboxed install, including Python and `pip.` - -#### Anaconda - -To install Anaconda, you will use the [64-bit graphical installer](https://www.anaconda.com/download/#windows) for PyTorch 3.x. Click on the installer link and select `Run`. Anaconda will download and the installer prompt will be presented to you. The default options are generally sane. - +To install the PyTorch binaries, you will need to use the supported package manager: [pip](https://pypi.org/project/pip/). #### pip If you installed Python by any of the recommended ways [above](#windows-python), [pip](https://pypi.org/project/pip/) will have already been installed for you. @@ -52,22 +44,6 @@ If you installed Python by any of the recommended ways [above](#windows-python), ## Installation {: #windows-installation} -### Anaconda -{: #windows-anaconda} - -To install PyTorch with Anaconda, you will need to open an Anaconda prompt via `Start | Anaconda3 | Anaconda Prompt`. - -#### No CUDA - -To install PyTorch via Anaconda, and do not have a [CUDA-capable](https://developer.nvidia.com/cuda-zone) system or do not require CUDA, in the above selector, choose OS: Windows, Package: Conda and CUDA: None. -Then, run the command that is presented to you. - -#### With CUDA - -To install PyTorch via Anaconda, and you do have a [CUDA-capable](https://developer.nvidia.com/cuda-zone) system, in the above selector, choose OS: Windows, Package: Conda and the CUDA version suited to your machine. Often, the latest CUDA version is better. -Then, run the command that is presented to you. - - ### pip {: #windows-pip} @@ -126,7 +102,7 @@ For the majority of PyTorch users, installing from a pre-built binary via a pack ### Prerequisites {: #windows-prerequisites-2} -1. Install [Anaconda](#anaconda) +1. Install [pip](https://pypi.org/project/pip/) 2. Install [CUDA](https://developer.nvidia.com/cuda-downloads), if your machine has a [CUDA-enabled GPU](https://developer.nvidia.com/cuda-gpus). 3. If you want to build on Windows, Visual Studio with MSVC toolset, and NVTX are also needed. The exact requirements of those dependencies could be found out [here](https://github.com/pytorch/pytorch#from-source). 4. Follow the steps described here: [https://github.com/pytorch/pytorch#from-source](https://github.com/pytorch/pytorch#from-source) diff --git a/_get_started/mobile.md b/_get_started/mobile.md index 2a640293144c..d709ee61e2f8 100644 --- a/_get_started/mobile.md +++ b/_get_started/mobile.md @@ -1,6 +1,6 @@ --- layout: get_started -title: ExecuTorch +title: PyTorch for Edge permalink: /get-started/executorch/ background-class: get-started-background body-class: get-started @@ -10,11 +10,29 @@ published: true ## Get Started with PyTorch ExecuTorch -

- - ExecuTorch Documentation - -

+PyTorch’s edge specific library is [ExecuTorch](https://github.com/pytorch/executorch/) and is designed to be lightweight, very performant even on devices with constrained hardware such as mobile phones, embedded systems and microcontrollers. + +ExecuTorch relies heavily on PyTorch core technologies such as [torch.compile](https://pytorch.org/docs/stable/torch.compiler.html) and [torch.export](https://pytorch.org/docs/stable/export.html), and should be very familiar to anyone who has used PyTorch in the past. + +### Getting Started +You can get started by following the [general getting started guide](https://pytorch.org/executorch/stable/getting-started.html#) or jump to the specific steps for your target device. + +* [Using ExecuTorch on Android](https://pytorch.org/executorch/stable/using-executorch-android.html) +* [Using ExecuTorch on iOS](https://pytorch.org/executorch/stable/using-executorch-ios.html) +* [Using ExecuTorch with C++](https://pytorch.org/executorch/stable/using-executorch-cpp.html) + +### Hardware Acceleration +ExecuTorch provides out of the box hardware acceleration for a growing number of chip manufacturers. See the following resources to learn more about how to leverage them: + +* [Backend Overview](https://pytorch.org/executorch/stable/backends-overview.html) +* [XNNPACK](https://pytorch.org/executorch/stable/backends-xnnpack.html) +* [Core ML](https://pytorch.org/executorch/stable/backends-coreml.html) +* [MPS](https://pytorch.org/executorch/stable/backends-mps.html) +* [Vulkan](https://pytorch.org/executorch/stable/backends-vulkan.html) +* [ARM Ethos-U](https://pytorch.org/executorch/stable/backends-arm-ethos-u.html) +* [Qualcomm AI Engine](https://pytorch.org/executorch/stable/backends-qualcomm.html) +* [MediaTek](https://pytorch.org/executorch/stable/backends-mediatek.html) +* [Cadence Xtensa](https://pytorch.org/executorch/stable/backends-cadence.html) diff --git a/_get_started/previous-versions.md b/_get_started/previous-versions.md index d47db4add549..d86ae87de17e 100644 --- a/_get_started/previous-versions.md +++ b/_get_started/previous-versions.md @@ -17,6 +17,82 @@ your convenience. ## Commands for Versions >= 1.0.0 +### v2.6.0 + +#### Wheel + +##### OSX + +``` +pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 +``` + +##### Linux and Windows + +``` +# ROCM 6.1 (Linux only) +pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/rocm6.1 +# ROCM 6.2.4 (Linux only) +pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/rocm6.2.4 +# CUDA 11.8 +pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118 +# CUDA 12.4 +pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124 +# CUDA 12.6 +pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126 +# CPU only +pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cpu +``` + +### v2.5.1 + +#### Conda + +##### OSX + +``` +# conda +conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 -c pytorch +``` + +##### Linux and Windows + +``` +# CUDA 11.8 +conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=11.8 -c pytorch -c nvidia +# CUDA 12.1 +conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.1 -c pytorch -c nvidia +# CUDA 12.4 +conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia +# CPU Only +conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 cpuonly -c pytorch +``` + +#### Wheel + +##### OSX + +``` +pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 +``` + +##### Linux and Windows + +``` +# ROCM 6.1 (Linux only) +pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/rocm6.1 +# ROCM 6.2 (Linux only) +pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/rocm6.2 +# CUDA 11.8 +pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu118 +# CUDA 12.1 +pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 +# CUDA 12.4 +pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124 +# CPU only +pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cpu +``` + ### v2.5.0 #### Conda diff --git a/_includes/footer.html b/_includes/footer.html index 4e1ada721a59..a74402d61751 100644 --- a/_includes/footer.html +++ b/_includes/footer.html @@ -87,7 +87,6 @@

Resources

- {% include google_pixel.html %} {% include mobile_menu.html %} diff --git a/_includes/head.html b/_includes/head.html index 06be73f6c60f..b86b1e202467 100644 --- a/_includes/head.html +++ b/_includes/head.html @@ -34,7 +34,6 @@ {% if jekyll.environment == 'production' %} - {% include analytics.html %} {% include pixel.html %} {% include twitter_pixel.html %} {% endif %} diff --git a/_includes/header.html b/_includes/header.html index cd3d2370eddd..45c484e1845e 100644 --- a/_includes/header.html +++ b/_includes/header.html @@ -1,6 +1,6 @@
- Join us in Silicon Valley September 18-19 at the 2024 PyTorch Conference. Learn more. + Join us at PyTorch Conference in San Francisco, October 22-23. CFP open now! Learn more.
diff --git a/_includes/main_menu.html b/_includes/main_menu.html index a5f41f2f51f6..46cc727fedf5 100644 --- a/_includes/main_menu.html +++ b/_includes/main_menu.html @@ -26,6 +26,9 @@ Intro to PyTorch - YouTube Series

Master PyTorch basics with our engaging YouTube tutorial series

+ + New to PyTorch Foundation +
@@ -36,10 +39,13 @@ Ecosystem
- + Tools

Learn about the tools and frameworks in the PyTorch Ecosystem

+ + Join the Ecosystem + Community

Join the PyTorch developer community to contribute, learn, and get your questions answered.

diff --git a/_includes/mobile_menu.html b/_includes/mobile_menu.html index e9ce82726284..70e11e57ec2a 100644 --- a/_includes/mobile_menu.html +++ b/_includes/mobile_menu.html @@ -42,13 +42,19 @@
  • Introduction to PyTorch - YouTube Series
  • +
  • + New to PyTorch Foundation +
  • Ecosystem
  • diff --git a/_layouts/blog_detail.html b/_layouts/blog_detail.html index 9b3726de5552..eb80011a163b 100644 --- a/_layouts/blog_detail.html +++ b/_layouts/blog_detail.html @@ -7,7 +7,7 @@
    - Join us in Silicon Valley September 18-19 at the 2024 PyTorch Conference. Learn more. + Join us at PyTorch Conference in San Francisco, October 22-23. CFP open now! Learn more.
    diff --git a/_posts/2019-07-18-pytorch-ecosystem.md b/_posts/2019-07-18-pytorch-ecosystem.md index 7351cbbd9d4f..1be05469bb83 100644 --- a/_posts/2019-07-18-pytorch-ecosystem.md +++ b/_posts/2019-07-18-pytorch-ecosystem.md @@ -41,7 +41,7 @@ When we review project submissions for the PyTorch ecosystem, we take into accou 5. *Ongoing maintenance:* Project authors need to be committed to supporting and maintaining their projects. 6. *Community:* Projects should have (or be on track to building) an active, broad-based community. -If you would like to have your project included in the PyTorch ecosystem and featured on [pytorch.org/ecosystem](http://pytorch.org/ecosystem), please complete the form [here](https://pytorch.org/ecosystem/join). If you've previously submitted a project for consideration and haven't heard back, we promise to get back to you as soon as we can - we've received a lot of submissions! +If you would like to have your project included in the PyTorch ecosystem and featured on [pytorch.org/ecosystem](http://pytorch.org/ecosystem), please complete the form [here](https://github.com/pytorch-fdn/ecosystem). If you've previously submitted a project for consideration and haven't heard back, we promise to get back to you as soon as we can - we've received a lot of submissions! ## PyTorch Hub for reproducible research | New models diff --git a/_posts/2024-05-11-enhancing-deep-learning.md b/_posts/2024-05-11-enhancing-deep-learning.md index fc5af1bc3c57..456ba8b9e658 100644 --- a/_posts/2024-05-11-enhancing-deep-learning.md +++ b/_posts/2024-05-11-enhancing-deep-learning.md @@ -8,7 +8,7 @@ Welcome to the thriving PyTorch ecosystem, where a wealth of tools and libraries Initially, PyTorch aimed to establish a thriving community, enabling developers to access each other's tools, engage in meaningful discussions, and explore the wealth of resources available within the community. -Today, the PyTorch ecosystem has grown to feature over 100 projects tailored to your needs, providing robust support, enhanced speed, and effortless integration with PyTorch. If your project aligns with our mission, we invite you to [submit](https://pytorch.org/ecosystem/join) it and join this dynamic ecosystem. +Today, the PyTorch ecosystem has grown to feature over 100 projects tailored to your needs, providing robust support, enhanced speed, and effortless integration with PyTorch. If your project aligns with our mission, we invite you to [submit](https://github.com/pytorch-fdn/ecosystem) it and join this dynamic ecosystem. New this month, we’ve moved all of our Ecosystem blogs over to our PyTorch.org website to host a space where our community can show off the latest innovations with our users. Read on to hear about the latest projects in the ecosystem! @@ -94,7 +94,7 @@ Our diverse ecosystem tools are instrumental in PyTorch's success.. They provid Leveraging these tools empowers developers and researchers to accelerate their deep learning workflows and unlock new possibilities in the field of AI. -Have a tool that would be a good fit for the [PyTorch Ecosystem](https://pytorch.org/ecosystem/)? If you can answer the below questions, we’d love for you to [submit your tool for review](https://pytorch.org/ecosystem/join). +Have a tool that would be a good fit for the [PyTorch Ecosystem](https://pytorch.org/ecosystem/)? If you can answer the below questions, we’d love for you to [submit your tool for review](https://github.com/pytorch-fdn/ecosystem). diff --git a/_posts/2024-08-07-flexattention.md b/_posts/2024-08-07-flexattention.md index 4c34879d33b6..acfc1fc40f01 100644 --- a/_posts/2024-08-07-flexattention.md +++ b/_posts/2024-08-07-flexattention.md @@ -1,7 +1,7 @@ --- layout: blog_detail title: "FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention" -author: "Team PyTorch: Horace He, Driss Guessous, Yanbo Liang, Joy Dong" +author: "Team PyTorch: Driss Guessous, Yanbo Liang, Joy Dong, Horace He" --- ![a cartoon chart flexing his muscles](/assets/images/flexattention/fg1.jpg){:style="width:100%"} @@ -131,7 +131,7 @@ Alibi is similar to relative positional encodings with one exception \- it has a alibi_bias = generate_alibi_bias() # [num_heads] def alibi(score, b, h, q_idx, kv_idx): - bias = alibi_bias[h] * (q_idx - kv_idx) + bias = alibi_bias[h] * (kv_idx - q_idx) return score + bias ``` @@ -218,12 +218,12 @@ def sliding_window_causal(b, h, q_idx, kv_idx): return causal_mask & window_mask # If you want to be cute... -from torch.nn.attention import or_masks +from torch.nn.attention import and_masks def sliding_window(b, h, q_idx, kv_idx) return q_idx - kv_idx <= SLIDING_WINDOW -sliding_window_causal = or_masks(causal_mask, sliding_window) +sliding_window_causal = and_masks(causal_mask, sliding_window) ``` We benchmark it against `F.scaled_dot_product_attention` with a sliding window mask as well as FA2 with a causal mask (as a reference point for performance). Not only are we significantly faster than `F.scaled_dot_product_attention`, we’re *also* significantly faster than FA2 with a causal mask as this mask has significantly more sparsity. @@ -479,4 +479,4 @@ We want to highlight some prior work (and people) that have inspired FlexAttenti - The Jax team's work on SplashAttention - Philippe Tillet and Keren Zhou for helping us with Triton - Ali Hassani for discussions on neighborhood attention -- Everybody who's complained about attention kernels not supporting their favorite attention variant :) \ No newline at end of file +- Everybody who's complained about attention kernels not supporting their favorite attention variant :) diff --git a/_posts/2024-12-23-2024-year-in-review.md b/_posts/2024-12-23-2024-year-in-review.md index f9bae5c6e48c..4b972e0c4c4d 100644 --- a/_posts/2024-12-23-2024-year-in-review.md +++ b/_posts/2024-12-23-2024-year-in-review.md @@ -40,7 +40,7 @@ Throughout the year the PyTorch Team has been working hard to introduce a number We’ve also had a number of strong technical showcases throughout the year to highlight how PyTorch can be used! [TorchTitan](https://arxiv.org/html/2410.06511v1) exhibited what an open source, PyTorch-native distributed training system could look like for training large language models (LLMs). [TorchChat](https://pytorch.org/blog/torchchat-local-llm-inference/) showcased how to seamlessly and performantly run LLMs across laptop, desktop, and mobile devices. -As well we were very excited to include [multiple new projects](https://pytorch.org/blog/enhancing-deep-learning/) into the PyTorch ecosystem throughout 2024, including the introduction of [vLLM](https://pytorch.org/blog/vllm-joins-pytorch/) into the PyTorch Ecosystem, a state-of-the-art inference engine, which gives machine learning engineers an easy, fast, and cheap way of serving LLMs. If you are interested in joining the PyTorch Ecosystem, please [join](https://pytorch.org/ecosystem/join)! +As well we were very excited to include [multiple new projects](https://pytorch.org/blog/enhancing-deep-learning/) into the PyTorch ecosystem throughout 2024, including the introduction of [vLLM](https://pytorch.org/blog/vllm-joins-pytorch/) into the PyTorch Ecosystem, a state-of-the-art inference engine, which gives machine learning engineers an easy, fast, and cheap way of serving LLMs. If you are interested in joining the PyTorch Ecosystem, please [join](https://github.com/pytorch-fdn/ecosystem)! ![people at a conference](/assets/images/2024-year-in-review/fg4.jpg){:style="width:100%"} diff --git a/_posts/2025-01-21-accelerating-llm-inference.md b/_posts/2025-01-21-accelerating-llm-inference.md new file mode 100644 index 000000000000..e35c661eb071 --- /dev/null +++ b/_posts/2025-01-21-accelerating-llm-inference.md @@ -0,0 +1,285 @@ +--- +layout: blog_detail +title: "Accelerating LLM Inference with GemLite, TorchAO and SGLang" +author: "Teams at PyTorch, Mobius Labs and SGLang" +--- + +Large Language Models (LLMs) are typically very resource-intensive, requiring significant amounts of memory, compute and power to operate effectively. Quantization provides a solution by reducing weights and activations from 16 bit floats to lower bitrates (e.g., 8 bit, 4 bit, 2 bit), achieving significant speedup and memory savings and also enables support for larger batch sizes. + +Existing solutions for low precision inference work well for small batch sizes, but suffer from following issues: + +* Performance drops when we increase the batch size +* Restrictions on types of quantization, for example, some kernels only support symmetric quantization that could have implications on accuracy of the model at lower bits +* Interplay between quantization, serialization, and tensor parallelism (TP) makes it difficult to load quantized models and requires changes to user models + +To address these challenges, we created an end-to-end, performant, modular and extensible low-precision inference solution integrating the following libraries: + +* [GemLite](https://github.com/mobiusml/gemlite), a Triton kernel library, tackles the performance limitations of large batch sizes and restrictions on the types of quantization +* [TorchAO](https://github.com/pytorch/ao), a PyTorch-native library, provides a streamlined experience for quantization, sparsity, and tensor parallelism (with DTensor) +* [SGLang](https://github.com/sgl-project/sglang), a fast, efficient and hackable serving framework for Large Language Model (LLM) and Vision Language Models (VLM) with extensive model support + +If you’re interested in trying this out in SGLang, please follow these [repro instructions](#repro-instructions). For the rest of the blog, we’ll walk through relevant details for GemLite, TorchAO and SGlang both in terms of the design of the library itself and integration in addressing the problems we mentioned above, in the end we’ll present the benchmarking results on Llama 3.1-8B model across different batch sizes and tensor parallel sizes. + +## 1. Teaser of Results + +Following is a summary of the results in 8xH100 machine on Llama 3.1-8B for decode. For all experiments, the baseline is bfloat16 torch.compiled model: + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + bfloat16 w/ torch.compile + int4 weight only quantization, group size 64 + float8 per row dynamic quantization +
    Batch size 1, TP size 1 + 131 tokens/sec + 255 tokens/sec (1.95x speedup) + 166 tokens/sec (1.27x speedup) +
    Batch size 32, TP size 1 + 2799 tokens/sec + 3241 tokens/sec (1.16x speedup) + 3586 tokens/sec (1.28x speedup) +
    Batch size 32, TP size 4 + 5575 tokens/sec + 6334 tokens/sec (1.14x speedup) + 6159 tokens/sec (1.10x speedup) +
    + + +Our solution supports NVIDIA GPUs, including H100 and A100, and achieves speedup over the compiled bfloat16 baseline across batch sizes and TP sizes for both int4 weight only (from 1.14x to 1.95x) and float8 dynamic quantization (from 1.10x to 1.28x). Note that quantization may have a small impact on accuracy, which is outside the scope of this blogpost. Our int4 weight-only quantization is compatible with accuracy preserving techniques like HQQ. Please refer to [TorchAO's README](https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md#cuda-backend-1), [this benchmark](https://huggingface.co/mobiuslabsgmbh/Llama-3.1-8b-instruct_4bitgs64_hqq_calib), and [this blog](https://neuralmagic.com/blog/we-ran-over-half-a-million-evaluations-on-quantized-llms-heres-what-we-found/) for more information. + + +## 2. GemLite: Kernel Development + +The kernels were developed as part of GemLite, a project dedicated to optimizing low-bit matrix multiplication kernels. Developed using Triton, GemLite provides highly flexible and performant solutions across various activations, bitrates and hardware. In a nutshell, the kernels offer: + + + +* Support for various activation data types: fp16, int8 and fp8 +* Compatibility: works seamlessly with non-packed (e.g., int8, fp8) and packed formats (e.g., uint4, uint2, uint1) +* Performance Optimization: includes optimized kernels and autotuning tools to achieve high performance across different hardware and batch sizes +* Integration: Compatible with torch.compile and CUDA graphs, ensuring support for advanced features like tensor parallelism + +### Kernel Selection + +Optimizing kernel selection for large language model (LLM) generation requires addressing the distinct needs of different batch sizes. LLM workloads involve a mix of compute-bound and memory-bound iterations: smaller batch sizes are memory-bound, while larger batch sizes become compute-bound. GemLite kernels are designed to adapt to these varying demands, ensuring optimal execution for each scenario. + +In memory-bound scenarios, where data transfer is the limiting factor, the processor often waits for data to be fetched, leading to underutilized computational resources. For batch size = 1, a GEMV kernel performs best, whereas for larger batch sizes, GEMM kernels are more efficient. For batch sizes between 2 and 64, when matrices are "skinny," a GEMM-SPLITK kernel is used to enable better GPU utilization ([arXiv](https://arxiv.org/abs/2402.00025)). + +GemLite includes the following kernels optimized for each of these scenarios: + +### Single Sample Inference + +For single-sample inferences, we use GEMV kernels. However, asymmetric quantization methods require additional metadata, such as scales and zero points, to be loaded for each block. This can lead to increased memory transfer, so careful handling is essential. + +Specifically, for packed data, our experiments indicate that loading scales and zero points only once per two consecutive blocks minimizes redundant operations. Since these blocks share the same metadata, this approach results in: + +* 5–8% end-to-end inference speedup compared to the default GEMV kernel +* 30–40% improvement over the traditional Split-K method + +This new kernel/algorithm, GEMV_REVSPLITK, is available [here](https://github.com/mobiusml/gemlite/blob/master/gemlite/triton_kernels/gemv_revsplitK_A16fWnO16f_int32packing.py). + +For non-packed data, the [GEMV_SPLITK](https://github.com/mobiusml/gemlite/blob/master/gemlite/triton_kernels/gemv_splitK_A16fWnO16f_int32packing.py) algorithm is employed. This algorithm iterates over the k-dimension to compute the dot product without relying on Triton's tl.dot. + +### Batched Inference + +For moderate batch sizes, we use the GEMM-based Split-K method ([arXiv](https://arxiv.org/abs/2402.00025)) which splits the k-dimension (weight rows) into multiple jobs. The optimal-split SPLIT_K parameter is found by autotuning values ranging from 1 to 16. Setting SPLIT_K=1 enables a fallback implementation to a GEMM kernel, allowing the same kernel code to be used for compute-bound batch sizes starting from 32 and 64, depending on the matrix shape and the device. + +### Maximizing High Performance: Key Implementation Insights + +Various implementation details must be carefully addressed to achieve high performance. Following are some of the key aspects we focused on to ensure high performance: + +1. Autotuning for Performance + + + [Autotuning](https://triton-lang.org/main/python-api/generated/triton.autotune.html) is critical for achieving optimal kernel performance. Since this process can be time-intensive, GemLite provides tools to automatically save and load autotuning results for all kernels. This ensures that the autotuning process is performed only once per GPU device, minimizing runtime, reducing repetitive overhead, and maintaining consistent performance across runs. + +2. Ensuring Kernel Correctness + + + Ensuring kernel correctness across different quantization and configuration settings is essential. Triton’s [early configuration pruning](https://triton-lang.org/main/python-api/generated/triton.autotune.html) plays a key role in this process. For example, during Split-K tuning, configurations are selected only if K is divisible by BLOCK_SIZE_K × SPLIT_K,, and BLOCKS_SIZE_K is further pruned based on the group-size value. This approach ensures both efficiency and correctness in kernel operation. + +3. Overcoming Bit-Unpacking Bottlenecks + + + When deploying on data center-grade GPUs like NVIDIA’s A100 and H100, performance bottlenecks related to bit-unpacking were observed. To mitigate these, various bit-packing configurations were explored, including packing along columns versus rows and experimenting with different bit-packing widths (e.g., 8-bit vs. 32-bit). Notably, transitioning from 32-bit to 8-bit packing delivered performance improvements of up to 18% on the A100 and 6% on the H100 + +4. torch.compile compatibility + + + To ensure seamless compatibility with PyTorch’s torch.compile, kernel calls are wrapped in a [custom_op](https://pytorch.org/tutorials/advanced/python_custom_ops.html). This integration allows advanced features such as pre-hooks and early configuration pruning to function correctly, delivering accurate results without sacrificing performance. While some of these [features](https://github.com/pytorch/pytorch/issues/139059) are not yet fully supported in PyTorch, the custom_op implementation effectively bridges the gap, ensuring smooth integration and high performance. + + +## 3. TorchAO + +TorchAO is a PyTorch native quantization and sparsity library for both training and inference, featuring simple user APIs to train, quantize and deploy low precision models, and composability with other PyTorch features like distributed inference and torch.compile. + +PyTorch does not support low precision dtypes or different packing formats by default. With Tensor Subclass, we extend PyTorch native Tensor abstractions and model quantization as dtype conversion, while different packing formats for custom kernels are handled through layouts. For example, we support quantized linear operations with int4 weights, packed in a Tensor Core friendly layout, with tinygemm or GemLite kernel implementations. More details can be found [here](https://pytorch.org/ao/stable/contributor_guide.html). + + +![flow diagram](/assets/images/accelerating-llm-inference/fg1.png){:style="width:100%"} + + +Apart from more PyTorch native abstractions for developers, we want to highlight two benefits of this design for modeling users. + +1. [Serialization](https://pytorch.org/ao/stable/serialization.html): Save and load quantized weights into a state_dict just like a floating point model, eliminating the need to transform floating point model to quantized model before the quantized weights are loaded. This reduces friction of distributing and deploying quantized models. + +2. [Composability](#torch-tensor-parallel): Seamless integration with downstream features like tensor parallel, allowing users to focus on modeling without worrying about compatibility with tensor parallel, torch.compile, and other PyTorch features. Since these features are implemented with Tensor level abstraction, users can quantize and do distributed inference with no model changes most of the time. + + +### GemLite Kernel Integration + +To achieve the aforementioned benefits for the GemLite kernel, we integrated GemLite into TorchAO. This integration takes advantage of GemLite’s wide support and flexibility to allow for weight only quantization at 4 and 8 bits, under asymmetric and symmetric quantization schemes, 32 and 8 bit packing sizes, as well as grouped and ungrouped quantization. We enable this integration via the `quantize_` api which can be used alongside the GemLite constructor as follows + + +``` +quantize_(model, gemlite_uintx_weight_only(group_size, bit_width, packing_bitwidth)) +``` + + +The primary difficulty in creating this integration was making sure that the TorchAO composability guarantees were satisfied for the entire breadth of GemLite quantization kernel options. While the primary integration was relatively straight forward, making sure every different quantization type and their associated kernels worked well with tensor parallel was non-trivial. + + +### Torch Tensor Parallel {#torch-tensor-parallel} + +Tensor Parallelism is an effective way to speed up LLM inference. TP shards large matrices of linear or embedding modules onto multiple devices, typically in column-wise or row-wise styles. As the weight matrix gets distributed, computation is decomposed too. For example, the column-wise pattern below enables simultaneous matrix-vector multiply on four devices: + +![equation](/assets/images/accelerating-llm-inference/fg5.jpg){:style="max-width:300px; width:100%; display: block; margin-left: auto; margin-right: auto"} + + +PyTorch implements TP by converting a regular tensor (e.g. matrix *A*) into a *DTensor*: + +``` +dtensor = _shard_tensor(mA, device_mesh, (Shard(0),)) +``` + +Since DTensor stores meta information about the sharding, it knows how to reconstruct the full result when needed. Take Transformers’ feedforward module for example, as the down projection and up projection use column-wise and row-wise sharding respectively, DTensor will automatically perform an all-reduce on the ranks’ results as they move into the next operation. Such automation allows model authors to focus on computation without worrying about the communication needed for distributed execution. + +**Tensor Parallel and Quantization Order** + +Since both DTensor and quantization are tensor-level transformations, the application order matters in ensuring a workflow can generally work on different setups. We have two observations: (i) checkpoints are typically saved in quantized formats, to save the quantization overhead before each run; and (ii) TP may run on a different number of devices, depending on resource constraints or service agreements. As such, we first apply quantization to the original tensor, save it to disk depending on whether a reuse is desired. At service launch time, we load the quantized checkpoint and shard the tensors into DTensors on-the-fly as we load them into the model. + +**Tensor Parallel Support in TorchAO** + +Since we quantize the model first then distribute the Tensor, we’ll have `DTensor(QuantizedTensor(weight))`, where `DTensor` means a distributed Tensor class and `QuantizedTensor` means a quantized tensor class in TorchAO. `QuantizedTensor` should support the operators called when constructing a `DTensor`, including slice and view ops. To make sure the overall execution is efficient, the packed weight that’s sliced in the dimension 0 and 1 should match the result of first slice the unpacked weight then pack (pack and slice operation should commute), otherwise the packing format is not compatible with tensor parallelism. + + +## 4. SGLang + +SGLang is a fast serving framework for large language models and vision language models. It is known for its almost [zero-overhead batch scheduler](https://lmsys.org/blog/2024-12-04-sglang-v0-4/) and fast [constrained decoding](https://lmsys.org/blog/2024-02-05-compressed-fsm/). It is mainly implemented in Python, lightweight, and easy to hack. It is also one of the first frameworks to integrate torch.compile. + +**TorchAO integration in SGLang** + +We integrated `quantize_` API for applying a specific type of quantization to model into SGLang that supports int4 weight only quantization (both tinygemm and GemLite version), float8 dynamic quantization and a few other types of quantization so far. Users can enable quantization by adding `--torchao-config` argument to the benchmarking script. The currently enabled options also support tensor parallelism through composition with DTensor that is enabled with `--tp-size` option. + +**Torch Native Tensor Parallel Support in SGLang** + +Existing model definitions in SGLang use special linear modules that are coupled with tensor parallelism style, for example: `MergedColumnParallelLinear`, `QKVParallelLinear` and `RowParallelLinear`. To decouple the model definition and tensor parallelization style, we defined a [pytorch native model](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/torch_native_llama.py) that uses plain `nn.Linear` module from PyTorch and rely on PyTorch tensor parallelism APIs for parallelization and torch.compile for speedup. At related module hierarchies, we add a dictionary describing how a submodule should be parallelized. For example, in `class LlamaAttention`, we define: + +``` +_tp_plan = { + "qkv_proj": "Colwise_Sharded", + "o_proj": "Rowwise", +} +``` + +where `"qkv_proj" `and `"o_proj" `are the FQNs of the `wqkv` and `wo` projections, and the values are their TP styles. + +We then define a TP engine in `model_parallel.py`. It searches for `_tp_plan `recursively within the model, and applies the indicated TP styles to the submodules using PyTorch’s [parallelize_module](https://pytorch.org/docs/stable/distributed.tensor.parallel.html#torch.distributed.tensor.parallel.parallelize_module) API. + + +## 5. Results + +The evaluation focused on two popular quantization techniques for H100 machines: int4 weight-only quantization and float8 dynamic quantization. These methods were chosen due to their widespread use in optimizing memory efficiency and computational performance on H100 machines, making them ideal candidates for benchmarking against various workloads. + + + +* **int4 Weight-Only Quantization**: This method significantly reduces memory footprint and accelerates decode for memory-bound workloads, with minimal impact on performance in compute-intensive scenarios like prefill or larger batch sizes. We present results for bf16, GemLite, and tinygemm kernels below, across various batch sizes and tensor parallel configurations +* **float8 Dynamic Quantization**: While offering less memory savings, this method often provides higher accuracy and balanced speedups for both memory-bound and compute-bound tasks. With Hopper-grade hardware and native fp8 support, the efficient cutlass/cuBLAS kernels used by AO contribute to a significant speedup + +The graphs below show the decode tokens/sec for different tp sizes, each graph shows the results across different batch sizes and for different types of quantization: + + + +* BF16 is our bfloat16, torch.compile’d baseline +* tinygemm-4-64 is using `int4_weight_only` quantization in TorchAO, it’s a 4 bit groupwise quantization with group size of 64, using tinygemm kernel +* gemlite-4-64 is using `gemlite_uintx_weight_only `quantization in TorchAO, 4 means 4 bit, and 64 is also the group size, using GemLite kernel +* fp8dq-per_row is using `float8_dynamic_activation_float8_weight` quantization in TorchAO, both activation and weights are quantized with per row scales + +![bar chart](/assets/images/accelerating-llm-inference/fg2.png){:style="width:100%"} + +![bar chart](/assets/images/accelerating-llm-inference/fg3.png){:style="width:100%"} + +![bar chart](/assets/images/accelerating-llm-inference/fg4.png){:style="width:100%"} + + +For int4 weight-only quantization, at batch size 1, the tinygemm kernel achieved the best performance. However, its efficiency declined with increasing batch sizes. Conversely, GemLite effectively bridged this gap, delivering superior performance at larger batch sizes. GemLite also achieved a 9–10x speedup during the prefill phase compared to tinygemm, despite ongoing performance optimizations constrained by Triton. + +Float8 dynamic quantization showed 1.3x speedup over bfloat16 consistently with tensor parallel size 1 across different batch sizes and 1.1x to 1.2x speedup in larger tensor parallel sizes. As the tensor parallel size increases, the overall speedup decreases, which is expected due to the reduction in matmul size. Note that we do expect to get speedup for prefill as well, but since we rely on torch.compile for speedup and prefill compile is not enabled in SGLang yet, we will leave this for future work. + + +### Repro Instructions {#repro-instructions} + +We conducted benchmarks on an 8xH100 machine using GemLite 0.4.1, SGLang built from commit feb2b76, TorchAO nightly 0.8.0.dev20241223+cu124, and PyTorch 2.5.1. The Llama-3.1 Instruct models were chosen as the architecture for evaluation. + +``` +BATCH_SIZE=16 +# Note: gemlite is only compatible with float16 +# while int4wo-64 (tinygemm-4-64 as shown in the graph) and fp8dq-per_row should use bfloat16 +DTYPE=float16 +# int4wo-64, fp8dq-per_tensor +TORCHAO_CONFIG=gemlite-4-64 +TP_SIZE=2 +# Decode performance +python3 -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --json-model-override-args '{"architectures": ["TorchNativeLlamaForCausalLM"]}' --dataset-name random --random-input 1024 --random-output 512 --random-range 1 --num-prompts $BATCH_SIZE --enable-torch-compile --dtype $DTYPE --torchao-config $TORCHAO_CONFIG --tp-size $TP_SIZE + +# Example output +# Benchmark... +# [2024-12-20 12:42:16 TP0] Prefill batch. #new-seq: 2, #new-token: 2046, #cached-token: 4, cache hit rate: \0.06%, token usage: 0.00, #running-req: 0, #queue-req: 0 +# ... +# [2024-12-20 12:45:35 TP0] Decode batch. #running-req: 16, #token: 16763, token usage: 0.01, gen throughput\ (token/s): 2.20, #queue-req: 0 +# [2024-12-20 12:45:38 TP0] Decode batch. #running-req: 16, #token: 24443, token usage: 0.02, gen throughput\ (token/s): 2739.89, #queue-req: 0 + +# We reported the last throughput (token/s) as the performance for decode +``` + +## Conclusion + +With performant and extensible kernels from [GemLite](https://github.com/mobiusml/gemlite), PyTorch native architecture optimization library [TorchAO](https://github.com/pytorch/ao) and high performance inference framework [SGLang](https://github.com/sgl-project/sglang), we showcased fast end-to-end quantized inference for both int4 and float8 across different batch sizes and tensor parallel sizes with simple and composable user APIs to reduce the resource requirement for LLMs. This integration is our first step towards meeting the needs of fast inference across different models, workloads, precisions and hardwares and we are looking forward to continuing advancing the state of the art for end to end mixed and low precision LLM inference. + +Our immediate future work focuses on the following: + + + +* Exploring diverse combinations of weight and activation quantization to strike the best balance between speed and accuracy +* Extending support to additional GPU architectures to broaden accessibility +* Enhancing compatibility with MoE models to address growing demands in scalable inference +* Allow for easy integration of fast custom kernels in TorchAO so that they can be easily leveraged by SGLang and other inference frameworks +* While we didn’t measure accuracy impact in this blogpost, we can develop auto quantization tool in TorchAO to allow users to trade off between performance and accuracy +* Better integration with tensor parallelism in SGLang to support running larger models +* Enable torch.compile for prefill phase in SGLang + +We also invite the community to actively test, provide feedback, and contribute to shaping the future of fast and efficient LLM inference. \ No newline at end of file diff --git a/_posts/2025-01-22-bringing-the-pytorch-community-together.md b/_posts/2025-01-22-bringing-the-pytorch-community-together.md new file mode 100644 index 000000000000..41b8fe2a8562 --- /dev/null +++ b/_posts/2025-01-22-bringing-the-pytorch-community-together.md @@ -0,0 +1,134 @@ +--- +layout: blog_detail +title: "Bringing the PyTorch Community Together" +author: "Team PyTorch" +hidden: true +--- + +As we step into a new year, it’s a great moment to reflect on the incredible community events that made 2024 a memorable year for the PyTorch Foundation. Global meetups, events, and conferences brought the community together to learn, connect, and grow. Here’s a quick recap of the year’s highlights and what to expect in 2025\. + +![PyTorch Seattle Meetup (May 23)](/assets/images/community-events-recap/fg5.jpg){:style="width:100%"} + +**PyTorch Seattle Meetup (May 23\)** + +We hosted a PyTorch Meetup in Seattle in May at the Meta Bellevue Office where Meta, Microsoft, and Google gave technical talks and about 60 attendees participated in discussion and networking. + +**PyTorch Docathon 2024 (June 4-20)** + +The PyTorch Docathon returned for its third edition, spanning over two weeks in June. This unique event focused on improving PyTorch’s documentation with contributions from community members worldwide. Documentation is the backbone of any successful open source project, and PyTorch’s Docathon fostered inclusivity and collaboration, making it easier for new users to adopt the framework and for experienced developers to maximize its potential. The 2024 Docathon resulted in more than 50 merged pull requests and was a testament to the collaborative spirit of the PyTorch community and its commitment to enhancing accessibility and usability. Watch the [PyTorch Docathon Kickoff](https://youtu.be/2D0aej50umA?feature=shared) on YouTube. + +![PyTorch Shanghai Meetup (August 15)](/assets/images/community-events-recap/fg3.png){:style="width:100%"} + +#### **PyTorch Shanghai Meetup (August 15\)** + +In August, the [PyTorch Shanghai Meetup](https://pytorch.org/blog/pytorch-shanghai-notes/) brought together developers, researchers, and enthusiasts in Shanghai, China. This event served as a platform for knowledge sharing, with engaging talks and networking opportunities. Highlights from the agenda included insights into PyTorch’s latest developments, community-led presentations showcasing innovative use cases, and networking sessions fostering collaboration among attendees. + +![PyTorch Conference 2024 (September 18-19)](/assets/images/community-events-recap/fg1.jpg){:style="width:100%"} + +#### **PyTorch Conference 2024 (September 18-19)** + +The PyTorch Conference in San Francisco was undoubtedly one of the year’s most significant events. This two-day gathering brought together top-tier researchers, developers, and academic communities, fostering collaboration and innovation in machine learning. + +![What Made It Special](/assets/images/community-events-recap/fg6.jpeg){:style="width:100%"} + +#### **What Made It Special:** + +* Keynote speeches from industry leaders and PyTorch maintainers. +* In-depth sessions covering PyTorch’s end-to-end machine learning capabilities. +* Hands-on workshops and breakout sessions. +* A vibrant expo area showcasing cutting-edge tools and applications. +* Startup Showcase where early-stage founders pitched their AI startups to a panel of top venture capitalists. +* DL Compiler Mini-Summit that took a deep dive into the advances in deep learning (DL) compilers that are transforming AI workloads. +* Fine-Tuning Mini-Summit that covered everything from memory efficiency, parameter-efficient fine-tuning and quantization to performance at scale and reproducible evaluations. +* Poster Session showcasing innovations in PyTorch, including model optimization, hardware integration, generative AI, quantization, and tools for enhanced performance and usability, with contributions from industry leaders. + +The conference’s focus on fostering collaboration underscored PyTorch’s role as a driving force in the open source ML community. Missed out? You can watch the [PyTorch Conference 2024 Playlist](https://youtube.com/playlist?list=PL_lsbAsL_o2B_znuvm-pDtV_cRhpqZb8l&si=mdoSkqMJYKRlzxlg) to catch any sessions you might have missed. + +![GPU MODE IRL Hackathon (September 21)](/assets/images/community-events-recap/fg4.jpg){:style="width:100%"} + +#### **GPU MODE IRL Hackathon (September 21\)** + +PyTorch sponsored this meetup in person in San Francisco where attendees made friends, watched keynotes, hacked all day, took breaks with afternoon talks, and then hacked all night. We heard about torchao, our new quantization and sparsity library, vLLM which deploys PyTorch models in production, llm.c, and more. Key takeaways included: GPU Mode IRL Hackathon 1st place winner was inspired by PyTorch FlexAttention to improve CUTLASS, NCCL in Triton would help us do distributed programming with a minimal NCCL reimplementation in pure Python, No libtorch pytorch binaries dramatically reduces binary sizes for on device deployments. + +![Consumer AI Edge Hackathon (November 22-23)](/assets/images/community-events-recap/fg8.png){:style="width:100%"} + +#### **Consumer AI Edge Hackathon (November 22-23)** + +The PyTorch team served as mentors and coaches in a Hackathon in Paris, co-sponsored by Hugging Face, Scaleway, and Entrepreneur First, challenging teams to create innovative consumer (B2C) applications leveraging Hugging Face, PyTorch and other open source on-device tools and models. 120+ people across 22 teams hacked for 2 days (and nights\!) building the future of AI-powered on-device solutions based on open source models and tools. Participants created innovative applications, powered by PyTorch, [ExecuTorch](https://github.com/pytorch/executorch/tree/main) and Hugging Face resources, such as an on-device yoga coach, a magical storytelling companion and a Kinect-like experience to mobile phones. The PyTorch team is planning similar events in other geographies in 2025 around innovative on-device AI applications. + +![PyTorch Korea User Group Meetup (November 30)](/assets/images/community-events-recap/fg9.png){:style="width:100%"} + +#### **PyTorch Korea User Group Meetup (November 30\)** + +The PyTorch Korea User Group, founded in 2018, is a community dedicated to introducing PyTorch to Korean-speaking users and growing together. The group began by translating PyTorch 0.3 tutorials into Korean and has since supported PyTorch's growth in Korea. The group focuses on three primary activities: + +1. Sharing knowledge for PyTorch learning and application, +2. Sharing insights and experiences in the field of artificial intelligence, and +3. Fostering growth through online and offline networking. + +The PyTorch Korea User Group reaches tens of thousands of Korean AI developers every month. If you're interested in their activities, check out these links: + +* [PyTorch Korea User Group](https://pytorch.kr) +* [PyTorch Korean Tutorials](https://tutorials.pytorch.kr) +* [PyTorch Korean Community](https://discuss.pytorch.kr) +* [GitHub Repository](https://github.com/PyTorchKorea) +* [YouTube Channel](https://youtube.com/@pytorchkr) + +![PyTorch Korea User Group 2025 Events Overview](/assets/images/community-events-recap/fg2.jpeg){:style="width:100%"} + +The PyTorch Korea User Group has planned three major activities for the year: + +1. **PyTorch CoreSIG** + Since December 2024, this weekly online event has been held every Wednesday afternoon. Led by Kim Hong-Seok, CSO of Rebellions (a PyTorch member company), it provides in-depth knowledge and experience regarding PyTorch internals. Approximately 150 Korean developers participate weekly, reflecting growing interest in PyTorch Core development in Korea. +2. **Offline Meetup** + These meetups provide opportunities to share insights and experiences in PyTorch and artificial intelligence, along with networking. Around 3–4 sessions are planned for this year, focusing on key topics in PyTorch and AI. +3. **Online Community Engagement** + This activity involves sharing and discussing various projects and papers in the AI field. For more information, visit: [https://discuss.pytorch.kr](https://discuss.pytorch.kr). + +#### **Open Source AI Night at NeurIPS 2024 (December 10\)** + +The PyTorch Foundation co-hosted a social event at NeurIPS along with The Fin AI and Open Finance Foundation that featured engaging discussions on open source AI and applications in finance. + +![PyTorch Webinars](/assets/images/community-events-recap/fg7.jpeg){:style="width:100%"} + +**PyTorch Webinars** + +Throughout 2024, PyTorch hosted the following virtual webinars: + +Expert Exchanges: + +* [How does batching work on modern CPUs?](https://www.youtube.com/live/HTcnp9NEHGY?feature=shared) +* [DistServe: disaggregating prefill and decoding for goodput-optimized LLM inference](https://www.youtube.com/live/Bh-jlh5vlF0?feature=shared) +* [Efficient Streaming Language Models with Attention Sinks](https://www.youtube.com/live/RnM84Sv9WpA?feature=shared) +* [Adapting open source models with Open-Instruct and Tulu](https://www.youtube.com/live/e1qUJFAo10s?feature=shared) +* [Efficient Generative Models: From Sparse to Distributed Inference](https://www.youtube.com/live/Eqg0VIiWrgM?feature=shared) + +Summer Series: + +* [Using PyTorch for Monocular Depth Estimation Webinar](https://youtu.be/xf2QgioY370?feature=shared) +* [Accelerating LLM family of models on Arm Neoverse based Graviton AWS processors with KleidiAI](https://youtu.be/NeHIhQWewug?feature=shared) +* [torch.compile: The Missing Manual](https://www.youtube.com/live/rew5CSUaIXg?feature=shared) + +Release Live Q&As: + +* [PyTorch 2.4: Live Q&A](https://www.youtube.com/live/ry_QgUIYX1E?feature=shared) +* [PyTorch 2.5 Live Q&A](https://www.youtube.com/live/B3IgXpl4xt4?feature=shared) + +Live Webinars: + +* [PyTorch Documentary Virtual Premiere](https://www.youtube.com/watch?v=EjgTv6aSeqk) +* [Using PyTorch to Help Predict Wildfires](https://www.youtube.com/watch?v=gSC_IHyx0IM) +* [Seismic Data to Subsurface Models with OpenFWI: Training an AI Model with PyTorch](https://www.youtube.com/watch?v=zvk3Rr-OjU0) +* [Dinosaur Bone Hunting with Intel AI](https://www.youtube.com/watch?v=w4JmPkqnD0E) + +Each of these events underscored the importance of collaboration and community engagement in advancing AI research and applications. Thank you to everyone who participated, organized, and supported these events—your contributions make all the difference\! + +--- + +### **Looking Ahead** + +2024 was packed with opportunities to connect, learn, and contribute, and there will be even more ways to connect with the PyTorch community in 2025\. + +Mark your calendar\! The [PyTorch Conference](https://events.linuxfoundation.org/pytorch-conference-2025/) is returning to San Francisco on October 22-23, 2025\. Get ready for an exciting event filled with technical deep dives, exciting announcements, insightful sessions, and enhanced opportunities for community collaboration. + +Stay tuned for more upcoming events and opportunities to get involved by [subscribing to our newsletter](https://pytorch.org/newsletter). \ No newline at end of file diff --git a/_posts/2025-01-24-how-intel-uses-pytorch-to-empower-generative-ai-through-intel-arc-gpus.md b/_posts/2025-01-24-how-intel-uses-pytorch-to-empower-generative-ai-through-intel-arc-gpus.md new file mode 100644 index 000000000000..00241593ecf9 --- /dev/null +++ b/_posts/2025-01-24-how-intel-uses-pytorch-to-empower-generative-ai-through-intel-arc-gpus.md @@ -0,0 +1,43 @@ +--- +layout: blog_detail +title: "How Intel Uses PyTorch to Empower Generative AI through Intel Arc GPUs" +author: "Team PyTorch" +--- + +Intel has long been at the forefront of technological innovation, and its recent venture into Generative AI (GenAI) solutions is no exception. With the rise of AI-powered gaming experiences, Intel sought to deliver an accessible and intuitive GenAI inferencing solution tailored for AI PCs powered by Intel’s latest GPUs. By leveraging PyTorch as the backbone for development efforts, Intel successfully launched AI Playground, an open source application that showcases advanced GenAI workloads. + +**The Business Challenge** + +Our goal was to deliver an accessible and intuitive GenAI inferencing solution tailored for AI PCs powered by Intel. We recognized the need to showcase the capabilities of the latest GenAI workloads on our newest line of client GPUs. To address this, we developed a starter application, [AI Playground](https://github.com/intel/ai-playground), which is open source and includes a comprehensive developer reference sample available on GitHub using PyTorch. This application seamlessly integrates image generation, image enhancement, and chatbot functionalities, using retrieval-augmented generation (RAG) features, all within a single, user-friendly installation package. This initiative not only demonstrates the functionality of these AI workloads but also serves as an educational resource for the ecosystem, guiding developers on effectively leveraging the [Intel® Arc™ GPU](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/arc.html) product line for advanced AI applications. This solution leverages Intel® Arc™ Xe Cores and [Xe Matrix Extensions (XMX)](https://www.intel.com/content/www/us/en/support/articles/000091112/graphics.html) for accelerating inferencing. + +![AI Playground](/assets/images/intel-case-study/fg1.png){:style="width:100%"} + +**How Intel Used PyTorch** + +PyTorch is the core AI framework for AI Playground. We extensively leverage PyTorch's eager mode, which aligns perfectly with the dynamic and iterative nature of our generative models. This approach not only enhances our development workflow but also enables us to rapidly prototype and iterate on advanced AI features. By harnessing PyTorch’s powerful capabilities, we have created a robust reference sample that showcases the potential of GenAI on Intel GPUs in one cohesive application. + +**Solving AI Challenges with PyTorch** + +PyTorch has been instrumental in addressing our AI challenges by providing a robust training and inference framework optimized for discrete and integrated Intel Arc GPU product lines. Choosing PyTorch over alternative frameworks or APIs was crucial. Other options would have necessitated additional custom development or one-off solutions, which could have significantly slowed our time to market and limited our feature set. With PyTorch, we leveraged its flexibility and ease of use, allowing our team to focus on innovation through experimentation, rather than infrastructure. The integration of [Intel® Extension for PyTorch](https://www.intel.com/content/www/us/en/developer/tools/oneapi/optimization-for-pytorch.html#gs.j6azz7) further enhanced performance by optimizing computational efficiency and enabling seamless scaling on Intel hardware, ensuring that our application ran faster and more efficiently. + +**A Word from Intel** + +*With PyTorch as the backbone of our AI Playground project, we achieved rapid development cycles that significantly accelerated our time to market. This flexibility enabled us to iteratively enhance features and effectively align with the commitments of our hardware launches in 2024\.* + +*\-Bob Duffy, AI Playground Product Manager* + +![PyTorch Case Stidu](/assets/images/intel-case-study/fg2.png){:style="width:100%"} + +**The Benefits of Using PyTorch** + +The biggest benefit of using PyTorch for us is the large PyTorch ecosystem, which connects us with an active and cooperative community of developers. This collaboration has facilitated the seamless deployment of key features from existing open source projects, allowing us to integrate the latest GenAI capabilities into AI Playground. Remarkably, we accomplished this with minimal re-coding, ensuring that these advanced features are readily accessible on Intel Arc GPUs. + +**Learn More** + +For more information about Intel’s AI Playground and collaboration with PyTorch, visit the following links: + +* [PyTorch Optimizations from Intel](https://www.intel.com/content/www/us/en/developer/tools/oneapi/optimization-for-pytorch.html#gs.j8h6mc) +* [AI Playground GitHub](https://github.com/intel/ai-playground) +* [AI Playground](https://intel.com/ai-playground) +* [AI Playground Deep Dive Video](https://youtu.be/cYPZye1MC6U) +* [Intel GPU Support Now Available in PyTorch 2.5](https://pytorch.org/blog/intel-gpu-support-pytorch-2-5/) \ No newline at end of file diff --git a/_posts/2025-01-28-2025-priorities-for-tac.md b/_posts/2025-01-28-2025-priorities-for-tac.md new file mode 100644 index 000000000000..8e55be0b3338 --- /dev/null +++ b/_posts/2025-01-28-2025-priorities-for-tac.md @@ -0,0 +1,25 @@ +--- +layout: blog_detail +title: "2025 Priorities for the PyTorch Technical Advisory Council (TAC)" +author: "Luca Antiga, PyTorch TAC Chair" +--- + +![social share](/assets/images/1738166706211.jpg){:style="max-width:600px; width:100%; display: block; margin-left: auto; margin-right: auto"} + + +[2024 has been a year of incredible growth for PyTorch](https://pytorch.org/blog/2024-year-in-review/). As that continues in 2025, the PyTorch Foundation has made important steps towards evolving the governance of the project under the Linux Foundation’s vendor-neutral umbrella. + +An important piece of governance for PyTorch is represented by the Technical Advisory Council (TAC). The TAC acts as a bridge between the industry, including but not limited to the PyTorch Foundation members, the community, and the PyTorch core development team. + +Operating with transparency and inclusivity, the TAC gathers input, facilitates collaboration, and drives initiatives that enhance the experience for everyone who relies on PyTorch. + +In 2025, the TAC will focus on four key areas: + +1. **Build Open, Multi-Cloud Continuous Integration (CI):** Building on the groundwork from 2024, the TAC will oversee the transition to an open, community-driven CI infrastructure. In addition to ensuring the extremely high bar for correctness that PyTorch has, PyTorch’s CI is complex with a high-quality bar including many automated functional and performance daily test runs. In 2025, PyTorch’s CI infrastructure will be fully open sourced and extended to support multiple compute providers, enabling broader contribution and participation to the effort from organizations benefitting from PyTorch. +2. **Support more Accelerators:** The TAC is committed to creating a level playing field for the growing landscape of AI accelerators. By gathering industry players and PyTorch developers, the TAC will facilitate efforts towards third-party device support and provide levels of integration of external CI systems with the main PyTorch CI. This will make it easier for emerging hardware to gain adoption within the PyTorch ecosystem, and for users to experiment with diverse compute options for training and inference. +3. **Create a High-Quality, User-Centric Ecosystem:** A big focus for the TAC in early 2025 is on improving the experience and discoverability of the PyTorch ecosystem. With many projects growing organically, users often face challenges navigating projects of different scope and quality within the rapidly changing AI landscape. To solve this, a newly curated ecosystem landscape tool will be launched soon on the PyTorch website. We will also introduce lightweight, open processes to improve projects and ensure users a predictable, high-quality experience. In many ways, the experience with PyTorch is as good as its ecosystem. +4. **Gather Feedback from Industry and the Community:** PyTorch has widespread adoption across research labs, startups, and enterprises. Striking the right balance between expressiveness and performance across the board is a very challenging task, so the TAC set out to be one of the several ways the Core development team receives signals. During our monthly TAC meetings, we provide the opportunity to PyTorch Foundation members from industry and academia, as well as non-member organizations to present their use case, their challenges and discuss them directly with appropriate members of the Core team. This feedback loop helps prioritize improvements, ensuring the framework stays relevant in a fast-evolving AI landscape. + +By focusing on these priorities, the TAC aims to maintain PyTorch’s position as the leading deep learning framework, while ensuring it remains open, accessible, and responsive to the needs of its diverse community. + +As members of the TAC, we’re extremely excited to contribute to the success of PyTorch and to the impact it’s having in the real world. If you are a PyTorch user or developer, consider [participating in our monthly calls](https://zoom-lfx.platform.linuxfoundation.org/meetings/pytorch?__hstc=132719121.a26416c161ac91bef494ffc19f91a62e.1723036593114.1738082449904.1738088158683.375&__hssc=132719121.1.1738088158683&__hsfp=810579359) (they are open to everyone, and the recordings are available [here](https://lists.pytorch.org/g/tac)). Also, if you develop or maintain a project based on PyTorch, consider contributing it to the new PyTorch ecosystem ([instructions](https://github.com/pytorch-fdn/ecosystem)). \ No newline at end of file diff --git a/_posts/2025-01-29-pytorch2-6.md b/_posts/2025-01-29-pytorch2-6.md new file mode 100644 index 000000000000..6ccac080294b --- /dev/null +++ b/_posts/2025-01-29-pytorch2-6.md @@ -0,0 +1,146 @@ +--- +layout: blog_detail +title: "PyTorch 2.6 Release Blog" +--- + +We are excited to announce the release of PyTorch® 2.6 ([release notes](https://github.com/pytorch/pytorch/releases/tag/v2.6.0))! This release features multiple improvements for PT2: `torch.compile` can now be used with Python 3.13; new performance-related knob `torch.compiler.set_stance`; several AOTInductor enhancements. Besides the PT2 improvements, another highlight is FP16 support on X86 CPUs. + +NOTE: Starting with this release we are not going to publish on Conda, please see [[Announcement] Deprecating PyTorch’s official Anaconda channel](https://github.com/pytorch/pytorch/issues/138506) for the details. + +For this release the experimental Linux binaries shipped with CUDA 12.6.3 (as well as Linux Aarch64, Linux ROCm 6.2.4, and Linux XPU binaries) are built with CXX11_ABI=1 and are [using the Manylinux 2.28 build platform](https://dev-discuss.pytorch.org/t/pytorch-linux-wheels-switching-to-new-wheel-build-platform-manylinux-2-28-on-november-12-2024/2581). If you build PyTorch extensions with custom C++ or CUDA extensions, please update these builds to use CXX_ABI=1 as well and report any issues you are seeing. For the next PyTorch 2.7 release we plan to switch all Linux builds to Manylinux 2.28 and CXX11_ABI=1, please see [[RFC] PyTorch next wheel build platform: manylinux-2.28](https://github.com/pytorch/pytorch/issues/123649) for the details and discussion. + +Also in this release as an important security improvement measure we have changed the default value for `weights_only` parameter of `torch.load`. This is a backward compatibility-breaking change, please see [this forum post](https://dev-discuss.pytorch.org/t/bc-breaking-change-torch-load-is-being-flipped-to-use-weights-only-true-by-default-in-the-nightlies-after-137602/2573) for more details. + +This release is composed of 3892 commits from 520 contributors since PyTorch 2.5. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve PyTorch. More information about how to get started with the PyTorch 2-series can be found at our [Getting Started](https://pytorch.org/get-started/pytorch-2.0/) page. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Beta + Prototype +
    torch.compiler.set_stance + Improved PyTorch user experience on Intel GPUs +
    torch.library.triton_op + FlexAttention support on X86 CPU for LLMs +
    torch.compile support for Python 3.13 + Dim.AUTO +
    New packaging APIs for AOTInductor + CUTLASS and CK GEMM/CONV Backends for AOTInductor +
    AOTInductor: minifier + +
    AOTInductor: ABI-compatible mode code generation + +
    FP16 support for X86 CPUs + +
    + + +*To see a full list of public feature submissions click [here](https://docs.google.com/spreadsheets/d/1TzGkWuUMF1yTe88adz1dt2mzbIsZLd3PBasy588VWgk/edit?usp=sharing). + + +## BETA FEATURES + + +### [Beta] torch.compiler.set_stance + +This feature enables the user to specify different behaviors (“stances”) that `torch.compile` can take between different invocations of compiled functions. One of the stances, for example, is + +“eager_on_recompile”, that instructs PyTorch to code eagerly when a recompile is necessary, reusing cached compiled code when possible. + +For more information please refer to the [set_stance documentation](https://pytorch.org/docs/2.6/generated/torch.compiler.set_stance.html#torch.compiler.set_stance) and the [Dynamic Compilation Control with torch.compiler.set_stance](https://pytorch.org/tutorials/recipes/torch_compiler_set_stance_tutorial.html) tutorial. + +### [Beta] torch.library.triton_op + +`torch.library.triton_op` offers a standard way of creating custom operators that are backed by user-defined triton kernels. + +When users turn user-defined triton kernels into custom operators, `torch.library.triton_op` allows `torch.compile` to peek into the implementation, enabling `torch.compile` to optimize the triton kernel inside it. + +For more information please refer to the [triton_op documentation](https://pytorch.org/docs/2.6/library.html#torch.library.triton_op) and the[ Using User-Defined Triton Kernels with torch.compile](https://pytorch.org/tutorials/recipes/torch_compile_user_defined_triton_kernel_tutorial.html) tutorial. + +### [Beta] torch.compile support for Python 3.13 + +`torch.compile` previously only supported Python up to version 3.12. Users can now optimize models with `torch.compile` in Python 3.13. + +### [Beta] New packaging APIs for AOTInductor + +A new package format, “[PT2 archive](https://docs.google.com/document/d/1RQ4cmywilnFUT1VE-4oTGxwXdc8vowCSZsrRgo3wFA8/edit?usp=sharing)”, has been introduced. This essentially contains a zipfile of all the files that need to be used by AOTInductor, and allows users to send everything needed to other environments. There is also functionality to package multiple models into one artifact, and to store additional metadata inside of the package. + +For more details please see the updated [torch.export AOTInductor Tutorial for Python runtime](https://pytorch.org/tutorials/recipes/torch_export_aoti_python.html). + +### [Beta] AOTInductor: minifier + +If a user encounters an error while using AOTInductor APIs, AOTInductor Minifier allows creation of a minimal nn.Module that reproduces the error. + +For more information please see the [AOTInductor Minifier documentation](https://pytorch.org/docs/2.6/torch.compiler_aot_inductor_minifier.html). + +### [Beta] AOTInductor: ABI-compatible mode code generation + +AOTInductor-generated model code has dependency on Pytorch cpp libraries. As Pytorch evolves quickly, it’s important to make sure previously AOTInductor compiled models can continue to run on newer Pytorch versions, i.e. AOTInductor is backward compatible. + +In order to guarantee application binary interface (ABI) backward compatibility, we have carefully defined a set of stable C interfaces in libtorch and make sure AOTInductor generates code that only refers to the specific set of APIs and nothing else in libtorch. We will keep the set of C APIs stable across Pytorch versions and thus provide backward compatibility guarantees for AOTInductor-compiled models. + +### [Beta] FP16 support for X86 CPUs (both eager and Inductor modes) + +Float16 datatype is commonly used for reduced memory usage and faster computation in AI inference and training. CPUs like the recently launched [Intel® Xeon® 6 with P-Cores](https://www.intel.com/content/www/us/en/products/details/processors/xeon/xeon6-p-cores.html) support Float16 datatype with native accelerator [AMX](https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html). Float16 support on X86 CPUs was introduced in PyTorch 2.5 as a prototype feature, and now it has been further improved for both eager mode and Torch.compile + Inductor mode, making it Beta level feature with both functionality and performance verified with a broad scope of workloads. + + +## PROTOTYPE FEATURES + +### [Prototype] Improved PyTorch user experience on Intel GPUs + +PyTorch user experience on Intel GPUs is further improved with simplified installation steps, Windows release binary distribution and expanded coverage of supported GPU models including the latest Intel® Arc™ B-Series discrete graphics. Application developers and researchers seeking to fine-tune, inference and develop with PyTorch models on [Intel® Core™ Ultra AI PCs ](https://www.intel.com/content/www/us/en/products/docs/processors/core-ultra/ai-pc.html)and [Intel® Arc™ discrete graphics](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/arc.html) will now be able to directly install PyTorch with binary releases for Windows, Linux and Windows Subsystem for Linux 2. + + + +* Simplified Intel GPU software stack setup to enable one-click installation of the torch-xpu PIP wheels to run deep learning workloads in an out of the box fashion, eliminating the complexity of installing and activating Intel GPU development software bundles. +* Windows binary releases for torch core, torchvision and torchaudio have been made available for Intel GPUs, and the supported GPU models have been expanded from Intel® Core™ Ultra Processors with Intel® Arc™ Graphics, [Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics](https://www.intel.com/content/www/us/en/products/details/processors/core-ultra.html) and [Intel® Arc™ A-Series Graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/desktop/a-series/overview.html) to the latest GPU hardware [Intel® Arc™ B-Series graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/desktop/b-series/overview.html). +* Further enhanced coverage of Aten operators on Intel GPUs with SYCL* kernels for smooth eager mode execution, as well as bug fixes and performance optimizations for torch.compile on Intel GPUs. + +For more information regarding Intel GPU support, please refer to [Getting Started Guide](https://pytorch.org/docs/main/notes/get_start_xpu.html). + +### [Prototype] FlexAttention support on X86 CPU for LLMs + +FlexAttention was initially introduced in PyTorch 2.5 to provide optimized implementations for Attention variants with a flexible API. In PyTorch 2.6, X86 CPU support for FlexAttention was added through TorchInductor CPP backend. This new feature leverages and extends current CPP template abilities to support broad attention variants (e.x.: PageAttention, which is critical for LLMs inference) based on the existing FlexAttention API, and brings optimized performance on x86 CPUs. With this feature, it’s easy to use FlexAttention API to compose Attention solutions on CPU platforms and achieve good performance. + +### [Prototype] Dim.AUTO + +`Dim.AUTO` allows usage of automatic dynamic shapes with `torch.export`. Users can export with `Dim.AUTO `and “discover” the dynamic behavior of their models, with min/max ranges, relations between dimensions, and static/dynamic behavior being automatically inferred. + +This is a more user-friendly experience compared to the existing named-Dims approach for specifying dynamic shapes, which requires the user to fully understand the dynamic behavior of their models at export time. `Dim.AUTO` allows users to write generic code that isn’t model-dependent, increasing ease-of-use for exporting with dynamic shapes. + +Please see [torch.export tutorial](https://pytorch.org/tutorials/intermediate/torch_export_tutorial.html#constraints-dynamic-shapes) for more information. + +### [Prototype] CUTLASS and CK GEMM/CONV Backends for AOTInductor + +The CUTLASS and CK backend adds kernel choices for GEMM autotuning in Inductor. This is now also available in AOTInductor which can run in C++ runtime environments. A major improvement to the two backends is improved compile-time speed by eliminating redundant kernel binary compilations and dynamic shapes support. \ No newline at end of file diff --git a/_posts/2025-02-05-warp-specialization.md b/_posts/2025-02-05-warp-specialization.md new file mode 100644 index 000000000000..098e1f6261fe --- /dev/null +++ b/_posts/2025-02-05-warp-specialization.md @@ -0,0 +1,112 @@ +--- +layout: blog_detail +title: "Enabling advanced GPU features in PyTorch - Warp Specialization" +author: "Meta and NVIDIA" +--- + +**Meta**: Hongtao Yu, Manman Ren, Bert Maher, Shane Nay +**NVIDIA**: Gustav Zhu, Shuhao Jiang + +Over the past few months, we have been working on enabling advanced GPU features for PyTorch and Triton users through the Triton compiler. One of our key goals has been to introduce warp specialization support on NVIDIA Hopper GPUs. Today, we are thrilled to announce that our efforts have resulted in the rollout of fully automated Triton warp specialization, now available to users in the upcoming release of Triton [3.2](https://github.com/triton-lang/triton/tree/release/3.2.x), which will ship with PyTorch 2.6. PyTorch users can leverage this feature by [implementing user-defined Triton kernels](https://pytorch.org/tutorials/recipes/torch_compile_user_defined_triton_kernel_tutorial.html). This work leveraged an initial implementation of warp specialization in Triton by NVIDIA and we look forward to further development with the community in the future. + +Warp specialization (WS) is a GPU programming technique where warps (a group of 32 threads on NVIDIA GPUs) within a threadblock are assigned distinct roles or tasks. This approach optimizes performance by enabling efficient execution of workloads that require task differentiation or cooperative processing. It enhances kernel performance by leveraging an asynchronous execution model, where different parts of the kernel are managed by separate hardware units. Data communication between these units, facilitated via shared memory on the NVIDIA H100, is highly efficient. Compared to a uniform warp approach, warp specialization allows the hardware multitasking warp scheduler to operate more effectively, maximizing resource utilization and overall performance. + +Using GEMM as an example, a typical uniform warp approach on the H100 GPU involves 8 warps per thread block collectively computing a tile of the output tensor. These 8 warps are divided into two warp groups (WG), with each group cooperatively computing half of the tile using efficient warp-group-level MMA (WGMMA) instructions, as illustrated in Figure 1. + + +![Figure 1. GEMM K-loop Body with Uniform Warps](/assets/images/warp-specialization/fg1.jpg){:style="width:100%"} + +Figure 1. GEMM K-loop Body with Uniform Warps + +The implementation is clean, easy to understand, and generally performs well, thanks to an elegant software pipeliner. The pipeliner's purpose is to enhance instruction-level parallelism by executing non-dependent operations on different hardware units. For instance, load operations from the next loop iteration can be executed simultaneously with WGMMA operations in the current iteration. However, this approach relies heavily on the compiler to craft an instruction sequence that ensures load and WGMMA instructions are issued at precisely the right time. While this is relatively straightforward for GEMM, which involves a limited number of operations, it becomes significantly more challenging for more complex kernels, such as flash attention. + +On the other hand, warp specialization addresses programming challenges by separating operations intended to run simultaneously on different hardware units into distinct warps, synchronizing them efficiently using low-cost barriers in shared memory. This allows each warp to have its own instruction sequence, enabling instructions to be issued and executed continuously without being interrupted by other operations, thanks to the multi-way warp scheduler. An illustration of a warp-specialized GEMM can be seen in Figure 2. + + +![Figure 2. GEMM K-loop Body with Specialized Warps](/assets/images/warp-specialization/fg2.jpg){:style="width:100%"} + +Figure 2. GEMM K-loop Body with Specialized Warps + + +## How to enable WS + +To enable warp specialization, users simply need to specify two autotune flags: num_consumer_groups and num_buffers_warp_spec. For example, a warp-specialized GEMM implementation might look as shown below. Users can enable warp specialization by setting a non-zero value for num_consumer_groups, which defines the number of consumer warp groups. There is no corresponding flag to set the number of producer warp groups, as currently only one producer is supported. The num_buffers_warp_spec flag specifies the number of buffers the producer warp group will use to communicate with the consumer warp groups. A working example of a warp-specialized kernel is provided in the persistent GEMM [tutorial](https://github.com/triton-lang/triton/blob/6771065cb3137f7e64454cc047b9b74d577cbf7f/python/tutorials/09-persistent-matmul.py#L620). + +``` +@triton.autotune( + configs=[ + triton.Config( + { + "BLOCK_SIZE_M": 128, + "BLOCK_SIZE_N": 256, + "BLOCK_SIZE_K": 64, + "GROUP_SIZE_M": 8, + }, + num_stages=2, + num_warps=4, + num_consumer_groups=2, + num_buffers_warp_spec=3, + ), + ], + key=["M", "N", "K"], +) +@triton.jit +def matmul_persistent_ws_kernel( + a_ptr, b_ptr, c_ptr, M, N, K, + stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn, + BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr, +): + pid = tl.program_id(axis=0) + num_pid_m = tl.cdiv(M, BLOCK_M) + num_pid_n = tl.cdiv(N, BLOCK_N) + pid_m = pid // num_pid_m + pid_n = pid % num_pid_n + offs_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M) + offs_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N) + offs_k = tl.arange(0, BLOCK_K) + a_ptrs = a_ptr + (offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak) + b_ptrs = b_ptr + (offs_k[:, None] * stride_bk + offs_n[None, :] * stride_bn) + acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32) + for k in range(0, tl.cdiv(K, BLOCK_K)): + a = tl.load(a_ptrs) + b = tl.load(b_ptrs) + acc += tl.dot(a, b) + a_ptrs += BLOCK_K * stride_ak + b_ptrs += BLOCK_K * stride_bk + c = acc.to(tl.float16) + c_ptrs = c_ptr + stride_cm * offs_m[:, None] + stride_cn * offs_n[None, :] + tl.store(c_ptrs, c) +``` + + +## Under the Hood + +Warp specialization uses a set of hierarchical compiler transformations and IR changes to translate a user's non-warp-specialized kernel into warp-specialized machine code. These include: + + + +* **Task Partitioning**: The entire kernel is automatically divided into asynchronous tasks based on predefined heuristics. The compiler determines how to utilize one producer warp group and a user-specified number of consumer warp groups to execute the kernel. It assigns task IDs to specific anchor operations, which then influence the task assignments for remaining operations through asynchronous task ID propagation and dependency analysis. Since shared memory is the most efficient method for data transfer between warp groups across all supported platforms, the compiler optimizes task partitions to minimize register spills to shared memory, ensuring efficient execution. +* **Data Partitioning for Multiple Consumer Groups**: Efficiently partitioning data among multiple consumer groups is key to optimizing workload distribution. On the H100 GPU, the compiler, by default, attempts to partition the input tensor `A` along the `M` dimension, allowing each consumer group to compute half of the output tensor independently. This strategy, known as [cooperative partitioning](https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md#warp-specialization), maximizes efficiency under most conditions. However, if this split leads to inefficiencies—such as producing a workload smaller than the native WGMMA instruction size—the compiler dynamically adjusts and partitions along the `N` dimension instead. +* **Dataflow Pipelining**: The compiler creates cyclic shared memory buffers to pipeline dataflows across multiple-dimensional loops. Warp-specialized pipelining supports complex control flow. For example, our warp-specialized persistent GEMM kernel uses a doubly-nested loop, allowing the producer to begin fetching data for the next output tile while the consumer is finishing the compute for the prior tile. +* **Communication Operations**`: `We introduced four high-level Triton GPU IR (TTGIR) communication operations`—ProducerAcquireOp, ProducerCommitOp, ConsumerWaitOp, `and` ConsumerReleaseOp—`to manage pipelined dataflows. These support both TMA and non-TMA memory operations. +* **Code Partitioning**: Each async task is outlined into its own standalone code region, guarded by warp group ID checks. Control dependencies are duplicated as needed. +* **TTGIR to LLVM/PTX Materialization**: TTGIR communication operations are materialized into corresponding LLVM/PTX barrier operations. + + +## Performance + +The [warp specialization release](https://github.com/triton-lang/triton/pull/5622) introduces a range of Triton compiler transformations that collectively convert user code into warp-specialized kernels. This feature has been applied to several key kernels, including Flash Attention and FP8 row-wise GEMM, resulting in significant performance gains of 10% to 15%. Below, we highlight the latest performance metrics for these high-impact kernels. + + +![bar chart](/assets/images/warp-specialization/fg3.png){:style="width:100%"} + + + + +![bar chart](/assets/images/warp-specialization/fg4.png){:style="width:100%"} + + + +## Future Work + +Looking ahead, we plan to further enhance Triton's warp specialization support by introducing new features such as Ping-Pong scheduling, expanded buffer sharing support, improved transparent handling for TMA, refined partitioning heuristics for upcoming NVIDIA hardware. \ No newline at end of file diff --git a/_posts/2025-02-11-unlocking-pt-2-6-intel.md b/_posts/2025-02-11-unlocking-pt-2-6-intel.md new file mode 100644 index 000000000000..2a0cb363e10f --- /dev/null +++ b/_posts/2025-02-11-unlocking-pt-2-6-intel.md @@ -0,0 +1,75 @@ +--- +layout: blog_detail +title: "Unlocking the Latest Features in PyTorch 2.6 for Intel Platforms" +author: "the Intel PyTorch Team" +--- + +[PyTorch* 2.6](https://pytorch.org/blog/pytorch2-6/) has just been released with a set of exciting new features including torch.compile compatibility with Python 3.13, new security and performance enhancements, and a change in the default parameter for torch.load. PyTorch also announced the deprecation of its official Anaconda channel. + +Among the performance features are three that enhance developer productivity on Intel platforms: + +1. Improved Intel GPU availability +2. FlexAttention optimization on x86 CPU for LLM +3. FP16 on x86 CPU support for eager and Inductor modes + +## Improved Intel GPU Availability + +To provide developers working in artificial intelligence (AI) with better support for Intel GPUs, the PyTorch user experience on these GPUs has been enhanced. This improvement includes simplified installation steps, a Windows* release binary distribution, and expanded coverage of supported GPU models, including the latest Intel® Arc™ B-Series discrete graphics. + +These new features help promote accelerated machine learning workflows within the PyTorch ecosystem, providing a consistent developer experience and support. Application developers and researchers seeking to fine-tune, perform inference, and develop with PyTorch models on [Intel® Core™ Ultra AI PCs ](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/arc.html) and [Intel® Arc™ discrete graphics](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/arc.html) will now be able to install PyTorch directly with binary releases for Windows, Linux*, and Windows Subsystem for Linux 2. + +The new features include: + +* Simplified Intel GPU software stack setup to enable one-click installation of the torch-xpu PIP wheels to run deep learning workloads in a ready-to-use fashion, thus eliminating the complexity of installing and activating Intel GPU development software bundles.  +* Windows binary releases for torch core, torchvision and torchaudio have been made available for Intel GPUs, expanding from [Intel® Core™ Ultra Series 2](https://www.intel.com/content/www/us/en/products/details/processors/core-ultra.html) with Intel® Arc™ Graphics and [Intel® Arc™ A-Series graphics ](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/desktop/a-series/overview.html)to the latest GPU hardware [Intel® Arc™ B-Series graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/desktop/b-series/overview.html) support.  +* Further enhanced coverage of Aten operators on Intel GPUs with SYCL* kernels for smooth eager mode execution, as well as bug fixes and performance optimizations for torch.compile on Intel GPUs.  + +Get a tour of new environment setup, PIP wheels installation, and examples on Intel® Client GPUs and Intel® Data Center GPU Max Series in the [Getting Started Guide](https://pytorch.org/docs/main/notes/get_start_xpu.html). + +## FlexAttention Optimization on X86 CPU for LLM + +FlexAttention was first introduced in [PyTorch 2.5](https://pytorch.org/blog/pytorch2-5/), to address the need to support various Attentions or even combinations of them. This PyTorch API leverages torch.compile to generate a fused FlashAttention kernel, which eliminates extra memory allocation and achieves performance comparable to handwritten implementations. + +Previously, FlexAttention was implemented for CUDA* devices based on the Triton backend. Since PyTorch 2.6, X86 CPU support of FlexAttention was added through TorchInductor CPP backend. This new feature leverages and extends current CPP template abilities to support broad attention variants (e.g., PageAttention, which is critical for LLMs inference) based on the existing FlexAttention API, and brings optimized performance on x86 CPUs. With this feature, user can easily use FlexAttention API to compose their Attention solutions on CPU platforms and achieve good performance. + +Typically, FlexAttention is utilized by popular LLM ecosystem projects, such as Hugging Face transformers and vLLM in their LLM related modeling (e.g., PagedAttention) to achieve better out-of-the-box performance. Before the official adoption happens, [this enabling PR](https://github.com/huggingface/transformers/pull/35419) in Hugging Face can help us the performance benefits that FlexAttention can bring on x86 CPU platforms. + +The graph below shows the performance comparison of PyTorch 2.6 (with this feature) and PyTorch 2.5 (without this feature) on typical Llama models. For real-time mode (Batch Size = 1), there is about 1.13x-1.42x performance improvement for next token across different input token lengths. As for best throughput under a typical SLA (P99 token latency <=50ms), PyTorch 2.6 achieves more than 7.83x performance than PyTorch 2.5 as PyTorch 2.6 can run at 8 inputs (Batch Size = 8) together and still keep SLA while PyTorch 2.5 can only run 1 input, because FlexAttention based PagedAttention in PyTorch 2.6 provides more efficiency during multiple batch size scenarios. + + +![Figure 1. Performance comparison of PyTorch 2.6 and PyTorch 2.5 on Typical Llama Models](/assets/images/unlocking-pt-2-6-intel.png){:style="width:100%"} + + +**Figure 1. Performance comparison of PyTorch 2.6 and PyTorch 2.5 on Typical Llama Models** + +## FP16 on X86 CPU Support for Eager and Inductor Modes + +Float16 is a commonly used reduced floating-point type that improves performance in neural network inference and training. CPUs like recently launched [Intel® Xeon® 6 with P-Cores](https://www.intel.com/content/www/us/en/products/details/processors/xeon/xeon6-p-cores.html) support Float16 datatype with native accelerator [AMX](https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/advanced-matrix-extensions/overview.html), which highly improves the Float16 performance. Float16 support on x86 CPU was first introduced in PyTorch 2.5 as a prototype feature. Now it has been further improved for both eager mode and Torch.compile + Inductor mode, which is pushed to Beta level for broader adoption. This helps the deployment on the CPU side without the need to modify the model weights when the model is pre-trained with mixed precision of Float16/Float32. On platforms that support AMX Float16 (i.e., the Intel® Xeon® 6 processors with P-cores), Float16 has the same pass rate as Bfloat16 across the typical PyTorch benchmark suites: TorchBench, Hugging Face, and Timms. It also shows good performance comparable to 16 bit datatype Bfloat16. + +## Summary + +In this blog, we discussed three features to enhance developer productivity on Intel platforms in PyTorch 2.6. These three features are designed to improve Intel GPU availability, optimize FlexAttention for x86 CPUs tailored for large language models (LLMs), and support FP16 on x86 CPUs in both eager and Inductor modes. Get [PyTorch 2.6](https://pytorch.org/) and try them for yourself or you can access PyTorch 2.6 on the [Intel® Tiber™ AI Cloud](https://ai.cloud.intel.com/) to take advantage of hosted notebooks that are optimized for Intel hardware and software. + +## Acknowledgements + +The release of PyTorch 2.6 is an exciting milestone for Intel platforms, and it would not have been possible without the deep collaboration and contributions from the community. We extend our heartfelt thanks to [Alban](https://github.com/albanD), [Andrey](https://github.com/atalman), [Bin](https://github.com/desertfire), [Jason](https://github.com/jansel), [Jerry](https://github.com/jerryzh168) and [Nikita](https://github.com/malfet) for sharing their invaluable ideas, meticulously reviewing PRs, and providing insightful feedback on RFCs. Their dedication has driven continuous improvements and pushed the ecosystem forward for Intel platforms. + +## References + +* [FlexAttention in PyTorch](https://pytorch.org/blog/flexattention/) +* [PagedAttention Optimization](https://arxiv.org/abs/2309.06180) +* [Intel® Xeon® 6 with P-Cores](•%09https:/www.intel.com/content/www/us/en/products/details/processors/xeon/xeon6-p-cores.html) + +## Product and Performance Information + +Measurement on AWS EC2 m7i.metal-48xl using: 2x Intel® Xeon® Platinum 8488C, HT On, Turbo On, NUMA 2, Integrated Accelerators Available [used]: DLB [8], DSA [8], IAA[8], QAT[on CPU, 8], Total Memory 512GB (16x32GB DDR5 4800 MT/s [4400 MT/s]), BIOS Amazon EC2 1.0, microcode 0x2b000603, 1x Elastic Network Adapter (ENA) 1x Amazon Elastic Block Store 800G, Ubuntu 24.04.1 LTS 6.8.0-1018-aws Test by Intel on Jan 15th 2025. + +## Notices and Disclaimers + +Performance varies by use, configuration and other factors. Learn more on the Performance Index site. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates.  See backup for configuration details.  No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation. + +Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. + +## AI disclaimer: + +AI features may require software purchase, subscription or enablement by a software or platform provider, or may have specific configuration or compatibility requirements. Details at [www.intel.com/AIPC](http://www.intel.com/AIPC). Results may vary. \ No newline at end of file diff --git a/_posts/2025-02-12-datathon-2025.md b/_posts/2025-02-12-datathon-2025.md new file mode 100644 index 000000000000..0f69fd074382 --- /dev/null +++ b/_posts/2025-02-12-datathon-2025.md @@ -0,0 +1,34 @@ +--- +layout: blog_detail +title: "Solve Real-Word AI Challenges with PyTorch at Datathon 2025: DataOrbit" +author: "Aakash Senthilnathan" +hidden: true +--- + +**We’re excited to have PyTorch sponsor [Datathon 2025: DataOrbit](https://dataorbit-2025.devpost.com/)**, a place where students can collaborate with a team to solve problems using real-world datasets! This event, hosted by Data Science UCSB in collaboration with Gaucho Sports Analytics and ACM@UCSB, will take place on **February 22–23rd, 2025 at UC Santa Barbara**, with the incredible opportunity to present your project to a panel of corporate and faculty judges – **including the executive director of Pytorch!** – for a chance to win prizes up to $3000. + + +![logo](/assets/images/datathon-2025.png){:style="max-width:700px; width:100%; display: block; margin-left: auto; margin-right: auto"} + +PyTorch’s versatility and power have made it an essential tool for tackling complex data problems in domains ranging from computer vision and natural language processing to time series analysis. At Datathon 2025: DataOrbit, participants will have the chance to leverage PyTorch’s dynamic framework, ease of use, and robust ecosystem to build innovative solutions. Whether you’re building machine learning models, experimenting with deep learning architectures, or applying PyTorch to solve real-world challenges, workshops and mentors will be available to help you dive deeper into its capabilities and accelerate your project’s success. + +**Register Here:** [tinyurl.com/dataorbit25-reg](http://tinyurl.com/dataorbit25-reg) (Open until February 21st or until capacity is reached) + +Additional information regarding the timeline of events can be found on the registration form. + +About the Datathon + + + +* Open only to undergraduate students in the United States +* In-person events over 36 hours +* Teams sizes of 2-5 people +* 10 different prize tracks +* Workshops and office hours teaching essential data science tools and techniques +* Professional development workshops + networking opportunities with our sponsors +* All meals provided +* A fun time! + +*If you have a group you would like to work with, we require that every member register separately. If you do not have a group, we will have an opportunity at the beginning of the event to participate in an activity to form groups. Unfortunately, at this time we do not provide travel accommodations or lodging for participants.* + +*If you are interested in mentoring students virtually during the course of our datathon, or have any other questions contact us at datascience.ucsb@gmail.com.* \ No newline at end of file diff --git a/_posts/2025-02-19-optimize-llms.md b/_posts/2025-02-19-optimize-llms.md new file mode 100644 index 000000000000..b2dfec99bd0b --- /dev/null +++ b/_posts/2025-02-19-optimize-llms.md @@ -0,0 +1,176 @@ +--- +layout: blog_detail +title: "Optimize LLMs for Efficiency & Sustainability" +hidden: true +author: "Zach Lasiuk, Arm" +--- + +The rapid growth of large language model (LLM) applications is linked to rapid growth in energy demand. According to the International Energy Agency (IEA), data center electricity consumption is projected to roughly double by 2026 primarily driven by AI. This is due to the energy-intensive training requirements for massive LLMs – however, the increase in AI Inferencing workloads also plays a role. For example, compared with traditional search queries, a single AI inference can consume about [10x more energy](https://www.weforum.org/stories/2024/07/generative-ai-energy-emissions/). + +As developers, we directly affect how energy-intensive our AI solution is. There are technical decisions we can take to help make our AI solution more environmentally sustainable. Minimizing compute to deliver LLM solutions is not the only requirement for creating sustainable AI use. For example, systemic changes, such as policy interventions may be needed, but utilizing energy efficient solutions is an important factor and is an impactful intervention we can adopt right away. + +With that said, minimizing your LLM inference cloud compute requirements also leads to reducing your cloud bill and makes your app more energy efficient, creating a win-win situation. In this blog, we will take you through the steps to creating an LLM chatbot by optimizing and deploying a Llama 3.1 model on PyTorch, quantifying the computational efficiency benefits of specific architecture decisions. + + +## What will we evaluate? + +For this blog, our goal is to create an immersive fantasy storytelling app where users enter a fantasy world by chatting with a Generative AI. The first location is the land of Wicked, allowing people to role-play walking around the Emerald City and observe the sights and scenes in real-time. We’ll implement this via a chatbot and a custom system prompt. + +We will be evaluating LLM performance on CPUs. You can see the advantages of[ CPU vs GPU inference here](https://www.arm.com/resources/ebook/cpu-inference). In general, leveraging CPUs in the cloud for LLM inference is a great choice for models around 10B parameters or less like the Llama series. + +We will also be using Arm-based CPUs, specifically the AWS Graviton series. Based on studies,[ the Arm-based Graviton3 server can provide 67.6 percent lower workload carbon intensity built in](https://newsroom.arm.com/blog/aws-graviton-decarbonize-compute). While this study was based on a simulation, it is an excellent start to showing the possibilities for minimizing our app’s energy requirements. + +First, you’ll see how to run a simple LLM chatbot on PyTorch, then explore three techniques to optimize your application for computational efficiency: + +1. Model optimization: Utilizing 4-bit quantization and added KleidiAI kernels. +2. Shortcut optimization: Implementing a vector database to handle common queries. +3. Architecture optimization: Adopting a serverless architecture. + +Let’s get started. + + +## Run Llama-3.1 via PyTorch on AWS Graviton4 + +To maximize energy efficiency, we will only use the minimum server resources needed to support this LLM chatbot. For this [Llama-3.1 8-billion parameter model](https://huggingface.co/meta-llama/Llama-3.1-8B), 16 cores, 64GB RAM, and disk space of 50GB is required. We will use the r8g.4xlarge Graviton4 instance running Ubuntu 24.04, as it meets these specifications. + +Spin up this EC2 instance, connect to it, and start installing the requirements: + + +``` + sudo apt-get update + sudo apt install gcc g++ build-essential python3-pip python3-venv google-perftools -y +``` + + +Then install Torchchat, the library developed by the PyTorch team that enables running LLMs across devices: + + +``` + git clone https://github.com/pytorch/torchchat.git + cd torchchat + python3 -m venv .venv + source .venv/bin/activate + ./install/install_requirements.sh +``` + + +Next, install the Llama-3.1-8b model from Hugging Face through the CLI. You will first need to make a Hugging Face access token on your HF account. This will download the 16GB model to your instance, which may take a few minutes: + + +``` + pip install -U "huggingface_hub[cli]" + huggingface-cli login + + python torchchat.py export llama3.1 --output-dso-path exportedModels/llama3.1.so --device cpu --max-seq-length 1024 +``` + + +Now you are ready to run the LLM model, adding a system prompt to be a guiding storyteller in the land of Wicked: + + +``` + LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libtcmalloc.so.4 TORCHINDUCTOR_CPP_WRAPPER=1 TORCHINDUCTOR_FREEZING=1 OMP_NUM_THREADS=16 python torchchat.py generate llama3.1 --device cpu --chat +``` + + +Type ‘y’ to enter a system prompt and enter the following prompt: + + +*You are the guiding storyteller for a fantasy adventure application. Immerse users in the enchanting world of Wicked, guiding them through interactive, real-time experiences in the Emerald City. Describe vivid sights, dynamic scenes, and engage users in storytelling that feels alive and responsive. Allow users to make choices that shape their journey while maintaining the magical tone of the Wicked universe.* + +Then enter your user query: + + +*I walk through the Emerald City gates and look up* + +The output will show on the screen, taking about 7 seconds to generate the first token with less than 1 token per second. + + +![terminal](/assets/images/optimize-llms.png){:style="width:100%"} + + +This example took 245 seconds, or 4 minutes, to generate its complete reply—not very fast. The first optimization we’ll look at will speed up the LLM generation, reducing its computational footprint. + + +### Optimization 1: KleidiAI and Quantization + +Several optimizations are possible from the basic implementation above. The simplest and quickest one t to do is to quantize the model from FP16 to INT4. This approach trades-off some accuracy while cutting the model size from 16Gb to about 4Gb, increasing the inference speed in the process. + +Another common optimization comes in leveraging TorchAO (Torch Architecture Optimization), the PyTorch library that works seamlessly with TorchChat to enhance model performance through various quantization and sparsity methods. + +Lastly, we’ll use Arm KleidiAI optimizations. These are micro-kernels written in assembly that lead to significant performance improvements for LLM inference on Arm CPUs. You can read more about [how KleidiAI kernels work if interested](https://learn.arm.com/learning-paths/cross-platform/kleidiai-explainer/). + +To implement these optimizations, spin up a fresh EC2 instance and follow the instructions [on how to run a Large Language Model (LLM) chatbot with PyTorch](https://learn.arm.com/learning-paths/servers-and-cloud-computing/pytorch-llama/). When ready, run the model and enter the same system prompt and user query as above. You’ll get results that significantly speed up the inference: Less than 1 second to first token, and about 25 tokens per second. + +This cuts the inference time from 245 seconds to about 10 seconds. This results in less power-draw from your server, as it is spending more time idle vs running a power-hungry inference. All else being equal, this is a more carbon-friendly solution than the non-optimized app. The next two approaches go beyond model inference optimization, modifying the solution architectural to further reduce computational load. + + +### Optimization 2: FAISS to match database for common questions + +As stated in the introduction, model inferences are typically more computationally expensive than other search techniques. What if you could automatically respond to common user queries without performing an LLM inference? Using a query/response database is an option to bypass LLM inference and respond efficiently. For this interactive storytelling app, you can imagine common questions about specific characters, the world itself, and rules about what the chatbot is/is not capable of that can have pre-generated answers. + +However, a traditional exact-match database isn’t sufficient as users can phrase the same query in many ways. Asking about the chatbot’s capabilities could all invite the same answer but be phrased differently: + + + +* “What are you capable of?” +* “Tell me what you can do.” +* “How can I interact with you?” + +Implementing semantic search solves this issue by matching a user’s query to the most relevant pre-generated answer by understanding the user’s intent. The [FAISS library](https://github.com/facebookresearch/faiss) is a great option to implement semantic search. + +The computational savings of this approach depends on three factors: + + + +1. Percentage of user queries that can be serviced by semantic search instead of LLM. +2. Computational cost of running the LLM inference. +3. Computational cost of running the semantic search. + +With the savings equation being: + + +``` + Computational_savings = (% of queries) * (LLM_cost – search_cost). +``` + + +This type of architecture makes sense in a few situations. One is if your system has common queries with many repeat questions. Another is large-scale systems with hundreds of thousands of incoming queries, where small percentage savings add up to meaningful changes. Lastly, if your LLM inference is very computationally expensive compared to the search cost, particularly with larger parameter models. + +The final optimization approach is transitioning from server to serverless. + + +### Optimization 3: Serverless approach + +Using serverless architectures are popular for many reasons, one being only paying for active compute time, and eliminating costs with idle servers. Idling servers require a non-trivial amount of power to keep on, wasting energy while waiting. + +This cost efficiency translates into being an inherently more environmentally friendly architecture, as it reduces wasteful energy consumption. Further, multiple applications share underlying physical infrastructure, improving resource efficiency. + +To set up your own serverless chatbot, you need to first containerize the quantized Llama-3.1-8b with TorchChat, TorchAO, and Arm KleidiAI optimizations with a python script containing a Lambda entry function `lambda_handler`. One deployment option is to upload your container to AWS ECR and attach the container to your Lambda function. Then set up an API Gateway WebSocket or similar to interact with your Lambda through an API. + +There are two notable limitations to using a serverless architecture to host your LLM, the first being token generation speed. Recall that the server-based approach delivered about 25 tokens/second with KleidiAI optimizations. The serverless approach delivers an order of magnitude slower, which we measured at around about 2.5 tokens/second. This limitation mainly results from Lambda functions deploying onto Graviton2 servers. When deployment moves to CPUs with more SIMD channels, like Graviton3 and Graviton4, the tokens/second should increase over time. Learn more about architecture optimizations introduced in Graviton3 via the [Arm Neoverse-V1 CPU here](https://developer.arm.com/Processors/Neoverse%20V1). + +This slower speed restricts the viable use cases for serverless LLM architectures, but there are certain cases where this can be seen as an advantage. In our use cases of interactive storytelling, slowly revealing information creates a sense of immersion, building anticipation and mimicking real-time narration. Other use cases include: + + + +* Guided meditation apps with slow, relaxing word delivery +* Virtual friend engaging in thoughtful conversation, or a therapeutic conversation. +* Poetry generation or interactive art to slow delivery creating a contemplative aesthetic. + +Users may have a better experience with slower token generation in the right applications. When prioritizing a more sustainable solution, restrictions end up becoming strengths. As an analogy, a common critique of modern movies today is that their overreliance on visual effects leads to fewer compelling storylines vs older movies. The cost restrictions of VFX meant older movies had to craft captivating dialog, leveraging skillful camera angles and character positioning to fully engage viewers. Similarly, focusing on sustainable AI architectures can lead to more engaging, immersive experiences when done thoughtfully. + +The second serverless limitation on LLM inferences is the cold-start time of about 50 seconds. If implemented poorly, a user waiting 50 seconds with no alternative will likely leave the app. You can turn this limitation into a feature in our Wicked-based experience with several design tricks: + + + +* Create a “prologue experience” where you guide users through hard-coded questions and answers, priming them for where they will land in Emerald City and collecting input to shape their upcoming experience. +* Make the waiting period a countdown timer, revealing hard-coded text snippets of the story or world-building. A character, like the wizard, could communicate with the user with fragmented lines to build suspense and prime the user into the right mindset. +* Create an audio intro with music from the movie or musical, along with rotating visuals to draw users into the atmosphere of the Wicked world. + + +### Thinking outside the box + +Implementing a sustainability-minded solution architecture includes and goes beyond optimizing your AI inferences. Understand how users will interact with your system, and right-size your implementation accordingly. Always optimizing for fast tokens per second or time to first token will hide opportunities for engaging features. + +With that said, you should be leveraging straightforward optimizations when possible. Using TorchAO and Arm KleidiAI micro-kernels are great ways to speed up your LLM chatbot. By combining creative solution architectures and optimizing where possible, you can build more sustainable LLM-based applications. Happy coding! \ No newline at end of file diff --git a/_posts/2025-02-26-accelerating-generative-ai-segment-anything-2.md b/_posts/2025-02-26-accelerating-generative-ai-segment-anything-2.md new file mode 100644 index 000000000000..87751067df7b --- /dev/null +++ b/_posts/2025-02-26-accelerating-generative-ai-segment-anything-2.md @@ -0,0 +1,1342 @@ +--- +layout: blog_detail +title: "Accelerating Generative AI with PyTorch: Segment Anything 2 - Fast and furious inference with low latency and fast cold starts" +--- + +This post is a follow-up to our [first entry in the multi-series blog focused on how to accelerate generative AI models](https://pytorch.org/blog/accelerating-generative-ai/) with pure, native PyTorch and a focus on latency and elastic scalability. We use torch.compile and torch.export to create highly optimized low latency versions of SAM2 that can be quickly scaled up on new instances. + +By utilizing AOTInductor's (AOTI) ahead-of-time compilation via torch.export, reduced precision, batched prompts and GPU preprocessing we observe up to **13x improvement in p90 execution latency** and **queue times compared to regular eager mode PyTorch**. + +We calculate our final results and demonstrate the improvement in a realistic deployment on auto-scaling cloud infrastructure from [Modal](https://modal.com). + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 execution latency +
    +(ms / improvement) +
    p90 execution latency +
    +(ms / improvement) +
    + eager float32 + AOTI float16 + eager float32 + AOTI float16 +
    AMG + 741 + 112 (6.6x) + 1140 + 176 (6.5x) +
    SPS + 98 + 20 (4.9x) + 130 + 28 (4.6x) +
    MPS + 269 + 38 (7.1x) + 714 + 52 (13.7x) +
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 queue time (ms / improvement) + p90 queue time (ms / improvement) +
    + eager float32 + AOTI float16 + eager float32 + AOTI float16 +
    AMG + 201 + 41 (4.9x) + 815 + 327 (2.6x) +
    SPS + 31 + 33 (0.9x) + 441 + 49 (9.0x) +
    MPS + 40 + 37 (1.1x) + 942 + 75 (12.6x) +
    + + + +## The Tasks + +The first post focused on processing a small number of varying prompts (points of interest) per image. These points represented the center points of the ground truth masks. For this post, we'll now focus on a broader set of tasks. Single prompt segmentation (SPS), multi prompt segmentation (MPS), automatic mask generation (AMG) which generates the full set of masks for the input image without a given set of prompts. The first post focused on MPS only. + +![comparison of 3 images](/assets/images/accelerating-generative-ai-2.jpg){:style="width:100%"} + + + +The little star in the image represents a user prompt. For AMG there are no prompts and masks are filtered down heuristically from a dense grid of initial candidate prompts (guesses). For SPS and MPS user prompts are derived from the center points of AMG masks. For SPS we choose the mask with the largest area. + +**Note that SAM2 uses a different backbone than SAM1. In particular, we only consider the largest and most accurate sam2.1_hiera_large backbone for this blog.** + +We aggregate the scripts needed to reproduce the results in [torchao's example folder](https://github.com/pytorch/ao/tree/main/examples/sam2_amg_server) and incrementally upstream the more stable parts of the [changes to the SAM2 model in torchao](https://github.com/pytorch/ao/tree/main/torchao/_models/sam2) to the main [SAM2](https://github.com/facebookresearch/sam2) repository. So if you are interested in taking a look at the cutting-edge variant or would like to contribute experimental features, please don't hesitate to reach out to the torchao repository and team. For the more stable and latest model version, please head on over to SAM2 directly. + + +## Overview + +We categorize the changes presented here into two. **Fast** changes constrain themselves to techniques that are not meant to affect model accuracy. **Furious** changes sacrifice some numerical accuracy for additional speed by making use of approximations such as low-precision data types. + +Approximations may slightly lower precision metrics in favor of significantly improved performance while still passing an end-to-end check based on mean intersection over union (mIoU). + +To measure the performance improvements we processed 1000 images, which were selected at random from the SAM2 validation dataset. We look at the p50 and p90 latency per image. To measure accuracy we consider the mIoU. Most notably for the AMG task we also define a fail count metric. We consider a comparison failed if the **number of masks** differs. This turns out to be a fairly unstable quantity and we can see that the other tasks are not as sensitive to small numeric changes as AMG. + + +## The Setup + +We are running the offline experiments on a regular H100 devserver, which is a fairly beefy and performant machine. + +However, we try to look at these tasks with realistic constraints. In particular, we would like to emulate a server-side inference environment. That means we don't use DataLoader to hide the latency of image preprocessing or decoding routines. + +For the latency calculations we include decoding, segmentation and conversion of masks to a dictionary of run-length encoded masks. Or put differently, we exclude loading the images into in-memory host bytearrays and storing the resulting dictionaries as json files on disk. This is meant to emulate a more realistic setting. + +More concretely, consider the code below for the routines we include in our measurements. For any task `gen_masks` produces a batched bool Tensor bitmask that represents the corresponding object masks. We then compress this bitmask into a run length encoded (rle) format that can be used to transfer back the results from a remote server much more efficiently. + + +``` +image_tensors = decode_img_bytes(...) +masks = gen_masks(image_tensors, ...) +rle_dicts = [rle_dict_from_masks(m) for m in masks] +``` + + + +## Optimizations + + +### ao: eager code optimizations + +The most effective tool for this work is the PyTorch autograd profiler combined with `record_function`. To build this software, we've used the profiler repeatedly to observe the program and confirm the effectiveness of any changes. It's also important to keep in mind that the profiler itself has overhead. The more data you collect, such as stack traces, the more overhead you introduce, which might skew the collected trace. But it is excellent to find synchronization points, space between kernels and GPU kernels that take a long time. + +GPU traces help you understand bottlenecks that are not necessarily easily addressed by compile. We found that AutomaticMaskGeneration in particular is dominated by the data structure used to store the masks and by the routine used to convert the masks to a run-length encoded compressed format. We also found a large part of AMG performance is dominated by the large number of masks created as a single batch. Sometimes candidate masks can be filtered down to fewer candidates earlier in the postprocessing stage by reordering operations. This in turn significantly speeds up the later operations. + +In order to confirm the accuracy of our implementation we first compare without any changes in settings and using float32 precision. We see that mIoU is unchanged and the masks match perfectly when using the exact same settings. This means that these eager mode changes did not affect the accuracy of these tasks. + +AMG + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 latency (ms) + p90 latency (ms) + memory (MiB) + mIoU / fail count +
    Baseline + 864 + 1144 + 4350 + reference +
    AO + 693 + 786 + 4010 + 1 / 0 +
    + + + +### ao: batching prompts + +Another lossless performance optimization that we were able to apply is batching the user input prompt calculations. When optimizing for latency at batch size 1 on a server-grade GPU such as an H100 we are often left with a lot of spare memory. We can easily trade off that memory for more performance by processing more points of interest (also called user prompts) at once. Remember that SAM2 is split into two parts: First the backbone (image encoder), second the prediction and decoding of masks based on a set of user prompts / points of interest. It is the second part where we may expect a larger or even varying number of inputs and it is this second part where we apply batching. + +This causes a large increase in memory, but also much better latency. The baseline generates one mask per prompt in a loop. For AMG the baseline processes 64 prompts at once and all that is needed is to change it to 1024, which is the number of candidate prompts generated. For SPS we process one prompt at a time, but it's still included below for completeness. + +AMG + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 latency (ms) + p90 latency (ms) + memory (MiB) + mIoU / fail count +
    Baseline + 864 + 1144 + 4350 + reference +
    AO + batching + 613 + 706 + 33786 + 0.9999995 / 0 +
    + + +SPS + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 latency (ms) + p90 latency (ms) + memory (MiB) + mIoU +
    Baseline + 116 + 181 + 1337 + reference +
    AO + 110 + 170 + 1339 + 1 +
    + + +MPS + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 latency (ms) + p90 latency (ms) + memory (MiB) + mIoU +
    Baseline + 276 + 681 + 1337 + reference +
    AO + batching + 126 + 225 + 8021 + 0.9999992 +
    + + +As a technical side note: Most notably to enable batching for MPS, and to avoid a significant manual rewrite of the code base to support multiple prompts at the same time, we used a Tensor subclass we call MapTensor. A MapTensor allows us to pass a batch of N prompts, but have it advertise a batch size of 1. Any operation is then automatically broadcast to the wrapped Tensor and propagated throughout the prediction part of the model. This works because individual prompt predictions are independent of one another. This is very similar to torch.vmap. + + +``` +center_points_torch = to_map_tensor(center_points_torch) +center_points_label_torch = to_map_tensor(center_points_label_torch) +masks, scores, _ = mask_generator.predictor.predict( + point_coords=center_points_torch, + point_labels=center_points_label_torch, + multimask_output=True, + return_logits=False, + return_type="torch", +) +# Unwrapping MapTensor +masks = masks.elems +scores = scores.elems +``` + + + +### fast: fullgraph compilation + +Just as with our first post, we first remove GPU syncs and graph breaks to make use of fullgraph compiled model code with max-autotune kernels where appropriate. After some rewriting, we are able to compile the image encoder and the prediction of masks. + +We run the experiments twice to get a sense of the overhead due to compilation. We run it once in an environment with an empty TORCHINDUCTOR_CACHE_DIR and then again while ingesting the artifacts from the previous run. In particular, auto-tuning can take a long time and happens on the first call in a pristine environment. We call the second run "warm". The first iteration is typically expected to be slow due to various other related initialization processes, but compile increases it significantly, even if an existing cache is used and the same exact shapes are fed again. Having said that, an overhead of a few seconds in a warm environment is often still stomachable on the very first call. + +Most of these drawbacks can be mitigated and compiling causes a significant improvement in latency and reduction in memory. + +AMG + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 latency (ms) + p90 latency (ms) + memory (MiB) + mIoU / +
    +fail count +
    first iteration +
    +(ms) +
    AO + batching + 613 + 706 + 33786 + 0.9999995 / 0 + 1125 +
    + compile (cold) + 423 + 513 + 29349 + skipped + 404866 +
    + compile (warm) + 439 + 530 + 29349 + 0.994 / 190 + 8544 +
    + + +The number of masks produced per mask can vary slightly when using automatic mask segmentation. There is ambiguity in the number of masks per object the model may produce. For example, a car may be subdivided into frames, windows and doors or treated as a whole. When a modification causes the number of masks to change, we consider the comparison failed and we only calculate the mIoU on masks with an exact match. This does not apply to the other tasks. We found that the number of masks generated is very sensitive to small numerical changes. The other tasks use the same code and MPS in particular can help us further verify correctness. + +SPS + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 latency (ms) + p90 latency (ms) + memory (MiB) + mIoU + first iteration +
    +(ms) +
    AO + 110 + 170 + 1339 + 1 + 562 +
    + compile (cold) + 102 + 158 + 1343 + skipped + 319954 +
    + compile (warm) + 100 + 160 + 1302 + 0.9999 + 8947 +
    + + +MPS + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 latency (ms) + p90 latency (ms) + memory (MiB) + mIoU + first iteration +
    +(ms) +
    AO + batching + 126 + 225 + 8021 + 0.9999992 + 504 +
    + compile (cold) + 129 + 215 + 8021 + skipped + 333308 +
    + compile (warm) + 113 + 213 + 8021 + 0.998 + 8617 +
    + + + +### furious: TF32, float16 and GPU preprocessing + +We found that using float16 is the right level of precision for a few significant subcomponents of the model. In particular, the image encoder and mask decoder weights can be converted entirely to float16. We can also use TensorFloat32 precision for the remaining float32 matrix operations. It should be possible to further reduce the precision and we may address this in a future post. We also move image preprocessing such as image normalization onto the GPU with the furious mode. We can't use GPU decoding (nvJPEG) routines, because the differences are too significant and the model suffers from significant degradation in mIoU, so image decoding still happens on the CPU. + +AMG + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 latency (ms) + p90 latency (ms) + memory (MiB) + mIoU / +
    +fail count +
    AO +
    ++ batching +
    ++ compile (warm) +
    439 + 530 + 29349 + 0.994 / 190 +
    + furious + 165 + 240 + 28335 + 0.978 / 306 +
    + + +This causes a significant degradation in mIoU for the AMG task, but doesn't affect the other tasks. After an in-depth investigation, we still chalk this up to numerical instability and reordering of operations. More work is needed to further investigate this and it may not be interesting to run the AMG task in lower precision. The other tasks, however, benefit drastically in latency with minimal changes in mIoU. + +SPS + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 latency (ms) + p90 latency (ms) + memory (MiB) + mIoU +
    AO +
    ++ compile (warm) +
    100 + 160 + 1302 + 0.9999 +
    + furious + 32 + 63 + 861 + 0.9997 +
    + + +MPS + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 latency (ms) + p90 latency (ms) + memory (MiB) + mIoU +
    AO +
    ++ batching +
    ++ compile (warm) +
    113 + 213 + 8021 + 0.998 +
    + furious + 36 + 64 + 4222 + 0.997 +
    + + + +### AOTInductor's (AOTI) ahead-of-time compilation via torch.export + +When scaling elastically it often is not possible to accommodate long startup times. That means the first iteration cannot be slow, but we must quickly deliver results. This is when torch.compile's current compilation overhead can get in the way. To address this we can use AOTInductor's (AOTI) ahead-of-time compilation via torch.export. AOTI lets us compile the model on a representative input and store the resulting code in a binary that is quick to load and run. + +AOTI via torch.export is a new feature and we currently can't export everything that is compilable. We've been able to export the image encoder for all tasks but have only been able to export the mask prediction for the AMG and SPS tasks due to varying prompts. torch.export also supports dynamic shapes, but we need to invest a bit more time to prepare the code for it. + +AMG: AO + batching + furious + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 latency (ms) + p90 latency (ms) + memory (MiB) + mIoU / +
    +fail count +
    first iteration +
    +(ms) +
    + compile (warm) + 165 + 240 + 28335 + 0.978 / 306 + 10341 +
    + load export +
    +(cold) +
    162 + 233 + 27927 + 0.974 / 308 + 906 +
    + + +SPS: AO + furious + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 latency (ms) + p90 latency (ms) + memory (MiB) + mIoU + first iteration +
    +(ms) +
    + compile (warm) + 32 + 63 + 861 + 0.9997 + 7989 +
    + load export +
    +(cold) +
    35 + 66 + 1686 + 0.9997 + 763 +
    + + +Note that loading the exported model significantly increases memory. It likely only increases peak memory utilization, because initialization really needs to be delayed before loading up an exported model to avoid having twice the weights in memory at once. This is something we could address, but the memory consumption is nowhere near the limit. We don't see an increase in the other tasks, because AMG and MPS peak memory is dominated by processing batches of masks. One way to reduce that could be to operate on masks in the rle format (or some other sparse format) earlier on, but for now, there is no reason for this given the current memory consumption and focus on latency. + +MPS: AO + batching + furious + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 latency (ms) + p90 latency (ms) + memory (MiB) + mIoU + first iteration +
    +(ms) +
    + compile (warm) + 36 + 64 + 4222 + 0.997 + 9626 +
    + load export +
    +(cold) +
    43 + 72 + 3813 + 0.997 + 747 +
    + + +Using export by itself doesn't seem to benefit from extensive warmup and can be run in a pristine new inductor cache directory. But again, we do not evict the CUDA cache or other caches. In the section on Modal, we are running some of these experiments in a pristine environment. + +When only processing 1000 images in a new process, using export can really be worth it to save out on compile and other cold start overhead. + + +### bonus: More GPU preprocessing + +At this point, the latency is fairly low. In particular, for the SPS and MPS tasks we are processing at around 30ms to 40ms. Let's bring back the pseudo-code from the setup section again. + + +``` +image_tensors = decode_img_bytes(...) +masks = gen_masks(image_tensors, ...) +rle_dicts = [rle_dict_from_masks(m) for m in masks] +``` + + +Further profiling showed that at this point `decode_img_bytes` takes about 10ms. In particular, it uses torchvision's ToTensor transform to convert from a numpy Tensor to a scaled, float32 torch.Tensor. The bytes passed to ToTensor have already been decoded and converted to an numpy ndarray. By slightly rewriting ToTensor, using torchvision's v2 API and moving the uint8 decoded smaller integer Tensor to GPU first before scaling, we can gain another 10ms in latency. Without including `decode_img_bytes` in our analysis we would have missed this opportunity that has real-world impact on server-side inference. + + +``` +image_tensor = torch.from_numpy(image_tensor) +image_tensor = image_tensor.permute((2, 0, 1)) +image_tensor = image_tensor.cuda() +image_tensor = v2.ToDtype(torch.float32, scale=True)( image_tensor) +``` + + +Note in particular that using pinned memory to perform asynchronous data transfers doesn't apply, since the time it takes to move the Tensor into pinned memory isn't worth the gain in asynchronicity for this data movement. For future work, we might want to explore further improvements here by using more advanced direct memory transfer techniques. + +AMG: AO + batching + furious + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 latency (ms) + p90 latency (ms) + memory (MiB) + mIoU / +
    +fail count +
    first iteration +
    +(ms) +
    + load export +
    +(cold) +
    162 + 233 + 27927 + 0.974 / 308 + 906 +
    + load export (warm) + 157 + 230 + 27927 + 0.974 / 308 + 799 +
    + load export (warm) +
    ++ preproc +
    136 + 208 + 27950 + 0.977 / 311 + 908 +
    + + +SPS: AO + furious + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 latency (ms) + p90 latency (ms) + memory (MiB) + mIoU + first iteration +
    +(ms) +
    + load export +
    +(cold) +
    35 + 66 + 1686 + 0.9997 + 763 +
    + load export (warm) + 31 + 63 + 1686 + 0.9997 + 683 +
    + load export (warm) +
    ++ preproc +
    19 + 25 + 1711 + 0.9997 + 658 +
    + + +MPS: AO + batching + furious + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 latency (ms) + p90 latency (ms) + memory (MiB) + mIoU + first iteration +
    +(ms) +
    + load export +
    +(cold) +
    43 + 72 + 3813 + 0.997 + 747 +
    + load export (warm) + 53 + 81 + 3813 + 0.997 + 807 +
    + load export (warm) +
    ++ preproc +
    31 + 41 + 3837 + 0.997 + 671 +
    + + +This small change has a significant impact on the SPS and MPS task. + + +## Deploying on Modal + +Finally, we deployed our optimized inference onto [Modal](https://modal.com), a serverless infrastructure provider, to demonstrate that the benefits of these optimizations can be realized in a more realistic deployment setting. + +In particular, compilation and AOTI via torch.export requires extra work. In a naïve deployment that work might be added to every single inference execution, adding latency that dwarfs any improvements from a faster model. This is particularly challenging with elastic or autoscaling infrastructure, where replicas of our inference service need to be regularly and automatically created and destroyed. + +We share a deployment script in the torchao repository ([cli_on_modal.py](https://github.com/pytorch/ao/tree/main/examples/sam2_amg_server)) to demonstrate one pattern for an elastic deployment. We build the exported models ahead of time and then upload them to [distributed storage](https://modal.com/docs/guide/volumes). Relative to eager execution, this adds a bit of extra work when replicas spin up since they need to read this data over a network, but this is far less costly than compilation or export. + +We benchmarked this deployment with a large batch inference workload: sending 1000 images for concurrent processing. The deployment scales up to ten replicas on ten GPUs at peak and scales down to zero GPUs when inactive. + +First, let’s look at the execution latencies. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 execution latency +
    +(ms / improvement) +
    p90 execution latency +
    +(ms / improvement) +
    + eager float32 + AOTI float16 + eager float32 + AOTI float16 +
    + + Modal + Offline + + Modal + Offline +
    AMG + 741 + 112 (6.6x) + 136 (5.4x) + 1140 + 176 (6.5x) + 208 (5.5x) +
    SPS + 98 + 20 (4.9x) + 19 (5.2x) + 130 + 28 (4.6x) + 25 (5.2x) +
    MPS + 269 + 38 (7.1x) + 31 (8.7x) + 714 + 52 (13.7x) + 41 (17.4x) +
    + + +We notice that execution latencies on Modal and Offline are fairly close, especially relative to the baseline, indicating that optimizing the deployment offline was a reasonable proxy for optimizing the deployment directly. + +In addition to execution latency, our batch workload has queueing time, since there are fewer replicas than there are inputs, and so some inputs have to wait in line. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    + p50 queue time (ms) + p90 queue time (ms) +
    + eager float32 + AOTI float16 + eager float32 + AOTI float16 +
    AMG + 201 + 41 (4.9x) + 815 + 327 (2.6x) +
    SPS + 31 + 33 (0.9x) + 441 + 49 (9.0x) +
    MPS + 40 + 37 (1.1x) + 942 + 75 (12.6x) +
    + + +Even though the queueing system provided by the infrastructure is unchanged, the queue latencies also decrease when we use our optimized model – in the p90 case by a factor of 2 to 12. That’s because when we finish previous inputs faster (from reduced execution latency) we can pull our next inputs sooner (reducing their queueing time). + +If you’re interested in optimizing SAM2 inference or deployments further, don’t hesitate to reach out to us at the [torchao repository](https://github.com/pytorch/ao)! + + +## Conclusions + +We rewrote Meta's original SAM2 in pure PyTorch with little loss of accuracy and a strong focus on latency. We deployed our optimized inference onto [Modal](https://modal.com), a serverless infrastructure provider, to demonstrate that the benefits of these optimizations can be realized in a more realistic deployment setting. + +By utilizing AOTInductor's (AOTI) ahead-of-time compilation via torch.export, reduced precision, batched prompts and GPU preprocessing we observe up to 13x improvement in p90 execution latency and queue times compared to regular eager mode PyTorch. + +With elastic or autoscaling infrastructure, where replicas of our inference service need to be regularly and automatically created and destroyed, a naïve deployment of torch.compile can add work to inference execution that dwarfs any improvements from a faster model. By utilizing AOTInductor's (AOTI) ahead-of-time compilation via torch.export, we are able to upload exported models ahead of time and read this data over a network, which enables us to get the benefits of compilation without significantly increased work. + +For more details on how to reproduce the data in this blog post, [check out the experiments folder of torchao](https://github.com/pytorch/ao/tree/main/examples/sam2_amg_server). Please don't hesitate to contact us or [open an issue](https://github.com/pytorch/ao/issues/new) if you run into any technical issues. \ No newline at end of file diff --git a/_posts/2025-03-04-submit-to-speak.md b/_posts/2025-03-04-submit-to-speak.md new file mode 100644 index 000000000000..89d9907b682d --- /dev/null +++ b/_posts/2025-03-04-submit-to-speak.md @@ -0,0 +1,79 @@ +--- +layout: blog_detail +title: "📣 Submit to Speak at PyTorch Conference + Save on Registration" +--- + +Step into the Future of AI at PyTorch Conference 2025. + + +![banner ad for conference](/assets/images/submit-to-speak/fg1.png){:style="width:100%"} + + +The Call for Proposals for **PyTorch Conference 2025** is officially open! + +**Join us in San Francisco from October 22–23, 2025,** to showcase your expertise and innovations with PyTorch—the industry-leading, open-source machine learning framework powering innovations from bare-metal infrastructure to sophisticated application and agent layers. This is your opportunity to share insights, breakthroughs, and case studies with a global audience of AI and Generative AI practitioners, researchers, and developers. + +![people watching presentation at conference](/assets/images/submit-to-speak/fg2.jpg){:style="width:100%"} + + +Submit your proposals and prepare to engage, learn, and network alongside some of the brightest minds in the AI/ML community. We’re seeking sessions, Birds of a Feather discussions, lightning talks, and poster sessions on the following topics: + +* Core PyTorch Framework +* PyTorch on Accelerator Hardware +* PyTorch Ecosystem and Tools +* AI Applications and Use Cases +* AI in Research and Academia +* AI in Industry and Enterprise Applications +* AI Infrastructure and Scalability +* Ethical AI, Governance, and Regulation +* Training, Fine-Tuning, and Alignment +* Inference, Deployment, and Serving +* Performance Measurement and Benchmarking +* Data Engineering and Management for AI +* Generative AI and Large Language Models (LLMs) +* Model Optimization and Efficiency +* Open Source Collaboration, Education and Community Building +* Edge AI and On-Device +* DL Compilers and Kernel Authoring + + +
    +

    Learn more and submit your talk by Sunday, June 1, at 11:59 PDT!

    + + SUBMIT TO SPEAK + +
    + + +--- + +![people arriving at conference](/assets/images/submit-to-speak/fg3.jpg){:style="max-width:300px; display: block; float: right;"} + +**Save up to USD$500 with Super Early Bird Pricing!** + +* Reserve your pass by **11:59 PM PDT on March 21** and score Super Early Bird pricing for just **USD$499**. That’s a savings of up to USD$500! +* Student or faculty? Learn more about our **[discounted academic rate](https://events.linuxfoundation.org/pytorch-conference/register/#registration-rates)**. +* Need help covering travel costs? We offer discretionary travel funding for those community members who would otherwise not be able to attend. **[Learn more](https://events.linuxfoundation.org/pytorch-conference/register/#additional-information)**. + +
    + + REGISTER NOW & SAVE + +
    + +--- + + +**Become a Sponsor at PyTorch Conference 2025!** + +Seize your opportunity to influence the future of Generative AI and Machine Learning by sponsoring PyTorch Conference 2025. PyTorch is at the forefront of innovation—empowering rapid experimentation, flexible model development, and efficient deployment into production environments with its powerful, versatile ecosystem of tools and thriving community of dedicated users. + +As a sponsor, you'll gain more than visibility; you'll strategically position your organization at the heart of a vibrant, global AI/ML ecosystem. Connect directly with **3,000+** expert attendees, researchers, engineers, and decision-makers, and actively shape the conversations driving the next generation of AI advancements. + +
    + + BECOME A SPONSOR + +
    + +For more details on CFP submissions, registration, and sponsorship, visit **the** [PyTorch Conference Website](https://events.linuxfoundation.org/pytorch-conference/). \ No newline at end of file diff --git a/_posts/2025-03-05-activation-checkpointing-techniques.md b/_posts/2025-03-05-activation-checkpointing-techniques.md new file mode 100644 index 000000000000..782722e96681 --- /dev/null +++ b/_posts/2025-03-05-activation-checkpointing-techniques.md @@ -0,0 +1,233 @@ +--- +layout: blog_detail +title: "Current and New Activation Checkpointing Techniques in PyTorch" +--- + +As models scale in depth, batch size, and sequence length, etc, activation memory becomes an increasingly significant contributor to the overall memory usage. To help address this, PyTorch provides utilities for [activation checkpointing](https://pytorch.org/docs/stable/checkpoint.html), which reduce the number of saved tensors by recomputing them when needed, trading off memory usage for additional compute. + +In this post, we’ll walk through the basics of what activation memory is, the high-level ideas behind existing activation checkpointing techniques, and also introduce some newer techniques that aim to improve flexibility and provide more optimization/automation out of the box. + +As we look at these techniques, we'll compare how these methods fit into a speed vs. memory trade-off diagram and hopefully provide some insight on how to choose the right strategy for your use case. + +*(If you prefer to jump straight to the new APIs, please skip ahead to the “Selective Activation Checkpoint” and “Memory Budget API” sections below.)* + +![flow diagram](/assets/images/activation-checkpointing-techniques/fg1.png){:style="width:100%"} + + +--- + + +## Activation Memory Basics + +By default, in eager mode (rather than using `torch.compile`), PyTorch’s autograd preserves intermediate activations for backward computation. For example, if you call `sin` on a tensor `x` during the forward pass, autograd must remember `x` to compute `cos(x)` during backward. + + +![flow diagram](/assets/images/activation-checkpointing-techniques/fg2.png){:style="max-width:400px; display: block; margin-left: auto; margin-right: auto"} + + +If this tensor `x` is saved at the beginning of the forward pass, it remains in memory throughout both the forward and backward phases. It can only be cleared after it is used to compute the gradient, which happens at the end of the backward pass (due to the reverse order of execution). + +Thus, as you proceed through the forward pass and perform more and more operations, you accumulate more and more activations, resulting in more and more activation memory until it (typically) reaches its peak at the start of backward (at which point activations can start to get cleared). + +![flow diagram](/assets/images/activation-checkpointing-techniques/fg3.png){:style="width:100%"} + + +*In the diagram above, the orange boxes represent operations, black arrows represent their tensor inputs and outputs. The black arrows that cross over the right represent tensors that autograd saves for backward.* + +A useful way to visually organize this default saving behavior in eager as well as the techniques we're about to introduce is based on how they trade off speed versus memory. + + +![flow diagram](/assets/images/activation-checkpointing-techniques/fg4.png){:style="width:100%"} + + +The ideal place to be on this diagram is the top-left, where you have "high" speed but also low memory usage. + +We begin by putting the default saving behavior on the **top-right** (for reasons we'll explain in more detail as we introduce more points for other techniques). + + +--- + + +## Activation Checkpointing (AC) + +**[Activation checkpointing (AC)](https://pytorch.org/docs/stable/checkpoint.html)** is a popular technique to reduce memory usage in PyTorch. + +During forward, any operations performed inside the AC'd region do not save tensors for backward. (Only the inputs to the function are saved.) During backward, the intermediate activations needed for gradient computation are rematerialized by running the function a second time. + +![flow diagram](/assets/images/activation-checkpointing-techniques/fg5.png){:style="width:100%"} + + +*In the diagram (right), the black box shows where activation checkpointing is applied. Compared to the default eager approach (left), this setup results in fewer tensors being saved (1 versus 3).* + +Applying AC on the right parts of the model has the effect of reducing peak memory, because the intermediate activations are no longer materialized in memory when the memory usage typically peaks (at the beginning of backward). + +On the speed-versus-memory tradeoff diagram, AC is plotted on the **bottom-left.** Relative to eager mode, it reduces the amount of memory saved for backward but comes with an added cost in compute due to recomputation. + +![flow diagram](/assets/images/activation-checkpointing-techniques/fg6.png){:style="width:100%"} + + +Note that AC’s speed–memory tradeoff /can/ be adjusted by selecting which parts of the forward pass to checkpoint and by defining how many checkpoint regions to use. However, implementing these changes may require modifying your model’s structure and can be cumbersome depending on how your code is organized. For the purposes of this diagram, we assume only one region is checkpointed; under this assumption, AC appears as a single point on the tradeoff diagram. + +Also note that “memory” here does not refer to peak memory usage; rather, it indicates the how much memory is saved for backward for a fixed region. + + +--- + + +## torch.compile and min-cut partitioner + +Another notable approach to keep in mind is **torch.compile** (introduced in PyTorch 2.0). Like activation checkpointing, `torch.compile` can also perform some level of recomputation under the hood. Specifically, it traces the forward and backward computations into a single joint graph, which is then processed by a [“min-cut” partitioner](https://dev-discuss.pytorch.org/t/min-cut-optimal-recomputation-i-e-activation-checkpointing-with-aotautograd/467). This partitioner uses a min-cut/max-flow algorithm to split the graph such that it minimizes the number of tensors that need to be saved for backward. + +At first glance, this might sound a lot like what we want for activation memory reduction. However, the reality is more nuanced. By default, the partitioner’s primary goal is to reduce runtime. As a result, it only recomputes certain types of operations—primarily simpler, fusible, and non-compute-intensive ops (like pointwise ops). + +Placing "compile" on the speed-versus-memory tradeoff diagram... + +![flow diagram](/assets/images/activation-checkpointing-techniques/fg7.png){:style="width:100%"} + + +It is to the top-left of the eager non-AC point, as we expect `torch.compile` to improve on both speed and memory. + +On the other hand, relative to activation checkpointing, torch.compile is more conservative about what it recomputes, placing it closer to the top-left on the speed-versus-memory diagram. + + +--- + + +## Selective Activation Checkpoint [NEW!] + +While normal checkpointing recomputes every op in a chosen region, [selective activation checkpointing (SAC)](https://pytorch.org/docs/main/checkpoint.html#torch.utils.checkpoint.create_selective_checkpoint_contexts) is an additional setting on top of activation checkpointing that you can apply to have a more granular control over which operations to recompute. + +This can be useful if you have certain more expensive operations like matmuls which you prefer to avoid recomputing, but still generally want to recompute cheaper operations like pointwise. + +![flow diagram](/assets/images/activation-checkpointing-techniques/fg8.png){:style="width:100%"} + + +*Where plain AC (left) would save a single tensor and then recompute the entire AC'd region, with SAC (right) you can selectively save specific operations (marked red) in the region, so you can avoid recomputing them.* + +To specify what to selectively save, you can specify a policy_fn. To illustrate the additional trade offs you can make with this, we present two simple policy functions. + + +### Policy 1: Not recomputing matmuls: + + +``` +aten = torch.ops.aten +compute_intensive_ops = [ + aten.mm, + aten.bmm, + aten.addmm, +] +def policy_fn(ctx, op, *args, **kwargs): + if op in compute_intensive_ops: + return CheckpointPolicy.MUST_SAVE + else: + return CheckpointPolicy.PREFER_RECOMPUTE +``` + +![flow diagram](/assets/images/activation-checkpointing-techniques/fg9.png){:style="width:100%"} + + +### Policy 2: More aggressively save anything compute intensive + + +``` +# torch/_functorch/partitioners.py +aten = torch.ops.aten +compute_intensive_ops = [ + aten.mm, + aten.convolution, + aten.convolution_backward, + aten.bmm, + aten.addmm, + aten._scaled_dot_product_flash_attention, + aten._scaled_dot_product_efficient_attention, + aten._flash_attention_forward, + aten._efficient_attention_forward, + aten.upsample_bilinear2d, + aten._scaled_mm +] +def policy_fn(ctx, op, *args, **kwargs): + if op in compute_intensive_ops: + return CheckpointPolicy.MUST_SAVE + else: + return CheckpointPolicy.PREFER_RECOMPUTE +``` + +![flow diagram](/assets/images/activation-checkpointing-techniques/fg10.png){:style="width:100%"} + + +On the speed-versus-memory diagram, SAC is plotted as a range of points from closer to AC to closer to Eager, depending on your chosen policy. + + +![flow diagram](/assets/images/activation-checkpointing-techniques/fg11.png){:style="width:100%"} + + +**Try it out!** (Available in 2.5 as a prototype feature; see [docs](https://pytorch.org/docs/main/checkpoint.html#torch.utils.checkpoint.create_selective_checkpoint_contexts) for more info + copy-pastable example) + + +``` +from torch.utils.checkpoint import checkpoint, create_selective_checkpoint_contexts + +# Create a policy function that returns a CheckpointPolicy +def policy_fn(ctx, op, *args, **kwargs): + if op in ops_to_save: + return CheckpointPolicy.MUST_SAVE + else: + return CheckpointPolicy.PREFER_RECOMPUTE + +# Use the context_fn= arg of the existing checkpoint API +out = checkpoint( + fn, *args, + use_reentrant=False, + # Fill in SAC context_fn's policy_fn with functools.partial + context_fn=partial(create_selective_checkpoint_contexts, policy_fn), +) + +``` +--- + + + +## (compile-only) Memory Budget API [NEW!] + +As mentioned previously, any given SAC policy can be represented as a point on a speed-memory tradeoff diagram. Not all policies are created equal, however. The "optimal" policies are the ones that fall on a pareto curve, e.g. for all policies that incur the same memory overhead, this policy is the one that minimizes the amount of required compute. + +For users who are using torch.compile, we offer a **memory budget API** that automatically applies SAC over your compiled region with a pareto-optimal policy given a user-specified "memory budget" between 0 and 1, where a budget of 0 behaves like plain-AC and a budget of 1 behaves like default torch.compile. + + +![flow diagram](/assets/images/activation-checkpointing-techniques/fg12.png){:style="width:100%"} + + +Below are some real results on a transformer model: + +![flow diagram](/assets/images/activation-checkpointing-techniques/fg13.png){:style="width:100%"} + + +We observe a 50% memory reduction by recomputing only pointwise ops, with a steady drop-off as you recompute more and more of your matmuls. Attention is the most expensive, so you tend to want to recompute those last. + +**Try it out!** (Available in 2.4 as an experimental feature; see this [comment block](https://github.com/pytorch/pytorch/blob/68a363548409a3ff17965770304ee5e12fe718d9/torch/_functorch/config.py#L110-L122) for more info) + + +``` +torch._dynamo.config.activation_memory_budget = 0.5 + +out = torch.compile(fn)(inp) +``` + +--- + + + + +## Conclusion + + +![flow diagram](/assets/images/activation-checkpointing-techniques/fg14.png){:style="width:100%"} + + +In summary, activation checkpointing techniques in PyTorch offer a variety of ways to balance memory and compute demands, from simple region-based checkpointing to more selective and automated methods. By choosing the option that best matches your model’s structure and resource constraints, you can achieve significant memory savings with an acceptable trade-off in compute. + + +## Acknowledgements + +We would like to thank Meta's [xformers](https://github.com/facebookresearch/xformers) team including [Francisco Massa](https://github.com/fmassa) for working on the original version of Selective Activation Checkpoint. \ No newline at end of file diff --git a/_posts/2025-03-06-peak-performance-minimized-memory.md b/_posts/2025-03-06-peak-performance-minimized-memory.md new file mode 100644 index 000000000000..6271d6412aff --- /dev/null +++ b/_posts/2025-03-06-peak-performance-minimized-memory.md @@ -0,0 +1,152 @@ +--- +layout: blog_detail +title: "Peak Performance, Minimized Memory: Optimizing torchtune’s performance with torch.compile & Liger Kernel" +author: LinkedIn and Meta +--- + +**LinkedIn**: Shivam Sahni, Byron Hsu, Yanning Chen +**Meta**: Ankith Gunapal, Evan Smothers + +This blog explores the integration of a custom triton kernel, Liger Kernel with `torch.compile` to enhance the performance of fine-tuning large language models (LLMs) using torchtune. torchtune, a PyTorch-native library, offers modular building blocks and customizable finetuning recipes which include `torch.compile` support for various LLMs, while Liger Kernel provides optimized Triton kernels to improve training efficiency and reduce memory usage. The integration involves modifying the `TransformerDecoder` module in torchtune to bypass the linear layer computation, allowing the Liger Fused Linear Cross Entropy Loss to handle the forward projection weights. Experiments conducted on an NVIDIA A100 instance demonstrate that `torch.compile` outperforms PyTorch Eager in throughput and memory efficiency, with Liger Kernel further reducing peak memory allocation and enabling larger batch sizes. The results show a 47% reduction in peak memory at batch size 256 and a marginal increase in throughput with `meta-llama/Llama-3.2-1B` , confirming the effectiveness of the integration without affecting the loss curves. + + +## Introduction to torchtune + +torchtune is a PyTorch-native library which has been designed for finetuning LLMs. torchtune provides composable and modular building blocks along with finetuning recipes that can be easily customized for your use case, as will be shown in this blog. \ +torchtune provides: + + + +* PyTorch implementations of popular LLM model architectures from Llama, Gemma, Mistral, Phi, and Qwen model families +* Hackable training recipes for full finetuning, LoRA, QLoRA, DPO, PPO, QAT, knowledge distillation, and more +* Out-of-the-box memory efficiency, performance improvements, and scaling with the latest PyTorch APIs, including `torch.compile` +* YAML configs for easily configuring training, evaluation, quantization or inference recipes +* Built-in support for many popular dataset formats and prompt templates + + +## Introduction to Liger Kernel + +Liger Kernel is an open source library of optimized Triton kernels designed to enhance the efficiency and scalability of training Large Language Models (LLMs). It focuses on kernel-level optimizations such as operation fusing and input chunking, achieving significant improvements in training throughput and GPU memory usage compared to existing implementations like those from HuggingFace. By using a single line of code, Liger Kernel can improve [training throughput by 20% and reduce memory usage by 60%](https://www.linkedin.com/blog/engineering/open-source/liger-kernel-open-source-ecosystem-for-efficient-llm-training). + + +![Fused Linear Cross Entropy](/assets/images/peak-performance-minimized-memory/fg1.png){:style="width:100%"} + +
    +

    Figure 1: Fused Linear Cross Entropy

    +
    + + +The bulk of LIger Kernel’s performance improvement comes from the Fused Linear Cross Entropy (FLCE) Loss, whose core idea is as follows: + +In LLMs, the vocabulary size has increased significantly, leading to a large logit tensor during cross-entropy (CE) loss computation. This logit tensor consumes excessive memory, causing a bottleneck in training. For example, when training with a batch size of 8 and sequence length of 4096, the 256k vocabulary size results in a 16.8 GB logit tensor. The FLCE kernel breaks down the computation into smaller chunks, reducing memory consumption. + +Here's how it works: + + + +1. Flattens the 3D hidden states into a 2D matrix by collapsing the batch size and sequence length dimensions. +2. Applies the linear projection head sequentially on the chunked hidden states. +3. Computes the partial loss and returns the chunked logits gradient using the Liger CE kernel. +4. Derives the chunked hidden states gradients and accumulates the projection head gradients. + +Torchtune’s recipes provide `torch.compile` support out of the box. It has been shown that utilizing `torch.compile` with FLCE makes [FLCE 2x faster](https://github.com/linkedin/Liger-Kernel/issues/227). + + +## Integrating Liger Kernel with torch.compile & torchtune + +We demonstrate integration of Liger Kernel with `torch.compile` & torchtune by running a full fine-tuning recipe with `meta-llama/Llama-3.2-1B`. To make this integration happen, we have defined a custom full finetuning recipe, the details of the changes are mentioned below. + + +``` +CUDA_VISIBLE_DEVICES=0,1,2,3 tune run --nproc_per_node 4 recipes/full_finetune_distributed.py --config llama3_2/1B_full optimizer=torch.optim.AdamW optimizer.fused=True optimizer_in_bwd=False gradient_accumulation_steps=1 dataset.packed=True compile=True enable_activation_checkpointing=True tokenizer.max_seq_len=512 batch_size=128 +``` + + +One of the inputs to the LCE Kernel is the forward projection weights. torchtune is designed as a modular library with composable blocks. There is a `TransformerDecoder` [block](https://github.com/pytorch/torchtune/blob/main/torchtune/modules/transformer.py#L322) where at the end of the block, we pass the final hidden state through a linear layer to get the final output. Since the linear layer is combined with the CE loss in LCE Kernel, we write a custom `forward` function for `TransformerDecoder` where we skip the computation through the linear layer. + +In the full finetuning recipe, we override the model's forward method with this custom method + + +``` +import types +from liger_kernel.torchtune.modules.transformers import decoder_forward +self._model.forward = types.MethodType(decoder_forward, self._model) +``` + + +We then pass the model's forward projection weights to calculate the loss with LCE Kernel + + +``` +from liger_kernel.transformers.fused_linear_cross_entropy import ( + LigerFusedLinearCrossEntropyLoss, +) + +# Use LCE loss instead of CE loss +self._loss_fn = LigerFusedLinearCrossEntropyLoss() + +# call torch.compile on the loss function +if self._compile: + training.compile_loss(self._loss_fn, verbose=self._is_rank_zero) + +# pass the model's forward projection weights for loss computation +current_loss = ( + self._loss_fn( + self._model.output.tied_module.weight, + logits, + labels, + ) + * current_num_tokens + ) +``` + + +The complete code and instructions can be found in the [GitHub repo](https://github.com/pytorch-labs/applied-ai/tree/liger_kernel/third_party). + + +## Experiments & Benchmarking Results + +We conduct 3 types of experiments to demonstrate how Liger Kernel integration with `torch.compile` enhances the performance of torchtune. We set up our experiments on an instance running NVIDIA A100. We fine-tune a small LLM `meta-llama/Llama-3.2-1B `with differing batch sizes. We record the throughput in terms of tokens/second and measure the peak memory allocated during finetuning. Since it's a small model, we only use 4 A100 GPUs for the benchmarking. The following are the experiments we conducted: + + + +1. Increase batch_size in powers of 2 with PyTorch eager +2. Increase batch_size in powers of 2 with torch.compile +3. Increase batch_size in powers of 2 with torch.compile & Liger integration + +We notice that with PyTorch Eager, throughput increases with increasing batch_size till we hit OOM at batch_size 256. With `torch.compile`, the throughput is higher than PyTorch Eager for each batch_size. We see that the peak memory allocation reduces drastically with increasing batch_size and more than 50% reduction in peak memory at batch_size 128. This results in `torch.compile` being able to support batch_size 256 and hence, the overall throughput with `torch.compile` being 36% greater than PyTorch Eager. Integrating Liger Kernel with `torch.compile` doesn’t drop the throughput at lower batch_size but with increasing batch_size, we notice that torchtune is consuming less memory compared to torch.compile. At batch_size 256, we see a 47% reduction in peak memory allocation with the Liger kernel. This allows us to use batch_size 512 with `torch.compile` & Liger. We notice that there is a marginal 1-2% increase in throughput compared to `torch.compile` without custom triton kernels. + + +![Plot of tokens/sec per rank vs batch_size](/assets/images/peak-performance-minimized-memory/fg2.png){:style="width:100%"} + +
    +

    Figure 2: Plot of tokens/sec per rank vs batch_size

    +
    + +![Peak memory allocated vs batch_size](/assets/images/peak-performance-minimized-memory/fg3.png){:style="width:100%;margin-top: 60px;"} + +
    +

    Figure 3: Peak memory allocated vs batch_size

    +
    + +To rule out any potential functional issues with our integration of Liger Kernel with torchtune, we plot the loss curve against training steps with & without Liger. We see that there is no visible difference in the loss curves. + + +![Plot of loss vs training steps for batch_size=128](/assets/images/peak-performance-minimized-memory/fg4.png){:style="width:100%"} + +
    +

    Figure 4: Plot of loss vs training steps for batch_size=128

    +
    + + +## Next Steps + + + +* Enable Liger kernels for [DPO loss](https://github.com/linkedin/Liger-Kernel/blob/main/src/liger_kernel/chunked_loss/dpo_loss.py#L7) and [distillation loss](https://github.com/linkedin/Liger-Kernel/blob/main/src/liger_kernel/chunked_loss/fused_linear_distillation.py#L9) in torchtune’s recipes for [DPO](https://pytorch.org/torchtune/main/recipes/dpo.html) and [knowledge distillation](https://pytorch.org/blog/llama-into-torchtune/), respectively. +* Support Liger integration in torchtune with [tensor parallel training](https://github.com/pytorch/torchtune/pull/2330). + + +## Acknowledgments + +We thank Hamid Shojanazeri (Meta), Less Wright (Meta), Horace He (Meta) & Gregory Chanan (Meta) for their feedback and support in making this blog post happen. diff --git a/_posts/2025-03-07-pt-fedora-os-communities.md b/_posts/2025-03-07-pt-fedora-os-communities.md new file mode 100644 index 000000000000..77081b55ea04 --- /dev/null +++ b/_posts/2025-03-07-pt-fedora-os-communities.md @@ -0,0 +1,52 @@ +--- +layout: blog_detail +title: "Powering AI with PyTorch, Fedora, and Open Source Communities" +author: Sudhir Dharanendraiah +hidden: true +--- + + +![man speaking at a conference](/assets/images/pt-fedora-os-communities/fg1.jpg){:style="width:100%"} + + +At [DevConf.IN 2025](https://www.devconf.info/in/) in Pune, I had the opportunity to host a **[PyTorch Meetup](https://pretalx.devconf.info/devconf-in-2025/talk/W3YURM/)** on February 28th. The session, titled "**Powering AI with PyTorch, Fedora, and Open Source Communities**" was aimed at introducing PyTorch to students and professionals, explaining why **PyTorch+Fedora** form an ideal AI development platform. The other key aspect I covered was collaboration between open source communities. + + +## Introduction to PyTorch + + +## The Power of Deep Learning made simple + + +With the explosion of GPTs, there is a renowned interest in the field of AI and ML. The myth of developing AI/ML technologies and its applications is rocket science and far-fetched, needs correction. Only open source has the power to demystify this myth and further evolve the technology to make it versatile and developer friendly. Since its inception, PyTorch has evolved and has been a driving force to make AI/ML development extremely simple. I covered the aspects of PyTorch key components, its features and why PyTorch is the best choice as a deep learning framework. + + +![man speaking at a conference](/assets/images/pt-fedora-os-communities/fg2.jpg){:style="width:100%"} + + + +The codewalk through was designed to showcase how easy and simple it is to utilise the power of GPUs, creating a simple neural network and training the model. The code walkthrough was very well received and it was great to hear back from the attendees that they never knew how powerful PyTorch is for deep learning. The real world examples sighted how this powerful framework can be used beyond the common GPTs and has the power to influence across a broad spectrum of applications. + + +## Fedora+PyTorch the Ideal AI/ML Development Platform + +![man speaking at a conference](/assets/images/pt-fedora-os-communities/fg3.jpg){:style="width:100%"} + +![man speaking at a conference](/assets/images/pt-fedora-os-communities/fg4.jpg){:style="width:100%"} + + +One of the highlights of the event was the discussion on Fedora’s role as an AI platform. Fedora’s reliability, flexibility, and strong community support make it an ideal partner for PyTorch, allowing developers to focus on model-building without worrying about infrastructure. The students were intrigued by the idea of contributing to Fedora’s AI/ML ecosystem while building their own projects. Sumantro Mukherjee spoke about the AI policy in Fedora and how one can start contributing to the AI/ML using Fedora as a platform. He highlighted how Fedora is evolving to meet the needs of AI practitioners. The idea that an open-source operating system could provide the perfect foundation for AI research sparked an engaging conversation. + + +## Innovation in Open Source When Communities Come Together + +![charts](/assets/images/pt-fedora-os-communities/fg5.jpg){:style="width:100%"} + +It is important that we learn from history and repeat the good things! When open source communities come together they can create seismic shifts in the industry. To drive this home, I took the audience on a journey through history, revisiting a pivotal moment when Apache and Linux came together, solving common problems and fundamentally reshaping enterprise computing. That moment was not just about technology; it was about collaboration. It was about two powerful communities recognizing that they were stronger together. Today, we stand at the cusp of another such moment - PyTorch and Linux, particularly Fedora, are coming together to shape the future of AI/ML. This is not just an opportunity but a responsibility for contributors, developers, and AI/ML enthusiasts to be part of this movement. + + +## Looking Ahead + +![man speaking at a conference](/assets/images/pt-fedora-os-communities/fg6.jpg){:style="width:100%"} + +One of the best parts of the event was the enthusiasm it generated. Diverse audience, including students, AI enthusiasts, and industry professionals. Notably, Vincent Caldeira (CTO, APAC, Red Hat) and Chris Butler (Senior Principal Chief Architect, Red Hat) were present, reinforcing the growing interest in open-source AI/ML. Many students were eager to explore PyTorch and Fedora, contribute to open-source AI projects, and start their own AI experiments. Industry experts saw the potential for scalable, community-driven AI innovation. The session sparked curiosity and conversations that continued long after the event ended. \ No newline at end of file diff --git a/_posts/2025-03-11-scaling-recommendation-2d-sparse-parallelism.md b/_posts/2025-03-11-scaling-recommendation-2d-sparse-parallelism.md new file mode 100644 index 000000000000..230b3d0337bb --- /dev/null +++ b/_posts/2025-03-11-scaling-recommendation-2d-sparse-parallelism.md @@ -0,0 +1,219 @@ +--- +layout: blog_detail +title: "Scaling Recommendation Systems Training to Thousands of GPUs with 2D Sparse Parallelism" +author: "PyTorch Team at Meta: Chunzhi Yang, Rich Zhu, Zain Huda, Liangbei Xu, Xin Zhang, Jiyan Yang, Dennis van der Staay, Wang Zhou, Jin Fang, Jade Nie, Yuxi Hu" +--- + +At Meta, recommendation systems are the cornerstone of delivering relevant and personalized ads to billions of users globally. Through technologies like PyTorch's TorchRec, we've successfully developed solutions that enable model training across hundreds of GPUs. While these systems have served us well, recent research on scaling laws has revealed a compelling opportunity: we can achieve significantly better model performance by training dramatically larger neural networks. + +However, this insight presents us with a new challenge. Our current training infrastructure, though highly optimized for hundreds of GPUs, cannot efficiently scale to the thousands of GPUs needed to train these larger models. The leap from hundreds to thousands of GPUs introduces complex technical challenges, particularly around handling sparse operations in recommendation models. These challenges require fundamentally new approaches to distributed training, which we address with a novel parallelization strategy. + +**To address these issues, we introduced 2D embedding parallel, a novel parallelism strategy that overcomes the sparse scaling challenges inherent in training large recommendation models across thousands of GPUs. This is available today in TorchRec through the DMPCollection API.** This approach combines two complementary parallelization techniques: data parallelism for the sparse components of the model, and model parallelism for the embedding tables, leveraging TorchRec's robust sharding capabilities. By strategically integrating these techniques, we've created a solution that scales to thousands of GPUs and now powers Meta's largest recommendation model training runs. + +**What are the sparse scaling challenges?** + +We identified three key challenges that prevented us from naively scaling our model to thousands of GPUs: + +* **Imbalancing and straggler issue:** with more GPUs it’s harder to achieve balanced sharding, some ranks can have much heavier workload for embedding computations, which can slow down the entire training. +* **Communication across nodes:** As training jobs utilize an increased number of GPUs, the all-to-all communication bandwidth can drop under certain network topologies which can increase communication latency significantly. +* **Memory overhead:** The memory used by input features is often negligible, however, as we use thousands of GPUs, we can introduce larger input features and the memory requirements can become significant. + +With 2D embedding parallel, we can describe our new parallelism scheme like this, in this example we have 2 model replicas (Replica 1: GPU1/GPU3, Replica 2: GPU2/GPU4) + + +![Flow diagram](/assets/images/scaling-recommendation-2d-sparse-parallelism/fg1.png){:style="width:100%"} + +***Figure 1: Layout illustration of 2D Sparse Parallelism*** + +With 2D sparse parallelism we address these challenges, instead of sharding tables across all ranks, we first evenly divide all ranks into several parallel groups: + + + +1. Within each group, we use model parallel for the embedding tables, such as column-wise/row-wise sharding. At scale, for our largest tables, we have also developed a grid sharding, which shards embedding tables on the row and column dimension. +2. Across groups, we do data parallel, such that each rank in a group has its corresponding replica rank in the other groups (replica rank means storing the same embedding table shards). + 1. After each group has completed its own backward pass, we all reduce the embedding table weights across the replicas to keep them synchronized. + +## Our production solution + +TorchRec is our library to build the sparse part of the recommendation models in native PyTorch. With the traditional API being DistributedModelParallel which applies model parallel to the embedding tables. We introduce a new API alongside it, known as DMPCollection, which serves as the main entry point for enabling 2D parallel on TorchRec models. We designed it to be as easy of a change as applying FSDP/DDP is. + +To understand what DMPCollection does, we have to understand what DistributedModelParallel (DMP) does first: + + + +1. Create embedding tables, known as EmbeddingBagCollection and EmbeddingCollections. +2. Generate a sharding plan with respect to GPU topology, embedding tables, memory available, input data, and more. +3. Wrap model with DMP and the associated sharding plan passed in. +4. DMP initializes and shards the embedding tables in accordance with the sharding plan. +5. On a train step, DMP takes an input batch, communicates it to the appropriate GPUs containing the embedding table shard of interest, looks up the value, and returns it back to the GPU that requested it. This is all done on the global process group, with some exceptions for special sharding (such as table row wise sharding) + +DistributedModelParallel was built for model parallel with many parts working under the assumption of sharding and working around the global world size. We need to change these parts in a way where we can introduce additional dimensions of parallelism without losing the optimizations and feature set of TorchRec. + +DMPCollection changes a few key parts to enable 2D parallel in an extensible way, + + + +* Generate sharding plans for the smaller sharding group once, once passed in we communicate to the appropriate ranks across the global group and remap the ranks to fit the new sharding group ranks. +* Create two new NCCL process groups, known as sharding and replica process groups. The sharding process group is passed into sharding and train step components of TorchRec. The replica process group is used for the weight and optimizer state synchronization, the all reduce call happens over this process group. + * The sub NCCL process groups allow us to efficiently communicate only between the ranks that are relevant for a particular comm. Each rank will have two associated process groups. + +To the user, the change is very simple, while taking away all the complexity around applying the parallelism strategies to the model. + +## How do we create these sharding and replication groups? + +These process groups are one of the keys to DMPCollection’s performant implementation. From our earlier diagram, we showed a simple 2x2 GPU setup, however, at scale, how do we assign which ranks are part of a given sharding group and what are their replica ranks across the sharding groups? + +Consider the following setup with 2 nodes, each with 4 GPUs. The sharding and replication groups under 2D parallel will be, + + + + + + + +
    + + + + + + + + + + + + + + +
    Sharding Group + Sharding Ranks +
    0 + 0, 2, 4, 6 +
    1 + 1, 3, 5, 7 +
    + + +
    + + + + + + + + + + + + + + + + + + + + + + +
    Replication Group + Replication Ranks +
    0 + 0, 1 +
    1 + 2, 3 +
    2 + 4, 5 +
    3 + 6, 7 +
    + + +
    + + +We use the following formulation, + + + +1. Divide all trainers into G sharding groups, each with L trainers + 1. Groups, G, is determined by G = T / L, where T is total number of trainers +2. For each group, G, we assigned non-contiguous trainer ranks based on the group it’s in, following, + 2. [i, G+i, 2G+i, ..., (L - 1) G+i], where* i = 0 to G-1* +3. From the groups, G, we can create the replication group, which is every G continuous ranks + 3. (0 to G-1, G to 2* G - 1) each continuous set stores the duplicate embedding table shards. + +This means our sharding groups, G, are of size L, which can be known as the number of ranks to apply model parallel across. This, in turn, gives us replica groups, each of size G, which are the ranks we data parallel across. + +In DMPCollection, we’re able to create these process groups efficiently with the use of DeviceMesh, we create the entire GPU topology in a 2x2 matrix, with each row representing the group of sharding ranks and each column representing the corresponding replica ranks, + +``` +create peer matrix +num_groups = global_world_size // sharding_group_size +for each group_rank in num_groups: + peers = [num_groups * rank + group_rank for rank in range(sharding_group_size)] + add peer to peer matrix + +initalize DeviceMesh with two dimensions (shard, replicate) +slice DeviceMesh on shard for sharding process group +slide DeviceMesh on replicate for replica process group +``` + +With our DeviceMesh approach, should we want to change the topology or provide further flexibility in the future, we can easily extend our creation logic to any form of topologies and even extend for further dimensions of parallelism if needed. + +## Performance of 2D parallel + +Our rank partitioning strategy optimizes communication patterns by strategically placing model replica ranks for each shard within the same compute node. This architecture provides significant performance benefits for the weight synchronization operation. After the backward pass, we perform all-reduce operations to synchronize model weights—which is an expensive process given the large parameter counts we have to communicate and sync—with our setup of placing replicas on the same node we leverage intra node’s high-bandwidth over-relying on slower inter-node bandwidth. + +The effect of this design choice on the other communication collectives generally improves the latencies. The improvement stems from two factors. + + + +1. By sharding the embedding tables over a reduced number of ranks and conducting communications for the model within the smaller group, we achieve a lower all-to-all latency. +2. With the replication in 2D parallel, our embedding lookup latency on a rank reduces, we can reduce the local batch size to 1/Nth of the equivalent global batch size, where N is the number of model replicas. + +A production model trace exemplifies these two factors, here we run the 2D parallel job on 1024 GPUs, with a sharding group size of 256 GPUs. + +![State diagram](/assets/images/scaling-recommendation-2d-sparse-parallelism/fg2.png){:style="width:100%"} + +***Figure 2: Comparing latencies between non 2D parallel and 2D parallel workloads*** + +There are two key levers users have to tune to maximize performance for their workloads: + + + +1. The size of the model sharding group relative to the global world size. The global world size divided by the sharding group size represents the number of model replicas we will have. + 1. To maximize performance, users can look to scale up their model up to 8x, this scaling factor maintains the intra-host all reduce. + 1. For further scaling, the all reduce would have to happen over inter host. From our experiments, we did not see an obvious performance regression and in fact note advantages of an inter host all reduce. We can change our sharding and replica topology to inter host all reduce, which can help us introduce fault tolerance strategies should a particular host go down. +2. Frequency of all reduce synchronization, DMPCollection comes with a sync() call, which can be tuned to be called every N training steps, performing a sort of local SGD training. With scale, reducing the frequency of synchronization can bring significant gains to performance. + +## Future Work + +Readers should note that 2D sparse parallel training differs from non-parallelized training because we synchronize the embedding table weights rather than the gradients. This approach is made possible by TorchRec's use of FBGEMM, which provides optimized kernels under the hood. One of FBGEMM's key optimizations is the fusion of the optimizer in the backward pass. Instead of fully materializing the embedding table gradients—which would consume significant memory—they are passed directly to the optimizer update. Attempting to materialize and synchronize these gradients would create substantial overhead, making that approach impractical. + +Our exploration revealed that to achieve training results comparable to the baseline, we synchronize optimizer states on a delayed schedule, with the timing dependent on the number of sharding/replica groups (ie: for Adagrad we update the momentum behind by one sync step). This approach also enables users to implement local SGD or semi-synchronized training strategies, which can achieve convergence and potentially produce better loss curves than the baseline. + +We thank you for reading our post! This is an exciting direction we have come across that we hope to develop further to maximize performance of recommendation systems and push the state of the art. + + \ No newline at end of file diff --git a/_posts/2025-03-13-pytorch-landscape.md b/_posts/2025-03-13-pytorch-landscape.md new file mode 100644 index 000000000000..4cc3687be952 --- /dev/null +++ b/_posts/2025-03-13-pytorch-landscape.md @@ -0,0 +1,44 @@ +--- +layout: blog_detail +title: "Introducing the New PyTorch Landscape: Your Guide to the PyTorch Ecosystem" +--- + +We’re excited to reveal our brand new PyTorch Landscape. The [PyTorch Landscape](https://landscape.pytorch.org/) helps researchers, developers, and organizations easily locate useful, curated, community-built tools that augment the PyTorch core framework. + + +landscape banner + +## What the Landscape Offers + +The Landscape visually organizes projects into three categories—Modeling, Training, and Optimizations—making finding relevant frameworks, libraries, and projects easy. Users can quickly locate curated, valuable tools for a variety of use cases that complement the PyTorch framework. Each tool that is part of the Landscape has been reviewed and vetted by PyTorch project experts. The projects in the Landscape are considered to be mature and healthy and provide valuable capabilities that complement the PyTorch framework in their respective domains. + + +## Explore the AI Landscape + +The **Explore** page presents platforms, tools, and libraries, each with a logo, description, and links to GitHub and further details. This categorized, visual approach simplifies discovery and provides quick access to essential technologies. + + +## Guide Page: A Closer Look + +For deeper insights, the **Guide** page expands on each project, highlighting methodologies and trends shaping AI development, from adversarial robustness to self-supervised learning. There are also project statistics provided for each project, including metrics such as number of stars, contributors, commit history, languages used, license, and other valuable metrics that provide an in-depth understanding of the project and how it may be used. + + +## Tracking AI’s Growth: The Stats Page + +The **Stats** page provides insights into AI development trends, tracking repository activity, programming languages, and industry funding data. + +* Repositories: 117 repositories, 20.5k contributors, and 797.2k stars across 815MB of source code. +* Development Trends: Weekly commit activity over the last year. +* Licensing Breakdown: Repositories are categorized by license type. +* Funding & Acquisitions: Insights into investment trends, including funding rounds and acquisitions. + + +## Why Use the PyTorch Landscape? + +Finding useful and high quality open source projects that complement the PyTorch core system can be overwhelming. The PyTorch Landscape offers a clear, accessible way to explore the ecosystem of community-built tools, whether you're researching, building models, or making strategic decisions. + +Stay ahead with the [PyTorch Landscape](https://landscape.pytorch.org/) — your guide to the PyTorch Ecosystem. + +## Want to Contribute a Project to the PyTorch Landscape? + +Have you built a useful open source tool that you would like to share with the PyTorch community? Then help us grow the Ecosystem by contributing your tool! You can find the [instructions to apply here](https://github.com/pytorch-fdn/ecosystem). We welcome all contributions from the community! \ No newline at end of file diff --git a/_posts/2025-03-16-pytorch-at-gtc.md b/_posts/2025-03-16-pytorch-at-gtc.md new file mode 100644 index 000000000000..94be8a113f5f --- /dev/null +++ b/_posts/2025-03-16-pytorch-at-gtc.md @@ -0,0 +1,109 @@ +--- +layout: blog_detail +title: "PyTorch at GTC 2025" +author: "Team PyTorch at NVIDIA" +hidden: true +--- + +[GTC](https://www.nvidia.com/gtc/) is coming back to San Jose on March 17–21, 2025. Join PyTorch Foundation members Arm, AWS, Google Cloud, IBM, Lightning AI, Meta, Microsoft Azure, Snowflake, and thousands of developers as we celebrate PyTorch. Together learn how AI & accelerated computing are helping humanity solve our most complex challenges. + +Join in person with [discounted GTC registration](https://www.nvidia.com/gtc/?ncid=GTC-NVI0K8HVX) for PyTorch Foundation or [watch online](https://register.nvidia.com/flow/nvidia/gtcs25/registration/) with free registration. + + +![book cover](/assets/images/pytorch-at-gtc.jpg){:style="max-width:500px; display: block; margin-left: auto; margin-right: auto"} + + +### [Scaling Open Source AI: From Foundation Models to Ecosystem Success](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1738966749087001K1dG) + +Hear from PyTorch Foundation’s Executive Director Matt White & panelists from UC Berkeley, Meta, NVIDIA, & Sequoia Capital how open source is transforming AI development, bringing together experts from industry, academia, and venture capital to discuss the technical and business aspects of collaborative open source AI development They’ll examine how open source projects like PyTorch, vLLM, Ray, and NVIDIA's NeMo are accelerating AI innovation while creating new opportunities for businesses and researchers. They'll share real-world experiences from PyTorch's development, Berkeley's research initiatives, and successful AI startups. Take away valuable insights into the technical and business aspects of open source AI. – Monday, Mar 17 10:00 AM - 11:00 AM PDT + + +## PyTorch @ GTC + +[The Performance of CUDA with the Flexibility of PyTorch ](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1726155993061001WWZM) +Mark Saroufim, Software Engineer, Meta Platforms + +This talk explores how PyTorch users are also becoming CUDA developers. We'll start with motivating examples from eager, the launch of torch.compile and the more recent trend of kernel zoos. We will share details on how we went about integrating low bit matmuls in torchao and the torch.compile CUTLASS backend. We'll also discuss details on how you can define, build and package your own custom ops in PyTorch so you get the raw performance of CUDA while maintaining the flexibility of PyTorch. + +[Make My PyTorch Model Fast, and Show Me How You Did It](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1727978036338001UVLu) +Thomas Viehmann, Principal Research Engineer, Lightning AI +Luca Antiga, CTO, Lightning AI + +PyTorch is popular in deep learning and LLMs for richness and ease of expressions. To make the most of compute resources, PyTorch models benefit from nontrivial optimizations, but this means losing some of their ease and understandability. Learn how with Thunder, a PyTorch-to-Python compiler focused on usability, understandability, and extensibility, you can optimize and transform (i.e., distribute across many machines) models while • leaving the PyTorch code unchanged • targeting a variety of models without needing to adapt to each of them • understanding each transformation step because the results are presented as simple Python code • accessing powerful extension code for your own optimizations with just one or a few lines of code We'll show how the combination of Thunder transforms and the NVIDIA stack (NVFuser, cuDNN, Apex) delivers optimized performance in training and inference on a variety of models. + +[FlexAttention: The Flexibility of PyTorch With the Performance of FlashAttention](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1726184633014001Jh5G) +Driss Guessous, Machine Learning Engineer, Meta Platforms + +Introducing FlexAttention: a novel PyTorch API that enables custom, user-defined attention mechanisms with performance comparable to state-of-the-art solutions. By leveraging the PyTorch compiler stack, FlexAttention supports dynamic modifications to attention scores within SDPA, achieving both runtime and memory efficiency through kernel fusion with the FlashAttention algorithm. Our benchmarks on A100 GPUs show FlexAttention achieves 90% of FlashAttention2's performance in forward passes and 85% in backward passes. On H100 GPUs, FlexAttention's forward performance averages 85% of FlashAttention3 and is ~25% faster than FlashAttention2, while backward performance averages 76% of FlashAttention3 and is ~3% faster than FlashAttention2. Explore how FlexAttention balances near-state-of-the-art performance with unparalleled flexibility, empowering researchers to rapidly iterate on attention mechanisms without sacrificing efficiency. + +[Keep Your GPUs Going Brrr : Crushing Whitespace in Model Training](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1731693095418001cruA) +Syed Ahmed, Senior Software Engineer, NVIDIA +Alban Desmaison, Research Engineer, Meta +Aidyn Aitzhan, Senior Software Engineer, NVIDIA + +Substantial progress has recently been made on the compute-intensive portions of model training, such as high-performing attention variants. While invaluable, this progress exposes previously hidden bottlenecks in model training, such as redundant copies during collectives and data loading time. We'll present recent improvements in PyTorch achieved through Meta/NVIDIA collaboration to tackle these newly exposed bottlenecks and how practitioners can leverage them. + +[Accelerated Python: The Community and Ecosystem](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1727176757800001qp7T) +Andy Terrel, CUDA Python Product Lead, NVIDIA +Jeremy Tanner, Open Source Programs, NVIDIA +Anshuman Bhat, CUDA Product Management, NVIDIA + +Python is everywhere. Simulation, data science, and Gen AI all depend on it. Unfortunately, the dizzying array of tools leaves a newcomer baffled at where to start. We'll take you on a guided tour of the vibrant community and ecosystem surrounding accelerated Python programming. Explore a variety of tools, libraries, and frameworks that enable efficient computation and performance optimization in Python, including CUDA Python, RAPIDS, Warp, and Legate. We'll also discuss integration points with PyData, PyTorch, and JAX communities. Learn about collaborative efforts within the community, including open source projects and contributions that drive innovation in accelerated computing. We'll discuss best practices for leveraging these frameworks to enhance productivity in developing AI-driven applications and conducting large-scale data analyses. + +[Supercharge large scale AI with Google Cloud AI hypercomputer (Presented by Google Cloud)](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1734571562315001xMKM) +Deepak Patil, Product Manager, Google Cloud +Rajesh Anantharaman, Product Management Lead, ML Software, Google Cloud + +Unlock the potential of your large-scale AI workloads with Google Cloud AI Hypercomputer – a supercomputing architecture designed for maximum performance and efficiency. In this session, we will deep dive into PyTorch and JAX stacks on Google Cloud on NVIDIA GPUs, and showcase capabilities for high performance foundation model building on Google Cloud. + +[Peering Into the Future: What AI and Graph Networks Can Mean for the Future of Financial Analysis](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1739906058885001OxEF) +Siddharth Samsi, Sr. Solutions Architect, NVIDIA +Sudeep Kesh, Chief Innovation Officer, S&P Global + +Artificial Intelligence, agentic systems, and graph neural networks (GNNs) are providing the new frontier to assess, monitor, and estimate opportunities and risks across work portfolios within financial services. Although many of these technologies are still developing, organizations are eager to understand their potential. See how S&P Global and NVIDIA are working together to find practical ways to learn and integrate such capabilities, ranging from forecasting corporate debt issuance to understanding capital markets at a deeper level. We'll show a graph representation of market data using the PyTorch-Geometric library and a dataset of issuances spanning three decades and across financial and non-financial industries. Technical developments include generation of a bipartite graph and link-prediction GNN forecasting. We'll address data preprocessing, pipelines, model training, and how these technologies can broaden capabilities in an increasingly complex world. + +[Unlock Deep Learning Performance on Blackwell With cuDNN](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1727984645671001Y9eq) +Yang Xu (Enterprise Products), DL Software Engineering Manager, NVIDIA + +Since its launch, cuDNN, a library for GPU-accelerating deep learning (DL) primitives, has been powering many AI applications in domains such as conversational AI, recommender systems, and speech recognition, among others. CuDNN remains a core library for DL primitives in popular frameworks such as PyTorch, JAX, Tensorflow, and many more while covering training, fine-tuning, and inference use cases. Even in the rapidly evolving space of Gen AI — be it Llama, Gemma, or mixture-of-experts variants requiring complex DL primitives such as flash attention variants — cuDNN is powering them all. Learn about new/updated APIs of cuDNN pertaining to Blackwell’s microscaling format, and how to program against those APIs. We'll deep dive into leveraging its graph APIs to build some fusion patterns, such as matmul fusion patterns and fused flash attention from state-of-the-art models. Understand how new CUDA graph support in cuDNN, not to be mistaken with the cuDNN graph API, could be exploited to avoid rebuilding CUDA graphs, offering an alternative to CUDA graph capture with real-world framework usage. + +[Train and Serve AI Systems Fast With the Lightning AI Open-Source Stack (Presented by Lightning AI)](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1736347047099001au7y) +Luca Antiga, CTO, Lightning AI + +See how the Lightning stack can cover the full life cycle, from data preparation to deployment, with practical examples and particular focus on distributed training and high-performance inference. We'll show examples that focus on new features like support for multi-dimensional parallelism through DTensors, as well as quantization through torchao. + + +## Connect With Experts (Interactive Sessions) + +[Meet the Experts From Deep Learning Framework Teams ](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1728516848639001tO7H) +Eddie Yan, Technical Lead of PyTorch, NVIDIA +Masaki Kozuki, Senior Software Engineer in PyTorch, NVIDIA +Patrick Wang (Enterprise Products), Software Engineer in PyTorch, NVIDIA +Mike Ruberry, Distinguished Engineer in Deep Learning Frameworks, NVIDIA +Rishi Puri, Sr. Deep Learning Engineer and Lead for PyTorch Geometric, NVIDIA + + +## Training Labs + +[Kernel Optimization for AI and Beyond: Unlocking the Power of Nsight Compute ](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1726073884811001C0za) +Felix Schmitt, Sr. System Software Engineer, NVIDIA +Peter Labus, Senior System Software Engineer, NVIDIA + +Learn how to unlock the full potential of NVIDIA GPUs with the powerful profiling and analysis capabilities of Nsight Compute. AI workloads are rapidly increasing the demand for GPU computing, and ensuring that they efficiently utilize all available GPU resources is essential. Nsight Compute is the most powerful tool for understanding kernel execution behavior and performance. Learn how to configure and launch profiles customized for your needs, including advice on profiling accelerated Python applications, AI frameworks like PyTorch, and optimizing Tensor Core utilization essential to modern AI performance. Learn how to debug your kernel and use the expert system built into Nsight Compute, known as “Guided Analysis,” that automatically detects common issues and directs you to the most relevant performance data all the way down to the source code level. + +[Make Retrieval Better: Fine-Tuning an Embedding Model for Domain-Specific RAG](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1725042189130001cmoW) +Gabriel Moreira, Sr. Research Scientist, NVIDIA +Ronay Ak, Sr. Data Scientist, NVIDIA + +LLMs power AI applications like conversational chatbots and content generators, but are constrained by their training data. This might lead to hallucinations in content generation, which requires up-to-date or domain-specific information. Retrieval augmented generation (RAG) addresses this issue by enabling LLMs to access external context without modifying model parameters. Embedding or dense retrieval models are a key component of a RAG pipeline for retrieving relevant context to the LLM. However, an embedding model’s effectiveness to capture the unique characteristics of the custom data hinges on the quality and domain relevance of its training data. Fine-tuning embedding models is gaining interest to provide more accurate and relevant responses tailored to users’ specific domain. + +In this lab, you'll learn to generate a synthetic dataset with question-context pairs from a domain-specific corpus, and process the data for fine-tuning. Then, fine-tune a text embedding model using synthetic data and evaluate it. + + +## Poster Presentations + +[Single-View X-Ray 3D Reconstruction Using Neural Back Projection and Frustum Resampling](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1729781473379001KiPD) +Tran Minh Quan, Developer Technologist, NVIDIA + +[Enable Novel Applications in the New AI Area in Medicine: Accelerated Feature Computation for Pathology Slides](https://www.nvidia.com/gtc/session-catalog/?regcode=no-ncid&ncid=no-ncid&tab.catalogallsessionstab=16566177511100015Kus&search=pytorch#/session/1729757102989001KDG4) +Nils Bruenggel, Principal Software Engineer, Roche Diagnostics Int. AG \ No newline at end of file diff --git a/_posts/2025-03-19-pt-day-china-2025-cfp.md b/_posts/2025-03-19-pt-day-china-2025-cfp.md new file mode 100644 index 000000000000..44f98dfd7ee1 --- /dev/null +++ b/_posts/2025-03-19-pt-day-china-2025-cfp.md @@ -0,0 +1,60 @@ +--- +layout: blog_detail +title: "PyTorch Day China 2025 Call for Proposals Open" +--- + +We’re excited to announce the **first-ever [PyTorch Day China](https://www.lfasiallc.com/pytorch-day-china/)**! This new event, hosted by the PyTorch Foundation, will take place on **June 7 in Beijing, China**, bringing together AI practitioners, researchers, and industry professionals to explore the latest advancements in open source AI and machine learning. Co-located with the **BAAI Conference**, PyTorch Day China is a chance to connect with the community, share knowledge, and help shape the future of deep learning. + + +![PyTorch Day China 2025 Call for Proposals Open](/assets/images/pt-day-china-2025-cfp.jpg){:style="max-width:500px; display: block; margin-left: auto; margin-right: auto"} + + +## Why Submit a Proposal? + +PyTorch Day China offers a platform for AI practitioners and researchers to showcase their work, exchange ideas, and connect with others in the community. If you're working on innovative applications, tools, or research in the PyTorch ecosystem, we encourage you to share your expertise. + + +## Topics for Submission: + + + +* AI Applications and Use Cases +* Core PyTorch Framework +* DL Compilers and Kernel Authoring +* Edge AI and On-Device +* Ethical AI, Governance, and Regulation +* Generative AI and Large Language Models (LLMs) with PyTorch +* Open Source Collaboration, Education, and Community Building +* Optimization for Training and Inference +* PyTorch on Accelerator Hardware +* PyTorch Ecosystem and Tools +* PyTorch in Research and Academia +* Performance Measurement and Benchmarking +* Scaling Training and Inference + +**The submission deadline is April 13. Submit and learn more here:** [https://www.lfasiallc.com/pytorch-day-china/call-for-proposals-cfp/](https://www.lfasiallc.com/pytorch-day-china/call-for-proposals-cfp/) + + +## Why Attend? + +PyTorch Day China will feature **technical talks, discussions, and poster sessions** that highlight real-world applications and developments in AI and machine learning. Attendees will have the opportunity to learn from experts, contribute to the open source community, and engage with fellow PyTorch users. Registration information will be available in April. + + +## Event Details + +* **Date:** June 7, 2025 +* **Location:** Zhongguancun Exhibition Center, Beijing, China +* **Address:** 索家坟, Hai Dian Qu, Bei Jing Shi, China, 100080 +* **Co-located with:** BAAI Conference + + +## Travel Information + +The venue, **Zhongguancun Exhibition Center**, is approximately **39 km from Beijing International Airport**. More details on travel and accommodation will be available on the **BAAI Conference website** and updated here as they become available. + + +## Have Questions? + +For inquiries, please contact pytorchevents@linuxfoundation.org. + +Submit your proposal by **April 13** and join the conversation shaping the future of PyTorch. \ No newline at end of file diff --git a/_posts/2025-03-19-sglang-joins-pytorch.md b/_posts/2025-03-19-sglang-joins-pytorch.md new file mode 100644 index 000000000000..1334a6b6a52c --- /dev/null +++ b/_posts/2025-03-19-sglang-joins-pytorch.md @@ -0,0 +1,105 @@ +--- +layout: blog_detail +title: "SGLang Joins PyTorch Ecosystem: Efficient LLM Serving Engine" +author: "SGLang Team" +hidden: true +--- + + +![sglang logo](/assets/images/sglang-join-pytorch/fg1.png){:style="max-width:400px; display: block; margin-left: auto; margin-right: auto"} + + +We’re thrilled to announce that the SGLang project has been integrated into the PyTorch ecosystem! This integration ensures that SGLang aligns with PyTorch’s standards and practices, providing developers with a reliable and community-supported framework for fast and flexible serving of LLMs. + +To view the PyTorch Ecosystem, see the [PyTorch Landscape](https://landscape.pytorch.org/) and learn more about how projects can [join the PyTorch Ecosystem](https://github.com/pytorch-fdn/ecosystem). + + +## About SGLang + +SGLang is a fast-serving engine for large language models and vision language models. It makes the interaction with models faster and more controllable by co-designing the backend runtime and frontend language. + +The core features include: + +* Fast Backend Runtime: Provides efficient serving with RadixAttention for prefix caching, zero-overhead CPU scheduler, continuous batching, token attention (paged attention), speculative decoding, tensor parallelism, chunked prefill, structured outputs, and quantization (FP8/INT4/AWQ/GPTQ). +* Flexible Frontend Language: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions. +* Extensive Model Support: Supports a wide range of generative models (Llama, Gemma, Mistral, Qwen, DeepSeek, LLaVA, etc.), embedding models (e5-mistral, gte, mcdse) and reward models (Skywork), with easy extensibility for integrating new models. +* Active Community: SGLang is open source and backed by an active community with industry adoption. + +SGLang is famous for its fast speed. It can often significantly outperform other state-of-the-art frameworks in terms of serving throughput and latency. You can learn more about the underlying techniques from the past release blog posts: [v0.2 blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/), [v0.3 blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/), [v0.4 blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/). + +SGLang has been widely adopted by leading industry companies and frontier research labs. For example, xAI uses SGLang to serve its flagship model, [Grok 3](https://grok.com/), which is currently the best model according to the Chatbot Arena leaderboard. Microsoft Azure uses SGLang to serve [DeepSeek R1](https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-deepseek-r1-on-a-single-ndv5-mi300x-vm/4372726) on AMD GPUs, which is currently the best open source model. + + +## Serving DeepSeek Models + +You can easily launch a Docker container to serve a DeepSeek model with the following command: + +``` +# Pull the latest image +docker pull lmsysorg/sglang:latest + +# Launch a server +docker run --gpus all --shm-size 32g -p 30000:30000 -v ~/.cache/huggingface:/root/.cache/huggingface --ipc=host --network=host --privileged lmsysorg/sglang:latest \ + python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30000 +``` + +Then you can query the server with the OpenAI-compatible API + +``` +import openai +client = openai.Client(base_url=f"http://127.0.0.1:30000/v1", api_key="None") + +response = client.chat.completions.create( + model="deepseek-ai/DeepSeek-V3", + messages=[ + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) +``` + +The server launch command above works for 8xH200. You can find detailed instructions for other hardware (MI300X, H100, A100, H20, L40S) at https://docs.sglang.ai/references/deepseek.html. + +SGLang integrates DeepSeek-specific optimizations, such as MLA throughput optimizations, MLA-optimized kernels, data-parallel attention, multi-token prediction, and DeepGemm, making it the top choice for serving DeepSeek models by dozens of [companies](https://x.com/lmsysorg/status/1887262321636221412), including AMD, NVIDIA, and many cloud providers. The team is actively working on integrating more optimizations following the 2025 H1 roadmap below. + + +## Serving Llama Models + +Similarly, you can launch the server for a Llama 3.1 text model with: + +``` +python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct +``` + +Or a Llama 3.2 multimodal model with: + +``` +python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-11B-Vision-Instruct --chat-template=llama_3_vision +``` + + +## Roadmap + +This year, the SGLang team will continue to push the boundaries of system efficiency. You can find the roadmap of 2025H1 [here](https://github.com/sgl-project/sglang/issues/4042). The focus is + +- Throughput-oriented large-scale deployment similar to the DeepSeek inference system +- Long context optimizations +- Low latency speculative decoding +- Reinforcement learning training framework integration +- Kernel optimizations + +## Community + +SGLang has been deployed to large-scale production, generating trillions of tokens every day. It has an active community with over three hundred contributors on GitHub. It is supported by the following institutions: AMD, Atlas Cloud, Baseten, Cursor, DataCrunch, Etched, Hyperbolic, iFlytek, Jam & Tea Studios, LinkedIn, LMSYS, Meituan, Nebius, Novita AI, NVIDIA, RunPod, Stanford, UC Berkeley, UCLA, xAI, and 01.AI. + + +![logos](/assets/images/sglang-join-pytorch/fg2.png){:style="width:100%;"} + + + +## Conclusion + +We’re excited to welcome SGLang to the PyTorch ecosystem. SGLang accelerates the serving of large language and vision language models. It’s widely adopted by industry, powering the large-scale online serving of frontier models like Grok and DeepSeek. + +We invite you to explore the [SGLang GitHub repo](https://github.com/sgl-project/sglang/tree/main), join the [community on Slack](https://slack.mindee.com/), and reach out to [contact@sglang.ai](mailto:contact@sglang.ai) for inquiries or collaboration opportunities. Together, we can make powerful AI models accessible to everyone. \ No newline at end of file diff --git a/_posts/2025-04-03-pt-day-france-cfp.md b/_posts/2025-04-03-pt-day-france-cfp.md new file mode 100644 index 000000000000..9ed63b302833 --- /dev/null +++ b/_posts/2025-04-03-pt-day-france-cfp.md @@ -0,0 +1,58 @@ +--- +layout: blog_detail +title: "PyTorch Day France 2025: Call For Proposals Open" +--- + +We’re pleased to announce **[PyTorch Day France 2025](https://events.linuxfoundation.org/pytorch-day-france/)**, a dedicated gathering of the PyTorch community held **7 May 2025** in **Paris, France**. Proudly hosted by the **PyTorch Foundation** and co-located with **[GOSIM AI Paris 2025](https://paris2025.gosim.org/)**, this event will bring together developers, researchers, and practitioners driving innovation in open source AI and machine learning. + +Whether you're building cutting-edge models or contributing to the ecosystem, PyTorch Day France is your opportunity to connect, collaborate, and help shape the future of deep learning. + + + +![PT Day CFP](/assets/images/pt-day-cfp.png){:style="max-width:600px; display: block; margin-left: auto; margin-right: auto"} + + +## Why Attend? + +Set in the vibrant atmosphere of STATION F, the world’s largest startup campus, PyTorch Day France will offer a full day of: + +* Insightful Technical Talks +* Interactive Discussions +* Engaging Poster Sessions + +The event is designed to foster open exchange across the PyTorch ecosystem, providing a space to learn from peers, share practical insights, and explore the latest research and applications in AI. + + +## Submit a Proposal + +We are currently accepting proposals for talks. If you have a project, idea, or research story you'd like to share with the PyTorch community, we want to hear from you. + +📩 Email your **talk title and abstract** to [pytorchevents@linuxfoundation.org](mailto:pytorchevents@linuxfoundation.org) for consideration. + + +## Registration + +To register for PyTorch Day France, please visit the **GOSIM AI Paris website**, and use the code PYTORCHFRIEND to receive 25% off. + +👉 [https://paris2025.gosim.org/](https://paris2025.gosim.org/) + +We encourage early registration to secure your spot and ensure access to both PyTorch Day France and the broader GOSIM AI Paris programming. + + +## Venue + +STATION F +5 Parv. Alan Turing, 75013 Paris, France +A landmark of innovation and entrepreneurship in the heart of Paris. + + +## Travel and Accommodations + +Participants are responsible for their own travel and lodging. For those arriving internationally, Paris Charles de Gaulle Airport is approximately 38.4 km from STATION F. Additional information about accommodations and transportation may be available on the [GOSIM AI Paris website](https://paris2025.gosim.org/). + + +## Questions? + +For any inquiries, please contact us at [pytorchevents@linuxfoundation.org](mailto:pytorchevents@linuxfoundation.org). + +We look forward to welcoming the PyTorch community to Paris this May for a day of collaboration, learning, and open source AI innovation. \ No newline at end of file diff --git a/_posts/2025-04-08-accelerating-whisper-arm-w-transformers.md b/_posts/2025-04-08-accelerating-whisper-arm-w-transformers.md new file mode 100644 index 000000000000..10db0cabc270 --- /dev/null +++ b/_posts/2025-04-08-accelerating-whisper-arm-w-transformers.md @@ -0,0 +1,39 @@ +--- +layout: blog_detail +title: "Accelerating Whisper on Arm with PyTorch and Hugging Face Transformers" +author: Pareena Verma, Arm +--- + +Automatic speech recognition (ASR) has revolutionized how we interact with technology, clearing the way for applications like real-time audio transcription, voice assistants, and accessibility tools. OpenAI Whisper is a powerful model for ASR, capable of multilingual speech recognition and translation. + +A new Arm Learning Path is now available that explains how to accelerate Whisper on Arm-based cloud instances using PyTorch and Hugging Face transformers. + +**Why Run Whisper on Arm?** + +Arm processors are popular in cloud infrastructure for their efficiency, performance, and cost-effectiveness. With major cloud providers such as AWS, Azure, and Google Cloud offering Arm-based instances, running machine learning workloads on this architecture is becoming increasingly attractive. + +**What You’ll Learn** + +The [Arm Learning Path](https://learn.arm.com/learning-paths/servers-and-cloud-computing/whisper/) provides a structured approach to setting up and accelerating Whisper on Arm-based cloud instances. Here’s what you cover: + +**1. Set Up Your Environment** + +Before running Whisper, you must set up your development environment. The learning path walks you through setting up an Arm-based cloud instance and installing all dependencies, such as PyTorch, Transformers, and ffmpeg. + +**2. Run Whisper with PyTorch and Hugging Face Transformers** + +Once the environment is ready, you will use the Hugging Face transformer library with PyTorch to load and execute Whisper for speech-to-text conversion. The tutorial provides a step-by-step approach for processing audio files and generating audio transcripts. + +**3. Measure and Evaluate Performance** + +To ensure efficient execution, you learn how to measure transcription speeds and compare different optimization techniques. The guide provides insights into interpreting performance metrics and making informed decisions on your deployment. + +**Try it Yourself** + +Upon completion of this tutorial, you know how to: + +* Deploy Whisper on an Arm-based cloud instance. +* Implement performance optimizations for efficient execution. +* Evaluate transcription speeds and optimize further based on results. + +**Try the live demo today** and see audio transcription in action on Arm: [Whisper on Arm Demo](https://learn.arm.com/learning-paths/servers-and-cloud-computing/whisper/_demo/). \ No newline at end of file diff --git a/_posts/2025-04-23-pytorch-2-7.md b/_posts/2025-04-23-pytorch-2-7.md new file mode 100644 index 000000000000..1f31b9f2e6c3 --- /dev/null +++ b/_posts/2025-04-23-pytorch-2-7.md @@ -0,0 +1,161 @@ +--- +layout: blog_detail +title: "PyTorch 2.7 Release" +--- + +We are excited to announce the release of PyTorch® 2.7 ([release notes](https://github.com/pytorch/pytorch/releases/tag/v2.7.0))! This release features: + +* support for the [NVIDIA Blackwell GPU architecture](https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/) and pre-built wheels for [CUDA 12.8](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html) across Linux x86 and arm64 architectures. +* *torch.compile* support for Torch Function Modes which enables users to override any *torch.** operation to implement custom user-defined behavior. +* Mega Cache which allows users to have end-to-end portable caching for torch; +* new features for FlexAttention - LLM first token processing, LLM throughput mode optimization and Flex Attention for Inference. + +This release is composed of 3262 commits from 457 contributors since PyTorch 2.6. We want to sincerely thank our dedicated community for your contributions. As always, we encourage you to try these out and report any issues as we improve 2.7. More information about how to get started with the PyTorch 2-series can be found at our [Getting Started](https://pytorch.org/get-started/pytorch-2.0/) page. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    Beta + Prototype +
    Torch.Compile support for Torch Function Modes + NVIDIA Blackwell Architecture Support +
    Mega Cache + PyTorch Native Context Parallel +
    + Enhancing Intel GPU Acceleration +
    + FlexAttention LLM first token processing on x86 CPUs +
    + FlexAttention LLM throughput mode optimization on x86 CPUs +
    + Foreach Map +
    + Flex Attention for Inference +
    + Prologue Fusion Support in Inductor +
    + + +*To see a full list of public feature submissions click [here](https://docs.google.com/spreadsheets/d/1TzGkWuUMF1yTe88adz1dt2mzbIsZLd3PBasy588VWgk/edit?usp=sharing). + + +## BETA FEATURES + + +### [Beta] Torch.Compile support for Torch Function Modes + +This feature enables users to override any *torch.** operation to implement custom user-defined behavior. For example, ops can be rewritten to accommodate a specific backend. This is used in FlexAttention to re-write indexing ops. + +See the [tutorial](https://pytorch.org/tutorials/recipes/torch_compile_torch_function_modes.html) for more information. + + +### [Beta] Mega Cache + +Mega Cache allows users to have end-to-end portable caching for torch. The intended use case is after compiling and executing a model, the user calls *torch.compiler.save_cache_artifacts()* which will return the compiler artifacts in a portable form. Later, potentially on a different machine, the user may call *torch.compiler.load_cache_artifacts()* with these artifacts to pre-populate the torch.compile caches in order to jump-start their cache. + +See the [tutorial](https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html#torch-compile-end-to-end-caching-mega-cache) for more information. + + +## PROTOTYPE FEATURES + + +### [Prototype] NVIDIA Blackwell Architecture Support + +PyTorch 2.7 introduces support for NVIDIA's new Blackwell GPU architecture and ships pre-built wheels for CUDA 12.8. For more details on CUDA 12.8 see [CUDA Toolkit Release](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html). + + + +* Core components and libraries including cuDNN, NCCL, and CUTLASS have been upgraded to ensure compatibility with Blackwell platforms. +* PyTorch 2.7 includes Triton 3.3, which adds support for the Blackwell architecture with torch.compile compatibility. +* To utilize these new features, install PyTorch with CUDA 12.8 using: *pip install torch==2.7.0 --index-url https://download.pytorch.org/whl/cu128* + +More context can also be found [here](https://github.com/pytorch/pytorch/issues/145949). + + +### [Prototype] PyTorch Native Context Parallel + +PyTorch Context Parallel API allows users to create a Python context so that every *torch.nn.functional.scaled_dot_product_attention() *call within will run with context parallelism. Currently, PyTorch Context Parallel supports 3 attention backends: 1. Flash attention; 2. Efficient attention; and 3. cuDNN attention. + +As an example, this is [used within TorchTitan as the Context Parallel solution for LLM training](https://discuss.pytorch.org/t/distributed-w-torchtitan-breaking-barriers-training-long-context-llms-with-1m-sequence-length-in-pytorch-using-context-parallel/215082). + +See [tutorial](https://pytorch.org/tutorials/prototype/context_parallel.html) here. + + +### [Prototype] Enhancing Intel GPU Acceleration + +This latest release introduces enhanced performance optimizations for Intel GPU architectures. These improvements accelerate workloads across various Intel GPUs through the following key enhancements: + + + +* Enable torch.compile on Windows 11 for Intel GPUs, delivering the performance advantages over eager mode as on Linux. +* Optimize the performance of PyTorch 2 Export Post Training Quantization (PT2E) on Intel GPU to provide a full graph mode quantization pipelines with enhanced computational efficiency. +* Improve Scaled Dot-Product Attention (SDPA) inference performance with bfloat16 and float16 to accelerate attention-based models on Intel GPUs. +* Enable AOTInuctor and torch.export on Linux to simplify deployment workflows. +* Implement more Aten operators to enhance the continuity of operators execution on Intel GPU and increase the performance on Intel GPU in eager mode. +* Enable profiler on both Windows and Linux to facilitate model performance analysis. +* Expand the Intel GPUs support to [Intel® Core™ Ultra Series 2 with Intel® Arc™ Graphics](https://www.intel.com/content/www/us/en/products/details/processors/core-ultra.html), and [Intel® Arc™ B-Series graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/desktop/b-series/overview.html) on both Windows and Linux. + +For more information regarding Intel GPU support, please refer to [Getting Started Guide](https://pytorch.org/docs/main/notes/get_start_xpu.html). + +See also the tutorials [here](https://pytorch.org/tutorials/prototype/inductor_windows.html) and [here](https://pytorch.org/tutorials/prototype/pt2e_quant_xpu_inductor.html). + + +### [Prototype] FlexAttention LLM first token processing on x86 CPUs + +FlexAttention x86 CPU support was first introduced in PyTorch 2.6, offering optimized implementations — such as PageAttention, which is critical for LLM inference—via the TorchInductor C++ backend. In PyTorch 2.7, more attention variants for first token processing of LLMs are supported. With this feature, users can have a smoother experience running FlexAttention on x86 CPUs, replacing specific *scaled_dot_product_attention* operators with a unified FlexAttention API, and benefiting from general support and good performance when using torch.compile. + + +### [Prototype] FlexAttention LLM throughput mode optimization + +The performance of FlexAttention on x86 CPUs for LLM inference throughput scenarios has been further improved by adopting the new C++ micro-GEMM template ability. This addresses the performance bottlenecks for large batch size scenarios present in PyTorch 2.6. With this enhancement, users can transparently benefit from better performance and a smoother experience when using FlexAttention APIs and torch.compile for LLM throughput serving on x86 CPUs. + + +### [Prototype] Foreach Map + +This feature uses torch.compile to allow users to apply any pointwise or user-defined function (e.g. torch.add) to lists of tensors, akin to the existing *torch._foreach_** ops. The main advantage over the existing *torch._foreach_** ops is that any mix of scalars or lists of tensors can be supplied as arguments, and even user-defined python functions can be lifted to apply to lists of tensors. Torch.compile will automatically generate a horizontally fused kernel for optimal performance. + +See [tutorial](https://pytorch.org/tutorials/recipes/foreach_map.html) here. + + +### [Prototype] Flex Attention for Inference + +In release 2.5.0, [FlexAttention](https://pytorch.org/blog/flexattention/)* torch.nn.attention.flex_attention* was introduced for ML researchers who’d like to customize their attention kernels without writing kernel code. This update introduces a decoding backend optimized for inference, supporting GQA and PagedAttention, along with feature updates including nested jagged tensor support, performance tuning guides and trainable biases support. + +### [Prototype] Prologue Fusion Support in Inductor + +Prologue fusion optimizes matrix multiplication (matmul) operations by fusing operations that come before the matmul into the matmul kernel itself, improving performance by reducing global memory bandwidth. diff --git a/_posts/2025-04-25-pytorch-2-7-intel-gpus.md b/_posts/2025-04-25-pytorch-2-7-intel-gpus.md new file mode 100644 index 000000000000..7643d20ae51b --- /dev/null +++ b/_posts/2025-04-25-pytorch-2-7-intel-gpus.md @@ -0,0 +1,92 @@ +--- +layout: blog_detail +title: "Accelerate PyTorch 2.7 on Intel® GPUs" +author: the Intel PyTorch Team +--- + +[PyTorch 2.7](https://pytorch.org/blog/pytorch-2-7/) continues to deliver significant functionality and performance enhancements on Intel® GPU architectures to streamline AI workflows. Application developers and researchers seeking to fine-tune, inference and develop PyTorch models on Intel GPUs will now have a consistent user experience across various operating systems, including Windows, Linux and Windows Subsystem for Linux (WSL2). This is made possible through improved installation, eager mode script debugging, a performance profiler, and graph model (torch.compile) deployment. As a result, developers have greater options with a unified GPU programming paradigm for both front-end and back-end development. + +## Incremental improvements of Intel GPU support in PyTorch + +Since PyTorch 2.4, we've made steady improvements to Intel GPU support with each release. With PyTorch 2.7, we are excited to share that we have established a solid foundation to have Intel GPU work in both graph mode (torch.compile) and eager mode on Windows and Linux. This includes a wide range of Intel GPU products, many of which you may already access. We hope these enhancements will unlock more ubiquitous hardware for your AI research and development. + +* Over time, we have expanded Intel GPU Support across Windows and Linux, including these products: + * [Intel® Arc™ A-Series Graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/desktop/a-series/overview.html) + * [Intel® Arc™ B-Series Graphics](https://www.intel.com/content/www/us/en/products/docs/discrete-gpus/arc/desktop/b-series/overview.html) + * [Intel® Core™ Ultra Processors with Intel Arc Graphics](https://www.intel.com/content/www/us/en/support/articles/000097599/processors.html) + * [Intel® Core™ Ultra Mobile Processors (Series 2) with Intel Arc Graphics](https://www.intel.com/content/www/us/en/products/docs/processors/core-ultra/core-ultra-series-2-mobile-product-brief.html) + * [Intel® Core™ Ultra Desktop Processors (Series 2) with Intel Arc Graphics](https://www.intel.com/content/www/us/en/products/docs/processors/core-ultra/core-ultra-desktop-processors-series-2-brief.html) + * [Intel® Data Center GPU Max Series](https://www.intel.com/content/www/us/en/products/details/discrete-gpus/data-center-gpu/max-series.html) +* [Simpler installation](https://pytorch.org/docs/2.7/notes/get_start_xpu.html) of torch-xpu PIP wheels and an effortless setup experience. +* High ATen operation coverage with SYCL and oneDNN for smooth eager mode support with functionality and performance. +* Notable speedups with torch.compile through default TorchInductor and Triton backend, proved by measurable performance gains with Hugging Face, TIMM, and TorchBench benchmarks. + +Check out the detailed advancements in these related release blogs:[ PyTorch 2.4](https://pytorch.org/blog/intel-gpus-pytorch-2-4/),[ PyTorch 2.5](https://pytorch.org/blog/intel-gpu-support-pytorch-2-5/), and[ PyTorch 2.6](https://pytorch.org/blog/unlocking-pt-2-6-intel/). + + +## What's New in PyTorch 2.7 + +These are the features in PyTorch 2.7 that were added to help accelerate performance on Intel GPUs. + + + +* Improve scaled dot-product attention (SDPA) inference performance with bfloat16 and float16 to accelerate attention-based models on Intel GPUs. +With the new SDPA optimization for Intel GPUs on PyTorch 2.7, Stable Diffusion float16 inference achieved up to 3x gain over PyTorch 2.6 release on Intel® Arc™ B580 Graphics and Intel® Core™ Ultra 7 Processor 258V with Intel® Arc™ Graphics 140V on eager mode. See Figure 1 below. + + +![chart](/assets/images/pytorch-2-7-intel-gpus/fg1.png){:style="width:100%"} + +**Figure 1. PyTorch 2.7 Stable Diffusion Performance Gains Over PyTorch 2.6** + +* Enable torch.compile on Windows 11 for Intel GPUs, delivering the performance advantages over eager mode as on Linux. With this, Intel GPUs became the first accelerator to support torch.compile on Windows. Refer to[ Windows tutorial](https://pytorch.org/tutorials/prototype/inductor_windows.html) for details. +Graph model (torch.compile) is enabled in Windows 11 for the first time across Intel GPUs, delivering the performance advantages over eager mode as on Linux by PyTorch 2.7. The latest performance data was measured on top of PyTorch Dynamo Benchmarking Suite using Intel® Arc™ B580 Graphics on Windows showcase torch.compile speedup ratio over eager mode as shown in Figure 2. Both training and inference achieved similar significant improvements. + + +![chart](/assets/images/pytorch-2-7-intel-gpus/fg2.png){:style="width:100%"} + +**Figure 2. Torch.compile Performance Gains Over Eager Mode on Windows** + + + +* Optimize the performance of PyTorch 2 Export Post Training Quantization (PT2E) on Intel GPU to provide full graph mode quantization pipelines with enhanced computational efficiency. Refer to [PT2E tutorial](https://pytorch.org/tutorials/prototype/pt2e_quant_xpu_inductor.html) for details. +* Enable AOTInductor and torch.export on Linux to simplify deployment workflows. Refer to[ AOTInductor tutorial](https://pytorch.org/docs/main/torch.compiler_aot_inductor.html) for details. +* Enable profiler on both Windows and Linux to facilitate model performance analysis. Refer to the[ PyTorch profiler tutorial](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html#pytorch-profiler) for details. + +Review the [Getting Started on Intel GPU Guide](https://pytorch.org/docs/2.7/notes/get_start_xpu.html) for a tour of the environment setup and a quick start on Intel GPUs. + + +## Future Work + +Looking ahead, we will continue the Intel GPU upstream efforts in future PyTorch releases to: + +* Attain state-of-the-art PyTorch-native performance to showcase competitive GEMM computational efficiency for torch.compile, and enhance performance for LLM models through FlexAttention and lower precision data types. +* Broaden feature compatibility by delivering distributed XCCL backend support for Intel® Data Center GPU Max Series. +* Expand accelerator support across core PyTorch ecosystem components including torchao, torchtune, and torchtitan. + +Follow along in the [PyTorch Dev Discussion](https://dev-discuss.pytorch.org/t/intel-gpu-cpu-enabling-status-and-feature-plan-2025-h1-update/2913) to learn more about Intel GPU & CPU enabling status and features. As we get further along, we will create tickets on GitHub to document our progress. + + +## Summary + +In this blog, we reviewed the Intel GPU upstream progress starting in PyTorch 2.4 and highlighted the new features of PyTorch 2.7 that accelerate AI workload performance across various Intel GPUs. These new features, especially SDPA on Windows, achieved up to 3x inference (Stable Diffusion, float16) gain over PyTorch 2.6 release on Intel Arc B580 Graphics and Intel Core Ultra 7 Processor 258V with Intel Arc Graphics 140V. Also, torch.compile on Windows delivers similar performance advantages over eager mode on Dynamo benchmarks as on Linux. + + +## Acknowledgments + +We want to thank the following PyTorch maintainers for their technical discussions and insights: [Nikita Shulga](https://github.com/malfet), [Jason Ansel](https://github.com/jansel), [Andrey Talman](https://github.com/atalman), [Alban Desmaison](https://github.com/alband), and [Bin Bao](https://github.com/desertfire). + +We also thank collaborators from PyTorch for their professional support and guidance. + +## Product and Performance Information + +Measurement on Intel Core Ultra 7 258V: 2200 MHz, 8 Core(s), 8 Logical Processor(s) with Intel Arc 140V GPU (16GB), GPU memory 18.0 GB, using Intel Graphics Driver 32.0.101.6647 (WHQL Certified), Windows 11 Pro - 24H2. And Intel Core Ultra 5 245KF: 4200 MHz, 14 Core(s), 14 Logical Processor(s), Intel Arc B580 Graphics, dedicated GPU memory 12.0 GB, shared GPU memory 15.8 GB, using Intel Graphics Driver 32.0.101.6647 (WHQL Certified), Windows 11 Enterprise LTSC - 24H2. Test by Intel on Apr 8th, 2025. + +## Notices and Disclaimers + +Performance varies by use, configuration and other factors. Learn more on the Performance Index site. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates.  See backup for configuration details.  No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation. + +Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. + +## AI Disclaimer + +AI features may require software purchase, subscription or enablement by a software or platform provider, or may have specific configuration or compatibility requirements. Details at [www.intel.com/AIPC](http://www.intel.com/AIPC). Results may vary. \ No newline at end of file diff --git a/_posts/2025-04-28-accelerating-training-float8-rowwise-crusoe.md b/_posts/2025-04-28-accelerating-training-float8-rowwise-crusoe.md new file mode 100644 index 000000000000..245688c07605 --- /dev/null +++ b/_posts/2025-04-28-accelerating-training-float8-rowwise-crusoe.md @@ -0,0 +1,195 @@ +--- +layout: blog_detail +title: "Accelerating Large Scale Training and Convergence with PyTorch Float8 Rowwise on Crusoe 2K H200s" +author: Meta and Crusoe +--- + +**Meta**: Less Wright, Hamid Shojanazeri, Vasiliy Kuznetsov, Daniel Vega-Myhre, Gokul Nadathur, Will Constable, Tianyu Liu, Tristan Rice, Driss Guessous, Josh Fromm, Luca Wehrstedt, Jiecao Yu +**Crusoe**: Ethan Petersen, Martin Cala, Chip Smith + +Working with [Crusoe.AI](http://Crusoe.AI) we were provided access to one of their new 2K H200 clusters in Iceland, which enabled us to showcase training accelerations of 34 - 43% at scale by leveraging TorchTitan’s HSDP2 and TorchAO’s new float8 rowwise, with comparable convergence and stability vs BF16. + + +![bar chart](/assets/images/accelerating-training-float8-rowwise-crusoe/fg1.png){:style="width:100%;"} + + +In this post we detail the synergy of H200’s with PyTorch’s new Float8 rowwise training with TorchTitan’s FSDP2/HSDP2 and CP at scale. + +## Background - what is an H200? + +H200’s are an ‘enhanced’ H100, offering the exact same compute as an H100, but with two additional improvements. + +* Larger global memory, 141GiB HBM3e vs the standard 80GiB HBM3 +* Memory bandwidth is ~43% faster with 4.8TB/s vs 3.35 TB/s. The faster memory transfer has an outsized effect on training speed, especially for PyTorch’s AsyncTP. + +## What is PyTorch Float8 rowwise? + +Float 8 Rowwise is a finer grained resolution for Float8 vs the previous ‘tensor wise’ Float8. It is designed to ensure finer grained accuracy to support larger workloads that tend to become more sensitive to quantization at scale and as training progresses. + +There are two key improvements with Float8 rowwise: + +* Each row now maintains its own scaling factor versus a single scaling factor for the entire tensor, thus improving quantization precision. Finer grained scaling per row helps reduce the effect of outliers (extreme values that force the quantization scaling factor to stretch and degrade the precision of the normally distributed values) and thus ensures better precision. +* The scaling factor itself is now implemented by rounding down to the nearest power of 2. This has been shown to help reduce quantization errors when multiplying/dividing by the scaling factor as well as ensuring large values remain scaled to the same value in both the forward and backward passes. + +Note that other large scale models have been trained using Float8 at 2K scale with a combination of 1x128 groupwise and 128x128 blockwise, with power of 2 scaling factors. They had the same goal of improving Float8’s precision for supporting large scale training. + +Thus, Float8 rowwise offers a similar promise to enable Float8 for very large scale training, but we wanted to provide proof of stability and convergence at scale, which training on the Crusoe H200 2k cluster provided initial verification thereof. + +## Showcasing Float8 Rowwise Loss convergence vs BF16 at 1600 and 1920 GPU Scale: + +In order to verify comparable loss convergence, we ran two separate runs at both 1920 and then 1600 (1.6k) gpu scale using TorchTitan and Lllama3 70B. The 1.6K GPU runs were set for 2.5k iterations, using TorchTitans’ HSDP2 and Context Parallel to enable 2D parallelism. + +The loss convergence tests were run using Titan’s deterministic mode - this mode effectively freezes most potential sources of variation from run to run, and thus helps ensure that the only substantial change is what we want to test, namely the loss convergence and loss curves of BF16 vs Float8 Rowwise. + +Note that deterministic mode also slows down training speed because various kernels will not be autotuned to maximize throughput (otherwise we risk using different kernels between runs and introducing variance). + +Two runs were completed, one with BF16 and the other with Float8 Rowwise. + +Both runs completed their assigned 2.5k iters without issue, showcasing the Crusoe cluster stability, with FP8 completing at exactly 24 hours and BF16 finishing after 31 hours, 19 minutes. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    DType + Time / Iters + Loss +
    + + +
    BF16 + 24 hours + 3.15453 +
    Float8 Rowwise + 24 hours + 2.86386 +
    + + +
    BF16 + 31 hours, 19 minutes / 2.5K + 2.88109 +
    Float8 Rowwise + 24 hours / 2.5K + 2.86386 +
    + + +At the 24 hour mark, Float8 completed 2.5K iterations showcasing the comparative speed up (even in deterministic mode) of float8 training. At the 24 hour mark, Float8 enabled a **+9.21%** relative improvement in loss compared to BF16 for the same 24 hours of large scale training time. + + +After 31 hours, 19 minutes, the BF16 run finally completed its 2.5k iters. + + +The final loss numbers: +BF16 = **2.88109** +Float8 = **2.86386** + +From the loss curves we observed very similar curves at the first and last ⅓ and then a turbulent zone in the middle where both showed similar spikes, but with a slight skew to the relative timing of the spikes. + + +![line chart](/assets/images/accelerating-training-float8-rowwise-crusoe/fg2.png){:style="width:100%;"} + + +As a result of this, we can see that PyTorch’s Float8 rowwise offers similar convergence but over 33% speedup for the same amount of training time. + +## Long Term Training stability with Float8 Rowwise + +Beyond showcasing comparable convergence, we also wanted to show longer term training stability with Float8 and thus we launched a 4 day, 15K run at 256 scale. + +![line chart](/assets/images/accelerating-training-float8-rowwise-crusoe/fg3.png){:style="width:100%;"} + + +As shown above, Float8 training ran for over 100 hours with no issues, highlighting the long term stability of Float8 Rowwise. + +## Determinism in TorchTitan + +To verify determinism and to see if the spikiness in the longer runs was from scale, we also ran a smaller run comprising of 2 runs of BF16, and 1 run of Float8 at 256 scale, and with HSDP2 only (i.e. without 2D Context parallel). + +In this case both BF16 runs had identical curves and final loss, and we saw a similar spikiness zone for all three runs. + +At the 2K iteration mark, both Float8 and BF16 ending at nearly identical points: +BF16 *2 = **3.28538** +Float8 rowwise = **3.28203** + +![line chart](/assets/images/accelerating-training-float8-rowwise-crusoe/fg4.png){:style="width:100%;"} + + +The above result confirms that neither CP nor scale (2k) are responsible for spikiness in the loss as we saw similar effect at 256 scale as well. The most likely explanation for the loss spikes could be content distribution in the dataset. + +For the sake of determinism, the experiments were run with a serialized C4 dataset (not shuffled), meaning the spikes could be from encountering new content within the dataset. + +## Net speedups at various Scales with Float8 rowwise: + +We performed shorter runs at various GPU scales to understand how Float8 Rowwise would scale in terms of training acceleration as cluster sizes expanded. Doubling in scale from 960 to 1920, Float8 continued to deliver impressive training speedups, with a range of over 34-43% gains compared to BF16. We also want to note that scaling from 1k to 2k GPUs communication overhead likely kicked in and we observed a 4% hit on throughput with BF16. + +![bar chart](/assets/images/accelerating-training-float8-rowwise-crusoe/fg5.png){:style="width:100%;"} + + +As shown in the longer training runs at scale above, Float8 rowwise delivered substantial speedups with equal or even slightly improved loss endpoints while delivering 34% speedups at 1920 (DeepSeek) scale. + +## How can I use Float8 Rowwise in my training? + +Float8 Rowwise is available now for you to use in your large scale training. It is packaged in [TorchAO’s](https://github.com/pytorch/ao) latest builds (0.9 and higher) and integrated into [TorchTitan](https://github.com/pytorch/torchtitan) natively if you want to get up and running quickly. + +To activate Float8 Rowwise in TorchTitan: + +First enable the model converter to hotswap the nn.linears into float8 linear layers in your models .toml file - see line 29: + + +![code](/assets/images/accelerating-training-float8-rowwise-crusoe/fg6.png){:style="max-width:600px; display: block; margin-left: auto; margin-right: auto"} + +Secondly, specify the ‘rowwise’ float8 recipe - see line 72: + + +![code](/assets/images/accelerating-training-float8-rowwise-crusoe/fg7.png){:style="max-width:600px; display: block; margin-left: auto; margin-right: auto"} + + +Note that you have three choices for the ‘recipe_name’: + +* rowwise which is the recommended default, +* tensorwise (the older style float8) and +* rowwise_with_gw_hp. + +The gw_hp rowwise option keeps the gradients to the weights in BF16 precision during the backwards pass, and this can further enhance float8 precision for extremely sensitive workloads. But, it can ironically be a bit more performant than generic rowwise if the majority of the matmul sizes in your model are smaller (with an estimated tipping point at roughly 13-16K dimensions on H100). + +Thus while we recommend rowwise as the default, it may be worth comparing with gw_hp on your model to verify which provides the best performance, with an upside of even greater precision. + +By toggling the model converter on and off with a #, you can directly compare training acceleration between BF16 and Float8 Rowwise to understand the potential speedups for your own training. + +## Future Updates: + +We’ll have an additional update coming showcasing multiple improvements for Pipeline Parallel and Async Distributed Checkpointing so please stay tuned. \ No newline at end of file diff --git a/_posts/2025-04-29-pt-foundation-expands.md b/_posts/2025-04-29-pt-foundation-expands.md new file mode 100644 index 000000000000..a0b0454ae588 --- /dev/null +++ b/_posts/2025-04-29-pt-foundation-expands.md @@ -0,0 +1,50 @@ +--- +layout: blog_detail +title: "PyTorch Foundation Expands to an Umbrella Foundation to Accelerate AI Innovation" +author: Matt White, Executive Director, PyTorch Foundation +--- + +Today, I am thrilled to announce a significant milestone for the PyTorch Foundation: we are expanding our scope to become an umbrella foundation, allowing us to host additional projects. This expansion positions the PyTorch Foundation to foster a broader ecosystem of high-value, trusted, and innovative AI projects that cater to all stages of the AI lifecycle—from training and inference to industry-specific applications. + +## Why Expand? + +Since its inception at the Linux Foundation two and a half years ago, the PyTorch Foundation has rapidly grown, now encompassing over 30 member organizations and 120 vibrant ecosystem projects. PyTorch itself has become the framework of choice for AI researchers, practitioners, and industry leaders worldwide. Our flagship PyTorch Conference has seen attendance multiply sixfold over just two years, reflecting the community’s tremendous enthusiasm and engagement. + +With new initiatives such as PyTorch Day events, global community meetups, the PyTorch Ambassador Program, Open Source Program Office (OSPO) outreach, the Speaker’s Bureau, and our upcoming training and certification programs, we have significantly deepened our community’s expertise and collaboration capabilities. To sustain and accelerate this momentum, the logical next step was to expand the PyTorch Foundation into an umbrella organization. + +## What Does an Umbrella Foundation Mean? + +By transitioning into an umbrella foundation, PyTorch will now host a range of diverse, high-quality AI and ML projects beyond PyTorch Core. These include foundation-hosted projects in two categories: + + +* **Platform Projects**: Domain-agnostic solutions essential across various stages of the AI lifecycle, such as training, inference, model optimization, and deployment as well as agentic systems. +* **Vertical Projects**: Domain-specific projects tailored to particular industries or applications, such as biomedical imaging, protein folding, and geospatial analysis. + +Projects under our umbrella gain immediate access to vendor-neutral governance, enhanced visibility, increased funding opportunities, and robust community engagement and support. + +## Foundation-Hosted vs. Ecosystem Projects + +As we expand, it’s important to clarify the distinction between foundation-hosted and ecosystem projects: + +* **Foundation-Hosted Projects** are projects that fall under the umbrella, they are officially governed and administered under the PyTorch Foundation’s neutral and transparent governance model. Project maintainers continue to oversee their project, and they transfer assets to the Linux Foundation for independent stewardship and adopt an open governance model significantly reducing vendor bias and encouraging broader community contributions and adoption. These projects have greater stability and longevity and integrate with the larger PyTorch community. +* **Ecosystem Projects** remain independently managed but receive recognition and increased visibility by aligning themselves closely with the PyTorch Foundation community standards. These projects meet specific quality and maturity criteria but retain full independence in governance and asset management. + +## How to Join the PyTorch Ecosystem or Become a Foundation-Hosted Project + +We have clearly defined pathways for projects looking to become part of the PyTorch community: + +1. **[Ecosystem Project Status](https://github.com/pytorch-fdn/ecosystem)**: Projects must meet defined criteria, such as active development, comprehensive documentation, CI/CD infrastructure, clear governance, and community engagement. Approved ecosystem projects benefit from increased exposure and official recognition on the [PyTorch Landscape](https://landscape.pytorch.org/). +2. **[Candidate Project Status](https://github.com/pytorch-fdn/foundation-hosted)**: Ecosystem projects aspiring to foundation-hosted status can become candidates by securing sponsorship from a PyTorch Foundation [Technical Advisory Council (TAC)](/tac) voting member. Candidates receive guidance on meeting all necessary governance, technical, and strategic criteria. +3. **[Foundation-Hosted Project Status](https://github.com/pytorch-fdn/foundation-hosted)**: Candidate projects demonstrating high maturity, stability, multi-platform support, security best practices, and strategic value to the PyTorch community can be approved by the TAC. These projects gain extensive benefits, including neutral trademark hosting, foundation support, marketing and events resources, governance guidance, and strategic funding opportunities. + +## Ensuring Long-Term Success and Innovation + +By expanding our scope to become an umbrella foundation, the PyTorch Foundation is uniquely positioned to enhance collaboration, innovation, and sustained growth across the entire AI community. Our mission is clear: create a vendor-neutral, open source environment where the best AI and ML tools can thrive, benefiting users, contributors, and industry stakeholders worldwide. + +*“PyTorch is absolutely the foundation of the innovation happening in AI today and with projects like Llama, ChatGPT, and hundreds of thousands of open projects built on PyTorch, it has cemented itself as a critical ingredient to the world of AI. This move to create an umbrella foundation enables PyTorch to significantly expand its ecosystem both horizontally and vertically in this new era of agentic systems. I am very excited about this opportunity to take the PyTorch community to the next level!” - Joe Spisak, Product Director for PyTorch at Meta.* + +*"PyTorch sits at the very core of AI today. Meanwhile, the depth of the AI stack has grown dramatically—evolving from enabling accelerated compute to powering fully autonomous systems. Broadening the PyTorch Foundation is a key step in keeping the AI revolution open and accessible to all, across the stack and aligned with the principles PyTorch was built on." - Luca Antiga, CTO at Lightning AI.* + +We are incredibly optimistic about the opportunities ahead and excited to welcome new projects into our growing family. The PyTorch Foundation remains deeply committed to driving AI innovation forward, and together, we will continue to build the future of open source artificial intelligence. + +Stay tuned for more updates, announcements, and opportunities to participate! \ No newline at end of file diff --git a/_posts/2025-04-30-6x-faster-async-checkpointing.md b/_posts/2025-04-30-6x-faster-async-checkpointing.md new file mode 100644 index 000000000000..12a2f9e1b1de --- /dev/null +++ b/_posts/2025-04-30-6x-faster-async-checkpointing.md @@ -0,0 +1,108 @@ +--- +layout: blog_detail +title: "6x faster Async Checkpointing in PyTorch, using Cached Plans, no GIL contention" +author: Meta and Crusoe +--- + +**Meta**: Less Wright, Meet Vadakkanchery, Saurabh Mishra, Ela Krepska, Hamid Shojanazeri, Pradeep Fernando +**Crusoe**: Ethan Petersen, Martin Cala, Chip Smith + +PyTorch DCP (Distributed Checkpointing) has recently enabled new optimizations in asynchronous checkpointing to reduce GPU utilization drop by minimizing collective overhead and improving overall checkpointing efficiency. + +Using Crusoe’s 2K H200 cluster, with TorchTitan and training a Llama3-70B, we were able to verify these new features deliver substantial speedups at 1856 GPU scale, reducing the background processing time for async DCP checkpoints from ~436 seconds to ~67 seconds. + +This is roughly a 6.5x reduction in background checkpoint processing time, enabling even more total training time to proceed at full training throughput. + +![chart](/assets/images/6x-faster-async-checkpointing/fg1.png){:style="width:100%"} + + +*Fig 1: 1856 training run with high frequency checkpointing. The first checkpoint (drop down in tps) does not have a cached save plan, and the background processing takes far longer than the rest where the cached plan is used.* + + +## Background: What is Asynchronous Checkpointing? + +In a standard checkpointing workflow, GPUs are blocked while the checkpointing data is offloaded from GPU to CPU and then written to storage. After the save to physical media is complete, training can resume. + +Asynchronous checkpointing greatly reduces this downtime by enabling the actual saving to storage to be done via CPU threads, allowing GPU-based training to continue while the checkpoint data is being persisted in parallel. It is used primarily for intermediate/fault tolerant checkpoints as it unblocks the GPUs much faster compared to the synchronous checkpoints. \ +For example, in our large-scale experiment, GPU training was blocked for less than a second (.78 seconds at 1856 scale) while checkpoint data was moved from GPU to CPU (staging). At that point, GPU training immediately continues, which is a substantial training time improvement over traditional checkpointing. For reference, Async Checkpointing is covered in more detail [here](https://pytorch.org/blog/reducing-checkpointing-times/). + + +## Challenges with Asynchronous Checkpointing + +However, the background processing inherent in Asynchronous Checkpointing has additional challenges that result in a temporary reduction of training throughput while the storage phase is being completed. These are highlighted below. + + +### GPU utilization drop from GIL contention: + +The Global Interpreter Lock (GIL) in Python is a mechanism that prevents multiple native threads from executing Python bytecode at the same time. This lock is necessary mainly because CPython's memory management is not thread-safe. + +DCP currently uses background threads for metadata collectives and uploading to storage. Although these expensive steps are done asynchronously, it leads to contention for the GIL with the trainer threads. This causes the GPU utilization (QPS) to suffer significantly and also increases the e2e upload latency. For large-scale checkpoints, the overhead of the CPU parallel processing has a suppressive effect on net GPU training speed since CPUs also drive the training process via GPU kernel launches. + +Please refer to the following figure from our experiments: + +![chart](/assets/images/6x-faster-async-checkpointing/fg2.png){:style="width:100%"} + + +*Fig 2: One can see a sustained drop in training QPS even after staging (i.e. blocking operation to trainer) is complete.* + +The first dip in Figure 2 (marked by the purple line) indicates that staging is complete, and training can continue. However, a second drop is evident (marked by the area between the purple and yellow lines) which is due to trainer thread and checkpointing threads contending for the Python GIL, leading to degraded training QPS until the checkpoint thread completes execution. + + +### Collective communications cost: + +DCP performs multiple collectives today for various reasons: dedupe, global metadata for the checkpoint, resharding, and distributed exception handling. Collectives are costly as these require network I/O and pickling/unpickling of the large metadata being sent across the GPU network. These collectives become extremely expensive as the job scale grows, leading to significantly higher e2e latency and potential for collective timeouts. + + +## Solutions + + +### Process based async checkpointing + +DCP now supports async checkpoint save via a background process. This helps avoid the training QPS drop by eliminating the python GIL contention with the trainer threads. Please see Fig 2 for checkpointing via threads and Fig 3 for checkpointing via background process. + + +### Caching of the save plans + +DCP has a clear boundary between the planning and storage I/O steps. SavePlanner in DCP is a stateful component which acts as an access proxy to the state_dict. Planner manages save plans prepared by individual ranks, which carry metadata information necessary to do the write I/O. The planning step involves a collective operation to gather a comprehensive view of the checkpoint on the coordinator rank. The coordinator rank is responsible for de-duplicating parameters/weights to eliminate redundancies, validating the global plan to ensure accuracy and consistency, and creating the global metadata structs. This is followed by a scatter collective where the coordinator rank assigns I/O tasks to each rank. Any transformations done on the plans affect how the storage components finally write the data. + +During the course of a training job, multiple checkpoints are saved. In the majority of these cases, only the checkpoint data changes between different save instances, and thus, the plan remains the same. This presented an opportunity for us to cache the plans, pay the planning cost only on the first save, and then amortize that cost across all the subsequent attempts. Only the updated plans (plans which changed in the next attempt) are sent via collective, thus reducing the collective overhead significantly. + + +## Experiment Results + +**Set up:** 1856 H200 GPUs, Llama3-70B, HSDP2 with TorchTitan + +After deploying both the solutions above, the following are the key results: + +* TPS drop has significantly narrowed, with a peak dip to 372 vs 315 tps, and for a greatly reduced time window (~67 seconds vs ~437 seconds). This time window is now mostly attributed to the blocking for CPU processing. +* Subsequent checkpoint save attempts also continue to be much faster due to very low overhead at the planning stage. E2E latency is thus improved by over 6.5x. This will allow our partners to increase the checkpointing frequency and reduce the lost training progress (i.e. wasted training time). + +If you look at the very first downspike in Figure 1, this drawdown in GPU processing time takes training throughput from 700 down to 320 tps, and suppresses it for roughly 7 minutes (467 seconds). Once the CPUs have finished processing, training continues again at full speed. + +Previously, this ~7 minute suppression would be repeated at *every* checkpoint. However, with the new process-based checkpointing feature, only the first checkpoint has the full drawdown time (mainly due to overhead from daemon process initialization), as all future checkpoints are executed via the background process, mitigating GIL contention with the trainer threads. + +This is visually shown in all the subsequent checkpoints where the average MFU suppression time drops to just over a minute, reflected by the sharp spikes that almost immediately revert to full MFU throughput. + + +![chart](/assets/images/6x-faster-async-checkpointing/fg3.png){:style="width:100%"} + + +*Fig 3: The red box shows the non-cached plan checkpoint, which also includes Checkpoint Background Init process overhead, while the purple box highlights the first checkpoint to run with the cached plan.* + +This means that even large-scale checkpointing, such as shown in Fig 2 at 1856 GPU scale, can be done with ~6x reduced training throughput impact. This enables Asynchronous DCP checkpointing to be run more frequently (thus better rollback protection) while enhancing total training throughput relative to previous Async Checkpointing overhead. + +**Using DCP’s cached checkpointing:** + +This feature is already available as part of the PyTorch nightly builds, and you can test out PyTorch’s Asynchronous DCP checkpointing directly in TorchTitan. Following are the instructions to enable these features: + +* Process-based asynchronous checkpointing: + * Set the **async_checkpointer_type** to AsyncCheckpointerType.PROCESS in the [async_save](https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/state_dict_saver.py#L193) API. (*file*: pytorch/torch/distributed/checkpoint/state_dict_saver.py) +* Save plan caching: + * Set the **enable_plan_caching** flag to true in the [DefaultSavePlanner](https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/default_planner.py#L78C9-L78C28). (*file*: pytorch/torch/distributed/checkpoint/default_planner.py) + + +## Future work + +DCP will be rolling out additional optimizations to further improve the checkpointing cost. Currently even though the save plans are cached, coordinator rank still prepares the metadata. For larger jobs and models with many tensors, this overhead is non-trivial. In the next iteration, DCP will eliminate the metadata overhead and improve the e2e latency further. DCP will also introduce additional optimizations, such as zero-overhead checkpointing, to enable efficient checkpointing in large-scale jobs. + +Stay tuned! diff --git a/_posts/2025-04-30-flexattention-for-inference.md b/_posts/2025-04-30-flexattention-for-inference.md new file mode 100644 index 000000000000..587aedf2158a --- /dev/null +++ b/_posts/2025-04-30-flexattention-for-inference.md @@ -0,0 +1,380 @@ +--- +layout: blog_detail +title: "FlexAttention Part II: FlexAttention for Inference" +author: Joy Dong, Boyuan Feng, Driss Guessous, Joel Schlosser, Yanbo Liang, Horace He +--- + +## Overview + +In PyTorch 2.5.0 release, we introduced [FlexAttention](https://pytorch.org/blog/flexattention/) `torch.nn.attention.flex_attention` for ML researchers who’d like to customize their attention kernels without writing kernel code. This blog introduces our decoding backend optimized for inference, supporting GQA and PagedAttention, along with feature updates including nested jagged tensor support, performance tuning guides and trainable biases support. + +If you’re looking for an easy way to play around with FlexAttention in your post-training / inference pipeline, PyTorch native post-training library [torchtune](https://github.com/pytorch/torchtune) and inference codebase [gpt-fast](https://github.com/pytorch-labs/gpt-fast) already have FlexAttention integrated. Try it out! + +We are excited to share that our paper on FlexAttention has been accepted for presentation at the MLSys2025 Conference held from May 12-15th in Santa Clara, California. + +Title: **FlexAttention: A Programming Model for Generating Optimized Attention Kernels.** [Poster](https://mlsys.org/virtual/2025/poster/3007) + + +## FlexAttention for Inference + +TL;DR: `torch.compile` lowers `flex_attention` to a fused [FlashDecoding](https://pytorch.org/blog/flash-decoding/) kernel when it runs on a very short query. + +One fused attention kernel does not suit all – especially in long-context LLM inference. + +The decoding phase of LLM inference is an iterative process: tokens are generated one at a time, requiring `N` forward passes to generate an `N`-token sentence. Fortunately, each iteration doesn’t need to recompute self-attention over the full sentence — previously calculated tokens are cached, therefore we only need to attend the newly generated token to the cached context. + + +![chart](/assets/images/flexattention-for-inference/fg1.png){:style="width:100%"} + + +This results in a unique attention pattern where a short query sequence (1 token) attends to a long key-value cache (context length up to 128k). Traditional optimizations for square attention kernels (`q_len ≈ kv_len`) don’t directly apply here. This pattern poses new challenges for GPU memory utilization and occupancy. We build a dedicated FlexDecoding backend optimized for long-context LLM inference incorporating decoding-specific techniques from [FlashDecoding](https://pytorch.org/blog/flash-decoding/). + +FlexDecoding is implemented as an alternative backend for the `torch.nn.attention.flex_attention `operator. `flex_attention` automatically switches to the FlexDecoding backend for its JIT compilation when given a short query and a long KV cache. If the input shape changes significantly, for example transitioning from the prefill phase to decoding, JIT recompilation generates a separate kernel for each scenario. + +``` +flex_attention = torch.compile(flex_attention) + +k_cache = torch.random(B, H, 16384, D) +v_cache = torch.random(B, H, 16384, D) + +... + +# Prefill Phase: query shape = [B, H, 8000, D] +flex_attention(q_prefill, k_cache, v_cache, ...) # Uses FlexAttention backend optimized for prefill & training + +# Decoding Phase: q_last_token shape = [B, H, 1, D] +flex_attention(q_last_token , k_cache, v_cache, ...) # Recompiles with the FlexDecoding backend + +# decode 2 tokens at the same time: q_last_2_tokens shape = [B, H, 2, D] +flex_attention(q_last_2_tokens, k_cache, v_cache, ...) # No recompilation needed! Runs the decoding kernel again. +``` + + +## Working with KV Cache + +One of the key optimizations for efficient inference is maintaining a preallocated KV cache that updates **in place** as new tokens are generated. Instead of enforcing a specific KV cache policy with a dedicated API, FlexDecoding allows users to define and manage the KV cache themselves. + +Similar to FlexAttention, FlexDecoding takes user-defined `mask_mod` and `score_mod` functions. These functions modify attention scores before the softmax operation. + +![chart](/assets/images/flexattention-for-inference/fg2.png){:style="width:100%"} + +``` +score_mod(score, b, h, q_idx, kv_idx) -> tensor # return updated score +``` + +Score is a scalar pytorch tensor that represents the dot product of a query token and a key token. The rest of the arguments specify which score is being computed: + + + +* `b` batch index +* `h` attention head index +* `q_idx` token position in query tensor +* `kv_idx` token position in key/value tensor + +In the decoding phase, previously calculated tokens are cached, and only the latest generated token (i-th) is used as the query. A naive causal mask on this one token query looks like this: + +``` +def causal(score, b, h, q_idx, kv_idx): + return torch.where(q_idx >= kv_idx, score, -float("inf")) +``` + + +![chart](/assets/images/flexattention-for-inference/fg3.png){:style="width:100%"} + + +This is problematic: the new token “*saw*” should attend to all previously generated tokens i.e. “*The cat sat on the mat and saw*”, not just the first entry in the kv cache. To correct this, the `score_mod` needs to **offset q_idx** **by i **for accurate decoding. + + +![chart](/assets/images/flexattention-for-inference/fg4.png){:style="width:100%"} + + +Creating a new `score_mod` for each token to accommodate the offset is slow since it means FlexAttention needs to be recompiled every iteration for a different `score_mod`. Instead, + +We define this `offset` as a tensor and increment its value at each iteration: + +``` +offset = torch.tensor(i, "cuda") +def causal_w_offset(score, b, h, q_idx, kv_idx): + return torch.where(q_idx + offset >= kv_idx, score, -float("inf")) + +# Attend the i-th token +flex_attention(..., score_mod=causal_w_offset ) # Compiles the kernel here +... +# Attend the i+1-th token +offset = offset + 1 # Increment offset +flex_attention(..., score_mod=causal_w_offset ) # Doesn't need to recompile! +``` + +Notably, here `offset` becomes a captured tensor and it does not need to recompile if `offset` changes values. + +Manually rewriting your `score_mod` and `mask_mod` for offset handling isn't necessary. We can automate this process with a generic rewriter: + +``` +offset = torch.tensor(i, "cuda") + +def get_score_mod_w_offset(score_mod: _score_mod_signature, _offset: tensor): + def _score_mod(score, b, h, q, kv): + return score_mod(score, b, h, q + _offset, kv) + return _score_mod + +def get_mask_mod_w_offset(mask_mod: _mask_mod_signature, _offset: tensor): + def _mask_mod(b, h, q, kv): + return mask_mod(b, h, q + _offset, kv) + return _mask_mod + +causal_w_offset = get_score_mod_w_offset(causal, offset) +``` + +## BlockMask for Inference + +We can also use BlockMask with inference to leverage mask sparsity. The idea is to precompute the BlockMask once during model setup and use slices of it during decoding + + +### Precomputing BlockMask + +During setup, we create a squared BlockMask for `MAX_SEQ_LEN x MAX_SEQ_LEN`: + +``` +from torch.nn.attention.flex_attention import create_block_mask + +def causal_mask(b, h, q_idx, kv_idx): + return q_idx >= kv_idx + +block_mask = create_block_mask(causal_mask, B=None, H=None, Q_LEN=MAX_SEQ_LEN,KV_LEN=MAX_SEQ_LEN) +``` + +![chart](/assets/images/flexattention-for-inference/fg5.png){:style="width:100%"} + + +### Using BlockMask During Decoding + +For the i-th token, we use a slice of the mask: + +``` +block_offset = i // block_mask.BLOCK_SIZE[0] +block_mask_slice = block_mask[:, :, block_offset] + +# don't forget to use the mask_mod with offset! +block_mask_slice.mask_mod = get_mask_mod_w_offset(causal_mask) +``` + +![chart](/assets/images/flexattention-for-inference/fg6.png){:style="width:100%"} + + +## Performance + + +![chart](/assets/images/flexattention-for-inference/fg7.png){:style="width:100%"} + +FlexDecoding kernel performs on par with FlashDecoding (FAKV) and significantly outperforms pytorch scaled_dot_product_attention ([code](https://github.com/pytorch/pytorch/blob/main/benchmarks/transformer/score_mod.py)). + + +![chart](/assets/images/flexattention-for-inference/fg8.png){:style="width:100%"} + +FlexDecoding boosts LLaMa3.1-8B serving performance by 1.22x-2.04x, and LLaMa3.1-70B performance by 0.99x - 1.66x compared to SDPA in gpt-fast. ([code](https://github.com/pytorch-labs/gpt-fast)) + + +## Paged Attention + +[vLLM](https://blog.vllm.ai/2023/06/20/vllm.html) is one of the popular LLM serving engines, powered by the efficient memory management from PagedAttention. Existing [PagedAttention](https://github.com/vllm-project/vllm/blob/main/csrc/attention/paged_attention_v2.cu) implementation requires dedicated CUDA kernels and shows limited flexibility on supporting emerging attention variants. In this section, we present a PT2-native PagedAttention implementation that is enabled by flex attention and torch.compile. + +PagedAttention scatters KV cache to reduce memory fragmentation and support higher batch sizes. Without PagedAttention, KV cache from the same request are stored in a contiguous memory, requiring 2 tensor of shape *B x H x KV LEN x D*. We call it a logical KV cache. Here, KV_LEN is the maximum sequence length over all requests in a batch. Considering the Figure 1(a), KV_LEN is 9 thus all requests must be padded to 9 tokens, leading to large memory waste. With PagedAttention, we can chunk each request into multiple pages of the same size page_size and scatter these pages into a physical KV cache of shape *1 x H x max seq len x D*, where max_seq_len=n_pages x page_size. This avoids padding requests to the same length and saves memory. Specifically, we provide an `assign` API to update KV cache via index computations: + +``` +def assign( + batch_idx: torch.Tensor, + input_pos: torch.Tensor, + k_val: torch.Tensor, + v_val: torch.Tensor, + k_cache: torch.Tensor, + v_cache: torch.Tensor, +) -> None +``` + +Behind this `assign` API is a page table, a tensor mapping logical KV cache to physical KV cache: + +[batch_idx, logical_page_idx] -> physical_page_idx + +`assign` takes `k_val` and `v_val` and scatters to physical KV cache guided by the mapping from the page table. + + +![chart](/assets/images/flexattention-for-inference/fg9.png){:style="width:100%"} + + +**Paged Attention with Page Table** + +A natural question is, how to integrate PagedAttention with flex attention to support diverse attention variants? A naive idea is to materialize the logical KV cache before computing with flex attention. But this leads to redundant memory copy and bad performance. Another idea is to build a dedicated CUDA or Triton kernel for paged attention, similar to [existing PagedAttention implementation](https://github.com/vllm-project/vllm/blob/main/csrc/attention/paged_attention_v2.cu). However, this adds much manual effort and code complexity. + +Instead, we design a fused indirect memory access by converting a logical block mask according to the page table. In FlexAttention, we exploit BlockMask to identify logical blocks and skip redundant computation. While Paged Attention adds an extra layer of indirect memory access, we can further convert the logical block mask to the physical block mask corresponding to the page table, as illustrated in Figure 2. Our PagedAttention implementation provides a `convert_logical_block_mask` via torch.gather calls: + +``` +def convert_logical_block_mask( + block_mask: BlockMask, + batch_idx: Optional[torch.Tensor] = None, +) -> BlockMask +``` + +![chart](/assets/images/flexattention-for-inference/fg10.png){:style="width:100%"} + + + +**Paged Attention via Block Mask Conversion** + +One remaining question is how to rewrite user-specified `mask_mod` and `score_mod` for PagedAttention. When users specify these modifications, they write with logical indices without the knowledge of the page table maintained at runtime. The following code shows an automated conversion at runtime which is necessary to rewrite user-specified modifications with physical kv indices. The `new_mask_mod` would take the physical_kv_idx and convert it back to the logical_kv_idx and apply user-specified `mask_mod` on the logical_kv_idx for the correct mask. For efficiency, we maintain physical_to_logical as a mapping from physical_kv_block to logical_kv_block to facilitate the conversion. For correctness, we mask out-of-boundary blocks as False with a `torch.where` call. After batching logical KV caches from multiple requests into the same physical KV cache, there are much more physical blocks than the number of logical blocks for each request. Thus, a physical block may not have a corresponding logical block for a specific request during block mask conversion. By masking as False with `torch.where`, we can ensure the correctness that data from different requests do not interfere with each other. Similarly, we can convert the [score_mod](https://github.com/pytorch/pytorch/blob/main/torch/nn/attention/experimental/_paged_attention.py#L308-L338) automatically. + +``` +def get_mask_mod(mask_mod: Optional[_mask_mod_signature]) -> _mask_mod_signature: + if mask_mod is None: + mask_mod = noop_mask + + def new_mask_mod( + b: torch.Tensor, + h: torch.Tensor, + q_idx: torch.Tensor, + physical_kv_idx: torch.Tensor, + ): + physical_kv_block = physical_kv_idx // page_size + physical_kv_offset = physical_kv_idx % page_size + logical_block_idx = physical_to_logical[b, physical_kv_block] + logical_kv_idx = logical_block_idx * page_size + physical_kv_offset + return torch.where( + logical_block_idx >= 0, mask_mod(b, h, q_idx, logical_kv_idx), False + ) + + return new_mask_mod +``` + +Figure 3 demonstrates the latency from Paged Attention ([code](https://github.com/pytorch-labs/attention-gym/blob/main/attn_gym/paged_attention/latency.py)). Overall, there is less than 5% overhead from Flex Attention with Paged Attention, compared with Flex Attention only. We also observe an on-par performance with Flash Attention v2. A [minimal serving example](https://github.com/pytorch-labs/attention-gym/blob/main/attn_gym/paged_attention/throughput.py) further shows that PagedAttention can support 76x higher batch size when evaluating on [OpenOrca dataset](https://huggingface.co/datasets/Open-Orca/OpenOrca) which includes 1M GPT-4 completions and 3.2M GPT-3.5 completions. + + +![chart](/assets/images/flexattention-for-inference/fg11.png){:style="width:100%"} + + +**Paged Attention: Latency under diverse sequence length** + + +## Ragged input sequences with Nested Jagged Tensors (NJTs) + +FlexAttention now supports ragged-sized input sequences through the use of Nested Jagged Tensors (NJTs). NJTs represent ragged-sized sequences by packing sequences into a single “stacked sequence” and maintaining a set of offsets delimiting sequence boundaries for each batch item. + +A block mask can be created for input NJTs through the new `create_nested_block_mask()` API. The returned block mask is compatible with the ragged structure of the given NJT, treating it as a single “stacked sequence” with inter-sequence attention automatically masked out. The mask_mod or score_mod function can be written as usual. + +``` +from torch.nn.attention.flex_attention import create_nested_block_mask, flex_attention + +BATCH = 8 +NUM_HEADS = 8 +D = 16 +device = "cuda" + +# Input NJTs of shape (BATCH, SEQ_LEN*, D) with ragged SEQ_LEN +sequence_lengths = [torch.randint(5, 30, ()).item() for _ in range(BATCH)] +query = torch.nested.nested_tensor([ + torch.randn(seq_len, NUM_HEADS * D, device=device) + for seq_len in sequence_lengths +], layout=torch.jagged) +key = torch.randn_like(query) +value = torch.randn_like(query) + +# View as shape (BATCH, NUM_HEADS, SEQ_LEN*, HEAD_DIM) +query = query.unflatten(-1, [NUM_HEADS, D]).transpose(1, 2) +key = key.unflatten(-1, [NUM_HEADS, D]).transpose(1, 2) +value = value.unflatten(-1, [NUM_HEADS, D]).transpose(1, 2) + +# Simple causal mask +def my_mask_mod(b, h, q_idx, kv_idx): + return q_idx >= kv_idx + +# Construct a block mask using the ragged structure of the +# specified query NJT. Ragged-sized sequences are treated as a single +# "stacked sequence" with inter-sequence attention masked out. +block_mask = create_nested_block_mask(my_mask_mod, 1, 1, query) + +# For cross attention, create_nested_block_mask() also supports a +# rectangular block mask using the ragged structures of both query / key. +#block_mask = create_nested_block_mask(my_mask_mod, 1, 1, query, key) + +output = flex_attention(query, key, value, block_mask=block_mask) +``` + +## Trainable Biases + +FlexAttention now supports trainable parameters in `score_mod functions.` This feature enables users to reference tensors that require gradients within their `score_mod` implementations, with gradients automatically backpropagating through these parameters during training. + + +### Memory-Efficient Gradient Accumulation + +Instead of materializing the full attention scores matrix, FlexAttention uses atomic additions (`tl.atomic_add`) to accumulate gradients. This approach significantly reduces memory usage at the cost of introducing some non-determinism in gradient calculations. + + +### Handling Broadcasted Operations + +Broadcasting operations in the forward pass (e.g., `score + bias[h]`) require special consideration in the backward pass. When broadcasting a tensor across multiple attention scores within a head or other dimensions, we need to reduce these gradients back to the original tensor shape. Rather than materializing the full attention score matrix to perform this reduction, we use atomic operations. While this incurs some runtime overhead, it allows us to maintain memory efficiency by avoiding the materialization of large intermediate tensors. + + +### Current Limitations + +The implementation currently allows only a single read from each input tensor in the `score_mod` function. For example, `bias[q_idx] + bias[kv_idx]` would not be supported as it reads from the same tensor twice. We hope to remove this restriction in the future. + + +### Simple Example: + +``` +bias = torch.randn(num_heads, requires_grad=True) +def score_mod(score, b, h, q_idx, kv_idx): + return score + bias[h] +``` + +## Performance Tuning for FlexAttention + + +### TL;DR + +For optimal performance, compile FlexAttention using `max-autotune`, especially when dealing with complex `score_mods` and `mask_mods`: + +flex_attention = torch.compile(flex_attention, dynamic=True, mode='max-autotune') + + +### What is `max-autotune`? + +`max-autotune` is a `torch.compile` mode in which TorchInductor sweeps many kernel parameters (e.g., tile size, `num_stages`) and selects the best-performing configuration. This process allows kernels to test both successful and failing configurations without issues, and find the best viable configuration. + +While compilation takes longer with `max-autotune`, the optimal configuration is cached for future kernel executions. + +Here’s an example of FlexAttention compiled with `max-autotune`: + +``` +triton_flex_attention_backward_7 0.2528 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'ieee'", GQA_SHARED_HEADS=7, HAS_FULL_BLOCKS=False, IS_DIVISIBLE=False, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, QK_HEAD_DIM=128, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.08838834764831843, SPARSE_KV_BLOCK_SIZE=1073741824, SPARSE_Q_BLOCK_SIZE=1073741824, V_HEAD_DIM=128, num_stages=4, num_warps=4 +``` + +### Why Use `max-autotune` for FlexAttention? + +The amount of shared memory utilized in FlexAttention depends on `score_mod` and `mask_mod` methods. This variability means that the preconfigured default kernel parameters may lead to performance cliffs or even out of shared memory** **errors on certain hardware for some masks/mods. + +For instance, with document masks, default configurations can halve GPU occupancy, reducing performance to ~75% of its potential on some GPUs. To avoid such issues, we strongly recommend enabling `max-autotune`. + + +## Updates and Enhancements + +* Now available as a prototype feature in PyTorch 2.5.0 +* Fixed critical correctness issues, including a bug affecting multiple calls to FlexAttention within the same call to torch.compile + + +## Expanded Architecture Support + +* Arbitrary sequence length support - no longer requires multiples of 128 +* Added native grouped-query attention (GQA) support via `is_gqa=True` +* Enhanced dimension flexibility: + * Different QK and V head dimensions + * Non-power-of-two head dimensions +* Trainable attention biases (prototype) + + +## Under the Hood + +* New fused CPU backend +* Improved TF32 handling for float32 inputs +* Resolved various dynamic shape issues +* Output layout matching query strides + +These updates make FlexAttention more robust and flexible while maintaining its core promise of combining PyTorch's ease of use with FlashAttention's performance benefits. \ No newline at end of file diff --git a/_posts/2025-05-01-docathon-2025.md b/_posts/2025-05-01-docathon-2025.md new file mode 100644 index 000000000000..1ad33370e775 --- /dev/null +++ b/_posts/2025-05-01-docathon-2025.md @@ -0,0 +1,54 @@ +--- +layout: blog_detail +title: 'Announcing the PyTorch Docathon 2025' +--- + +![PyTorch Docathon 2025](/assets/images/docathon-2025.png){:style="max-width:600px; display: block; margin-left: auto; margin-right: auto"} + + +We're thrilled to announce the [2025 PyTorch Docathon](https://community.linuxfoundation.org/events/details/lfhq-pytorch-foundation-presents-pytorch-docathon-june-3rd-18th-2025/)! This is a hackathon-style event aimed at enhancing PyTorch documentation with the support of the community. Documentation is a vital component of any technology, and by refining it, we can simplify the onboarding process for new users, help them effectively utilize PyTorch's features, and ultimately speed up the transition from research to production in machine learning. + + +## WHY PARTICIPATE + + +### Low Barrier to Entry + +Unlike many open-source projects that require deep knowledge of the codebase and previous contributions to join hackathon events, the Docathon is tailored for newcomers. While we expect participants to be familiar with Python, and have basic knowledge of PyTorch and machine learning, there are tasks related to website issues that don't even require that level of expertise. + + +### Tangible Results + +A major advantage of the Docathon is witnessing the immediate impact of your contributions. Enhancing documentation significantly boosts a project's usability and accessibility, and you'll be able to observe these improvements directly. Seeing tangible outcomes can also be a strong motivator to continue contributing. + + +### Collaborative Environment + +The Docathon fosters a collaborative atmosphere, offering you the chance to work alongside other contributors and PyTorch maintainers to improve the documentation. This is a fantastic opportunity to learn from peers, exchange ideas, and build connections. + + +### Learning Opportunities + +Even if you're not a PyTorch expert, the Docathon offers a valuable learning experience. You'll have the chance to delve into PyTorch modules, test tutorials on your machine, and explore them in the CI environment. + + +## WHO SHOULD PARTICIPATE + +Whether you’re a seasoned documentation expert or just starting out, we invite everyone to join in the PyTorch docathon to contribute and develop your skills and knowledge to help improve the documentation for everyone! We will have issues labelled by skill level, and the PyTorch Discord will be available for collaboration and help. + + +## EVENT DETAILS + + + +* June 3: Kick-off 10 AM PT +* June 4 - June 15: Submissions and Feedback +* June 16 - June 17: Final Reviews +* June 18: Winner Announcements + +Make sure to [RSVP](https://community.linuxfoundation.org/events/details/lfhq-pytorch-foundation-presents-pytorch-docathon-june-3rd-18th-2025/) to the event so you receive all the notifications and instructions on how to participate. + +Further details about the Docathon will be shared during the Kick-off call on June 3. + + +**Don't forget to register for this year's event: [RSVP now](https://community.linuxfoundation.org/events/details/lfhq-pytorch-foundation-presents-pytorch-docathon-june-3rd-18th-2025/)** \ No newline at end of file diff --git a/_posts/2025-05-01-how-ibm-uses-pt-terratorch.md b/_posts/2025-05-01-how-ibm-uses-pt-terratorch.md new file mode 100644 index 000000000000..db6955023bc0 --- /dev/null +++ b/_posts/2025-05-01-how-ibm-uses-pt-terratorch.md @@ -0,0 +1,90 @@ +--- +layout: blog_detail +title: 'How IBM Research Uses PyTorch and TerraTorch to Make Geospatial Computer Vision Accessible for Everyone' +hidden: true +--- + +Earth Observation-based analytics are becoming essential for understanding our planet — from monitoring deforestation to tracking urban development and analyzing the impacts of climate change. However, the coding and deep learning skills for applying AI models to satellite imagery and earth observation data has traditionally been a major barrier for many practitioners. + +By IBM Research’s launch of TerraTorch 1.0, a PyTorch domain library for fine-tuning of Geospatial Computer Vision Foundation Models, we make geospatial AI not only more accessible but also more practical for the wider PyTorch community. Our goal: simplify the process so that any data scientist, researcher, or enthusiast can build powerful geospatial models with ease and low GPU and data processing requirements. + +![globes](/assets/images/how-ibm-uses-pt-terratorch/fg1.png){:style="width:100%"} + + +**The power of foundation models, even with 75-95% of the input data removed, the models do a fantastic job in reconstruction of the input data - therefore learning the underlying physics of our planet in a deep, latent space** + +## The Business Challenge + +Our goal was to remove the technical barriers that prevent people from working with satellite imagery, weather and climate data at scale. Together with NASA, we’ve developed the Prithvi family of foundation models. Integrating the latest innovations of AI research using the clean API PyTorch provides has facilitated the job. + +We wanted to create a framework that anyone can use to go from raw data to inference ready models in just a few steps. + + +![globes](/assets/images/how-ibm-uses-pt-terratorch/fg2.png){:style="width:100%"} + + +**How a weather and climate foundation model created and fine-tuned on PyTorch is used for weather forecasts** + +## How IBM Research Used PyTorch + +We’ve built TerraTorch on top of PyTorch, leveraging its dynamic ecosystem to integrate: + + + +* PyTorch Lightning for clean, scalable training loops +* TorchGeo for geospatial data handling and transformations (PyTorch transforms) +* For foundation models like the leading generative multimodal foundation model ['Terramind'](https://research.ibm.com/blog/terramind-esa-earth-observation-model), co-developed by IBM and ESA, and [the ‘Prithvi’ family](https://huggingface.co/ibm-nasa-geospatial), co-developed by IBM and NASA, TerraTorch has been used to fine-tune all of the downstream geospatial models for satellite imagery, weather and climate data. It includes the family of fine-tuned models that IBM has released as part of [Granite](https://huggingface.co/collections/ibm-granite/granite-geospatial-models-667dacfed21bdcf60a8bc982). In addition, other interesting foundation models and ecosystem components like Clay, SatMAE, Satlas, DeCur and DOFA are included in TerraTorch. +* Powerful and state-of-the-art vision transformers to experiment with modern neural network architectures +* TerraTorch-Iterate build on top of PyTorch, Optuna, MLFlow and Ray Tune for Hyperparameter Optimization (HPO), Neural Architecture Search (NAS) and Foundation Model Benchmarking (GeoBench), where TerraTorch became the reference implementation + + +![flow diagram](/assets/images/how-ibm-uses-pt-terratorch/fg5.png){:style="width:100%"} + +**The fine-tuning and inference process is completely described in a single YAML config file. There, the architectural building blocks of the model (backbone, neck, decoder, head) are defined. The Model Factory assembles the model using the build-in and custom registries. In addition, the Optimizer and Data Modules are created as defined in the config. Finally, everything is passed to the Lightning Trainer, who executes the task.** + + +With PyTorch’s flexibility, we were able to prototype quickly, iterate on model architectures, and deploy pipelines for a range of geospatial applications — from flood and biomass detection to increasing resolution of climate data, where some of our our work became part of the [IBM Granite Geospatial Model Family](https://huggingface.co/collections/ibm-granite/granite-geospatial-models-667dacfed21bdcf60a8bc982). + + +![flow diagram](/assets/images/how-ibm-uses-pt-terratorch/fg3.png){:style="width:100%"} + + +**Architecture of the Prithvi-EO-2.0-600M foundation model which IBM Research developed together with NASA** + +## Solving AI Challenges with PyTorch + +PyTorch helped us to tackle three major challenges: + +* Ease of experimentation: Dynamic computation graphs, automatic differentiation, full abstraction of CUDA and rich visualization tools made it simple to test different models and training strategies. +* Scalability: With DDP, FSDP, PyTorch Lightning and TorchGeo, we could train models on large-scale datasets without worrying about infrastructure. +* Community support: PyTorch - the de-facto standard in AI research - with its active community and excellent documentation made it easy to overcome hurdles and stay up to date with the latest advancements in AI research. + +## A Word from IBM Research + +*"PyTorch gave me the power to turn complex linear algebra and optimization problems into accessible, shareable solutions for the community. It feels empowering that we’re building and fine-tuning models for anyone curious about understanding our planet through AI."* + +— Romeo Kienzler, AI Research Engineer at IBM Research Zurich, Rueschlikon + + +![quote](/assets/images/how-ibm-uses-pt-terratorch/fg4.png){:style="width:100%"} + + +## The Benefits of Using PyTorch + +Using PyTorch allowed us to: + + + +* Build a reproducible, open-source framework for fine-tuning geospatial foundation models +* Share our work with the community through easy-to-follow notebooks, TerraTorch configuration files, tutorials and model checkpoints on HuggingFace +* Rapidly iterate over foundation model architectures and deploy fine-tuned models for inference, from research to real-world client products + +## Learn More + +For more information about this project and to explore the code, visit: + +* [GitHub Repository](https://github.com/IBM/terratorch) +* [IBM Research: Simplifying Geospatial AI with TerraTorch 1.0](https://research.ibm.com/blog/simplifying-geospatial-ai-with-terra-torch-1-0) +* [TerraTorch PrithviEOv2 example notebooks](https://github.com/IBM/terratorch/tree/main/examples/tutorials/PrithviEOv2) +* [TerraMind example notebooks](https://github.com/IBM/terramind/tree/main/notebooks) +* [Run TerraMind using TerraTorch on Colab](https://colab.research.google.com/github/IBM/terramind/blob/main/notebooks/terramind_v1_base_sen1floods11.ipynb) diff --git a/_posts/2025-05-02-pt-day-france-featured-sessions.md b/_posts/2025-05-02-pt-day-france-featured-sessions.md new file mode 100644 index 000000000000..36bd9bacd37b --- /dev/null +++ b/_posts/2025-05-02-pt-day-france-featured-sessions.md @@ -0,0 +1,49 @@ +--- +layout: blog_detail +title: 'PyTorch Day France Featured Sessions: A Defining Moment for Open Source AI' +--- + +[PyTorch Day France](https://events.linuxfoundation.org/pytorch-day-france/) offers a front-row seat to the future of open source AI. Taking place **7 May at Station F in Paris** and co-located with **[GOSIM AI Paris](https://paris2025.gosim.org/)**, this one-day event will bring together developers, researchers, and industry leaders for a day of technical sessions, real-world insights, and community exchange. + + +## 🌍 A Major Milestone for the PyTorch Foundation + +This event marks the very first **PyTorch Day**, launching a new international series hosted annually in different regions to convene AI researchers, developers, engineers, and enthusiasts. PyTorch Days are designed to spotlight open source AI advancements, foster community collaboration, and provide a forum to learn about active, high-impact AI projects built using PyTorch. + +PyTorch Day France also represents a pivotal moment in the PyTorch Foundation’s journey. With its recent [expansion into an umbrella foundation]( https://pytorch.org/blog/pt-foundation-expands/), PyTorch is now positioned to support a broader ecosystem of trusted, community-driven AI projects across the full AI lifecycle. + +At PyTorch Day France, you’ll hear directly from PyTorch Foundation **Executive Director, Matt White,** about this transition—and get a first look at some exciting announcements. + + +## 🎟️ Registration Details + +[Register now](https://www.eventbrite.com/e/gosim-ai-paris-tickets-1265928669729?aff=oddtdtcreator) with code **PYTORCH** for **free access** to the full day of **PyTorch Day France** sessions, **plus** **GOSIM AI Paris**. + +🔗Two events, one registration—double the sessions, double the innovation. \ +[Register here](https://www.eventbrite.com/e/gosim-ai-paris-tickets-1265928669729?aff=oddtdtcreator) + + +## 📅 Featured Sessions + +The day’s agenda includes deep technical dives and applied AI use cases from across the community, including the following talks: + + + +* [Luca Antiga (Lightning AI)](https://sched.co/21nz4) + *Lightning Thunder: Supercharged PyTorch for Modern Hardware* +* [Erwan Gallen & Eldar Kurtic (Red Hat)](https://sched.co/21nyd) + *Scaling LLM Inference with vLLM: Multi‑Accelerator Serving and Quantized LLMs* +* [Pierre Rouanet (Pollen Robotics)](https://sched.co/21nyX) + *Real-World Robotics as the Next Frontier for AI?* +* [Pablo Montalvo (Hugging Face)](https://sched.co/21nzG) + *PyTorch x Transformers: Pythonicity, Autodiff, and Modularity Defining Modern AI* +* [Pedro Ortis (Common Crawl)](https://sched.co/21nym) + *Harnessing Common Crawl for AI and ML Applications* +* [Meriem Bendris (NVIDIA)](https://sched.co/21nys) + *Teaching Mistral to Reason: Post-Training with PyTorch and NVIDIA* +* [Olatunji Ruwase (Snowflake)](https://sched.co/21nyy) + *DeepSpeed – Efficient Training Scalability for Deep Learning Models* + +[View the full schedule](https://pytorchdayfrance2025.sched.com/). + +Whether you’re a contributor, practitioner, or simply curious about what’s ahead, PyTorch Day France is an opportunity to connect with the community and shape what’s next for our ecosystem. diff --git a/_posts/2025-05-02-pt-korea-user-group-recap.md b/_posts/2025-05-02-pt-korea-user-group-recap.md new file mode 100644 index 000000000000..b5a2126271b8 --- /dev/null +++ b/_posts/2025-05-02-pt-korea-user-group-recap.md @@ -0,0 +1,87 @@ +--- +layout: blog_detail +title: 'Recap of the PyTorch Korea User Group Meetup: A Technical Conference with a PyTorch Core Maintainer' +author: 'Jiho Kim, PyTorch Korea User Group' +--- + +At the end of March, the PyTorch Korea User Group hosted a special meetup that brought together prominent speakers for deep discussions on the PyTorch core and its broader ecosystem. With the event more than doubling in size compared to past gatherings, we were able to connect with even more developers and share insights. Huge thanks to [goorm](https://goorm.co/) for sponsoring the fantastic venue! 😄 + + +![people at a conference](/assets/images/pt-korea-user-group-recap/fg1.jpg){:style="width:100%"} + + + +This recap is for those who couldn’t attend in person, as well as for participants who want to revisit the energy and insights of the day. The event featured experts in core PyTorch, AI accelerators, inference optimization, and large language model development. Below is a quick overview of the key sessions that anchored the conference. + + + +## 1️⃣ Jerry Lee | PyTorch Foundation + +Representing the PyTorch Foundation, part of the Linux Foundation, Jaeung provided an overview of how PyTorch is driving core open source technologies forward. He shared PyTorch's growth story, the many global projects currently in motion, and the ecosystem’s impressive 20%+ annual growth. The session also covered how the foundation operates, how member organizations are involved, and upcoming plans that are particularly useful for practitioners. + + +![people at a conference](/assets/images/pt-korea-user-group-recap/fg2.jpg){:style="width:100%"} + + +## 2️⃣ Alban Desmaison | PyTorch Roadmap + +Alban shared the design philosophy behind PyTorch and Meta’s official contribution roadmap ([link](https://dev-discuss.pytorch.org/t/meta-pytorch-team-2025-h1-roadmaps/2794)). He provided a deep technical dive into the differences between Eager and Compiled modes, especially breaking down the backend architecture of device Eager execution. Practical tools and improvements were also introduced—such as memory profilers, enhanced custom operator support, and pinned memory optimizations. + + +![people at a conference](/assets/images/pt-korea-user-group-recap/fg3.jpg){:style="width:100%"} + + + + +## 3️⃣ Hongseok Kim | PyTorch on Rebellions AI Accelerators: Status + +Rebellions is building runtime integration for their proprietary NPU architecture, fully aligned with the structural changes in PyTorch 2.0. This talk introduced the performance and scalability of their upcoming chip, their integration strategy with the PyTorch runtime, and challenges in supporting Eager Mode. Hongseok also previewed their roadmap toward releasing these features within the year. + +![people at a conference](/assets/images/pt-korea-user-group-recap/fg4.jpg){:style="width:100%"} + + + +## 4️⃣ Kyujin Cho | Backend.AI: A Unified Platform for All AI Accelerators + +Backend.AI abstracts and integrates various AI accelerators into a unified workflow. As the diversity of accelerator architectures grows, the need for portability and infrastructure unification becomes even more important. This session showcased features across development and operations—from NPU scheduling and resource allocation to monitoring. Backend.AI currently supports accelerators from NVIDIA, Intel, Tenstorrent, Rebellions, and more. + +![people at a conference](/assets/images/pt-korea-user-group-recap/fg5.jpg){:style="width:100%"} + + + +## 5️⃣ Taeho Kim | Optimizing & Deploying Models Across Multiple Chipsets Using NetsPresso + +This talk focused on the challenges of inference in real-world industrial applications of AI models. As new state-of-the-art models emerge rapidly, there’s a growing need for environments that can quickly validate device compatibility—ideally with one-click ease. NetsPresso is actively working on a static graph representation compatible with PyTorch, offering efficient support for model development, optimization, and testing. + + +![people at a conference](/assets/images/pt-korea-user-group-recap/fg6.jpg){:style="width:100%"} + + +## 6️⃣ Jungyeop Lee | The Journey to Reproduce Deepseek-R1 + +Jungyeop took us through his journey of reproducing Deepseek, a large language model—an effort that involved 201 experiments. He shared real-world lessons from training with Korean data, tokenizer modifications, and fine-tuning strategies. His practical insights and next steps were especially valuable for those building or re-implementing large models from scratch. + + +![people at a conference](/assets/images/pt-korea-user-group-recap/fg7.jpg){:style="width:100%"} + + +## 7️⃣ Sol Kim | A journey from TCP architecture to production-level LLMs + +Sol presented an integrated optimization approach to deploying large models using the TCP(Tensor Contraction Processor) architecture, which supports tensor contraction at the hardware level. The talk highlighted optimization techniques built on hardware abstraction layers (HALs) and bottom-up integration strategies with PyTorch—offering a hybrid hardware-software perspective. + + +![people at a conference](/assets/images/pt-korea-user-group-recap/fg8.jpg){:style="width:100%"} + +## 💡 Panel Talk & Q&A 💡 + +The event wrapped up with an engaging panel discussion. Attendees asked sharp questions, and the speakers offered insightful answers. It was a powerful moment that captured the community’s enthusiasm for PyTorch and their hunger for deeper technical understanding. + + +![people at a conference](/assets/images/pt-korea-user-group-recap/fg9.jpg){:style="width:100%"} + + +## Final Thoughts + +Since our first offline meetup in October 2022, the PyTorch Korea User Group has held five major technical conferences. Each event deepens our appreciation for the scale and depth of the PyTorch ecosystem. With perspectives from users, contributors, and ecosystem builders, the stories we share are only growing—and we’re committed to continuing this journey together. + +See you at the next conference—with even more exciting talks to come! 🙌 \ No newline at end of file diff --git a/_sass/events.scss b/_sass/events.scss index 7397707b5df2..18e89c238ca2 100644 --- a/_sass/events.scss +++ b/_sass/events.scss @@ -37,6 +37,16 @@ } } } + .community-event { + margin: 0; + padding: 3px 10px; + border: 1px solid #8c8c8c; + border-radius: 3px; + text-transform: uppercase; + font-size: 14px; + font-weight: 700; + color: #8c8c8c; + } .event-side-nav-container { padding-left: 3rem; ul { diff --git a/_sass/navigation.scss b/_sass/navigation.scss index fd84dc74f890..420978c613c1 100644 --- a/_sass/navigation.scss +++ b/_sass/navigation.scss @@ -2,7 +2,7 @@ height: $mobile_header_height; @include full-nav-menu-desktop { - height: $desktop_header_height; + height: $desktop_header_height - 20px; } align-items: center; @@ -13,6 +13,9 @@ position: fixed; right: 0; top: 0; + @include full-nav-menu-desktop { + top: 32px; + } width: 100%; z-index: 9999; @@ -36,7 +39,7 @@ @include full-nav-menu-desktop { background-color: #CC2F90; color: $white; - display: none; + display: flex; letter-spacing: .34px; justify-content: center; padding: 4px 0; diff --git a/_videos/vid10.md b/_videos/vid10.md new file mode 100644 index 000000000000..faf1c637b5ae --- /dev/null +++ b/_videos/vid10.md @@ -0,0 +1,5 @@ +--- +title: 'Using PyTorch and DINOv2 for Multi-label Plant Species Classification' +youtube_id: rxVg3yrc51s +date: Mar 28, 2025 +--- diff --git a/_videos/vid11.md b/_videos/vid11.md new file mode 100644 index 000000000000..b7720dd02abb --- /dev/null +++ b/_videos/vid11.md @@ -0,0 +1,5 @@ +--- +title: 'PyTorch Expert Exchange – Multi-Modal Tabular Deep Learning with PyTorch Frame' +youtube_id: zPjLHf0X78w +date: Feb 20, 2025 +--- diff --git a/_videos/vid12.md b/_videos/vid12.md new file mode 100644 index 000000000000..f3ba5fc289fa --- /dev/null +++ b/_videos/vid12.md @@ -0,0 +1,5 @@ +--- +title: 'PyTorch 2.6 Release Live Q&A' +youtube_id: 1OopuwTq6oE +date: Feb 8, 2025 +--- diff --git a/_videos/vid13.md b/_videos/vid13.md new file mode 100644 index 000000000000..747642d8aea4 --- /dev/null +++ b/_videos/vid13.md @@ -0,0 +1,5 @@ +--- +title: 'How does batching work on modern GPUs?' +youtube_id: HTcnp9NEHGY +date: Nov 14, 2024 +--- diff --git a/ai-powered-competitive-programming.html b/ai-powered-competitive-programming.html index efd5defe00bf..5a14b2a7b7a4 100644 --- a/ai-powered-competitive-programming.html +++ b/ai-powered-competitive-programming.html @@ -28,12 +28,12 @@

    AI-Powered Competitive Programming: My HackerCup 2024 Experience

    Location: Online

    - In this talk, Anton will share how he built an AI agent that ranked #1 in the finals of Meta HackerCup 2024 (AI division). Anton developed a workflow that could solve the hardest competitive programming problems quickly and reliably. Anton will walk through how he used state-of-the-art reasoning LLM models, curated RAG, and leveraged cloud infrastructure to safely test and execute solutions at scale. This approach highlights the massive potential of test-time compute scaling and provides insights into AI's future role in programming. + In this talk, Anton shared how he built an AI agent that ranked #1 in the finals of Meta HackerCup 2024 (AI division). Anton developed a workflow that could solve the hardest competitive programming problems quickly and reliably. Anton will walk through how he used state-of-the-art reasoning LLM models, curated RAG, and leveraged cloud infrastructure to safely test and execute solutions at scale. This approach highlights the massive potential of test-time compute scaling and provides insights into AI's future role in programming.

    -Anton Pidkuiko is a Software Engineer at Meta, Reality Labs in London. He is currently working on applying the power of Large Language Models to Metaverse Avatar product experiences. + Anton Pidkuiko is a Software Engineer at Meta, Reality Labs in London. He is currently working on applying the power of Large Language Models to Metaverse Avatar product experiences.

    - Register now to attend this event. -
    + Watch the recording now and access Anton's presentation slides here. +

    diff --git a/announcement.html b/announcement.html index 9cc4d3985cfc..90eda81bc8d7 100644 --- a/announcement.html +++ b/announcement.html @@ -19,11 +19,13 @@

    PyTorch
           Foundation

    +

    Accelerating Open Source AI

    - Welcome to the PyTorch Foundation—a vibrant, collaborative hub created for and by the deep learning community. Here, developers, researchers, and industry leaders come together to shape and expand the open source PyTorch framework and ecosystem. Through a network of dedicated contributors to the PyTorch project, the PyTorch Foundation fuels discussion, innovation, and hands-on collaboration across the PyTorch landscape. + Welcome to the PyTorch Foundation—a vibrant, community-driven hub for open source AI. Developers, researchers, and industry pioneers collaborate here to advance the PyTorch framework and strengthen the open source AI ecosystem.

    - Community-driven collaboration is at the heart of PyTorch's growth and evolution. From advancing the core framework to building essential tools that power PyTorch at a production scale, your contributions are key to moving this ecosystem forward. As part of the Linux Foundation, the PyTorch community also supports a variety of initiatives: developer training, regional and local events, open source tooling, research, and guides for both newcomers and seasoned contributors—all to make your journey with PyTorch more accessible and rewarding. + From cutting-edge development to production-ready tools and libraries, the PyTorch Foundation thrives through transparent collaboration and collective innovation. As part of the Linux Foundation, we host global events, deliver specialized training, support research, and provide resources to accelerate your AI journey. + Whether you are contributing code, sharing your expertise, or deploying real-world AI solutions, the PyTorch Foundation actively empowers you to shape the future of accessible and impactful open source AI.

    @@ -46,127 +48,18 @@

    Our Guiding Principles

    -

    Premier Members

    -
    - {% for card in cards %} - {% assign card_title = card.title | split: ' ' %} - - {% endfor %} -
    -
    -
    -
    -
    +

    PyTorch Members

    -
    -
    -
    -
    -

    General Members

    -
    - - - -
    -
    -
    -
    -
    - -
    -
    -
    -
    -

    Associate Members

    - + + + + + + + +
    @@ -179,7 +72,10 @@

    PROM, IIT Rajasthan

    Our Governance

    - The PyTorch Foundation’s Governing Board oversees the Foundation’s activities according to its Guiding Principles and the PyTorch Foundation Charter. + The PyTorch Foundation’s Governing Board oversees the Foundation’s activities according to its Guiding Principles and the PyTorch Foundation Charter. +
    +
    + The PyTorch Foundation Code of Conduct details our commitment to fostering an inclusive, welcoming, and safe environment for everyone involved in the PyTorch Foundation community.

    The technical governance structure for the PyTorch open source project is defined by the PyTorch maintainers and is available on our PyTorch Technical Governance page. diff --git a/assets/HackerCup-AI-PS.pdf b/assets/HackerCup-AI-PS.pdf new file mode 100644 index 000000000000..2841752ef6c5 Binary files /dev/null and b/assets/HackerCup-AI-PS.pdf differ diff --git a/assets/brand-guidelines/PyTorch Foundation Charter.pdf b/assets/brand-guidelines/PyTorch Foundation Charter.pdf deleted file mode 100644 index eeafa7c6cf1c..000000000000 Binary files a/assets/brand-guidelines/PyTorch Foundation Charter.pdf and /dev/null differ diff --git a/assets/images/1738166706211.jpg b/assets/images/1738166706211.jpg new file mode 100644 index 000000000000..80766945d098 Binary files /dev/null and b/assets/images/1738166706211.jpg differ diff --git a/assets/images/6x-faster-async-checkpointing/fg1.png b/assets/images/6x-faster-async-checkpointing/fg1.png new file mode 100644 index 000000000000..4c3d3bf50d02 Binary files /dev/null and b/assets/images/6x-faster-async-checkpointing/fg1.png differ diff --git a/assets/images/6x-faster-async-checkpointing/fg2.png b/assets/images/6x-faster-async-checkpointing/fg2.png new file mode 100644 index 000000000000..1eaddbc43e68 Binary files /dev/null and b/assets/images/6x-faster-async-checkpointing/fg2.png differ diff --git a/assets/images/6x-faster-async-checkpointing/fg3.png b/assets/images/6x-faster-async-checkpointing/fg3.png new file mode 100644 index 000000000000..4c3d3bf50d02 Binary files /dev/null and b/assets/images/6x-faster-async-checkpointing/fg3.png differ diff --git a/assets/images/accelerating-generative-ai-2.jpg b/assets/images/accelerating-generative-ai-2.jpg new file mode 100644 index 000000000000..d2bddef62d8f Binary files /dev/null and b/assets/images/accelerating-generative-ai-2.jpg differ diff --git a/assets/images/accelerating-llm-inference/fg1.png b/assets/images/accelerating-llm-inference/fg1.png new file mode 100644 index 000000000000..68e37dc442f1 Binary files /dev/null and b/assets/images/accelerating-llm-inference/fg1.png differ diff --git a/assets/images/accelerating-llm-inference/fg2.png b/assets/images/accelerating-llm-inference/fg2.png new file mode 100644 index 000000000000..aa33f57d2455 Binary files /dev/null and b/assets/images/accelerating-llm-inference/fg2.png differ diff --git a/assets/images/accelerating-llm-inference/fg3.png b/assets/images/accelerating-llm-inference/fg3.png new file mode 100644 index 000000000000..74192bab2d2c Binary files /dev/null and b/assets/images/accelerating-llm-inference/fg3.png differ diff --git a/assets/images/accelerating-llm-inference/fg4.png b/assets/images/accelerating-llm-inference/fg4.png new file mode 100644 index 000000000000..26ad62e67a21 Binary files /dev/null and b/assets/images/accelerating-llm-inference/fg4.png differ diff --git a/assets/images/accelerating-llm-inference/fg5.jpg b/assets/images/accelerating-llm-inference/fg5.jpg new file mode 100644 index 000000000000..b9fd9e589f25 Binary files /dev/null and b/assets/images/accelerating-llm-inference/fg5.jpg differ diff --git a/assets/images/accelerating-training-float8-rowwise-crusoe/fg1.png b/assets/images/accelerating-training-float8-rowwise-crusoe/fg1.png new file mode 100644 index 000000000000..7dcf02db043e Binary files /dev/null and b/assets/images/accelerating-training-float8-rowwise-crusoe/fg1.png differ diff --git a/assets/images/accelerating-training-float8-rowwise-crusoe/fg2.png b/assets/images/accelerating-training-float8-rowwise-crusoe/fg2.png new file mode 100644 index 000000000000..2245f96c5fff Binary files /dev/null and b/assets/images/accelerating-training-float8-rowwise-crusoe/fg2.png differ diff --git a/assets/images/accelerating-training-float8-rowwise-crusoe/fg3.png b/assets/images/accelerating-training-float8-rowwise-crusoe/fg3.png new file mode 100644 index 000000000000..e5797aedd0ca Binary files /dev/null and b/assets/images/accelerating-training-float8-rowwise-crusoe/fg3.png differ diff --git a/assets/images/accelerating-training-float8-rowwise-crusoe/fg4.png b/assets/images/accelerating-training-float8-rowwise-crusoe/fg4.png new file mode 100644 index 000000000000..3adae3b02e6b Binary files /dev/null and b/assets/images/accelerating-training-float8-rowwise-crusoe/fg4.png differ diff --git a/assets/images/accelerating-training-float8-rowwise-crusoe/fg5.png b/assets/images/accelerating-training-float8-rowwise-crusoe/fg5.png new file mode 100644 index 000000000000..7dcf02db043e Binary files /dev/null and b/assets/images/accelerating-training-float8-rowwise-crusoe/fg5.png differ diff --git a/assets/images/accelerating-training-float8-rowwise-crusoe/fg6.png b/assets/images/accelerating-training-float8-rowwise-crusoe/fg6.png new file mode 100644 index 000000000000..9c77b71f5d4f Binary files /dev/null and b/assets/images/accelerating-training-float8-rowwise-crusoe/fg6.png differ diff --git a/assets/images/accelerating-training-float8-rowwise-crusoe/fg7.png b/assets/images/accelerating-training-float8-rowwise-crusoe/fg7.png new file mode 100644 index 000000000000..35695c3de6d0 Binary files /dev/null and b/assets/images/accelerating-training-float8-rowwise-crusoe/fg7.png differ diff --git a/assets/images/activation-checkpointing-techniques/fg1.png b/assets/images/activation-checkpointing-techniques/fg1.png new file mode 100644 index 000000000000..e4805cb40ea6 Binary files /dev/null and b/assets/images/activation-checkpointing-techniques/fg1.png differ diff --git a/assets/images/activation-checkpointing-techniques/fg10.png b/assets/images/activation-checkpointing-techniques/fg10.png new file mode 100644 index 000000000000..91bd1c909173 Binary files /dev/null and b/assets/images/activation-checkpointing-techniques/fg10.png differ diff --git a/assets/images/activation-checkpointing-techniques/fg11.png b/assets/images/activation-checkpointing-techniques/fg11.png new file mode 100644 index 000000000000..d4fa91fb677c Binary files /dev/null and b/assets/images/activation-checkpointing-techniques/fg11.png differ diff --git a/assets/images/activation-checkpointing-techniques/fg12.png b/assets/images/activation-checkpointing-techniques/fg12.png new file mode 100644 index 000000000000..e6c1679433dd Binary files /dev/null and b/assets/images/activation-checkpointing-techniques/fg12.png differ diff --git a/assets/images/activation-checkpointing-techniques/fg13.png b/assets/images/activation-checkpointing-techniques/fg13.png new file mode 100644 index 000000000000..ea5a5cbe0bf8 Binary files /dev/null and b/assets/images/activation-checkpointing-techniques/fg13.png differ diff --git a/assets/images/activation-checkpointing-techniques/fg14.png b/assets/images/activation-checkpointing-techniques/fg14.png new file mode 100644 index 000000000000..cc20d543962d Binary files /dev/null and b/assets/images/activation-checkpointing-techniques/fg14.png differ diff --git a/assets/images/activation-checkpointing-techniques/fg2.png b/assets/images/activation-checkpointing-techniques/fg2.png new file mode 100644 index 000000000000..00c20f76c09a Binary files /dev/null and b/assets/images/activation-checkpointing-techniques/fg2.png differ diff --git a/assets/images/activation-checkpointing-techniques/fg3.png b/assets/images/activation-checkpointing-techniques/fg3.png new file mode 100644 index 000000000000..412639ab92b8 Binary files /dev/null and b/assets/images/activation-checkpointing-techniques/fg3.png differ diff --git a/assets/images/activation-checkpointing-techniques/fg4.png b/assets/images/activation-checkpointing-techniques/fg4.png new file mode 100644 index 000000000000..5b4af130db49 Binary files /dev/null and b/assets/images/activation-checkpointing-techniques/fg4.png differ diff --git a/assets/images/activation-checkpointing-techniques/fg5.png b/assets/images/activation-checkpointing-techniques/fg5.png new file mode 100644 index 000000000000..d4cdc3202836 Binary files /dev/null and b/assets/images/activation-checkpointing-techniques/fg5.png differ diff --git a/assets/images/activation-checkpointing-techniques/fg6.png b/assets/images/activation-checkpointing-techniques/fg6.png new file mode 100644 index 000000000000..919609dbabce Binary files /dev/null and b/assets/images/activation-checkpointing-techniques/fg6.png differ diff --git a/assets/images/activation-checkpointing-techniques/fg7.png b/assets/images/activation-checkpointing-techniques/fg7.png new file mode 100644 index 000000000000..bbddbd9bf91a Binary files /dev/null and b/assets/images/activation-checkpointing-techniques/fg7.png differ diff --git a/assets/images/activation-checkpointing-techniques/fg8.png b/assets/images/activation-checkpointing-techniques/fg8.png new file mode 100644 index 000000000000..42b413e2118f Binary files /dev/null and b/assets/images/activation-checkpointing-techniques/fg8.png differ diff --git a/assets/images/activation-checkpointing-techniques/fg9.png b/assets/images/activation-checkpointing-techniques/fg9.png new file mode 100644 index 000000000000..a4b748ead8e9 Binary files /dev/null and b/assets/images/activation-checkpointing-techniques/fg9.png differ diff --git a/assets/images/autonomous-language-model-systems.png b/assets/images/autonomous-language-model-systems.png new file mode 100644 index 000000000000..06b75fe2c6be Binary files /dev/null and b/assets/images/autonomous-language-model-systems.png differ diff --git a/assets/images/community-events-recap/fg1.jpg b/assets/images/community-events-recap/fg1.jpg new file mode 100644 index 000000000000..629c39a353c4 Binary files /dev/null and b/assets/images/community-events-recap/fg1.jpg differ diff --git a/assets/images/community-events-recap/fg2.jpeg b/assets/images/community-events-recap/fg2.jpeg new file mode 100644 index 000000000000..abba3e473b5f Binary files /dev/null and b/assets/images/community-events-recap/fg2.jpeg differ diff --git a/assets/images/community-events-recap/fg3.png b/assets/images/community-events-recap/fg3.png new file mode 100644 index 000000000000..7b3c8b8c9120 Binary files /dev/null and b/assets/images/community-events-recap/fg3.png differ diff --git a/assets/images/community-events-recap/fg4.jpg b/assets/images/community-events-recap/fg4.jpg new file mode 100644 index 000000000000..b99760295a51 Binary files /dev/null and b/assets/images/community-events-recap/fg4.jpg differ diff --git a/assets/images/community-events-recap/fg5.jpg b/assets/images/community-events-recap/fg5.jpg new file mode 100644 index 000000000000..ab9a95e1d16c Binary files /dev/null and b/assets/images/community-events-recap/fg5.jpg differ diff --git a/assets/images/community-events-recap/fg6.jpeg b/assets/images/community-events-recap/fg6.jpeg new file mode 100644 index 000000000000..c6d38c3a5eba Binary files /dev/null and b/assets/images/community-events-recap/fg6.jpeg differ diff --git a/assets/images/community-events-recap/fg7.jpeg b/assets/images/community-events-recap/fg7.jpeg new file mode 100644 index 000000000000..2a52e451e662 Binary files /dev/null and b/assets/images/community-events-recap/fg7.jpeg differ diff --git a/assets/images/community-events-recap/fg8.png b/assets/images/community-events-recap/fg8.png new file mode 100644 index 000000000000..edbd4fb2cdc4 Binary files /dev/null and b/assets/images/community-events-recap/fg8.png differ diff --git a/assets/images/community-events-recap/fg9.png b/assets/images/community-events-recap/fg9.png new file mode 100644 index 000000000000..3acee7a2bea5 Binary files /dev/null and b/assets/images/community-events-recap/fg9.png differ diff --git a/assets/images/datathon-2025.png b/assets/images/datathon-2025.png new file mode 100644 index 000000000000..f2539d3a2692 Binary files /dev/null and b/assets/images/datathon-2025.png differ diff --git a/assets/images/docathon-2025.png b/assets/images/docathon-2025.png new file mode 100644 index 000000000000..aad9c70d1f36 Binary files /dev/null and b/assets/images/docathon-2025.png differ diff --git a/assets/images/executorch-chip-logo.svg b/assets/images/executorch-chip-logo.svg new file mode 100644 index 000000000000..11e5ed60956b --- /dev/null +++ b/assets/images/executorch-chip-logo.svg @@ -0,0 +1,205 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/assets/images/flexattention-for-inference/fg1.png b/assets/images/flexattention-for-inference/fg1.png new file mode 100644 index 000000000000..c42a3bf5717f Binary files /dev/null and b/assets/images/flexattention-for-inference/fg1.png differ diff --git a/assets/images/flexattention-for-inference/fg10.png b/assets/images/flexattention-for-inference/fg10.png new file mode 100644 index 000000000000..70d9e441b97c Binary files /dev/null and b/assets/images/flexattention-for-inference/fg10.png differ diff --git a/assets/images/flexattention-for-inference/fg11.png b/assets/images/flexattention-for-inference/fg11.png new file mode 100644 index 000000000000..94697c426b7e Binary files /dev/null and b/assets/images/flexattention-for-inference/fg11.png differ diff --git a/assets/images/flexattention-for-inference/fg2.png b/assets/images/flexattention-for-inference/fg2.png new file mode 100644 index 000000000000..47ae6ab99d26 Binary files /dev/null and b/assets/images/flexattention-for-inference/fg2.png differ diff --git a/assets/images/flexattention-for-inference/fg3.png b/assets/images/flexattention-for-inference/fg3.png new file mode 100644 index 000000000000..06bc61656d47 Binary files /dev/null and b/assets/images/flexattention-for-inference/fg3.png differ diff --git a/assets/images/flexattention-for-inference/fg4.png b/assets/images/flexattention-for-inference/fg4.png new file mode 100644 index 000000000000..b78a15172977 Binary files /dev/null and b/assets/images/flexattention-for-inference/fg4.png differ diff --git a/assets/images/flexattention-for-inference/fg5.png b/assets/images/flexattention-for-inference/fg5.png new file mode 100644 index 000000000000..dbb7081efe98 Binary files /dev/null and b/assets/images/flexattention-for-inference/fg5.png differ diff --git a/assets/images/flexattention-for-inference/fg6.png b/assets/images/flexattention-for-inference/fg6.png new file mode 100644 index 000000000000..d2221e66d982 Binary files /dev/null and b/assets/images/flexattention-for-inference/fg6.png differ diff --git a/assets/images/flexattention-for-inference/fg7.png b/assets/images/flexattention-for-inference/fg7.png new file mode 100644 index 000000000000..6ec36ad490c5 Binary files /dev/null and b/assets/images/flexattention-for-inference/fg7.png differ diff --git a/assets/images/flexattention-for-inference/fg8.png b/assets/images/flexattention-for-inference/fg8.png new file mode 100644 index 000000000000..a6c6a5227db8 Binary files /dev/null and b/assets/images/flexattention-for-inference/fg8.png differ diff --git a/assets/images/flexattention-for-inference/fg9.png b/assets/images/flexattention-for-inference/fg9.png new file mode 100644 index 000000000000..8187641ba4b5 Binary files /dev/null and b/assets/images/flexattention-for-inference/fg9.png differ diff --git a/assets/images/governing-board/Dwarak-Rajagopal.jpg b/assets/images/governing-board/Dwarak-Rajagopal.jpg deleted file mode 100644 index dc8d4a070614..000000000000 Binary files a/assets/images/governing-board/Dwarak-Rajagopal.jpg and /dev/null differ diff --git a/assets/images/governing-board/Lysandre-Debut.jpg b/assets/images/governing-board/Lysandre-Debut.jpg index 0fe262d727e0..40f34aea0b5b 100644 Binary files a/assets/images/governing-board/Lysandre-Debut.jpg and b/assets/images/governing-board/Lysandre-Debut.jpg differ diff --git a/assets/images/governing-board/alex-spinelli.jpeg b/assets/images/governing-board/alex-spinelli.jpeg index b9f9b2613a72..b810247d1558 100644 Binary files a/assets/images/governing-board/alex-spinelli.jpeg and b/assets/images/governing-board/alex-spinelli.jpeg differ diff --git a/assets/images/governing-board/andrew-wafaa.jpg b/assets/images/governing-board/andrew-wafaa.jpg new file mode 100644 index 000000000000..17557e768d94 Binary files /dev/null and b/assets/images/governing-board/andrew-wafaa.jpg differ diff --git a/assets/images/governing-board/damien-sereni.jpeg b/assets/images/governing-board/damien-sereni.jpeg index de1a39c64ff2..d208dde122ef 100644 Binary files a/assets/images/governing-board/damien-sereni.jpeg and b/assets/images/governing-board/damien-sereni.jpeg differ diff --git a/assets/images/governing-board/dwarakrajagopal2.jpg b/assets/images/governing-board/dwarakrajagopal2.jpg new file mode 100644 index 000000000000..9036b956605d Binary files /dev/null and b/assets/images/governing-board/dwarakrajagopal2.jpg differ diff --git a/assets/images/governing-board/joe-spisak.jpg b/assets/images/governing-board/joe-spisak.jpg new file mode 100644 index 000000000000..9a96bc3157a7 Binary files /dev/null and b/assets/images/governing-board/joe-spisak.jpg differ diff --git a/assets/images/governing-board/ricardo-aravena.jpg b/assets/images/governing-board/ricardo-aravena.jpg new file mode 100644 index 000000000000..4c76381a73cf Binary files /dev/null and b/assets/images/governing-board/ricardo-aravena.jpg differ diff --git a/assets/images/governing-board/shauheen-zahirazami.jpg b/assets/images/governing-board/shauheen-zahirazami.jpg new file mode 100644 index 000000000000..ffcc5761bd81 Binary files /dev/null and b/assets/images/governing-board/shauheen-zahirazami.jpg differ diff --git a/assets/images/how-ibm-uses-pt-terratorch/fg1.png b/assets/images/how-ibm-uses-pt-terratorch/fg1.png new file mode 100644 index 000000000000..140186a272cf Binary files /dev/null and b/assets/images/how-ibm-uses-pt-terratorch/fg1.png differ diff --git a/assets/images/how-ibm-uses-pt-terratorch/fg2.png b/assets/images/how-ibm-uses-pt-terratorch/fg2.png new file mode 100644 index 000000000000..7a37b893773d Binary files /dev/null and b/assets/images/how-ibm-uses-pt-terratorch/fg2.png differ diff --git a/assets/images/how-ibm-uses-pt-terratorch/fg3.png b/assets/images/how-ibm-uses-pt-terratorch/fg3.png new file mode 100644 index 000000000000..bcbe77ea9eca Binary files /dev/null and b/assets/images/how-ibm-uses-pt-terratorch/fg3.png differ diff --git a/assets/images/how-ibm-uses-pt-terratorch/fg4.png b/assets/images/how-ibm-uses-pt-terratorch/fg4.png new file mode 100644 index 000000000000..798947a41f20 Binary files /dev/null and b/assets/images/how-ibm-uses-pt-terratorch/fg4.png differ diff --git a/assets/images/how-ibm-uses-pt-terratorch/fg5.png b/assets/images/how-ibm-uses-pt-terratorch/fg5.png new file mode 100644 index 000000000000..a8306bf3ed84 Binary files /dev/null and b/assets/images/how-ibm-uses-pt-terratorch/fg5.png differ diff --git a/assets/images/intel-case-study/fg1.png b/assets/images/intel-case-study/fg1.png new file mode 100644 index 000000000000..b2a94fae07fc Binary files /dev/null and b/assets/images/intel-case-study/fg1.png differ diff --git a/assets/images/intel-case-study/fg2.png b/assets/images/intel-case-study/fg2.png new file mode 100644 index 000000000000..26ba47e31e71 Binary files /dev/null and b/assets/images/intel-case-study/fg2.png differ diff --git a/assets/images/landscape.jpg b/assets/images/landscape.jpg new file mode 100644 index 000000000000..b9702fdb895f Binary files /dev/null and b/assets/images/landscape.jpg differ diff --git a/assets/images/members/common-crawl-logo.svg b/assets/images/members/common-crawl-logo.svg new file mode 100644 index 000000000000..2a9efcd9ef62 --- /dev/null +++ b/assets/images/members/common-crawl-logo.svg @@ -0,0 +1,52 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/assets/images/members/dodlf-logo.jpg b/assets/images/members/dodlf-logo.jpg new file mode 100644 index 000000000000..c153de2adb11 Binary files /dev/null and b/assets/images/members/dodlf-logo.jpg differ diff --git a/assets/images/members/iabfu-logo.svg b/assets/images/members/iabfu-logo.svg new file mode 100644 index 000000000000..ac630fa9079e --- /dev/null +++ b/assets/images/members/iabfu-logo.svg @@ -0,0 +1,265 @@ + + + + +LOGO-01 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/assets/images/members/iit-logo.png b/assets/images/members/iit-logo.png new file mode 100644 index 000000000000..1cdf841f2aa2 Binary files /dev/null and b/assets/images/members/iit-logo.png differ diff --git a/assets/images/members/rensselaer-logo.png b/assets/images/members/rensselaer-logo.png new file mode 100644 index 000000000000..cc30e72e6df4 Binary files /dev/null and b/assets/images/members/rensselaer-logo.png differ diff --git a/assets/images/multi-modal-dl-frame.png b/assets/images/multi-modal-dl-frame.png new file mode 100644 index 000000000000..9bb6b68c60c6 Binary files /dev/null and b/assets/images/multi-modal-dl-frame.png differ diff --git a/assets/images/openreg.png b/assets/images/openreg.png new file mode 100644 index 000000000000..71fab0973309 Binary files /dev/null and b/assets/images/openreg.png differ diff --git a/assets/images/optimize-llms.png b/assets/images/optimize-llms.png new file mode 100644 index 000000000000..ba6e73cf4899 Binary files /dev/null and b/assets/images/optimize-llms.png differ diff --git a/assets/images/peak-performance-minimized-memory/fg1.png b/assets/images/peak-performance-minimized-memory/fg1.png new file mode 100644 index 000000000000..175eadfbe04d Binary files /dev/null and b/assets/images/peak-performance-minimized-memory/fg1.png differ diff --git a/assets/images/peak-performance-minimized-memory/fg2.png b/assets/images/peak-performance-minimized-memory/fg2.png new file mode 100644 index 000000000000..365dfa313c7d Binary files /dev/null and b/assets/images/peak-performance-minimized-memory/fg2.png differ diff --git a/assets/images/peak-performance-minimized-memory/fg3.png b/assets/images/peak-performance-minimized-memory/fg3.png new file mode 100644 index 000000000000..6d28237582f5 Binary files /dev/null and b/assets/images/peak-performance-minimized-memory/fg3.png differ diff --git a/assets/images/peak-performance-minimized-memory/fg4.png b/assets/images/peak-performance-minimized-memory/fg4.png new file mode 100644 index 000000000000..3685c1c81f98 Binary files /dev/null and b/assets/images/peak-performance-minimized-memory/fg4.png differ diff --git a/assets/images/pt-26-live-q-a.png b/assets/images/pt-26-live-q-a.png new file mode 100644 index 000000000000..d0b059cf8511 Binary files /dev/null and b/assets/images/pt-26-live-q-a.png differ diff --git a/assets/images/pt-day-cfp.png b/assets/images/pt-day-cfp.png new file mode 100644 index 000000000000..f8f6a849f3ab Binary files /dev/null and b/assets/images/pt-day-cfp.png differ diff --git a/assets/images/pt-day-china-2025-cfp.jpg b/assets/images/pt-day-china-2025-cfp.jpg new file mode 100644 index 000000000000..d42c377175a5 Binary files /dev/null and b/assets/images/pt-day-china-2025-cfp.jpg differ diff --git a/assets/images/pt-dinov2-multi-label-plant-species-classification.png b/assets/images/pt-dinov2-multi-label-plant-species-classification.png new file mode 100644 index 000000000000..c544f914043a Binary files /dev/null and b/assets/images/pt-dinov2-multi-label-plant-species-classification.png differ diff --git a/assets/images/pt-fedora-os-communities/fg1.jpg b/assets/images/pt-fedora-os-communities/fg1.jpg new file mode 100644 index 000000000000..e9c0de7b24ef Binary files /dev/null and b/assets/images/pt-fedora-os-communities/fg1.jpg differ diff --git a/assets/images/pt-fedora-os-communities/fg2.jpg b/assets/images/pt-fedora-os-communities/fg2.jpg new file mode 100644 index 000000000000..1aa340f71de9 Binary files /dev/null and b/assets/images/pt-fedora-os-communities/fg2.jpg differ diff --git a/assets/images/pt-fedora-os-communities/fg3.jpg b/assets/images/pt-fedora-os-communities/fg3.jpg new file mode 100644 index 000000000000..11ff09aaff08 Binary files /dev/null and b/assets/images/pt-fedora-os-communities/fg3.jpg differ diff --git a/assets/images/pt-fedora-os-communities/fg4.jpg b/assets/images/pt-fedora-os-communities/fg4.jpg new file mode 100644 index 000000000000..008d80e99dd4 Binary files /dev/null and b/assets/images/pt-fedora-os-communities/fg4.jpg differ diff --git a/assets/images/pt-fedora-os-communities/fg5.jpg b/assets/images/pt-fedora-os-communities/fg5.jpg new file mode 100644 index 000000000000..8761774d551b Binary files /dev/null and b/assets/images/pt-fedora-os-communities/fg5.jpg differ diff --git a/assets/images/pt-fedora-os-communities/fg6.jpg b/assets/images/pt-fedora-os-communities/fg6.jpg new file mode 100644 index 000000000000..9d06bd98d994 Binary files /dev/null and b/assets/images/pt-fedora-os-communities/fg6.jpg differ diff --git a/assets/images/pt-korea-user-group-recap/fg1.jpg b/assets/images/pt-korea-user-group-recap/fg1.jpg new file mode 100644 index 000000000000..dbd408d6baf8 Binary files /dev/null and b/assets/images/pt-korea-user-group-recap/fg1.jpg differ diff --git a/assets/images/pt-korea-user-group-recap/fg1.png b/assets/images/pt-korea-user-group-recap/fg1.png new file mode 100644 index 000000000000..02caf444eb0b Binary files /dev/null and b/assets/images/pt-korea-user-group-recap/fg1.png differ diff --git a/assets/images/pt-korea-user-group-recap/fg2.jpg b/assets/images/pt-korea-user-group-recap/fg2.jpg new file mode 100644 index 000000000000..924780bb6a3c Binary files /dev/null and b/assets/images/pt-korea-user-group-recap/fg2.jpg differ diff --git a/assets/images/pt-korea-user-group-recap/fg2.png b/assets/images/pt-korea-user-group-recap/fg2.png new file mode 100644 index 000000000000..3153c75f603c Binary files /dev/null and b/assets/images/pt-korea-user-group-recap/fg2.png differ diff --git a/assets/images/pt-korea-user-group-recap/fg3.jpg b/assets/images/pt-korea-user-group-recap/fg3.jpg new file mode 100644 index 000000000000..b49e18e82f9d Binary files /dev/null and b/assets/images/pt-korea-user-group-recap/fg3.jpg differ diff --git a/assets/images/pt-korea-user-group-recap/fg3.png b/assets/images/pt-korea-user-group-recap/fg3.png new file mode 100644 index 000000000000..ed9e8ccadb3e Binary files /dev/null and b/assets/images/pt-korea-user-group-recap/fg3.png differ diff --git a/assets/images/pt-korea-user-group-recap/fg4.jpg b/assets/images/pt-korea-user-group-recap/fg4.jpg new file mode 100644 index 000000000000..b89e21c71d78 Binary files /dev/null and b/assets/images/pt-korea-user-group-recap/fg4.jpg differ diff --git a/assets/images/pt-korea-user-group-recap/fg4.png b/assets/images/pt-korea-user-group-recap/fg4.png new file mode 100644 index 000000000000..4831610f703d Binary files /dev/null and b/assets/images/pt-korea-user-group-recap/fg4.png differ diff --git a/assets/images/pt-korea-user-group-recap/fg5.jpg b/assets/images/pt-korea-user-group-recap/fg5.jpg new file mode 100644 index 000000000000..5e41bbd65b32 Binary files /dev/null and b/assets/images/pt-korea-user-group-recap/fg5.jpg differ diff --git a/assets/images/pt-korea-user-group-recap/fg5.png b/assets/images/pt-korea-user-group-recap/fg5.png new file mode 100644 index 000000000000..02e5b01ac852 Binary files /dev/null and b/assets/images/pt-korea-user-group-recap/fg5.png differ diff --git a/assets/images/pt-korea-user-group-recap/fg6.jpg b/assets/images/pt-korea-user-group-recap/fg6.jpg new file mode 100644 index 000000000000..ce7c789c2385 Binary files /dev/null and b/assets/images/pt-korea-user-group-recap/fg6.jpg differ diff --git a/assets/images/pt-korea-user-group-recap/fg6.png b/assets/images/pt-korea-user-group-recap/fg6.png new file mode 100644 index 000000000000..5e5650ab16cf Binary files /dev/null and b/assets/images/pt-korea-user-group-recap/fg6.png differ diff --git a/assets/images/pt-korea-user-group-recap/fg7.jpg b/assets/images/pt-korea-user-group-recap/fg7.jpg new file mode 100644 index 000000000000..e0b7f94f2a3b Binary files /dev/null and b/assets/images/pt-korea-user-group-recap/fg7.jpg differ diff --git a/assets/images/pt-korea-user-group-recap/fg7.png b/assets/images/pt-korea-user-group-recap/fg7.png new file mode 100644 index 000000000000..3371ea3c6073 Binary files /dev/null and b/assets/images/pt-korea-user-group-recap/fg7.png differ diff --git a/assets/images/pt-korea-user-group-recap/fg8.jpg b/assets/images/pt-korea-user-group-recap/fg8.jpg new file mode 100644 index 000000000000..8d57c2a30bc4 Binary files /dev/null and b/assets/images/pt-korea-user-group-recap/fg8.jpg differ diff --git a/assets/images/pt-korea-user-group-recap/fg8.png b/assets/images/pt-korea-user-group-recap/fg8.png new file mode 100644 index 000000000000..4ba3b87290b3 Binary files /dev/null and b/assets/images/pt-korea-user-group-recap/fg8.png differ diff --git a/assets/images/pt-korea-user-group-recap/fg9.jpg b/assets/images/pt-korea-user-group-recap/fg9.jpg new file mode 100644 index 000000000000..e631c2a29545 Binary files /dev/null and b/assets/images/pt-korea-user-group-recap/fg9.jpg differ diff --git a/assets/images/pt-korea-user-group-recap/fg9.png b/assets/images/pt-korea-user-group-recap/fg9.png new file mode 100644 index 000000000000..02d9224b8c7d Binary files /dev/null and b/assets/images/pt-korea-user-group-recap/fg9.png differ diff --git a/assets/images/pt27qa.png b/assets/images/pt27qa.png new file mode 100644 index 000000000000..dbc60c8fcd0e Binary files /dev/null and b/assets/images/pt27qa.png differ diff --git a/assets/images/pytorch-2-7-intel-gpus/fg1.png b/assets/images/pytorch-2-7-intel-gpus/fg1.png new file mode 100644 index 000000000000..a0b4ee57da90 Binary files /dev/null and b/assets/images/pytorch-2-7-intel-gpus/fg1.png differ diff --git a/assets/images/pytorch-2-7-intel-gpus/fg2.png b/assets/images/pytorch-2-7-intel-gpus/fg2.png new file mode 100644 index 000000000000..cb39643891c1 Binary files /dev/null and b/assets/images/pytorch-2-7-intel-gpus/fg2.png differ diff --git a/assets/images/pytorch-at-gtc.jpg b/assets/images/pytorch-at-gtc.jpg new file mode 100644 index 000000000000..7380cfb0125a Binary files /dev/null and b/assets/images/pytorch-at-gtc.jpg differ diff --git a/assets/images/scaling-recommendation-2d-sparse-parallelism/fg1.png b/assets/images/scaling-recommendation-2d-sparse-parallelism/fg1.png new file mode 100644 index 000000000000..08674e9efda3 Binary files /dev/null and b/assets/images/scaling-recommendation-2d-sparse-parallelism/fg1.png differ diff --git a/assets/images/scaling-recommendation-2d-sparse-parallelism/fg2.png b/assets/images/scaling-recommendation-2d-sparse-parallelism/fg2.png new file mode 100644 index 000000000000..45b60ca30c15 Binary files /dev/null and b/assets/images/scaling-recommendation-2d-sparse-parallelism/fg2.png differ diff --git a/assets/images/sglang-join-pytorch/fg1.png b/assets/images/sglang-join-pytorch/fg1.png new file mode 100644 index 000000000000..a7838c59ac6e Binary files /dev/null and b/assets/images/sglang-join-pytorch/fg1.png differ diff --git a/assets/images/sglang-join-pytorch/fg2.png b/assets/images/sglang-join-pytorch/fg2.png new file mode 100644 index 000000000000..5e7e3b1d1f0a Binary files /dev/null and b/assets/images/sglang-join-pytorch/fg2.png differ diff --git a/assets/images/submit-to-speak/fg1.png b/assets/images/submit-to-speak/fg1.png new file mode 100644 index 000000000000..37ce1065eddc Binary files /dev/null and b/assets/images/submit-to-speak/fg1.png differ diff --git a/assets/images/submit-to-speak/fg2.jpg b/assets/images/submit-to-speak/fg2.jpg new file mode 100644 index 000000000000..b8e0a6f2cbfc Binary files /dev/null and b/assets/images/submit-to-speak/fg2.jpg differ diff --git a/assets/images/submit-to-speak/fg3.jpg b/assets/images/submit-to-speak/fg3.jpg new file mode 100644 index 000000000000..7beba8019952 Binary files /dev/null and b/assets/images/submit-to-speak/fg3.jpg differ diff --git a/assets/images/unlocking-pt-2-6-intel.png b/assets/images/unlocking-pt-2-6-intel.png new file mode 100644 index 000000000000..94d372662a2c Binary files /dev/null and b/assets/images/unlocking-pt-2-6-intel.png differ diff --git a/assets/images/warp-specialization/fg1.jpg b/assets/images/warp-specialization/fg1.jpg new file mode 100644 index 000000000000..cea07c126baa Binary files /dev/null and b/assets/images/warp-specialization/fg1.jpg differ diff --git a/assets/images/warp-specialization/fg2.jpg b/assets/images/warp-specialization/fg2.jpg new file mode 100644 index 000000000000..f5121c1d0469 Binary files /dev/null and b/assets/images/warp-specialization/fg2.jpg differ diff --git a/assets/images/warp-specialization/fg3.png b/assets/images/warp-specialization/fg3.png new file mode 100644 index 000000000000..3f190a06a40c Binary files /dev/null and b/assets/images/warp-specialization/fg3.png differ diff --git a/assets/images/warp-specialization/fg4.png b/assets/images/warp-specialization/fg4.png new file mode 100644 index 000000000000..183441f59f88 Binary files /dev/null and b/assets/images/warp-specialization/fg4.png differ diff --git a/assets/pytorch-foundation-charter-04052023.pdf b/assets/pytorch-foundation-charter-04052023.pdf deleted file mode 100644 index 3a2d110c15bd..000000000000 Binary files a/assets/pytorch-foundation-charter-04052023.pdf and /dev/null differ diff --git a/assets/pytorch-foundation-charter.pdf b/assets/pytorch-foundation-charter.pdf new file mode 100644 index 000000000000..7dac6f5ac972 Binary files /dev/null and b/assets/pytorch-foundation-charter.pdf differ diff --git a/assets/pytorch-frame-expert-exchange.pdf b/assets/pytorch-frame-expert-exchange.pdf new file mode 100644 index 000000000000..6930f03c3ccb Binary files /dev/null and b/assets/pytorch-frame-expert-exchange.pdf differ diff --git a/assets/quick-start-module.js b/assets/quick-start-module.js index 78605fd2786a..33580d87a930 100644 --- a/assets/quick-start-module.js +++ b/assets/quick-start-module.js @@ -11,8 +11,8 @@ var archInfoMap = new Map([ ['accnone', {title: "CPU", platforms: new Set(['linux', 'macos', 'windows'])}] ]); -let version_map={"nightly": {"accnone": ["cpu", ""], "cuda.x": ["cuda", "11.8"], "cuda.y": ["cuda", "12.4"], "cuda.z": ["cuda", "12.6"], "rocm5.x": ["rocm", "6.3"]}, "release": {"accnone": ["cpu", ""], "cuda.x": ["cuda", "11.8"], "cuda.y": ["cuda", "12.1"], "cuda.z": ["cuda", "12.4"], "rocm5.x": ["rocm", "6.2"]}} -let stable_version="Stable (2.5.1)"; +let version_map={"nightly": {"accnone": ["cpu", ""], "cuda.x": ["cuda", "11.8"], "cuda.y": ["cuda", "12.6"], "cuda.z": ["cuda", "12.8"], "rocm5.x": ["rocm", "6.4"]}, "release": {"accnone": ["cpu", ""], "cuda.x": ["cuda", "11.8"], "cuda.y": ["cuda", "12.6"], "cuda.z": ["cuda", "12.8"], "rocm5.x": ["rocm", "6.3"]}} +let stable_version="Stable (2.7.0)"; var default_selected_os = getAnchorSelectedOS() || getDefaultSelectedOS(); var opts = { @@ -267,7 +267,7 @@ $("[data-toggle='cloud-dropdown']").on("click", function(e) { }); function commandMessage(key) { - var object = {"preview,pip,linux,accnone,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "preview,pip,linux,cuda.x,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118", "preview,pip,linux,cuda.y,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124", "preview,pip,linux,cuda.z,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", "preview,pip,linux,rocm5.x,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3", "preview,conda,linux,cuda.x,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "preview,conda,linux,cuda.y,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "preview,conda,linux,cuda.z,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "preview,conda,linux,rocm5.x,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "preview,conda,linux,accnone,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "preview,libtorch,linux,accnone,cplusplus": "Download here (Pre-cxx11 ABI):
    https://download.pytorch.org/libtorch/nightly/cpu/libtorch-shared-with-deps-latest.zip
    Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/nightly/cpu/libtorch-cxx11-abi-shared-with-deps-latest.zip", "preview,libtorch,linux,cuda.x,cplusplus": "Download here (Pre-cxx11 ABI):
    https://download.pytorch.org/libtorch/nightly/cu118/libtorch-shared-with-deps-latest.zip
    Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/nightly/cu118/libtorch-cxx11-abi-shared-with-deps-latest.zip", "preview,libtorch,linux,cuda.y,cplusplus": "Download here (Pre-cxx11 ABI):
    https://download.pytorch.org/libtorch/nightly/cu124/libtorch-shared-with-deps-latest.zip
    Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/nightly/cu124/libtorch-cxx11-abi-shared-with-deps-latest.zip", "preview,libtorch,linux,cuda.z,cplusplus": "Download here (Pre-cxx11 ABI):
    https://download.pytorch.org/libtorch/nightly/cu126/libtorch-shared-with-deps-latest.zip
    Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/nightly/cu126/libtorch-cxx11-abi-shared-with-deps-latest.zip", "preview,libtorch,linux,rocm5.x,cplusplus": "Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/nightly/rocm6.3/libtorch-cxx11-abi-shared-with-deps-latest.zip", "preview,pip,macos,cuda.x,python": "# CUDA is not available on MacOS, please use default package
    pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "preview,pip,macos,cuda.y,python": "# CUDA is not available on MacOS, please use default package
    pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "preview,pip,macos,cuda.z,python": "# CUDA is not available on MacOS, please use default package
    pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "preview,pip,macos,rocm5.x,python": "# ROCm is not available on MacOS, please use default package
    pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "preview,pip,macos,accnone,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "preview,conda,macos,cuda.x,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "preview,conda,macos,cuda.y,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "preview,conda,macos,cuda.z,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "preview,conda,macos,rocm5.x,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "preview,conda,macos,accnone,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "preview,libtorch,macos,accnone,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/nightly/cpu/libtorch-macos-arm64-latest.zip", "preview,libtorch,macos,cuda.x,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/nightly/cpu/libtorch-macos-arm64-latest.zip", "preview,libtorch,macos,cuda.y,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/nightly/cpu/libtorch-macos-arm64-latest.zip", "preview,libtorch,macos,cuda.z,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/nightly/cpu/libtorch-macos-arm64-latest.zip", "preview,libtorch,macos,rocm5.x,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/nightly/cpu/libtorch-macos-arm64-latest.zip", "preview,pip,windows,accnone,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "preview,pip,windows,cuda.x,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118", "preview,pip,windows,cuda.y,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124", "preview,pip,windows,cuda.z,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", "preview,pip,windows,rocm5.x,python": "NOTE: ROCm is not available on Windows", "preview,conda,windows,cuda.x,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "preview,conda,windows,cuda.y,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "preview,conda,windows,cuda.z,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "preview,conda,windows,rocm5.x,python": "NOTE: ROCm is not available on Windows", "preview,conda,windows,accnone,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "preview,libtorch,windows,accnone,cplusplus": "Download here (Release version):
    https://download.pytorch.org/libtorch/nightly/cpu/libtorch-win-shared-with-deps-latest.zip
    Download here (Debug version):
    https://download.pytorch.org/libtorch/nightly/cpu/libtorch-win-shared-with-deps-debug-latest.zip", "preview,libtorch,windows,cuda.x,cplusplus": "Download here (Release version):
    https://download.pytorch.org/libtorch/nightly/cu118/libtorch-win-shared-with-deps-latest.zip
    Download here (Debug version):
    https://download.pytorch.org/libtorch/nightly/cu118/libtorch-win-shared-with-deps-debug-latest.zip", "preview,libtorch,windows,cuda.y,cplusplus": "Download here (Release version):
    https://download.pytorch.org/libtorch/nightly/cu124/libtorch-win-shared-with-deps-latest.zip
    Download here (Debug version):
    https://download.pytorch.org/libtorch/nightly/cu124/libtorch-win-shared-with-deps-debug-latest.zip", "preview,libtorch,windows,cuda.z,cplusplus": "Download here (Release version):
    https://download.pytorch.org/libtorch/nightly/cu126/libtorch-win-shared-with-deps-latest.zip
    Download here (Debug version):
    https://download.pytorch.org/libtorch/nightly/cu126/libtorch-win-shared-with-deps-debug-latest.zip", "preview,libtorch,windows,rocm5.x,cplusplus": "NOTE: ROCm is not available on Windows", "stable,pip,linux,accnone,python": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu", "stable,pip,linux,cuda.x,python": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118", "stable,pip,linux,cuda.y,python": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121", "stable,pip,linux,cuda.z,python": "pip3 install torch torchvision torchaudio", "stable,pip,linux,rocm5.x,python": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2", "stable,conda,linux,cuda.x,python": "conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia", "stable,conda,linux,cuda.y,python": "conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia", "stable,conda,linux,cuda.z,python": "conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia", "stable,conda,linux,rocm5.x,python": "NOTE: Conda packages are not currently available for ROCm, please use pip instead
    ", "stable,conda,linux,accnone,python": "conda install pytorch torchvision torchaudio cpuonly -c pytorch", "stable,libtorch,linux,accnone,cplusplus": "Download here (Pre-cxx11 ABI):
    https://download.pytorch.org/libtorch/cpu/libtorch-shared-with-deps-2.5.1%2Bcpu.zip
    Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-2.5.1%2Bcpu.zip", "stable,libtorch,linux,cuda.x,cplusplus": "Download here (Pre-cxx11 ABI):
    https://download.pytorch.org/libtorch/cu118/libtorch-shared-with-deps-2.5.1%2Bcu118.zip
    Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/cu118/libtorch-cxx11-abi-shared-with-deps-2.5.1%2Bcu118.zip", "stable,libtorch,linux,cuda.y,cplusplus": "Download here (Pre-cxx11 ABI):
    https://download.pytorch.org/libtorch/cu121/libtorch-shared-with-deps-2.5.1%2Bcu121.zip
    Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/cu121/libtorch-cxx11-abi-shared-with-deps-2.5.1%2Bcu121.zip", "stable,libtorch,linux,cuda.z,cplusplus": "Download here (Pre-cxx11 ABI):
    https://download.pytorch.org/libtorch/cu124/libtorch-shared-with-deps-2.5.1%2Bcu124.zip
    Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/cu124/libtorch-cxx11-abi-shared-with-deps-2.5.1%2Bcu124.zip", "stable,libtorch,linux,rocm5.x,cplusplus": "Download here (Pre-cxx11 ABI):
    https://download.pytorch.org/libtorch/rocm6.2/libtorch-shared-with-deps-2.5.1%2Brocm6.2.zip
    Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/rocm6.2/libtorch-cxx11-abi-shared-with-deps-2.5.1%2Brocm6.2.zip", "stable,pip,macos,cuda.x,python": "# CUDA is not available on MacOS, please use default package
    pip3 install torch torchvision torchaudio", "stable,pip,macos,cuda.y,python": "# CUDA is not available on MacOS, please use default package
    pip3 install torch torchvision torchaudio", "stable,pip,macos,cuda.z,python": "# CUDA is not available on MacOS, please use default package
    pip3 install torch torchvision torchaudio", "stable,pip,macos,rocm5.x,python": "# ROCm is not available on MacOS, please use default package
    pip3 install torch torchvision torchaudio", "stable,pip,macos,accnone,python": "pip3 install torch torchvision torchaudio", "stable,conda,macos,cuda.x,python": "# CUDA is not available on MacOS, please use default package
    conda install pytorch::pytorch torchvision torchaudio -c pytorch", "stable,conda,macos,cuda.y,python": "# CUDA is not available on MacOS, please use default package
    conda install pytorch::pytorch torchvision torchaudio -c pytorch", "stable,conda,macos,cuda.z,python": "# CUDA is not available on MacOS, please use default package
    conda install pytorch::pytorch torchvision torchaudio -c pytorch", "stable,conda,macos,rocm5.x,python": "# ROCm is not available on MacOS, please use default package
    conda install pytorch::pytorch torchvision torchaudio -c pytorch", "stable,conda,macos,accnone,python": "conda install pytorch::pytorch torchvision torchaudio -c pytorch", "stable,libtorch,macos,accnone,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.5.1.zip", "stable,libtorch,macos,cuda.x,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.5.1.zip", "stable,libtorch,macos,cuda.y,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.5.1.zip", "stable,libtorch,macos,cuda.z,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.5.1.zip", "stable,libtorch,macos,rocm5.x,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.5.1.zip", "stable,pip,windows,accnone,python": "pip3 install torch torchvision torchaudio", "stable,pip,windows,cuda.x,python": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118", "stable,pip,windows,cuda.y,python": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121", "stable,pip,windows,cuda.z,python": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124", "stable,pip,windows,rocm5.x,python": "NOTE: ROCm is not available on Windows", "stable,conda,windows,cuda.x,python": "conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia", "stable,conda,windows,cuda.y,python": "conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia", "stable,conda,windows,cuda.z,python": "conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia", "stable,conda,windows,rocm5.x,python": "NOTE: ROCm is not available on Windows", "stable,conda,windows,accnone,python": "conda install pytorch torchvision torchaudio cpuonly -c pytorch", "stable,libtorch,windows,accnone,cplusplus": "Download here (Release version):
    https://download.pytorch.org/libtorch/cpu/libtorch-win-shared-with-deps-2.5.1%2Bcpu.zip
    Download here (Debug version):
    https://download.pytorch.org/libtorch/cpu/libtorch-win-shared-with-deps-debug-2.5.1%2Bcpu.zip", "stable,libtorch,windows,cuda.x,cplusplus": "Download here (Release version):
    https://download.pytorch.org/libtorch/cu118/libtorch-win-shared-with-deps-2.5.1%2Bcu118.zip
    Download here (Debug version):
    https://download.pytorch.org/libtorch/cu118/libtorch-win-shared-with-deps-debug-2.5.1%2Bcu118.zip", "stable,libtorch,windows,cuda.y,cplusplus": "Download here (Release version):
    https://download.pytorch.org/libtorch/cu121/libtorch-win-shared-with-deps-2.5.1%2Bcu121.zip
    Download here (Debug version):
    https://download.pytorch.org/libtorch/cu121/libtorch-win-shared-with-deps-debug-2.5.1%2Bcu121.zip", "stable,libtorch,windows,cuda.z,cplusplus": "Download here (Release version):
    https://download.pytorch.org/libtorch/cu124/libtorch-win-shared-with-deps-2.5.1%2Bcu124.zip
    Download here (Debug version):
    https://download.pytorch.org/libtorch/cu124/libtorch-win-shared-with-deps-debug-2.5.1%2Bcu124.zip", "stable,libtorch,windows,rocm5.x,cplusplus": "NOTE: ROCm is not available on Windows"}; + var object = {"preview,pip,linux,accnone,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "preview,pip,linux,cuda.x,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118", "preview,pip,linux,cuda.y,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", "preview,pip,linux,cuda.z,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128", "preview,pip,linux,rocm5.x,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.4", "preview,libtorch,linux,accnone,cplusplus": "Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/nightly/cpu/libtorch-cxx11-abi-shared-with-deps-latest.zip", "preview,libtorch,linux,cuda.x,cplusplus": "Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/nightly/cu118/libtorch-cxx11-abi-shared-with-deps-latest.zip", "preview,libtorch,linux,cuda.y,cplusplus": "Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/nightly/cu126/libtorch-cxx11-abi-shared-with-deps-latest.zip", "preview,libtorch,linux,cuda.z,cplusplus": "Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/nightly/cu128/libtorch-cxx11-abi-shared-with-deps-latest.zip", "preview,libtorch,linux,rocm5.x,cplusplus": "Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/nightly/rocm6.4/libtorch-cxx11-abi-shared-with-deps-latest.zip", "preview,pip,macos,cuda.x,python": "# CUDA is not available on MacOS, please use default package
    pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "preview,pip,macos,cuda.y,python": "# CUDA is not available on MacOS, please use default package
    pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "preview,pip,macos,cuda.z,python": "# CUDA is not available on MacOS, please use default package
    pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "preview,pip,macos,rocm5.x,python": "# ROCm is not available on MacOS, please use default package
    pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "preview,pip,macos,accnone,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "preview,libtorch,macos,accnone,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/nightly/cpu/libtorch-macos-arm64-latest.zip", "preview,libtorch,macos,cuda.x,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/nightly/cpu/libtorch-macos-arm64-latest.zip", "preview,libtorch,macos,cuda.y,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/nightly/cpu/libtorch-macos-arm64-latest.zip", "preview,libtorch,macos,cuda.z,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/nightly/cpu/libtorch-macos-arm64-latest.zip", "preview,libtorch,macos,rocm5.x,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/nightly/cpu/libtorch-macos-arm64-latest.zip", "preview,pip,windows,accnone,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "preview,pip,windows,cuda.x,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118", "preview,pip,windows,cuda.y,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", "preview,pip,windows,cuda.z,python": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128", "preview,pip,windows,rocm5.x,python": "NOTE: ROCm is not available on Windows", "preview,libtorch,windows,accnone,cplusplus": "Download here (Release version):
    https://download.pytorch.org/libtorch/nightly/cpu/libtorch-win-shared-with-deps-latest.zip
    Download here (Debug version):
    https://download.pytorch.org/libtorch/nightly/cpu/libtorch-win-shared-with-deps-debug-latest.zip", "preview,libtorch,windows,cuda.x,cplusplus": "Download here (Release version):
    https://download.pytorch.org/libtorch/nightly/cu118/libtorch-win-shared-with-deps-latest.zip
    Download here (Debug version):
    https://download.pytorch.org/libtorch/nightly/cu118/libtorch-win-shared-with-deps-debug-latest.zip", "preview,libtorch,windows,cuda.y,cplusplus": "Download here (Release version):
    https://download.pytorch.org/libtorch/nightly/cu126/libtorch-win-shared-with-deps-latest.zip
    Download here (Debug version):
    https://download.pytorch.org/libtorch/nightly/cu126/libtorch-win-shared-with-deps-debug-latest.zip", "preview,libtorch,windows,cuda.z,cplusplus": "Download here (Release version):
    https://download.pytorch.org/libtorch/nightly/cu128/libtorch-win-shared-with-deps-latest.zip
    Download here (Debug version):
    https://download.pytorch.org/libtorch/nightly/cu128/libtorch-win-shared-with-deps-debug-latest.zip", "preview,libtorch,windows,rocm5.x,cplusplus": "NOTE: ROCm is not available on Windows", "stable,pip,linux,accnone,python": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu", "stable,pip,linux,cuda.x,python": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118", "stable,pip,linux,cuda.y,python": "pip3 install torch torchvision torchaudio", "stable,pip,linux,cuda.z,python": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128", "stable,pip,linux,rocm5.x,python": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3", "stable,libtorch,linux,accnone,cplusplus": "Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-2.7.0%2Bcpu.zip", "stable,libtorch,linux,cuda.x,cplusplus": "Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/cu118/libtorch-cxx11-abi-shared-with-deps-2.7.0%2Bcu118.zip", "stable,libtorch,linux,cuda.y,cplusplus": "Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/cu126/libtorch-cxx11-abi-shared-with-deps-2.7.0%2Bcu126.zip", "stable,libtorch,linux,cuda.z,cplusplus": "Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/cu128/libtorch-cxx11-abi-shared-with-deps-2.7.0%2Bcu128.zip", "stable,libtorch,linux,rocm5.x,cplusplus": "Download here (cxx11 ABI):
    https://download.pytorch.org/libtorch/rocm6.3/libtorch-cxx11-abi-shared-with-deps-2.7.0%2Brocm6.3.zip", "stable,pip,macos,cuda.x,python": "# CUDA is not available on MacOS, please use default package
    pip3 install torch torchvision torchaudio", "stable,pip,macos,cuda.y,python": "# CUDA is not available on MacOS, please use default package
    pip3 install torch torchvision torchaudio", "stable,pip,macos,cuda.z,python": "# CUDA is not available on MacOS, please use default package
    pip3 install torch torchvision torchaudio", "stable,pip,macos,rocm5.x,python": "# ROCm is not available on MacOS, please use default package
    pip3 install torch torchvision torchaudio", "stable,pip,macos,accnone,python": "pip3 install torch torchvision torchaudio", "stable,conda,macos,cuda.x,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "stable,conda,macos,cuda.y,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "stable,conda,macos,cuda.z,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "stable,conda,macos,rocm5.x,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "stable,conda,macos,accnone,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "stable,libtorch,macos,accnone,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.7.0.zip", "stable,libtorch,macos,cuda.x,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.7.0.zip", "stable,libtorch,macos,cuda.y,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.7.0.zip", "stable,libtorch,macos,cuda.z,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.7.0.zip", "stable,libtorch,macos,rocm5.x,cplusplus": "Download arm64 libtorch here (ROCm and CUDA are not supported):
    https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.7.0.zip", "stable,pip,windows,accnone,python": "pip3 install torch torchvision torchaudio", "stable,pip,windows,cuda.x,python": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118", "stable,pip,windows,cuda.y,python": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126", "stable,pip,windows,cuda.z,python": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128", "stable,pip,windows,rocm5.x,python": "NOTE: ROCm is not available on Windows", "stable,conda,windows,cuda.x,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "stable,conda,windows,cuda.y,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "stable,conda,windows,cuda.z,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "stable,conda,windows,rocm5.x,python": "NOTE: ROCm is not available on Windows", "stable,conda,windows,accnone,python": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", "stable,libtorch,windows,accnone,cplusplus": "Download here (Release version):
    https://download.pytorch.org/libtorch/cpu/libtorch-win-shared-with-deps-2.7.0%2Bcpu.zip
    Download here (Debug version):
    https://download.pytorch.org/libtorch/cpu/libtorch-win-shared-with-deps-debug-2.7.0%2Bcpu.zip", "stable,libtorch,windows,cuda.x,cplusplus": "Download here (Release version):
    https://download.pytorch.org/libtorch/cu118/libtorch-win-shared-with-deps-2.7.0%2Bcu118.zip
    Download here (Debug version):
    https://download.pytorch.org/libtorch/cu118/libtorch-win-shared-with-deps-debug-2.7.0%2Bcu118.zip", "stable,libtorch,windows,cuda.y,cplusplus": "Download here (Release version):
    https://download.pytorch.org/libtorch/cu126/libtorch-win-shared-with-deps-2.7.0%2Bcu126.zip
    Download here (Debug version):
    https://download.pytorch.org/libtorch/cu126/libtorch-win-shared-with-deps-debug-2.7.0%2Bcu126.zip", "stable,libtorch,windows,cuda.z,cplusplus": "Download here (Release version):
    https://download.pytorch.org/libtorch/cu128/libtorch-win-shared-with-deps-2.7.0%2Bcu128.zip
    Download here (Debug version):
    https://download.pytorch.org/libtorch/cu128/libtorch-win-shared-with-deps-debug-2.7.0%2Bcu128.zip", "stable,libtorch,windows,rocm5.x,cplusplus": "NOTE: ROCm is not available on Windows"}; if (!object.hasOwnProperty(key)) { $("#command").html( diff --git a/autonomous-language-model-systems.html b/autonomous-language-model-systems.html new file mode 100644 index 000000000000..3b065fafb852 --- /dev/null +++ b/autonomous-language-model-systems.html @@ -0,0 +1,46 @@ +--- +layout: default +title: "Towards Autonomous Language Model Systems" +body-class: announcement +background-class: announcement-background +permalink: /autonomous-language-model-systems +--- + +

    +
    +
    +

    PyTorch Webinars

    +
    +
    +
    + +
    +
    +
    +
    + Towards Autonomous Language Model Systems +

    Towards Autonomous Language Model Systems

    +

    + Date: May 21, 2025, 11AM PT / 2PM ET +
    + Speaker: Ofir Press +
    +
    + Language models (LMs) are increasingly used to assist users in day-to-day tasks such as programming (Github Copilot) or search (Google's AI Overviews). But can we build language model systems that are able to autonomously complete entire tasks end-to-end? +

    + +In this talk, Ofir Press will discuss efforts to build autonomous LM systems, focusing on the software engineering domain. Ofir will present SWE-bench, a novel method for measuring AI systems on their abilities to fix real issues in popular software libraries. Ofir will then discuss SWE-agent, a system for solving SWE-bench tasks. +

    + +SWE-bench and SWE-agent are used by many leading AI organizations in academia and industry, including OpenAI, Anthropic, Meta, and Google, and SWE-bench has been downloaded over 2 million times. These projects show that academics on tight budgets can have a substantial impact in steering the research community toward building autonomous systems that can complete challenging tasks. +

    + +Ofir is a postdoc at Princeton University, where they mainly work with Karthik Narasimhan's lab. Ofir previously completed their PhD at the University of Washington in Seattle, where Ofir was advised by Noah Smith. During their PhD, Ofir spent two years at Facebook AI Research Labs on Luke Zettlemoyer's team. +

    +

    Register now to attend this event

    +
    +

    +
    +
    +
    +
    \ No newline at end of file diff --git a/code-of-conduct.html b/code-of-conduct.html new file mode 100644 index 000000000000..419ba4a38970 --- /dev/null +++ b/code-of-conduct.html @@ -0,0 +1,224 @@ +--- +layout: default +title: PyTorch Foundation Code of Conduct +body-class: announcement +background-class: announcement-background +permalink: /code-of-conduct +--- +{% assign cards = site.board_info %} + +
    +
    +
    +

    PyTorch Foundation
    Code of Conduct

    +
    +
    +
    + +
    +
    +
    +
    + + +

    Our Commitment

    + + +

    + The PyTorch Foundation is committed to fostering an inclusive, welcoming, and safe environment for everyone involved in the PyTorch Foundation community. This commitment extends across all Foundation activities, including but not limited to our technical projects, events, communication channels, and social media presence. We pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation. +

    +

    Scope

    + + +

    + This code of conduct applies to Governing Board meetings, Technical Advisory Council meetings and outreach programs (such as the Ambassador Program) of the PyTorch Foundation and any other activity of the PyTorch Foundation that is not otherwise covered by a code of conduct of either The Linux Foundation or an applicable technical project. +

    +

    PyTorch Foundation Events

    +

    + PyTorch Foundation events that are produced by the Linux Foundation with professional events staff are governed by the Linux Foundation Events Code of Conduct available on the event page, which is designed to be used in conjunction with this PyTorch Foundation Code of Conduct. +

    +

    Technical Projects in the PyTorch Foundation Umbrella

    +

    + Technical projects supported by the PyTorch Foundation are organized as separate projects and each maintains a code of conduct that applies to participants in those projects. +

    +

    Expected Behavior

    + + +

    + Community members are expected to: +

    +
      + +
    • Use welcoming and inclusive language
    • + +
    • Respect differing viewpoints and experiences
    • + +
    • Accept constructive criticism gracefully
    • + +
    • Prioritize what benefits the community as a whole
    • + +
    • Show empathy and kindness toward others
    • + +
    • Be professional and responsible in all interactions
    • + +
    • Follow health and safety requirements at in-person events
    • + +
    • Exercise consideration and respect in speech and actions
    • + +
    • Collaborate with other community members in a constructive manner
    • +
    +

    Unacceptable Behavior

    + + +

    + The following behaviors are considered unacceptable within our community: +

    +

    Harassment and Discrimination

    + + +
      + +
    • Harassment of any kind, whether verbal, physical, or visual
    • + +
    • Discrimination based on protected characteristics
    • + +
    • Sexual harassment or unwelcome sexual attention
    • + +
    • Deliberate intimidation, stalking, or following
    • + +
    • Sustained disruption of talks, events, or online discussions
    • + +
    • Inappropriate physical contact
    • +
    +

    Communication and Content

    + + +
      + +
    • Use of sexualized language or imagery
    • + +
    • Violent or threatening language or imagery
    • + +
    • Trolling, insulting/derogatory comments, or personal attacks
    • + +
    • Public or private harassment
    • + +
    • Publishing others’ private information without permission
    • + +
    • Using Foundation platforms for political campaigning or promotion of political causes that are unrelated to technology
    • + +
    • Other conduct which could reasonably be considered inappropriate in a professional setting
    • +
    +

    Online and Social Media Behavior

    + + +
      + +
    • Harassment or bullying through social media platforms
    • + +
    • Spreading misinformation about the Foundation or its members
    • + +
    • Using Foundation channels for commercial promotion without permission
    • + +
    • Creating multiple accounts to evade moderation
    • + +
    • Impersonating Foundation members or officials
    • +
    +

    Behavior During Investigations

    + + +
      + +
    • Providing knowingly false or misleading information in connection with a Code of Conduct investigation or otherwise intentionally tampering with an investigation.
    • + +
    • Retaliating against a person because they reported an incident or provided information about an incident as a witness.
    • +
    +

    Enforcement

    + + +

    Reporting Violations

    + + +

    + Violations can be reported to conduct@pytorch.org. All reports will be: +

    +
      + +
    • Reviewed promptly and thoroughly
    • + +
    • Treated with strict confidentiality
    • + +
    • Investigated and addressed appropriately
    • + +
    • Documented for future reference
    • +
    +

    Consequences

    + + +

    + Violations may result in: +

    +
      + +
    • Warning to the offending individual
    • + +
    • Temporary or permanent ban from Foundation spaces
    • + +
    • Removal from leadership or contributory roles
    • + +
    • Expulsion from events without refund
    • + +
    • Reporting to appropriate authorities if necessary
    • + +
    • Other consequences
    • +
    +

    Appeals Process

    + + +
      + +
    • Individuals may appeal enforcement decisions
    • + +
    • Appeals must be submitted in writing within 30 days to the PyTorch Foundation via email to conduct@pytorch.org
    • + +
    • Decisions on appeals are final
    • +
    +

    Pre-Event Concerns

    + + +

    + If you have concerns about attending an upcoming event where specific individuals may be present: +

    +
      + +
    • Contact conduct@pytorch.org in advance
    • + +
    • Arrangements can be made for your safety and comfort
    • + +
    • Precautions may include providing security escorts and notifying staff
    • +
    +

    Amendments

    + + +

    + This Code of Conduct may be amended by the PyTorch Foundation as needed. Changes will be communicated to the community, and continued participation in the community indicates agreement to the current version. +

    +

    Questions and Reporting - Contact

    + + +

    + For questions, concerns, or reports: +
    + Email: conduct@pytorch.org +

    +

    ​​Acknowledgements

    +

    + This Code of Conduct is adapted from the Contributor Covenant, version 2.0 available here. +

    + + + +
    +
    +
    +
    diff --git a/credits.html b/credits.html index ce48847b9725..d4e3dc24f111 100644 --- a/credits.html +++ b/credits.html @@ -21,9 +21,9 @@

    PyTorch Cloud
           Credit Program<
    -

    We believe providing public, self-service, and automated access to Cloud Infrastructure is crucial for every project to incubate, grow, and succeed. +

    We believe providing public, self-service, and automated access to cloud infrastructure is essential for every project's incubation, growth, and success.

    - That’s why PyTorch has created the Cloud Credits program, focussed on the mutual success of projects and participating companies. To date, supporters like AWS have donated cloud credits to showcase direct support of our incubating and graduated projects. In fact, many organizations sponsor PyTorch projects for their business to succeed. + To support this, PyTorch has established a program that enables organizations to contribute either cloud credits or financial donations directly towards maintaining and expanding our Continuous Integration (CI) infrastructure and other foundation-hosted project infrastructure. Contributions from organizations like AWS have already provided cloud credits, demonstrating a clear commitment to the success and sustainability of the PyTorch' Foundation's hosted projects. Many organizations continue to sponsor PyTorch projects, recognizing that supporting foundational infrastructure contributes directly to their own business growth and success.

    diff --git a/ecosystem/ecosystem.html b/ecosystem/ecosystem.html index 2fe0e044cdcb..b60f4bc5efd2 100644 --- a/ecosystem/ecosystem.html +++ b/ecosystem/ecosystem.html @@ -4,6 +4,7 @@ permalink: ecosystem/ background-class: ecosystem-background body-class: ecosystem +redirect_to: https://landscape.pytorch.org/ ---
    @@ -14,7 +15,8 @@

    Tap into a rich ecosystem of tools, libraries, and more to support, accelerate, and explore AI development.

    -

    Join the Ecosystem

    +

    Join the Ecosystem

    +
    @@ -64,7 +66,7 @@

    Have a project you want featured?

    -

    Join the PyTorch ecosystem

    +

    Join the PyTorch ecosystem

    diff --git a/ecosystem/join.html b/ecosystem/join.html index 05f246cf4e8c..ce102e84e325 100644 --- a/ecosystem/join.html +++ b/ecosystem/join.html @@ -4,57 +4,5 @@ permalink: ecosystem/join.html body-class: ecosystem background-class: ecosystem-join-background +redirect_to: https://github.com/pytorch-fdn/ecosystem --- - -
    -
    -

    Join The
    Ecosystem

    - -
    -
    - -
    -
    -
    - - - -
    -
    -
    -

    Thank you for being part of the PyTorch community! The PyTorch ecosystem consists of projects, tools, and libraries from a broad set of researchers in academia and industry, application developers, and ML engineers. - The goal of this ecosystem is to support, accelerate, and aid in your exploration with PyTorch. Below are a few questions to think about if you'd like to have your project included as part of the ecosystem:

    -
    - -
    -
    -

    01

    -
    - -
    -

    Can your project be used with or within PyTorch to help augment the user experience, enable new capabilities or speed up training/inference?
    - Examples could include visualization tools, a kernel library or a framework that sits on top to enable research in a particular area such as NLP.

    -
    - -
    - -
    -
    -

    02

    -
    - -
    -

    Is the project ready for broad developer usage?
    - For example, is the project stable, will it be maintained, and is there adequate supporting infrastructure, documentation, and technical support to allow a developer to successfully use it?

    -
    -
    - -

    If you answered 'yes' to both, please fill out the form below with more details, and we'll follow up with next steps. If your tool seems like a good fit, the PyTorch Foundation team will work with you to find a time present your tool to the TAC. -

    -
    - - {% include ecosystem_form.html %} -
    -
    -
    -
    diff --git a/events.html b/events.html index 4c21b04b752f..1f90419bdebe 100644 --- a/events.html +++ b/events.html @@ -23,9 +23,6 @@

    Events

    - @@ -35,7 +32,6 @@

    Events

    {% include live_events.html %} - {% include live_stream.html %} {% include podcast.html %} {% include community_events.html %} diff --git a/executorch.html b/executorch.html index f4b1f932aa85..705b8daad87d 100644 --- a/executorch.html +++ b/executorch.html @@ -22,10 +22,17 @@

    ExecuTorch

    -

    What is ExecuTorch?

    -

    ExecuTorch is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch Edge ecosystem and enables efficient deployment of various PyTorch models (vision, speech, Generative AI, and more) to edge devices. Key value propositions of ExecuTorch are:

    -
    +
    +
    +

    What is ExecuTorch?

    +

    ExecuTorch is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch Edge ecosystem and enables efficient deployment of various PyTorch models (vision, speech, Generative AI, and more) to edge devices. Key value propositions of ExecuTorch are:

    +
    +
    + ExecuTorch logo +
    +
    +
    Mobile icon @@ -63,6 +70,9 @@

    Explore ExecuTorch

    ExecuTorch on GitHub + + ExecuTorch on Discord +

    Why ExecuTorch?

    diff --git a/governing-board.html b/governing-board.html index 1f68d1eb62c3..588aa3db0d42 100644 --- a/governing-board.html +++ b/governing-board.html @@ -17,7 +17,32 @@

    Governing
           Board

    - +
    +
    + Andrew Wafaa +
    +
    +

    Andrew Wafaa, Arm

    +

    Andrew is part of Arm’s Strategy & Ecosystem and Developer platforms leadership team, heading up Arm’s Open Source Office and is responsible for Arm’s relationship and engagements with software communities. With over 25 years of deep engagement with open source communities under a variety of employers, through the years he has held various positions with a variety of projects and foundations. Currently he is the Chair of the Yocto Project and sits on the Board of the PyTorch Foundation, FreeBSD Foundation, Xen Project and on the TSC/TAC/SC of PyTorch, Zephyr, UXLFoundation. He has previously served on the Board of the Rust Foundation, OpenUK and openSUSE Project. To support his work with the various communities, Andrew is often at most of the open source events that are held globally.
    + + + +

    +
    +
    +
    +
    + Ankit Patel +
    +
    +

    Ankit Patel, NVIDIA

    +

    Ankit Patel is a senior director at NVIDIA, leading developer engagement for SDKs, APIs, and tools. He joined NVIDIA in 2011, transitioning from GPU product manager to software product management in virtualization, ray tracing, and AI. Previously, he managed video editing and live production products at Matrox Video and Blackmagic Design. Ankit holds a bachelor's degree in computer science from Concordia University and an MBA from Cornell University.
    + + + +

    +
    +
    Brian Granger @@ -31,82 +56,95 @@

    Brian Granger, Amazon Web Services

    -
    - Niles Burbank + Dwarak Rajagopal
    -

    Niles Burbank, AMD

    -

    Niles Burbank is Director of Product Management at AMD, responsible for a family of data center GPU products. Based in Markham, Canada, he has been involved in the computer hardware industry since 1995 in a range of technical and management roles. Niles shares AMD’s passion for working together with the wider community to deliver the software and hardware tools that will advance AI. He holds an MASc in Electrical Engineering from the University of Toronto.
    - +

    Dwarak Rajagopal, Snowflake

    +

    Dwarak Rajagopal is a distinguished technology leader with over two decades of experience at some of the world’s most innovative companies, including Google, Meta, Uber, Apple, and AMD. As Vice President of AI Engineering at Snowflake, Dwarak oversees the AI and ML engineering organizations, helping drive innovation across Snowflake Cortex AI, Snowflake’s AI Research Team, and open source offerings. He previously served as Senior Engineering Director at Google, where he spearheaded the AI Frameworks and On-Device Machine Learning teams, driving cutting-edge advancements in AI and large language models (LLMs) to enhance efficiency across server and mobile platforms. Earlier in his career, Dwarak led Meta’s PyTorch Core Frameworks team, bridging the gap between AI research and production. His career also includes developing autonomous driving software at Uber, pioneering graphics technology at Apple, and shaping compiler innovations at AMD. +
    +
    + Dwarak is passionate about democratizing AI through open source initiatives and ensuring the ethical advancement of technology. His visionary leadership continues to shape the future of AI and empower global innovation. +
    +

    +
    +
    +
    +
    + Fred Li +
    +
    +

    Fred Li, Huawei

    +

    Fred Li, General Manager of Open Source Development at Huawei's Computing Product Line, brings over 20 years of technology and open-source experience since 2015. Serving as the platinum director of OpenStack Foundation and representing Huawei in PyTorch Foundation and LF AI & Data, he's a pivotal figure in open source. Fred initiated the openEuler community in 2019, nurturing multiple open-source communities with thousands of engineers and millions of users. With expertise in IT product development, project management, and open-source strategy, Fred is committed to global open-source advocacy.
    +

    +
    +
    +
    -
    -
    - Alex Spinelli -
    -
    -

    Alex Spinelli, Arm

    -

    Alex is the Senior Vice President of AI and Developer Technologies at Arm, where he is responsible for helping developers build their best AI experiences and software applications on the most ubiquitous computing platform on the planet, from Cloud to Edge. Prior to Arm, Alex was VP of Product Management for Machine Learning at Google. At Google he was responsible for TensorFlow and the technologies and infrastructures that power Google’s machine learning and AI services. Alex previously led product and engineering for LivePerson, FastCompany’s #1 ranked AI company for 2022, where he built technology to enable enterprises to build their own conversational assistants. He held positions at Amazon, leading the core software platforms for Alexa, and product and engineering for Amazon’s Search. At Thomson Reuters, Alex oversaw product and technology for news and media, including ultra-low latency AI-driven news categorization and synthetic news for financial professionals. Alex started his career as a software engineer in New York City and he has an MBA from Duke University.
    - - - -

    -
    +
    +
    + Kismat Singh
    -
    -
    - Dwarak Rajagopal -
    -
    -

    Dwarak Rajagopal, Google

    -

    Dwarak Rajagopal, a seasoned technology leader, boasts over 20 years of experience with pivotal roles at Google, Meta/Facebook, Uber, Apple, and AMD. Currently a Sr. Engineering Director at Google, he leads AI Frameworks and OnDevice ML teams, driving innovations in AI and LLMs on server/mobile for efficiency. At Meta/Facebook, he led the PyTorch Core Frameworks team, advancing AI research to production. His contributions span self-driving software at Uber, graphics technology at Apple, and compiler development at AMD. Dwarak's dedication to democratizing AI, open-source software, and ethical practices defines his commitment to shaping the future of technology.
    - - - -

    -
    +
    +

    Kismat Singh, Intel

    +

    Kismat works as VP S/W of AI Frameworks team at Intel. The team focuses on frameworks like PyTorch, TF and other inference/serving engines. Prior to Intel Kismat worked at Nvidia for 8 focusing on AI S/W frameworks/libs like TensorRT, Deep learning compiler etc.
    + + + +

    -
    -
    - Fred Li -
    -
    -

    Fred Li, Huawei

    -

    Fred Li, General Manager of Open Source Development at Huawei's Computing Product Line, brings over 20 years of technology and open-source experience since 2015. Serving as the platinum director of OpenStack Foundation and representing Huawei in PyTorch Foundation and LF AI & Data, he's a pivotal figure in open source. Fred initiated the openEuler community in 2019, nurturing multiple open-source communities with thousands of engineers and millions of users. With expertise in IT product development, project management, and open-source strategy, Fred is committed to global open-source advocacy.
    -

    -
    +
    +
    +
    + Luca Antiga
    -
    -
    - Lysandre Debut -
    -
    -

    Lysandre Debut, Hugging Face

    -

    Lysandre is the Head of Open Source at Hugging Face. Lysandre has been at Hugging Face since the company’s pivot to open-source, and was the first engineer to focus entirely on the open-source mission. Now leading the open-source part of the organization, Lysandre remains technically involved by being a core maintainer of the Transformers library.
    - - - -

    -
    +
    +

    Luca Antiga, Lightning AI

    +

    Luca Antiga is the CTO at Lightning AI. He is an early contributor to PyTorch core and co-authored “Deep Learning with PyTorch” (published by Manning). He started his journey as a researcher in Bioengineering, and later co-founded Orobix, a company focused on building and deploying AI in production settings.
    + + + +

    -
    -
    - Kismat Singh -
    -
    -

    Kismat Singh, Intel

    -

    Kismat works as VP S/W of AI Frameworks team at Intel. The team focuses on frameworks like PyTorch, TF and other inference/serving engines. Prior to Intel Kismat worked at Nvidia for 8 focusing on AI S/W frameworks/libs like TensorRT, Deep learning compiler etc.
    - - - -

    -
    +
    +
    +
    + Lysandre Debut +
    +
    +

    Lysandre Debut, Hugging Face

    +

    Lysandre is the Head of Open Source at Hugging Face. Lysandre has been at Hugging Face since the company’s pivot to open-source, and was the first engineer to focus entirely on the open-source mission. Now leading the open-source part of the organization, Lysandre remains technically involved by being a core maintainer of the Transformers library.
    + + + +

    +
    +
    +
    +
    + Niles Burbank +
    +
    +

    Niles Burbank, AMD

    +

    Niles Burbank is Director of Product Management at AMD, responsible for a family of data center GPU products. Based in Markham, Canada, he has been involved in the computer hardware industry since 1995 in a range of technical and management roles. Niles shares AMD’s passion for working together with the wider community to deliver the software and hardware tools that will advance AI. He holds an MASc in Electrical Engineering from the University of Toronto.
    + + + +

    +
    Raghu Ganti @@ -122,31 +160,17 @@

    Raghu Ganti, IBM

    - Luca Antiga -
    -
    -

    Luca Antiga, Lightning AI

    -

    Luca Antiga is the CTO at Lightning AI. He is an early contributor to PyTorch core and co-authored “Deep Learning with PyTorch” (published by Manning). He started his journey as a researcher in Bioengineering, and later co-founded Orobix, a company focused on building and deploying AI in production settings.
    - - - -

    -
    -
    -
    -
    - Damien Sereni + Shauheen Zahirazami
    -

    Damien Sereni, Meta

    -

    Supporting the PyTorch team within Meta. In the past Damien has worked on compilers, static analysis, programming languages, virtual machines and developer tools. Damien is passionate about developer productivity and building strong organizations to support developers.
    - +

    Shauheen Zahirazami, Google

    +

    Shauheen Zahirazami has a PhD in control engineering and BSc in applied mathematics. He is currently leading Cloud TPU Machine Learning teams at Google who are responsible for ML Frameworks and 3P ecosystem including the PyTorch teams that develop PyTorch/XLA.
    +

    -
    Steve Wan @@ -161,25 +185,5 @@

    Steve Wan, Microsoft Azure

    -
    -
    - Ankit Patel -
    -
    -

    Ankit Patel, NVIDIA

    -

    Ankit Patel is a senior director at NVIDIA, leading developer engagement for SDKs, APIs, and tools. He joined NVIDIA in 2011, transitioning from GPU product manager to software product management in virtualization, ray tracing, and AI. Previously, he managed video editing and live production products at Matrox Video and Blackmagic Design. Ankit holds a bachelor's degree in computer science from Concordia University and an MBA from Cornell University.
    - - - -

    -
    -
    - - - - - - -
    diff --git a/host-your-project.html b/host-your-project.html new file mode 100644 index 000000000000..9e392fb50f5b --- /dev/null +++ b/host-your-project.html @@ -0,0 +1,37 @@ +--- +layout: default +title: Host Your Project +body-class: announcement +background-class: announcement-background +permalink: /host-your-project +--- +{% assign cards = site.board_info %} + +
    +
    +
    +

    Host Your
           Project

    +
    +
    +
    + +
    +
    +
    +
    +

    Host Your Project with the PyTorch Foundation

    +

    + Have an open source project that advances the PyTorch ecosystem? The PyTorch Foundation offers a neutral, community-driven home for projects that share our commitment to open, accessible deep learning innovation. By hosting your project with us, you’ll gain support from a global network of developers, researchers, and industry leaders—plus access to events, training, and infrastructure that help your project grow and thrive. +
    +
    + Ready to take the next step? Submit your project for consideration and become part of the broader PyTorch ecosystem. +
    +
    + + Host Your Project + +

    +
    +
    +
    +
    diff --git a/index.html b/index.html index 08ef50680c44..3f0e5a4976bf 100644 --- a/index.html +++ b/index.html @@ -1,225 +1,12 @@ ---- -layout: default -background-class: home-page-background -body-class: homepage -display-news-banner: true ---- - -
    -
    -

    PyTorch logo
    Get Started

    - -

    Choose Your Path: Install PyTorch Locally or Launch Instantly on Supported Cloud Platforms

    - - - Get started - -
    -
    - -
    - -
    -
    -
    -

    Blog

    -

    - Stay up-to-date on the latest news and technical topics from the PyTorch Foundation. -

    -

    - Read more -

    -
    -
    -

    PyTorch 2.5

    -

    - Featuring a new CuDNN backend for SDPA, improvements to TorchDynamo, regional compilation of torch.compile, and more. -

    - Learn more -

    -
    -
    -

    Membership Available

    -

    - Become an integral part of the PyTorch Foundation, to build and shape the future of AI. -

    -

    - Join -

    -
    -
    -
    - - -
    - -
    -
    -
    -
    -

    Key Features &
    Capabilities

    - - - See all Features - -
    -
    - -
    - {% assign features = site.features | where: "featured-home", true | sort: 'order' %} - - {% for feature in features %} -
    -
    {{ feature.title }}
    -

    {{ feature.summary-home }}

    -
    - {% endfor %} -
    -
    -
    -
    - - {% include quick_start_module.html %} - -
    -
    -
    -
    -

    Ecosystem

    -
    Feature Projects
    - - - See all Projects - - -
    -
    -

    Explore a rich ecosystem of libraries, tools, and more to support development.

    -
    -
    - -
    - {% assign ecosystems = site.ecosystem | where: "featured-home", true | sort: 'order' %} - {% for item in ecosystems %} - - {% endfor %} -
    - -
    -
    -

    Community

    - -
    -
    -

    Join the PyTorch developer community to contribute, learn, and get your questions answered.

    -
    -
    - -
    - {% assign resources = site.resources | where: "featured-home",true | sort: 'order' %} - - {% for resource in resources %} - {% assign card_title = resource.title | split: ' ' %} - - - -
    -
    -
    - -
    -
    -
    -
    -

    Companies & Universities
    Using PyTorch

    - - - -
    - {% assign case_studies = site.case_studies | where: "featured-home", true | sort: "order" %} - - {% for case_study in case_studies %} -
    - -

    {{ case_study.excerpt }}

    - - Learn More - -
    - {% endfor %} -
    -
    -
    -
    -
    -
    - - - - -
    - -
    - - - - - - - - - - - + + + + + Redirecting... + + + + +

    If you are not redirected automatically, follow this link to the documentation.

    + + \ No newline at end of file diff --git a/join-ecosystem.html b/join-ecosystem.html new file mode 100644 index 000000000000..255cb92eff40 --- /dev/null +++ b/join-ecosystem.html @@ -0,0 +1,65 @@ +--- +layout: default +title: Join the PyTorch Ecosystem +body-class: announcement +background-class: announcement-background +permalink: /join-ecosystem +--- +{% assign cards = site.board_info %} + +
    +
    +
    +

    Join the
           PyTorch Ecosystem

    +
    +
    +
    + +
    +
    +
    +
    +

    + The PyTorch Ecosystem is made up of innovative open source AI projects that extend, integrate with, or build upon PyTorch. If you're developing a project that supports the PyTorch community, you’re welcome to apply for inclusion in the Ecosystem. +

    + Please review the PyTorch Ecosystem review process to ensure that you meet the minimum expectations before applying. +

    +

    Application Process

    +

    + Applying to join the PyTorch Ecosystem is simple: +

    +
      + +
    1. Open a new application using the Ecosystem Application GitHub issue form.
    2. + +
    3. Complete all required sections.
    4. + +
    5. Submit the issue.
    6. +
    +

    + Once submitted, your application enters the review pipeline managed by the PyTorch Ecosystem Working Group. +

    +

    Review Workflow

    +

    + All applications are tracked via the PyTorch Ecosystem Project Board. +

    + Applications are reviewed on a first-come, first-served basis. +

    + The Working Group reviews approximately 7–10 projects per session, depending on availability. +

    + Projects are moved to “In Progress” roughly two weeks before a scheduled review session. +

    + During review, Working Group members assess your application based on technical merit, alignment with the PyTorch mission, and community readiness. Questions and requests for clarification may be posted directly in your GitHub issue. +

    +

    Ready to apply?

    +

    + + + Submit your application on GitHub + +

    + +
    +
    +
    +
    diff --git a/join.html b/join.html index 18be388bbe4a..256d5bb64cec 100644 --- a/join.html +++ b/join.html @@ -41,7 +41,7 @@

    Apply for Membership Today

    If your organization is interested in becoming a Member of the PyTorch Foundation, please complete the Member Enrollment form and designate someone with signature authority for your institution for the membership approval process.

    - Premier Membership requires approval from the PyTorch Governing Board. After completing the Membership Application form, please fill out the Premier Membership Questionnaire. The Governing Board will review your application within 14 days. + For Premier Membership you must also complete the Premier Membership Questionnaire after submitting your membership application. The Foundation will review your application after both have been submitted.

    Please note, membership is not required to participate in PyTorch as an individual contributor. You can join the community at any time and collaborate with others interested in the project. diff --git a/multi-modal-dl-frame.html b/multi-modal-dl-frame.html new file mode 100644 index 000000000000..5fc616d414e2 --- /dev/null +++ b/multi-modal-dl-frame.html @@ -0,0 +1,40 @@ +--- +layout: default +title: "Multi-Modal Tabular Deep Learning with PyTorch Frame" +body-class: announcement +background-class: announcement-background +permalink: /multi-modal-dl-frame +--- + +

    +
    +
    +

    PyTorch Webinars

    +
    +
    +
    + +
    +
    +
    +
    + AI-Powered Competitive Programming +

    Multi-Modal Tabular Deep Learning with PyTorch Frame

    +

    + Date: February 19, 12 pm PST +
    + Speaker: Akihiro Nitta, Software Engineer, Kumo.ai +
    + Link to session video +
    + Download slides +
    +
    + In this talk, Akihiro introduced PyTorch Frame, a modular framework for multi-modal tabular deep learning. PyTorch Frame enables seamless integration with the PyTorch ecosystem, including PyTorch Geometric for graph-based message passing across relational data and Hugging Face Transformers for extracting rich text features. The talk also highlights its specialized data structures for efficiently handling sparse features, making PyTorch Frame an essential tool for modern tabular data. +

    + Akihiro Nitta is a software engineer on the ML team at Kumo.ai and a core contributor to PyTorch Frame and PyTorch Geometric, with prior experience as a maintainer of PyTorch Lightning. +

    +
    +
    +
    +
    \ No newline at end of file diff --git a/new.html b/new.html index ec2bedaa5923..65a7707c07ad 100644 --- a/new.html +++ b/new.html @@ -122,6 +122,19 @@

    Get Started

    +
    +
    +
    +
    +

    Get Linux Foundation Help

    +

    + Need to reach Member Projects Support for The Linux Foundation? Visit the LF Service Desk. +

    +
    +
    +
    +
    +
    diff --git a/pt-26-live-q-a.html b/pt-26-live-q-a.html new file mode 100644 index 000000000000..154fe993fca7 --- /dev/null +++ b/pt-26-live-q-a.html @@ -0,0 +1,43 @@ +--- +layout: default +title: "PyTorch 2.6 Live Q&A" +body-class: announcement +background-class: announcement-background +permalink: /pt-26-live-q-a +--- + +
    +
    +
    +

    PyTorch Webinars

    +
    +
    +
    + +
    +
    +
    +
    + PyTorch 2.6 Live Q&A +

    PyTorch 2.6 Live Q&A

    +

    + Date: February 7, 10 am PST +
    + Speaker: Nikita Shulga +
    + Location: Online +
    +
    + Wondering what's new in the recent PyTorch 2.6 release? Do you have questions? Join us for a live Q&A on PyTorch 2.6 with PyTorch Core Maintainer, Nikita Shulga (Meta). +

    + Nikita is a Software Engineer at Meta where he is, among other things, responsible for PyTorch releases and continuous integration. Nikita is committed to uplifting the developer community and continuously improving PyTorch. He earned his Master’s degree in Applied Mathematics from the Moscow Institute of Physics and Technology (MIPT). +

    + Bring your PyTorch 2.6 questions for Nikita during this live Q&A session. +

    + Watch the recording: +

    +

    +
    +
    +
    +
    \ No newline at end of file diff --git a/pt-27-release-qa.html b/pt-27-release-qa.html new file mode 100644 index 000000000000..12a713ab06b2 --- /dev/null +++ b/pt-27-release-qa.html @@ -0,0 +1,46 @@ +--- +layout: default +title: "PyTorch 2.7 Release Live Q&A" +body-class: announcement +background-class: announcement-background +permalink: /pt-27-release-qa +--- + +
    +
    +
    +

    PyTorch Webinars

    +
    +
    +
    + +
    +
    +
    +
    + PyTorch 2.7 Release Live Q&A +

    PyTorch 2.7 Release Live Q&A

    +

    + Date: April 28, 12 pm PT +
    + Speakers: Piotr Bialecki (NVIDIA) and Nikita Shulga (Meta) +
    +
    + Have questions about PyTorch 2.7? Join PyTorch Core Maintainers Piotr Bialecki (NVIDIA) and Nikita Shulga (Meta) for a live Q&A session on Monday, April 28 at 12 PM PST. +

    + +Piotr joined the PyTorch team at NVIDIA in 2019 and currently manages the team. He drives NVIDIA’s effort in maintaining and advancing PyTorch’s CUDA backend and received the PyTorch SUPERHERO award in 2023 for his community contributions, especially in the PyTorch discussion board. As a Core Maintainer, he is also focused on PyTorch’s long-term vision and development. +

    + +Nikita is a Software Engineer at Meta where, among other things, he is responsible for PyTorch releases and continuous integration. Nikita is committed to uplifting the developer community and continuously improving PyTorch. He earned a Master’s degree in Applied Mathematics from the Moscow Institute of Physics and Technology (MIPT). +

    + +Bring your PyTorch 2.7 questions for Piotr & Nikita during this live Q&A session. +

    + Watch the recording: +

    +

    +
    +
    +
    +
    \ No newline at end of file diff --git a/pt-dinov2-multi-label-plant-species-classification.html b/pt-dinov2-multi-label-plant-species-classification.html new file mode 100644 index 000000000000..10b99f213935 --- /dev/null +++ b/pt-dinov2-multi-label-plant-species-classification.html @@ -0,0 +1,39 @@ +--- +layout: default +title: "Using PyTorch and DINOv2 for Multi-label Plant Species Classification" +body-class: announcement +background-class: announcement-background +permalink: /pt-dinov2-multi-label-plant-species-classification +--- + +
    +
    +
    +

    PyTorch Webinars

    +
    +
    +
    + +
    +
    +
    +
    + Using PyTorch and DINOv2 for Multi-label Plant Species Classification +

    Using PyTorch and DINOv2 for Multi-label Plant Species Classification

    +

    + Date: March 27th, 12 PM PST +
    + Speaker: Murilo Gustineli +
    +
    + Join us for an engaging webinar on our innovative transfer learning approach using self-supervised Vision Transformers (DINOv2) for multi-label plant species classification in the PlantCLEF 2024 challenge. We’ll cover how we efficiently extract feature embeddings from a dataset of 1.4 million images and utilize PyTorch Lightning for model training and Apache Spark for data management. Learn about our image processing techniques, including transforming images into grids of tiles and aggregating predictions to overcome computational challenges. Discover the significant performance improvements achieved and get insights into multi-label image classification. Perfect for PyTorch developers, this session will include a Q&A and access to our complete codebase at github.com/dsgt-kaggle-clef/plantclef-2024. +

    + Murilo Gustineli is a Senior AI Software Solutions Engineer at Intel, and is currently pursuing a Master’s in Computer Science at Georgia Tech focusing on machine learning. His work involves creating synthetic datasets, fine-tuning large language models, and training multi-modal models using Intel® Gaudi® Al accelerators as part of the Development Enablement team. He is particularly interested in deep learning, information retrieval, and biodiversity research, aiming to improve species identification and support conservation efforts. Visit Murilo on GitHub. +

    + Watch the recording: +

    +

    +
    +
    +
    +
    \ No newline at end of file diff --git a/published_versions.json b/published_versions.json index 6e847024870e..6e643fdbe663 100644 --- a/published_versions.json +++ b/published_versions.json @@ -1,5 +1,5 @@ { - "latest_stable": "2.5.1", + "latest_stable": "2.7.0", "latest_lts": "lts-1.8.2", "versions": { "preview": { @@ -15,72 +15,46 @@ }, "cuda.y": { "note": null, - "command": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124" + "command": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126" }, "cuda.z": { "note": null, - "command": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126" + "command": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128" }, "rocm5.x": { "note": null, - "command": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3" - } - }, - "conda": { - "cuda.x": { - "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", - "command": null - }, - "cuda.y": { - "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", - "command": null - }, - "cuda.z": { - "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", - "command": null - }, - "rocm5.x": { - "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", - "command": null - }, - "accnone": { - "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", - "command": null + "command": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.4" } }, "libtorch": { "accnone": { "note": null, "versions": { - "Download here (Pre-cxx11 ABI):": "https://download.pytorch.org/libtorch/nightly/cpu/libtorch-shared-with-deps-latest.zip", "Download here (cxx11 ABI):": "https://download.pytorch.org/libtorch/nightly/cpu/libtorch-cxx11-abi-shared-with-deps-latest.zip" } }, "cuda.x": { "note": null, "versions": { - "Download here (Pre-cxx11 ABI):": "https://download.pytorch.org/libtorch/nightly/cu118/libtorch-shared-with-deps-latest.zip", "Download here (cxx11 ABI):": "https://download.pytorch.org/libtorch/nightly/cu118/libtorch-cxx11-abi-shared-with-deps-latest.zip" } }, "cuda.y": { "note": null, "versions": { - "Download here (Pre-cxx11 ABI):": "https://download.pytorch.org/libtorch/nightly/cu124/libtorch-shared-with-deps-latest.zip", - "Download here (cxx11 ABI):": "https://download.pytorch.org/libtorch/nightly/cu124/libtorch-cxx11-abi-shared-with-deps-latest.zip" + "Download here (cxx11 ABI):": "https://download.pytorch.org/libtorch/nightly/cu126/libtorch-cxx11-abi-shared-with-deps-latest.zip" } }, "cuda.z": { "note": null, "versions": { - "Download here (Pre-cxx11 ABI):": "https://download.pytorch.org/libtorch/nightly/cu126/libtorch-shared-with-deps-latest.zip", - "Download here (cxx11 ABI):": "https://download.pytorch.org/libtorch/nightly/cu126/libtorch-cxx11-abi-shared-with-deps-latest.zip" + "Download here (cxx11 ABI):": "https://download.pytorch.org/libtorch/nightly/cu128/libtorch-cxx11-abi-shared-with-deps-latest.zip" } }, "rocm5.x": { "note": null, "versions": { - "Download here (cxx11 ABI):": "https://download.pytorch.org/libtorch/nightly/rocm6.3/libtorch-cxx11-abi-shared-with-deps-latest.zip" + "Download here (cxx11 ABI):": "https://download.pytorch.org/libtorch/nightly/rocm6.4/libtorch-cxx11-abi-shared-with-deps-latest.zip" } } } @@ -112,32 +86,6 @@ "command": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu" } }, - "conda": { - "cuda.x": { - "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", - "command": null, - "default": true - }, - "cuda.y": { - "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", - "command": null, - "default": true - }, - "cuda.z": { - "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", - "command": null, - "default": true - }, - "rocm5.x": { - "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", - "command": null, - "default": true - }, - "accnone": { - "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", - "command": null - } - }, "libtorch": { "accnone": { "note": null, @@ -186,38 +134,16 @@ "command": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118" }, "cuda.y": { - "note": null, - "command": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124" - }, - "cuda.z": { "note": null, "command": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126" }, - "rocm5.x": { - "note": "NOTE: ROCm is not available on Windows", - "command": null - } - }, - "conda": { - "cuda.x": { - "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", - "command": null - }, - "cuda.y": { - "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", - "command": null - }, "cuda.z": { - "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", - "command": null + "note": null, + "command": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128" }, "rocm5.x": { "note": "NOTE: ROCm is not available on Windows", "command": null - }, - "accnone": { - "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", - "command": null } }, "libtorch": { @@ -238,15 +164,15 @@ "cuda.y": { "note": null, "versions": { - "Download here (Release version):": "https://download.pytorch.org/libtorch/nightly/cu124/libtorch-win-shared-with-deps-latest.zip", - "Download here (Debug version):": "https://download.pytorch.org/libtorch/nightly/cu124/libtorch-win-shared-with-deps-debug-latest.zip" + "Download here (Release version):": "https://download.pytorch.org/libtorch/nightly/cu126/libtorch-win-shared-with-deps-latest.zip", + "Download here (Debug version):": "https://download.pytorch.org/libtorch/nightly/cu126/libtorch-win-shared-with-deps-debug-latest.zip" } }, "cuda.z": { "note": null, "versions": { - "Download here (Release version):": "https://download.pytorch.org/libtorch/nightly/cu126/libtorch-win-shared-with-deps-latest.zip", - "Download here (Debug version):": "https://download.pytorch.org/libtorch/nightly/cu126/libtorch-win-shared-with-deps-debug-latest.zip" + "Download here (Release version):": "https://download.pytorch.org/libtorch/nightly/cu128/libtorch-win-shared-with-deps-latest.zip", + "Download here (Debug version):": "https://download.pytorch.org/libtorch/nightly/cu128/libtorch-win-shared-with-deps-debug-latest.zip" } }, "rocm5.x": { @@ -5607,6 +5533,488 @@ } } } + }, + "2.6.0": { + "linux": { + "pip": { + "accnone": { + "note": null, + "command": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu" + }, + "cuda.x": { + "note": null, + "command": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118" + }, + "cuda.y": { + "note": null, + "command": "pip3 install torch torchvision torchaudio" + }, + "cuda.z": { + "note": null, + "command": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126" + }, + "rocm5.x": { + "note": null, + "command": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2.4" + } + }, + "conda": { + "cuda.x": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null + }, + "cuda.y": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null + }, + "cuda.z": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null + }, + "rocm5.x": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null + }, + "accnone": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null + } + }, + "libtorch": { + "accnone": { + "note": null, + "versions": { + "Download here (Pre-cxx11 ABI):": "https://download.pytorch.org/libtorch/cpu/libtorch-shared-with-deps-2.6.0%2Bcpu.zip", + "Download here (cxx11 ABI):": "https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-2.6.0%2Bcpu.zip" + } + }, + "cuda.x": { + "note": null, + "versions": { + "Download here (Pre-cxx11 ABI):": "https://download.pytorch.org/libtorch/cu118/libtorch-shared-with-deps-2.6.0%2Bcu118.zip", + "Download here (cxx11 ABI):": "https://download.pytorch.org/libtorch/cu118/libtorch-cxx11-abi-shared-with-deps-2.6.0%2Bcu118.zip" + } + }, + "cuda.y": { + "note": null, + "versions": { + "Download here (Pre-cxx11 ABI):": "https://download.pytorch.org/libtorch/cu124/libtorch-shared-with-deps-2.6.0%2Bcu124.zip", + "Download here (cxx11 ABI):": "https://download.pytorch.org/libtorch/cu124/libtorch-cxx11-abi-shared-with-deps-2.6.0%2Bcu124.zip" + } + }, + "cuda.z": { + "note": null, + "versions": { + "Download here (Pre-cxx11 ABI):": "https://download.pytorch.org/libtorch/cu126/libtorch-shared-with-deps-2.6.0%2Bcu126.zip", + "Download here (cxx11 ABI):": "https://download.pytorch.org/libtorch/cu126/libtorch-cxx11-abi-shared-with-deps-2.6.0%2Bcu126.zip" + } + }, + "rocm5.x": { + "note": null, + "versions": { + "Download here (cxx11 ABI):": "https://download.pytorch.org/libtorch/rocm6.2.4/libtorch-cxx11-abi-shared-with-deps-2.6.0%2Brocm6.2.4.zip" + } + } + } + }, + "macos": { + "pip": { + "cuda.x": { + "note": "# CUDA is not available on MacOS, please use default package", + "command": "pip3 install torch torchvision torchaudio", + "default": true + }, + "cuda.y": { + "note": "# CUDA is not available on MacOS, please use default package", + "command": "pip3 install torch torchvision torchaudio", + "default": true + }, + "cuda.z": { + "note": "# CUDA is not available on MacOS, please use default package", + "command": "pip3 install torch torchvision torchaudio", + "default": true + }, + "rocm5.x": { + "note": "# ROCm is not available on MacOS, please use default package", + "command": "pip3 install torch torchvision torchaudio", + "default": true + }, + "accnone": { + "note": null, + "command": "pip3 install torch torchvision torchaudio" + } + }, + "conda": { + "cuda.x": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null, + "default": true + }, + "cuda.y": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null, + "default": true + }, + "cuda.z": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null, + "default": true + }, + "rocm5.x": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null, + "default": true + }, + "accnone": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null + } + }, + "libtorch": { + "accnone": { + "note": null, + "versions": { + "Download arm64 libtorch here (ROCm and CUDA are not supported):": "https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.6.0.zip" + } + }, + "cuda.x": { + "note": null, + "default": true, + "versions": { + "Download arm64 libtorch here (ROCm and CUDA are not supported):": "https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.6.0.zip" + } + }, + "cuda.y": { + "note": null, + "default": true, + "versions": { + "Download arm64 libtorch here (ROCm and CUDA are not supported):": "https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.6.0.zip" + } + }, + "cuda.z": { + "note": null, + "default": true, + "versions": { + "Download arm64 libtorch here (ROCm and CUDA are not supported):": "https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.6.0.zip" + } + }, + "rocm5.x": { + "note": null, + "default": true, + "versions": { + "Download arm64 libtorch here (ROCm and CUDA are not supported):": "https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.6.0.zip" + } + } + } + }, + "windows": { + "pip": { + "accnone": { + "note": null, + "command": "pip3 install torch torchvision torchaudio" + }, + "cuda.x": { + "note": null, + "command": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118" + }, + "cuda.y": { + "note": null, + "command": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124" + }, + "cuda.z": { + "note": null, + "command": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126" + }, + "rocm5.x": { + "note": "NOTE: ROCm is not available on Windows", + "command": null + } + }, + "conda": { + "cuda.x": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null + }, + "cuda.y": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null + }, + "cuda.z": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null + }, + "rocm5.x": { + "note": "NOTE: ROCm is not available on Windows", + "command": null + }, + "accnone": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null + } + }, + "libtorch": { + "accnone": { + "note": null, + "versions": { + "Download here (Release version):": "https://download.pytorch.org/libtorch/cpu/libtorch-win-shared-with-deps-2.6.0%2Bcpu.zip", + "Download here (Debug version):": "https://download.pytorch.org/libtorch/cpu/libtorch-win-shared-with-deps-debug-2.6.0%2Bcpu.zip" + } + }, + "cuda.x": { + "note": null, + "versions": { + "Download here (Release version):": "https://download.pytorch.org/libtorch/cu118/libtorch-win-shared-with-deps-2.6.0%2Bcu118.zip", + "Download here (Debug version):": "https://download.pytorch.org/libtorch/cu118/libtorch-win-shared-with-deps-debug-2.6.0%2Bcu118.zip" + } + }, + "cuda.y": { + "note": null, + "versions": { + "Download here (Release version):": "https://download.pytorch.org/libtorch/cu124/libtorch-win-shared-with-deps-2.6.0%2Bcu124.zip", + "Download here (Debug version):": "https://download.pytorch.org/libtorch/cu124/libtorch-win-shared-with-deps-debug-2.6.0%2Bcu124.zip" + } + }, + "cuda.z": { + "note": null, + "versions": { + "Download here (Release version):": "https://download.pytorch.org/libtorch/cu126/libtorch-win-shared-with-deps-2.6.0%2Bcu126.zip", + "Download here (Debug version):": "https://download.pytorch.org/libtorch/cu126/libtorch-win-shared-with-deps-debug-2.6.0%2Bcu126.zip" + } + }, + "rocm5.x": { + "note": "NOTE: ROCm is not available on Windows", + "versions": null + } + } + } + }, + "2.7.0": { + "linux": { + "pip": { + "accnone": { + "note": null, + "command": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu" + }, + "cuda.x": { + "note": null, + "command": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118" + }, + "cuda.y": { + "note": null, + "command": "pip3 install torch torchvision torchaudio" + }, + "cuda.z": { + "note": null, + "command": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128" + }, + "rocm5.x": { + "note": null, + "command": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3" + } + }, + "libtorch": { + "accnone": { + "note": null, + "versions": { + "Download here (cxx11 ABI):": "https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-2.7.0%2Bcpu.zip" + } + }, + "cuda.x": { + "note": null, + "versions": { + "Download here (cxx11 ABI):": "https://download.pytorch.org/libtorch/cu118/libtorch-cxx11-abi-shared-with-deps-2.7.0%2Bcu118.zip" + } + }, + "cuda.y": { + "note": null, + "versions": { + "Download here (cxx11 ABI):": "https://download.pytorch.org/libtorch/cu126/libtorch-cxx11-abi-shared-with-deps-2.7.0%2Bcu126.zip" + } + }, + "cuda.z": { + "note": null, + "versions": { + "Download here (cxx11 ABI):": "https://download.pytorch.org/libtorch/cu128/libtorch-cxx11-abi-shared-with-deps-2.7.0%2Bcu128.zip" + } + }, + "rocm5.x": { + "note": null, + "versions": { + "Download here (cxx11 ABI):": "https://download.pytorch.org/libtorch/rocm6.3/libtorch-cxx11-abi-shared-with-deps-2.7.0%2Brocm6.3.zip" + } + } + } + }, + "macos": { + "pip": { + "cuda.x": { + "note": "# CUDA is not available on MacOS, please use default package", + "command": "pip3 install torch torchvision torchaudio", + "default": true + }, + "cuda.y": { + "note": "# CUDA is not available on MacOS, please use default package", + "command": "pip3 install torch torchvision torchaudio", + "default": true + }, + "cuda.z": { + "note": "# CUDA is not available on MacOS, please use default package", + "command": "pip3 install torch torchvision torchaudio", + "default": true + }, + "rocm5.x": { + "note": "# ROCm is not available on MacOS, please use default package", + "command": "pip3 install torch torchvision torchaudio", + "default": true + }, + "accnone": { + "note": null, + "command": "pip3 install torch torchvision torchaudio" + } + }, + "conda": { + "cuda.x": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null, + "default": true + }, + "cuda.y": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null, + "default": true + }, + "cuda.z": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null, + "default": true + }, + "rocm5.x": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null, + "default": true + }, + "accnone": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null + } + }, + "libtorch": { + "accnone": { + "note": null, + "versions": { + "Download arm64 libtorch here (ROCm and CUDA are not supported):": "https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.7.0.zip" + } + }, + "cuda.x": { + "note": null, + "default": true, + "versions": { + "Download arm64 libtorch here (ROCm and CUDA are not supported):": "https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.7.0.zip" + } + }, + "cuda.y": { + "note": null, + "default": true, + "versions": { + "Download arm64 libtorch here (ROCm and CUDA are not supported):": "https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.7.0.zip" + } + }, + "cuda.z": { + "note": null, + "default": true, + "versions": { + "Download arm64 libtorch here (ROCm and CUDA are not supported):": "https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.7.0.zip" + } + }, + "rocm5.x": { + "note": null, + "default": true, + "versions": { + "Download arm64 libtorch here (ROCm and CUDA are not supported):": "https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.7.0.zip" + } + } + } + }, + "windows": { + "pip": { + "accnone": { + "note": null, + "command": "pip3 install torch torchvision torchaudio" + }, + "cuda.x": { + "note": null, + "command": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118" + }, + "cuda.y": { + "note": null, + "command": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126" + }, + "cuda.z": { + "note": null, + "command": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128" + }, + "rocm5.x": { + "note": "NOTE: ROCm is not available on Windows", + "command": null + } + }, + "conda": { + "cuda.x": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null + }, + "cuda.y": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null + }, + "cuda.z": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null + }, + "rocm5.x": { + "note": "NOTE: ROCm is not available on Windows", + "command": null + }, + "accnone": { + "note": "NOTE: Conda packages are no longer available. Please use pip instead.
    ", + "command": null + } + }, + "libtorch": { + "accnone": { + "note": null, + "versions": { + "Download here (Release version):": "https://download.pytorch.org/libtorch/cpu/libtorch-win-shared-with-deps-2.7.0%2Bcpu.zip", + "Download here (Debug version):": "https://download.pytorch.org/libtorch/cpu/libtorch-win-shared-with-deps-debug-2.7.0%2Bcpu.zip" + } + }, + "cuda.x": { + "note": null, + "versions": { + "Download here (Release version):": "https://download.pytorch.org/libtorch/cu118/libtorch-win-shared-with-deps-2.7.0%2Bcu118.zip", + "Download here (Debug version):": "https://download.pytorch.org/libtorch/cu118/libtorch-win-shared-with-deps-debug-2.7.0%2Bcu118.zip" + } + }, + "cuda.y": { + "note": null, + "versions": { + "Download here (Release version):": "https://download.pytorch.org/libtorch/cu126/libtorch-win-shared-with-deps-2.7.0%2Bcu126.zip", + "Download here (Debug version):": "https://download.pytorch.org/libtorch/cu126/libtorch-win-shared-with-deps-debug-2.7.0%2Bcu126.zip" + } + }, + "cuda.z": { + "note": null, + "versions": { + "Download here (Release version):": "https://download.pytorch.org/libtorch/cu128/libtorch-win-shared-with-deps-2.7.0%2Bcu128.zip", + "Download here (Debug version):": "https://download.pytorch.org/libtorch/cu128/libtorch-win-shared-with-deps-debug-2.7.0%2Bcu128.zip" + } + }, + "rocm5.x": { + "note": "NOTE: ROCm is not available on Windows", + "versions": null + } + } + } } } } \ No newline at end of file diff --git a/pytorch-domains.html b/pytorch-domains.html index df512b73d3b3..740c2a49fbd1 100644 --- a/pytorch-domains.html +++ b/pytorch-domains.html @@ -102,6 +102,13 @@

    ExecuTorch

    +
    diff --git a/releases.json b/releases.json index c9c345f041c8..bbe560c7a46c 100644 --- a/releases.json +++ b/releases.json @@ -13,7 +13,7 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -28,67 +28,67 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.9", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "container_image": "pytorch/manylinux2_28-builder:cuda12.4", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", "package_type": "manywheel", - "build_name": "manywheel-py3_9-cuda12_4", + "build_name": "manywheel-py3_9-cuda12_6", "validation_runner": "linux.g5.4xlarge.nvidia.gpu", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.9", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.6", - "desired_cuda": "cu126", - "container_image": "pytorch/manylinux2_28-builder:cuda12.6", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", "package_type": "manywheel", - "build_name": "manywheel-py3_9-cuda12_6", + "build_name": "manywheel-py3_9-cuda12_8", "validation_runner": "linux.g5.4xlarge.nvidia.gpu", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.9", "gpu_arch_type": "rocm", - "gpu_arch_version": "6.2.4", - "desired_cuda": "rocm6.2.4", - "container_image": "pytorch/manylinux2_28-builder:rocm6.2.4", + "gpu_arch_version": "6.3", + "desired_cuda": "rocm6.3", + "container_image": "pytorch/manylinux2_28-builder:rocm6.3", "package_type": "manywheel", - "build_name": "manywheel-py3_9-rocm6_2_4", + "build_name": "manywheel-py3_9-rocm6_3", "validation_runner": "linux.2xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.2.4", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.9", "gpu_arch_type": "rocm", - "gpu_arch_version": "6.3", - "desired_cuda": "rocm6.3", - "container_image": "pytorch/manylinux2_28-builder:rocm6.3", + "gpu_arch_version": "6.4", + "desired_cuda": "rocm6.4", + "container_image": "pytorch/manylinux2_28-builder:rocm6.4", "package_type": "manywheel", - "build_name": "manywheel-py3_9-rocm6_3", + "build_name": "manywheel-py3_9-rocm6_4", "validation_runner": "linux.2xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.4", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -103,7 +103,7 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -118,67 +118,67 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.10", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "container_image": "pytorch/manylinux2_28-builder:cuda12.4", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", "package_type": "manywheel", - "build_name": "manywheel-py3_10-cuda12_4", + "build_name": "manywheel-py3_10-cuda12_6", "validation_runner": "linux.g5.4xlarge.nvidia.gpu", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.10", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.6", - "desired_cuda": "cu126", - "container_image": "pytorch/manylinux2_28-builder:cuda12.6", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", "package_type": "manywheel", - "build_name": "manywheel-py3_10-cuda12_6", + "build_name": "manywheel-py3_10-cuda12_8", "validation_runner": "linux.g5.4xlarge.nvidia.gpu", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.10", "gpu_arch_type": "rocm", - "gpu_arch_version": "6.2.4", - "desired_cuda": "rocm6.2.4", - "container_image": "pytorch/manylinux2_28-builder:rocm6.2.4", + "gpu_arch_version": "6.3", + "desired_cuda": "rocm6.3", + "container_image": "pytorch/manylinux2_28-builder:rocm6.3", "package_type": "manywheel", - "build_name": "manywheel-py3_10-rocm6_2_4", + "build_name": "manywheel-py3_10-rocm6_3", "validation_runner": "linux.2xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.2.4", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.10", "gpu_arch_type": "rocm", - "gpu_arch_version": "6.3", - "desired_cuda": "rocm6.3", - "container_image": "pytorch/manylinux2_28-builder:rocm6.3", + "gpu_arch_version": "6.4", + "desired_cuda": "rocm6.4", + "container_image": "pytorch/manylinux2_28-builder:rocm6.4", "package_type": "manywheel", - "build_name": "manywheel-py3_10-rocm6_3", + "build_name": "manywheel-py3_10-rocm6_4", "validation_runner": "linux.2xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.4", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -193,7 +193,7 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -208,67 +208,67 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.11", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "container_image": "pytorch/manylinux2_28-builder:cuda12.4", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", "package_type": "manywheel", - "build_name": "manywheel-py3_11-cuda12_4", + "build_name": "manywheel-py3_11-cuda12_6", "validation_runner": "linux.g5.4xlarge.nvidia.gpu", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.11", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.6", - "desired_cuda": "cu126", - "container_image": "pytorch/manylinux2_28-builder:cuda12.6", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", "package_type": "manywheel", - "build_name": "manywheel-py3_11-cuda12_6", + "build_name": "manywheel-py3_11-cuda12_8", "validation_runner": "linux.g5.4xlarge.nvidia.gpu", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.11", "gpu_arch_type": "rocm", - "gpu_arch_version": "6.2.4", - "desired_cuda": "rocm6.2.4", - "container_image": "pytorch/manylinux2_28-builder:rocm6.2.4", + "gpu_arch_version": "6.3", + "desired_cuda": "rocm6.3", + "container_image": "pytorch/manylinux2_28-builder:rocm6.3", "package_type": "manywheel", - "build_name": "manywheel-py3_11-rocm6_2_4", + "build_name": "manywheel-py3_11-rocm6_3", "validation_runner": "linux.2xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.2.4", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.11", "gpu_arch_type": "rocm", - "gpu_arch_version": "6.3", - "desired_cuda": "rocm6.3", - "container_image": "pytorch/manylinux2_28-builder:rocm6.3", + "gpu_arch_version": "6.4", + "desired_cuda": "rocm6.4", + "container_image": "pytorch/manylinux2_28-builder:rocm6.4", "package_type": "manywheel", - "build_name": "manywheel-py3_11-rocm6_3", + "build_name": "manywheel-py3_11-rocm6_4", "validation_runner": "linux.2xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.4", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -283,7 +283,7 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -298,67 +298,67 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.12", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "container_image": "pytorch/manylinux2_28-builder:cuda12.4", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", "package_type": "manywheel", - "build_name": "manywheel-py3_12-cuda12_4", + "build_name": "manywheel-py3_12-cuda12_6", "validation_runner": "linux.g5.4xlarge.nvidia.gpu", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.12", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.6", - "desired_cuda": "cu126", - "container_image": "pytorch/manylinux2_28-builder:cuda12.6", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", "package_type": "manywheel", - "build_name": "manywheel-py3_12-cuda12_6", + "build_name": "manywheel-py3_12-cuda12_8", "validation_runner": "linux.g5.4xlarge.nvidia.gpu", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.12", "gpu_arch_type": "rocm", - "gpu_arch_version": "6.2.4", - "desired_cuda": "rocm6.2.4", - "container_image": "pytorch/manylinux2_28-builder:rocm6.2.4", + "gpu_arch_version": "6.3", + "desired_cuda": "rocm6.3", + "container_image": "pytorch/manylinux2_28-builder:rocm6.3", "package_type": "manywheel", - "build_name": "manywheel-py3_12-rocm6_2_4", + "build_name": "manywheel-py3_12-rocm6_3", "validation_runner": "linux.2xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.2.4", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.12", "gpu_arch_type": "rocm", - "gpu_arch_version": "6.3", - "desired_cuda": "rocm6.3", - "container_image": "pytorch/manylinux2_28-builder:rocm6.3", + "gpu_arch_version": "6.4", + "desired_cuda": "rocm6.4", + "container_image": "pytorch/manylinux2_28-builder:rocm6.4", "package_type": "manywheel", - "build_name": "manywheel-py3_12-rocm6_3", + "build_name": "manywheel-py3_12-rocm6_4", "validation_runner": "linux.2xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.4", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -373,7 +373,7 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -388,128 +388,158 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.13", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "container_image": "pytorch/manylinux2_28-builder:cuda12.4", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", "package_type": "manywheel", - "build_name": "manywheel-py3_13-cuda12_4", + "build_name": "manywheel-py3_13-cuda12_6", "validation_runner": "linux.g5.4xlarge.nvidia.gpu", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.13", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.6", - "desired_cuda": "cu126", - "container_image": "pytorch/manylinux2_28-builder:cuda12.6", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", "package_type": "manywheel", - "build_name": "manywheel-py3_13-cuda12_6", + "build_name": "manywheel-py3_13-cuda12_8", "validation_runner": "linux.g5.4xlarge.nvidia.gpu", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.13", "gpu_arch_type": "rocm", - "gpu_arch_version": "6.2.4", - "desired_cuda": "rocm6.2.4", - "container_image": "pytorch/manylinux2_28-builder:rocm6.2.4", + "gpu_arch_version": "6.3", + "desired_cuda": "rocm6.3", + "container_image": "pytorch/manylinux2_28-builder:rocm6.3", "package_type": "manywheel", - "build_name": "manywheel-py3_13-rocm6_2_4", + "build_name": "manywheel-py3_13-rocm6_3", "validation_runner": "linux.2xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.2.4", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.13", "gpu_arch_type": "rocm", - "gpu_arch_version": "6.3", - "desired_cuda": "rocm6.3", - "container_image": "pytorch/manylinux2_28-builder:rocm6.3", + "gpu_arch_version": "6.4", + "desired_cuda": "rocm6.4", + "container_image": "pytorch/manylinux2_28-builder:rocm6.4", "package_type": "manywheel", - "build_name": "manywheel-py3_13-rocm6_3", + "build_name": "manywheel-py3_13-rocm6_4", "validation_runner": "linux.2xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.4", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { + "python_version": "3.13t", "gpu_arch_type": "cpu", "gpu_arch_version": "", "desired_cuda": "cpu", - "libtorch_variant": "shared-with-deps", - "libtorch_config": "", - "devtoolset": "pre-cxx11", - "container_image": "pytorch/manylinux-builder:cpu", - "package_type": "libtorch", - "build_name": "libtorch-cpu-shared-with-deps-pre-cxx11", + "container_image": "pytorch/manylinux2_28-builder:cpu", + "package_type": "manywheel", + "build_name": "manywheel-py3_13t-cpu", "validation_runner": "linux.2xlarge", - "installation": "https://download.pytorch.org/libtorch/nightly/cpu/libtorch-shared-with-deps-latest.zip", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "channel": "nightly", - "stable_version": "2.5.1" + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false }, { + "python_version": "3.13t", "gpu_arch_type": "cuda", "gpu_arch_version": "11.8", "desired_cuda": "cu118", - "libtorch_variant": "shared-with-deps", - "libtorch_config": "", - "devtoolset": "pre-cxx11", - "container_image": "pytorch/manylinux-builder:cuda11.8", - "package_type": "libtorch", - "build_name": "libtorch-cuda11_8-shared-with-deps-pre-cxx11", - "validation_runner": "linux.2xlarge", - "installation": "https://download.pytorch.org/libtorch/nightly/cu118/libtorch-shared-with-deps-latest.zip", + "container_image": "pytorch/manylinux2_28-builder:cuda11.8", + "package_type": "manywheel", + "build_name": "manywheel-py3_13t-cuda11_8", + "validation_runner": "linux.g5.4xlarge.nvidia.gpu", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118", "channel": "nightly", - "stable_version": "2.5.1" + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false }, { + "python_version": "3.13t", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "libtorch_variant": "shared-with-deps", - "libtorch_config": "", - "devtoolset": "pre-cxx11", - "container_image": "pytorch/manylinux-builder:cuda12.4", - "package_type": "libtorch", - "build_name": "libtorch-cuda12_4-shared-with-deps-pre-cxx11", - "validation_runner": "linux.2xlarge", - "installation": "https://download.pytorch.org/libtorch/nightly/cu124/libtorch-shared-with-deps-latest.zip", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", + "package_type": "manywheel", + "build_name": "manywheel-py3_13t-cuda12_6", + "validation_runner": "linux.g5.4xlarge.nvidia.gpu", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", "channel": "nightly", - "stable_version": "2.5.1" + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false }, { + "python_version": "3.13t", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.6", - "desired_cuda": "cu126", - "libtorch_variant": "shared-with-deps", - "libtorch_config": "", - "devtoolset": "pre-cxx11", - "container_image": "pytorch/manylinux-builder:cuda12.6", - "package_type": "libtorch", - "build_name": "libtorch-cuda12_6-shared-with-deps-pre-cxx11", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", + "package_type": "manywheel", + "build_name": "manywheel-py3_13t-cuda12_8", + "validation_runner": "linux.g5.4xlarge.nvidia.gpu", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128", + "channel": "nightly", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13t", + "gpu_arch_type": "rocm", + "gpu_arch_version": "6.3", + "desired_cuda": "rocm6.3", + "container_image": "pytorch/manylinux2_28-builder:rocm6.3", + "package_type": "manywheel", + "build_name": "manywheel-py3_13t-rocm6_3", + "validation_runner": "linux.2xlarge", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3", + "channel": "nightly", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13t", + "gpu_arch_type": "rocm", + "gpu_arch_version": "6.4", + "desired_cuda": "rocm6.4", + "container_image": "pytorch/manylinux2_28-builder:rocm6.4", + "package_type": "manywheel", + "build_name": "manywheel-py3_13t-rocm6_4", "validation_runner": "linux.2xlarge", - "installation": "https://download.pytorch.org/libtorch/nightly/cu126/libtorch-shared-with-deps-latest.zip", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.4", "channel": "nightly", - "stable_version": "2.5.1" + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false }, { "gpu_arch_type": "cpu", @@ -524,7 +554,7 @@ "validation_runner": "linux.2xlarge", "installation": "https://download.pytorch.org/libtorch/nightly/cpu/libtorch-cxx11-abi-shared-with-deps-latest.zip", "channel": "nightly", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "cuda", @@ -539,22 +569,7 @@ "validation_runner": "linux.2xlarge", "installation": "https://download.pytorch.org/libtorch/nightly/cu118/libtorch-cxx11-abi-shared-with-deps-latest.zip", "channel": "nightly", - "stable_version": "2.5.1" - }, - { - "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "libtorch_variant": "shared-with-deps", - "libtorch_config": "", - "devtoolset": "cxx11-abi", - "container_image": "pytorch/libtorch-cxx11-builder:cuda12.4", - "package_type": "libtorch", - "build_name": "libtorch-cuda12_4-shared-with-deps-cxx11-abi", - "validation_runner": "linux.2xlarge", - "installation": "https://download.pytorch.org/libtorch/nightly/cu124/libtorch-cxx11-abi-shared-with-deps-latest.zip", - "channel": "nightly", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "cuda", @@ -569,22 +584,22 @@ "validation_runner": "linux.2xlarge", "installation": "https://download.pytorch.org/libtorch/nightly/cu126/libtorch-cxx11-abi-shared-with-deps-latest.zip", "channel": "nightly", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { - "gpu_arch_type": "rocm", - "gpu_arch_version": "6.2.4", - "desired_cuda": "rocm6.2.4", + "gpu_arch_type": "cuda", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", "libtorch_variant": "shared-with-deps", "libtorch_config": "", "devtoolset": "cxx11-abi", - "container_image": "pytorch/libtorch-cxx11-builder:rocm6.2.4", + "container_image": "pytorch/libtorch-cxx11-builder:cuda12.8", "package_type": "libtorch", - "build_name": "libtorch-rocm6_2_4-shared-with-deps-cxx11-abi", + "build_name": "libtorch-cuda12_8-shared-with-deps-cxx11-abi", "validation_runner": "linux.2xlarge", - "installation": "https://download.pytorch.org/libtorch/nightly/rocm6.2.4/libtorch-cxx11-abi-shared-with-deps-latest.zip", + "installation": "https://download.pytorch.org/libtorch/nightly/cu128/libtorch-cxx11-abi-shared-with-deps-latest.zip", "channel": "nightly", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "rocm", @@ -599,7 +614,22 @@ "validation_runner": "linux.2xlarge", "installation": "https://download.pytorch.org/libtorch/nightly/rocm6.3/libtorch-cxx11-abi-shared-with-deps-latest.zip", "channel": "nightly", - "stable_version": "2.5.1" + "stable_version": "2.7.0" + }, + { + "gpu_arch_type": "rocm", + "gpu_arch_version": "6.4", + "desired_cuda": "rocm6.4", + "libtorch_variant": "shared-with-deps", + "libtorch_config": "", + "devtoolset": "cxx11-abi", + "container_image": "pytorch/libtorch-cxx11-builder:rocm6.4", + "package_type": "libtorch", + "build_name": "libtorch-rocm6_4-shared-with-deps-cxx11-abi", + "validation_runner": "linux.2xlarge", + "installation": "https://download.pytorch.org/libtorch/nightly/rocm6.4/libtorch-cxx11-abi-shared-with-deps-latest.zip", + "channel": "nightly", + "stable_version": "2.7.0" } ], "windows": [ @@ -615,7 +645,7 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -630,37 +660,37 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.9", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "container_image": "pytorch/manylinux2_28-builder:cuda12.4", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", "package_type": "wheel", - "build_name": "wheel-py3_9-cuda12_4", + "build_name": "wheel-py3_9-cuda12_6", "validation_runner": "windows.g4dn.xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.9", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.6", - "desired_cuda": "cu126", - "container_image": "pytorch/manylinux2_28-builder:cuda12.6", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", "package_type": "wheel", - "build_name": "wheel-py3_9-cuda12_6", + "build_name": "wheel-py3_9-cuda12_8", "validation_runner": "windows.g4dn.xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -675,7 +705,7 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -690,37 +720,37 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.10", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "container_image": "pytorch/manylinux2_28-builder:cuda12.4", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", "package_type": "wheel", - "build_name": "wheel-py3_10-cuda12_4", + "build_name": "wheel-py3_10-cuda12_6", "validation_runner": "windows.g4dn.xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.10", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.6", - "desired_cuda": "cu126", - "container_image": "pytorch/manylinux2_28-builder:cuda12.6", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", "package_type": "wheel", - "build_name": "wheel-py3_10-cuda12_6", + "build_name": "wheel-py3_10-cuda12_8", "validation_runner": "windows.g4dn.xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -735,7 +765,7 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -750,37 +780,37 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.11", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "container_image": "pytorch/manylinux2_28-builder:cuda12.4", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", "package_type": "wheel", - "build_name": "wheel-py3_11-cuda12_4", + "build_name": "wheel-py3_11-cuda12_6", "validation_runner": "windows.g4dn.xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.11", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.6", - "desired_cuda": "cu126", - "container_image": "pytorch/manylinux2_28-builder:cuda12.6", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", "package_type": "wheel", - "build_name": "wheel-py3_11-cuda12_6", + "build_name": "wheel-py3_11-cuda12_8", "validation_runner": "windows.g4dn.xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -795,7 +825,7 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -810,37 +840,37 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.12", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "container_image": "pytorch/manylinux2_28-builder:cuda12.4", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", "package_type": "wheel", - "build_name": "wheel-py3_12-cuda12_4", + "build_name": "wheel-py3_12-cuda12_6", "validation_runner": "windows.g4dn.xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.12", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.6", - "desired_cuda": "cu126", - "container_image": "pytorch/manylinux2_28-builder:cuda12.6", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", "package_type": "wheel", - "build_name": "wheel-py3_12-cuda12_6", + "build_name": "wheel-py3_12-cuda12_8", "validation_runner": "windows.g4dn.xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -855,7 +885,7 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -870,53 +900,113 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.13", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "container_image": "pytorch/manylinux2_28-builder:cuda12.4", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", "package_type": "wheel", - "build_name": "wheel-py3_13-cuda12_4", + "build_name": "wheel-py3_13-cuda12_6", "validation_runner": "windows.g4dn.xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.13", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.6", - "desired_cuda": "cu126", - "container_image": "pytorch/manylinux2_28-builder:cuda12.6", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", "package_type": "wheel", - "build_name": "wheel-py3_13-cuda12_6", + "build_name": "wheel-py3_13-cuda12_8", "validation_runner": "windows.g4dn.xlarge", - "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { + "python_version": "3.13t", "gpu_arch_type": "cpu", "gpu_arch_version": "", "desired_cuda": "cpu", - "libtorch_variant": "shared-with-deps", - "libtorch_config": "release", - "devtoolset": "", - "container_image": "", + "container_image": "pytorch/manylinux2_28-builder:cpu", + "package_type": "wheel", + "build_name": "wheel-py3_13t-cpu", + "validation_runner": "windows.4xlarge", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", + "channel": "nightly", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13t", + "gpu_arch_type": "cuda", + "gpu_arch_version": "11.8", + "desired_cuda": "cu118", + "container_image": "pytorch/manylinux2_28-builder:cuda11.8", + "package_type": "wheel", + "build_name": "wheel-py3_13t-cuda11_8", + "validation_runner": "windows.g4dn.xlarge", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118", + "channel": "nightly", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13t", + "gpu_arch_type": "cuda", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", + "package_type": "wheel", + "build_name": "wheel-py3_13t-cuda12_6", + "validation_runner": "windows.g4dn.xlarge", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126", + "channel": "nightly", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13t", + "gpu_arch_type": "cuda", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", + "package_type": "wheel", + "build_name": "wheel-py3_13t-cuda12_8", + "validation_runner": "windows.g4dn.xlarge", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128", + "channel": "nightly", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "gpu_arch_type": "cpu", + "gpu_arch_version": "", + "desired_cuda": "cpu", + "libtorch_variant": "shared-with-deps", + "libtorch_config": "release", + "devtoolset": "", + "container_image": "", "package_type": "libtorch", "build_name": "libtorch-cpu-shared-with-deps-release", "validation_runner": "windows.4xlarge", "installation": "https://download.pytorch.org/libtorch/nightly/cpu/libtorch-win-shared-with-deps-latest.zip", "channel": "nightly", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "cuda", @@ -931,37 +1021,37 @@ "validation_runner": "windows.4xlarge", "installation": "https://download.pytorch.org/libtorch/nightly/cu118/libtorch-win-shared-with-deps-latest.zip", "channel": "nightly", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", "libtorch_variant": "shared-with-deps", "libtorch_config": "release", "devtoolset": "", "container_image": "", "package_type": "libtorch", - "build_name": "libtorch-cuda12_4-shared-with-deps-release", + "build_name": "libtorch-cuda12_6-shared-with-deps-release", "validation_runner": "windows.4xlarge", - "installation": "https://download.pytorch.org/libtorch/nightly/cu124/libtorch-win-shared-with-deps-latest.zip", + "installation": "https://download.pytorch.org/libtorch/nightly/cu126/libtorch-win-shared-with-deps-latest.zip", "channel": "nightly", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "cuda", - "gpu_arch_version": "12.6", - "desired_cuda": "cu126", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", "libtorch_variant": "shared-with-deps", "libtorch_config": "release", "devtoolset": "", "container_image": "", "package_type": "libtorch", - "build_name": "libtorch-cuda12_6-shared-with-deps-release", + "build_name": "libtorch-cuda12_8-shared-with-deps-release", "validation_runner": "windows.4xlarge", - "installation": "https://download.pytorch.org/libtorch/nightly/cu126/libtorch-win-shared-with-deps-latest.zip", + "installation": "https://download.pytorch.org/libtorch/nightly/cu128/libtorch-win-shared-with-deps-latest.zip", "channel": "nightly", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "cpu", @@ -976,7 +1066,7 @@ "validation_runner": "windows.4xlarge", "installation": "https://download.pytorch.org/libtorch/nightly/cpu/libtorch-win-shared-with-deps-debug-latest.zip", "channel": "nightly", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "cuda", @@ -991,37 +1081,37 @@ "validation_runner": "windows.4xlarge", "installation": "https://download.pytorch.org/libtorch/nightly/cu118/libtorch-win-shared-with-deps-debug-latest.zip", "channel": "nightly", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", "libtorch_variant": "shared-with-deps", "libtorch_config": "debug", "devtoolset": "", "container_image": "", "package_type": "libtorch", - "build_name": "libtorch-cuda12_4-shared-with-deps-debug", + "build_name": "libtorch-cuda12_6-shared-with-deps-debug", "validation_runner": "windows.4xlarge", - "installation": "https://download.pytorch.org/libtorch/nightly/cu124/libtorch-win-shared-with-deps-debug-latest.zip", + "installation": "https://download.pytorch.org/libtorch/nightly/cu126/libtorch-win-shared-with-deps-debug-latest.zip", "channel": "nightly", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "cuda", - "gpu_arch_version": "12.6", - "desired_cuda": "cu126", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", "libtorch_variant": "shared-with-deps", "libtorch_config": "debug", "devtoolset": "", "container_image": "", "package_type": "libtorch", - "build_name": "libtorch-cuda12_6-shared-with-deps-debug", + "build_name": "libtorch-cuda12_8-shared-with-deps-debug", "validation_runner": "windows.4xlarge", - "installation": "https://download.pytorch.org/libtorch/nightly/cu126/libtorch-win-shared-with-deps-debug-latest.zip", + "installation": "https://download.pytorch.org/libtorch/nightly/cu128/libtorch-win-shared-with-deps-debug-latest.zip", "channel": "nightly", - "stable_version": "2.5.1" + "stable_version": "2.7.0" } ], "macos": [ @@ -1037,7 +1127,7 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1052,7 +1142,7 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1067,7 +1157,7 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1082,7 +1172,7 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1097,7 +1187,22 @@ "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", "channel": "nightly", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13t", + "gpu_arch_type": "cpu", + "gpu_arch_version": "", + "desired_cuda": "cpu", + "container_image": "pytorch/manylinux2_28-builder:cpu", + "package_type": "wheel", + "build_name": "wheel-py3_13t-cpu", + "validation_runner": "macos-m1-stable", + "installation": "pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu", + "channel": "nightly", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1113,7 +1218,7 @@ "validation_runner": "macos-m1-stable", "installation": "https://download.pytorch.org/libtorch/nightly/cpu/libtorch-macos-arm64-latest.zip", "channel": "nightly", - "stable_version": "2.5.1" + "stable_version": "2.7.0" } ] }, @@ -1131,7 +1236,7 @@ "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1146,67 +1251,67 @@ "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.9", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.1", - "desired_cuda": "cu121", - "container_image": "pytorch/manylinux2_28-builder:cuda12.1", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", "package_type": "manywheel", - "build_name": "manywheel-py3_9-cuda12_1", + "build_name": "manywheel-py3_9-cuda12_6", "validation_runner": "linux.g5.4xlarge.nvidia.gpu", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121", + "installation": "pip3 install torch torchvision torchaudio", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.9", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "container_image": "pytorch/manylinux2_28-builder:cuda12.4", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", "package_type": "manywheel", - "build_name": "manywheel-py3_9-cuda12_4", + "build_name": "manywheel-py3_9-cuda12_8", "validation_runner": "linux.g5.4xlarge.nvidia.gpu", - "installation": "pip3 install torch torchvision torchaudio", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.9", "gpu_arch_type": "rocm", - "gpu_arch_version": "6.1", - "desired_cuda": "rocm6.1", - "container_image": "pytorch/manylinux2_28-builder:rocm6.1", + "gpu_arch_version": "6.2.4", + "desired_cuda": "rocm6.2.4", + "container_image": "pytorch/manylinux2_28-builder:rocm6.2.4", "package_type": "manywheel", - "build_name": "manywheel-py3_9-rocm6_1", + "build_name": "manywheel-py3_9-rocm6_2_4", "validation_runner": "linux.2xlarge", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2.4", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.9", "gpu_arch_type": "rocm", - "gpu_arch_version": "6.2", - "desired_cuda": "rocm6.2", - "container_image": "pytorch/manylinux2_28-builder:rocm6.2", + "gpu_arch_version": "6.3", + "desired_cuda": "rocm6.3", + "container_image": "pytorch/manylinux2_28-builder:rocm6.3", "package_type": "manywheel", - "build_name": "manywheel-py3_9-rocm6_2", + "build_name": "manywheel-py3_9-rocm6_3", "validation_runner": "linux.2xlarge", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1221,7 +1326,7 @@ "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1236,67 +1341,67 @@ "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.10", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.1", - "desired_cuda": "cu121", - "container_image": "pytorch/manylinux2_28-builder:cuda12.1", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", "package_type": "manywheel", - "build_name": "manywheel-py3_10-cuda12_1", + "build_name": "manywheel-py3_10-cuda12_6", "validation_runner": "linux.g5.4xlarge.nvidia.gpu", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121", + "installation": "pip3 install torch torchvision torchaudio", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.10", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "container_image": "pytorch/manylinux2_28-builder:cuda12.4", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", "package_type": "manywheel", - "build_name": "manywheel-py3_10-cuda12_4", + "build_name": "manywheel-py3_10-cuda12_8", "validation_runner": "linux.g5.4xlarge.nvidia.gpu", - "installation": "pip3 install torch torchvision torchaudio", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.10", "gpu_arch_type": "rocm", - "gpu_arch_version": "6.1", - "desired_cuda": "rocm6.1", - "container_image": "pytorch/manylinux2_28-builder:rocm6.1", + "gpu_arch_version": "6.2.4", + "desired_cuda": "rocm6.2.4", + "container_image": "pytorch/manylinux2_28-builder:rocm6.2.4", "package_type": "manywheel", - "build_name": "manywheel-py3_10-rocm6_1", + "build_name": "manywheel-py3_10-rocm6_2_4", "validation_runner": "linux.2xlarge", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2.4", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.10", "gpu_arch_type": "rocm", - "gpu_arch_version": "6.2", - "desired_cuda": "rocm6.2", - "container_image": "pytorch/manylinux2_28-builder:rocm6.2", + "gpu_arch_version": "6.3", + "desired_cuda": "rocm6.3", + "container_image": "pytorch/manylinux2_28-builder:rocm6.3", "package_type": "manywheel", - "build_name": "manywheel-py3_10-rocm6_2", + "build_name": "manywheel-py3_10-rocm6_3", "validation_runner": "linux.2xlarge", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1311,7 +1416,7 @@ "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1326,67 +1431,67 @@ "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.11", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.1", - "desired_cuda": "cu121", - "container_image": "pytorch/manylinux2_28-builder:cuda12.1", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", "package_type": "manywheel", - "build_name": "manywheel-py3_11-cuda12_1", + "build_name": "manywheel-py3_11-cuda12_6", "validation_runner": "linux.g5.4xlarge.nvidia.gpu", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121", + "installation": "pip3 install torch torchvision torchaudio", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.11", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "container_image": "pytorch/manylinux2_28-builder:cuda12.4", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", "package_type": "manywheel", - "build_name": "manywheel-py3_11-cuda12_4", + "build_name": "manywheel-py3_11-cuda12_8", "validation_runner": "linux.g5.4xlarge.nvidia.gpu", - "installation": "pip3 install torch torchvision torchaudio", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.11", "gpu_arch_type": "rocm", - "gpu_arch_version": "6.1", - "desired_cuda": "rocm6.1", - "container_image": "pytorch/manylinux2_28-builder:rocm6.1", + "gpu_arch_version": "6.2.4", + "desired_cuda": "rocm6.2.4", + "container_image": "pytorch/manylinux2_28-builder:rocm6.2.4", "package_type": "manywheel", - "build_name": "manywheel-py3_11-rocm6_1", + "build_name": "manywheel-py3_11-rocm6_2_4", "validation_runner": "linux.2xlarge", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2.4", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.11", "gpu_arch_type": "rocm", - "gpu_arch_version": "6.2", - "desired_cuda": "rocm6.2", - "container_image": "pytorch/manylinux2_28-builder:rocm6.2", + "gpu_arch_version": "6.3", + "desired_cuda": "rocm6.3", + "container_image": "pytorch/manylinux2_28-builder:rocm6.3", "package_type": "manywheel", - "build_name": "manywheel-py3_11-rocm6_2", + "build_name": "manywheel-py3_11-rocm6_3", "validation_runner": "linux.2xlarge", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1401,7 +1506,7 @@ "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1416,128 +1521,248 @@ "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.12", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.1", - "desired_cuda": "cu121", - "container_image": "pytorch/manylinux2_28-builder:cuda12.1", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", "package_type": "manywheel", - "build_name": "manywheel-py3_12-cuda12_1", + "build_name": "manywheel-py3_12-cuda12_6", "validation_runner": "linux.g5.4xlarge.nvidia.gpu", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121", + "installation": "pip3 install torch torchvision torchaudio", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.12", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "container_image": "pytorch/manylinux2_28-builder:cuda12.4", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", "package_type": "manywheel", - "build_name": "manywheel-py3_12-cuda12_4", + "build_name": "manywheel-py3_12-cuda12_8", "validation_runner": "linux.g5.4xlarge.nvidia.gpu", - "installation": "pip3 install torch torchvision torchaudio", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.12", "gpu_arch_type": "rocm", - "gpu_arch_version": "6.1", - "desired_cuda": "rocm6.1", - "container_image": "pytorch/manylinux2_28-builder:rocm6.1", + "gpu_arch_version": "6.2.4", + "desired_cuda": "rocm6.2.4", + "container_image": "pytorch/manylinux2_28-builder:rocm6.2.4", "package_type": "manywheel", - "build_name": "manywheel-py3_12-rocm6_1", + "build_name": "manywheel-py3_12-rocm6_2_4", "validation_runner": "linux.2xlarge", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.1", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2.4", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.12", "gpu_arch_type": "rocm", - "gpu_arch_version": "6.2", - "desired_cuda": "rocm6.2", - "container_image": "pytorch/manylinux2_28-builder:rocm6.2", + "gpu_arch_version": "6.3", + "desired_cuda": "rocm6.3", + "container_image": "pytorch/manylinux2_28-builder:rocm6.3", "package_type": "manywheel", - "build_name": "manywheel-py3_12-rocm6_2", + "build_name": "manywheel-py3_12-rocm6_3", "validation_runner": "linux.2xlarge", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { + "python_version": "3.13", "gpu_arch_type": "cpu", "gpu_arch_version": "", "desired_cuda": "cpu", - "libtorch_variant": "shared-with-deps", - "libtorch_config": "", - "devtoolset": "pre-cxx11", - "container_image": "pytorch/manylinux-builder:cpu", - "package_type": "libtorch", - "build_name": "libtorch-cpu-shared-with-deps-pre-cxx11", + "container_image": "pytorch/manylinux2_28-builder:cpu", + "package_type": "manywheel", + "build_name": "manywheel-py3_13-cpu", "validation_runner": "linux.2xlarge", - "installation": "https://download.pytorch.org/libtorch/cpu/libtorch-shared-with-deps-2.5.1%2Bcpu.zip", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu", "channel": "release", - "stable_version": "2.5.1" + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false }, { + "python_version": "3.13", "gpu_arch_type": "cuda", "gpu_arch_version": "11.8", "desired_cuda": "cu118", - "libtorch_variant": "shared-with-deps", - "libtorch_config": "", - "devtoolset": "pre-cxx11", - "container_image": "pytorch/manylinux-builder:cuda11.8", - "package_type": "libtorch", - "build_name": "libtorch-cuda11_8-shared-with-deps-pre-cxx11", - "validation_runner": "linux.2xlarge", - "installation": "https://download.pytorch.org/libtorch/cu118/libtorch-shared-with-deps-2.5.1%2Bcu118.zip", + "container_image": "pytorch/manylinux2_28-builder:cuda11.8", + "package_type": "manywheel", + "build_name": "manywheel-py3_13-cuda11_8", + "validation_runner": "linux.g5.4xlarge.nvidia.gpu", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118", "channel": "release", - "stable_version": "2.5.1" + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false }, { + "python_version": "3.13", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.1", - "desired_cuda": "cu121", - "libtorch_variant": "shared-with-deps", - "libtorch_config": "", - "devtoolset": "pre-cxx11", - "container_image": "pytorch/manylinux-builder:cuda12.1", - "package_type": "libtorch", - "build_name": "libtorch-cuda12_1-shared-with-deps-pre-cxx11", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", + "package_type": "manywheel", + "build_name": "manywheel-py3_13-cuda12_6", + "validation_runner": "linux.g5.4xlarge.nvidia.gpu", + "installation": "pip3 install torch torchvision torchaudio", + "channel": "release", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13", + "gpu_arch_type": "cuda", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", + "package_type": "manywheel", + "build_name": "manywheel-py3_13-cuda12_8", + "validation_runner": "linux.g5.4xlarge.nvidia.gpu", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128", + "channel": "release", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13", + "gpu_arch_type": "rocm", + "gpu_arch_version": "6.2.4", + "desired_cuda": "rocm6.2.4", + "container_image": "pytorch/manylinux2_28-builder:rocm6.2.4", + "package_type": "manywheel", + "build_name": "manywheel-py3_13-rocm6_2_4", + "validation_runner": "linux.2xlarge", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2.4", + "channel": "release", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13", + "gpu_arch_type": "rocm", + "gpu_arch_version": "6.3", + "desired_cuda": "rocm6.3", + "container_image": "pytorch/manylinux2_28-builder:rocm6.3", + "package_type": "manywheel", + "build_name": "manywheel-py3_13-rocm6_3", + "validation_runner": "linux.2xlarge", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3", + "channel": "release", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13t", + "gpu_arch_type": "cpu", + "gpu_arch_version": "", + "desired_cuda": "cpu", + "container_image": "pytorch/manylinux2_28-builder:cpu", + "package_type": "manywheel", + "build_name": "manywheel-py3_13t-cpu", "validation_runner": "linux.2xlarge", - "installation": "https://download.pytorch.org/libtorch/cu121/libtorch-shared-with-deps-2.5.1%2Bcu121.zip", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu", "channel": "release", - "stable_version": "2.5.1" + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false }, { + "python_version": "3.13t", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "libtorch_variant": "shared-with-deps", - "libtorch_config": "", - "devtoolset": "pre-cxx11", - "container_image": "pytorch/manylinux-builder:cuda12.4", - "package_type": "libtorch", - "build_name": "libtorch-cuda12_4-shared-with-deps-pre-cxx11", + "gpu_arch_version": "11.8", + "desired_cuda": "cu118", + "container_image": "pytorch/manylinux2_28-builder:cuda11.8", + "package_type": "manywheel", + "build_name": "manywheel-py3_13t-cuda11_8", + "validation_runner": "linux.g5.4xlarge.nvidia.gpu", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118", + "channel": "release", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13t", + "gpu_arch_type": "cuda", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", + "package_type": "manywheel", + "build_name": "manywheel-py3_13t-cuda12_6", + "validation_runner": "linux.g5.4xlarge.nvidia.gpu", + "installation": "pip3 install torch torchvision torchaudio", + "channel": "release", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13t", + "gpu_arch_type": "cuda", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", + "package_type": "manywheel", + "build_name": "manywheel-py3_13t-cuda12_8", + "validation_runner": "linux.g5.4xlarge.nvidia.gpu", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128", + "channel": "release", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13t", + "gpu_arch_type": "rocm", + "gpu_arch_version": "6.2.4", + "desired_cuda": "rocm6.2.4", + "container_image": "pytorch/manylinux2_28-builder:rocm6.2.4", + "package_type": "manywheel", + "build_name": "manywheel-py3_13t-rocm6_2_4", "validation_runner": "linux.2xlarge", - "installation": "https://download.pytorch.org/libtorch/cu124/libtorch-shared-with-deps-2.5.1%2Bcu124.zip", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.2.4", "channel": "release", - "stable_version": "2.5.1" + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13t", + "gpu_arch_type": "rocm", + "gpu_arch_version": "6.3", + "desired_cuda": "rocm6.3", + "container_image": "pytorch/manylinux2_28-builder:rocm6.3", + "package_type": "manywheel", + "build_name": "manywheel-py3_13t-rocm6_3", + "validation_runner": "linux.2xlarge", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3", + "channel": "release", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false }, { "gpu_arch_type": "cpu", @@ -1550,9 +1775,9 @@ "package_type": "libtorch", "build_name": "libtorch-cpu-shared-with-deps-cxx11-abi", "validation_runner": "linux.2xlarge", - "installation": "https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-2.5.1%2Bcpu.zip", + "installation": "https://download.pytorch.org/libtorch/cpu/libtorch-cxx11-abi-shared-with-deps-2.7.0%2Bcpu.zip", "channel": "release", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "cuda", @@ -1565,69 +1790,69 @@ "package_type": "libtorch", "build_name": "libtorch-cuda11_8-shared-with-deps-cxx11-abi", "validation_runner": "linux.2xlarge", - "installation": "https://download.pytorch.org/libtorch/cu118/libtorch-cxx11-abi-shared-with-deps-2.5.1%2Bcu118.zip", + "installation": "https://download.pytorch.org/libtorch/cu118/libtorch-cxx11-abi-shared-with-deps-2.7.0%2Bcu118.zip", "channel": "release", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "cuda", - "gpu_arch_version": "12.1", - "desired_cuda": "cu121", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", "libtorch_variant": "shared-with-deps", "libtorch_config": "", "devtoolset": "cxx11-abi", - "container_image": "pytorch/libtorch-cxx11-builder:cuda12.1", + "container_image": "pytorch/libtorch-cxx11-builder:cuda12.6", "package_type": "libtorch", - "build_name": "libtorch-cuda12_1-shared-with-deps-cxx11-abi", + "build_name": "libtorch-cuda12_6-shared-with-deps-cxx11-abi", "validation_runner": "linux.2xlarge", - "installation": "https://download.pytorch.org/libtorch/cu121/libtorch-cxx11-abi-shared-with-deps-2.5.1%2Bcu121.zip", + "installation": "https://download.pytorch.org/libtorch/cu126/libtorch-cxx11-abi-shared-with-deps-2.7.0%2Bcu126.zip", "channel": "release", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", "libtorch_variant": "shared-with-deps", "libtorch_config": "", "devtoolset": "cxx11-abi", - "container_image": "pytorch/libtorch-cxx11-builder:cuda12.4", + "container_image": "pytorch/libtorch-cxx11-builder:cuda12.8", "package_type": "libtorch", - "build_name": "libtorch-cuda12_4-shared-with-deps-cxx11-abi", + "build_name": "libtorch-cuda12_8-shared-with-deps-cxx11-abi", "validation_runner": "linux.2xlarge", - "installation": "https://download.pytorch.org/libtorch/cu124/libtorch-cxx11-abi-shared-with-deps-2.5.1%2Bcu124.zip", + "installation": "https://download.pytorch.org/libtorch/cu128/libtorch-cxx11-abi-shared-with-deps-2.7.0%2Bcu128.zip", "channel": "release", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "rocm", - "gpu_arch_version": "6.1", - "desired_cuda": "rocm6.1", + "gpu_arch_version": "6.2.4", + "desired_cuda": "rocm6.2.4", "libtorch_variant": "shared-with-deps", "libtorch_config": "", "devtoolset": "cxx11-abi", - "container_image": "pytorch/libtorch-cxx11-builder:rocm6.1", + "container_image": "pytorch/libtorch-cxx11-builder:rocm6.2.4", "package_type": "libtorch", - "build_name": "libtorch-rocm6_1-shared-with-deps-cxx11-abi", + "build_name": "libtorch-rocm6_2_4-shared-with-deps-cxx11-abi", "validation_runner": "linux.2xlarge", - "installation": "https://download.pytorch.org/libtorch/rocm6.1/libtorch-cxx11-abi-shared-with-deps-2.5.1%2Brocm6.1.zip", + "installation": "https://download.pytorch.org/libtorch/rocm6.2.4/libtorch-cxx11-abi-shared-with-deps-2.7.0%2Brocm6.2.4.zip", "channel": "release", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "rocm", - "gpu_arch_version": "6.2", - "desired_cuda": "rocm6.2", + "gpu_arch_version": "6.3", + "desired_cuda": "rocm6.3", "libtorch_variant": "shared-with-deps", "libtorch_config": "", "devtoolset": "cxx11-abi", - "container_image": "pytorch/libtorch-cxx11-builder:rocm6.2", + "container_image": "pytorch/libtorch-cxx11-builder:rocm6.3", "package_type": "libtorch", - "build_name": "libtorch-rocm6_2-shared-with-deps-cxx11-abi", + "build_name": "libtorch-rocm6_3-shared-with-deps-cxx11-abi", "validation_runner": "linux.2xlarge", - "installation": "https://download.pytorch.org/libtorch/rocm6.2/libtorch-cxx11-abi-shared-with-deps-2.5.1%2Brocm6.2.zip", + "installation": "https://download.pytorch.org/libtorch/rocm6.3/libtorch-cxx11-abi-shared-with-deps-2.7.0%2Brocm6.3.zip", "channel": "release", - "stable_version": "2.5.1" + "stable_version": "2.7.0" } ], "windows": [ @@ -1643,7 +1868,7 @@ "installation": "pip3 install torch torchvision torchaudio", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1658,37 +1883,37 @@ "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.9", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.1", - "desired_cuda": "cu121", - "container_image": "pytorch/manylinux2_28-builder:cuda12.1", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", "package_type": "wheel", - "build_name": "wheel-py3_9-cuda12_1", + "build_name": "wheel-py3_9-cuda12_6", "validation_runner": "windows.g4dn.xlarge", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.9", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "container_image": "pytorch/manylinux2_28-builder:cuda12.4", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", "package_type": "wheel", - "build_name": "wheel-py3_9-cuda12_4", + "build_name": "wheel-py3_9-cuda12_8", "validation_runner": "windows.g4dn.xlarge", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1703,7 +1928,7 @@ "installation": "pip3 install torch torchvision torchaudio", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1718,37 +1943,37 @@ "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.10", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.1", - "desired_cuda": "cu121", - "container_image": "pytorch/manylinux2_28-builder:cuda12.1", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", "package_type": "wheel", - "build_name": "wheel-py3_10-cuda12_1", + "build_name": "wheel-py3_10-cuda12_6", "validation_runner": "windows.g4dn.xlarge", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.10", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "container_image": "pytorch/manylinux2_28-builder:cuda12.4", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", "package_type": "wheel", - "build_name": "wheel-py3_10-cuda12_4", + "build_name": "wheel-py3_10-cuda12_8", "validation_runner": "windows.g4dn.xlarge", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1763,7 +1988,7 @@ "installation": "pip3 install torch torchvision torchaudio", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1778,37 +2003,37 @@ "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.11", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.1", - "desired_cuda": "cu121", - "container_image": "pytorch/manylinux2_28-builder:cuda12.1", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", "package_type": "wheel", - "build_name": "wheel-py3_11-cuda12_1", + "build_name": "wheel-py3_11-cuda12_6", "validation_runner": "windows.g4dn.xlarge", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.11", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "container_image": "pytorch/manylinux2_28-builder:cuda12.4", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", "package_type": "wheel", - "build_name": "wheel-py3_11-cuda12_4", + "build_name": "wheel-py3_11-cuda12_8", "validation_runner": "windows.g4dn.xlarge", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1823,7 +2048,7 @@ "installation": "pip3 install torch torchvision torchaudio", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1838,37 +2063,157 @@ "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.12", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.1", - "desired_cuda": "cu121", - "container_image": "pytorch/manylinux2_28-builder:cuda12.1", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", "package_type": "wheel", - "build_name": "wheel-py3_12-cuda12_1", + "build_name": "wheel-py3_12-cuda12_6", "validation_runner": "windows.g4dn.xlarge", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { "python_version": "3.12", "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", - "container_image": "pytorch/manylinux2_28-builder:cuda12.4", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", "package_type": "wheel", - "build_name": "wheel-py3_12-cuda12_4", + "build_name": "wheel-py3_12-cuda12_8", "validation_runner": "windows.g4dn.xlarge", - "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13", + "gpu_arch_type": "cpu", + "gpu_arch_version": "", + "desired_cuda": "cpu", + "container_image": "pytorch/manylinux2_28-builder:cpu", + "package_type": "wheel", + "build_name": "wheel-py3_13-cpu", + "validation_runner": "windows.4xlarge", + "installation": "pip3 install torch torchvision torchaudio", + "channel": "release", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13", + "gpu_arch_type": "cuda", + "gpu_arch_version": "11.8", + "desired_cuda": "cu118", + "container_image": "pytorch/manylinux2_28-builder:cuda11.8", + "package_type": "wheel", + "build_name": "wheel-py3_13-cuda11_8", + "validation_runner": "windows.g4dn.xlarge", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118", + "channel": "release", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13", + "gpu_arch_type": "cuda", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", + "package_type": "wheel", + "build_name": "wheel-py3_13-cuda12_6", + "validation_runner": "windows.g4dn.xlarge", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126", + "channel": "release", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13", + "gpu_arch_type": "cuda", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", + "package_type": "wheel", + "build_name": "wheel-py3_13-cuda12_8", + "validation_runner": "windows.g4dn.xlarge", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128", + "channel": "release", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13t", + "gpu_arch_type": "cpu", + "gpu_arch_version": "", + "desired_cuda": "cpu", + "container_image": "pytorch/manylinux2_28-builder:cpu", + "package_type": "wheel", + "build_name": "wheel-py3_13t-cpu", + "validation_runner": "windows.4xlarge", + "installation": "pip3 install torch torchvision torchaudio", + "channel": "release", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13t", + "gpu_arch_type": "cuda", + "gpu_arch_version": "11.8", + "desired_cuda": "cu118", + "container_image": "pytorch/manylinux2_28-builder:cuda11.8", + "package_type": "wheel", + "build_name": "wheel-py3_13t-cuda11_8", + "validation_runner": "windows.g4dn.xlarge", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118", + "channel": "release", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13t", + "gpu_arch_type": "cuda", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", + "container_image": "pytorch/manylinux2_28-builder:cuda12.6", + "package_type": "wheel", + "build_name": "wheel-py3_13t-cuda12_6", + "validation_runner": "windows.g4dn.xlarge", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126", + "channel": "release", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13t", + "gpu_arch_type": "cuda", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", + "container_image": "pytorch/manylinux2_28-builder:cuda12.8", + "package_type": "wheel", + "build_name": "wheel-py3_13t-cuda12_8", + "validation_runner": "windows.g4dn.xlarge", + "installation": "pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128", + "channel": "release", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -1882,9 +2227,9 @@ "package_type": "libtorch", "build_name": "libtorch-cpu-shared-with-deps-release", "validation_runner": "windows.4xlarge", - "installation": "https://download.pytorch.org/libtorch/cpu/libtorch-win-shared-with-deps-2.5.1%2Bcpu.zip", + "installation": "https://download.pytorch.org/libtorch/cpu/libtorch-win-shared-with-deps-2.7.0%2Bcpu.zip", "channel": "release", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "cuda", @@ -1897,39 +2242,39 @@ "package_type": "libtorch", "build_name": "libtorch-cuda11_8-shared-with-deps-release", "validation_runner": "windows.4xlarge", - "installation": "https://download.pytorch.org/libtorch/cu118/libtorch-win-shared-with-deps-2.5.1%2Bcu118.zip", + "installation": "https://download.pytorch.org/libtorch/cu118/libtorch-win-shared-with-deps-2.7.0%2Bcu118.zip", "channel": "release", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "cuda", - "gpu_arch_version": "12.1", - "desired_cuda": "cu121", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", "libtorch_variant": "shared-with-deps", "libtorch_config": "release", "devtoolset": "", "container_image": "", "package_type": "libtorch", - "build_name": "libtorch-cuda12_1-shared-with-deps-release", + "build_name": "libtorch-cuda12_6-shared-with-deps-release", "validation_runner": "windows.4xlarge", - "installation": "https://download.pytorch.org/libtorch/cu121/libtorch-win-shared-with-deps-2.5.1%2Bcu121.zip", + "installation": "https://download.pytorch.org/libtorch/cu126/libtorch-win-shared-with-deps-2.7.0%2Bcu126.zip", "channel": "release", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", "libtorch_variant": "shared-with-deps", "libtorch_config": "release", "devtoolset": "", "container_image": "", "package_type": "libtorch", - "build_name": "libtorch-cuda12_4-shared-with-deps-release", + "build_name": "libtorch-cuda12_8-shared-with-deps-release", "validation_runner": "windows.4xlarge", - "installation": "https://download.pytorch.org/libtorch/cu124/libtorch-win-shared-with-deps-2.5.1%2Bcu124.zip", + "installation": "https://download.pytorch.org/libtorch/cu128/libtorch-win-shared-with-deps-2.7.0%2Bcu128.zip", "channel": "release", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "cpu", @@ -1942,9 +2287,9 @@ "package_type": "libtorch", "build_name": "libtorch-cpu-shared-with-deps-debug", "validation_runner": "windows.4xlarge", - "installation": "https://download.pytorch.org/libtorch/cpu/libtorch-win-shared-with-deps-debug-2.5.1%2Bcpu.zip", + "installation": "https://download.pytorch.org/libtorch/cpu/libtorch-win-shared-with-deps-debug-2.7.0%2Bcpu.zip", "channel": "release", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "cuda", @@ -1957,39 +2302,39 @@ "package_type": "libtorch", "build_name": "libtorch-cuda11_8-shared-with-deps-debug", "validation_runner": "windows.4xlarge", - "installation": "https://download.pytorch.org/libtorch/cu118/libtorch-win-shared-with-deps-debug-2.5.1%2Bcu118.zip", + "installation": "https://download.pytorch.org/libtorch/cu118/libtorch-win-shared-with-deps-debug-2.7.0%2Bcu118.zip", "channel": "release", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "cuda", - "gpu_arch_version": "12.1", - "desired_cuda": "cu121", + "gpu_arch_version": "12.6", + "desired_cuda": "cu126", "libtorch_variant": "shared-with-deps", "libtorch_config": "debug", "devtoolset": "", "container_image": "", "package_type": "libtorch", - "build_name": "libtorch-cuda12_1-shared-with-deps-debug", + "build_name": "libtorch-cuda12_6-shared-with-deps-debug", "validation_runner": "windows.4xlarge", - "installation": "https://download.pytorch.org/libtorch/cu121/libtorch-win-shared-with-deps-debug-2.5.1%2Bcu121.zip", + "installation": "https://download.pytorch.org/libtorch/cu126/libtorch-win-shared-with-deps-debug-2.7.0%2Bcu126.zip", "channel": "release", - "stable_version": "2.5.1" + "stable_version": "2.7.0" }, { "gpu_arch_type": "cuda", - "gpu_arch_version": "12.4", - "desired_cuda": "cu124", + "gpu_arch_version": "12.8", + "desired_cuda": "cu128", "libtorch_variant": "shared-with-deps", "libtorch_config": "debug", "devtoolset": "", "container_image": "", "package_type": "libtorch", - "build_name": "libtorch-cuda12_4-shared-with-deps-debug", + "build_name": "libtorch-cuda12_8-shared-with-deps-debug", "validation_runner": "windows.4xlarge", - "installation": "https://download.pytorch.org/libtorch/cu124/libtorch-win-shared-with-deps-debug-2.5.1%2Bcu124.zip", + "installation": "https://download.pytorch.org/libtorch/cu128/libtorch-win-shared-with-deps-debug-2.7.0%2Bcu128.zip", "channel": "release", - "stable_version": "2.5.1" + "stable_version": "2.7.0" } ], "macos": [ @@ -2005,7 +2350,7 @@ "installation": "pip3 install torch torchvision torchaudio", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -2020,7 +2365,7 @@ "installation": "pip3 install torch torchvision torchaudio", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -2035,7 +2380,7 @@ "installation": "pip3 install torch torchvision torchaudio", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -2050,7 +2395,37 @@ "installation": "pip3 install torch torchvision torchaudio", "channel": "release", "upload_to_base_bucket": "no", - "stable_version": "2.5.1", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13", + "gpu_arch_type": "cpu", + "gpu_arch_version": "", + "desired_cuda": "cpu", + "container_image": "pytorch/manylinux2_28-builder:cpu", + "package_type": "wheel", + "build_name": "wheel-py3_13-cpu", + "validation_runner": "macos-m1-stable", + "installation": "pip3 install torch torchvision torchaudio", + "channel": "release", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", + "use_split_build": false + }, + { + "python_version": "3.13t", + "gpu_arch_type": "cpu", + "gpu_arch_version": "", + "desired_cuda": "cpu", + "container_image": "pytorch/manylinux2_28-builder:cpu", + "package_type": "wheel", + "build_name": "wheel-py3_13t-cpu", + "validation_runner": "macos-m1-stable", + "installation": "pip3 install torch torchvision torchaudio", + "channel": "release", + "upload_to_base_bucket": "no", + "stable_version": "2.7.0", "use_split_build": false }, { @@ -2064,9 +2439,9 @@ "package_type": "libtorch", "build_name": "libtorch-cpu-shared-with-deps-cxx11-abi", "validation_runner": "macos-m1-stable", - "installation": "https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.5.1.zip", + "installation": "https://download.pytorch.org/libtorch/cpu/libtorch-macos-arm64-2.7.0.zip", "channel": "release", - "stable_version": "2.5.1" + "stable_version": "2.7.0" } ] } diff --git a/scripts/gen_quick_start_module.py b/scripts/gen_quick_start_module.py index a3ceb1a35bc3..5fd20d79949f 100755 --- a/scripts/gen_quick_start_module.py +++ b/scripts/gen_quick_start_module.py @@ -157,8 +157,10 @@ def update_versions(versions, release_matrix, release_version): if x["libtorch_variant"] == "shared-with-deps" } if instr["versions"] is not None: - for ver in [PRE_CXX11_ABI, CXX11_ABI]: - if gpu_arch_type == "rocm" and ver == PRE_CXX11_ABI: + for ver in [CXX11_ABI, PRE_CXX11_ABI]: + # temporarily remove setting pre-cxx11-abi. For Release 2.7 we + # should remove pre-cxx11-abi completely. + if ver == PRE_CXX11_ABI: continue else: instr["versions"][LIBTORCH_DWNL_INSTR[ver]] = ( diff --git a/tac.html b/tac.html index 10844f1af8da..171e68422fb2 100644 --- a/tac.html +++ b/tac.html @@ -81,6 +81,20 @@

    Milos Puzovic, Arm

    +
    +
    + Ricardo Aravena +
    +
    +

    Ricardo Aravena, Snowflake

    +

    + Ricardo is a seasoned technology leader with over two decades of experience across the enterprise and startup landscape. He works at Snowflake as a Cloud Infrastructure and Open Source Lead, focusing on automating AI/ML infrastructure using cloud-native technologies at scale. A passionate open source advocate, Ricardo also serves as a shadow member of the CNCF Technical Oversight Committee and CNCF AI Sub-project, where he helps shape the future of AI computing infrastructure. +

    + Throughout his career, Ricardo has held key engineering and leadership roles at major companies such as Rakuten, Cisco, and VMware, as well as at innovative startups including Truera, Branch Metrics, Coupa, HyTrust, Exablox, and SnapLogic. He's committed to community-driven innovation and regularly contributes to industry initiatives that bridge the gap between open source communities and enterprise adoption. +
    +

    +
    +
    Shauheen Zahirazami @@ -103,6 +117,10 @@

    Shauheen Zahirazami, Google

    Yikun Jiang, Huawei

    +

    Yikun Jiang is principal software engineer from Huawei computing opensource development team, working on multi-arch, heterogeneous hardware support and improvement of projects in computing area. He has more than 10 years experience in computing area, and leads an active and creative team in R&D under the principle of “upstream first”, which aims to make diverse computing power ubiquitous.
    +