Skip to content

Commit dac6408

Browse files
committed
Updates to "Introducing torchchat" blog post
Signed-off-by: Chris Abraham <[email protected]>
1 parent 41db0ae commit dac6408

File tree

1 file changed

+11
-9
lines changed

1 file changed

+11
-9
lines changed

_posts/2024-07-30-torchchat-local-llm-inference.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,25 +5,24 @@ title: "Introducing torchchat: Accelerating Local LLM Inference on Laptop, Deskt
55

66
Today, we’re releasing [torchchat](https://github.com/pytorch/torchchat), a library showcasing how to seamlessly and performantly run Llama 3, 3.1, and other large language models across laptop, desktop, and mobile.
77

8-
In our previous blog posts, we [showed](https://pytorch.org/blog/accelerating-generative-ai-2/) how to use native PyTorch 2.0 to run LLMs with great performance using CUDA. Torchchat expands on this with more target environments, models and execution modes as well as providing important functions such as export, quantization and export in a way that’s easy to understand.
8+
In our previous blog posts, we [showed](https://pytorch.org/blog/accelerating-generative-ai-2/) how to use native PyTorch 2 to run LLMs with great performance using CUDA. Torchchat expands on this with more target environments, models and execution modes. Additionally it provides important functions such as export, quantization and eval in a way that’s easy to understand providing an E2E story for those who want to build a local inference solution.
99

1010
You will find the project organized into three areas:
1111

12+
1213
* Python: Torchchat provides a [REST API](https://github.com/pytorch/torchchat?tab=readme-ov-file#server) that is called via a Python CLI or can be accessed via the browser
1314
* C++: Torchchat produces a desktop-friendly binary using PyTorch's [AOTInductor](https://pytorch-dev-podcast.simplecast.com/episodes/aotinductor) backend
1415
* Mobile devices: Torchchat uses [ExecuTorch](https://pytorch.org/executorch/stable/index.html) to export a .pte binary file for on-device inference
1516

16-
1717
![torchchat schema](/assets/images/torchchat.png){:style="width:100%"}
1818

1919

2020
## Performance
2121

22-
The following table tracks the performance of torchchat for Llama 3 for a variety of configurations.
23-
22+
The following table tracks the performance of torchchat for Llama 3 for a variety of configurations.
2423
_Numbers for Llama 3.1 are coming soon._
2524

26-
**Llama 3 8B Instruct on Apple MacBook Pro M1 Max 64GB**
25+
**Llama 3 8B Instruct on Apple MacBook Pro M1 Max 64GB Laptop**
2726

2827

2928
<table class="table table-bordered">
@@ -98,8 +97,7 @@ _Numbers for Llama 3.1 are coming soon._
9897
</table>
9998

10099

101-
**Llama 3 8B Instruct on Linux x86 and CUDA**
102-
100+
**Llama 3 8B Instruct on Linux x86 and CUDA**
103101
_Intel(R) Xeon(R) Platinum 8339HC CPU @ 1.80GHz with 180GB Ram + A100 (80GB)_
104102

105103

@@ -156,6 +154,10 @@ _Intel(R) Xeon(R) Platinum 8339HC CPU @ 1.80GHz with 180GB Ram + A100 (80GB)_
156154
</table>
157155

158156

159-
Torchchat provides exceptional performance for Llama 3 8B on mobile (iPhone and Android). We run Llama 2 7B on Samsung Galaxy S22, and S23, and on iPhone 15 Pro using 4-bit GPTQ and post training quantization (PTQ). Early work on Llama 3 8B support is included in collaboration with ExecuTorch. Many improvements were made to export speed, memory overhead, and runtime speed. Ultimately, though, we’ll be seeing even stronger performance through Core ML, MPS, and HTP in the near future. We are excited!
157+
**Llama3 8B Instruct on Mobile**
158+
Torchchat achieves > 8T/s on the Samsung Galaxy S23 and iPhone using 4-bit GPTQ via ExecuTorch.
159+
160+
161+
## Conclusion
160162

161-
We encourage you to **[clone the torchchat repo and give it a spin](https://github.com/pytorch/torchchat)**, explore its capabilities, and share your feedback as we continue to empower the PyTorch community to run LLMs locally and on constrained devices. Together, let's unlock the full potential of generative AI and LLMs on any device. Please submit [issues](https://github.com/pytorch/torchat/issues) as you see them as well as in [PyTorch](https://github.com/pytorch/pytorch/issues) plus [ExecuTorch](https://github.com/pytorch/executorch/issues), since we are still iterating quickly. We’re also inviting community contributions across a broad range of areas, from additional models, target hardware support, new quantization schemes, or performance improvements. Happy experimenting!
163+
We encourage you to **[clone the torchchat repo and give it a spin](https://github.com/pytorch/torchchat)**, explore its capabilities, and share your feedback as we continue to empower the PyTorch community to run LLMs locally and on constrained devices. Together, let's unlock the full potential of generative AI and LLMs on any device. Please submit [issues](https://github.com/pytorch/torchat/issues) as you see them, since we are still iterating quickly. We’re also inviting community contributions across a broad range of areas, from additional models, target hardware support, new quantization schemes, or performance improvements. Happy experimenting!

0 commit comments

Comments
 (0)