NVIDIA Networking’s Post

View organization page for NVIDIA Networking, graphic

27,682 followers

1mo

With DOCA 2.7 released, get ready to explore the new Remote Direct Memory Access (RDMA) functionalities controlled by a GPU CUDA kernel with DOCA GPUNetIO and a performance comparison with the performance test (perftest) microbenchmarks. https://nvda.ws/45nGovU

Unlocking GPU-Accelerated RDMA with NVIDIA DOCA GPUNetIO | NVIDIA Technical Blog

developer.nvidia.com

To view or add a comment, sign in

More Relevant Posts

Laura Reznikov

Meta Engineering Manager, Avatar Visuals and Performance
10mo
Report this post
Reminder that the deadline for proposals for GPU Zen 3 is December 3rd “After the tremendous success of the ShaderX, the GPU Pro, and the GPU Zen book series, we are looking for authors for GPU Zen 3. The upcoming book will cover advanced rendering techniques and newer applications for the GPU with any API available. It can include topics on GPU Work Generation techniques Geometry Manipulation, Level of Detail, and Compression Specific Mobile Devices Techniques Image Space Techniques Shadows, Lighting and Baking 3D Game Engine Design Tools General Purpose GPU compute Machine Learning assisted algorithms Real-time Ray Tracing, Path Tracing, Denoising, Sampling, Light Caching New Materials, Appearances, and Effects Neural graphics, neural representations User-generated and AI-assisted content Simulation and Procedurals” #GPU #graphics #gamedeveloper #machinelearning #raytracing #pathtracing #denoising #neural #AI

Wolfgang Engel
10mo

Call for Authors: GPU Zen 3 https://lnkd.in/gCas94U

GPU Zen

gpuzen.blogspot.com
Like Comment
To view or add a comment, sign in
Pashva Mehta

SDE at SAP LABS India | Ex Summer Intern at Samsung Research Institute Bangalore
5mo
Report this post
Excited to share the next chapter in our journey with Weaviate 🔍 In continuation of our exploration into local vectorization using Weaviate vector databases, As promised I'm happy to share a new blog post that explores and compares weaviate's import performance on different CPU and GPU environments 💻. Dive deep into the technical nuances 🧠 and discover how harnessing the power of GPU acceleration can supercharge your RAG workflows🔥. Read the full post https://lnkd.in/dz_Z9nTU #aicommunity #rag #aritificialintelligence #machine #datascience #dataanalytics #vectordatabase #vectordb #vectorsearch #vectordatabases

Exploring Weaviate’s Import Performance: GPU vs CPU

medium.com

2 Comments
Like Comment
To view or add a comment, sign in
Sharath TS

Senior Deep Learning Algorithms Engineer at NVIDIA
10mo Edited
Report this post
TLDR: Imagen in JAX optimized for GPUs We have released Imagen in NVIDIA JAX toolbox(https://lnkd.in/g37f8jCa), our first multimodal generative model in JAX to create high fidelity images from text prompts. Our implementation is optimized for GPUs and one of the only OSS implementations that supports running a GPU inference server for offloading the computation of text embeddings. This makes the training of the core diffusion model highly optimal and not needing petabytes of storage to save the text embeddings. We provide pre-built containers and push button scripts to train both the base and super resolution models from scratch. DeepFloyd & SDXL inference and more coming soon. Stay tuned! https://lnkd.in/g37f8jCa

3 Comments
Like Comment
To view or add a comment, sign in
Russell Jurney

Graphs and Generative AI
9mo
Report this post
> A 32GB GPU typically constrained cuGraph to graph sizes up to 500 million edges. Which means 8GB = 125M or 8 * 2 = 250M... https://lnkd.in/gKJaJCyt.

Tackling Large Graphs with RAPIDS cuGraph and CUDA Unified Memory on GPUs

medium.com
Like Comment
To view or add a comment, sign in
Ali Hashemi

Data Scientist
1mo
Report this post
Improving DistilBERT Inference Time Using ONNX on CPU and GPU ONNX به کمک DistilBERT تسریع زمان استنتاج برای مدل زبانی I optimized the inference time of the DistilBERT model using PyTorch, PyTorch's JIT, and ONNX across CPU and GPU. After thorough profiling, the ONNX model consistently outperformed both the standard and traced PyTorch models. The ONNX runtime's graph optimizations and hardware acceleration led to the best inference times. Therefore, ONNX is the most effective format for deploying the DistilBERT model on both CPU and GPU. source code: https://lnkd.in/dzVn7hx5
Like Comment
To view or add a comment, sign in
Abby Morgan

AI/ML Growth Engineer @ Comet | Technical Writer | Community Organizer | Mentor
6mo
Report this post
🤖 By design, #LLMs are large and require a high number of #GPUs to be fine-tuned. 🔧 But often, developers seek to tailor these #LanguageModels for specific use-cases and applications and fine-tune them for better performance. 🚀 This blog from PyTorch demonstrates how to fine-tune a 7B parameter model on a typical consumer GPU (NVIDIA T4 16GB) with #LoRA and tools from the PyTorch and Hugging Face ecosystem with complete reproducible Google Colab notebook. 👉 Check it out:

Finetune LLMs on your own consumer hardware using tools from PyTorch and Hugging Face ecosystem

1 Comment
Like Comment
To view or add a comment, sign in
Adarsha Rao

GPU Architecture Manager at NVIDIA
1y
Report this post
Check this out.. You can directly pass packets from the network to the GPU, Do high performance packet processing directly on the GPU with minimal intervention from CPU.

Realizing the Power of Real-Time Network Processing with NVIDIA DOCA GPUNetIO | NVIDIA Technical Blog

developer.nvidia.com
Like Comment
To view or add a comment, sign in
Vikram Rangnekar

Ax LLM Framework / AI Research / Ex-Linkedin
7mo
Report this post
Everyday something new and groundbreaking drops, power of open source I guess 🔥 . A new paper and implementation of an inference engine that claims to massively speedup LLMs on consumer grade CPU/GPI by leveraging the fact that not all neurons in the LLM are used. The paper states that neurons follow a power law distribution and only a small set of hot neurons are used which they can move to the GPU keeping the cold ones on the CPU. https://lnkd.in/g4mS9XeB

GitHub - SJTU-IPADS/PowerInfer: High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

github.com
Like Comment
To view or add a comment, sign in
Stefano Fago

Software Solutions Architect R&D
3w
Report this post
Paper: The Broker Queue: A Fast, Linearizable FIFO Queue for Fine-Granular Work Distribution on the GPU https://lnkd.in/dWxwEbNB [ Source Code: https://lnkd.in/dVxUvN7e ]

The Broker Queue: A Fast, Linearizable FIFO Queue for Fine-Granular Work Distribution on the GPU

arbook.icg.tugraz.at
Like Comment
To view or add a comment, sign in
Jay Shah

Research Scientist at Colfax International
1mo
Report this post
Excited to share the following coding tutorial on the #NVIDIA Tensor Memory Accelerator! TMA is essential to extracting performance on NVIDIA Hopper™ GPUs, but it's not the easiest feature to learn how to program for. This tutorial aims to change that state of affairs and impart an operational understanding of TMA by walking through a few fully worked-out examples. We cover TMA load, store, store reduce, and load multicast. This is the fruit of a collaboration with Hieu Pham and is part of an ongoing series of CUDA® tutorials with an emphasis on the CUTLASS library. https://lnkd.in/g8MRp8a2

CUTLASS Tutorial: Mastering the NVIDIA® Tensor Memory Accelerator (TMA)

research.colfax-intl.com
Like Comment
To view or add a comment, sign in

27,682 followers

View Profile Follow

NVIDIA Networking’s Post

More Relevant Posts

Explore topics