0% found this document useful (0 votes)

35 views35 pages

Gaudi2 Whitepaper

The Habana Gaudi®2 is a second-generation deep learning accelerator designed for enhanced performance and efficiency in training and inference, surpassing the capabilities of its predecessor and competing GPUs like the Nvidia A100. It features a heterogeneous architecture with advanced compute engines, memory subsystems, and is supported by the SynapseAI® software suite, facilitating integration with popular frameworks like TensorFlow and PyTorch. Gaudi2 offers significant price-performance advantages, allowing customers to train models more efficiently while reducing costs.

Uploaded by

meng-qingli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views35 pages

Gaudi2 Whitepaper

Uploaded by

meng-qingli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

HABANA® GAUDI®2

WHITE PAPER
JUNE 2022

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | May 2022
HABANA® GAUDI®2 WHITE PAPER 2

Table of Contents
I. INTRODUCTION ......................................................................................................................................................... 3

II. GAUDI®2 CHIP ARCHITECTURE .................................................................................................................................. 6

III. SYNAPSEAI® SOFTWARE SUITE ................................................................................................................................ 11

GRAPH COMPILER AND RUNTIME ...................................................................................................................................... 12

HABANA COMMUNICATION LIBRARIES ................................................................................................................................ 12
TPC PROGRAMMING ...................................................................................................................................................... 12
DL FRAMEWORK INTEGRATION ......................................................................................................................................... 12
SYSTEM MANAGEMENT AND MONITORING ......................................................................................................................... 14
ORCHESTRATION ............................................................................................................................................................ 14

IV. HABANA DEVELOPER SITE ....................................................................................................................................... 15

V. MODEL MIGRATION ................................................................................................................................................ 17

VI. ECOSYSTEM PARTNERSHIPS .................................................................................................................................... 19

HUGGING FACE TRANSFORMERS WITH GAUDI ...................................................................................................................... 20

PYTORCH LIGHTNING WITH GAUDI..................................................................................................................................... 22
MLOPS WITH CNVRG.IO .................................................................................................................................................. 23

VII. RACK LEVEL INTEGRATION ...................................................................................................................................... 24

DDN A3I REFERENCE ARCHITECTURE FOR GAUDI AI SERVERS ................................................................................................ 24

VIII. THE GAUDI®2 OAM CARD ........................................................................................................................................ 29

IX. GAUDI®2 HLBA-225 BASEBOARD ............................................................................................................................. 30

BLOCK DIAGRAM AND MAIN COMPONENTS ......................................................................................................................... 30

X. HLS-GAUDI®2 SERVER .............................................................................................................................................. 33

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 3

I. INTRODUCTION

Gaudi®2 is Habana’s second-generation deep learning

accelerator supporting Training and Inference.
Building on the architecture of Gaudi®, which launched first on the AWS EC2
cloud in the DL1 instance, and on-premises via the Supermicro X12 Gaudi
Training Server, Gaudi2 brings a new level of performance and efficiency to
deep learning in the datacenter and cloud.

Gaudi2 is supported by the SynapseAI® software suite, which is integrated

with TensorFlow and PyTorch frameworks. We offer reference models, tools
and how-to-guides on Habana’s GitHub and Developer Site.

The main benefits that current customers of our first-generation Gaudi see
are the price-performance advantages relative to GPU solutions for popular
vision and language models. These enable customers to train more and pay
less, and this way, accelerate time-to-market with their model training.

With Gaudi2, we are pleased to extend these benefits to not just better
price-performance, but also performance leadership vs. the leading shipping
7nm GPU today, namely the Nvidia A100.

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 4

Before we address the architecture details, here arekey benchmarks for Gaudi2 at
time of publication of this white paper and whichreflect the SynapseAI® Software
Suite Release 1.5 effective June 16, 2022

FIGURE 1. RESNET-50 TRAINING THROUGHPUT

A100-80GB: Measured by Habana on Azure instance

Standard_ND96amsr_A100_v4 using
single A100-80GB using TF docker 22.03-
tf2-py3 from NGC (optimizer=sgd,
BS=256)
A100-40GB: Measured by Habana on DGX-A100 using
single A100-40GB using TF docker 22.03-
tf2-py3 from NGC (optimizer=sgd, BS=256
V100-32GB: Measured by Habana on p3dn.24xlarge
using single V100-32GB using TF docker
22.03-tf2-py3 from NGC (optimizer=sgd,
BS=256)
Gaudi2: Measured by Habana on HLS-Gaudi2
system using single Gaudi2 using
SynapseAI TF docker 1.5.0 (BS=256)

Results may vary.

FIGURE 2. BERT PHASE-1 TRAINING THROUGHPUT

A100-80GB: Measured by Habana on Azure instance
Standard_ND96amsr_A100_v4 using
single A100-80GB with TF docker 22.03-
tf2-py3 from NGC (Phase-1: Seq len=128,
BS=312, accu steps=256; Phase-2: seq
len=512, BS=40, accu steps=768)
A100-40GB: Measured by Habana on DGX-A100 using
single A100-40GB with TF docker 22.03-
tf2-py3 from NGC (Phase-1: Seq len=128,
BS=64, accu steps=1024;
Phase-2: seq len=512, BS=16, accu
steps=2048)
V100-32GB: Measured by Habana on p3dn.24xlarge
using single V100-32GB with TF docker
21.12-tf2-py3 from NGC (Phase-1: Seq
len=128, BS=64, accu steps=1024;
Phase-2: seq len=512, BS=8, accu
steps=4096
Gaudi2: Measured by Habana on HLS-
Gaudi2system using single Gaudi2 with
SynapseAI TF docker 1.5.0 (Phase-1:
FIGURE 3: BERT PHASE-2 TRAINING THROUGHPUT Seq len=128, BS=64, accu steps=1024;
Phase-2: seq len=512,
BS=16, accu steps=2048)

Results may vary.

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 5

And combining Phase-1 and Phase-2 of the BERT-Large model:

FIGURE 4. BERT TRAINING THROUGHPUT

A100-80GB: Measured by Habana on Azure instance Standard_ND96amsr_A100_v4 using single A100-80GB with TF docker 22.03-tf2-
py3 from NGC (Phase-1: Seq len=128, BS=312, accu steps=256;
Phase-2: seq len=512, BS=40, accu steps=768)
A100-40GB: Measured by Habana on DGX-A100 using single A100-40GB with TF docker 22.03-tf2-py3 from NGC (Phase-1: Seq len=128,
BS=64, accu steps=1024;
Phase-2: seq len=512, BS=16, accu steps=2048)
V100-32GB: Measured by Habana on p3dn.24xlarge using single V100-32GB with TF docker 21.12-tf2-py3 from NGC (Phase-1: Seq
len=128, BS=64, accu steps=1024;
Phase-2: seq len=512, BS=8, accu steps=4096)
Gaudi2: Measured by Habana on HLS-Gaudi2 system using single Gaudi2 with SynapseAI TF docker 1.5.0 (Phase-1: Seq len=128,
BS=64, accu steps=1024;
Phase-2: seq len=512, BS=16, accu steps=2048)

Results may vary.

Habana reported these results on models ported from Gaudi to Gaudi2 and based
on the SynapseAI® Software Suite release 1.5 While our OEM partners are working
on building servers for general consumption, Habana’s engineers are working these
days on porting and developing additional deep-learning models, with a cadence of
6-8 weeks for software releases. You can track our progress through our public
Github and developer.habana.ai site.

Now let’s dive in to the Gaudi2 processor, hardware, and software.

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 6

II. GAUDI®2 CHIP ARCHITECTURE

Gaudi2 architecture includes 3 main subsystems:

Compute, memory, and networking.
The compute architecture is heterogeneous and The Gaudi2 Memory subsystem includes 96 GB of
includes two compute engines – a Matrix HBM2E memories delivering 2.45 TB/sec bandwidth,
Multiplication Engine (MME) and a fully in addition to a 48 MB of local SRAM with sufficient
programmable Tensor Processor Core (TPC) cluster. bandwidth to allow MME, TPC, DMAs and RDMA
The MME is responsible for doing all operations NICs to operate in parallel.
which can be lowered to Matrix Multiplication (fully
connected layers, convolutions, batched-GEMM) Specifically for Vision applications, Gaudi2 has
while the TPC, a VLIW SIMD processor tailor-made integrated media decoders which operate
for deep learning operations, is used to accelerate independently and can handle the entire pre-
everything else. Besides MME & TPC, Gaudi2 is also processing pipe in all popular formats – HEVC, H.264,
instancing several DMAs which are coupled with a VP9 & JPEG as well as post-decode image
transpose engine for efficient, on the fly, tensor transformations needed to prepare the data for the
shape transformations, in addition to the ability to AI pipeline.
read & write non-contiguous multi-dimensional Gaudi2 supports all popular data types required for
tensors from and to the Gaudi2 memory subsystem.
deep learning: FP32, TF32, BF16, FP16 & FP8 (both
The Gaudi2 Processor offers 2.4 Terabits of E4M3 and E5M2). In the MME, all data types are
networking bandwidth with the native integration accumulated into an FP32 accumulator.
on-chip of 24 x 100 Gbps RoCE V2 RDMA NICs, which
enable inter-Gaudi communication via direct routing
or via standard Ethernet switching.

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 7

FIGURE 5. GAUDI2 ARCHITECTURE DIAGRAM

Gaudi2 architecture is a heterogenous architecture with primarily two types of

compute engines – MME & TPC. Gaudi 2 was architected to enable parallel
operation of MME and TPC, such that their compute time is overlapped to
accelerate the execution of the deep learning topology dramatically.

The below diagram shows what is usually observed on GPUs where the GEMM
compute and the general-purpose cores execution time is not overlapping:

FIGURE 6. NON-OVERLAPPING COMPUTATION

GEMM GEMM

Non GEMM Non GEMM

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 8

On Gaudi and Gaudi2 architectures, the MME and TPC compute time overlaps.
The GEMM and none-GEMM operations are mostly overlapped, dramatically
accelerating the workload.

FIGURE 7. OVERLAPPING COMPUTATION ON GAUDI ARCHITECTURE

GEMM GEMM

Non GEMM Non GEMM

Another big difference between GPU and Gaudi architecture is the size of the matrix
multiplication accelerator. This fact, which may seem minor, has a big effect on overall
ability to utilize those accelerators, specifically when matrix sizes become smaller.

The below diagram compares a 256x256 matrix accelerator (on the left) to 256 small
16x16 matrix accelerators on the right (the depth dimension was removed to simplify
the explanation). From compute perspective, both are equivalent, but from bandwidth
perspective, although the left one requires 512 input elements per cycle to utilize the
compute, the right side requires 8K input elements per cycle to utilize the compute.
A 16x difference on the read bandwidth requirements towards the first level memory.

FIGURE 8. DIFFERENT SIZES OF MATRIX MULTIPLICATION ACCELERATORS

256 Elem/Cycle 256x16 Elem/Cycle

256
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
16x16
256 Elem/Cycle 256x256 256x16 Elem/Cycle 16x16
16x16

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 9

GPUs, which implement the right side, compensate for the above phenomena, by
mandating a large reuse factor from the hierarchical caching memory sub system,
such that overall, to utilize their multipliers, they require a very big matrix
multiplication problem to solve. Gaudi2 (and some other dedicated tensor
processors), which implement the left side approach, can utilize their multipliers
easily while leaving a lot of free bandwidth & capacity from their flat memory
subsystem for other tasks, besides matrix multiplications.

The above difference between a big matrix multiplication accelerator to a lot of

small ones can be observed in the utilization each architecture can achieve with
respect to matrix size. To utilize 80% of modern GPUs’ many small matrix
multiplication accelerators, a GEMM dimension of m=n=k=~3K is required. On
Gaudi2, m=n=k=1K is sufficient to utilize 100% of the multipliers. And if activations
are pipelined via 48MB SRAM (which is usually the case), m=n=k=512 is sufficient to
utilize MME by 100%. In other words, Gaudi2 requires between ~25x-~200x less
MACs in a GEMM operation to be utilized 100% compared to modern GPUs which
are utilized 80%. Paradoxically, creating a relatively big matrix multiplication
accelerator allows hardware to be efficiently utilized on smaller tensors compared
to the alternative.

Such high utilization on small tensors significantly eases MME & TPC computation
overlapping, as to allow such tight overlapping, an operation needs to be sliced as
described below, which creates smaller tensors for MME & TPC to operate upon:

FIGURE 9. SLICED COMPUTATION OVERLAPPING

GEMM Slice 1 Slice 2 Slice 3 Slice 1 Slice 2 Slice 3

Non GEMM Slice 1 Slice 2 Slice 3 Slice 1 Slice 2 Slice 3

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 10

Gaudi2 integrates Habana’s fourth generation extending the RoCE scalability with flexible time-
Tensor Processor Core. The TPC is a general purpose based congestion control algorithm, enabling linear
VLIW processor which is 256B SIMD wide and scalability over thousands of Gaudi system.
supports FP32, BF16, FP16 & FP8 (Both E4M3 and
E5M2), in addition to INT32, INT16 & INT8 data DNN topologies tend to use collective operations
types. As opposed to common DSPs, which require a extensively and posting collective operations on
DMA to fetch in and out the operands to a local multiple ports usually requires high CPU
SRAM, the TPC, using advanced micro-architectural horsepower. To reduce the CPU utilization a scalable
techniques, exposes a DMA-free programming collective offload was introduced on Gaudi2 which
model which significantly eases SW development. In helps Gaudi’s 2 message rate to be more than an
addition, the same advanced microarchitecture order of magnitude better than competition.
allows bubble-free execution between kernels which Gaudi’s NICs are also aligned with all other engines
effectively makes TPC 100% utilized on tensor in chip and can access both local and remote
processing, even for very short kernels, regardless of memory in tensor semantics.
the location of its inputs/outputs (SRAM or HBM).
Just like MME, TPC is also very efficient in working
on small tensors.
To summarize, Gaudi heterogenous
As deep learning training is usually solved on
multiple devices, Gaudi2 Network Interface architecture is unique in the sense that
controllers (NICs) are an essential component in the
overall Habana second-generation training solution.
it is highly efficient on small tensors’
Gaudi’s NIC is customized to fit a distribution of a operations, which is an enabler for
DNN graph between the chips in the network (AKA
scale-out). The NIC provides the compute engine overlapping the computation &
with a remote direct memory access (RDMA)
featuring high bandwidth and low latency over networking communication between
reliable connection without any software
the heterogenous agents, in addition to
intervention. To fit common cloud infrastructure,
NIC ports use Ethernet connectivity with aggregated freeing up significant memory capacity
bandwidth of 2.4Tb/s, supporting multiple port
configurations. The NIC implements the RoCE v2 and bandwidth requirements from its
specification, benefiting from the commonly used
Ethernet infrastructure and the reliable and low memory subsystem.
latency RDMA of the InfiniBand protocol, while

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 11

III. SYNAPSEAI® SOFTWARE SUITE

Designed to facilitate high-performance deep learning (DL)

training on Habana’s AI processors, the SynapseAI® Software
Suite enables efficient mapping of neural network topologies
onto the Gaudi hardware family.
The software suite includes Habana’s graph compiler and runtime, TPC kernel
library, firmware and drivers, and developer tools such as the TPC programming tool
kit for custom kernel development and SynapseAI Profiler. SynapseAI is integrated
with the popular frameworks, TensorFlow and PyTorch, and performance-optimized
for Habana’s Gaudi and Gaudi2 AI processors. Figure 10 shows a pictorial view of
the SynapseAI software suite.

FIGURE 10. SynapseAI Software Suite

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 12

Graph Compiler and Runtime TPC Programming

The SynapseAI graph compiler generates optimized The SynapseAI TPC SDK includes an LLVM-based
binary code that implements the given model TPC-C compiler, a simulator and debugger.
topology on Gaudi. It performs operator fusion, These tools facilitate the development of custom
data layout management, parallelization, pipelining TPC kernels, and we have used this very SDK to
and memory management, and graph-level build the high-performance kernels provided by
optimizations. The graph compiler uses the rich TPC Habana. Users can thereby develop customized
kernel library, which contains a wide variety of deep learning models and algorithms on
performance-optimized operations (for example, Gaudi to innovate and optimize to their
elementwise, non-linear, non-GEMM operators). unique requirements.
Given the heterogenous nature of Gaudi hardware
(Matrix Math engine, TPC and DMA), the SynapseAI The TPC programming language, TPC-C, is a
graph compiler enables effective utilization through derivative of C99 with added language data types to
parallel and pipelined execution of framework enable easy utilization of processor-unique SIMD
graphs. SynapseAI uses stream architecture to capabilities. It natively supports wide vector data
manage concurrent execution of asynchronous types to assist with programming of the SIMD
tasks, supporting Gaudi’s unique combination of engine (for example, float64, uchar256 and so on).
compute and networking, exposing a multi-stream It has many built-in instructions for deep learning,
architecture to the framework. Streams of different including tensor-based memory accesses,
types — compute, networking, and DMA — are acceleration for special functions, random number
synchronized with one another at high performance generation and multiple data types.
and with low run-time overheads.
DL Framework Integration
Habana Communication Libraries Habana SynapseAI integrates PyTorch and
The Habana Communication Library enables TensorFlow, two of the most popular frameworks
efficient scale-up communication between Gaudi used by data scientists and AI developers. This
processors within a single node and scale-out across section provides a brief overview of the SynapseAI
nodes for distributed training, leveraging TensorFlow integration. It illustrates how SynapseAI
Gaudi’s high performance RDMA does much of the mapping and optimization under
communication capabilities. the hood, while customers still enjoy the same
abstraction, they are accustomed to today.
It has an MPI look-and-feel and supports point-to- The SynapseAI TensorFlow bridge receives a
point operations (for example, Write, Send) and computational graph of the model from the
collective operations (for example, AllReduce, TensorFlow framework and identifies the subset of
AlltoAll) that are performance-optimized for Gaudi. the graph that can be accelerated by Gaudi. These
Habana Collective Communications Library (HCCL) subgraphs are encapsulated and executed optimally
that is Habana’s implementation of standard on Gaudi. Figure 11 shows an example of
collective communication routines with NCCL- encapsulation performed on the TensorFlow
compatible API.
framework graph. The yellow node is not supported
on Gaudi, while blue nodes can execute on Gaudi.

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 13

Subgraphs with blue nodes are identified and performance on Gaudi. By default, we enable lazy
encapsulated. The original graph is modified to mode exaction. Instead of executing one operator
replace the subgraphs with their corresponding at a time, the SynapseAI bridge internally
encapsulated nodes. accumulates these in a graph.
The execution of the accumulated operators in the
The framework runtime then executes the modified graph is triggered in a “lazy” manner, only when a
graph. Per node, a corresponding SynapseAI graph tensor value is required by the user. This allows the
is created and compiled. For performance bridge to construct a graph, which provides the
optimization, the compilation recipe is cached for SynapseAI graph compiler the opportunity to
future use. After allocating memory, the recipe is optimize the device execution for the operators.
enqueued for execution on a SynapseAI stream.
Mixed precision execution is available via the
SynapseAI supports distributed training with Habana Mixed Precision (HMP) package. The HMP
TensorFlow using Horovod and tf.distribute API with package automatically modifies the Python
HPUStrategy. Mixed precision execution is available operators to add the appropriate cast operations,
via the tf.keras.mixed_precision API or using and this enables you to run mixed precision training
Habana’s automated mixed precision conversion. without extensive modifications to existing FP32
These enable you to run mixed precision training model scripts. SynapseAI PyTorch bridge supports
without extensive modifications to existing FP32 distributed training using torch.distributed and
model scripts. More details are available in the torch.nn.parallel.DistributedDataParallel APIs for
TensorFlow section on docs.habana.ai. both data and model parallelism. Distributed
communication is enabled using HCCL backend.
The SynapseAI PyTorch bridge interfaces between
the framework and SynapseAI software stack to For more details, check out the PyTorch section on
docs.habana.ai.
train PyTorch-based deep learning models on Gaudi.
We support two modes of execution: (1) Eager SynapseAI is also integrated with TensorBoard to
mode, which performs operator-by-operator enable debugging and profiling of your TensorFlow
execution as defined in standard PyTorch eager or PyTorch models. Users interested in low-level
mode scripts, and (2) Lazy mode, which performs focused profiling can refer to the SynapseAI Profiler
deferred execution of graphs comprising a User Guide on docs.habana.ai.
collection of operators. Lazy mode provides user
experience like Eager mode, while enabling high

FIGURE 11. SUBGRAPH SELECTION AND ENCAPSULATION IN TENSORFLOW FRAMEWORK GRAPH

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 14

System Management and Monitoring Orchestration

The SynapseAI software suite includes support for Kubernetes is an orchestration system that
monitoring and management that is useful for automates the processes of running containers in
server developers and IT personnel who manage production clusters. It eliminates the
server deployments. The Habana Labs infrastructure complexity associated with
Management Library (HLML) is a C-based deploying, scaling, and managing containerized
programmatic interface for monitoring and applications. Kubernetes based container
managing various states within Habana AI orchestration is commonly used for deploying AI
processors. The Habana System Management workloads, and SynapseAI provides the necessary
Interface (hl-smi) is a command line utility, based components to enable Gaudi in a Kubernetes
on top of HLML, intended to aid in the cluster. The Habana Device Plugin enables the
management and monitoring of Habana AI registration of the Habana device in a container
processors. The Habana Labs Qualification cluster for compute workload. With the
(hl_qual) tools package provides the required appropriate hardware and this plugin deployed in
tools to validate and qualify the usage and your Kubernetes cluster, you will be able to run
integration of Gaudi hardware platforms in your jobs that require a Habana device. Habana uses
server design. the standard MPI Operator from Kubeflow that
enables the running of MPI all-reduce style
workloads leveraging Gaudi processors in a
Kubernetes cluster. This enables you to run
distributed training on Gaudi with the Kubernetes
job distribution model. To enable monitoring of
cluster health, SynapseAI also includes support for
Prometheus Metric Exporter for Kubernetes. It is a
Daemonset that enables the collection of device
metrics in a container cluster for compute
workload. SynapseAI supports multiple flavors of
Kubernetes orchestration, including vanilla open-
source Kubernetes, Amazon EKS, RedHat
OpenShift and VMware Tanzu.

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 15

IV. HABANA DEVELOPER SITE

At Habana, designing and developing running models on Gaudi. And for IT and Systems
Administrators building Gaudi-based systems on
the hardware for high-performance premise, we provide guidance on set-up and
and efficient DL processors accounts management for Gaudi servers and computing
for a relatively small portion of infrastructure.

Habana’s effort; the majority is Habana GitHub contains repositories open to the
dedicated to leveraging that general public, which include setup and install
instructions for Habana binaries and docker
hardware with the right software, creation, Jupyter notebook-based tutorials,
tools and support you need to make reference models, custom TPC kernel example,
your workloads and models run and more.

efficiently, with accuracy and speed. Our Model-References repository contains 30+
popular TensorFlow and PyTorch models that
In addition to the SynapseAI software suite that is have been ported to Gaudi, and the Model
designed for performance and usability, we have Performance page provides the latest
also published a wealth of information and performance results for these models. The
resources to make it easy for you to get started Habana Developer Site also has a searchable
with training on Gaudi processors. The Habana Catalog of SynapseAI container images,
Developer Site, is the hub for Habana developers TensorFlow and PyTorch reference models. For
from where you will find the content, guidance, more information on future model support,
tools, and support needed to build easily and please refer to our SynapseAI model roadmap
flexibly new or migrate existing AI models and page. Each model is supported with model scripts
optimize their performance on our AI processors. and instructions on how to run these models on
The Resources section contains a collection of Gaudi. We are committed to expanding our
documents, short videos and hands-on Jupyter model coverage continuously and providing a
notebook tutorials to help you get started and wide variety of examples for users.

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 16

Habana’s Documentation page hosts detailed documentation for SynapseAI

User Guides, TPC User Guides, API Reference Guides and Migration Guides. It
is web-based and searchable with content based on the latest software release
and with archive of the prior software release documentation. We also provide
developers with “Documentation Update,” which summarizes major updates
from the previous software release.

Containers can be deployed easily and consistently, regardless of whether the

target environment is a private data center or the public cloud. The Gaudi-
optimized frameworks containers are delivered with all necessary
dependencies including the complete SynapseAI software.

SynapseAI is integrated and validated with recent versions of TensorFlow and

PyTorch. Support for additional framework operators, features and APIs will
continue to evolve over time. Please refer to the Support Matrix on the
Habana Documentation.

The table below highlights supported versions as of April 2022:

TABLE 1. SYNAPSEAI SUPPORT MATRIX

Supported Frameworks TensorFlow2 and PyTorch

Operating Systems Ubuntu 18.04 and 20.04, AWS Linux2, RHEL8

Container Runtimes Docker

Distributed Training TensorFlow: Horovod and tf.distribute

Schemes PyTorch: Distributed data parallel (DDP)

Orchestration Kubernetes

The Habana Community Forum is a dynamic resource for the developer

community to access answers to their questions when implementing or
managing Gaudi-based systems, and to share their own insights and
perspectives with others who are working with Habana and Gaudi. We invite
you to join the Forum and help build a robust and vital community of AI
thought-leaders and builders who seek to leverage the unprecedented
benefits of Habana’s AI processors.

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 17

V. MODEL MIGRATION

Switching from a familiar DL platform and workflow to a new

one takes effort. Our goal is to minimize this effort and lower
the barriers wherever possible.
We expect most users will be able to take existing models with minor changes to
existing scripts and run on Gaudi. Habana GitHub will contain migration guides and
examples to assist users with porting their current models to run on Gaudi.
More information on migrating models to Gaudi is available in the Migration Guide
on docs.habana.ai.

The SynapseAI TensorFlow and PyTorch user guides provide an overview of

SynapseAI integration with respective frameworks, APIs and operators that are
supported, how to enable mixed precision and distributed training, and more.
The migration guides helps users develop a better understanding of how to port
their existing models to Gaudi and provide practical tips to assist in their effort.

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 18

Below we show the minimum set of changes required to port a TensorFlow Keras
model that does not contain any custom kernels.

import tensorflow as tf
from TensorFlow.common.library_loader import load_habana_module
load_habana_module()
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(10),
])
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, batch_size=128)
model.evaluate(x_test, y_test)

The minimal changes to enable training on the Habana Gaudi device are highlighted
in bold. All you need is to import load_habana_module package and then invoke
the load_habana_module() function to enable training on Gaudi. With this change,
the Gaudi device, which is referred to as HPU in the framework, is now registered in
TensorFlow and prioritized for execution over CPU. When an operator is available
for both CPU and HPU, the operator is assigned to the HPU. When it is not
supported on Gaudi, it runs on the CPU. For more details on porting your
TensorFlow model to Gaudi processors, check out the TensorFlow Migration Guide.

A similar approach applies to migrating your PyTorch models as well. More

information on migrating PyTorch models to Gaudi processors is available in the
PyTorch Migration Guide on docs.habana.ai.

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 19

VI. ECOSYSTEM PARTNERSHIPS

The AI software ecosystem is rapidly expanding with research

breakthroughs being quickly integrated into popular software
packages in a scalable and hardware agnostic fashion.
Data scientists and AI developers are adopting these software solutions to help
them focus more on the data science and research, and less on managing the
complexities of underlying software engineering.

At Habana, our aim is to meet the developers where they are. We have
been busy collaborating with AI software ecosystem partners to enable a
seamless user experience with Habana AI processors.

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 20

Hugging Face Transformers with Gaudi

Powered by deep learning, transformer models deliver state-of-the-art performance
on a wide range of machine learning tasks, such as natural language processing,
computer vision, speech, and more. With 60,000+ stars on Github, 30,000+ models,
and millions of monthly visits, Hugging Face is one of the fastest-growing projects in
open source software history, and the go-to place for the machine learning
community. With the integration of Habana’s SynapseAI software suite with the
Hugging Face Optimum open-source library, data scientists and machine learning
engineers can now accelerate their Transformer training jobs on Habana AI
processors with just a few lines of code and enjoy greater productivity as well as
lower training cost.

There are two main classes one needs to know: (1) GaudiTrainer class that takes
care of compiling (lazy or eager mode) and distributing the model to run on HPUs,
and of performing training and evaluation, and (2) GaudiConfig class to configure
Habana Mixed Precision and decide whether optimized operators and optimizers
should be used. The GaudiTrainer is very similar to the Transformers Trainer and
adapting a script using the Trainer to make it work with Gaudi will mostly consist in
simply swapping the Trainer class for the GaudiTrainer one. The example in the
picture above shows how simple it is to get started with training Transfomer models
on Gaudi. Several popular reference models are available on the HuggingFace
Habana page, including, bert-base, bert-large, roberta-case, roberta-large,
distilbert-base, albert-large and albert-xxlarge.

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 21

from optimum.habana import GaudiConfig, GaudiTrainer, GaudiTrainingArguments

# A lot of the same code as the original script here

...

# Loading the GaudiConfig needed by the GaudiTrainer to fine-tune the model on HPUs

gaudi_config = GaudiConfig.from_pretrained(

training_args.gaudi_config_name,

cache_dir=model_args.cache_dir,

revision=model_args.model_revision,

use_auth_token=True if model_args.use_auth_token else None,

# Initialize our Trainer

trainer = GaudiTrainer(

model=model,

gaudi_config=gaudi_config,

# The training arguments differ a bit from the original ones, that is why we use
GaudiTrainingArguments

args=training_args,

train_dataset=train_dataset if training_args.do_train else None,

eval_dataset=eval_dataset if training_args.do_eval else None,

compute_metrics=compute_metrics,

tokenizer=tokenizer,

data_collator=data_collator,

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 22

PyTorch Lightning with Gaudi

PyTorch Lightning is a lightweight framework built on PyTorch and provides
APIs that abstract the boilerplate code that PyTorch users need to train
models. Lightning adoption has grown quickly in the last two years, with over
600 contributors, 15K GitHub stars, and 2 million monthly downloads, with 10x
yoy growth. Habana has been collaborating with Grid.ai to make it easier and
quicker for developers to train on Gaudi® processors with PyTorch Lightning
without any code changes. Grid.ai and PyTorch Lightning make coding neural
networks simple. Habana Gaudi makes it cost efficient to train those networks.
The integration of Habana’s SynapseAI® software suite with PyTorch Lightning
brings the best of both worlds together, enabling greater developer
productivity while lowering the cost of model training. PyTorch Lightning 1.6
that was recently released now supports Habana Gaudi. Developers now have
the flexibility to choose Gaudi’s AI computational power with Lightning to
benefit from Gaudi advantages with speed and ease.

Below you will see how easy it is to get started with training on Gaudi using
PyTorch Lightning:

import pytorch_lightning as pl

from pytorch_lightning.plugins import HPUPrecisionPlugin

trainer = pl.Trainer(accelerator="hpu”, devices=8, precision=16)

All you need is to provide accelerator="hpu" parameter to the Trainer class, and
select the number of Gaudi processors by setting the devices parameter. For mixed
precision training, import the HPUPrecisionPlugin and set “precision=16”.

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 23

MLOps with cnvrg.io

cnvrg.io is a machine learning platform built by data scientists, for data
scientists. cnvrg.io is transforming the way enterprises manage, scale and
accelerate AI and data science development from research to production and
offers unrivaled flexibility to run on-premise, cloud or both. Habana and
cnvrg.io have partnered to bring together the best of both worlds for data
scientists and AI developers, offering flexibility, lower costs, and higher
productivity for AI training. Enterprises can now easily deploy Gaudi’s AI
computational power and cost-efficiency with cnvrg.io MLOPs platform.

With cnvrg.io, data scientists can deploy more models with drag and drop
machine learning pipelines. You can easily run and track experiments, and
automate your machine learning from research to production using reusable
components and drag-n-drop interface. Getting started with Habana Gaudi on
cnvrg.io first requires setting up a Kubernetes cluster for your on-premise
Gaudi servers or an Amazon EKS cluster using DL1 EC2 instances. cnvrg.io
seamlessly integrates both on-premises and cloud compute resources. The
Habana Vault, which hosts the SynapseAI TensorFlow and PyTorch Docker
container images, is integrated and available on cnvrg.io Registries. You can
now bring up a new Jupyter Workspace, select the appropriate Gaudi compute
and Docker image from the cnvrg.io Habana container registry. You can then
get started with the Habana reference models by simply adding the repo
location in the cnvrg Project Settings Git Integration page. Now you can start a
new Experiment in cnvrg.io and begin training your model on Gaudi.

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 24

VII. RACK LEVEL INTEGRATION

As Gaudi supports standard scaling interfaces, natively, it makes

scaling from a single node to racks and cluster that much easier. It is
a key differentiator from GPU-based solutions that use proprietary
interfaces, and external NIC cards for connecting to network
switches. For IT decision makers, avoiding proprietary interfaces in
their gear is key for avoiding being “locked” to a single vendor.

Habana has worked with both server, switch, and storage systems partners to make
it easy for end-customers to build AI racks and clusters.

The figure below shows a rack-scale configuration with four Gaudi servers
connected to a single Ethernet switch at the top of the rack. This switch can be
further connected to other racks to form a much larger training pod that can hold
hundreds or thousands of Gaudi processors.

FIGURE 12. EXAMPLE RACK CONFIGURATION

Gaudi2 is now available to Habana’s customers, who work on commercializing

Gaudi2-based servers for Enterprise customers. Habana will be collaborating with
Storage, Switch and OEM partners to build scalable AI appliances with efficient
compute, storage and networking to accelerate enterprise AI deployments.

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 25

DDN A3I Reference Architecture for Gaudi AI Servers

DDN A3I solutions (Accelerated, Any-Scale AI) are architected to achieve the most
from at-scale AI training applications running on Habana Gaudi AI processors. They
provide predictable performance, capacity, and capability through integration
between DDN AI400X2 appliances and Supermicro X12 Gaudi AI servers. Every layer
of hardware and software engaged in delivering and storing data is optimized for
fast, responsive, and reliable access. DDN A3I solutions are designed, developed,
and optimized in close collaboration between Habana, DDN and Supermicro.

The DDN A3I scalable architecture integrates X12 Gaudi AI servers with DDN AI
shared parallel file storage appliances and delivers fully optimized end-to-end AI
acceleration on Habana Gaudi AI processors. DDN A3I solutions greatly simplify the
deployment of X12 Gaudi AI servers in single server and multi-server configurations,
while also delivering performance and efficiency for maximum Habana Gaudi AI
processors saturation, and high levels of scalability.

This section describes the components integrated in DDN A3I Solutions with
Supermicro X12 Gaudi AI servers.

- DDN AI400X2 Appliance is a fully integrated and optimized shared data

platform with predictable capacity, capability, and performance. Every
AI400X2 appliance delivers over 90 GB/s throughput and 3M IOPS directly
to X12 Gaudi AI servers.

- The Supermicro X12 Gaudi AI server (SYS-420GH-TNGR), powered by

Habana’s first-generation Gaudi AI Processors, pushes the boundaries of
deep learning training and can scale up to hundreds of Gaudi processors in
one AI cluster.

- Arista network switches provide optimal interconnect for Habana Gaudi AI

processors. For the Gaudi network, the Arista DCS-7060DX4-32 400 Gb/s
Ethernet Switch provides 32 ports of connectivity in a 1U form factor. For
the Storage and Cluster Management Network, the Arista 7170-32C 100
Gb/s Ethernet Switch provides 32 ports of connectivity in a 1U form factor.
For the Management Network, the Arista 7010T Ethernet Switch is a
Gigabit Ethernet Layer 3 switch family featuring 54 ports with 48
10/100/1000BASE-T ports, 4 x 10G SFP+ uplink ports. It provides robust
capabilities for critical low-intensity traffic like component management.

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 26

As general guidance, DDN recommends an AI400X2 appliance for every four

X12 Gaudi AI servers. These configurations can be adjusted and scaled easily to
match specific workload requirements. An overview of the network
architecture is shown in the figure below.

FIGURE 13. OVERVIEW OF THE NETWORK ARCHITECTURE

STORAGE & CLUSTER MANAGEMENT NETWORK

GAUDI NETWORK

SUPERMICRO X12 DDN AI400X2

GAUDI AI SERVERS APPLIANCES

MANAGEMENT NETWORK

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 27

FIGURE 14. DDN A3I REFERENCE ARCHITECTURE WITH FOUR X12 GAUDI AI SERVERS

DDN AI400X2

8
Storage & Cluster
Management
Network Switch

2 6

Supermicro X12
Gaudi AI Servers

2 6

Gaudi Network
Switch

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 28

DDN AI400X2

8
Storage & Cluster
Management
Network Switch

2 6

Supermicro X12
Gaudi AI Servers

2 6

Gaudi Network
Switch

Figure 14 illustrates the DDN A3I architecture quad server configuration.

Four X12 Gaudi AI servers are connected to an AI400X2 appliance through a
network switch. Every X12 Gaudi AI server connects to the storage & cluster
management network switch via two 100 GbE links. The AI400X2 appliance
connects to the storage & cluster management network switches via eight 100
GbE links. This ensures non-blocking data communication between every
device connected to the network. The multi-path design provides full
redundancy and maximum data availability in case a link becomes unavailable.

Additionally, the X12 Gaudi AI servers are connected through a network switch
for Gaudi communication. Every X12 Gaudi AI server connects to the Gaudi
network switch via six 400 GbE links.

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 29

VIII. THE GAUDI®2 OAM CARD

Gaudi2 is offered to system designers in standard OCP OAM 1.1 Mezzanine
card form and supports up to 600 TDP Power with passive cooling.

The following table provides its key interfaces

TABLE 2. HL-225H KEY INTERFACES

Interface Description

Host i/f X16 PCIe Gen3/4

Networking: Card to Card & 48 x 56Gb PAM4 SerDes Links, Supporting
Scale-out 24x100GbE RoCE v2

JTAG In-field CPLD programming

UART Low level debug & BMC access

I2C Master On/Off board Peripherals

I2C Slave / SMBUS BMC control and monitoring interface

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 30

IX. GAUDI®2 HLBA-225 BASEBOARD

The HLBA-225 is another product, inspired by OCP, and offered for simplifying
system design for Gaudi2-based system designers. It supports eight Gaudi2
mezzanine cards which are passively interconnected on its PCB in a non-
blocking all-to-all configuration, using from each card 21 NICs (3 ports to every
other of the 7 cards), as well as routing the 3 remaining NICs from every
Gaudi2 card (3x8=24) to the six on-board QSFP-DD connectors, for scaling-out.

The baseboard has standard interface/connectors to the HIB (Host Interface

Board), which allows the system designer the customization to design to
specific needs and the flexibility to build systems of choice with a different
ratio of CPUs to accelerators for different kind of topologies and applications.

TABLE 3: HLBA-225 KEY PROPERTIES

Feature Description

• All to all connectivity for 8 Gaudi2 HL-225H cards

• OAM powered by 54v and 12v
OAM support
• x16 PCIe Gen4 host interface per OAM
• 8 X dual B2B connectors

• 8 X 16 PCIe Gen 4 connectors

• Power: 12V, 54V
Baseboard to HIB (Host Interface • Side band signals: I2C, Reset, reference clocks, JTAG,
Board) Interface UART, SGMII, USB
• Eight Amphenol connectors:
2x 160P (10131762-101LF) + 6 x 112P (10137002-101LF)

Per OAM: 24x 100GbE PAM4 SerDes Links split into:

• 21 x 100GbE for OAMs-to-OAMs connections
Networking: Card to Card
• 3 x 100GbE for scale-out
& Scale-out
Total Scale-out:
• 8 x 3 x 100GbE = 2.4TbE connected to 6 QSFP-DD ports

PCB dimension • 585mm x 417mm x 4.6mm

Block Diagram and Main Components

HLBA-225 has the following main components:

• 8 X dual B2B connectors for the HL225H Mezzanine boards

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 31

• High speed connectors for x16 PCIe interconnect to HIB

• 2 CPLDs
• Power and reset control
• JTAG distribution to the mezzanines
• LED indications
• 6x QSFP-DD connectors (4x400G using 56G PAM 4 SerDes)
• 3x PHY retimers
• 8x PCIe retimers
• USB connectors for Debug Management and control
• RJ45 connector for BMC management

FIGURE 15. HLBA-225 KEY COMPONENTS

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 32

Figure 16. HLBA-225 High Speed Block Diagram

3 x 100G RoCE 3 x 100G RoCE

OAM1 OAM0 PCIe Gen4 x16

HL-225H HL-225H
21 x 100G RoCE 21 x 100G RoCE
HIB
4 x 100G RoCE
QSFP-DD1
3 x 100G RoCE 3 x 100G RoCE
PAM4 Retimer1
OAM2 OAM3 PCIe Retimer
HL-225H HL-225H HIF Conn 6x8
QSFP-DD2 x16
21 x 100G RoCE 21 x 100G RoCE PCIe Retimer
HIF Conn 4x8
x16
PCIe Retimer
QSFP-DD3 HIF Conn 4x8
x16
PCIe Retimer
HIF Conn 4x8
x16
PCIe Retimer
PAM4 Retimer2 HIF Conn 4x8
x16
PCIe Retimer
HIF Conn 4x8
x16
PCIe Retimer
QSFP-DD4 HIF Conn 4x8
x16
21 x 100G RoCE 21 x 100G RoCE PCIe Retimer
HIF Conn 6x8
OAM5 OAM4 x16
QSFP-DD5
HL-225H HL-225H
PAM4 Retimer3 3 x 100G RoCE 3 x 100G RoCE

QSFP-DD6
21 x 100G RoCE 21 x 100G RoCE

OAM6 OAM7
HL-225H HL-225H
3 x 100G RoCE 3 x 100G RoCE

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 33

X. HLS-GAUDI®2 Server
The HLS-Gaudi®2 system is a high-performance deep-learning server,
incorporating a dual socket Xeon host subsystem and 8 Gaudi2 accelerators,
which supports scaling out using 24x100GbE RDMA ports.

HLS-Gaudi®2 has the following main features:

TABLE 4. HLS-GAUDI2 KEY FEATURES

Feature Description

System Dimension • 19”

• 2* INTEL Xeon Ice Lake CPU

CPU head node • 32* DDR4 DIMM

• 2* NIC

• 2* PCIe switch
HIB
• BMC + peripheral

• HLBA-225

Base Board • fully connected topology

• 6x QSFP-DD connectors (4x400G using 56G PAM 4 SerDes)

OAM • 8* Habana Gaudi®2

• 4* PCIe Gen 4 U.2 NVME SSD

PSU • 4 (3+1) 4 kW PSU

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 34

Figure 17. HLS-Gaudi2 System Layout

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022
HABANA® GAUDI®2 WHITE PAPER 35

2022 Habana Labs (an Intel Company) | www.habana.ai | Ver 1.0 | June 2022

Levental Uchicago 0330D 17419
No ratings yet
Levental Uchicago 0330D 17419
163 pages
Gaudi 3 Ai Accelerator White Paper
No ratings yet
Gaudi 3 Ai Accelerator White Paper
24 pages
AI
No ratings yet
AI
242 pages
Gaudi 3 Ai Accelerator White Paper
No ratings yet
Gaudi 3 Ai Accelerator White Paper
30 pages
Gaudi 3 Ai Accelerator White Paper
No ratings yet
Gaudi 3 Ai Accelerator White Paper
25 pages
Artificial Intelligence Hardware Design: Challenges and Solutions 1st Edition Albert Chun-Chen Liu PDF Download
No ratings yet
Artificial Intelligence Hardware Design: Challenges and Solutions 1st Edition Albert Chun-Chen Liu PDF Download
42 pages
ADA Project Report - 2 067
No ratings yet
ADA Project Report - 2 067
9 pages
Edgecore AI Server 20250110
No ratings yet
Edgecore AI Server 20250110
25 pages
Li Thesis 2022 - 2
No ratings yet
Li Thesis 2022 - 2
45 pages
Gaudi 3 Ai Accelerator White Paper
No ratings yet
Gaudi 3 Ai Accelerator White Paper
30 pages
ASDFGHJKL
No ratings yet
ASDFGHJKL
10 pages
Project Report Fall Aditya 1
No ratings yet
Project Report Fall Aditya 1
55 pages
GPU Bootcamp Samhar
100% (1)
GPU Bootcamp Samhar
96 pages
(Iclr2023) CKTGNN
No ratings yet
(Iclr2023) CKTGNN
20 pages
Designing RISC-V Instruction Set Extensions For Artificial Neural Networks An LLVM Compiler-Driven Perspective
No ratings yet
Designing RISC-V Instruction Set Extensions For Artificial Neural Networks An LLVM Compiler-Driven Perspective
22 pages
Gaudi 3 Ai Accelerator White Paper
No ratings yet
Gaudi 3 Ai Accelerator White Paper
31 pages
Gaudi2 Whitepaper
No ratings yet
Gaudi2 Whitepaper
34 pages
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
No ratings yet
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
101 pages
Exploring Energy Consumption of AI Frameworks On A 64-Core RV64 Server CPU
No ratings yet
Exploring Energy Consumption of AI Frameworks On A 64-Core RV64 Server CPU
12 pages
02 - Embedded and Edge Hardware
No ratings yet
02 - Embedded and Edge Hardware
49 pages
Wheeldon Et Al 2020 Learning Automata Based Energy Efficient Ai Hardware Design For Iot Applications
No ratings yet
Wheeldon Et Al 2020 Learning Automata Based Energy Efficient Ai Hardware Design For Iot Applications
18 pages
A High Performance Reconfigurable Hardware Archite
No ratings yet
A High Performance Reconfigurable Hardware Archite
17 pages
Intel AI Everywhere
No ratings yet
Intel AI Everywhere
29 pages
3.addon Spro
No ratings yet
3.addon Spro
114 pages
AuroraLabs Guidehouse AI For Automotive Software Development WP
No ratings yet
AuroraLabs Guidehouse AI For Automotive Software Development WP
22 pages
Car Brand Detection Prototype Using Intel® Distribution of Openvino™ Toolkit
No ratings yet
Car Brand Detection Prototype Using Intel® Distribution of Openvino™ Toolkit
18 pages
1 s2.0 S1383762121002575 Main
No ratings yet
1 s2.0 S1383762121002575 Main
10 pages
Ten Lessons From Three Generations Shaped Google S Tpuv4i
No ratings yet
Ten Lessons From Three Generations Shaped Google S Tpuv4i
40 pages
Loditech - Empower
No ratings yet
Loditech - Empower
8 pages
Research On Opencl Optimization For Fpga Deep Learning Application
No ratings yet
Research On Opencl Optimization For Fpga Deep Learning Application
19 pages
DL Acceleration On The Edge
No ratings yet
DL Acceleration On The Edge
78 pages
Mobile Comparison
No ratings yet
Mobile Comparison
13 pages
Ai Disruption Driving Innovation On Device Inference
No ratings yet
Ai Disruption Driving Innovation On Device Inference
12 pages
Insights Into DeepSeek-V3 - Scaling Challenges and Reflections On
No ratings yet
Insights Into DeepSeek-V3 - Scaling Challenges and Reflections On
14 pages
SambaNova - Accelerated Computing With A Reconfigurable Dataflow Architecture - Whitepaper - English
No ratings yet
SambaNova - Accelerated Computing With A Reconfigurable Dataflow Architecture - Whitepaper - English
10 pages
PHD Thesis Delft University of Technology
100% (2)
PHD Thesis Delft University of Technology
6 pages
CI - CD DevOps Pipeline Project
No ratings yet
CI - CD DevOps Pipeline Project
29 pages
DL Unit-6
No ratings yet
DL Unit-6
17 pages
Hazardous Waste Online Application
No ratings yet
Hazardous Waste Online Application
40 pages
DLBench A Comprehensive Experimental Evaluation of
No ratings yet
DLBench A Comprehensive Experimental Evaluation of
23 pages
Ai Unit 1
100% (1)
Ai Unit 1
101 pages
Ruijie Reyee RG-EST and RG-AirMetro Series Wireless Bridges B11P300 Release Notes (V1.3)
No ratings yet
Ruijie Reyee RG-EST and RG-AirMetro Series Wireless Bridges B11P300 Release Notes (V1.3)
12 pages
Benchmarking Contemporary Deep Learning Hardware and Frameworks A Survey of Qualitative Metrics
No ratings yet
Benchmarking Contemporary Deep Learning Hardware and Frameworks A Survey of Qualitative Metrics
8 pages
Unit-2 Installation and Configuration of Android
No ratings yet
Unit-2 Installation and Configuration of Android
24 pages
Module 5-Notes
No ratings yet
Module 5-Notes
10 pages
Kanoria Shubham Anil 2023HT01569
No ratings yet
Kanoria Shubham Anil 2023HT01569
9 pages
60 HC2024.Intel - RomanKaplan.Gaudi3-0826
No ratings yet
60 HC2024.Intel - RomanKaplan.Gaudi3-0826
16 pages
AI Benchmark: All About Deep Learning On Smartphones in 2019
No ratings yet
AI Benchmark: All About Deep Learning On Smartphones in 2019
19 pages
Ug4 Proj
No ratings yet
Ug4 Proj
44 pages
Mastering Azure Cloud A Comprehensive Guide To Building Scalable Cloud Solutions
No ratings yet
Mastering Azure Cloud A Comprehensive Guide To Building Scalable Cloud Solutions
4 pages
GPT4架构揭秘
No ratings yet
GPT4架构揭秘
12 pages
Digital Microscope: Instruction Manual
No ratings yet
Digital Microscope: Instruction Manual
72 pages
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
No ratings yet
HLS-Based Acceleration Framework For Deep Convolutional Neural Networks
11 pages
Circuit Net
No ratings yet
Circuit Net
2 pages
Executive Officer II (Cluster Allied Health Office)
No ratings yet
Executive Officer II (Cluster Allied Health Office)
3 pages
Arc 323 - Qus 311 Lecture
No ratings yet
Arc 323 - Qus 311 Lecture
30 pages
Electronics 09 00589
No ratings yet
Electronics 09 00589
17 pages
DL TR 2022 002
No ratings yet
DL TR 2022 002
20 pages
Ma 0702 05 en 00 - Setup Manual
No ratings yet
Ma 0702 05 en 00 - Setup Manual
214 pages
Origami
No ratings yet
Origami
14 pages
A Survey Comparing Specialized Hardware and Evolution in TPUs For Neural Networks
No ratings yet
A Survey Comparing Specialized Hardware and Evolution in TPUs For Neural Networks
7 pages
Agile Product Development
No ratings yet
Agile Product Development
4 pages
Zendesk Integration Manual ENG
No ratings yet
Zendesk Integration Manual ENG
8 pages
Opensource EDA
No ratings yet
Opensource EDA
8 pages
Syscon Error Codes - PS3 Developer Wiki
No ratings yet
Syscon Error Codes - PS3 Developer Wiki
22 pages
Invitedpaper Aspdac 24
No ratings yet
Invitedpaper Aspdac 24
7 pages
A Mixed-Pruning Based Framework For Embedded Convolutional Neural Network Acceleration
No ratings yet
A Mixed-Pruning Based Framework For Embedded Convolutional Neural Network Acceleration
10 pages
10 1109@vlsi-Dat49148 2020 9196288
No ratings yet
10 1109@vlsi-Dat49148 2020 9196288
1 page
An Implementation of Convolutional Neural Networks
No ratings yet
An Implementation of Convolutional Neural Networks
23 pages
Building CNN Model - Formatted Paper
No ratings yet
Building CNN Model - Formatted Paper
7 pages
Reinforcement Car Racing With A3C
No ratings yet
Reinforcement Car Racing With A3C
8 pages
Bringing Deep Learning To Embedded Systems: Mark Nadeski
No ratings yet
Bringing Deep Learning To Embedded Systems: Mark Nadeski
7 pages
Attachment
No ratings yet
Attachment
1 page
Chapter 2 Time Management (Part 1)
No ratings yet
Chapter 2 Time Management (Part 1)
19 pages
Evolving CPU Architectures For AI
No ratings yet
Evolving CPU Architectures For AI
5 pages
Simulation of Self-Driving Car Using Deep Learning: Aman Bhalla Munipalle Sai Nikhila Pradeep Singh
No ratings yet
Simulation of Self-Driving Car Using Deep Learning: Aman Bhalla Munipalle Sai Nikhila Pradeep Singh
7 pages
Download
No ratings yet
Download
1 page
zz1004D SampleQuestions
No ratings yet
zz1004D SampleQuestions
4 pages
7d41 PDF
No ratings yet
7d41 PDF
7 pages
Baseline Data Preparation of Revenue Land Record - IJS&ER 2015
No ratings yet
Baseline Data Preparation of Revenue Land Record - IJS&ER 2015
8 pages
Beamex White Paper - A Behind The Scenes Look at A Calibration Process Change
No ratings yet
Beamex White Paper - A Behind The Scenes Look at A Calibration Process Change
3 pages
Project Domain / Category: Virtual Pharmacy
No ratings yet
Project Domain / Category: Virtual Pharmacy
3 pages
T01-1 MasterFrame Tutorial - The Basics
No ratings yet
T01-1 MasterFrame Tutorial - The Basics
68 pages
B 2 3 3ROMeo
No ratings yet
B 2 3 3ROMeo
4 pages
Pro E Fundamentals Overview
No ratings yet
Pro E Fundamentals Overview
12 pages
Bar Council of The State of Andhra Pradesh Instructions On Online Enrolment Registration
No ratings yet
Bar Council of The State of Andhra Pradesh Instructions On Online Enrolment Registration
3 pages
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
Microservice
No ratings yet
Microservice
2 pages
PC Engine / TurboGrafx-16 Architecture: Architecture of Consoles: A Practical Analysis, #16
From Everand
PC Engine / TurboGrafx-16 Architecture: Architecture of Consoles: A Practical Analysis, #16
Rodrigo Copetti
No ratings yet
Practical Projects For Operate Personal Computer
No ratings yet
Practical Projects For Operate Personal Computer
3 pages