0% found this document useful (0 votes)
24 views64 pages

ASPLOS 2021 - Golden Age of Compilers

Uploaded by

fussfuss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views64 pages

ASPLOS 2021 - Golden Age of Compilers

Uploaded by

fussfuss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

The Golden Age of Compilers

in an era of Hardware/Software co-design

International Conference on
Architectural Support for Programming Languages and
Operating Systems (ASPLOS 2021)

Chris Lattner
SiFive Inc
April 19, 2021
Video Recording
Let’s talk compilers + accelerators
● Classical Compiler Design
● Modular Compiler Infrastructure
● Domain Specific Architectures
● Accelerator Compilers
● Silicon Compilers
● A Golden Age of Compilers
A New Golden Age for Computer Architecture
John L. Hennessy, David A. Patterson; June 2018

RISC
2X / 1.5
yrs
(52%/yr)

[cite] Turing Lecture, Hennessy, Patterson; June 2018 / CACM Feb 2019
HW / SW co-design is the best way to expose parallelism of silicon... and utilize it

[cite] Hennessy, Patterson; June 2018 / CACM Feb 2019


🧐 How do we program these things?
[cite] Hennessy, Patterson; June 2018 / CACM Feb 2019
Hardware is getting harder
Modern compute acceleration platforms are multi-level and explicit:
● Scalar, SIMD/Vector, Multi-core, Multi-package, Multi-rack
● Non-coherent memory subsystems increase efficiency

Heterogeneous compute incorporating domain-specific accelerators


● Standard in high-end SoCs, domain-specific hard blocks in FPGAs

Many accelerator IPs are configurable:


● Optional extensions, tile / core count, memory hierarchy, etc

🤯 How…canand“normal people” write Software for this in the first place?


how can you afford to build generation-specific SW?
Next-Gen compilers and PL are needed!
We need:
● hardware abstraction spanning diverse accelerators
● support for heterogeneous compute platforms
● domain specific languages and programming models
● quality, reliability, and scalability of infrastructure

This opportunity is beckoning a golden age in compiler and PL technology!

Let’s learn from the past, then project into the future 🚀
Classical Compiler Design
C Compilers leading into the early 90s

IBM

⇒ Expensive, not very compatible, inconsistencies abound


● … and didn’t share any code
Also: Came in boxes, with printed manuals, often on floppy disks!
Three Phase Compiler Design

Figure 11.1: Three Major Components of a Three-Phase Compiler

[cite] LLVM @ The Architecture of Open Source Applications


FOSS Enables Collaboration & Reuse

Figure 11.2: Retargetablity

One frontend for many backends, one backend for many frontends
Lessons Learned
Achieved “O(frontend+backends)” scalability of compiler ecosystem

Larger center of gravity concentrated scarce compiler engineering effort


● Enables innovations in languages, frontends and backends

Reduced 💢fragmentation, standardized “C in practice”


● Enabled new business models 💸
● Untied the CPU ISA war from inconsequential impl details
Modular Compiler Infrastructure
Library Based Design

Clang
Optimizer CodeGen Emitter
Frontend

RISC-V

SPARC
LICM

IPCP
DCE
CSE
Dataflow

X86
...
Tooling
Parser

IRGen

...

JIT
Sema

.o
.s
Key insight: Compilers as libraries, not an app!
● Enable embedding in other applications
● Mix and match components
● No hard coded lowering pipeline
LLVM
Components and interfaces!
Better than monolithic approaches for large scale designs:
- Easier to understand and document components
- Easier to test
- Easier to iterate and replace
- Easier to subset
- Easier to scale the community

LLVM
Lessons Learned
Larger center of gravity concentrated scarce compiler engineering effort
● Enables innovations in languages, frontends and backends

Reduced 💢fragmentation of JIT compilers, standardized CPU codegen


● Enabled new business models
● Databases, graphics shader compilers, GPGPU, EDA HLS tools, …
Scalable community architecture:
● Design methodology / developer policies
● Community policies: inclusion, licensing, extensions etc

LLVM
Limitations of LLVM
20 years in perspective on LLVM:
● “One size fits all” quickly turns into “one size fits none”
● LLVM is: 👍 CPUs, “just ok” 👈 for SIMT, but 👎 for many accelerators
● … is not great for parallel programming models 💩

Engineering is “pretty good” but could be better:


● Lots of redundancy/reimplementation @ different levels of abstraction
● Deeper discussion @ CGO 2020 talk

Going beyond basic CPUs means going beyond LLVM IR!

LLVM
Domain Specific Architectures
[cite] Hennessy, Patterson; June 2018 / CACM Feb 2019
It’s happening!
CPU, etc. GPGPU, etc. TPU, NPU, etc. FPGA, CPLD, etc. ASIC

Programmable xPUs Custom Hardware

Specialization

[cite] Applying Circuit IR Compilers and Tools (CIRCT)


to ML Applications, Mike Urbach, MLSys Chips And Compilers
Symposium 2021
Lots of players! (an incomplete list!)

CPU, etc. GPGPU, etc. TPU, NPU, etc. FPGA, CPLD, etc. ASIC

Programmable xPUs Custom Hardware


How do we compile for this?

⇒ Not very compatible, inconsistent quality and scope


● … and don’t share much code
We’ve seen this before!

IBM
We need some unifying theories!
We need:
● “O(frontend+backends)” scalability of compiler ecosystem
● Larger center of gravity concentrated scarce compiler engineering effort
● Reduced 💢fragmentation:
● Ability to innovate in the programming model
● ... without reinventing the whole stack
Accelerator Compilers
How do accelerators work?
Control Processor / Sequencer
Control Processor / “Sequencer” ● Executes commands by the host driver app
Parallel ● Handles booting and other housekeeping
Compute ● Diagnostics, security, debug, other functions
Unit
Some accelerators may do significantly more!

Ratio of control to parallel compute vary, as do the internal arch’s of both


Add a system interface
Memory / Bus Interface

Control Processor / “Sequencer”

Parallel
Compute
Unit

Communicate w other parts of the SoC, or to off-chip resources

Including DDR, HBM, … AMBA, PCI, CXL, etc depending on integration level
“Oops we need some software”
Hardware Software
Memory / Bus Interface

Control Processor / “Sequencer”

Programming Model + Userspace API


Parallel
Compute
Unit

Device Kernels

Control Proc Assembler + Kernel Driver

The SW people are called in after the accelerator is defined to “make it work”
Larger accelerators go multicore/SIMT...
Hardware Software
Control Processor / “Sequencer”
Control Processor / “Sequencer” Programming Model + Userspace API
Parallel
Memory / Bus Interface

Control Processor / “Sequencer”


Parallel
Control Processor / “Sequencer”
Compute
Parallel
Compute
Parallel
Unit
Compute
Unit
Compute
Unit
Unit

Parallel Device Kernels

Control Proc Assembler + Kernel Driver

Use of more HW area is desired, requiring parallel control logic


Tiling and heterogeneity for generality
Hardware Software
Control Processor / “Sequencer” Control Processor / “Sequencer”
Control Processor / “Sequencer” Control Processor / “Sequencer” Programming Model + Userspace API
Parallel Parallel
Control Processor / “Sequencer” Control Processor / “Sequencer”
Parallel Parallel
Control Processor / “Sequencer” Control Processor / “Sequencer”
Compute
Parallel Compute
Parallel
Compute
Parallel Compute
Parallel
Memory / Bus Interface

Unit
Compute Unit
Compute Multistream Mgmt / Interop Parallelism
Unit
Compute Unit
Compute Memory + Communication Optimization
Unit Unit Heterogenous Device + Host fallback
Unit Unit
Control Processor / “Sequencer” Control Processor / “Sequencer”
Control Processor / “Sequencer” Control Processor / “Sequencer”
Parallel Parallel
Control Processor / “Sequencer” Control Processor / “Sequencer”
Parallel Parallel
Control Processor / “Sequencer” Control Processor / “Sequencer”
Compute
Parallel Compute
Parallel Parallel Device Kernels
Compute
Parallel Compute
Parallel
Unit
Compute Unit
Compute
Unit
Compute Unit
Compute
Unit Unit Control Proc Assembler + Kernel Driver
Unit Unit
⇒ Also, hierarchical compute at the board, rack, and datacenter level
Pro & Cons of hand written kernels
Benefits:
● Easy to get started, ability to get peak performance, hackability
Pro & Cons of hand written kernels
Benefits:
● Easy to get started, ability to get peak performance, hackability

Problem: hand written kernels don’t scale


● Expensive to maintain a library of 100’s to 1000’s of kernels
● Don’t scale to configurable IPs, not even memory hierarchy dimensions
● Don’t scale to device families, or evolving µarch’s over time
● Eventually end up limiting HW design space exploration / evolution

Often addressed with metaprogramming (aka “mini compilers”)


“DSA Compilers” to the rescue
Hardware Software
Control Processor / “Sequencer” Control Processor / “Sequencer”
Control Processor / “Sequencer” Control Processor / “Sequencer” Programming Model + Userspace API
Parallel Parallel
Control Processor / “Sequencer” Control Processor / “Sequencer”
Parallel Parallel
Control Processor / “Sequencer” Control Processor / “Sequencer”
Compute
Parallel Compute
Parallel
Compute
Parallel Compute
Parallel
Memory / Bus Interface

Unit
Compute Unit
Compute Accelerator Kernel Compiler
Unit
Compute Unit
Compute
Unit Unit Multistream Mgmt / Interop Parallelism
Unit Unit
Control Processor / “Sequencer” Control Processor / “Sequencer”
Memory + Communication Optimization
Control Processor / “Sequencer” Control Processor / “Sequencer”
Parallel Parallel
Control Processor / “Sequencer” Control Processor / “Sequencer” Heterogenous Device + Host fallback
Parallel Parallel
Control Processor / “Sequencer” Control Processor / “Sequencer”
Kernel Code Generation
Compute
Parallel Compute
Parallel
Compute
Parallel Compute
Parallel
Unit
Compute Unit
Compute
Unit
Compute Unit
Compute
Unit Unit Control Proc Assembler + Kernel Driver
Unit Unit
This is hard!
… and we keep reinventing it over and over again
… at the expense of usability and quality
Mostly needless reinvention, not co-design!
Hardware Software
Control Processor / “Sequencer” Control Processor / “Sequencer”
Control Processor / “Sequencer” Control Processor / “Sequencer” Programming Model + Userspace API
Parallel Parallel
Control Processor / “Sequencer” Control Processor / “Sequencer”
Parallel Parallel
Control Processor / “Sequencer” Control Processor / “Sequencer”
Compute
Parallel Compute
Parallel
Compute
Parallel Compute
Parallel
Memory / Bus Interface

Unit
Compute Unit
Compute Accelerator Kernel Compiler
Unit
Compute Unit
Compute
Unit Unit Multistream Mgmt / Interop Parallelism
Unit Unit
Control Processor / “Sequencer” Control Processor / “Sequencer”
Memory + Communication Optimization
Control Processor / “Sequencer” Control Processor / “Sequencer”
Parallel Parallel
Control Processor / “Sequencer” Control Processor / “Sequencer” Heterogenous Device + Host fallback
Parallel Parallel
Control Processor / “Sequencer” Control Processor / “Sequencer”
Kernel Code Generation
Compute
Parallel Compute
Parallel
Compute
Parallel Compute
Parallel
Unit
Compute Unit
Compute
Unit
Compute Unit
Compute
Unit Unit Control Proc Assembler + Kernel Driver
Unit Unit
Most complexity is in non-differentiated (table stakes) components
Innovate where it matters
… use open standards to accelerate the rest
Industry already standardized the buses
Memory / Bus Interface

Control Processor / “Sequencer”

Parallel
Compute
Unit

PCI, HBM, DDR, CXL, AMBA, etc are all standardized


Standardize the Control Processor?
Control Processor / “Sequencer”
SW is a bigger problem than HW for accelerators:
Standard Interface

Parallel ● Control processor is bottom of the SW stack


Compute
Unit We fool ourselves into building trivial CPUs:
● it can seem fun to design a new solution here
● … except reset, debug, power management,
security, etc (the hard parts!)

“Saving a few gates” slows down what matters


● … and hobbles the critical path: software
[cite] Hennessy, Patterson; June 2018 / CACM Feb 2019
Open Industry Standard
● Many implementations available

🧩 Modular and subset-able ISA design:


● Extensibility allows easy addition of heterogeneous units

Scalability allows full spectrum of design points!

RISC-V 2-Series
RISC-V 5 Series RISC-V
Parallel 7 Series RISC-V
Compute Parallel
Compute Unit
Parallel Parallel Parallel
Compute Compute Compute 8+ Series
Unit Unit Unit Unit

Hard Coded Programmable Heterogeneous General Purpose CPU


Accelerator Accelerator Workload Accelerator
“15,000 gates? I can’t count that low!”
Cliff Young, TPU architect, Google Brain
MLSys 2021 Chips and Compilers Symposium Panel
Standardize your base Software
RISC-V Control Processor
Standard Interface

Parallel
Compute
Unit
Write your kernels in C or LLVM IR!
● Use existing code generators
● Use existing simulators
● Step through them in a debugger

RISC-V Compiler + Kernel Drivers


The next frontier: DSA Compilers?

“No one size fits all” compiler! Accelerator Kernel Compiler

Multistream Mgmt / Interop Parallelism


Shape of the problem is the same... Memory + Communication Optimization
… but the accel details always vary Heterogenous Device + Host fallback
Kernel Code Generation

How do we get reuse?


RISC-V Software Ecosystem
MLIR: Compiler Infra at the End of Moore’s Law
● Multi-Level Intermediate Representation
● Joined LLVM, follows open library-based philosophy
● 🧩 Modular, extensible, general to many domains
○ Being used for CPU, GPU, TPU, FPGA, HW, quantum, ....
● Easy to learn, great for research
● MLIR + LLVM IR + RISC-V CodeGen = 💝💝

https://mlir.llvm.org
See more (e.g.):
2020 CGO Keynote Talk Slides
2021 CGO Paper
RISC-V+MLIR: Uniting an Industry
CPU, etc. GPGPU, etc. TPU, NPU, etc. FPGA, CPLD, etc. ASIC

Programmable xPUs Custom Hardware


What is the benefit of this?
Larger center of gravity concentrated scarce compiler engineering effort
● Enables innovations in programming models and hardware

Achieved “O(frontend+backends)” scalability of compiler ecosystem

Reduced 💢fragmentation, improved 🧩 modularity


● Focus on the differentiated parts of the stack
But… what about hardware?
HW design is fragmented too
CPU, etc. GPGPU, etc. TPU, NPU, etc. FPGA, CPLD, etc. ASIC

??
Programmable xPUs Custom Hardware
Building Parallel Compute Units?
Memory / Bus Interface

RISC-V Processor

Parallel
Compute
Unit

Notice how I conveniently


omitted how to build the
“interesting” part!
Silicon Compilers
Hardware Design is ripe with opportunity
SystemVerilog is industry standard, but:
● Huge, complicated, incompletely implemented
● Is it an IR? or programming language for humans? neither? both?

EDA tools are mature, but not always:


● … innovating rapidly, now that process technology has slowed
● … designed for usability
● … using best practices in SW architecture
● … cost efficient
Open Source tools to the rescue?
Wonderful ecosystem of Open Source tools, but:
● Generally aspiring to be “as good” as proprietary tools
● Fragmented communities, not sharing much code
● Monolithic designs connected by unfortunate standards

nextpnr

Innovation Explosion Underway!
Research is producing new HW design models and abstraction approaches

Magma

Dahlia

A great opportunity to pull PL + type system + compiler tech from SW world...


… held back by poor interop standards and ecosystem 💢fragmentation
See also: ASPLOS LATTE’21 Workshop
CIRCT: Circuit IR for Compilers and Tools
Compiler infrastructure for design and verification

● LLVM incubator project built on MLIR & LLVM


● Composable toolchain for different aspects of
hardware design / EDA processes
● 🧩 Modularity, library based design, ecosystem
● High quality, usability, performance

Goals:
● Unite HW design tools community
● “Accelerate” design of the accelerators!
https://circt.llvm.org
CIRCT Ambition / Path Ahead
Support multiple different “hardware design models” in one framework:
● Generators, HLS, atomic transactions, ...
Increase abstraction level in the hardware design IR:
● Integrate modern type system features from the SW world
● Capture more design intent, higher level verification and tools
● Better integrate formal methods into the design flow
Increase quality of the tools themselves:
● Compile time: shrink development cycle time
● Usability: robust location tracking for good error messages

“10x” design and verification, change economics of hardware design


Co-design of HW and SW design
CPU, etc. GPGPU, etc. TPU, NPU, etc. FPGA, CPLD, etc. ASIC

Programmable xPUs Custom Hardware


A Golden Age of Compilers
in an era of Hardware/Software co-design
Compiler/PL tech more important than ever!
The world is evolving fast at the “End of Moore’s Law”
● Changing assumptions, expanding possibilities
HW changes require new programming models and approaches:
● … and is validating well known but sparsely adopted techniques
We need compiler and PL experts to step up!

We’re hiring!
Get involved!
https://mlir.llvm.org/
https://circt.llvm.org/
Too much content,
skip this section

Frontiers in Compiler Architecture


Concurrency within the compiler
Parallel for each is not enough
Caching
Why are we rerunning N^2 and NP complete algorithms from scratch when
their inputs aren’t always changing??

Need to design the compiler for this from the beginning


Distribution
Distributed systems vs compilers. One of the heaviest workloads, why are we
doing this??

Build system + compilers dichotomy is terrible.

No distributed system person would ever build things this way.

Many problems are embarrassingly parallel here.


Extensible Compilers
Call back into the generator as part of lowering.
Computational Demands of Machine Learning

https://openai.com/blog/ai-and-compute

You might also like