Scaling Your Chip Design Flow v18
Scaling Your Chip Design Flow v18
Scaling Your
Chip Design
Flow
Using Google Cloud to accelerate your chip design
process
Table of contents
Executive summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1
Executive Summary
As companies undergo digital transformation - a process when an organization embraces new technologies to
redesign and redefine its relationships with customers, employees and partners - cloud-driven infrastructure
modernization stands out. Companies around the world, from a variety of industries are re-platforming to scale to
customer demand, enable higher internal productivity or accelerate innovation. Nokia, the European company
well-known for the world’s first mobile call in 1991 and several noteworthy mobile innovations recently announced
that it will migrate its entire on-premise infrastructure to cloud, in keeping with its cloud-first strategy. This strategic
step enables Nokia to expand collaboration and innovation capabilities in its numerous data centers worldwide as it
continues to grow its vast product portfolio of 5G, software, mobile and technology products.
The advancements in personal computing, mobile computing, 5G and AI/ML, are enabled by semiconductor
electronics. The semiconductor industry has enjoyed steady growth over the years; it is currently valued at hundreds
of billions of dollars, and expected to grow robustly in the coming decades. Technologies such as AI, cloud-
computing, 5G, Advanced Driver Assistance Systems (ADAS) are all expected to contribute to the growth and
limitless opportunities.
The modern day chip design process that accelerated with the invention of the MOSFET1 over 60 years ago is the
place where electronics begins. This process has evolved and benefited significantly from access to powerful
computing infrastructure. Gordon Moore’s prediction for the number of transistors doubling on chips every 18 to
twenty four months held up for a long time. Chip designers were able to to keep up with this timeline by retooling and
using faster machines. However, technology nodes (connected to the size of transistors in the library used for
fabricating the chip and referenced in nanometers or ‘nm’) are becoming smaller and smaller. For instance, chips
were at 28 nm just a few years ago but are now at 7nm and decreasing further (Figure 1), driven by the demands of
the several hundred industry use cases in mobile, computing, AI/ML, and others. With decreasing device sizes, the
potential for complexity grows exponentially and the essential process of verifying and validating functionality, chip
operation variability, timing, low-power requirements, semiconductor manufacturing rules among other checks
becomes enormously compute intensive. Lack of access to high performance computing capabilities ultimately
delays chip design and production, which can result in significant business losses.
The total cost of designing and manufacturing chips, from the skilled designers to the complex electronic design
automation (EDA) tools to the ultra expensive manufacturing costs has ensured that a small number of companies
operate in this space, and competition is fierce among them. In this golden age for semiconductors where there will
be insatiable demand for AI, automotive, cloud and edge hardware, semiconductor companies who want to succeed
will have to continue to design chips faster than their competitors.
1 Metal-oxide-Semiconductor Field-effect-Transistor, the predominant device used in digital and analog integrated circuits
2
This white paper details the constantly growing complexity of the chip
design process including specific challenges in the workflow, and
illustrates with real-world examples how cloud computing provides a
way to successfully address bottlenecks and design at scale.
3
Chapter 1
Chip design process
Like any project, a chip design project starts with a requirements specification. Circuits can be digital (operating at a
few defined levels of voltage), analog (operating over a continuous range) or a mixture of both device types. A vast
majority of chips designed are digital. Examples include microprocessors, network processors or special
Application Specific Integrated Circuits or ASICs.
A simplified version of a typical flow in the digital design flow process is shown below (Figure 2):
Each of the steps specified in Figure 2 can include several EDA tools. Most EDA tools tend to be from commercial
enterprises, though it is common for chip design companies to have several home-grown utilities for ensuring a
smooth flow between EDA tools in different phases.
"RTL Design & Modeling’, ‘Physical Design’, ‘Physical Verification’ and ‘Fabrication & Packaging’ steps are further
highlighted below, due to their intensive compute needs.
4
Fabrication and Test involve taking the taped-out2 design, processing it to improve manufacturability and fabricating
the required design on silicon wafers. When it comes to chip manufacturability, it is important to mitigate issues that
impact yield. Yield, in the semiconductor manufacturing context, is the amount of total input transformed into usable
product. Semiconductor yield is an aggregate number that includes several types of yield during various stages of
production. For example, one type of yield (“die yield”) refers to the number of die (individual design units) that
continue to the final testing stage. The impact of yield on a chip vendor is well understood by reviewing the classic
yield ramp curve (Figure 4). Yield issues can be process-related, environmental or electrical, and they are impacted
by feature size, number of metal layers, and wafer size among a multitude of other issues and factors. Manufacturing
processes include hundreds of steps, with each step producing vast amounts of data. This crucial step in the overall
product design lifecyle requires large compute, and ability to store volumes of diverse data such that insights from
such data are used to make process/parameter changes to accelerate yield improvement.
Figure 4: Yield Curve for memory chips, Source: Based on “Yield Learning Curve Models in Semiconductor Manufacturing” by Israel
Tirkel
2 A term from the early days of electronics when the schematic or circuit design was manually written to tape, disk or CD. In recent times, it is
the step where the design team hands off the final chip layout file for fabrication
6
Chapter 2
Design chips faster
The chip design community is focused on enabling optimizations to Each generation and
design chips faster, such as providing reusable IP (intellectual
property blocks), debug utilities to find root cause faster or improved new node requires
EDA algorithms to improve time to results. Designer productivity significantly more
solutions have been mostly around better EDA tools, new tools or new
ways to improve flows. However, current optimizations are not
processing and
enough. Each generation and new node requires significantly more analysis, leading to a
processing and analysis, leading to a growing gap between design growing gap
complexity and designer productivity.
between design
A typical chip can take anywhere from nine to eighteen months from complexity and
conception to delivery. This time has stayed constant over the years,
in part due to improved EDA tools and more powerful computers, but
designer
also due to the size of chip design teams which have grown productivity.
considerably over the past few years to meet time to market needs.
Elasticity:
In a typical on-premise data center, peak demand is often difficult to
meet in a cost effective manner. Often times, the need for additional
compute is short-lived, and on-premise datacenters are not built to
provide infrastructure for ultra short periods of time. A more prevalent
technique is to plan for the additional compute well in advance
(typically 8 to 12 weeks) and have it available when the additional
compute is needed. Manual provisioning is a significant drain on It is typical for new
productivity. Further, once the chip is taped out, on-premise
datacenters suffer from significant under-utilization. versions of EDA tools
to perform better
Access to state-of-the-art compute:
when a chipset with
Like in all industries, EDA tools are constantly trying to leverage the
benefits of state-of-the-art hardware architecture. In fact, it is typical a larger cache is
for new versions of EDA tools to perform better when a chipset with a provided, or if a
larger cache is provided, or if a multi-core capability that allows
scaling to several threads is supported. Similarly, several EDA tools
multi-core capability
perform better with a high-performance network between shared that allows scaling to
storage and compute. On-premise data centers have upgrade cycles
several threads is
that are 3 to 5 years long - often leading to an environment where the
best tool performance is not achieved. supported.
Data-driven insights:
Chip design companies design several generations of a chip, create
similar chips and can benefit from sharing knowledge acquired when
building new chips. Similarly fabrication plants operate highly
complex equipment that generate volumes of data. Learning from
data would be vastly beneficial in chip design and manufacturing.
However, this requires a systematic process, where successful
recipes, tool parameters, bug reports, tool options and similar metrics
are saved. An infrastructure that can support the logging and analysis
of chip design and manufacturing data is prohibitively expensive to
build for a large, multi-site design team, and this prevents design
teams from applying advanced analytics, AI/ML models and Big Data
techniques consistently and successfully in chip design.
8
Agility:
The number of skilled engineers needed to design and verify a chip
has been growing and companies are increasingly looking to leverage
global locations for access to highly skilled engineers and a diverse
employee base. Sometimes, starting a new design center is key to
business execution success. However, only a few design houses with
deep roots in global locations have been successful in such efforts.
This ability to start new design centers in a short amount of time is
daunting for most companies.
9
Chapter 3
How Google Cloud accelerates chip
development
Chances are that you have already used one of the many Google
services (eg: Search, Gmail, Chrome, YouTube) with over a billion
users each. Google Cloud is built on the same robust global
infrastructure that securely and reliably delivers these services to the
world. We offer a complete platform addressing the compute,
storage, networking, security, and AI/ML needs of a successful cloud
enhanced EDA strategy.
Compute is one of the most important components of an EDA workflow infrastructure. Often customers in an
on-prem model make decisions to procure compute resources based on the peak. These capex investments are
slow to add and frequently underutilized and over-provisioned after peak demands are met. Google Cloud offers
unparalleled flexibility in terms of Compute offerings to run the EDA workloads in the most optimized way. Google
Cloud offers pre-configured VM in a variety of shapes and offers custom shapes for the ultimate flexibility.
Google Cloud accelerates EDA workloads with high-performance virtual machines using Google Compute Engine
(GCE). Google Cloud has the industry’s fastest startup time for provisioning virtual machines. These virtual
machines easily integrate with other Google Cloud services such as storage, AI/ML, and analytics. Key benefits
include confidential computing, predictive autoscaling, live migration for VMs, sole-tenant nodes, HPC VM images,
GPU accelerators, custom machine type sizing, per second billing, committed-use discounts (up to 57% savings),
and preemptible VMs (savings up to 80%).
• General purpose (E2, N1, N2, N2D) machines provide a good balance of price and performance. N2
machine types are the second generation general-purpose machine types based on Cascade Lake CPUs
with a base frequency of 2.8 GHz and a sustained all core turbo of 3.4 GHz. Pre-configured shapes are
available between 2 to 80 vCPUs and 0.5 to 8 GB of memory per vCPU. These VMs could be used for some
of the front end workloads and infrastructure for tools with medium performance requirements.
11
VM instances can attach either HDD or SSD PDs and performance scales with size up to 100k read and 30k write
IOPS and with max sustained throughput of 1.2MB/s read and 800MB/s write.
Google Cloud is the first major public cloud to offer a tiered cloud
network. With the premium tier, which is the default network offering,
The latest generation
traffic is delivered over Google’s well-provisioned, low latency, highly of Andromeda -
reliable global network. This network consists of an extensive global Version 2.2
private fiber network with over 100 points of presence (POPs) across
the globe. - provides an 18X
increase in
The ultimate goal of the network is to maximize bandwidth to
compute engine VM’s. The latest generation of Andromeda - Version
VM-to-VM
2.2 - provides an 18X increase in VM-to-VM bandwidth and 8X bandwidth and 8X
reduction in latency without introducing any downtime. The maximum reduction in latency
network egress bandwidth is raised to 32 Gbps for same zone
VM-to-VM traffic on all common VM types. Further, Google Cloud without introducing
users also benefit from improved performance isolation through the any downtime
use of hardware offloads. This enables the Compute Engine guest VM
to bypass both the hypervisor and the software switch, and talk
directly to the Network Interface Card (NIC).
The SDN layer is unique in that it allows for zero-copy payload transfer
from the VM memory directly to the NIC and also allows for improving
performance and efficiency under the hood without requiring the use
of SR-IOV or other specifications that tie a VM to a physical machine
for its lifetime.
Cloud Interconnect
extends the
Cloud interconnect
on-premises
network to Google's
Most semiconductor customers require private connectivity from
their data centers to the cloud. network through a
highly available, low
Cloud Interconnect extends the on-premises network to Google's
network through a highly available, low latency connection.
latency connection.
Customers have the choice of dedicated or partner interconnect.
Dedicated Interconnect connects directly to Google whereas the
Partner Interconnect connects to Google through a supported service
provider. In both cases traffic does not traverse the public internet.
16
Security is often one of the biggest concerns for anyone looking to the cloud. One of the biggest concerns
semiconductor companies have is security, to make sure design IP (Intellectual property) is protected.
From the beginning Google has worked to make its services both secure and reliable. The underlying infrastructure
doesn’t rely on any single technology to make it secure, rather, security is built through progressive layers that deliver
true defense in depth. Starting from the bottom, Google builds its own custom hardware. The same goes for
software, including low-level software and the OS. Further, hardware is designed to include components specifically
for security - like Titan, which is a custom security chip. All of this rolls up to custom data center designs, which
include multiple layers of physical and logical protection.
Data by default is automatically encrypted at rest and in transit and also distributed for availability and reliability.
Communications over the Internet to the public cloud services are all encrypted. Further, since encryption is one of
the most expensive components of packet processing, Andromeda 2.2 utilizes special-purpose network hardware in
the NIC to offload that encryption, freeing the host machine's CPUs thereby running the guest vCPUs more
efficiently.
With Google Cloud, customers have full control and sole ownership of their data. By providing Access transparency
logs, Google Cloud is the only major cloud to provide near real-time logs of Google administrator access on Google
Cloud. Customers can also choose to manage their own encryption keys using Google Cloud’s Cloud Key
Management Service. Furthermore, with Cloud Data Loss Prevention (DLP), customers can identify, redact, and
prevent sensitive or private information from leaking outside of an organization.
17
Big Data, AI and ML are proven technologies to optimize flows. Chip design and manufacturing flows generate lots
of data - but systematic collection, analysis and insights is difficult [2]. GCP offers several scalable solutions to
build robust data collection and analysis frameworks, such as Cloud SQL and Cloud Spanner for relational
database services, Bigquery for scalable data warehousing, and Bigtable for wide-column databases.
Almost every design flow uses databases, but most are home-grown efforts using open source technologies. Such
solutions are often brittle and have difficulty scaling to multiple sites and increased usage. With a service like
Cloud SQL, users can migrate their backend database servers to managed services like CloudSQL and improve
engineer productivity.
AI/ML is also enabling innovative approaches to problem solving in many domains including semiconductors. In
chip design and manufacturing ML techniques are being used successfully to get the better quality of results
faster. ML techniques are also being used in semiconductor wafer testing with success. Semiconductor
organizations have realized the potential of introducing AI/ML in flows and have created organizations to enable
AI/ML in flows.
Chapter 4
Google Cloud reference architecture for EDA workloads
When it comes to selecting the reference architecture, several options exist. Every user situation is unique, and the
guidelines below are helpful in determining the architecture. First, some basic deployment models are described,
followed by a reference architecture that supports the model.
Burst Model Determination could be made at runtime whether a tool or flow is run in Cloud
or on-prem. Similar setup exists in Cloud and on-premises to enable a
seamless runtime environment. Setup example.
A subset of projects or flows are run all-on in Cloud. Other projects stay
Hybrid Model
on-prem. Infrastructure setup is similar to All in Cloud. Setup example.
All-on-cloud All infrastructure (compute, storage, networking, license server) is in Cloud with
no dependency on on-prem infrastructure. Setup example.
• Engineers login as usual, and submit a job to the on-prem job scheduler
• If the job scheduler finds that the on-prem cluster cannot meet the compute requirement, it sends the
request to the cloud via a resource connector
• Job executes on the cloud, and the results are returned to the user for next steps
20
An architecture capable of supporting All-on-Cloud mode is shown below in Figure 8. Further, a complete example
with scripts to demonstrate this mode of operation can be found in this repo.
The architecture shown in Figure 9 can be enhanced by adding Big Data and ML capabilities.
Chapter 5
Making a difference — where it
matters
Sustainability
Google operates the cleanest cloud in the industry. We believe in
sustainability and have over a decade of investment in operating
We have been
responsibly. We have been carbon neutral since 2007 and matched carbon neutral since
100% of our electrical consumption with renewable energy purchases
2007 and matched
since 2017. We recently also committed to fully decarbonize our
electricity supply by 2030 and operate on clean energy, every hour 100% of our
and in every region. electrical
You have many things to consider in the cloud platform you choose—
consumption with
its price, security, openness, and products. We believe you should renewable energy
consider the environment too. When you choose Google Cloud, your
purchases since 2017.
digital footprint is offset with clean energy, making your compute
impact net zero. By moving compute from a self-managed data
center or collocation facility to Google Cloud, the net emissions
directly associated with your company's compute and data storage
will be zero.
24
Ecosystem
Semiconductor chip design and manufacturing eco-system is large, vibrant and evolving. At Google Cloud, we
believe in building strong partnerships and nurturing them for long-term success. We have ongoing partnerships
with major EDA vendors, storage and infrastructure vendors, and foundries. We actively participate in industry events
to grow awareness about cloud technologies for chip design.
Cost Optimization
In addition to the custom VM machines, GCP offers two strong mechanisms to optimize the EDA infrastructure for
cost.
Preemptible VMs
With PVMs, GCP offers the same VM for a fraction of the cost (upto 80% discount). PVM can be stopped
(preempted) by compute engine and is only available for a maximum of 24 hours. With fault tolerant architecture,
PVMs provide a very cost effective solution for running batch, checkpointed and high throughput EDA jobs.
Rightsizing Recommendations
GCP offers a built-in mechanism for additional optimization recommendations after you start running your jobs.
Rightsizing insights provide recommendations for updating the VM shape if the system detects under utilized VMs.
This provides an additional mechanism for admins to constantly keep the cost under control for the EDA
infrastructure.
25
Conclusion
Infrastructure modernization is the cornerstone of the digital transformation journey. Semiconductor design and
fabrication is a complex and compute-intensive process with several challenges that impede the ability to design
faster. Google Cloud’s solutions in Compute, Networking, Storage, Data Management and AI/ML enable the
semiconductor industry to scale their operations by getting secure access to the latest infrastructure when needed,
and differentiate their flows with ML-based techniques.
26
References
Articles
1. M. Dalton, D. Schultz, J. Adriaens, A. Arefin, A. Gupta, B. Fahs, D. Rubinstein, E. C. Zermeno, E. Rubow, J. A.
Docauer, J. Alpert, J. Ai, J. Olson, K. DeCabooter, M. de Kruijf, N. Hua, N. Lewis, N. Kasinadhuni, R. Crepaldi,
S. Krishnan, S. Venkata, Y. Richter, U. Naik, and A. Vahdat. Andromeda: Performance, Isolation, and Velocity
at Scale in Cloud Network Virtualization. In NSDI, 2018.
2. S. Obilisetty, "Digital intelligence and chip design," 2018 International Symposium on VLSI Design,
Automation and Test (VLSI-DAT), Hsinchu, 2018, pp. 1-4, doi: 10.1109/VLSI-DAT.2018.8373256.
Links
1. https://www.fierceelectronics.com/electronics/
synopsys-google-cloud-team-cloud-based-functional-verification
2. https://cloud.google.com/blog/products/ai-machine-learning/
ai-and-machine-learning-improve-manufacturing-visual-inspection-process
3. https://cloud.google.com/blog/products/networking/
google-cloud-networking-in-depth-how-andromeda-2-2-enables-high-throughput-vms
4. https://cloud.google.com/solutions/filers-on-compute-engine
5. https://cloud.google.com/compute/docs/nodes/sole-tenant-nodes
6. https://cloud.google.com/blog/products/storage-data-transfer/
introducing-lustre-file-system-cloud-deployment-manager-scripts