0% found this document useful (0 votes)
21 views11 pages

HPC Module 1

Uploaded by

jarvistheai01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views11 pages

HPC Module 1

Uploaded by

jarvistheai01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

MODULE 1

1.Classes of Computers

1. Personal Mobile Device (PMD)

Personal mobile device (PMD) is the term we apply to a collection of wireless devices with multimedia
user interfaces such as cell phones, tablet computers,and so on. Cost is a prime concern given the
consumer price for the whole product is a few hundred dollars. Although the emphasis on energy
efficiency is frequently driven by the use of batteries, the need to use less expensive packaging plastic
versus ceramic—and the absence of a fan for cooling also limit total

power consumption. Responsiveness and predictability are key characteristics for media applications.
A real-time performance requirement means a segment of the application has an absolute maximum
execution time. For example, in playing a video on a PMD, the time to process each video frame is
limited, since the processor must accept and process the next frame shortly. In some applications, a
more nuanced requirement exists: the average time for a particular task is constrained as well as the
number of instances when some maximum time is exceeded. Such approaches—sometimes called
soft real-time—arise when it is possible to occasionally miss the time constraint on an event, as long
as not too many are missed. Real-time performance tends to be highly application dependent. Other
key characteristics in many PMD applications are the need to minimize memory and the need to use
energy efficiently. Energy efficiency is driven by both battery power and heat dissipation. The memory
can be a substantial portion of the system cost, and it is important to optimize memory size in such
cases. The importance of memory size translates to an emphasis on code size, since data size is
dictated by the application.

2.Desktop Computing

The first, and probably still the largest market in dollar terms, is desktop computing. Desktop
computing spans from low-end netbooks that sell for under $300 to high-end, heavily configured
workstations that may sell for $2500. Since 2008, more than half of the desktop computers made each
year have been battery operated laptop computers. Throughout this range in price and capability, the
desktop market tends to be driven to optimize price-performance. This combination of performance
(measured primarily in terms of compute performance and graphics performance) and price of a
system is what matters most to customers in this market, and hence to computer designers. As a
result, the newest, highest-performance microprocessors and cost-reduced microprocessors often
appear first in desktop systems. Desktop computing also tends to be reasonably well characterized in
terms of applications and benchmarking, though the increasing use of Web-centric, inter-active
applications poses new challenges in performance evaluation.

3. Servers

As the shift to desktop computing occurred in the 1980s, the role of servers grew to provide larger-
scale and more reliable file and computing services. Such servers have become the backbone of large-
scale enterprise computing, replacing the traditional mainframe. Consider the servers running ATM
machines for banks or airline reservation systems. Failure of such server systems is far more
catastrophic than failure of a single desktop, since these servers must operate seven days a week, 24
hours a day. A second key feature of server systems is scalability. Server systems often grow in
response to an increasing demand for the services they support or an increase in functional
requirements. Thus, the ability to scale up the computing capacity, the memory, the storage, and the
I/O bandwidth of a server is crucial. Finally, servers are designed for efficient throughput. That is, the
overall performance of the server—in terms of transactions per minute or Web pages served per
second—is what is crucial. Responsiveness to an individual request remains important, but overall
efficiency and cost-effectiveness, as determined by how many requests can be handled in a unit time,
are the key metrics for most servers.

4. Clusters/Warehouse-Scale Computers

The growth of Software as a Service (SaaS) for applications like search, social networking, video
sharing, multiplayer games, online shopping, and so on has led to the growth of a class of computers
called clusters. Clusters are collections of desktop computers or servers connected by local area
networks to act as a single larger computer. Each node runs its own operating system, and nodes
communicate using a networking protocol. The largest of the clusters are called warehouse-scale
computers (WSCs), in that they are designed so that tens of thousands of servers can act as one.
Supercomputers are related to WSCs in that they are equally expensive, costing hundreds of millions
of dollars, but supercomputers differ by emphasizing floating-point performance and by running large,
communication-intensive batch programs that can run for weeks at a time. This tight coupling leads
to use of much faster internal networks. In contrast, WSCs emphasize interactive applications, large-
scale storage, dependability, and high Internet bandwidth.

5. Embedded Computers

Embedded computers are found in everyday machines; microwaves, washing machines, most printers,
most networking switches, and all cars contain simple embedded microprocessors. The processors in
a PMD are often considered embedded computers, but we are keeping them as a separate category
because PMDs are platforms that can run externally developed software and they share many of the
characteristics of desktop computers. Other embedded devices are more limited in hardware and
software sophistication. We use the ability to run third-party software as the dividing line between
non-embedded and embedded computers. Embedded computers have the widest spread of
processing power and cost.They include 8-bit and 16-bit processors that may cost less than a dime,
32-bit microprocessors that execute 100 million instructions per second and cost under $5, and high-
end processors for network switches that cost $100 and can execute billions of instructions per
second. Although the range of computing power in the embedded computing market is very large,
price is a key factor in the design of computers for this space. Performance requirements do exist, of
course, but the primary goal is often meeting the performance need at a minimum price, rather than
achieving higher performance at a higher price. Most of this book applies to the design, use, and
performance of embedded processors, whether they are off-the-shelf microprocessors or
microprocessor cores that will be assembled with other special-purpose hardware.
2. Classes of Parallelism and Parallel Architectures
Parallelism at multiple levels is now the driving force of computer design across all four classes of
computers, with energy and cost being the primary constraints.

There are basically two kinds of parallelism in applications:

1. Data-Level Parallelism (DLP) arises because there are many data items that can be operated on at
the same time.

2. Task-Level Parallelism (TLP) arises because tasks of work are created that can operate
independently and largely in parallel.

Computer hardware in turn can exploit these two kinds of application parallelism in four major ways:

1. Instruction-Level Parallelism exploits data-level parallelism at modest levels with compiler help
using ideas like pipelining and at medium levels using

ideas like speculative execution.

2. Vector Architectures and Graphic Processor Units (GPUs) exploit data-level parallelism by applying
a single instruction to a collection of data in parallel.

3. Thread-Level Parallelism exploits either data-level parallelism or task-level parallelism in a tightly


coupled hardware model that allows for interaction among parallel threads.

4. Request-Level Parallelism exploits parallelism among largely decoupled tasks specified by the
programmer or the operating system.

Parallel computers are classified into 4

1. Single instruction stream, single data stream (SISD)—This category is the uniprocessor. The
programmer thinks of it as the standard sequential computer, but it can exploit instruction-level
parallelism. SISD architectures that use ILP techniques such as superscalar and speculative execution.

2. Single instruction stream, multiple data streams (SIMD)—The same instruction is executed by
multiple processors using different data streams. SIMD computers exploit data-level parallelism by
applying the same operations to multiple items of data in parallel. Each processor has its own data
memory (hence the MD of SIMD), but there is a single instruction memory and control processor,
which fetches and dispatches instructions.

3. Multiple instruction streams, single data stream (MISD)—No commercial multiprocessor of this
type has been built to date, but it rounds out this simple classification.

4. Multiple instruction streams, multiple data streams (MIMD)—Each processor fetches its own
instructions and operates on its own data, and it targets task-level parallelism. In general, MIMD is
more flexible than SIMD and thus more generally applicable, but it is inherently more expensive than
SIMD. For example, MIMD computers can also exploit data-level parallelism, although the overhead
is likely to be higher than would be seen in an SIMD computer. This overhead means that grain size
must be sufficiently large to exploit the parallelism efficiently.
3. Defining Computer Architecture
The task the computer designer faces is a complex one: Determine what attributes are important
for a new computer, then design a computer to maximize performance and energy efficiency while
staying within cost, power, and availability constraints. This task has many aspects, including
instruction set design, functional organization, logic design, and implementation. The
implementation may encompass integrated circuit design, packaging, power, and cooling.
Optimizing the design requires familiarity with a very wide range of technologies, from compilers
and operating systems to logic design and packaging.Several years ago, the term computer
architecture often referred only to instruction set design. Other aspects of computer design were
called implementation, often insinuating that implementation is uninteresting or less challenging.

3.1 Instruction Set Architecture: The Myopic View of Computer Architecture

The ISA serves as the boundary between the software and hardware. This quick review of ISA will use
examples from 80x86, ARM, and MIPS to illustrate the seven dimensions of an ISA.

1. Class of ISA—Nearly all ISAs today are classified as general-purpose register

architectures, where the operands are either registers or memory locations.The 80x86 has 16 general-
purpose registers and 16 that can hold floating- point data, while MIPS has 32 general-purpose and
32 floating-point registers. The two popular versions of this class are register-memory ISAs, such as
the 80x86, which can access memory as part of many instructions, and load-store ISAs, such as ARM
and MIPS, which can access memory only with load or store instructions. All recent ISAs are load-store.

2. Memory addressing—Virtually all desktop and server computers, including the 80x86, ARM, and
MIPS, use byte addressing to access memory operands. Some architectures, like ARM and MIPS,
require that objects must be aligned. An access to an object of size s bytes at byte address A is aligned
ifA mod s = 0. The 80x86 does not require alignment, but accesses are generally faster if operands are
aligned.

3. Addressing modes—In addition to specifying registers and constant operands, addressing modes
specify the address of a memory object. MIPS addressing modes are Register, Immediate (for
constants), and Displacement, where a constant offset is added to a register to form the memory
address. The 80x86 supports those three plus three variations of displacement: no register (abso-

lute), two registers (based indexed with displacement), and two registers where one register is
multiplied by the size of the operand in bytes (based with scaled index and displacement). It has more
like the last three, minus the displacement field, plus register indirect, indexed, and based with scaled
index. ARM has the three MIPS addressing modes plus PC-relative addressing, the sum of two
registers, and the sum of two registers where one register is multiplied by the size of the operand in
bytes. It also has auto increment and auto decrement addressing, where the calculated address
replaces the contents of one of the registers used in forming the address.

4. Types and sizes of operands—Like most ISAs, 80x86, ARM, and MIPS support operand sizes of 8-bit
(ASCII character), 16-bit (Unicode character or half word), 32-bit (integer or word), 64-bit (double word
or long integer), and IEEE 754 floating point in 32-bit (single precision) and 64-bit (double precision).
The 80x86 also supports 80-bit floating point (extended double precision).
5. Operations—The general categories of operations are data transfer, arithmetic logical, control
(discussed next), and floating point. MIPS is a simple and easy-to-pipeline instruction set architecture,
and it is representative of the RISC architectures being used in 2011. Figure 1.5 summarizes the MIPS
ISA. The 80x86 has a much richer and larger set of operations .

6. Control flow instructions—Virtually all ISAs, including these three, support conditional branches,
unconditional jumps, procedure calls, and returns. All three use PC-relative addressing, where the
branch address is specified by an address field that is added to the PC. There are some small
differences. MIPS conditional branches (BE, BNE, etc.) test the contents of registers, while the

80x86 and ARM branches test condition code bits set as side effects of arithmetic/logic operations.
The ARM and MIPS procedure call places the return address in a register, while the 80x86 call (CALLF)
places the return address on a stack in memory.

7. Encoding an ISA—There are two basic choices on encoding: fixed length and variable length. All
ARM and MIPS instructions are 32 bits long, which simplifies instruction decoding. Figure 1.6 shows
the MIPS instruction formats. The 80x86 encoding is variable length, ranging from 1 to 18 bytes.
Variable length instructions can take less space than fixed-length instructions, so a program compiled
for the 80x86 is usually smaller than the same program

compiled for MIPS. For example, the number of registers and the number of addressing modes both
have a significant impact on the size of instructions, as the register field and addressing mode field
can appear many times in a single instruction. (Note that ARM and MIPS later offered extensions to
offer 16-bit length instructions so as to reduce program size, called Thumb or Thumb-2 and MIPS16,
respectively.) The other challenges facing the computer architect beyond ISA design are particularly
acute at the present, when the differences among instruction sets are small and when there are
distinct application areas.

3.2 Genuine Computer Architecture: Designing the Organization and Hardware to Meet
Goals and Functional Requirements

The implementation of a computer has two components: organization and hardware. The term
organization includes the high-level aspects of a computer’s design, such as the memory system, the
memory interconnect, and the design of the internal processor or CPU (central processing unit—
where arithmetic, logic, branching, and data transfer are implemented). The term microarchitecture
is also used instead of organization. For example, two processors with the same instruction set
architectures but different organizations are the AMD Opteron and the Intel Core i7. Both processors
implement the x86 instruction set, but they have very different pipeline and cache organizations.The
switch to multiple processors per microprocessor led to the term core toalso be used for processor.
Instead of saying multiprocessor microprocessor, the term multicore has caught on. Given that
virtually all chips have multiple proces-sors, the term central processing unit, or CPU, is fading in
popularity. Hardware refers to the specifics of a computer, including the detailed logic design and the
packaging technology of the computer. Often a line of computers contains computers with identical
instruction set architectures and nearly identical organizations, but they differ in the detailed
hardware implementation. For example, the Intel Core i7 and the Intel Xeon 7560 are nearly identical
but offer different clock rates and different memory systems, making the Xeon 7560 more effective
for server computers.
Dependability
Computers are designed and constructed at different layers of abstraction. We can descend
recursively down through a computer seeing components enlarge themselves to full subsystems until
we run into individual transistors. Although some faults are widespread, like the loss of power, many
can be limited to a single component in a module. Thus, utter failure of a module at one level may be
considered merely a component error in a higher-level module. This distinction is helpful in trying to
find ways to build dependable computers. One difficult question is deciding when a system is
operating properly. This philosophical point became concrete with the popularity of Internet services.
Infrastructure providers started offering service level agreements (SLAs) or service level objectives
(SLOs) to guarantee that their networking or power service would be dependable. For example, they
would pay the customer a penalty if they did not meet an agreement more than some hours per
month. Thus, an SLA could be used to decide whether the system was up or down.

Systems alternate between two states of service with respect to an SLA:

1. Service accomplishment, where the service is delivered as specified

2. Service interruption, where the delivered service is different from the SLA

 Module reliability is a measure of the continuous service accomplishment (or, equivalently, of


the time to failure) from a reference initial instant. Hence, the mean time to failure (MTTF) is
a reliability measure. The reciprocal of MTTF is a rate of failures, generally reported as failures
per billion hours of operation, or FIT (for failures in time). Thus, an MTTF of 1,000,000 hours
equals 109 /106 or 1000 FIT. Service interruption is measured as mean time to repair (MTTR).
Mean time between failures (MTBF) is simply the sum of MTTF + MTTR. Although MTBF is
widely used, MTTF is often the more appropriate term. If a collection of modules has
exponentially distributed lifetimes—meaning that the age of a module is not important in
probability of failure—the overall failure rate of the collection is the sum of the failure rates
of the modules.
 Module availability is a measure of the service accomplishment with respect to the alternation
between the two states of accomplishment and interruption. For nonredundant systems with
repair, module availability is
Module availability=MTTF /(MTTF+MTTR)
Note that reliability and availability are now quantifiable metrics, rather than synonyms for
dependability. From these definitions, we can estimate reliability of a system quantitatively if
we make some assumptions about the reliability of components and that failures are
independent.

4. Quantitative Principles of Computer Design

4.1 Take Advantage of Parallelism


Taking advantage of parallelism is one of the most important methods for improving
performance. Every chapter in this book has an example of how performance is enhanced
through the exploitation of parallelism. We give three brief examples here, which are
expounded on in later chapters. Our first example is the use of parallelism at the system level.
To improve the throughput performance on a typical server benchmark, such as SPECWeb or
TPC-C, multiple processors and multiple disks can be used. The workload of handling requests
can then be spread among the processors and disks, resulting in improved throughput. Being
able to expand memory and the number of processors and disks is called scalability, and it is
a valuable asset for servers. Spreading of data across many disks for parallel reads and writes
enables data-level parallelism. SPECWeb also relies on request-level parallelism to use many
processors while TPC-C uses thread-level parallelism for faster processing of database queries.
At the level of an individual processor, taking advantage of parallelism among instructions is
critical to achieving high performance. One of the simplest ways to do this is through
pipelining The basic idea behind pipelining is to overlap instruction execution to reduce the
total time to complete an instruction sequence. A key insight that allows pipelining to work is
that not every instruction depends on its immediate predecessor, so executing the
instructions completely or partially in parallel may be possible.
Pipelining is the best-known example of instruction-level parallelism. Parallelism can also be
exploited at the level of detailed digital design. For example, set-associative caches use
multiple banks of memory that are typically searched in parallel to find a desired item. Modern
ALUs (arithmetic-logical units) use carry-lookahead, which uses parallelism to speed the
process of computing sums from linear to logarithmic in the number of bits per operand.
These are more examples of data-level parallelism.

4.2 Principle of Locality


Important fundamental observations have come from properties of programs. The most
important program property that we regularly exploit is the principle of locality: Programs
tend to reuse data and instructions they have used recently. A widely held rule of thumb is
that a program spends 90% of its execution time in only 10% of the code. An implication of
locality is that we can predict with reasonable accuracy what instructions and data a program
will use in the near future based on its accesses in the recent past. The principle of locality
also applies to data accesses, though not as strongly as to code accesses. Two different types
of locality have been observed. Temporal locality states that recently accessed items are likely
to be accessed in the near future. Spatial locality says that items whose addresses are near
one another tend to be referenced close together in time.

4.3 Focus on the Common Case

Perhaps the most important and pervasive principle of computer design is to focus on the
common case: In making a design trade-off, favor the frequent case over the infrequent case.
This principle applies when determining how to spend resources, since the impact of the
improvement is higher if the occurrence is frequent. Focusing on the common case works for
power as well as for resource allocation and performance. The instruction fetch and decode
unit of a processor may be used much more frequently than a multiplier, so optimize it first.
It works on dependability as well. If a database server has 50 disks for every processor, storage
dependability will dominate system dependability. In addition, the frequent case is often
simpler and can be done faster than the infrequent case. For example, when adding two
numbers in the processor, we can expect overflow to be a rare circumstance and can therefore
improve performance by optimizing the more common case of no overflow. This emphasis
may slow down the case when overflow occurs, but if that is rare then overall performance
will be improved by optimizing for the normal case.
4.4 Amdahl’s Law
The performance gain that can be obtained by improving some portion of a computer can be
calculated using Amdahl’s law. Amdahl’s law states that the performance improvement to be
gained from using some faster mode of execution is limited by the fraction of the time the
faster mode can be used. Amdahl’s law defines the speedup that can be gained by using a
particular feature. What is speedup? Suppose that we can make an enhancement to a
computer that will improve performance when it is used. Speedup is the ratio:

Speedup =Performance for entire task using the enhancement when possible

Performance for entire task without using the enhancement

Alternatively,

Speedup = Execution time for entire task without using the enhancement

Execution time for entire task using the enhancement when possible
Speedup tells us how much faster a task will run using the computer with the enhancement
as opposed to the original computer. Amdahl’s law gives us a quick way to find the speedup
from some enhancement, which depends on two factors:
1. The fraction of the computation time in the original computer that can be converted to
take advantage of the enhancement—For example, if 20 seconds of the execution time of
a program that takes 60 seconds in total can use an enhancement, the fraction is 20/60.
This value, which we will call Fraction enhanced, is always less than or equal to 1.

2. The improvement gained by the enhanced execution mode, that is, how much faster the
task would run if the enhanced mode were used for the entire program—This value is the
time of the original mode over the time of the enhanced mode. If the enhanced mode
takes, say, 2 seconds for a portion of the program, while it is 5 seconds in the original
mode, the improvement is 5/2. We will call this value, which is always greater than 1,
Speedup enhanced.

The execution time using the original computer with the enhanced mode will be the time spent using
the unenhanced portion of the computer plus the time spent using the enhancement:
Basics of Memory Hierarchies

The increasing size and thus importance of this gap led to the migration of the basics of memory
hierarchy into undergraduate courses in computer architecture, and even to courses in operating
systems and compilers. Thus, we’ll start with a quick review of caches and their operation. The bulk of
the chapter, however, describes more advanced innovations that attack the processor–memory
performance gap. When a word is not found in the cache, the word must be fetched from a lower
level in the hierarchy (which may be another cache or the main memory) and placed in the cache
before continuing. Multiple words, called a block (or line), are moved for efficiency reasons, and
because they are likely to be needed soon due to spatial locality. Each cache block includes a tag to
indicate which memory address it corresponds to.

A key design decision is where blocks (or lines) can be placed in a cache. The most popular scheme is
set associative, where a set is a group of blocks in the cache. A block is first mapped onto a set, and
then the block can be placed anywhere within that set. Finding a block consists of first mapping the
block address to the set and then searching the set—usually in parallel—to find the block. The set is
chosen by the address of the data:

(Block address) MOD (Number of sets in cache)

If there are n blocks in a set, the cache placement is called n-way set associative. The end points of
set associativity have their own names. A direct-mapped cache has just one block per set (so a block
is always placed in the same location), and a fully associative cache has just one set (so a block can
be placed anywhere).

Caching data that is only read is easy, since the copy in the cache and memory will be identical.
Caching writes is more difficult; for example, how can the copy in the cache and memory be kept
consistent? There are two main strategies. A write-through cache updates the item in the cache and
writes through to update main memory.

A write-back cache only updates the copy in the cache. When the block is about to be replaced, it is
copied back to memory. Both write strategies can use a write buffer to allow the cache to proceed as
soon as the data are placed in the buffer rather than wait the full latency to write the data into
memory. One measure of the benefits of different cache organizations is miss rate. Miss rate is simply
the fraction of cache accesses that result in a miss—that is, the number of accesses that miss divided
by the number of accesses. To gain insights into the causes of high miss rates, which can inspire better
cache designs, the three Cs model sorts all misses into three simple categories:

 Compulsory—The very first access to a block cannot be in the cache, so the block must be
brought into the cache. Compulsory misses are those that occur even if you had an infinite
sized cache.
 Capacity—If the cache cannot contain all the blocks needed during execution of a program,
capacity misses (in addition to compulsory misses) will occur because of blocks being
discarded and later retrieved.
 Conflict—If the block placement strategy is not fully associative, conflict misses (in addition
to compulsory and capacity misses) will occur because a block may be discarded and later
retrieved if multiple blocks map to its set and accesses to the different blocks are
intermingled.

Miss rate can be a misleading measure for several reasons. Hence, some designers prefer
measuring misses per instruction rather than misses per memory reference (miss rate).
These two are related:

1. Larger block size to reduce miss rate—The simplest way to reduce the miss rate is to take
advantage of spatial locality and increase the block size. Larger blocks reduce compulsory
misses, but they also increase the miss penalty. Because larger blocks lower the number
of tags, they can slightly reduce static power. Larger block sizes can also increase capacity
or conflict misses, especially in smaller caches. Choosing the right block size is a complex
trade-off that depends on the size of cache and the miss penalty.

2. Bigger caches to reduce miss rate—The obvious way to reduce capacity misses is to
increase cache capacity. Drawbacks include potentially longer hit time of the larger cache
memory and higher cost and power. Larger caches increase both static and dynamic
power.

3. Higher associativity to reduce miss rate—Obviously, increasing associativity reduces


conflict misses. Greater associativity can come at the cost of increased hit time. As we will
see shortly, associativity also increases power consumption.

4. Multilevel caches to reduce miss penalty—A difficult decision is whether to make the
cache hit time fast, to keep pace with the high clock rate of processors, or to make the
cache large to reduce the gap between the processor accesses and main memory
accesses. Adding another level of cache between the original cache and memory simplifies
the decision .The first-level cache can be small enough to match a fast clock cycle time,
yet the second-level (or third-level) cache can be large enough to capture many accesses
that would go to main memory. The focus on misses in second-level caches leads to larger
blocks, bigger capacity, and higher associativity. Multilevel caches are more power
efficient than a single aggregate cache. If L1 and L2 refer, respectively, to first- and second-
level caches, we can redefine the average memory access time:

5. Giving priority to read misses over writes to reduce miss penalty—A write buffer is a
good place to implement this optimization. Write buffers create hazards because they
hold the updated value of a location needed on a read miss—that is, a read-after-write
hazard through memory. One solution is to check the contents of the write buffer on a
read miss. If there are no conflicts, and if the memory system is available, sending the
read before the writes reduces the miss penalty. Most processors give reads priority over
writes. This choice has little effect on power consumption.

6. Avoiding address translation during indexing of the cache to reduce hit time—Caches
must cope with the translation of a virtual address from the processor to a physical
address to access memory. A common optimization is to use the page offset—the part
that is identical in both virtual and physical addresses—to index the cache, as described
in Appendix B, page B-38. This virtual index/ physical tag method introduces some system
complications and/or limitations on the size and structure of the L1 cache, but the
advantages of removing the translation lookaside buffer (TLB) access from the critical path
outweigh the disadvantages.

You might also like