0% found this document useful (0 votes)
8 views18 pages

Memory Models

Uploaded by

Ushs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views18 pages

Memory Models

Uploaded by

Ushs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Republic of the Philippines

President Ramon Magsaysay State University


(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE

Memory Models

Members:
Aireen Grace L.Sion
Maria Althea Montejo
Camile Reyes
King Augustine Sarmiento
Jhon Romer Torres

Submitted to:
Mr. Isarael M. Cabasug
Republic of the Philippines

President Ramon Magsaysay State University


(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE

0.0 Memory Models Introduction…………………………………..1

1.0 Memory Hardware Technology ………………………………. 1

1.1 Memory System Architecture ………………………………….2

1.2 High-level Memory Architecture…………………………… 3

1.3 Basic Issues in Cache Design ………………………………. 4

1.4 The Cache Coherence Problem…………………………….. 5

2.0. User-Level Memory Models………………………………….. 6

2.1. Shared Memory v. Message Passing……………………….. 6

2.2 Implementation of Shared-Memory Models ……………….. 7

2.3 Performance of Shared-Memory Models …………………. 9

3.0 Memory Consistency Models ………………………………… 10

4.0 Implementation and Performance of Memory Consistency Models……….11

4.1. Hardware Implementations ………………………………...11

4.2. Software Implementations ………………………………..12

5.0 Conclusions and Trends………………………………………

I
Republic of the Philippines

President Ramon Magsaysay State University


(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE

II
Republic of the Philippines

President Ramon Magsaysay State University


(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE

Introduction

Memory models are the rule-books for computer memory operations,

particularly in concurrent or multi-threaded setups. They ensure consistency and

order, allowing multiple processors to work together without errors. These models

help keep data accurate and accessible, making sure everything in the system runs

smoothly.

1.0 Memory Hardware Technology

Programmers ideally want memory systems that are both infinitely large and fast.

However, larger memory tends to be slower. Memory systems are measured by size

(storage capacity), latency (speed of data access), and bandwidth (data transfer rate).

Latency includes access time (time to retrieve data) and cycle time (time between

requests).Bandwidth measures data supply per unit time. It's linked to cycle time but a

lso depends on memory width and the number of memory banks.

There are two main types of memory chips: DRAM and SRAM. DRAM needs

periodic refreshing and has longer cycle times because reading data also requires

writing it back. SRAM does not need refreshing or write-back and has equal cycle and

access times but is less dense and more expensive. SRAM is about 16 times faster

than DRAM but holds about 16 times less data.

Besides SRAM and DRAM, hybrid memory options like cached DRAMs combine the

best features of both. They integrate a small SRAM buffer with a larger DRAM

memory, caching recently-accessed data to speed up average cycle times. Though

slower than processor-local caches, cached DRAMs benefit all processors and don’t

require coherence maintenance. DRAMs use a two-step addressing process, first

1
Republic of the Philippines

President Ramon Magsaysay State University


(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE

specifying a row and then a column. If accesses are to the same row, the first step can

be skipped, speeding up access time. Bandwidth relates to cycle time, but it can be

increased by parallel access or interleaving memory references across different

modules. Interleaving allows faster memory access rates and lets DRAM rewrite data

before being accessed again.

Most memory chips are 1 to 8 bits wide, but systems need larger data chunks, so

multiple chips are used in parallel, forming a memory bank. High-density chips

minimize cost and physical demands but clash with interleaving goals. For instance,

16 Mbit chips, each supplying 4 bits, require 8 chips to deliver a 32-bit word, making

each memory bank 16 Mbytes. Increasing interleaving with lower-density DRAM

loses cost benefits compared to SRAM. Most microprocessor memory systems use

DRAM with processor caches. Vector supercomputers, like the Cray Research C-90

with up to 1024 memory banks, use highly-interleaved SRAM without caches, as their

workloads don't benefit much from caches. Supercomputer customers pay extra for

SRAM's performance.

1.1. Memory System Architecture

Designing a multiprocessor memory system involves key architectural

decisions beyond chip technology. This includes organizing memory into modules,

determining their physical location relative to processors, structuring the address

space, and defining the role of caches. It also involves lower-level cache design

considerations, relevant to both uni-processors and multiprocessors. Additionally,

addressing coherence is critical in shared-memory systems where data copies may

exist in multiple locations.

2
Republic of the Philippines

President Ramon Magsaysay State University


(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE

3
Republic of the Philippines

President Ramon Magsaysay State University


(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE

1.2. High-level Memory Architecture

Multiprocessor memory system designs rely heavily on the concept of locality.

Programs with high locality use different address space portions at different times.

Architects leverage this by caching frequently accessed information near the

processors, creating the illusion of a memory that is both large and fast.

There are two dimensions of locality of importance on a uniprocessor, and a

third of importance on parallel machines:

1. Temporal Locality. If a data item is referenced once, then it is likely to be

referenced again in the near future.

2. Spatial Locality. If a data item is referenced, then items with nearby

addresses are likely to be referenced in the near future.

3. Processor Locality. If a data item item is referenced by a given processor,

then that data item (and others at nearby addresses) are likely to be referenced by that

same processor in the near future.

The performance of a memory hierarchy relies on how well caches handle

requests, minimizing the need to access slow main memory. Effective caches reduce

average memory latency and the bandwidth required, making cheaper parts usable.

However, if cache hit rates are low, performance drops, so flat memory systems might

be better for programs with low locality. Supercomputers, which often use flat

memory due to high bandwidth needs, rely on highly-interleaved SRAM and

aggressive prefetching. Trends indicate that large, multi-bank caches could improve

future supercomputer workloads by addressing bandwidth issues.

4
Republic of the Philippines

President Ramon Magsaysay State University


(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE

Designers must decide where to place main memory, either co-locating it with

processors (distributed memory architecture) or separating it across an

interconnection network (dance-hall architecture). Distributed memory improves

scalability for applications with high processor locality but complicates addressing

schemes, requiring node controllers to manage non-local memory requests.

1.3. Basic Issues in Cache Design

To judge cache success in a hierarchical memory system, we use metrics like

hit rate, miss rate, and mean cost per reference (MCPR). The hit rate is the percentage

of memory requests satisfied by the cache, while the miss rate is the percentage that

isn't. Combined, they total 100%. A cache hit occurs when the desired data is found in

the cache and returned; a miss occurs when the data must be fetched from the main

memory, potentially replacing existing cache data. MCPR measures the average

memory access cost, considering both hits and misses, providing a detailed

performance evaluation.

MCP R = HitRate ∗ HitT ime + M issRate ∗ M issT ime

The actual Mean Cost Per Reference (MCPR) depends on both application and

architecture factors. Application factors relate to locality properties, while architecture

factors include cache size, block size, associativity, tag/index organization,

replacement policy, write policy, and coherence mechanism. These parameters impact

performance by affecting cache hit/miss costs and ratios. Most parameters are relevant

to both multiprocessor and uniprocessor caches.

5
Republic of the Philippines

President Ramon Magsaysay State University


(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE

1.4. The Cache Coherence Problem

The coherence problem ensures no processor reads outdated data when multiple

copies exist. It occurs in uniprocessors with DMA I/O devices, typically resolved by

ad-hoc OS-level mechanisms. In multiprocessors with caches, the problem is more

complex. Small-scale multiprocessors address this by snooping on a shared bus,

where processors broadcast write actions and others monitor these transactions to

maintain cache consistency.

The main issue with this approach is the assumption that cache hits cost a single

cycle, which is often inaccurate. On-chip caches typically take 2 or 3 cycles to

respond. Compilers try to mask these extra cycles by scheduling independent

operations, but this isn't always feasible.

With current tech, buses can transfer up to 1.2 Gbytes/sec, while processors

consume data at around 200 Mbytes/sec. This limits the number of processors sharing

a bus to 20-30, as caches handle most requests. Without a fast broadcast mechanism,

cache coherence is managed with directory structures, which maintain data

consistency but pose scalability issues. Generally, a small number of pointers for each

sharable line works effectively. Directory-based systems like CC-NUMA store cache

line info at the main memory processor, while COMA uses more dynamic structures.

Examples include the Stanford Dash (Θ(P2) directory) and MIT Alewife (limited

pointers, software traps on overflow). The Convex Exemplar uses IEEE SCI standards

for a distributed directory with linear space overhead

6
Republic of the Philippines

President Ramon Magsaysay State University


(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE

2.0. User-Level Memory Models

2.1 Shared Memory v.s Message Passing

Memory models matter at both hardware and programmer levels. They

influence the effort needed to create effective parallel programs. Hardware memory

models are either message-passing or shared-memory, and programmer-level models

follow this too. In shared-memory, processes access variables directly by their names.

In message-passing, variables are divided among processor instances, requiring

message exchanges to access non-local variables. Both models offer diverse

programming approaches.

• via library routines linked into programs written in a conventional sequential

language;

• via features added to some existing sequential language by means of preprocessors

or compiler extensions of various levels of complexity; or

• via special-purpose parallel or distributed programming languages.

Library packages offer portability and simplicity but are limited to subroutine

calls and can't utilize compiler-based optimizations. Languages and extensions can

use more complex syntax and optimizations. Message-passing models, implemented

as subroutine calls, tend to be more successful than shared-memory models, which

must access shared data through pointers from library routines. Andrews and

Schneider offer a solid intro to parallel programming, with Andrews' later book

adding more detail. Bal, Steiner, and Tanenbaum survey message-passing models,

while Cheng's report covers various programming models from the early '90s. The

widely-used library-based shared-memory model for Fortran and C by Argonne

7
Republic of the Philippines

President Ramon Magsaysay State University


(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE

National Laboratory researchers is notable. Shared-memory models are generally seen

as easier than message-passing models, evidenced by their market dominance and

efforts to provide shared memory on message-passing hardware. The rest of this

chapter focuses on shared-memory models.

2.2. Implementation of Shared-Memory Models

Any user-level parallel programming model must deal with several issues. These

include

• specifying the computational tasks to be performed and the data structures to be

used;

• identifying data dependences among tasks;

• allocating tasks to processes, and scheduling and synchronizing processes on

processors, in a way that respects the dependences;

• determining (which copies of) which data should be located at which processors at

which points in time, so that processes have the data they need when they need it;

Arranging for communication among processors to effect the location

decisions.

Programming models differ in how much responsibility they place on users,

compilers, run-time systems, operating systems, and hardware. The programmer's

view of memory is closely tied to parallelization and data placement. Some models,

like message-passing and non-cache-coherent systems, make users handle all aspects.

For instance, Split-C offers various mechanisms for data management in C programs

on multicomputers. Other models, like Sun’s LWP and OSF’s pthreads, require

explicit process management by users but use hardware for data placement.

8
Republic of the Philippines

President Ramon Magsaysay State University


(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE

In large-scale scientific computing, where multicomputers are prevalent, there

has been significant focus on developing efficient shared-memory programming

models. High Performance Fortran (HPF) is a key outcome, developed through

collaboration among many research groups. HPF merges Fortran-90 syntax with

Fortran-D's data distribution and alignment concepts, letting users specify data

placement and parallel loop execution. The compiler follows the "owner computes"

rule, assigning computations based on data location. Similar efforts include pC++ and

Jade, parallel versions of C++ and C, respectively.

2.3. Performance of Shared-Memory Models

Bus-based shared memory machines lead in general-purpose

multiprocessors, while message-passing machines dominate high-

performance markets. The challenges in building large-scale

shared-memory multiprocessors and the perceived inefficiency of

shared memory models are key reasons for this. However, the

inefficiencies in shared memory are believed to result from their

usage methods, not the model itself.

To achieve good performance in shared memory parallel

programs, reducing communication is crucial. Improving locality is

a complex, multi-faceted challenge actively being researched. Key

aspects include (but aren't limited to): optimizing data access

patterns, minimizing data movement, and enhancing data

placement strategies.

9
Republic of the Philippines

President Ramon Magsaysay State University


(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE

Mismatch between the program data structures and the coherence units. If data

structures are not aligned with the coherence units, accesses to unrelated data

structures will cause unnecessary communication.

Relative location of computational tasks and the data they access. Random

placement of tasks (as achieved by centralized work queues) results in very high

communication rates. Programmer/compilers should attempt to place all tasks that

access the same data on the same processor to take advantage of processor locality.

Poor spatial locality. In some cases programs access array data with non-unit strides.

In that case a potentially large amount of data fetched in the cache remains untouched.

Blocking techniques and algorithmic restructuring can often help alleviate this

problem.

Poor temporal locality. Data is invalidated or evicted from a processor’s cache

before it is touched again. As in the previous case restructuring the program can help

improve its access pattern and as a consequence its temporal locality.

Conflicts between data structures. Data structures that map into the same cache

lines will cause this problem in caches with limited associativity. If these data

structures are used in the same computational phase performance can suffer severe

degradation. Skewing or relocating data structures can help to solve this problem.

Communication intrinsic in the algorithm. Such communication is necessary for

correctness and cannot be removed. However it may be tolerated if data is pre-fetched

before it is needed, or if the processor is able to switch to other computation while a

miss is being serviced.

10
Republic of the Philippines

President Ramon Magsaysay State University


(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE

3.0. Memory Consistency Models

Shared-memory coherence protocols resolve conflicting accesses to the same

memory block by different processors. The method of resolving these conflicts

defines the memory consistency model, determining the outcomes of parallel reads

and writes. Sequential consistency, aiming for behavior similar to sequential

machines, has been the traditional choice. However, it restricts performance

optimizations. By relaxing these constraints, significant performance improvements in

parallel programs are possible, though this leads to a more complex programming

model.

Processor 0 Processor 1

X = 10 Y = 10

A=Y B=X

Print A Print B

Sequential Consistency means all processors must agree on the order of memory

events. If two processors write to the same memory location, every processor must

concur on which write happened first.

Processor Consistency means processors don't need to agree on the order of reads

and writes by different processors. A write can even be delayed beyond some of the

processor's later reads.

11
Republic of the Philippines

President Ramon Magsaysay State University


(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE

Under sequential consistency, instructions will ensure at least one value of 10 is

printed because all processors agree on the write order. In processor consistency,

processors don't need to agree on the write order, which can lead to different results.

4.0. Implementation and Performance of Memory Consistency Models

The implementation of memory consistency models presents architects with several

design decisions:

• Should coherence be implemented in hardware, software, or some combination of

the two?

• What should be the size of coherence blocks?

• Should copies of data be updated or invalidated on a write?

• Should the updates/invalidates be performed in an eager or lazy fashion?

While all combinations of answers to the above question are possible the choice

between hardware and software has a strong influence on the remaining questions.

For this reason we will discuss hardware and software implementation of consistency

models separately.

4.1. Hardware Implementations

In hardware implementations, the coherence unit is typically a cache line (32-

256 bytes). Hardware systems generally use eager protocols, updating or invalidating

data immediately when inconsistencies arise. Lazy protocols, which delay these

actions until needed, are usually too costly for hardware. Invalidation is preferred due

to less communication required, although hybrid protocols with some updates might

enhance performance. Relaxed consistency models let reads bypass writes,

completing before prior writes are done. Non-uniform models allow pipelining and

12
Republic of the Philippines

President Ramon Magsaysay State University


(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE

merging of writes between synchronization accesses. Models differentiating

synchronization access types enable pipelining with adjacent data accesses.

4.2. Software Implementations

Eager hardware protocols run in parallel with applications, while software

protocols compete for processor cycles. Therefore, software protocols often have

higher overhead. Most DSM systems use lazy protocols due to this. In software, the

decision between updating and invalidating is complex; large coherence blocks make

reacquiring invalidated blocks costly, but updating small pieces is cheaper. Large

blocks can also cause false sharing. To reduce performance impact, some software

systems allow concurrent writes to a page without synchronization, assuming

synchronization protects truly shared data. At synchronization points, changes from

each processor are merged into a consistent copy, usually by comparing modified

pages to an unmodified shadow copy. Relaxed consistency models in software reduce

coherence transactions by removing the single writer restriction, allowing processors

to focus more on productive work at synchronization points. This decreases coherence

traffic, easing memory and network congestion, and speeding up other operations.

The debate between software and hardware implementations is ongoing. Recent

findings indicate that software coherence with lazy relaxed consistency rivals

hardware's eager relaxed consistency. Hardware architects also see the flexibility of

software as a key performance boost, with some designs integrating programmable

protocol engines.

13
Republic of the Philippines

President Ramon Magsaysay State University


(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE

5.0. Conclusions and Trends

How we conceptualize memory influences system design across multiple

layers: hardware, architecture, programming models, and memory consistency

models. Key concepts for developing cost-effective, efficient, and user-friendly

memory systems include these principles, which lay the groundwork for optimizing

performance and reliability.

1. Referential transparency- Shared memory aligns more with the sequential

programming most programmers are used to, while message passing can complicate

parallel programming. However, message passing can enhance performance in certain

scenarios. Many newer machines and systems now support both models to take

advantage of their respective benefits.

2. The principle of locality- Locality plays a crucial role at every level of the

memory hierarchy: page mode DRAMs in hardware, hierarchical memory systems in

architecture, and NUMA programming models for users. Utilizing locality helps

create cost-effective memory systems and improves performance when programming

with locality principles in mind.

3. Sequentially consistent ordering of events- Weaker models improve speed but

complicate programming. Adopting programmer-centric consistency models, which

ensure sequential consistency with specific synchronization rules, offers a promising

balance between performance and ease of programming.

14
Republic of the Philippines

President Ramon Magsaysay State University


(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE

Effective memory system design is a major research focus in multiprocessors.

Key trends include deeper memory hierarchies with possible third-level off-

chip caches and increasing importance of locality due to faster processors.

Current research aims to improve program locality via compiler techniques.

Programmable cache controllers might soon allow protocol customization

based on program needs. As memory costs rise, optimizing memory utilization

may become more important than processor utilization for peak performance.

15

You might also like