Memory Models
Memory Models
Memory Models
Members:
Aireen Grace L.Sion
Maria Althea Montejo
Camile Reyes
King Augustine Sarmiento
Jhon Romer Torres
Submitted to:
Mr. Isarael M. Cabasug
Republic of the Philippines
I
Republic of the Philippines
II
Republic of the Philippines
Introduction
order, allowing multiple processors to work together without errors. These models
help keep data accurate and accessible, making sure everything in the system runs
smoothly.
Programmers ideally want memory systems that are both infinitely large and fast.
However, larger memory tends to be slower. Memory systems are measured by size
(storage capacity), latency (speed of data access), and bandwidth (data transfer rate).
Latency includes access time (time to retrieve data) and cycle time (time between
requests).Bandwidth measures data supply per unit time. It's linked to cycle time but a
There are two main types of memory chips: DRAM and SRAM. DRAM needs
periodic refreshing and has longer cycle times because reading data also requires
writing it back. SRAM does not need refreshing or write-back and has equal cycle and
access times but is less dense and more expensive. SRAM is about 16 times faster
Besides SRAM and DRAM, hybrid memory options like cached DRAMs combine the
best features of both. They integrate a small SRAM buffer with a larger DRAM
slower than processor-local caches, cached DRAMs benefit all processors and don’t
1
Republic of the Philippines
specifying a row and then a column. If accesses are to the same row, the first step can
be skipped, speeding up access time. Bandwidth relates to cycle time, but it can be
modules. Interleaving allows faster memory access rates and lets DRAM rewrite data
Most memory chips are 1 to 8 bits wide, but systems need larger data chunks, so
multiple chips are used in parallel, forming a memory bank. High-density chips
minimize cost and physical demands but clash with interleaving goals. For instance,
16 Mbit chips, each supplying 4 bits, require 8 chips to deliver a 32-bit word, making
loses cost benefits compared to SRAM. Most microprocessor memory systems use
DRAM with processor caches. Vector supercomputers, like the Cray Research C-90
with up to 1024 memory banks, use highly-interleaved SRAM without caches, as their
workloads don't benefit much from caches. Supercomputer customers pay extra for
SRAM's performance.
decisions beyond chip technology. This includes organizing memory into modules,
space, and defining the role of caches. It also involves lower-level cache design
2
Republic of the Philippines
3
Republic of the Philippines
Programs with high locality use different address space portions at different times.
processors, creating the illusion of a memory that is both large and fast.
then that data item (and others at nearby addresses) are likely to be referenced by that
requests, minimizing the need to access slow main memory. Effective caches reduce
average memory latency and the bandwidth required, making cheaper parts usable.
However, if cache hit rates are low, performance drops, so flat memory systems might
be better for programs with low locality. Supercomputers, which often use flat
aggressive prefetching. Trends indicate that large, multi-bank caches could improve
4
Republic of the Philippines
Designers must decide where to place main memory, either co-locating it with
scalability for applications with high processor locality but complicates addressing
hit rate, miss rate, and mean cost per reference (MCPR). The hit rate is the percentage
of memory requests satisfied by the cache, while the miss rate is the percentage that
isn't. Combined, they total 100%. A cache hit occurs when the desired data is found in
the cache and returned; a miss occurs when the data must be fetched from the main
memory, potentially replacing existing cache data. MCPR measures the average
memory access cost, considering both hits and misses, providing a detailed
performance evaluation.
The actual Mean Cost Per Reference (MCPR) depends on both application and
replacement policy, write policy, and coherence mechanism. These parameters impact
performance by affecting cache hit/miss costs and ratios. Most parameters are relevant
5
Republic of the Philippines
The coherence problem ensures no processor reads outdated data when multiple
copies exist. It occurs in uniprocessors with DMA I/O devices, typically resolved by
where processors broadcast write actions and others monitor these transactions to
The main issue with this approach is the assumption that cache hits cost a single
With current tech, buses can transfer up to 1.2 Gbytes/sec, while processors
consume data at around 200 Mbytes/sec. This limits the number of processors sharing
a bus to 20-30, as caches handle most requests. Without a fast broadcast mechanism,
consistency but pose scalability issues. Generally, a small number of pointers for each
sharable line works effectively. Directory-based systems like CC-NUMA store cache
line info at the main memory processor, while COMA uses more dynamic structures.
Examples include the Stanford Dash (Θ(P2) directory) and MIT Alewife (limited
pointers, software traps on overflow). The Convex Exemplar uses IEEE SCI standards
6
Republic of the Philippines
influence the effort needed to create effective parallel programs. Hardware memory
follow this too. In shared-memory, processes access variables directly by their names.
programming approaches.
language;
Library packages offer portability and simplicity but are limited to subroutine
calls and can't utilize compiler-based optimizations. Languages and extensions can
must access shared data through pointers from library routines. Andrews and
Schneider offer a solid intro to parallel programming, with Andrews' later book
adding more detail. Bal, Steiner, and Tanenbaum survey message-passing models,
while Cheng's report covers various programming models from the early '90s. The
7
Republic of the Philippines
Any user-level parallel programming model must deal with several issues. These
include
used;
• determining (which copies of) which data should be located at which processors at
which points in time, so that processes have the data they need when they need it;
decisions.
view of memory is closely tied to parallelization and data placement. Some models,
like message-passing and non-cache-coherent systems, make users handle all aspects.
For instance, Split-C offers various mechanisms for data management in C programs
on multicomputers. Other models, like Sun’s LWP and OSF’s pthreads, require
explicit process management by users but use hardware for data placement.
8
Republic of the Philippines
collaboration among many research groups. HPF merges Fortran-90 syntax with
Fortran-D's data distribution and alignment concepts, letting users specify data
placement and parallel loop execution. The compiler follows the "owner computes"
rule, assigning computations based on data location. Similar efforts include pC++ and
shared memory models are key reasons for this. However, the
placement strategies.
9
Republic of the Philippines
Mismatch between the program data structures and the coherence units. If data
structures are not aligned with the coherence units, accesses to unrelated data
Relative location of computational tasks and the data they access. Random
placement of tasks (as achieved by centralized work queues) results in very high
access the same data on the same processor to take advantage of processor locality.
Poor spatial locality. In some cases programs access array data with non-unit strides.
In that case a potentially large amount of data fetched in the cache remains untouched.
Blocking techniques and algorithmic restructuring can often help alleviate this
problem.
before it is touched again. As in the previous case restructuring the program can help
Conflicts between data structures. Data structures that map into the same cache
lines will cause this problem in caches with limited associativity. If these data
structures are used in the same computational phase performance can suffer severe
degradation. Skewing or relocating data structures can help to solve this problem.
10
Republic of the Philippines
defines the memory consistency model, determining the outcomes of parallel reads
parallel programs are possible, though this leads to a more complex programming
model.
Processor 0 Processor 1
X = 10 Y = 10
A=Y B=X
Print A Print B
Sequential Consistency means all processors must agree on the order of memory
events. If two processors write to the same memory location, every processor must
Processor Consistency means processors don't need to agree on the order of reads
and writes by different processors. A write can even be delayed beyond some of the
11
Republic of the Philippines
printed because all processors agree on the write order. In processor consistency,
processors don't need to agree on the write order, which can lead to different results.
design decisions:
the two?
While all combinations of answers to the above question are possible the choice
between hardware and software has a strong influence on the remaining questions.
For this reason we will discuss hardware and software implementation of consistency
models separately.
256 bytes). Hardware systems generally use eager protocols, updating or invalidating
data immediately when inconsistencies arise. Lazy protocols, which delay these
actions until needed, are usually too costly for hardware. Invalidation is preferred due
to less communication required, although hybrid protocols with some updates might
completing before prior writes are done. Non-uniform models allow pipelining and
12
Republic of the Philippines
protocols compete for processor cycles. Therefore, software protocols often have
higher overhead. Most DSM systems use lazy protocols due to this. In software, the
decision between updating and invalidating is complex; large coherence blocks make
reacquiring invalidated blocks costly, but updating small pieces is cheaper. Large
blocks can also cause false sharing. To reduce performance impact, some software
each processor are merged into a consistent copy, usually by comparing modified
traffic, easing memory and network congestion, and speeding up other operations.
findings indicate that software coherence with lazy relaxed consistency rivals
hardware's eager relaxed consistency. Hardware architects also see the flexibility of
protocol engines.
13
Republic of the Philippines
memory systems include these principles, which lay the groundwork for optimizing
programming most programmers are used to, while message passing can complicate
scenarios. Many newer machines and systems now support both models to take
2. The principle of locality- Locality plays a crucial role at every level of the
architecture, and NUMA programming models for users. Utilizing locality helps
14
Republic of the Philippines
Key trends include deeper memory hierarchies with possible third-level off-
may become more important than processor utilization for peak performance.
15