0% found this document useful (0 votes)
34 views45 pages

ACA Unit 2

Uploaded by

Nikhath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views45 pages

ACA Unit 2

Uploaded by

Nikhath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

UNIT 2

Memory Hierarchy
• Since fast memory is expensive, a memory hierarchy is organized into several levels—each
smaller, faster, and more expensive per byte than the next lower level.
• The goal is to provide a memory system with cost per byte almost as low as the cheapest level of
memory and speed almost as fast as the fastest level.
• Each level maps addresses from a slower, larger memory to a smaller but faster memory higher in
the hierarchy
• The memory hierarchy is given the responsibility of address checking; hence, protection schemes
for scrutinizing addresses are also part of the memory hierarchy
Memory Hierarchy
• When a word is not found in the cache, the word must be fetched from the memory and placed in
the cache before continuing.
Memory Hierarchy
• Multiple words, called a block(line), include a tag to see the corresponding memory
address
• Set associative : a set is a group of blocks in the cache
• A block is first mapped to a set, and the the block can be placed anywhere within the
set.
• Finding a block- mapping block address to a set and then searching the set- in parallel
• Set is chosen by the address of the data
(Block address) MOD (Number of sets in cache)
Memory Hierarchy
• If there are n number of blocks in a set, it is known as n-way set associative.
• A direct mapped cache can have just one block per set
• A fully associative cache has just one set (block can be placed anywhere)
• Write through cache: updates the item in the cache and writes through to update
main memory.
• Write back cache: updates the copy in the cache.
• Both write strategies can use a write buffer to allow the cache to proceed as soon as
data is placed in the buffer rather waiting for the write to the memory
Memory hierarchy
• Miss rate is simply the fraction of caches access that result in a miss- number of access that miss
divided by the number of accesses.
• Average memory access time = Hit time+Miss rate x Miss Penalty
• Hit time is the time to hit in the cache
• Miss penalty is the time to replace the block from memory
• Multi level caches to reduce miss penalty- first level can be small enough to match a fast clock
cycle time (L1)
• Second level cache can be large enough to capture many accesses that go to the main memory(L2)
• Average memory access time = Hit time (L1)+Missrate(L1)x(Hit time(L2)+Missrate(L2)x Miss
penalty(L2)
Memory Hierarchy
Six basic cache optimizations techniques
1. Larger block size to reduce miss rate
2. Bigger caches to reduce miss rate
3. Higher associativity to reduce miss rate
4. Multilevel caches to reduce miss penalty
5. Giving priority to read misses over writes to reduce miss penalty
6. Avoiding address translation during indexing of the cache to reduce hit
time
Memory Hierarchy
• SRAM technology
• The first letter of SRAM stands for static.
• SRAMs don't need to refresh and so the access time is very close to the cycle time
• SRAMs typically use six transistors per bit to prevent the information from being disturbed when read
• SRAM needs only minimal power to retain the charge in standby mode
• SRAM designs are concerned with speed and capacity, while in DRAM designs the emphasis is on cost
per bit and capacity
• The cycle time of SRAMs is 8-16 times faster than DRAMs, but they are also 8-16 times as expensive
Memory Hierarchy
• DRAM technologies
• One-half of the address is sent first, called the row access strobe (RAS)
• The other half of the address, sent during the column access strobe (CAS), follows it
• These names come from the internal chip organization, since the memory is organized as a
rectangular matrix addressed by rows and columns
• DRAM derives from the property signified by
its first letter, D, for dynamic
• DRAMs use only a single transistor to store a bit
Memory Hierarchy
• DRAM technology
• Reading that bit destroys the information, so it must be restored
• This is one reason the DRAM cycle time is much longer than the access time
• In addition, to prevent loss of information when a bit is not read or written, the bit must be
"refreshed" periodically
• This requirement means that the memory system is occasionally unavailable because it is sending a
signal telling every chip to refresh
• Since the memory matrix in a DRAM is conceptually square, the number of steps in a refresh is
usually the square root of the DRAM capacity
Memory Hierarchy
• DRAMs are commonly sold on small boards called dual inline memory modules (DIMMs).
• DIMMs typically contain 4-16 DRAMs, and they are normally organized to be 8 bytes wide (+
ECC) for desktop systems
Crosscutting Issues: The design of memory hierarchies

• 1. Protection and Instruction set architecture


• 2. Speculative execution and the memory system
• 3. I/O and consistency of cached data
Memory Hierarchy
The three Cs model sorts all misses into three simple categories
• Compulsory—The very first access to a block cannot be in the cache, so the block must be brought
into the cache. Compulsory misses are those that occur even if you had an infinite cache.
• Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity
misses (in addition to compulsory misses) will occur because of blocks being discarded and later
retrieved.
• Conflict—If the block placement strategy is not fully associative, conflict misses (in addition to
compulsory and capacity misses) will occur because a block may be discarded and later retrieved if
conflicting blocks map to its set.
Crosscutting Issues: The design of memory
hierarchies
1. Protection and Instruction set architecture
Ex: 8086 instruction POPF loads the flag registers from the top of the stack,
in user mode all the flags are updated except IE, but in system mode, it
changes.
Guest OS runs in user mode inside a Virtual machine, it is problem, as IE
flag is not changed
Virtual Machine Monitor (VMM) took 3 steps to improve the performance
of virtual machines
Crosscutting Issues: The design of memory
hierarchies
a. Reduce the cost of process virtualization
b. Reduce interrupt overhead cost due to virtualization
c. Reduce interrupt cost by steering interrupts to VM without invoking
VMM
IBM is still the gold standard of virtual machine technology
Crosscutting Issues: The design of memory
hierarchies
2. Speculative execution and the memory system
• Possibility of generating invalid addresses
• Memory system must identify speculatively executed instructions and
conditionally executed instructions
• Supress the corresponding exception
• Cannot cause the cache to stall on a miss
• Processors must be matched with non blocking caches
Crosscutting Issues: The design of memory
hierarchies
3. I/O and consistency of cached data
• Data can be found in the memory and cache
• Processor is in danger of seeing the old or stale copy
• Multiple data copies are a rare event for I/O but multiple processors will want
to have copies of the same data in several caches.
• I/O cache coherency:
• Input puts the data into cache, both the I/O and the processor see the same data
Crosscutting Issues: The design of memory
hierarchies
• It interferes with the processor and can cause the processor to stall for I/O
• Interfere with the cache by displacing some information with new data that is not
likely to be accessed soon
• So, I/O occur directly to main memory, with main memory acting as input buffer.
• The software solution is to guarantee that no blocks of input buffer are in the cache.
• A page contaning the buffer can be marked as non cachable
• Hardware solution is to check the I/O addresses in the cache, the cache entries are
invalidated to avoid stale data.
Memory Hierarchy
• If there are n number of blocks in a set, it is known as n-way set associative.
• A direct mapped cache can have just one block per set
• A fully associative cache has just one set (block can be placed anywhere)
• Write through cache: updates the item in the cache and writes through to update
main memory.
• Write back cache: updates the copy in the cache.
• Both write strategies can use a write buffer to allow the cache to proceed as soon as
data is placed in the buffer rather waiting for the write to the memory
AMD Opteron Memory Hierarchy
• Opteron-
• Fetches up to three 80x86 instructions per clock cycle
• Translates to RISC like operations
• Issues three of them per clock cycle
• 11 parallel execution units
• 48 bit virtual addresses and 40 bit physical addresses
AMD Opteron Memory Hierarchy
• Step 1: PC is sent to instruction cache, it is 64KB, two way set associative,
with 64 byte block size- virtually indexed and physically tagged.
• 2^index= Cache size/(Block size x set associativity)= 64K/(64x2)= 512=
2^9 --- 9 bits
• Page frame of the instruction’s data address is sent to the instruction
TLB(Translation look aside buffer)
• Step 2: Same time the 9 bit index(plus an additional 2 bits) from virtual
address is sent to data cache.
AMD Opteron Memory Hierarchy
• Step 3 and 4 : The fully associative TLB simultaneously searcher all 40 entries to
find a match between the address and a valid PTE(Page table entry)
• TLB checks to see if the PTE demands that this access result in exception.
• An LITLB miss first goes to L2 TLB, which contains 512PTE of 4KB page sizes
and is four way set associative. It takes 2 cycles to load L1TLB from L2 TLB. If no
exception, instruction cache access continues.
• Step 5: The index field of address is sent to both groups of the two way set
associative data cache. The instruction cache tag is 40-90 bits(index)- 6 bits(block
offset)or 25 bits.
AMD Opteron Memory Hierarchy
• Step 6: The four tags and valid bits are compared to the physical page from the
instruction TLB
• On a miss in Instruction cache, the cache controller must check for a synonym( two
different virtual addresses that reference the same physical address. )Hence, the
instruction cache tags are examined for synonyms in parallel with the L2 cache tags
during an L2 lookup.
• Opteron uses the redundant snooping tags to check all the synonym in 1 clock cycle.
• The second level cache tries to fetch the block on a miss. The L2 cache is 1 MB, 16
way set associative with 64 byte blocks.
AMD Opteron Memory Hierarchy
• Step 8: The L2 index is 10. So, 34 bit block address is divided into 24 bit tag and 10 bit index.
• Step 9: Once again the index and tag are sent to all 16 groups of the 16 way set associative
data cache, that are compared in parallel.
Step 10: If one matches and is valid , it returns the block in sequential order, 8 bytes per clock
cycle. The L2 cache also cancels the memory request that the L1 cache sent to the controller.
An L1 instruction cache miss that hits in the L2 cache costs 7 processor clock cycles for the first
word.
The opteron has exclusion policy between the L1 caches and the L2 cache. Block is in L1 or L2
caches but not in both.
AMD Opteron Memory Hierarchy
• Step 11: If instruction is not found in the secondary cache, the on chip
memory controller must get the block from main memory. The opteron
has dual 64 bit memory channels that can act as one 128 bit channel, since
there is only one memory controller and the same address is sent on both
channels.
• Step 12: Each channel supports upto four double data rate (DDR) Dual in
line Memory modules (DIMMs)
AMD Opteron Memory Hierarchy
• Step 13 and 14: Opteron has a prefetch engine associated with the L2 caches. It looks at the
patterns for L2 misses in consecutive blocks, either ascending or descending and then prefetches
the next line into L2 cache.
• Step 15: Since the second level cache is a write back cache, any miss can lead to an old block
being written back to memory. The Opteron places this “victim”block into a victim buffer
• The victim buffer is size eight, so many victims can be queued before being written back either
to L2 or to memory.
• Memory controller can manage up to 10 simultaneous cache block misses- 8 from data cache and
2 from the instruction cache.
• Step 12: New data is loaded into the cache as soon as they arrive.
AMD Opteron Memory Hierarchy: Fallacies and Pitfalls

• Fallacy : Predicting cache performance of one program from another


• Pitfall: Simulating enough instructions to get accurate performance
measures of the memory hierarchy
- predicting performance of a large cache using a small trace.
- Program’s locality behaviour is not constant over the run of entire
program
- Program’s locality behaviour may vary depending on the input.
Fallacies and Pitfalls
• Pitfall : Overemphasizing memory bandwidth in DRAMs (instead of
lowing the memory latency)
• Pitfall: Not delivering high memory bandwidth in a cache based system
• Pitfall: Implementing a virtual machine monitor on an instruction set
architecture that wasn’t designed to be virtualizable
Advanced topics in disk storage
• Improving in the disk capacity is improvement in the areal density
• Cost per gigabyte has dropped at least as fast as areal density has
increased.
• Magnetic disks have been challenged many times for supremacy of
secondary storage
• The access time gap: DRAM latency is
• 100,000 times less than the disk.
Advanced topics in disk storage
• A fast disk in 2006 transfers about 115MB/sec from disk media with 37GB of storage and
cost is $150
• A 2 GB DRAM module costing about $300 in 2006 could transfer 3200MB/sec.
• DRAM module is 28 times higher bandwidth than the disk
• Bandwidth per dollar is 14 times higher
• The closest challenger is flash memory
• It is non volatile like disks and has same bandwidth as disks, but latency is 100-1000 times
faster than disk.
• More power efficient than disks, despite cost per gigabyte being 50 times higher than disks.
Advanced topics in disk storage
• The Flash memory bits wear out – limited to 1 million writes
• First, disks started offering higher level interfaces like ATA (Advanced Technology
Attachment) and SCSI (Small Computer System Interface) that was included in
microprocessor inside the disk.
• Second, the disks included buffers to hold the data until the computer was ready to
accept it and later caches to avoid read accesses.
• Later joined by command queue that allowed the disk to decide in what order to
perform the commands to maximise performance but maintain the correct
behaviour
Disk Power
• An ATA disk in 2006 might use 9 watts when idle, 11 watts when reading
or writing, and 13 watts when seeking
• Power consumed by the disk motor
• Smaller platters, slower rotation and fewer platters all help reduce disk
motor power and most of the power is in the motor.
Disk Power
• SATA (Serial Advanced Technology Attachment- Serial ATA) uses widest platters that fit the form factor and use four or
five of them, but they spin at 7200 RPM and seek relatively slowly to lower power.
• The corresponding Serial Attach SCSI (SAS) drive aims at performance, and so it spins at 15,000 RPM and seeks much
faster. To reduce power, the platter is much narrower than the form factor and it has only a single platter.
• The cost per gigabyte is about a factor of five better for the SATA drives
• Conversely, the cost per I/O per second or MB transferred per second is about a factor of five better for the SAS drives
• SAS disks use twice the power of the SATA drives, due to the much faster RPM and seeks.
Advanced topics in Disk Arrays
Advanced topics in Disk Arrays
• Throughput can be increased by having many disk drives and many disk
arms, rather than fewer larger drives
• The drawback is that with more devices, dependability decreases.
• Dependability is improved by adding redundant disks to tolerate faults
• If a single disk fails, then the information is reconstructed from redundant
information.
• Such redundant arrays are known by acronym RAID (redundant array of
inexpensive disks) some prefer Independent for I
Advanced topics in disk arrays
Advanced topics in Disk Arrays
RAID 0- no redundancy , nicknamed- JBOD – just a bunch of disks
• Data may be striped( simply spreading the data over multiple disks)
• This level is a measuring stick for the other RAID levels in terms of cost,
performance and dependability
RAID 1- called as mirroring or shadowing, there are two copies of every piece of data.
• Simplest and oldest method but has highest cost
• Some array controllers will optimize read performance by allowing the mirrored
disks to act independently but writes may take longer to complete
Advanced topics in Disk Arrays
RAID 2- memory style error correcting codes to disks
RAID 3 –extra disk contains the parity information – easy to recover from failure. Data is
organized in stripes, with N data blocks and one parity block
RAID 4- Small accesses are dominant. Each sector has their own error checking, safely
increasing reads by allowing every disk to perform independent read.
• Writes would still be low
• First, array reads the old data, calculates what bits will change before writing new data.
• Reads the old parity, calculates and writes the new parity to check disks
• Small writes require 4 disks accesses- but still slow
Advanced topics in Disk Arrays
• RAID 5: performance flaw for small writes in RAID 4 is that they all must read and write the
same check disk, so it is a performance bottleneck
• RAID 5 simply distributes the parity information across all disks in the array, thereby removing the
bottleneck.
• The disk array controller must now calculate which disk has the parity for when it wants to write a
given block, but that can be a simple calculation.
• it can do the large reads and writes of RAID 3 and the small reads of RAID 4, but it has higher
small write bandwidth than RAID 4.
Advanced topics in Disk Arrays
RAID 10 versus 01
• 4 disks worth of data to store and eight physical disks to use
• 4 pairs of disks – each organized as RAID1 – called as RAID 1+0 or RAID 10
• Or 2 sets of 4 disks , each organized as RAID 0 –known as RAID0+1 or RAID 01
RAID6 : Beyond a single disk failure
RAID 1 to 5 protect against single self identifying failure. If operator accidentally replaces
wrong disk during failure, then the disk will experience two failures and data will be lost.
So row-diagonal parity or RAID DP uses redundant space based on a parity calculation on a per
stripe basis. It adds two check blocks per stripe of data.
Putting It All Together: NetApp FAS6000 Filer

• Network Appliance entered the storage market in 1992 with a goal of providing an easy-to-operate
file server running NSF using their own log-structured file system and a RAID 4 disk array.
• The company later added support for the Windows CIFS file system and a RAID 6 scheme called
row-diagonal parity or RAID-DP
• NetApp also supports iSCSI, which allows SCSI commands to run over a TCP/IP network,thereby
allowing the use of standard networking gear to connect servers to storage, such as Ethernet, and
hence greater distance.
• FAS6000. It is a multiprocessor based on the AMD Opteron microprocessor connected using its
Hypertransport links.
• The FAS6000 comes as either a dual processor (FAS6030) or a quad processor (FAS6070).
Putting It All Together: NetApp FAS6000
Filer
• The FAS6000 connects 8 GB of DDR2700 to each Opteron, yielding 16 GB for the FAS6030 and
32 GB for the FAS6070
• As a filer, the FAS6000 needs a lot of I/O to connect to the disks and to connect
• to the servers. The integrated I/O consists of
• 8 Fibre Channel (FC) controllers and ports,
• 6 Gigabit Ethernet links,
• 6 slots for x8 (2 GB/sec) PCI Express cards,
• 3 slots for PCI-X 133 MHz, 64-bit cards,
• plus standard I/O options like IDE, USB, and 32-bit PCI.
Putting It All Together: NetApp FAS6000
Filer
• The 8 Fibre Channel (FC) controllers can each be attached to 6 shelves containing 14, 3.5-inch FC
disks. Thus, the maximum number of drives for the integrated I/O is 8 x 6 x 14 or 672 disks.
• Additional FC controllers can be added to the option slots to connect up to 1008 drives, to reduce
the number of drives per FC network so as to reduce contention.
• At 500 GB per FC drive in 2006, if we assume the RAID RDP group is 14 data disks and 2 check
disks, the available data capacity is 294 TB for 672 disks and 441 TB for 1008 disks.
• It can also connect to Serial ATA disks via a Fibre Channel to SATA bridge controller
• The six 1-gigabit Ethernet links connect to servers to make the FAS6000 look like a file server
running if NTFS or CIFS, or like a block server if running iSCSI.
Putting It All Together: NetApp FAS6000
Filer
• FAS6000 filers can be paired so that if one fails, the other can take over.
• This interconnect also allows each filer to have a copy of the log data in the NVRAM of the other
filer and to keep the clocks of the pair synchronized
• The healthy filer maintains its own network identity and its own primary functions, but it also
assumes the network identity of the failed filer and handles all its data requests via a virtual filer
until an administrator restores the data service to the original state.
Self study
• Designing and Evaluating an I/O System— The Internet Archive Cluster
• Fallacies and Pitfalls

You might also like