Information Storage System-Chapter3
Information Storage System-Chapter3
Information Storage System-Chapter3
Systems
LIMU
SPRING 2017-2018
Objectives
Explain how drive disk performance is measured and expressed.
Explain how IOPS would be measured and used as a performance metric.
IOPS
IOPS is used to measure and express performance of mechanical disk drives, solid-state drives and storage arrays.
As a performance metric, IOPS is most useful at measuring the performance of random workloads such as database
I/O.
So what is an I/O operation? An I/O operation is basically a read or a write operation that a disk or disk array
performs in response to a request from a host (usually a server).
Unfortunately, there are several types of I/O operations. Here are the most important:
Read
Write
Random
Sequential
Cache hit
Cache miss
Read and Write IOPS
Random vs. Sequential IOPS
Disk Drives are good at sequential IOPS
Solid State Drives are good at random IOPS
Disk Drives implement a buffer to reorder IOPS so that, they decrease the positional latency for
random workloads.
Cache Hits and Cache Misses
Cache-hit IOPS, sometimes referred to as cached IOPS¸ are IOPS that are satisfied from cache
rather than from disk. Because cache is usually DRAM, I/O operations serviced by cache are like
greased lightning compared to I/O operations that have to be serviced from disk.
When looking at IOPS numbers from large disk-array vendors, it is vital to know whether the
IOPS numbers being quoted are cache hits or misses. Cache-hit IOPS will be massively higher
and more impressive than cache-miss IOPS. In the real world, you will have a mixture of cache
hit and cache miss, especially in read workloads, so read IOPS numbers that are purely cache hit
are useless because they bear no resemblance to the real world.
Calculating Disk Drive IOPS
A quick and easy way to estimate how many IOPS a disk drive should be able to service is
to use this formula:
1 / (x + y) × 1,000
where x = average seek time, and y = average rotational latency
A quick example of a drive with an average seek time of 3.5 ms and an average rotational
latency of 2 ms gives us an IOPS value of 181 IOPS:
1 / (3.5 + 2) × 1,000 = 181.81
This works only for spinning disks and not for solid-state media or disk arrays with
large caches.
MBps
Megabytes per second (MBps) is another performance metric. This one is used to express the
number of megabytes per second a disk drive or storage array can perform. If IOPS is best at
measuring random workloads, MBps is most useful when measuring sequential workloads such
as media streaming or large backup jobs.
Maximum and Sustained Transfer Rates
Maximum transfer rate is another term you’ll hear in reference to disk drive performance. It is
usually used to express the absolute highest possible transfer rate of a disk drive (through-put or
MBps) under optimal conditions. It often refers to burst rates that cannot be sustained by the
drive. A better metric is sustained transfer rate (STR).
STR is the rate at which a disk drive can read or write sequential data spread over multiple tracks.
Because it includes data spread over multiple tracks, it incorporates some fairly realistic
overheads, such as R/W head switching and track switching. This makes it one of the most useful
metrics for measuring disk drive performance.
The following is an example of the differences between maximum transfer rate and sustained
transfer rate from a real-world disk drive spec sheet for a 15K drive:
Maximum: 600 MBps
Sustained: 198 MBps
Disk Drive Reliability
There is no doubt that disk drives are far more reliable than they used to be. However, disk
drives do fail, and when they fail, they tend to catastrophically fail. Because of this, you absolutely
must design and build storage solutions with the ability to cope with disk drive failures.
It is also worthy of note that enterprise-class drives are more reliable than cheaper, consumer-
grade drives. There are a couple of reasons:
Enterprise-class drives are often made to a higher specification with better-quality parts.
Enterprise drives tend to be used in environments that are designed to improve the reliability of
disk drives and other computer equipment. For example, enterprise-class disk drives are often
installed in specially designed cages in high-end disk arrays. These cages and arrays are designed
to minimize vibration, and have optimized air cooling and protected power feeds. They also
reside in data centers that are temperature and humidity controlled and have extremely low
levels of dust and similar problems.
Mean Time between Failure
When talking about reliability of disk drives, a common statistic is mean time between failure, or
MTBF for short. It is an abomination of a statistic that can be hideously complicated and is
widely misunderstood. To illustrate, a popular 2.5-inch 1.2 TB 10K SAS drive states a MTBF value
of 2 million hours! This absolutely does not mean that you should bank on this drive running for
228 years (2 million hours) before failing. What the statistic might tell you is that if you have 228
of these disk drives in your environment, you could expect to have around one failure in the first
year. Aside from that, it is not a useful statistic.
Annualized Failure Rate
Another reliability specification is annual failure rate (AFR). This attempts to estimate the
likelihood that a disk drive will fail during a year of full use. The same 1.2 TB drive in the
preceding example has an AFR of less than 1 percent.
Solid-State Media
The field of solid-state media is large, exciting, and moving at an amazing pace! New technologies and
innovations are hitting the street at an incredible rate. Although solid-state media, and even flash memory,
have been around for decades, uptake has been held back because of capacity, reliability, and cost issues.
Fortunately for us, these are no longer issues, and solid-state storage is truly taking center stage.
Solid-state media comes in various shapes and sizes. But these are the two that we are most interested in:
Solid-State Drive (SSD): For all intents and purposes, an SSD looks and feels like a hard disk drive (HDD). It
comes in the same shape, size, and color. This means that SSDs come in the familiar 2.5- and 3.5-inch form
factors and sport the same SAS, SATA, and FC interfaces and protocols. You can pretty much drop them into
any system that currently supports disk drives.
PCIe Card/Solid-State Card (SSC): This is solid-state media in the form of a PCI expansion card. In this form,
the solid-state device can be plugged into just about any PCIe slot in any server or storage array. These are
common in servers to provide extremely high performance and low latency to server apps. They are also
increasingly popular in high-end storage arrays for use as level 2 caches and extensions of array caches and
tiering.
Solid-State Media
There are also different types of solid-state media,
such as flash memory, phase-change memory
(PCM), memristor, ferroelectric RAM (FRAM), and
plenty of others. However, we will focuses on flash
memory because of its wide adoption in the
industry.
Unlike a mechanical disk drive, solid-state media has
no mechanical parts. Like everything else in
computing, solid-state devices are
silicon/semiconductor based. However, this gives
solid-state devices massively different
characteristics and patterns of behavior than
spinning disk drives.
Solid-State Media
The following are some of the important high-level differences:
Solid state is the king of random workloads, especially random reads!
Solid-state devices give more-predictable performance because of the total lack of positional
latency such as seek time.
Solid-state media tends to require less power and cooling than mechanical disk.
Sadly, solid-state media tends to come at a much higher $/TB capital expenditure cost than
mechanical disk.
Solid-state devices also commonly have a small DRAM cache that is used as an acceleration
buffer as well as for storing metadata such as the block directory and media-related statistics.
They often have a small battery or capacitor to flush the contents of the cache in the event of
sudden power loss.
Flash Memory
Flash memory is a form of solid-state storage. Meaning that, it is semiconductor based, has
no moving parts, and provides persistent storage. The specific type of flash memory we use
is called NAND flash, and it has a few quirks that you need to be aware of. Another form of
ash memory is NOR flash, but this is not commonly used in storage and enterprise tech.
Writing to Flash Memory
Flash memory is made up of ash cells that are
grouped into pages. Pages are usually 4K, 8K, or 16K
in size. These pages are then grouped into blocks
that are even bigger, often 128K, 256K, or 512K. A
brand new ash device comes with all cells set to 1.
Flash cells can be programmed
only from a 1 to a 0. If you want to change the value
of the cell back to a 1, you need to erase it via a
block-erase operation. Now for the small print:
erasing ash memory can be done only at a block
level! Yes, you read that right. Flash memory can be
erased only at the block level. And as you may
remember, blocks can be quite large.
Writing to Flash Memory
Programming/writing to flash means a couple of important things:
The first time you write to a flash drive, write performance will be blazingly fast. Every cell in
the block is preset to 1 and can be individually programmed to 0. There’s nothing more to it. It is
simple and lightning quick, although still not as fast as read performance.
If any part of a flash memory block has already been written to, all subsequent writes to any
part of that block will require a much more complex and time-consuming process. This process
is read/erase/program. Basically, the controller reads the current contents of the block into
cache, erases the entire block, and writes the new contents to the entire block. This is obviously
a much slower process than if you are writing to a clean block. Fortunately, though, this
shouldn’t have to occur often, because in a normal operating state a flash drive will have hidden
blocks in a pre-erased condition that it will redirect writes to. Only when a flash device runs out
of these pre-erased blocks will it have to revert to performing the full read/erase/program cycle.
This condition is referred to as the write cliff.