Lec02 secondaryStorageDevices
Lec02 secondaryStorageDevices
1.
Two major types of secondary storage devices: Direct Access Storage Devices (DASDs)
Magnetic Discs Hard disks (high capacity, low cost, fast) Floppy disks (low capacity, lower cost, slow) Optical Disks CD-ROM = (Compact disc, read-only memory
2.
Serial Devices
Magnetic tapes (very fast sequential access)
CENG 351
Storage has major implications for DBMS design! READ: transfer data from disk to main memory (RAM). WRITE: transfer data from RAM to disk. Both operations are high-cost operations, relative to in-memory operations, so DB must be planned carefully! Why Not Store Everything in Main Memory? Costs too much: Cost of RAM about 100 times the cost of the same amount of disk space, so relatively small size. Main memory is volatile. Typical storage hierarchy: Main memory (RAM) (primary storage) for currently used data. Disk for the main database (secondary storage). Tapes for archiving older versions of the data (tertiary storage).
CENG 351 3
Spatial units: o byte: 8 bits o kilobyte (KB): 1024 or 210 bytes o megabyte (MB): 1024 kilobytes or 220 bytes o gigabyte (GB): 1024 megabytes or 230 bytes
Time units: o nanosecond (ns) one- billionth (10-9 ) of a second o microsecond ( s) one- millionth (10-6 ) of a second o millisecond (ms) one- thousandth (10-3 ) of a second Primary versus Secondary Storage Primary storage costs several hundred times as much per unit as secondary storage, but has access times that are 250,000 to 1,000,000 times faster than secondary storage. CENG 351 4
Bits of data (0s and 1s) are stored on circular magnetic platters called disks. A disk rotates rapidly (& never stops). A disk head reads and writes bits of data as they pass under the head. Often, several platters are organized into a disk pack (or disk drive).
CENG 351
data is stored in blocks blocks occupy sectors sectors on tracks files have names files are indefinite in size files may be updated (in part or whole) directory entries record file data file allocation table keeps track of file pieces
surfaces
tracks
sector
Surface of disk showing tracks and sectors
CENG 351 8
Disk contains concentric tracks. Tracks are divided into sectors A sector is the smallest addressable unit in a disk.
CENG 351
Disk head
Spindle Tracks
The arm assembly is moved in or out to position a head on a desired track. Tracks under heads make a cylinder (imaginary!). Only one head reads/writes at any one time.
Arm movement
Sector
Platters
Block size is a multiple Arm assembly of sector size (which is often fixed).
CENG 351
10
Disk controllers: typically embedded in the disk drive, which acts as an interface between the CPU and the disk hardware.
The
controller has an internal cache (typically a number of MBs) that it uses to buffer data for read/write requests.
CENG 351 11
When a program reads a byte from the disk, the operating system locates the surface, track and sector containing that byte, and reads the entire sector into a special area in main memory called buffer. The bottleneck of a disk access is moving the read/write arm.
below/above each other on different surfaces, rather than in several tracks on the same surface.
CENG 351 12
All the information on a cylinder can be accessed without moving the read/write arm.
CENG 351
13
CENG 351
14
Track capacity = # of sectors/track * bytes/sector Cylinder capacity = # of tracks/cylinder * track capacity Drive capacity = # of cylinders * cylinder capacity Number of cylinders = # of tracks in a surface
CENG 351
15
Usually File manager, under the operating system, maintains the logical view of a file. File manager views the file as a series of clusters, each of a number of sectors. The clusters are ordered by their logical order. Files can be seen in the form of logical sectors or blocks, which needs to be mapped to physical clusters. File manager uses a file allocation table (FAT) to map logical sectors of the file to the physical clusters.
CENG 351
16
If there is a lot of room on a disk, it may be possible to make a file consist entirely of contiguous clusters. Then we say that the file is one extent. (very good for sequential processing) If there isnt enough contiguous space available to contain an entire file, the file is divided into two or more noncontiguous parts. Each part is a separate extent.
CENG 351 17
1)
2)
When to use large cluster size? What about small cluster size?
CENG 351
19
Disk tracks may be divided into user-defined blocks rather than into sectors. Blocks can be fixed or variable length. A block is usually organized to hold an integral number of logical records. Blocking Factor = number of records stored in a block. No internal fragmentation, no record spanning over two blocks. In block-addressing scheme each block of data may be accompanied by one or more subblocks containing extra information about the block: record count, last record key on the block
CENG 351
20
Both blocks and sectors require non-data overhead (written during formatting) On sector addressable disks, this information involves sector address, track address, and condition (usable/defective). Also pre-formatting involves placing gaps and synchronization marks between the sectors. On block-organized disk, where a block may be of any size, more information is needed and the programmer should be aware of some of this information to utilize it for better efficiency
CENG 351 21
Q) How many records can be stored per track if blocking factor is 10 or 60?
a) 10 (20000/1300*10=150) b) 60 (20000/6300*60=180)
CENG 351 22
Time Component
Seek Time
Rotational delay (or latency) Transfer time
Action
Time to move the read/write arm to the correct cylinder Time it takes for the disk to rotate so that the desired sector is under the read/write head
Once the read/write head is positioned over the data, this is the time it takes for transferring data
CENG 351
23
Seek time is the time required to move the arm to the correct cylinder. Largest in cost. Typically:
(track-to-track) 50 ms maximum (from inside track to outside track) 30 ms average (from one random track to another random track)
CENG 351
24
It is usually impossible to know exactly how many tracks will be traversed in every seek,
we usually try to determine the average seek time (s)
If the starting positions for each access are random, it turns out that the average seek traverses one third of the total number of cylinders.
Why? There are more ways to travel short distance than
Manufacturers specifications for disk drives often list this figure as the average seek time for the drives. Most hard disks today have s under 10 ms, and high-performance disks have s as low as 7.5 ms.
CENG 351 25
Seek time depends only on the speed with which the head rack moves, and the number of tracks that the head must move across to reach its target. Given the following (which are constant for a particular disk):
Hs = the time for the I/ O head to start moving Ht = the time for the I/ O head to move from one track
Then the time for the head to move n tracks is: Seek(n)= Hs+ Ht*n
to the next
CENG 351
26
a) b)
Calculate the time to read 10 sequential blocks, on the same track. Calculate the time to read 10 sequential cylinders, if there are 200 cylinders, and 20 surfaces.
CENG 351 27
Given the same disk, a) Calculate the time to read 100 blocks randomly b) Calculate the time to read 100 blocks sequentially.
CENG 351
28
We assume that blocks are arranged so that there is no rotational delay in transferring from one track to another within the same cylinder. This is possible if consecutive track beginnings are staggered (like running races on circular race tracks) We also assume that the consecutive blocks are arranged so that when the next block is on an adjacent cylinder, there is no rotational delay after the arm is moved to new cylinder Fast sequential reading: no rotational delay after finding the first block.
CENG 351 29
Given a file of 30000 records, 1600 bytes each, and block size 2400 bytes, how does record placement affect sequential reading time, in the following cases? Discuss.
i) Empty space in blocks-internal fragmentation. ii) Records overlap block boundaries.
CENG 351
30
CENG 351
32
No direct access, but very fast sequential access. Resistant to different environmental conditions. Easy to transport, store, cheaper than disk. Before it was widely used to store application data; nowadays, its mostly used for backups or archives.
CENG 351 33
A sequence of bits are stored on magnetic tape. For storage, the tape is wound on a reel. To access the data, the tape is unwound from one reel to another. As the tape passes the head, bits of data are read from or written onto the tape.
CENG 351
34
Reel 1
Reel 2
tape
Read/write head
CENG 351 35
CENG 351
36
Compact Disk read only memory (write once), R/W is also available. Data is encoded and read optically with a laser Can store around +600MB data Digital data is represented as a series of Pits and Lands:
Pit = a little depression, forming a lower level in the track Land = the flat part between pits, or the upper levels in the track
CENG 351
37
Reading a CD is done by shining a laser at the disc and detecting changing reflections patterns.
1 = change in height (land to pit or pit to land) 0 = a fixed amount of time between 1s
LAND
PIT
LAND
PIT
LAND
Note : we cannot have two 1s in a row! => uses Eight to Fourteen Modulation (EFM) encoding table. Usually, a pattern of 8 bits is translated to/from a pattern of 14 pits and lands.
CENG 351
38
While the speed of CD-ROM readers is relatively higher, such as 24X(24 times CD audio speed), the speed of writing is much slower, as low as 2X.
Note that the speed of the audio is about 150KB per
The DVD (Digital Video Disc or Digital Versatile Disc) technology is based on CD technology with increased storage density. The DVD technology allows two-side medium, with a storage capacity of up to 10GB.
second.
CENG 351
39
Because of the heritage from CD audio, the data is stored as a single spiral track on the CD-ROM, contrary to magnetic hard disks discrete track concept. Thus, the rotation speed is controlled by CLV-Constant Linear velocity. The rotational speed at the center is highest, slowing down towards the outer edge. Because, the recording density is the same every where. Note that with CLV, the linear speed of the spiral passing under the R/W head remains constant. CLV is the culprit for the poor seek time in CD-ROMs The advantage of CLV is that the disk is utilized at its best capacity, as the recording density is the same every where.
CENG 351
40
Note that: Since 0's are represented by the length of time between transitions, we must travel at constant linear velocity (CLV)on the tracks. Sectors are organized along a spiral Sectors have same linear length Advantage: takes advantage of all storage space available. Disadvantage: has to change rotational speed when seeking (slower towards the outside)
CENG 351
41
Question: Why does it take only 70 minutes of playing time in an CD audio. Ans.: If the sound frequency is 20 kilohertz, we need twice as much frequency for sampling speed to reconstruct the sound wave. Each sample may take up to 2 bytes.
-An accepted standard allows a sampling speed of 44100 times per second, which requires 88200 bytes for 2 bytes per sample. For stereo, this becomes 176400 bytes per second. If -If the capacity is about 600MB, you can compute number of minutes required
CENG 351 42
1 second of play time is divided up into 75 sectors. Each sector holds 2KB 60 min CD: 60min * 60 sec/min * 75 sectors/sec = 270,000 sectors = 540,000 KB ~ 540 MB A sector is addressed by: Minute:Second:Sector e.g. 16:22:34
CENG 351 43
One of the problems faced in using CDs for data storage is acceptance of a common file system, with the following desired design goals:
Support for hierarchical directory structure, with access of
one or two seeks Support for generic file names (as in file*.c), during directory access
If implement UNIX file system on CD-ROM, it will be a catastrophe! The seek time per access is from 500 msec to 1 sec.
CENG 351 44
In this case, one seek may be necessary per subdirectory. For example /usr/home/mydir/ceng351/exam1 will require five seeks to locate the file exam1 only Solution
one file, such that it allows building a left child right sibling structure to be able to access any file.
For a small file structure file, the entire directory structure can be kept in the memory all the time, which allows method to work.
CENG 351 45
The second approach is to create an index to the file locations by hashing the full path names of each file. This method will not work for generic file or directory searches. A third method may utilize both above methods, one can keep the advantage of Unix like one file per directory scheme, at the same time allows building indexes for the subdirectories.
CENG 351
46
A forth method, assume directories as files as well and use a special index that organizes the directories and the files into a hierarchy where a simple parental index indicates the relationship between all entries.
Rec Number 0 Root 1 Subdir1 2 Subdir11 3 Subdir12 4 File11 5 File 6 Subdir2 File or dir name 0 1 1 1 0 0 Parent
CENG 351
47
B+ Tree type data structures are appropriate for organizing the files on CD-ROMs. Build once read many times allows attempting to achieve100% utilization of blocks or buckets. Packing the internal nodes so that all of them can be maintained in the memory during the data fetches is important. Secondary indexes can be formed so that the records are pined to the indexes on a CD-ROM, as the file will never be reorganized
CENG 351
48
This may force the files on the source disks and their copies on the CD-ROM to be differently organized, because of the efficiency concerns. It is possible to use hashing on the CD-ROM, except that the overflow should either not exist or minimized. This becomes possible when the addressing space is kept large. Remember that the files to be put on a CD-ROM are final, so the hashing function can be chosen to perform the best, i.e. with no collisions.
CENG 351
49