Open navigation menu
Close suggestions
Search
Search
en
Change Language
Upload
Sign in
Sign in
Download free for days
0 ratings
0% found this document useful (0 votes)
301 views
749 pages
File Structures An Object-Oriented Approach PDF
Uploaded by
raulv.mosquera
AI-enhanced title
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here
.
Available Formats
Download as PDF or read online on Scribd
Download
Save
Save File_Structures_An_Object-Oriented_Approach.pdf For Later
0%
0% found this document useful, undefined
0%
, undefined
Embed
Share
Print
Report
0 ratings
0% found this document useful (0 votes)
301 views
749 pages
File Structures An Object-Oriented Approach PDF
Uploaded by
raulv.mosquera
AI-enhanced title
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here
.
Available Formats
Download as PDF or read online on Scribd
Carousel Previous
Carousel Next
Download
Save
Save File_Structures_An_Object-Oriented_Approach.pdf For Later
0%
0% found this document useful, undefined
0%
, undefined
Embed
Share
Print
Report
Download now
Download
You are on page 1
/ 749
Search
Fullscreen
File Structures An Object-Oriented Approach with C++ Michael J. Folk University of Illinois Bill Zoellick CAP Ventures Greg Riccardi Florida State University A yy ADDISON-WESLEY ‘Addison-Wesley is an imprint of Addison Wesley Longman, Inc. Reading, Massachusetts « Harlow, England Menlo Park, California Berkeley, California « Don Mills, Ontario « Sydney Bonn + Amsterdam + Tokyo * Mexico CityAcquisitions Editor: Susan, Hartman Associate Editor: Katherine Harutunian Production Editors: Patricia A. O. Unubun / Amy Willcutt Production Assistant: Brooke D, Albright Design Editor: Alwyn R. Velésquez Senior Marketing Manager: Tom Ziolkowski Interior Design and Composition: Greg Johnson, Art Directions Cover Designer: Eileen Hoff Library of Congress Cataloging-in-Publication Data Folk, Michael J. File structures: an object-oriented approach with C++/ Michael J. Folk, Bill Zoellick, Greg Riccardi, pom. Includes bibliographical references and index. ISBN 0-201-87401-6 1. C++ (Computer program language) 2. File organization (Computer science) I. Zoellick, Bill. 1. Riccardi, Greg. 10. Title. QA76.73.C153F65 1998 005.74"1—de21 " 97-31670 cP ‘Access the latest information about Addison-Wesley titles from our World Wide Web site: http://www.awl.com/cseng ‘The programs and applications presented in this book have been included for their instructional value, They have been tested with care but are not guaranteed for any purpose. The publisher does not offer any warranties or representations, nor does it accept any liabilities with respect to the programs or applications. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and Addison-Wesley was aware of a trademark claim, the designations have been printed in initial caps or all caps. Reprinted with corrections, March 1998. Copyright © 1998 by Addison Wesley Longman, Inc. All rights reserved, No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, OF otherwise, without the prior written permission of the publisher. Printed in the United States of America. 5 67 8 9 10-MA-010099Dedication To Pauline and Rachel To Karen, Joshua, and Peter and To Ann, Mary, Christina, and ElizabethPreface The first and second editions of File Structures by Michael Folk and Bill Zoellick established a standard for teaching and learning about file struc- tures. The authors helped many students and computing professionals gain familiarity with the tools used to organize files. This book extends the presentation of file structure design that has been so successful for twelve years with an object-oriented approach to implementing file structures using C++. It demonstrates how the object- oriented approach can be successfully applied to complex implementation problems. It is intended for students in computing classes who have had at least one programming course and for computing professionals who want to improve their skills in using files. ‘This book shows you how to design and implement efficient file struc- tures that are easy for application programmers to use. All you need is a compiler for C++ or other object-oriented programming language and an operating system. This book provides the conceptual tools that enable you to think through alternative file structure designs that apply to the task at hand. It also develops the programming skills necessary to produce quali- ty implementations. ‘The coverage of the C++ language in this book is suitable for readers with a basic knowledge of the language. Readers who have a working familiarity with C++ should have no problem understanding the programming examples. Those who have not programmed in C++ will benefit from access to an introductory textbook. The first programming examples in the book use very simple C++ classes to develop implementations of fundamental file structure tools.viii Preface One by one, advanced features of C++ appear in the context of imple- mentations of more complex file structure tools. Each feature is fully explained when it is introduced: Readers gain familiarity with inheritance, overloading, virtual methods, and templates and see examples of why these features are so useful to object-oriented programming. — Organization of the Book The first six chapters of this book give you the tools to design and imple- ment simple file structures from the ground up: simple I/O, methods for transferring objects between memory and files, sequential and direct access, and the characteristics of secondary storage. The last six chapters build on this foundation and introduce you to the most important high- level file structure tools, including indexing, cosequential processing, B- trees, B+ trees, hashing, and extendible hashing. The book: includes extensive discussion of the object-oriented approach to representing information and algorithms and the features of C++ that support this approach. Each of the topics in the text is accompa- nied by object-oriented representations. The full C++ class definitions and code are included as appendices and are available on the Internet. This code has been developed and tested using Microsoft Visual C++ and the Gnu C++ compilers on a variety of operating systems including Windows 95, Windows NT, Linux, Sun Solaris, and IBM AIX. You can find the programming examples and other materials at the Addison-Wesley Web site: http://www.awl.com/cseng/titles/0-201-87401-6/. Object-Oriented File Structures There are two reasons we have added the strong object-oriented program- ming component to this book. First, it allows us to be more specific, and more helpful, in illustrating the tools of file structure design. For each tool, we give very specific algorithms and explain the options that are available to implementers. We are also able to build full implementations of complex file structure tools that are suitable for solving file design prob- lems. By the time we get to B-tree indexing, for instance, we are able to use previous tools for defining object types, moving data between memory and files, and simple indexing. This makes it possible for the B-tree classesPreface ~ chk to have simple implementations and for the book to explain the features of B-trees as enhancements of previous tools. The second purpose of the programming component of the book is to illustrate the proper use of object-oriented methods. Students are often exposed to object-oriented techniques through simple examples. However, it is only in complex systems that the advantages of object-oriented tech- niques become clear. In this book, we have taken advantage of the orderly presentation of file structure tools to build a complex software system as a sequence of relatively simple design and implementation steps. Through this approach, students get specific examples of the advantages of object- oriented methods and are able to improve their own programming skills. — A Progressive Presentation of C++ ‘We cover the principles of design and implementation in a progressive fashion. Simple concepts come first and form the foundation for more complex concepts. Simple classes are designed and implemented in the early chapters, then are used extensively for the implementation topics of the later chapters. The most complex file structure tools have simple implementations because they extend the solid foundation of the early chapters. We also present the features of C++ and the techniques of object- oriented programming in a progressive fashion. The use of C++ begins with the simplest class definitions. Next comes the use of stream.classes for input.and output. Further examples introduce inheritance, then virtual functions, and finally templates. Each new feature is introduced and explained in the context of a useful file structure application. Readers see how to apply object-oriented techniques to programming problems and learn firsthand how object- oriented techniques can make complex programming tasks simpler. — Exercises and Programming Problems The book includes a wealth of new analytical and programming exercises. The programming exercises include extensions and enhancements to the file structure tools and the application of those tools. The tools in the book are working software, but some operations have been left as programmingx Preface problems. The deletion of records from files, for instance, is discussed in the text but not implemented. Specific programming problems fill in the gaps in the implementations and investigate some of the alternatives that are presented in the text. ‘An application of information processing is included as a series of programming projects in the exercise sets of appropriate chapters. This application begins in Chapter 1 with the representation of students and course registrations as objects of C++ classes. In Chapter 2, the project asks for simple input and output of these objects. Later projects include implementing files of objects (Chapter 4), indexes to files (Chapter 7), grade reports and transcripts (Chapter 8), B-tree indexes (Chapter 9), and hashed indexes (Chapter 12). Using the Book as a College Text The first two editions of File Structures have been used extensively as a text in many colleges and universities. Because the book is quite readable, students typically are expected to read the entire book over the course of a semester. The text covers the basics; class lectures can expand and supple- ment the material. The professor is free to explore more complex topics and applications, relying on the text to supply the fundamentals. A word of caution: It is easy to spend too much time on the low-level issues presented in the first six chapters. Move quickly through this mate- rial. The relatively large number of pages devoted to these matters is not a reflection of the percentage of the course that should be spent on them. The intent is to provide thorough coverage in the text so the instructor can assign these chapters as background reading, saving precious lecture time for more important topics. It is important to get students involved in the development of file processing software early in the course. Instructors may choose some combination of file tool implementation problems from the programming exercises and applications of the tools from the programming projects. Each of the programming problems and projects included in the exercises is intended to be of short duration with specific deliverables, Students can be assigned programming problems of one to three weeks in duration. It is typical for one assignment to depend on previous assignments. By conducting a sequence of related software developments, the students finish the semester with extensive experience in object-oriented software development.Preface xi — A Book for Computing Professionals We wrote and revised this book with our professional colleagues in mind. The style is conversational; the intent is to provide a book that you can read over a number of evenings, coming away with a good sense of how to approach file structure design problems. Some computing professionals may choose to skip the extensive programming examples and concentrate on the conceptual tools of file structure design. Others may want to use the C++ class definitions and code as the basis for their own implementa- tions of file structure tools. If you are already familiar with basic file structure design concepts and programming in C++, skim through the first six chapters and begin read- ing about indexing in Chapter 7. Subsequent chapters introduce you 'to cosequential processing, B-trees, B+ trees, hashing, and extendible hash- ing. These are key tools for any practicing programmer who is building file structures. We have tried to present them in a way that is both thorough and readable. The object-oriented C++ design and the implementation included throughout the book provide an extensive tour of the capabilities of the language and thorough examples of object-oriented design. If you need to build and access file structures similar to the ones in the text, you can use the C++ code as class libraries that you can adapt to your needs. A careful reading of the design and implementation examples can be helpful in enhancing your skills with object-oriented tools. All of the code included in the book is available on the Internet. If you are not already a serious Unix user, the Unix material in the first eight chapters will give you a feel for why Unix is a powerful environment in which to work with files. — Supplementary Materials The following supplementary materials are available to assist instructors and students. Links to these supplements are on the book’s official World Wide Web page at http://www.awl.com/cseng/titles/0-201-87401-6/. An Instructors’ Guide including answers to exercises will be available. Instructors should contact their Addison-Wesley local sales representative for information on the Guide’s availability. Programming examples and code will also be available via anonymous ftp at ftp.aw.com/cseng/authors/riccardixii Acknowledgments —e Acknowledgments It isa pleasure to acknowledge the outstanding work of Mike Folk and Bill Zoellick. As one who taught from the original work, I am pleased to add my contribution to its evolution. There are many people I would like to thank for help in preparing this revision of File Structures. The staff of the Computer and Engineering Publishing Group of Addison-Wesley was extremely helpful. Editor Susan Hartman approached me to revise this excellent book and add a C++ programming component. She was responsible for getting all of the complex pieces put together. Katherine Harutunian, associate editor, was ~ helpful and good-humored during the long and stressful process. The production staff of Patricia Unubun, Brooke Albright, and Amy Willcutt worked with me and were able to get the book finished on time. Tam particularly appreciative of the reviewers: H.K. Dai, Ed Boyno, Mary Ann Robbert, Barbara L. Laguna, Kenneth Cooper, Jr, and Mathew Palakal. Their comments and helpful suggestions showed me many ways to improve the book, especially in the presentation of the programming material. My greatest debt is to my wife, Ann, and my daughters, Mary, Christina, and Elizabeth, for giving me the time to work on this project. It was their support that allowed me to carry this project to completion. Greg Riccardi Tallahassee, Florida
[email protected]
Contents Preface vii - — Chapter 1 Introduction to the Design and Specification of File Structures 1 1.1. The Heart of File Structure Design 2 1.2 AShort History of File Structure Design 3 1.3 A Conceptual Toolkit: File Structure Literacy 5 1.4 An Object-Oriented Toolkit: Making File Structures Usable 6 1.5 Using ObjectsinC++ 6 Summary 10 KeyTerms 11. Further Readings 12 Programming Project 12 — Chapter 2 Fundamental File Processing Operations 13 2.1. Physical Files and Logical Files 14 2.2 OpeningFiles 15 2.3 Closing Files 19 2.4 Reading and Writing 20 2.4.1 Read and Write Functions 20 2.4.2 Files with C Streams and C++ Stream Classes 21 2.4.3 Programs in C++ to Display the Contents of a File 23 2.4.4 Detecting End-of-File 26 2.5 Seeking 27 2.5.1 Seeking with C Streams 27 2.5.2 Seeking with C++ Stream Classes 29 2.6 Special Charactersin Files 29 2.7 The Unix Directory Structure 30 2.8 Physical Devices and Logical Files 32 2.8.1 Physical Devices as Files 32 2.8.2 The Console, the Keyboard, and Standard Error 32 2.8.3 1/O Redirection and Pipes 33 2.9 File-Related Header Files 34 2.10 Unix File System Commands 35 Summary 35 KeyTerms 37 FurtherReadings 39 Exercises 40 Programming Exercises 41 Programming Project 42 xiiixiv Contents Chapter 3 Secondary Storage and System Software 43 3.1 3.2 33 34 3.5 3.6 “37 3.8 3.9 3.10 Disks 46 3.1.1 The Organization of Disks 46 3.1.2 Estimating Capacities and Space Needs 48 3.1.3 Organizing Tracks by Sector 50 3.1.4 Organizing Tracks by Block 55 3.1.5 Nondata Overhead 56 3.1.6 The Cost of a Disk Access 58 3.1.7 Effect of Block Size on Performance: A Unix Example 62 3.1.8 Disk as Bottleneck 63 MagneticTape 65 3.2.1 Types of Tape Systems 66 3.2.2 An Example of a High Performance Tape System 67 3.2.3 Organization of Data on Nine-Track Tapes 67 3.2.4 Estimating Tape Length Requirements 69 3.2.5 Estimating Data Transmission Times 71 Disk versusTape 72 Introduction to CD-ROM = 73 3.4.1 A Short History of CD-ROM 73 3.4,2 CD-ROM asa File Structure Problem 76 Physical Organization of CD-ROM 76 3.5.1 Reading Pits andLands 77 3.5.2 CLV instead of CAV 78 3.5.3Addressing 79 3.5.4 Structure of a Sector 80 CD-ROM Strengths and Weaknesses 82 3.6.1 Seek Performance 82 3.6.2 Data Transfer Rate 82 3.6.3 Storage Capacity 83 3.6.4 Read-Only Access 83 3.6.5 Asymmetric Writing and Reading 83 StorageasaHierarchy 84 AJourney ofa Byte 85 3.8.1 The File Manager 86 3.8.2 The /O Buffer 87 3.8.3 The Byte Leaves Memory: The I/O Processor and Disk Controller 87 Buffer Management 90 3.9.1 Buffer Bottlenecks 90 3.9.2 Buffering Strategies 91 VOin Unix 94 3.10.1 The Kernel 94 3.10.2 Linking File Names to Files 98Contents xv 3.10.3 Normal Files, Special Files, and Sockets 100 3.10.4 Block I/O 100 3.10.5 Device Drivers 101 3.10.6 The Kernel and File Systems 101 3.10.7 Magnetic Tape and Unix 102 Summary 103 KeyTerms 105 FurtherReadings 110 Exercises 110 Chapter 4 Fundamental File Structure Concepts 117 44 4.2 44 45 Field and Record Organization 118 4.1,1 A Stream File 119 4.1.2 Field Structures 120 4.1.3 Reading a Stream of Fields 123 4.1.4 Record Structures 125 4.1.5 A Record Structure That Uses a Length Indicator 127 4.1.6 Mixing Numbers and Characters: Use of aFileDump 131 Using Classes to Manipulate Buffers 134 4.2.1 Buffer Class for Delimited Text Fields 134 4.2.2 Extending Class Person with Buffer Operations 136 4.2.3 Buffer Classes for Length-Based and Fixed-Length Fields 137 Using Inheritance for Record Buffer Classes 139 43.1 Inheritance in the C++ Stream Classes 139 4.3.2 A Class Hierarchy for Record Buffer Objects 140 Managing Fixed-Length, Fixed-Field Buffers 144 An Object-Oriented Class for Record Files 146 Summary 147 KeyTerms 148 FurtherReadings 149 Exercises 149 Programming Exercises 151 Programming Project 152 Chapter 5 Managing Files of Records 153 5.1 5.2 5.3 5.4 Record Access 154 5.1.1 Record Keys 154 5.1.2 A Sequential Search 156 5.1.3 Unix Tools for Sequential Processing 159 5.1.4 Direct Access 161 More about Record Structures 163 5.2.1 Choosing a Record Structure and Record Length 163 5.2.2 Header Records 165 5.2.3 Adding Headers to C++ Buffer Classes 167 Encapsulating Record I/O Operations ina Single Class 168 File Access and File Organization 170xvi Contents 5.5 . 56 Beyond Record Structures . 172 5.5.1 Abstract Data Models for File Access. 172 5.5.2 Headers and Self-Describing Files 173 55.3Metadata 174 5.5.4 Color Raster Images 176 5.5.5 Mixing Object Types in One File 179 5.5.6 Representation-Independent File Access 182 5.5.7 Extensibility 183 Portability and Standardization 184 5.6.1 Factors Affecting Portability 184 5.6.2 Achieving Portability 186 Summary 192 KeyTerms 194 Further Readings 196 Exercises 196 Programming Exercises 199 Chapter 6 Organizing Files for Performance 201 61 6.2 6.3 DataCompression 203 6.1.1 Using a Different Notation 203 6.1.2 Suppressing Repeating Sequences 204 6.1.3 Assigning Variable-Length Codes 206 6.1.4 Irreversible Compression Techniques 207 6.1.5 Compression in Unix 207 Reclaiming Space in Files 208 6.2.1 Record Deletion and Storage Compaction 209 6.2.2 Deleting Fixed-Length Records for Reclaiming Space Dynamically 210 : 6.2.3 Deleting Variable-Length Records 214 6.2.4 Storage Fragmentation 217 6.2.5 Placement Strategies 220 Finding Things Quickly: An Introduction to Internal Sorting and Binary Searching 222 6.3.1 Finding Things in Simple Field and Record Files 222 6.3.2 Search by Guessing: Binary Search 223 6.3.3 Binary Search versus Sequential Search 225 6.3.4 Sorting a Disk Filein Memory 226 6.3.5 The Limitations of Binary Searching and Internal Sorting 226 Keysorting 228 6.4.1 Description of the Method 229 6.4.2 Limitations of the Keysort Method 232 6.43 Another Solution: Why Bother to Write the File Back? 232 6.4.4Pinned Records 234 Summary 234 KeyTerms 238 FurtherReadings 240 Exercises 241 Programming Exercises 243 Programming Project 245Contents xvi Chapter7 Indexing 247 7.9 WhatIsanIndex? 248 A Simple Index for Entry-Sequenced Files 249 Using Template Classes in C++ for Object /0 253 Object-Oriented Support for Indexed, Entry-Sequenced Files of Data Objects 255 7.4.1 Operations Required to Maintain an indexed File 256 7.4.2 Class TextindexedFile 260 7.4.3 Enhancements to Class TextIndexedFile 261 Indexes That Are Too Large toHoldin Memory 264 Indexing to Provide Access by Multiple Keys 265 Retrieval Using Combinations of Secondary Keys 270 Improving the Secondary Index Structure: Inverted Lists 272 7.8.1 A First Attempt ata Solution’ 272 7.8.2 A Better Solution: Linking the List of References 274 Selective Indexes 278 7.10 Binding 279 Summary 280 KeyTerms 282 FurtherReadings 283 Exercises 284 Programming and Design Exercises 285 Programming Project 286 Chapter 8 Cosequential Processing and the Sorting of Large Files 289 8.1 8.2 8.3 8.4 8.5 An Object-Oriented Model for Implementing Cosequential Processes 291 8.1.1 Matching Names in Two Lists 292 8.1.2 Merging Two Lists 297 8.1.3 Summary of the Cosequential Processing Model 299 Application of the Model to a General Ledger Program 301 8.2.1 The Problem 301 8.2.2 Application of the Model to the Ledger Program 304 Extension of the Model to Include Multiway Merging 309 8.3.1 A K-way Merge Algorithm 309 8.3.2 A Selective Tree for Merging Large Numbers of Lists © 310 A Second Look at Sorting in Memory 311 8.4.1 Overlapping Processing and |/O:Heapsort 312 8.4.2 Building the Heap while Reading the File 313 8.4.3 Sorting While Writing to the File 316 Merging as a Way of Sorting Large Files on Disk 318 8.5.1 How Much Time Does a Merge SortTake? 320 8.5.2 Sorting a File That Is Ten Times Larger 324xviii Contents 8.5.3 The Cost of Increasing the File Size 326 8.5.4 Hardware-Based Improvements 327 8.5.5 Decreasing the Number of Seeks Using Multiple-Step Merges 329 8.5.6 Increasing Run Lengths Using Replacement Selection 332 8.5.7 Replacement Selection Plus Multistep Merging 338 8.5.8 Using Two Disk Drives with Replacement Selection 341 8.5.9 More Drives? More Processors? 343 + 8.5.10 Effects of Multiprogramming 344 8.5.11 A Conceptual Toolkit for External Sorting 344 8.6 Sorting FilesonTape 345 8.6.1 The Balanced Merge 346 8.6.2 The K-way Balanced Merge 348 8.6.3 Multiphase Merges 349 8.6.4 Tapes versus Disks for External Sorting 351 8.7 Sort-Merge Packages 3. 8.8 Sorting and Cosequential Processing in Unix 352 8.8.1 Sorting and Merging in Unix 352 8.8.2 Cosequential Processing Utilities in Unix 355 Summary 357 KeyTerms 360 FurtherReadings 362 Exercises 363 Programming Exercises 366 Programming Project 367 Chapter9 Multilevel Indexing and B-Trees 369 9.1 Introduction: The Invention of the B-Tree 370 9.2. Statement of the Problem 372 9.3 Indexing with Binary Search Trees 373 9.3.1 AVLTrees 377 9.3.2 Paged Binary Trees 380 9.3.3 Problems with Paged Trees 382 9.4 Multilevel Indexing, a Better Approach to Tree Indexes 384 9.5 B-trees: Working up from the Bottom 387 9.6 Example of Creating aB-Tree 388 9.7 An Object-Oriented Representation of B-Trees 391 9.7.1 Class BTreeNode: representing B-Tree Nodes in Memory 391 9.7.2 Class Tree: Supporting Files of B-Tree Nodes 393 9.8 B-Tree Methods Search, Insert,and Others 394 9.8.1 Searching 394 9.8.2 Insertion 395 9.8.3 Create, Open,and Close 398 9.8.4Testing the B-Tree 398 9.9 B-TreeNomenclature 399 9.10 Formal Definition of B-Tree Properties 401Contents xix 9.11 Worst-Case Search Depth 401 9.12 Deletion, Merging, and Redistribution 403 9.12.1 Redistribution 406 9.13 Redistribution During Insertion: A Way to Improve Storage Utilization 407 9.14 B*Trees 408 9.15 Buffering of Pages:Virtual B-Trees 409 9.15.1 LRU Replacement 410 9.15.2 Replacement Based on Page Height 411 9.15.3 Importance of Virtual B-Trees 412 9.16 Variable-Length Records and Keys 413 Summary 414 KeyTerms 416 FurtherReadings 417 Exercises 419 Programming Exercises 421 Programming Project 422 Chapter 10 Indexed Sequential File Access and Prefix B* Trees 423 10.1 Indexed Sequential Access 424 10.2 Maintaininga Sequence Set 425 10.2.1 The Use of Blocks 425 10.2.2 Choice of Block Size 428 10.3 Adding a Simple Index to the Sequence Set 430 10.4 The Content of the Index: Separators Instead of Keys 432 10.5 The Simple Prefix B*Tree 434 10.6 Simple Prefix B*Tree Maintenance 435 10.6.1 Changes Localized to Single Blocks in the Sequence Set 435 10.6.2 Changes Involving Multiple Blocks in the Sequence Set 436 10.7 Index Set Block Size 439 10.8 — Internal Structure of Index Set Blocks: A Variable-Order B-Tree 440 10.9 Loading a Simple Prefix B*Tree 443 10.10° B*Trees 447 10.11 B-Trees, Bt Trees, and.Simple Prefix B+ Trees in Perspective 449 Summary 452 KeyTerms 455 FurtherReadings 456 Exercises 457 Programming Exercises 460 Programming Project 461 Chapter 11 Hashing 463 11.1 Introduction 464 11.1.1 What Is Hashing? 465 11.1.2 Collisions 466 11.2 A Simple Hashing Algorithm 468xx Contents 11.3 11.4 i in 11.6 17 1 Py 11.9 Hashing Functions and Record Distributions 472 11.3.1 Distributing Records among Addresses 472 11.3.2 Some Other Hashing Methods 473 11.3.3 Predicting the Distribution of Records 475 11.3.4 Predicting Collisions for a Full File 479 How Much Extra Memory Should Be Used? 480 11.4.1 Packing Density 481 11.4.2 Predicting Collisions for Different Packing Densities 481 Collision Resolution by Progressive Overflow 485 11.5.1 How Progressive Overflow Works 485 11.5.2 Search Length 487 Storing More Than One Record per Address: Buckets 490 11.6.1 Effects of Buckets on Performance 491 11.6.2 Implementation Issues 496 Making Deletions 498 11.7.1 Tombstones for Handling Deletions 499 11.7.2 Implications of Tombstones for Insertions 500 11.7.3 Effects of Deletions and Additions on Performance 501 Other Collision Resolution Techniques 502 11.8.1 Double Hashing 502 11.8.2 Chained Progressive Overflow 502 11.8.3 Chaining with a Separate Overflow Area 505 11.8.4 Scatter Tables: Indexing Revisited 506 Patterns of Record Access 507 Summary 508 KeyTerms 512 FurtherReadings 514 Exercises 515 Programming Exercises 520 — Chapter 12 Extendible Hashing 523 12.1 Introduction 524 12.2. How Extendible Hashing Works 525 12.2.1Tries 525 12.2.2 Turning the Trie into a Directory 526 12.2.3 Splitting to Handle Overflow 528 12.3 Implementation 530 12.3.1 Creating the Addresses 530 12.3.2 Classes for Representing Bucket and Directory Objects 533 12.3.3 Bucket and Directory Operations 536 12.3.4 Implementation Summary 542 12.4 Deletion 543 12.4.1 Overview of the Deletion Process » 543Contents 12.4.2 A Procedure for Finding Buddy Buckets 544 12.43 Collapsing the Directory 545 12.4.4 Implementing the Deletion Operations 546 12.4.5 Summary of the Deletion Operation 548 12.5. Extendible Hashing Performance 548 12.5.1 Space Utilization for Buckets 549 125.2 Space Utilization for the Directory 550 12.6 Alternative Approaches 551 12.6.1 Dynamic Hashing 551 12.6.2 Linear Hashing . 553 12.6.3 Approaches to Controlling Splitting 555 Summary 557 KeyTerms 558 FurtherReadings 560 Exercises 563 Programming Exercises 563 Programming Project 563 Appendix A Designing File Structures for CD-ROM 565 A.1 UsingThis Appendix 566 A.2. Tree Structures on CD-ROM 567 A.2.1 Design Exercises 567 A2.2Block Size 567 A.2.3 Special Loading Procedures and Other Considerations A,.2.4 Virtual Trees and Buffering Blocks 569 A.2.5 Trees as Secondary Indexes on CD-ROM 570 A.3 Hashed FilesonCD-ROM 571 A3.1 Design Exercises. 571 A3.2 Bucket Size 572 3.3 How'the Size of CD-ROM Helps 572 A3.4 Advantages of CD-ROM's Read-Only Status 572 AA The CD-ROM File System 573 AA.1 The Problem 573 A4.2 Design Exercise 574 A43 AHybrid Design 575 Summary 577 Appendix B ASCII Table — Appendix C Formatted Output with C++ Stream Classes 579 581xxii Contents — Appendix D Simple File Input/Output Examples 585 D.1_Liste.cpp. Program to read and display the contents of a file using Cstreams 586 D.2_ Listcpp.cpp. Program to read and display the contents of a file using C++ stream classes 586 D3. Person.h. Definition for class Person, including code for constructor 587 D4 Writestr.cpp.Write Person objects into a stream file 587 0.5 Readdel.cpp. Read Person objects with fields delimited by'|' 588 D6 Readvar.cpp. Read variable length records and break up into Person objects 589 0.7 Writeper.cpp. Function to write a person to atextfile 590 D8 Readper.cpp. Function to prompt user and read fields of aPerson 590 — Appendix E Classes for Buffer Manipulation 591 £1 Person.h. Definition for class Person 592 E.2 Person.cpp.Code for class Person 592 £3 Deltext.h. Definition for class DelimitedTextBuffer 595 E4 Deltext.cpp. Code for class DelimitedTextBuffer 596 £5 Lentext.h. Definition for class LengthTextBuffer 598 E.6 Lentext.cpp. Code for class LengthTextBuffer 598 £7 Fixtext.h. Definition for class FixedTextBuffer 600 EB Fixtext.cpp. Code for class FixedTextBuffer "601 £9 Test.cpp.Test program for all buffer classes 604 — Appendix F A Class Hierarchy for Buffer Input/Output 607 Fl Person.h. Definition for class Person 608 F2 — Person.cpp.Code for class Person 608 F3 lobuffer.h. Definition for class |OBuffer 610 F4 — lobuffer.cpp.Code for class |OBuffer 611 FS Varlen.h. Definition for class VariableLengthBuffer 613 Fé Varlen.cpp. Code for class VariableLengthBuffer 613 F7 Delim.h. Definition for class DelimFieldBuffer 616 F8 —Delim.cpp. Code for class DelimFieldBuffer 616 Fo Length.h. Definition for class LengthFieldBuffer 619 F.10 — Length.cpp. Code for class LengthFieldBuffer 619 F.11__ Fixlen.h. Definition for class FixedLengthBuffer 621Contents xxiii F12_ Fixlen.cpp. Code for class FixedLengthBuffer 621 F.13__ Fixfld.h. Definition for class FixedFieldBuffer 624 F.14_— Fixfld.cpp.Code for class FixedFieldBuffer 625 F.15 — Buffile.h. Definition for class BufferFile 629 F.16 — Buffile.cpp.Code for class BufferFile 630 17 Recfile.h. Template class RecordFile 633 F.18 — Test.cpp. Test program for Person and RecordFile including template function 634 — Appendix G Single Level Indexing of Records by Key 637 G1 Recordng,h. Definition of class Recording with composite key 638 G2 Recordng.cpp. Code for class Recording 638 G3 Makerec.cpp. Program to create a sample data file of recordings 640 G4 Textind.h. Definition of class Textindex 641 G5 Textind.cpp.Code for class Textindex 641 G6 RecFile.h Template class RecordFile 643 G7 Makeind.cpp. Program to make an index file for a file of recordings 644 G8 Tindbuff-h. Definition of class TextIndexBuffer 645 G9 Tindbuff.cpp. Code for class TextindexBuffer 646 G10 _Indfile.h. Template class TextIndexedFile 648 G.11__ Strclass.h. Definition of class String 652 G12 Strclass.cpp. Code for class String 653 G.13_ Simpind.h. Definition of template class Simpleindex 654 G.14Simpind.tc. Code for template class Simplelndex 655 — Appendix H Cosequential Processing 659 H.1 Coseq.h. Definition of class CosequentialProcess 660 H.2 _ Strlist.h. Definition of class StringListProcess 662 H3__ Strlist.cpp.Code for class StringlistProcess 662 H.4— Match.cpp. Main program for string matching and merging application 664 H.5 — Mastrans.h.Definition and cade for template class MasterTransactionProcess 664 H.6 — Ledgpost.h. Definition of class LedgerProcess 666 H.7 Ledgpost.cpp.Code for class LedgerProcess 667 H8 — Ledger.h. Definition of classes Ledger and Journal 668 H.9 Ledger.cpp. Code for classes Ledger and Journal 670 H.10 Heapsort.cpp. Code for class Heap and Heapsort 673xxiv Contents — Appendix! Multi-level Indexing with B-Trees 677 lL Btnade.h, Definition of template class BTreeNode 678 1.2 Btnode.tc, Method Bodies for template class BTreeNode 679 13. Btree.h, Definition of template class BTree 682 (4 Btree.tc, Method Bodies for template class BTree 683 1.5 Tstbtree.cpp. Program to test B-tree insertion 688 — Appendix J Extendible Hashing 689 J.1 Hash.h Functions Hash and MakeAddress 690 J. Hash.cpp. Implementation of functions Hash and MakeAddress 690 J3 Bucket.h. Definition of class Bucket 691 JA Directory.h. Definition of class Directory 692 J5 Tsthash.cpp. Program to test extendible hashing 693 4.6 Directory.cpp.lmplementation of class Directory 693 J.7 Bucket.cpp.implementation of class Bucket 699 BIBLIOGRAPHY 703 INDEX 709File Structures An Object-Oriented Approach with C++CHAPTER Introduction to the Design and Specification of File Structures CHAPTER OBJECTIVES design. “+ Introduce the primary design issues that characterize file structure Survey the history of file structure design, since tracing the developments in file structures teaches us much about how to design our own file structures. toolkit for file structure design. + Introduce the notions of file structure literacy and of a conceptual + Discuss the need for precise specification of data structures and operations and the development of an object-oriented toolkit that makes file structures easy to use. 4 Introduce classes and overloading in the C+ language.1.1 Chapter 1 Introduction to the Design and Specification of File Structures CHAPTER OUTLINE 1.1 The Heart of File Structure Design 1.2 A Short History of File Structure Design 1.3 AConceptual Toolkit: File Structure Literacy 1.4 An Object-Oriented Toolkit: Making File Structures Usable 1.5 Using Objects in C++ The Heart of File Structure Design Disks are slow. They are also technological marvels: one can pack thou- sands of megabytes on a disk that fits into a notebook computer. Only a few years ago, disks with that kind of capacity looked like small washing machines. However, relative to other parts of a computer, disks are slow. How slow? The time it takes to get information back from even rela- tively slow electronic random access memory (RAM) is about 120 nanoseconds, or 120 billionths of a second. Getting the same information from a typical disk might take 30 milliseconds, or 30 thousandths ‘of a second. To understand the size of this difference, we need an analogy. Assume that memory access is like finding something in the index of this book. Let’s say that this local, book-in-hand access takes 20 seconds. Assume that accessing a disk is like sending to a library for the information you cannot find here in this book. Given that our “memory access” takes 20 seconds, how long does the “disk access” to the library take, keeping the ratio the same as that of a real memory access and disk access? The disk access is a quarter of a million times longer than the memory access. This means that getting information back from the library takes 5 million seconds, or almost 58 days. Disks are very slow compared with memory. On the other hand, disks provide enormous capacity at much less cost than memory. They also keep the information stored on them when they are turned off. The tension between a disk’s relatively slow access time and its enormous, nonvolatile capacity is the driving force behind file structure design. Good file structure design will give us access to all the capacity without making our applications spend a lot of time waiting for the disk. A file structure is a combination of representations for data in files and of operations for accessing the data. A file structure allows applications to read, write, and modify data. It might also support finding the data that1.2 A Short History of File Structure Design 3 matches some search criteria or reading through the data in some partic- ular order. An improvement in file structure design may make an applica- tion hundreds of times faster. The details of the representation of the data and the implementation of the operations determine the efficiency of the file structure for particular applications. A tremendous variety in the types of data and in the needs of applica- tions makes file structure design very important. What is best for one situ- ation may be terrible for another. A Short History of File Structure Design Our goal is to show you how to think creatively about file structure design problems. Part of our approach draws on history: after introducing basic principles of design, we devote the last part of this book to studying some of the key developments in file design over the last thirty years. The prob- lems that researchers struggle with reflect the same issues that you confront in addressing any substantial file design problem. Working through the approaches to major file design issues shows you a lot about how to approach new design problems. The general goals of research and development in file structures can be drawn directly from our library analogy. Ideally, we would like to get the information we need with one access to the disk. In terms of our analogy, we do not want to issue a series of fifty-eight-day requests before we get what we want. m If itis impossible to get what we need in one access, we want struc- tures that allow us to find the target information with as few accesses as possible. For example, you may remember from your studies of data structures that a binary search allows us to find a particular record among fifty thousand other records with no more than sixteen comparisons. But having to look sixteen places on a disk before find- ing what we want takes too much time. We need file structures that allow us to find what we need with only two or three trips to the disk. m We want our file structures to group information so we are likely to get everything we need with only one trip to the disk. If we need a client's name, address, phone number, and account balance, we would prefer to get all that information at once, rather than having to look in several places for it.Chapter 1 Introduction to the Design and Specification of File Structures It is relatively easy to come up with file structure designs that meet these goals when we have files that never change. Designing file structures that maintain these qualities as files change, grow, or shrink when infor- mation is added and deleted is much more difficult. Early work with files presumed that files were on tape, since most files were. Access was sequential, and the cost of access grew in direct proportion to the size of the file. As files grew intolerably large for unaided sequential access and as storage devices such as disk drives became available, indexes were added to files. The indexes made it possible to keep a list of keys and pointers in a smaller file that could be searched more quickly. With the key and pointer, the user had direct access to the large, primary file. Unfortunately, simple indexes had some of the same sequential flavor as the data files, and as the indexes grew, they too became difficult to manage, especially for dynamic files in which the set of keys changes. Then, in the early 1960s, the idea of applying tree structures emerged. Unfortunately, trees can grow very unevenly as records are added and delet- ed, resulting in long searches requiring many disk accesses to find a record, In 1963 researchers developed the tree, an elegant, self-adjusting bina- ry tree structure, called an AVL tree, for data in memory. Other researchers began to look for ways to apply AVL trees, or something like them, to files. The problem was that even with a balanced binary tree, dozens of accesses were required to find a record in even moderate-sized files. A method was needed to keep a tree balanced when each node of the tree was nota single record, as in a binary tree, but a file block containing dozens, perhaps even hundreds, of records. It took nearly ten more years of design work before a solution emerged in the form of the B-tree. Part of the reason finding a solution took so long was that the approach required for file structures was very different from the approach that worked in memory. Whereas AVL trees grow from the top down as records. are added, B-trees grow from the bottom up. B-trees provided excellent access performance, but there was a cost: no longer could a file be accessed sequentially with efficiency. Fortunately, this problem was solved almost immediately by adding a linked list structure at the bottom level of the B-tree. The combination of a B-tree and a sequen- tial linked list is called a B+ tree, Over the next ten years, B-trees and B* trees became the basis for many commercial file systems, since they provide access times that grow in proportion to log,N, where Nis the number of entries in the file and kis1.3 A Conceptual Toolkit: File Structure Literacy 5 the number of entties indexed in a single block of the B-tree structure. In practical terms, this means that B-trees can guarantee that you can find one file entry among millions of others with only three or four trips to the disk. Further, B-trees guarantee that as you add and delete entries, perfor- mance stays about the same. Being able to retrieve information with just three or four accesses is pretty good. But how about our goal of being able to get what we want with a single request? An approach called hashing is a good way to do that with files that do not change size greatly over time. From early on, hashed indexes were used to provide fast access to files. However, until recently, hashing did not work well with volatile, dynamic files. After.the develop- ment of B-trees, researchers turned to work on systems for extendible, dynamic hashing that could retrieve information with one or, at most, two disk accesses no matter how big the file became. A Conceptual Toolkit: File Structure Literacy As we move through the developments in file structures over the last three decades, watching file structure design evolve as it addresses dynamic files first sequentially, then through tree structures, and finally through direct access, we see that the same design problems and design tools keep emerg- ing. We decrease the number of disk accesses by collecting data into buffers, blocks, or buckets; we manage the growth of these collections by splitting them, which requires that we find a way to increase our address or index space, and so on. Progress takes the form of finding new ways to combine these basic tools of file design. We think of these tools as conceptual tools. They are methods of fram- ing and addressing a design problem. Each tool combines ways of repre- senting data with specific operations. Our own work in file structures has shown us that by understanding the tools thoroughly and by studying how the tools have evolved to produce such diverse approaches as B-trees and extendible hashing, we develop mastery and flexibility in our own use of the tools. In other words, we acquire literacy with regard to file structures. This text is designed to help readers acquire file structure literacy. Chapters 2 through 6 introduce the basic tools; Chapters 7 through 11 introduce readers to the highlights of the past several decades of file struc- ture design, showing how the basic tools are used to handle efficient1.4 Chapter 1 Introduction to the Design and Specification of File Structures sequential access—B-trees, Bt trees, hashed indexes, and extendible, dynamic hashed files. An Object-Oriented Toolkit: Making File Structures Usable 1.5 Making file structures usable in application development requires turning this conceptual toolkit into application programming interfaces— collec- tions of data types and operations that can be used in applications. We have chosen to employ an object-oriented.approach in which data types and operations are presented in a unified fashion as class definitions. Each particular approach to representing some aspect of a file structure is repre- sented by one or more classes of objects. A major problem in describing the classes that can be used for file structure design is that they are complicated and progressive. New classes are often modifications or extensions of other classes, and the details of the data representations and operations become éver more complex. The most effective strategy for describing these classes is to give specific representa- tions in the simplest fashion. In this text, use the C++ programming language to give precise specifications to the file structure classes. From the first chapter to the last, this allows us to build one class on top of another in a concise and understandable fashion. Using Objects in C++ In an object-oriented information system, data content and behavior are integrated into a single design. The objects of the system are divided into classes of objects with common characteristics. Each class is described by its members, which are either data attributes (data members) or functions (member functions or methods). This book illustrates the principles of object-oriented design through implementations of file structures and file operations as C++ classes, These classes are also an extensive presentation of the features of C++. In this section, we look at some of the features of objects in C++, including class definitions, constructors, public and private sections, and operator overloading. Later chapters show how to make effective use of.inheritance, virtual functions, and templates.Using Objects in C++ 7 ‘An example of a very simple C++ class is Person, as given below. class Person { public: // data members char LastName [11], FirstName [11], Address [16]; char City [16], State [3], ZipCode (10); // method Person (); // default constructor Ui Each Person object has first and last names, address, city, state, and zip code, which are declared as members, just as they would be in a C struct. For an object p of type Person, p. LastName refers to its LastName member. The pub: label specifies that the following members and methods are part of the interface to objects of the class. These members and meth- ods can be freely accessed by any users of Person objects. There are three levels of access to class members: public, private, and protected. The last two restrict access and will be described later in the book. The only significant difference in C++ between struct and class is that for struct members the default access is public, and for class members the default access is private. Each of these member fields is represented by a character array of fixed size. However, the usual style of dealing with character arrays in C++ is to represent the value of the array as a null-delimited, variable-sized string with a maximum length. The number of characters in the represen- tation of a string is one more than the number of characters in the string. The LastName field, for example, is represented by an array of eleven characters and can hold a string of length between 0 and 10. Proper use of strings in C++ is dependent on ensuring that every string variable is initialized before it is used. C++ includes special methods called constructors that are used to provide a guarantee that.every object is properly initialized.! A construc- tor is a method with no return type whose name is the same as the class. Whenever an object is created, a constructor is called. The two ways that objects are created in C++ are by the declaration of a variable (automatic creation) and by the execution of a new operation (dynamic creation): 1. A destructorisa method ofa class that is executed whenever an object is destroyed. A destructor for class Person has definition ~Person (). Examples of destructors are given in later chapters.Chapter 1 Introduction to the Design and Specification of File Structures Person p; // automatic creation Person * p_ptr = new Person; // dynamic creation Execution of either of the object creation statements above includes the execution of the Person constructor. Hence, we-are sure that every Person object has been properly initialized before it is used. The code for the Person constructor initializes each member to an empty string by assigning 0 (null) to the first character: Person: :Person () { // set each field to an empty string LastName [0] = 0; FirstName [0] = 0; Address [0] = City [0] = 0; State (0] = 0; ZipCode [0] = 0; The symbol :: is the scope resolution operator. In this case, it tells us that Person () isa method of class Person. Notice that within the method code, the members can be referenced without the dot (.) operator. Every call on a member function has a pointer to an object as a hidden argu- ment. The implicit argument can be explicitly referred to with the keyword this. Within the method, this->LastName is the same as LastName. Overloading of symbols in programming languages allows a particular symbol to have more than one meaning. The meaning of each instance of the symbol depends on the context. We are very familiar with overloading of arithmetic operators to have different meanings depending on the operand type. For example, the symbol + is used for both integer and floating point addition. C++ supports the use of overloading by program- mers for a wide variety of symbols. We can create new meanings for oper- ator symbols and for named functions. The following class String illustrates extensive use of overloading: there are three constructors, and the operators = and == are overloaded with new meanings: ' class String {public: String (); // default constructor String (const String&); //copy constructor String (const char *); // create from C string ~String (); // destructor String & operator = (const String &); // assignment int operator == (const String &) const; // equality char * operator char*() // conversion to char * {return strdup(string);} // inline body of method private:Using Objects in C++ 9 char * string; // represent value as C string int MaxLength; hi The data members, string and MaxLength, of class String are in the private: section of the class. Access to these members is restrict- ed. They can be referenced only from inside the code of methods of the class. Hence, users of String objects cannot directly manipulate these members. A conversion operator (operator char *) has been provid- ed to allow the use of the value of a String object as a C string, The body of this operator is given inline, that is, directly in the class definition. To protect the value of the String from direct manipulation, a copy of the string value is returned. This operator allows a String object to be used asa char *. For example, the following code creates a String object $1. and copies its value to normal C string: String sl ("abcdefg"); // uses String::String (const char *) char str[10]; strepy (str, sl); // uses String::operator char * () The new definition of the assignment operator (operator =) replaces the standard meaning, which in C and C++ is to copy the bit pattern of one object to another. For two objects s1 and s2 of class String, s1 = s2 would copy the value of s1. string (a pointer) to s2.string,. Hence,s1.string and s2.string point to the same character array. In essence, s1 and s2 become aliases. Once the two fields point to the same array, a change in the string value of s1 would also change s2. This is contrary to how we expect variables to behave. The implementation of the assignment operator and an example of its use are: String & String: :operator = (const String & str) { // code for assignment operator strepy (string, str.string); return *this; } String sl, s sl = s2; // using overloaded assignment Inthe assignment sl = s2, the hidden argument (this) refers to s1, and the explicit argument str refers to s2. The line strepy (string, stxr.string); copies the contents of the string member of s2 to the string member of s1. This assignment operator does not create the alias problem that occurs with the standard meaning of assignment.10 Chapter 1 Introduction to the Design and Specification of File Structures To complete the class String, we add the copy constructor, which is used whenever a copy of a string is needed, and the equality operator (operator ==), which makes two String objects equal if the array contents are the same, The predefined meaning for these operators performs pointer copy and pointer comparison, respectively. The full spec- ification and implementation of class String are given in Appendix G. SUMMARY The key design problem that shapes file structure design is the relatively large amount of time that is required to get information from a disk. All file structure designs focus on minimizing disk accesses and maximizing the likelihood that the information the user will want is already in memory. This text begins by introducing the basic concepts and issues associat edwith file structures. The last half of the book tracks the development of file structure design as it has evolved over the last thirty years. The key problem addressed throughout this evolution has been finding ways to minimize disk accesses for files that keep changing in content and size. ‘Tracking these developments takes us first through work on sequential file access, then through developments in tree-structured access, and finally to relatively recent work on direct access to information in files. Our experience has been that the study of the principal research and design contributions to file structures—focusing on how the design work uses the same tools in new ways—provides a solid foundation for thinking creatively about new problems in file structure design. The presentation of these tools in an object-oriented design makes them tremendously useful in solving real problems. Object-oriented progtamming supports the integration of data content and behavior into a single design. C++ class definitions contain both data and function members and allow programmers to control precisely the manipulation of objects. The use of overloading, construc- tors, and private members enhances the programmer's ability to control the behavior of objects.Key Terms WW _KEY TERMS AVL tree. A self-adjusting binary tree structure that can guarantee good access times for data in memory. B-tree. A tree structure that provides fast access to data stored in files. Unlike binary trees, in which the branching factor from a node of the tree is two, the descendants from a node of a B-tree can be a much larger number. We introduce B-trees in Chapter 9. Bt tree, A variation on the B-tree structure that provides sequential access to the data as well as fast-indexed access. We discuss Bt trees at length in Chapter 10. Class. The specification of the common data attributes (members) and functions (methods) of a collection of objects. Constructor. A function that initializes an object when it is created. C++ automatically adds a call to a constructor for each operation that creates an object. Extendible hashing. An approach to hashing that works well with files that over time undergo substantial changes in size. File structures. The organization of data on secondary storage devices such as disks. Hashing. An access mechanism that transforms the search key into a stor- age address, thereby providing very fast access to stored data. Member. An attribute of an object that is included in a class specification. Members are either data fields or functions (methods). Method. A function member of an object. Methods are included in class specifications. Overloaded symbol. An operator or identifier in a program that has more than one meaning. The context of the use of the symbol determines its meaning. Private. The most restrictive access control level in C++. Private names can be used only by member functions of the class. Public, The least restrictive access control level in C++. Public names can be used in any function. Sequential access, Access that takes records in order, looking at the first, then the next, and so on.12 Chapter 1 Introduction to the Design and Specification of File Structures There are many good introductory textbooks on C++ and object-oriented programming, including Berry (1997), Friedinan and Kofman (1994), and Sessions (1992). The second edition of Stroustrup’s book on C++ (1998) is the standard reference for the language. The third edition of Stroustrup (1997) is a presentation of the Draft Standard for C++ 3.0. This is the first part of an object-oriented programming project that continues throughout the book. Each part extends the project with new file structures. We begin by introducing two classes of data objects. These projects apply the concepts of the book to produce an information system that maintains and processes information about students and courses. 1. Design a class Student. Each object represents information about a single student. Members should. be included for identifier, name, address, date of first enrollment, and number of credit hours complet- ed. Methods should be included for intitalization (constructors), assignment (overloaded “=” operator), and modifying field values, including a method to increment the number of credit hours. 2. Design a dass CourseRegistration. Each object represents the enrollment of a student in a course. Members should be included for a course identifier, student identifier, number of credit hours, and course grade. Methods should be included as appropriate. 3. Create a list of student and course registration information. This information will be used in subsequent exercises to test and evaluate the capabilities of the programming project. ‘The next part of the programming project is in Chapter 2.CHAPTER Fundamental File Processing Operations CHAPTER OBJECTIVES Describe the process of linking a logical file within a program to an actual physical file or device. Describe the procedures used to create, open, and close files. Introduce the C++ input and output classes. Explain the use of overloading in C++. Describe the procedures used for reading from and writing to files. Introduce the concept of position within a file and describe procedures for seeking different positions. Provide an introduction to the organization of hierarchical file systems. Present the Unix view of a file and describe Unix file operations and commands based on this view. a22.1 Chapter 2 Fundamental File Processing Operations CHAPTER OUTLINE 2.1 Physical Files and Logical Files 2.2 Opening Files 2.3. Closing Files 2.4 Reading and Writing 2.4.1 Read and Write Functions 2.4.2 Files with C Streams and C++ Stream Classes 2.4.3 Programs in C++ to Display'the Contents of a File 2.4.4 Detecting End-of-File 2.5 Seeking 2.5.1 Seeking with C Streams 2.5.2 Seeking with C++ Stream Classes 2.6 Special Characters in Files 2.7. The Unix Directory Structure 2.8 Physical Devices and Logical Files 2.8.1 Physical Devices as Files 2.8.2 The Console, the Keyboard, and Standard Error 2.8.3 I/O Redirection and Pipes 2.9 File-Related Header Files 2.10 Unix File System Commands Physical Files and Logical Files When we talk about a file on a disk or tape, we refer to a particular collec- tion of bytes stored there. A file, when the word is used in this sense, phys- ically exists. A disk drive might contain hundreds, even thousands, of these physical files. From the standpoint of an application program, the notion of a file is different. To the program, a file is somewhat like a telephone line connect- ed to a telephone network. The program can receive bytes through this phone line or send bytes down it, but it knows nothing about where these bytes come from or where they go. The program knows only about its own end of the phone line. Moreover, even though there may be thousands of physical files on a disk, a single program is usually limited to the use of only about twenty files. The application program relies on the operating system to take care of the details of the telephone switching system, as illustrated in Fig. 2.1. It could be that bytes coming down the line into the program originate from2.2 Opening Files 15 a physical file or that they come from the keyboard or some other input device. Similarly, the bytes the program sends down the line might end up in a file, or they could appear on the terminal screen. Although the program often doesn’t know where bytes are coming from or where they are going, it does know which line it is using. This line is usually referred to as the logical file to distinguish it from the physical files on the disk or tape. Before the program can open a file for use, the operating system must receive instructions about making a hookup between a logical file (for example, a phone line) and some physical file or device. When using oper- ating systems such as IBM’s OS/MVS, these instructions are provided through job control language (JCL). On minicomputers and microcom- puters, more modern operating systems such as Unix, MS-DOS, and VMS provide the instructions within the program. For example, in Cobol,} the association between a logical file called inp_file and a physical file called my file.dat is made with the following statement: select inp_file assign to “myfile.dat". This statement asks the operating system to find the physical file named myfile.dat and then to make the hookup by assigning a logical file (phone line) to it. The number identifying the particular phone line that is assigned is returned through the variable inp_file, which is the file’s logical name. This logical name is what we use to refer to the file inside the program. Again, the telephone analogy applies: My office phone is connected to six telephone lines. When I receive a call J get an intercom message such as, “You have a call on line three.” The receptionist does not say, “You have a call from 918-123-4567.” I need to have the call identified logically, not physically. Opening Files Once we have a logical file identifier hooked up to a physical file or device, we need to declare what we intend to do with the file. In general, we have two options: (1) open an existing file, or (2) create a new file, deleting any existing contents in the physical file. Opening a file makes it ready for use by the program. We are positioned at the beginning of the file and are 1. These values are defined in an “include” file packaged with your Unix system or C compiler. The name of the include file is often Ent 1 .h or £1 Le. h, but it can vary from system to system.16 Chapter 2 Fundamental File Processing Operations Figure 2.1 The Program relies on the operating sys- tem to make con- nections between logical files and a) esate physical files and REE ER devices. v I ——— Program Limit of approximately twenty phone lines Physical files ready to start reading or writing. The file contents are not disturbed by the open statement. Creating a file also opens the file in the sense that it is ready for use after creation. Because a newly created file has no contents, writing is initially the only use that makes sense. As an example of opening an existing file or creating a new one in C and C++, consider the function open, as defined in header file fcnt1.h. Although this function is based on a Unix system function, many C++ implementations for MS-DOS and Windows, including Microsoft Visual C++, also support open and the other parts of fent1-h. This function takes two required arguments and a third argu- ment that is optional: fd = open(filename, flags [, pmode]);Opening Files 7 ‘swrery ast a ae Operating system switchboard Can make connections to thousands of Bles oF VO devices ‘The return value £4 and the arguments filename, flags, and pmode have the following meanings: Argument Type Explanation fa int The file descriptor. Using our earlier analogy, this is the phone line (logical file identifier) used to refer to the file within the program. It is an integer. If there is an error in the attempt to open the file, this value is negative. filename char * A character string containing the physical file name. (Later we discuss pathnames that include directory information about the file’s location. This argument can be a pathname.) (continued)18 Chapter 2 Fundamental File Processing Operations Argument flags pmode Type int int Explanation ‘The flags argument controls the operation of the open function, determining whether it opens an existing file for reading or writing. It can also be used to indicate that you want to create a new file or open an existing file but delete its contents. The value of flags is set by performing a bit-wise OR of the follow- ing values, among others. O_APPEND Append every write operation to the end of the file Q_CREAT Create and open a file for writing. This has n\o effect if the file already exists. O_EXCL Return an error if OCREATE is specified and the file exists. O_RDONLY Open file for reading only. O_RDWR Opena file for reading and writing, OLTRUNC If the file exists, truncate it to a length of zero, destroying its contents. OLWRONLY Open file for writing only. Some of these flags cannot be used in combination with one another. Consult your documentation for details and for other options. If O_CREAT is specified, pmode is required. This integer argument specifies the protection mode for the file. In Unix, the pmode is a three-digit octal number that indicates how the file can be used by the owner (first digit), by members of the owner's group (second digit), and by everyone else (third digit). The first bit of each octal digit indicates read permission, the second write permission, and the third execute permission. So, if pmode is the octal number 0751, the file’s owner has read, write, and execute permis- sion for the file; the owner’s group has réad and execute permission; and everyone else has only execute permission: rwe rwe rwe pmode = 0751 =112 101 001 owner group world2.3 Closing Files 19 Given this description of the open function, we can develop some examples to show how it can be used to open and create files in C. The following function call opens an existing file for reading and writing or creates a new one if necessary. If the file exists, it is opened without change; reading or writing would start at the file’s first byte. fd = open(filename, O_RDWR | O_CREAT, 0751); The following call creates a new file for reading and writing. If there is already a file with the name specified in £i lename, its contents are truncated. fd = open(filename, O_RDWR | O_CREAT | O_TRUNC, 0751); Finally, here is a call that will create a new file only if there is not already a file with the name specified in filename. If a file with this name exists, it is not opened, and the function returns a negative value to indicate an error. fd = open(filename, O_RDWR | O_CREAT | O_EXCL, 0751); File protection is tied more to the host operating system than to a specific language. For example, implementations of C running on systems that support file protection, such as VAX/VMS, often include extensions to standard C that let you associate a protection status with a file when you create it. Closing Files In terms of our telephone line analogy, closing a file is like hanging up the phone. When you hang up the phone, the phone line is available for taking or placing another call; when you close a file, the logical file name or file descriptor is available for use with another file. Closing a file that has been used for output also ensures that everything has been written to the file. As you will learn in a later chapter, it is more efficient to move data to and from secondary storage in blocks than it is to move data one byte at a time. Consequently, the operating system does not immediately send off the bytes we write but saves them up in a buffer for transfer as a block of data. Closing a file ensures that the buffer for that file has been flushed of data and that éverything we have written has been sent to the file. Files are usually closed automatically by the operating system when a program terminates normally. Consequently, the execution of a close statement within a program is needed only to protect it against data loss in the event that the program is interrupted and to free up logical filenames for reuse.20 2.4 Chapter 2 Fundamental File Processing Operations Now that you know how to connect and disconnect programs to and from physical files and how to open the files, you are ready to start sending and receiving data. Reading and Writing Reading and writing are fundamental to file processing; they are the actions that make file processing an input/output (I/O) operation. The form of the read and write statements used in different languages varies. Some languages provide very high-level access to reading and writing and automatically take care of details for the programmer. Other languages provide access at a much lower level. Our use of C and C++ allows us to explore some of these differences.? 2.4.1 Read and Write Functions ‘We begin with reading and writing at a relatively low level. It is useful to have a kind of systems-level understanding of what happens when we send and receive information to and from a file. A low-level read call requires three pieces of information, expressed here as arguments to a generic Read function: Read (Source_file, Destination_addr, Size) Source_file ‘The Read call must know where it is to read from. We specify the source by logical file name (phone line) through which data is received. (Remember, before we do any reading, we must have already opened the file so the connection between a logical file and a specific physical file or device exists.) Destination_addr Read must know where to place the information it reads from the input file. In this generic function we specify the destination by giving the first address of the memory block where we want to store the data. Size Finally, Read must know how much information to bring in from the file, Here the argument is supplied as a byte count. 2, To accentuate the differences and view /O operations at something close to a systems level, we use the fread and fwrite functions in C rather than the higher-level functions such as Egetc, £gets, and soon.Reading and Writing 21 AWrite statement is similar; the only difference is that the data moves in the other direction: Write(Destination_file, Source_addr, Size) Destination_file The logical file name that is used for sending the data. Source_addr Write must know where to find the information it will send. We provide this specification as the first address of the memory block where the data is stored, Size The number of bytes to be written must be supplied. 2.4.2 Files with C Streams and C++ Stream Classes I/O operations in C and C++ are based on the concept of a stream, which can be a file or some other source or consumer of data. There are two different styles for manipulating files in C++. The first uses the standard C functions defined in header file stdio.h. This is often referred to as C streams or C input/output. The second uses the stream classes of header files iostream.h and fstream.h. We refer to this style as C++ stream classes, The header file stdio.h contains definitions of the types and the operations defined on C streams. The standard input and output of a C program are streams called stdin and stdout, respectively. Other files can be associated with streams through the use of the fopen function: file = fopen (filename, type); The return value file and the arguments filename and type have the following meanings: Argument ‘Type Explanation file FILE * A pointer to the file descriptor. Type FILE is another name for struct _iobuf. If there is an error in the attempt to open the file, this value is null, and the variable errno is set with the error number. filename char * The filename, just as in the Unix open function. type char * The type argument controls the operation of the open function, much like the flags argument to open. The following values are supported: "x" Open an existing file for input. "w" Create a new file, or truncate an existing one, for output.22 Chapter 2. Fundamental File Processing Operations "a" "Create a new file, or append to an existing one, for output, r+" Open an existing file for input and output. "w+" Create a new file, or truncate an existing one, for input and output. "a+" Create anew file, or append to an existing one, for input and output. Read and write operations are supported by functions fread, fget, fwrite, and fput. Functions fscanf and fprinté are used for formatted input and output. Stream classes in C++ support open, close, read, and write operations that are equivalent to those in st dio. h, but the syntax is considerably different. Predefined stream objects cin and cout represent the standard input and standard output files. The main class for access to files, fstream, as defined in header files iostream.h and fstream.h, has two constructors and a wide variety of methods. The following constructors and methods are included in the class: fstream (); // leave the stream unopened fstream (char * filename, int mode); int open (char * filename, int mode); int read (unsigned char * dest_addr, int size); int write (unsigned char * source_addr,- int size); The argument filename of the second constructor and the method open are just as we've seen before. These two operations attach the fstream toa file. The value of mode controls the way the file is opened, like the flags and type arguments previously described. The value is set with a bit-wise or of constants defined in class ios. Among the options are ios: : in (input), ios: :out (output), ios: :nocreate (fail if the file does not exist), and ios: :noreplace (fail if the file does exist). One additional, nonstandard option, ios: : binary, is support- ed on many systems to specify that a file is binary. On MS-DOS systems, if ios: : binary is not specified, the file is treated as a text file. This can have some unintended consequences, as we will see later. A large number of functions are provided for formatted input and output. The overloading. capabilities of C++ are used to make sure that objects are formatted according to their types. The infix operators >>(extraction) and <<(insertion) are overloaded for input and output, respectively. The header file iost ream. h includes the following over- Joaded definitions of the insertion operator (and many others):Reading and Writing 23 ostream& operator<<(char c); ostream& operator<<(unsigned char c); ostream& operator<<(signed char ¢); ostream& operator<<(const char *s); ostream& operator<<(const unsighed char *s); ostream& operator<<(const signed char *s); ostream& operator<<(const void *p); ostream& operator<<(int n); ostream& operator<<(unsigned int n); ostream& operator<<(long n); ostream& operator<<(unsigned long n); The overloading resolution rules of C++ specify which function is select- ed for a particular call depending on the types of the actual arguments and the types of the formal parameters. In this case, the insertion function that is used to evaluate an expression depends on the type of the arguments, particularly the right argument. Consider the following statements that include insertions into cout (an object of class ostream): int n = 25; cout << "Value of n is "<< n << endl; The insertion operators are evaluated left to right, and each one returns its left argument as the result. Hence, the stream cout has first the string “Value of n is” inserted, using the fourth function in the list above, then the decimal value of n, using the eighth function in the list. The last operand is the I/O manipulator endl, which causes an end-of-line to be inserted. The insertion function that is used for << end] is not in the list above, The header file iost ream. h includes the definition of endl and the operator that is used for this insertion. Appendix C includes definitions and examples of many of the format- ted input and output operations. 2.4.3. Programs in C++ to Display the Contents of a File Let’s do some reading and writing to see how these functions are used. This first simple file processing program opens a file for input and reads it, character by character, sending each character to the screen after it is read from the file, This program includes the following steps: 1. Display a prompt for the name of the input file. 2. Read the user’s response from the keyboard into a variable called filename.24 Chapter 2 Fundamental File Processing Operations 3, Open the file for input. 4, While there are still characters to be read from the input file, a. read a character from the file; b. write the character to the terminal screen. 5. Close the input file. Figures 2.2 and 2.3 are C++ implementations of this program using C streams and C++ stream classes, respectively. It is instructive to look at the differences between these implementations. The full implementations of these programs are included in Appendix D. Steps 1 and 2 of the program involve writing and reading, but in each of the implementations this is accomplished through the usual functions for handling the screen and keyboard. Step 4a, in which we read from the input file, is the first instance of actual file I/O. Note that the fread call using C streams parallels the low-level, generic Read statement we described earlier; in truth, we used the fread function as the model for our low-level Read. The function's first argument gives the address of a character variable used as the destination for the data, the second and third arguments are the element size and the number of elements (in this case the size is 1 byte, and the number of elements is one), and the fourth argu- ment gives a pointer to the file descriptor (the C stream version of a logi- cal file name) as the source for the input. // liste.cpp // program using C streams to read characters froma file // and write them to the terminal screen #include
main( ) { char ch; FILE * file; // pointer to file descriptor char filename(20]; print£("Enter the name of the-file: "); | // Step 1 gets (filename) ; // Step 2 file =fopen(filename, "r"); // Step 3 while (fread(&ch, 1, 1, file) != 0) // Step 4a fwrite(&ch, 1, 1, stdout); // Step 4b felose(file); // Step 5 } Figure 2.2 The file listing program using C steams (1istc.cpp)-Reading and Writing 25 // listepp.cpp // list contents of file using C++. stream classes #include
main () ( char ch; fstream file; // declare unattached fstream char filename(20]; cout <<"Enter the name of the file: “ // Step 1 <
> filename; // Step 2 file . open(filename, ios: file . unsetf(ios in); // Step 3 kipws);// include white space in read while (1) { file >> ch; // Step 4a if (£ile.fail()) break; cout << ch; // Step 4b ) file . close(); // Step 5 } Figure 2.3 The file listing program using C++ stream classes (Listepp . cpp). The arguments for the call to operator >> communicate the same information at a higher level. The first argument is the logical file name for the input source. The second argument is the name of a character variable, which is interpreted as the address of the variable. The overloading resolu- tion selects the >> operator whose right argument is a char variable. Hence, the code implies that only a single byte is to be transferred. In the C++ version, the call file.unsetf(ios::skipws) causes operator >> to include white space (blanks, end-of-line, tabs, and so on). The default for formatted read with C++ stream classes js to skip white space. After a character is read, we write it to standard output in Step 4b. Once again the differences between C streams and C++ stream classes indicate the range of approaches to I/O used in different languages. Everything must be stated explicitly in the fwr ite call. Using the special assigned file descriptor of stdout to identify the terminal screen as the destination for our writing, fwrite(&ch, 1, 1, stdout);26 Chapter 2 Fundamental File Processing Operations means: “Write to standard output the contents from memory starting at the address &ch. Write only one element of one byte.” Beginning C++ programmers should pay special attention to the use of the & symbol in the £write call here. This particular call, as a very low-level call, requires that the programmer provide the starting address in memory of the bytes to be transferred. Stdout, which stands for “standard output,” is a pointer to a struct defined in the file stdio.h, which has been included at the top of the program. The concept of standard output and its counterpart standard input are covered later in Section 2.8 “Physical and Logical Files.” Again the C++ stream code operates at a higher level. The right operand of operator << is a character value. Hence a single byte is trans- ferred to cout. cout << ch; As in the call to operator >>, C++ takes care of finding the address of the bytes; the programmer need specify only the name of the variable ch that is associated with that address. 2.4.4 Detecting End-of-File The programs in Figs. 2.2 and 2.3 have to know when to end the while loop and stop reading characters, C streams and C++ streams signal the end-of-file condition differently. The function fread returns a value that indicates whether the read succeeded. However, an explicit test is required to see if the C++ stream read has failed. The fread call returns the number of elements read as its value In this case, if fread returns a value of zero, the program has reached the end of the file. So we construct the while loop to run as long as the fread call finds something to read. Each C++ stream has a state that can be queried with function calls. Figure 2.3 uses the function fail, which returns true (1) if the previous operation on the stream failed. In this case, file. fail() returns false if the previous read failed because of trying to read past end-of-file. The following statement exits the while loop when end-of-file is encoun- tered: if (file.fail()) break; In some languages, including Ada, a function end_of_file canbe used to test for end-6f-file. As we read from a file, the operating system keeps track of our location in the file with a read/write pointer. This is2.5 Seeking 27 necessary: when the next byte is read, the system knows where to get it. The end_of_file function queries the system to see whether the read/write pointer has moved past the last element in the file. If it has, end_of_file returns true; otherwise it returns false. In Ada, it is neces- sary to call end_of_file before trying to read the next byte. For an empty file, end_of_file immediately returns true, and no bytes can be read. Seeking In the preceding sample programs we read through the file sequentially, reading one byte after another until we reach the end of the file, Every time a byte is read, the operating system moves the read/write pointer ahead, and we are ready to read the next byte. Sometimes we want to read or write without taking the time to go through every byte sequentially. Perhaps we know that the next piece of information we need is ten thousand bytes away, so we want to jump there. Or perhaps we need to jump to the end of the file so we can add new infor- mation there. To satisfy these needs we must be able to control the move- ment of the read/write pointer. The action of moving directly to a certain position in a file is often called seeking. A seek requires at least two pieces of information, expressed here as arguments to the generic pseudocode function Seek: Seek(Source_file, Offset) Source_file The logical file name in which the seek will occur. Offset The number of positions in the file the pointer is to be moved from the start of the file. Now, if we warit to move directly from the origin to the 373d position in a file called data, we don’t have to move sequentially through the first 372 positions. Instead, we can say Seek(data, 373) 2.5.1 Seeking with C Streams One of the features of Unix that has been incorporated into C streams is the ability to view a file as a potentially very large array of bytes that just28 Chapter 2 Fundamental File Processing Operations happens to be kept on secondary storage. In an array of bytes in memory, we can move to any particular byte using a subscript. The C stream seek function, £seek, provides a similar capability for files. It lets us set the read/write pointer to any byte ina file. The fseek function has the following form: pos = fseek(file, byte offset, origin) where the variables have the following meanings: pos A long integer value returned by Eseek equal to the posi- tion (in bytes) of the read/write pointer after it has been moved. file The file descriptor of the file to which the £seek is to be applied. byte_offset The number of bytes to move from some origin in the file. The byte offset must be specified as a long integer, hence the name feck for long seek. When appropriate, the byte_of fset can be negative, origin A value that specifies the starting position from which the byte_offset is to be taken. The origin can have the value 0, 1, or 23 : 0-fseek from the beginning of the file; 1-fseek from the current position; 2-fseek from the end of the file. The following definitions are included in stGio.h to allow symbolic reference to the origin values. #tdefine SEEK_SET 0 #define SEEK_CUR 1 #define SEEK_END 2 The following program fragment shows how you could use £seek to move to a position that is 373 bytes into a file. long pos; fseek(File * file, long offset, int origin); File * file; pos=fseek(file, 373L, 0); 3, Although the values 0,1, and 2 are almost always used here, they are not guaranteed to work for all C implementations. Consult your documentation.2.6 Special Characters in Files 29 2.5.2 Seeking with C++ Stream Classes Seeking in C++ stream classes is almost exactly the same as it is in C streams, There are two mostly syntactic differences: Mm An object of type fst ream has two file pointers: a get pointer for input and a put pointer for output. Two functions are supplied for seeking: seekg which moves the get pointer, and seekp which moves the put pointer. It is not guaranteed that the pointers move separately, but they might. We have to be very careful in-our use of these seek functions and often call both functions together. m The seek operations are methods of the stream classes. Hence the syntax is file.seekg(byte_offset,origin) and file.seekp(byte_offset, origin). The value of origin comes from class ios, which is described in more detail in Chapter 4. The values are ios::beg (beginning of file), ios::cur (current position), and ios::end (end of file). The following moves both get and put pointers to a byte 373: file.seekg(373, ios::beg); file.seekp(373, ios::beg); Special Characters in Files As you create the file structures described in this text, you may encounter some difficulty with extra, unexpected characters that turn up in your files with characters that disappear and with numeric counts that are inserted into your files, Here are some examples of the kinds of things you might encounter: . On many computers you may find that a Control-Z (ASCII value of 26) is appended at the end of your files. Some applications use this to indicate end-of-file even if you have not placed it there. This is most likely to happen on MS-DOS systems. m Some systems adopt a convention of indicating end-of-line in a text file‘ as a pair of characters consisting of a carriage return (CR: ASCII 4. When we use the term “text file” in this text, we are referring to a file consisting entirely of charac- ters from a specific standard character set, such as ASCII or EBCDIC. Unless otherwise specified, the ASCH character set will be assumed. Appendix B contains a table that describes the ASCII char- acter set.30 2.7 Chapter 2 Fundamental File Processing Operations value of 13) and a line feed (LF: ASCII value of 10). Sometimes I/O procedures written for such systems automatically expand single CR characters or LF characters into CR-LF pairs. This unrequested addi- tion of characters can cause a great deal of difficulty. Again, you are most likely to encounter this phenomenon on MS-DOS systems. Using flag “b” in a C file or mode ios::bin in a C++ stream will suppress these changes. ua Users of larger systems, such as VMS, may find that they have just the opposite problem. Certain file formats under VMS remove carriage return characters from your file without asking you, replacing them with a count of the characters in what the system has perceived as a line of text. ‘These are just a few examples of the kinds of uninvited modifications that record management systems or that I/O support packages might make to your files. You will find that they are usually associated with the concepts of a-line of text or the end of a file. In general, these modifica- tions to your files are an attempt to make your life easier by doing things for you automatically. This might, in fact, work out for those who want to do nothing more than store some text in a file. Unfortunately, however, programmers building sophisticated file structures must sometimes spend a lot of time finding ways to disable this automatic assistance so they can have complete control over what they are building. Forewarned is fore- armed: readers who encounter these kinds of difficulties as they build the file structures described in this text can take some comfort from the knowledge that the experience they gain in disabling automatic assistance will serve them well, over and over, in the future. The Unix Directory Structure No matter what computer system you have, even if it is a small PC, chances are there are hundreds or even thousands of files you have access to. To provide.convenient access to such large numbers of files, your computer has some method for organizing its files. In Unix this is called the file system. The Unix file system is a tree-structured organization of directories, with the root of the tree signified by the character /. All directories, includ- ing the root, can contain two kinds of files: regular files with programs andThe Unix Directory Structure 31 / (root) usté dev vsr /\ lib mydir /\ /\ console kbd TAPE adb cc yace Libaf.a addr \ bin ibe.a — Libm.e Figure 2.4 Sample Unix directory structure. data, and directories (Fig. 2.4). Since devices such as tape drives are also treated like files in Unix, directories can also contain references to devices, as shown in the dev directory in Fig. 2.4. The file name stored in a Unix directory corresponds to what we call its physical name. 7 Since every file in a Unix system is part of the file system that begins with the root, any file can be uniquely identified by giving its absolute pathname. For instance, the true, unambiguous name of the file “addr” in Fig. 2.4 is /usr6 /mydir/adar. (Note that the / is used both to indicate the root directory and to separate directory names from the file name.) When you issue commands to a Unix system, you do so within a direc- tory, which is called your current directory. A pathname for a file that does not begin with a / describes the location of a file relative to the current directory. Hence, if your current directory in Fig. 2.4 is mydir, addr uniquely identifies the file /usr6 /mydir/addar. The special filename . stands for the current directory, and .. stands for the parent of the current directory. Hence, if your current directory is /usr6/mydir/DF,. . /addr refers to the file /usr6/mydir/addr.32 2.8 Chapter 2 Fundamental File Processing Operations Physical Devices and Logical Files 2.8.1 Physical Devices as Files One of the most powerful ideas in Unix is reflected in its notion of what a file is. In Unix, a file is a sequence of bytes without any implication of how or where the bytes are stored or where they originate. This simple concep- tual view of a file makes it possible to do with very few operations what might require several times as many operations on a different operating system. For example, it is easy to think of a magnetic disk as the source of a file because we are used to the idea of storing such things on disks. But in Unix, devices like the keyboard and the console are also files—in Fig. 2.4, /dev /kbd and /dev/console, respectively. The keyboard produces a sequence of bytes that are sent to the computer when keys are pressed; the console accepts a sequence of bytes and displays their corresponding symbols on a screen. How can we say that the Unix concept of a file is simple when it allows so many different physical things to be called files? Doesn’t this make the situation more complicated, not less so? The trick in Unix is that no matter what physical representation a file may take, the logical view of a Unix file is the same. In its simplest form, a Unix file is represented logically by an integer-the file descriptor. This integer is an index to an array of more complete information about the file. A keyboard, a disk file, and a magnet- ic tape are all represented by integers. Once the integer that describes a file is identified, a program can access that file. If it knows the logical name of a file, a program can access that file without knowing whether the file comes from a disk, a tape, or a connection to another computer. Although the above discussion is directed at Unix files, the same capa- bility is available through the stdio furictions fopen, fread, and so on. Similar capabilities are present in MS-DOS, Windows, and other oper- ating systems. 2.8.2 The Console, the Keyboard, and Standard Error We see an example of the duality between devices and files in the listc.cpp program in Fig. 2.2: file =fopen(filename, "r"); // Step 3 while (fread(&ch, 1, 1, file) != 0) // Step 4a fwrite(&ch, 1, 1, stdout); // Step 4bPhysical Devices and Logical Files 33 The logical file is represented by the value returned by the fopen call. We assign this integer to the variable £ile in Step 3. In Step 4b, we use the value stdout, defined in stdio.h, to identify the console as the file to be written to. There are two other files that correspond to specific physical devices in most implementations of C streams: the keyboard is called st in (stan- dard input), and the error file is called stderr (standard error). Hence, stdin is the keyboard on your terminal. The statement fread(&ch, 1, 1, stdin); reads a single character from your terminal. Stderr is an error file which, like stdout, is usually just your console. When your compiler detects an error, it generally writes the error message to this file, which normally means that the error message turns up on your screen. As with stdin, the values stdin and stderr are usually defined in stdio.h. Steps 1 and 2 of the file listing program also involve reading and writ- ing from stdin or stdout. Since an enormous amount of I/O involves these devices, most programming languages have special functions to perform console input and output—in list.cpp, the C functions print £ and gets are used. Ultimately, however, print f and gets send their output through stdout and stdin, respectively. But these statements hide important elements of the I/O process. For our purposes, the second set of read and write statements is more interesting and instructive. 2.8.3 1/0 Redirection and Pipes Suppose you would like to change the file listing program so it writes its output to a regular file rather than to stdout. Or suppose you wanted to use the output of the file listing program as input to another program. Because it is common to want to do both of these, operating systems provide convenient shortcuts for switching between standard I/O (stdin and stdout) and regular file I/O. These shortcuts are called I/O redirec- tion and pipes. I/O redirection lets you specify at execution time alternate files for input or output. The notations for input and output redirection on the command line in Unix are < file (redirect stdin ‘to “file") > file (redirect stdout to "file") 5, Strictly speaking, /O redirection and pipes are part of a Unix shell, which is the command inter- preter that sits on top of the core Unix operating system, the kernel. For the purpose ofthis discus- sion, this distinction is not important.34 2.9 Chapter 2. Fundamental File Processing Operations For example, if the executable file listing program is called “list.exe,” we redirect the output from stdout to a file called “myfile” by entering the line list.exe > myfile What if, instead of storing the output from the list program in a file, you wanted to use it immediately in another program to sort the results? Pipes let you do this. The notation for a pipe in Unix and in MS-DOS is |. Hence, program] | program2 means take any stdout output from program] and use it in place of any stdin input to program2. Because Unix has a special program called sort, which takes its input from stdin, you can sort the output from the list program, without using an intermediate file, by entering list | sort Since sort writes its output to stdout, the sorted listing appears on your terminal screen unless you use additional pipes or redirection to send it elsewhere. File-Related Header Files Unix, like all operating systems, has special names and values that you must use when performing file operations. For example, some C functions return a special value indicating end-of-file (EOF) when you try to read beyond the end of a file. Recall the flags that you use in an open call to indicate whether you want read-only, write-only, or read/write access. Unless we know just where to look, it is often not easy to find where these values are defined. Unix handles the problem by putting such definitions in special header files such as /usr/ include, which can be found in special directories. Header files relevant to the material in this chapter are stdio.h, iostream.h, fstream.h, fontl.h, and file.h. The C streams are in stdio.h; C++ streams in iostream.h and fstream.h. Many Unix operations are in fcnt1.h and file.h. EOF, for instance, is defined on many Unix and MS-DOS systems in st dio. h, as are the file pointers stdin, stdout, and stderr. And the flags O_RDONLY,2.10 Summary 35 O_WRONLY, and O_RDWR can usually be found in file. hor Possibly in one of the files that it includes. It would be instructive for you to browse through these files as well as others that pique your curiosity. Unix File System Commands Unix provides many commands for manipulating files. We list a few that are relevant to the material in this chapter. Most of them have many options, but the simplest uses of most should be obvious. Consult a Unix manual for more information on how to use them. cat filenames Print the contents of the named text files. tail filename Print the last ten lines of the text file. cp filel file2 Copy file! to file2. my file! file2 Move (rename) file] to file2, xm filenames Remove (delete) the named files. Chmod mode filename Change the protection mode on the named files. 1s List the contents of the directory. mkdir name Create a directory with the given name. xmdir name Remove the named directory. SUMMARY This chapter introduces the fundamental operations of file systems: Open, Create, Close, Read, Write, and Seek. Each of these operations involves the creation or use of a link between a physical file stored on a secondary device and a logical file that represents a program’s more abstract view of the same file. When the program describes an operation using the logical file name, the equivalent physical operation gets performed on the corre- sponding physical file. The six operations appear in programming languages in many differ- ent forms. Sometimes they are built-in commands, sometimes they are functions, and sometimes they are direct calls to an operating system. Before we can use a physical file, we must link it to a logical file. In some programming environments, we do this with a statement36 Chapter 2 Fundamental File Processing Operations (select /assign in Cobol) or with instructions outside of the pro- gram (operating system shell scripts). In other languages, the link between the physical file and a logical file is made with open or create. The operations create and open make files ready for reading or writ- ing. Create causes a new physical file to be created. Open operates on an already existing physical file, usually setting the read/write pointer to the beginning of the file. The close operation breaks the link between a logical file and its corresponding physical file. It also makes sure that the file buffer is flushed so everything that was written is actually sent to the file. The I/O operations Read and Write, when viewed at a low systems level, require three items of information: m The logical name of the file to be read from or written to; An address of a memory area to be used for the “inside of the comput- er” part of the exchange; Mm An indication of how much data is to be read or written. These three fundamental elements of the exchange are illustrated in Fig. 2.5. Read and Write are sufficient for moving sequentially through a file to any desired position, but this form of access is often very ineffi- cient. Some languages provide seek operations that let a program move directly to a certain position in a file. C provides direct access by means of the fseek operation. The fseek operation lets us view a file as a kind of large array, giving us a great deal of freedom in deciding how to orga- nizea file. Another useful file operation involves knowing when the end of a file has been reached. End-of-file detection is handled in different ways by different languages. Much effort goes into shielding programmers from having to deal with the physical characteristics of files, but inevitably there are little details about the physical organization of files that programmers must know. When we try to have our program operate on files at a very low level eS Figure 2.5 The exchange between memory and external device.Key Terms 37 (as we do a great deal in this text), we must be on the lookout for little surprises inserted in our file by the operating system or applications. ‘The Unix file system, called the file system, organizes files in a tree structure, with all files and subdirectories expressible by their pathnames, It is possible to navigate around the file system as you work with Unix files. Unix views both physical devices and traditional disk files as files, so, for example, a keyboard (stdin), a console (stdout), and a tape drive are all considered files. This simple conceptual view of files makes it possi- ble in Unix to do with a very few operations what might require many times the operations on a different operating system. I/O redirection and pipes are convenient shortcuts provided in Unix for transferring file data between files and standard I/O. Header files in Unix, such as stdio.h, contain special names and values that you must use when performing file operations. It is important to be aware of the most common of these in use on your system. KEY TERMS Access mode. Type of file access allowed. The variety of access modes permitted varies from operating system to operating system. Buffering. When input or output is saved up rather than sent off to its destination immediately, we say that it is buffered. In later chapters, we find that we can dramatically improve the performance of programs that read and write data if we buffer the I/O. Byte offset. The distance, measured in bytes, from the beginning of the file. The first byte in the file has an offset of 0, the second byte has an offset of 1, and so on. Close. A function or system call that breaks the link between a logical file name and the corresponding physical file name. Create. A function or ‘system call that causes a file to be created on secondary storage and may also bind a logical name to the file’s phys- ical name—see Open. A call to create also results in the generation of information used by the system to manage the file, such as time of creation, physical location, and access privileges for anticipated users of the file. End-of-file (EOF). An indicator within a file that the end of the file has occurred, a function that tells if the end of a file has been encountered (end_of_file in Ada), or a system-specific value that is returned by38 Chapter 2 Fundamental File Processing Operations file-processing functions indicating that the end of a file has been encountered in the process of carrying out the function (EOF in Unix). File descriptor. A small, nonnegative integer value returned by a Unix open or creat call that is used as a logical name for the file in later Unix system calls. This value is an index into an array of FILE structs that contain information about open files. The C stream functions use FILE pointers for their file descriptors. File system. The name used in Unix and other operating systems to describe a collection of files and directories organized into a tree- structured hierarchy. Header file. A file that contains definitions and declarations commonly shared among many other files and applications. In C and C++, head- er files are included in other files by means of the “#include” statement (see Figs. 2.2 and 2.3). The header files iostream.h, stdio.h, file.h,and fcnt1.h described in this chapter contain important declarations and definitions used in file processing. W/O redirection. The redirection of a stream of input or output from its normal place. For instance, the operator > can be used to redirect toa file output that would normally be sent to the console. Logical file. The file as seen by the program. The use of logical files allows a program to describe operations to be performed on a file without knowing what physical file will be used. The program may then be used to process any one of a number of different files that share the same structure, Open. A function or system call that makes a file ready for use. It may also bind a logical file name to a physical file. Its arguments include the logical file name and the physical file name and may also include information on how the file is expected to be accessed. Pathname. A character string that describes the location of a file or direc- tory. If the pathname starts with a /, then it gives the absolute path- name—the complete path from the root directory to the file. Otherwise it gives the relative pathname—the path relative to the current working directory. Physical file. A file that actually exists on secondary storage. It is the file as. known by the computer operating system and that appears in its file directory. Pipe. A Unix operator specified by the symbol | that carries data from one process to another. The originating process specifies that the data is toFurther Readings 7 39 go to stdout, and the receiving process expects the data from stdin. For example, to send the standard output from a program makedata to the standard input of a program called usedata, use the command makedata | usedata. Protection mode. An indication of how a file can be accessed by various classes of users. In Unix, the protection mode is a three-digit octal number that indicates how the file can be read, written to, and execut- ed by the owner, by members of the owner's group, and by everyone else, Read. A function or system call used to obtain input from a file or device. When viewed at the lowest level, it requires three arguments: (1) a Source file logical name corresponding to an open file; (2) the Destination address for the bytes that are to be read; and (3) the Size or amount of data to be read. Seek. A function or system call that sets the read/write pointer to a speci- fied position in the file. Languages that provide seeking functions allow programs to access specific elements of a file directly, rather than having to read through a file from the beginning (sequentially) each time a specific item is desired. In C, the fseek function provides this capability. . Standard I/O. The source and destination conventionally used for input and output. In Unix, there are three types of standard I/O: standard input (stdin), standard output (stdout), and stderr (standard error). By default stdin is the keyboard, and stdout and stderr are the console screen. I/O redirection and pipes provide ways to over- ride these defaults. Write. A function or system call used to provide output capabilities. When viewed at the lowest level, it requires three arguments: (1) a Destination file name corresponding to an open file; (2) the Source address of the bytes that are to be written; and (3) the Size or amount of the data to be written. FURTHER READINGS Introductory textbooks on C and C++ tend to treat the fundamental file operations only briefly, if at all. This is particularly true with regard to C, since there are higher-levél standard I/O functions in C, such as the read operations fget and fgetc. Some books on C and/or UNIX that do40 Chapter 2 Fundamental File Processing Operations provide treatment of the fundamental file operations are Kernighan and Pike (1984) and Kernighan and Ritchie (1988). These books also provide discussions of higher-level I/O functions that we omitted from our text. An excellent explanation of the input and output classes of C++ is found in Plaugher (1995), which discusses the current (C++ version 2) and proposed draft standard for C++ input and output. As for Unix specifically, as of this writing there are many flavors of Unix including Unix System V from AT&T, the originators of Unix, BSD (Berkeley Software Distribution) Unix from the University of California at Berkeley, and Linux from the Free Software Foundation. Each manufac- turer of Unix workstations has its own operating system. There are efforts to standardize on either Posix, the international standard (ISO) Unix or OSE, the operating system of the Open Software Foundation. All of the versions are close enough that learning about any one will give you a good understanding of Unix generally. However, as you begin to use Unix, you will need reference material on the specific version that you are using. There are many accessible texts, including Sobell (1995) which covers a variety of Unix versions, including Posix, McKusick, et al (1996) on BSD, and Hekman (1997) on Linux. EXERCISES 1, Look up operations equivalent to Open, Close, Create, Read, Write, and Seek in other high-level languages, such as Ada, Cobol, _ and Fortran. Compare them with the C streams or C++ stream classes. 2. For the C++ language: a. Make a list of the different ways to perform the file operations Create, Open, Close, Read, and Write. Why is there more than one way to do each operation? . How would you use £seek to find the current position in a file? . Show how to change the permissions on a file my£ile so the owner has read and write permissions, group members have execute permission, and others have no permission. |. What is the difference between pmode and O_RDWR? What pmodes and O_RDWR are available on your system? . In some typical C++ environments, such as Unix and MS-DOS, all of the following represent ways to move data from one place to another: es a ©Programming Exercises 41 s scan£ fgete read cat (or type) fscanf gets < main (argc, argv) gete fgets | Describe as many of these as you can, and indicate how they might be useful. Which belong to the C++ language, and which belong to the operating system? A couple of years ago a company we know of bought a new Cobol compiler. One difference between the new compiler and the old one was that the new compiler did not automatically close files when execution of a'program terminated, whereas the old compiler did. What sorts of problems did this cause when some of the old software was executed after having been recompiled with the new compiler? Consult a C++ reference and describe the values of the io_state of the stream classes in C++. Describe the characteristics of a stream when each of the state bits is set. . Design an experiment that uses methods seekg, seekp, tella, and tellp to determine whether an implementation of C++ supports separate get and put pointers for the stream classes. Look up the Unix command we. Execute the following in a Unix environment, and explain why it gives the number of files in the directory. Is |we -w . Find stdio.hon your system, and find what value is used to indi- cate end-of-file. Also examine file.h or fent1.h, and describe in general what its contents are for. —_____ PROGRAMMING EXERCISES Make the listcpp. cpp program of Appendix D work with your compiler on your operating system. |. Write a program to create a file and store a string in it. Write another program to open the file and read the string. Implement the Unix command tail -n, where nis the number of lines from the end of the file to be copied to stdout. . Change the program listcpp.cpp so it reads from cin, rather than a file, and writes to a file, rather than cout. Show how to42 Chapter 2 Fundamental File Processing Operations 12. 13. execute the new version of the program in your programming envi- ronment, given that the input is actually in a file called instuff. Write a program to read a series of names, one per line, from stan- dard input, and write out those names spelled in reverse order to standard output. Use I/O redirection and pipes to do the following: a. Input a series of names that are typed in from the keyboard, and write them out, reversed, to a file called file1. b, Read the names in from £ile1; then write them out, re-reversed, toa file called file2. c. Read the names in from file2, reverse them again, and then sort the resulting list of reversed words using sort. Write a program to read and write objects of class String. Include code that uses the assignment operator and the various constructors for the class. Use a debugger to determine exactly which methods are called for each statement in your program. __. PROGRAMMING PROJECT This is the second part of the programming project begun in Chapter 1. We add methods to read objects from standard input and to write format- ted objects to an output stream for the classes of Chapter 1. 14, 15. Add methods to class Student to read student field values from an input stream and to write the fields of an object to an output stream, nicely formatted. You may also want to be able to prompt a user to enter the field values. Use the C++ stream operations to implement these methods. Write a driver program to verify that the class is correctly implemented. Add methods to class CourseRegistration to read course registration field values from an input stream and to write the fields of an object to an output stream, nicely formatted. You may also want to be able to prompt a user to enter the field values. Use the C++ stream operations to implement these methods. Write a driver program to verify that the class is correctly implemented. The next part of the programming project is in Chapter 4.secondary storage and CHAPTER System Software CHAPTER OBJECTIVES Describe the organization of typical disk drives, including basic units of organization and their relationships. Identify and describe the factors affecting disk access time, and de- scribe methods for estimating access times and space requirements. Describe magnetic tapes, give an example of current high- performance tape systems, and investigate the implications of block size on space requirements and transmission speeds. Introduce the commercially important characteristics of CD-ROM storage. Examine the performance characteristics of CD-ROM, and see that they are very different from those of magnetic disks. Describe the directory structure of the CD-ROM file system, and show how it grows from the characteristics of the medium. Identify fundamental differences between media and criteria that can be used to match the right medium to an application. Describe in general terms the events that occur when data is transmitted between a program and a secondary storage device. Introduce concepts and techniques of buffér management. Illustrate many of the concepts introduced in the chapter, especially system software concepts, in the context of Unix. 43Chapter 3 Secondary Storage and System Software 3a 3.2 3.3 3.5 3.6 37 3.9 CHAPTER OUTLINE Disks 3.1.1 The Organization of Disks 3.1.2 Estimating Capacities and Space Needs 3.1.3 Organizing Tracks by Sector 3.1.4 Organizing Tracks by Block 3.1.5 Nondata Overhead 3.1.6 The Cost of a Disk Access 3.1.7 Effect of Block Size on Performance: A Unix Example 3.1.8 Disk as Bottleneck Magnetic Tape 3.2.1 Types of Tape Systems 3.2.2 An Example of a High-Performance Tape System 3.2.3 Organization of Data on Nine-Track Tapes 3.2.4 Estimating Tape Length Requirements 3.2.5 Estimating Data Transmission Times Disk versus Tape Introduction to CD-ROM 3.4.1 A Short History of CD-ROM 3.4.2 CD-ROM as a File Structure Problem Physical Organization of CD-ROM 3.5.1 Reading Pits and Lands 3.5.2 CLV Instead of CAV 3.5.3 Addressing 3.5.4 Structure of a Sector CD-ROM Strengths and Weaknesses 3.6.1 Seek Performance 3.6.2 Data Transfer Rate 3.6.3 Storage Capacity 3.6.4 Read-Only Access 3.6.5 Asymmetric Writing and Reading Storage as a Hierarchy A Journey of a Byte 3.8.1 The File Manager 3.8.2 The I/O Buffer 3.8.3 The Byte Leaves Memory: The I/O Processor and Disk Controller Buffer Management 3.9.1 Buffer Bottlenecks 3.9.2 Buffering StrategiesChapter Outline 45 3.10 W/O in Unix 3.10.1 The Kernel 3.10.2 Linking File Names to Files 3.10.3 Normal Files, Special Files, and Sockets 3.10.4 Block /O 3.10.5 Device Drivers 3.10.6 The Kernel and File Systems 3.10.7 Magnetic Tape and Unix Good design is always responsive to the constraints of the medium and to the environment. This is as true for file structure design as it is for carvings in wood and stone. Given the ability to create, open, and close files, and to seek, read, and write, we can perform the fundamental operations of file construction. Now we need to look at the nature and limitations of the devices and systems used to store and retrieve files in order to prepare ourselves for file design. If files were stored just in memory, there would be no separate disci- pline called file structures. The general study of data structures would give us all the tools we need to build file applications. But secondary storage devices are very different from memory. One difference, as already noted, is that accesses to secondary storage take much more time than do access- es to memory. An even more important difference, measured in terms of design impact, is that not all accesses are equal. Good file structure design uses knowledge of disk and tape performance to arrange data in ways that minimize access costs. In this chapter we examine the characteristics of secondary storage devices. We focus on the constraints that shape our design work in the chapters that follow. We begin with a look at the major media used in the storage and processing of files, magnetic disks, and tapes. We follow this with an overview of the range of other devices and media used for secondary storage. Next, by following the journey of a byte, we take a brief look at the many pieces of hardware and software that become involved when a byte is sent by a program to a file on a disk. Finally, we take a closer look at one of the most important aspects of file manage- ment—buffering.46 3.1 Chapter 3 Secondary Storage and System Software Disks Compared with the time it takes to access an item in memory, disk access- es are always expensive. However, not all disk accesses are: equally expen- sive. This has to do with the way a disk drive works. Disk drives! belong to a class of devices known as direct access storage devices (DASDs) because they make it possible to access data directly. DASDs are contrasted with serial devices, the other major class of secondary storage devices. Serial devices use media such as magnetic tape that permit only serial access, which means that a particular data item cannot be read or written until all of the data preceding it on the tape have been read or written in order. Magnetic disks come in many forms. So-called hard disks offer high capacity and low cost per bit. Hard disks are the most common disk used in everyday file processing. Floppy disks are inexpensive, but they are slow and hold relatively little data. Floppies are good for backing up individual files or other floppies and for transporting small amounts of data. Removable disks use disk cartridges that can be mounted on the same drive at different times, providing a convenient form of backup storage that also makes it possible to access data directly. The Iomega Zip (100 megabytes per cartridge) and Jaz (1 gigabyte per cartridge) have become very popular among PC users. Nonmagnetic disk media, especially optical discs, are becoming increasingly important for secondary storage. (See Sections 3.4 and 3.5 and Appendix A for a full treatment of optical disc storage and its applica- tions.) 3.1.1 The Organization of Disks The information stored on a disk is stored on the surface of one or more platters (Fig. 3.1). The arrangement is such that the information is stored in successive tracks on the surface of the disk (Fig, 3.2). Each track is often divided into a number of sectors. A sector is the smallest addressable portion of a disk, When a read statement calls for a particular byte from a disk file, the computer operating system finds the correct surface, track, and sector, reads the entire séctor into a special area in memory called a buffer, and then finds the requested byte within that buffer. 1. When we use the terms disks or disk drives, we are referring to magnetic disk media.Disks 47 LLL | f f Platters “Spindle Read/write heads Boom Figure 3.1 Schematic illustration of disk drive. Tracks Sectors Figure 3.2 Surface of disk showing tracks and sectors. Disk drives typically have a number of platters. The tracks that are directly above and bélow one another form a cylinder (Fig. 3.3). The signif- icance of the cylinder is that all of the information on a single cylinder can48 Chapter 3 Secondary Storage and System Software be accessed without moving the arm that holds the read/write heads. Moving this arm is called seeking. This arm movement is usually the slow- est part of reading information from a disk. 3.1.2 Estimating Capacities and Space Needs Disks range in storage capacity from hundreds of millions to billions of bytes. In a typical disk, each platter has two surfaces, so the number of tracks per cylinder is twice the number of platters. The number of cylin- ders is the same as the number of tracks on a single surface, and each track has the same capacity, Hence the capacity of the disk is a function of the number of cylinders, the number of tracks per cylinder, and the capacity ofa track. Ten tracks Figure 3.3 Schematic illustration of disk drive viewed as a set of seven oylinders.Disks 49 The amount of data that can be held on a track and the number of tracks on a surface depend on how densely bits can be stored on the disk surface. (This in turn depends on the quality of the recording medium and the size of the read/write heads.) In 1991, an inexpensive, low-density disk held about 4 kilobytes on a track and 35 tracks on a 5-inch platter. In 1997, a Western Digital Caviar 850-megabyte disk, one of the smallest disks being manufactured, holds 32 kilobytes per track and 1,654 tracks on each surface of a 3-inch platter. A Seagate Cheetah high performance 9-gigabyte disk (still 3-inch platters) can hold about 87 kilobytes on a track and 6526 tracks on a surface. Table 3.1 shows how a variety of disk drives compares in terms of capacity, performance, and cost. Since a cylinder consists of a group of tracks, a track consists of a group of sectors, and a sector consists of a group of bytes, it is easy to compute track, cylinder, and drive capacities. Track capacity = number of sectors per track X bytes per sector Cylinder capacity = number of tracks per cylinder X track capacity Drive capacity = number of cylinders X cylinder capacity. Table 3.1 Specifications of the Disk Drives Seagate Western Digital Western Digital Characteristic Cheetah 9 Caviar AC22100 Caviar AC2850 Capacity 9000 MB 2100 MB 850 MB Minimum (track-to-track) seek time 0.78 msec 1 msec 1 msec Average seek time 8 msec 12 msec 10 msec Maximum seek time 19 msec 22 msec 22 msec Spindle speed 10000 rpm. 5200 rpm 4500 rpm Average rotational delay 3 msec 6 msec 6.6 msec Maximum transfer rate 6 msec/track, or 12 msec/track,or 13.3 msec/track, or 14.506 bytes/msec 2796 bytes/msec 2419 bytes/msec Bytes per sector 512 si2 siz Sectors per track 170 63 63 ‘Tracks per cylinder 16 16 16 Cylinders 526 4092 165450 Chapter3. Secondary Storage and System Software If we know the number of bytes in a file, we can-use these relationships to compute the amount of disk space the file is likely to require. Suppose, for instance, that we want to store a file with fifty thousand fixed-length data records on a “typical” 2.1-gigabyte small computer disk with the following characteristics: Number of bytes per sector = 512 Number of sectors per track = 63 Number of tracks per cylinde Number of cylinders = 4092 How many cylinders does the file require if each data record requires 256 bytes? Since each sector can hold two records, the file requires = = 25 000 sectors One cylinder can hold 63 X 16 = 1008 sectors so the number of cylinders required is approximately 25.000 — 94.8 cylinders 1008 Of course, it may be that a disk drive with 24.8 cylinders of available space does not have 24.8 physically contiguous cylinders available. In this likely case, the file might, in fact, have to be spread out over dozens, perhaps even hundreds, of cylinders. 3.1.3 Organizing Tracks by Sector There are two basic ways:to organize data on a disk: by sector and by user- defined block. So far, we have mentioned only sector organizations. In this section we examine sector organizations more closely. In the following section we will look at block organizations. The Physical Placement of Sectors There are several views that one can have of the organization of sectors on a track, The simplest view, one that suffices for most users most of the time, is that sectors are adjacent, fixed-sized segments of a track that happen to hold a file (Fig, 3.4a). This is often a perfectly adequate way to view a file logically, but it may not be a good way to store sectors physically.Disks 51 Figure 3.4 Two views of the organization of sectors on a thirty-two-sector track. When you want to read a series of sectors that are all in the same track, one right after the other, you often cannot read adjacent sectors. After reading the data, it takes the disk controller a certain amount of time to process the received information before it is ready to accept more. If logi- cally adjacent sectors were placed on the disk so they were also physically adjacent, we would miss the start of the following sector while we were processing the one we had just read in. Consequently, we would be able to read only one sector per revolution of the disk. I/O system designers have approached this problem by interleaving the sectors: they leave an interval of several physical sectors between logically adjacent sectors. Suppose our disk had an interleaving factor of 5. The assignment of logical sector content to the thirty-two physical sectors in a track is illustrated in Fig. 3.4(b). If you study this figure, you can see that it takes five revolutions to read the entire thirty-two sectors of a track. That is a big improvement over thirty-two revolutions. In the early 1990s, controller speeds improved so that disks can now offer 1:1 interleaving. This means that successive sectors are physically adjacent, making it possible to read an entire track in a single revolution of the disk.52 Chapter 3 Secondary Storage and Systern Software Clusters Another view of sector organization, also designed to improve perfor- mance, is the view maintained by the part of a computer's operating system that we call the file manager. When a program accesses a file, it is the file manager’s job to map the logical parts of the file to their corre- sponding physical locations. It does this by viewing the file as a series of clusters of sectors. A cluster is a fixed number of contiguous sectors.? Once a given cluster has been found on a disk, all sectors in that cluster can be accessed without requiring an additional seek. To view a file as a series of clusters and still maintain the sectored view, the file manager ties logical sectors to the physical clusters they belong to by using a file allocation table (FAT). The FAT contains a list of all the clus- ters in a file, ordered according to the logical order of the sectors they contain. With each cluster entry in the FAT is an entry giving the physical location of the cluster (Fig. 3.5). On many systems, the system administrator can decide how many sectors there should be in a cluster. For instance, in the standard physical disk structure used by VAX systems, the system administrator sets the clus- ter size to be used on a disk when the disk is initialized. The default value is 3512-byte sectors per cluster, but the cluster size may be set to any value between 1 and 65 535 sectors. Since clusters represent physically contigu- ous groups of sectors, larger clusters will read more sectors without seek- ing, so the use of large clusters can lead to substantial performance gains when a file is processed sequentially. Extents Our final view of sector organization represents a further attempt to emphasize physical contiguity of sectors in a file and to minimize seeking even more. (If you are getting the idea that the avoidance of seeking is an important part of file design, you are right.) If there is a lot of free room on a disk, it may be possible to make a file consist entirely of contiguous clusters. When this is the case, we say that the file consists of one extent: all of its sectors, tracks, and (if it is large enough) cylinders form one contigu- ous whole (Fig. 3.6a on page 54). This is a good situation, especially if the file is to be processed sequentially, because it means that the whole file can be accessed with a minimum amount of seeking. 2, Its not always physically contiguous; the degree of physical contiguity is determined by the inter- leaving factor.Disks 53 File allocation table ear) Cluster Cluster number . location 1 The part of the 2 FAT pertaining 3 to our file 7 Figure 3.5 The file manager determines which cluster in the file has the sector that is to be accessed. If there is not enough contiguous space available to contain an entire file, the file is divided into two or more noncontiguous parts. Each part is an extent. When new clusters are added to a file, the file manager tries to make them physically contiguous to the previous end of the file, but if space is unavailable, it must add one or more extents (Fig. 3.6b). The most important thing to understand about extents is that as the number of extents in a file increases, the file becomes more spread out on the disk, and the amount of seeking required to process the file increases. Fragmentation Generally, all sectors on a given drive must contain the same number of bytes. If, for example, the size of a sector is 512 bytes and the size of all records in a file is 300 bytes, there is no convenient fit between records and sectors. There are two ways to deal with this situation: store only one record per sector, or allow records to span sectors so the beginning of a record might be found in one sector and the end of it in another (Fig. 3.7). The first option has the advantage that any record can be retrieved by retrieving just one sector, but it has the disadvantage that it might leave an enormous amount of unused space within each sector. This loss of space54 Chapter 3 Secondary Storage and System Software Figure 3.6 File extents (shaded area represents space on disk used by a single file). . within a sector is called internal fragmentation. The second option has the advantage that it loses no space from internal fragmentation, but it has the disadvantage that some records may be retrieved only by accessing two sectors. Another potential source of internal fragmentation results from the use of clusters. Recall that a cluster is the smallest unit of space that can be allocated for a file. When the number of bytes in a file is not an exact multiple of the cluster size, there will be internal fragmentation in the last extent of the file. For instance, if a cluster consists of three 512-byte sectors, a file containing 1 byte would use up 1536 bytes on the disk; 1535 bytes would be wasted due to internal fragmentation. Clearly, there are important trade-offs in the use of large cluster sizes. A disk expected to have mainly large files that will often be processed sequentially would usually be given a large cluster size, since internal frag- mentation would not be a big problem and the performance gains might be great. A disk holding smaller files or files that are usually accessed only randomly would normally be set up with small clusters.Disks 55 a ) \ Figure 3.7 Alternate record organization within sectors (shaded areas represent data records, and unshaded areas represent unused space). 3.1.4 Organizing Tracks by Block Sometimes disk tracks are not divided into sectors, but into integral numbers of user-defined blocks whose sizes can vary. (Note: The word block has a different meaning in the context of the Unix I/O system. See Section 3.7 for details.) When the data on a track is organized by block, this usually means that the amount of data transferred in a single I/O operation can vary depending on the needs of the software designer, not the hardware. Blocks can normally be either fixed or variable in length, depending on the requirements of the file designer and the capabilities of the operating system. As with sectors, blocks are often referred to as phys- ical records. In this context, the physical record is the smallest unit of data that the operating system supports on a particular drive. (Sometimes the word block is used as a synonym for a sector or group of sectors. To avoid confusion, we do not use it in that way here.) Figure 3.8 illustrates the difference between one view of data on a sectored track and that on a blocked track. A block organization does not present the sector-spanning and frag- mentation problems of sectors because blocks can vary in size to fit the logical organization of the data. A block is usually organized to hold an integral number of logical records. The term blocking factor is used to indi- cate the number of records that are to be stored in each block in a file.56 Sector 1 Chapter 3. Secondary Storage and System Software Sector 2 Sector 4 Sector 5 Sector 6 Figure 3.8 Sector organization versus block organization, Hence, if we had a file with 300-byte records, a block-addressing scheme would let us define a block to be some convenient multiple of 300 bytes, depending on the needs of the program. No space would be lost to inter- nal fragmentation, and there would be no need to load two blocks to retrieve one record. Generally speaking, blocks are superior to sectors when it is desirable to have the physical allocation of space for records correspond to. their logical organization. (There are disk drives that allow both sector address- ing and block addressing, but we do not describe them here. See Bohl, 1981.) In block-addressing schemes, each block of data is usually accompa- nied by one or more subblocks containing extra information about the data block. Typically there is a count subblock that contains (among other things) the number of bytes in the accompanying data block (Fig. 3.9). There may also be a key subblock containing the key for the last record in the data block (Fig, 3.9b). When key subblocks are used, the disk controller can search a track for a block or record identified by a given key. This means that a program can ask its disk drive to search among all the blocks ona track for a block with a desired key. This approach can result in much more efficient searches than are normally possible with sector-addressable schemes, in which keys generally cannot be interpreted without first load- ing them into primary memory. 3.1.5 Nondata Overhead Both blocks and sectors require that a certain amount of space be taken up on the disk in the form of nondata overhead. Some of the overhead consists of information that is stored on the disk during preformatting, which is done before the disk can be used.Disks 57 (b) Figure 3.9 Block addressing requires that each physical data block be accompanied by one or more subblocks containing information about its contents. On sector-addressable disks, preformatting involves storing, at the beginning of each sector, information such as sector address, track address, and condition (whether the sector is usable or defective). Preformatting also involves placing gaps and synchronization marks between fields of information to help the read/write mechanism distin- guish between them. This nondata overhead usually is of no concern to the programmer. When the sector size is given for a certain drive, the programmer can assume that this is the amount of actual data that can be stored in a sector. On a block-organized disk, some of the nondata overhead is invisible to the programmer, but some of it must be accounted for. Since subblocks and interblock gaps have to be provided with every block, there is general- ly more nondata information provided with blocks than with sectors. Also, since the number and size of blocks can vary.from one application to another, the relative amount of space taken up by overhead can vary when block addressing is used. This is illustrated in the following example. Suppose we have a block-addressable disk drive with 20 000 bytes per track and the amount of space taken up by subblocks and interblock gaps is equivalent to 300 bytes per block. We want to store a file containing 100- byte records on the disk. How many records can be stored per track if the blocking factor is 10? If it is 60? 1. If there are ten 100-byte records per block, each block holds 1000 bytes of data and uses 300 + 1000, or 1300, bytes of track space when over- head is taken into account. The number of blocks that can fit on a 20 000-byte track can be expressed as58 Chapter 3. Secondary Storage and System Software 20000 1300 So fifteen blocks, or 150 records, can be stored per track. (Note that we have to take the floor of the result because a block cannot span two tracks.) 2. If there are sixty 100-byte records per block, each block holds 6000 bytes of data and uses 6300 bytes of track space. The number of blocks per track can be expressed as 20000 _ 6300 So three blocks, or 180 records, can be stored per track, = 15.38 =15 Clearly, the larger blocking factor can lead to more efficient use of storage. When blocks are larger, fewer blocks are required to hold a file, so there is less space consumed by the 300 bytes of overhead that accompany each block. Can we conclude from this example that larger blocking factors always lead to more efficient storage? Not necessarily. Since we can put only an integral number of blocks on a track and since tracks are fixed in length, we almost always lose some space at the end of a track. Here we have the inter- nal fragmentation problem again, but this time it applies to fragmentation within a track, The greater the block size, the greater potential amount of internal track fragmentation. What would have happened if we had chosen a blocking factor of 98 in the preceding example? What about 97? The flexibility introduced by the use of blocks, rather than sectors, can save time, since it lets the programmer determine to a large extent how data is to be organized physically on a disk. On the negative side, blocking schemes require the programmer and/or operating system to do the extra work of determining the data organization. Also, the very flexibility intro- duced by the use of blocking schemes precludes the synchronization of 1/O operations with the physical movement of the disk, which sectoring permits. This means that strategies such as sector interleaving cannot be used to improve performance. 3.1.6 The Cost of a Disk Access To give you a feel for the factors contributing to the total amount of time needed to access a file on a fixed disk, we calculate some access times. A disk access can be divided into three distinct physical operations, each with its own cost: seek time, rotational delay, and transfer time.Disks 59 Seek Time Seek time is the time required to move the access arm to the correct cylinder. The amount of time spent seeking during a disk access depends, of course, on how far the arm has to move. If we are accessing a file sequentially and the file is packed into several consecutive cylin- ders, seeking needs to be done only after all the tracks on a cylinder have been processed, and then the read/write head needs to move the width of only one track, At the other extreme, if we are alternately accessing sectors from two files that are stored at opposite extremes on a disk (one at the innermost cylinder, one at the outermost cylinder), seeking is very expensive. ; Seeking is likely to be more costly in a multiuser environment, where several processes are contending for use of the disk at one time, than in a single-user environment, where disk usage is dedicated to one process. Since seeking can be very costly, system designers often go to great extremes to minimize seeking. In an application that merges three files, for example, it is not unusual to see the three input files stored on three differ- ent drives and the output file stored on a fourth drive, so no seeking need be done as I/O operations jump from file to file. Since it is usually impossible to know exactly how many tracks will be traversed in every seek, we usually try to determine the average seek time required for a particular file operation. If the starting and ending positions for each access are random, it turns out that the average seek traverses one- third of the total number of cylinders that the read/write head ranges over.3 Manufacturers’ specifications for disk drives often list this figure as the average séek time for the drives Most hard disks available today have average-seek times of less than 10 milliseconds (msec), and high-perfor- mance disks have average seek times as low as 7.5 msec. Rotational Delay Rotational delay refers to the time it takes for the disk to rotate so the sector we want is under the read/write head. Hard disks usually rotate at about 5000 rpm, which is one r-volution per 12 msec. On average, the rotational delay is half a revolution, or about 6 msec. On floppy disks, which often rotate at only 360 rpm, average rotational delay is a sluggish 83.3 msec. 3. Derivations of this result, as well as more detailed and refined models, can be found in Wiederhold (1983), Knuth (1998), Teory and Fry (1982), and Salzberg (1988).60 Chapter 3. Secondary Storage and System Software As in the case of seeking, these averages apply only when the read/write head moves from some random place on the disk surface to the target track. In many circumstances, rotational delay can be much less than the average. For example, suppose that you have a file that requires two or more tracks, that there are plenty of available tracks on one cylin- der, and that you write the file to disk sequentially, with one write call. When the first track is filled, the disk can immediately begin writing to the second track, without any rotational delay. The “beginning” of the second track is effectively staggered by just the amount of time it takes to switch from the read/write head on the first track to the read/write head on the second, Rotational delay, as it were, is virtually nonexistent. Furthermore, when you read the file back, the position of data on the second track ensures that there is no rotational delay in switching from one track to another. Figure 3.10 illustrates this staggered arrangement. Transfer Time Once the data we want is under the read/write head, it can be transferred. The transfer time is given by the formula number of bytes transferred number of bytes oma track ‘Transfer time = X rotation time If a drive is sectored, the transfer time for one sector depends on the number of sectors on a track. For example, if there are sixty-three sectors per track, the time required to transfer one sector would be 1/63 of a revo- Figure 3.10 Whena single file can span several tracks on a cylinder, we can stagger the beginnings of the tracks to avoid rotational delay when moving from track to track during sequential access.Disks 61 lution, or 0.19 msec. The Seagate Cheetah rotates at 10 000 rpm. The transfer time for a single sector (170 sectors per track) is 0.036 msec. This results in a peak transfer rate of more than 14 megabytes per second. Some Tithing Computations Let’s look at two different file processing situations that show how differ- ent types of file access can affect access times. We will compare the time it takes to access a file in sequence with the time it takes to access all of the records in the file randomly. In the former case, we use as much of the file as we can whenever we access it. In the random-access case, we are able to use only one record on each access. The basis for our calculations is the high, performance Seagate Cheetah 9-gigabyte fixed disk described in Table 3.1. Although itis typical only of a certain class of fixed disk, the observations we draw as we perform these calculations are quite general. The disks used with personal computers are smaller and slower than this disk, but the nature and relative costs of the factors contributing to total access times are essentially the same. The highest performance for data transfer is achieved when files are in one-track units. Sectors are interleaved with an interleave factor of 1, so data on a given track can be transferred at the stated transfer rate. Let’s suppose that we wish to know how long it will take, using this drive, to read an 8 704 000-byte file that is divided into thirty-four thousand 256-byte records. First we need to know how the file is distributed on the disk. Since the 4096-byte cluster holds sixteen records, the file will be stored as a sequence of 2125 4096-byte sectors occupying one hundred tracks. This means that the disk needs one hundred tracks to hold the entire 8704 kilobytes that we want to read. We assume a situation in which the one hundred tracks are randomly dispersed over the surface of the disk. (This is an extreme situation chosen to dramatize the point we want to make, Still, it is not so extreme that it could not easily occur on a typical overloaded disk that has a large number of small files.) Now we are ready to calculate the time it would take to read the 8704- kilobyte file from the disk. We first estimate the time it takes to read the file sector by sector in sequence. This process involves the following operations for each track: Average seek 8 msec Rotational delay 3 msec Read one track 6 msec Total 17 msec62 Chapter 3 Secondary Storage and System Software ‘We want to find and read one hundred tracks, so Total time = 100 X'17 msec = 1700 msec = 1.7 seconds Now let’s calculate the time it would take to read in the same thirty- four thousand records using random access instead of sequential access. In other words, rather than being able to read one sector right after another, we assume that we have to access the records in an order that requires jumping from track to track every time we read a new sector. This process involves the following operations for each record: Average seek 8 msec Rotational delay 3 msec Read one cluster (1/21.5 X 6 msec) 0.28 msec. Total 11.28 msec Total time = 34 000 X 11.28 msec = 9250 msec = 9.25 seconds This difference in performance between sequential access and random access is very important. If we can get to the right location on the disk and read a lot of information sequentially, we are clearly much better off than if we have to jump around, seeking every time we need a new record. Remember that seek time is very expensive; when we are performing disk operations, we should try to minimize seeking. 3.1.7 Effect of Block Size on Performance: A Unix Example In deciding how best to organize disk storage allocation for several versions of BSD Unix, the Computer Systems Research Group (CSRG) in Berkeley investigated the trade-offs between block size and performance in a Unix environment (Leffler et al., 1989). The results of the research provide an interesting case study involving trade-offs between block size, fragmentation, and access time. The CSRG research indicated that a minimum block size of 512 bytes, standard at the time on Unix systems, was not very efficient in a typical Unix environment. Files that were several blocks long often were scattered over many cylinders, resulting in frequent seeks and thereby significantly decreasing throughput. The researchers found that doubling the block size to 1024 bytes improved performance by more than a factor of 2. But even with 1024-byte blocks, they found that throughput was only about 4 percent of the theoretical maximum. Eventually, they found that 4096- byte blocks provided the fastest throughput, but this led to large amounts of wasted space due to internal fragmentation. These results are summa- rized in Table 3.2.Disks 63 Table 3.2 The amount of wasted space as a function of block size, Space Used (MB) Percent Waste Organization 775.2 807.8 828.7 866.5 948.5 1128.3 0.0 Data only, no separation between files 42 Data only, each file starts on 512-byte boundary 6.9 Data + inodes, 512-byte block Unix file system 11.8 Data'+ inodes, 1024-byte block Unix file system 22.4 Data + inodes, 2048-byte block Unix file system 45.6 Data + inodes, 4096-byte block Unix file system From The Design and Implementation of the 4.3BSD Unix Operating System, Lefiler et al., p. 198. To gain the advantages of both the 4096-byte and the 512-byte systems, the Berkeley group implemented a variation of the cluster concept (see Section 3.1.3). In the new implementation, the researchers allocate 4096- byte blocks for files that are big enough to need them; but for smaller files, they allow the large blocks to be divided into one or more fragments. With a fragment size of 512 bytes, as many as eight small files can be stored in ‘one block, greatly reducing internal fragmentation. With the 4096/512 system, wasted space was found to decline to about 12 percent. 3.1.8 Disk as Bottleneck Disk performance is increasing steadily, even dramatically, but disk speeds still lag far behind local network speeds. A high-performance disk drive with 50 kilobytes per track can transmit at a peak rate of about 5 megabytes per second, and only a fraction of that under normal conditions. High- performance networks, in contrast, can transmit at rates of as much as 100 megabytes per second. The result can often mean that a process is disk bound—the network and the computer's central processing unit (CPU) have to wait inordinate lengths of time for the disk to transmit data. A number of techniques are used to solve this problem. One is multi- programing, in which the CPU works on other jobs while waiting for the data to-arrive. But if multiprogramming is not available or if the process simply cannot afford to lose so much time waiting for the disk, methods must be found to speed up disk I/O. One technique now offered on many high-performance systems is called striping. Disk striping involves splitting the parts of a file on several different drives, then letting the separate drives deliver parts of the file to the network simultaneously. Disk striping can be used to put different64 “Chapter 3 Secondary Storage and System Software blocks of the file on different drives or to spread individual blocks onto different drives. Disk striping exemplifies an important concept that we see more and more in system configurations—parallelism. Whenever there is.a bottleneck at some point in the system, consider duplicating the source - of the bottleneck and configure the system so several of them operate in parallel. If we put different blocks on different drives, independent processes accessing the same file will not necessarily interfere with each other. This improves the throughput of the system by improving the speed of multi- ple jobs, but it does not necessarily improve the speed of a single drive. There is a significant possibility of a reduction in seek time, but there is no guarantee. The speed of single jobs that do large amounts of I/O can be signifi- cantly improved by spreading each block onto many drives. This is commonly implemented in RAID (redundant array of independent disks) systems which are commercially available for most computer systems. For an eight-drive RAID, for example, the controller receives a single block to write and bceaks it into eight pieces, each with enough data for a full track. The first piece is written to a particular track of the first disk, the second piece to the same track of the second disk, and so on. The write occurs at a sustained rate of eight times the rate of a single drive. The read operation is similar, the same track is read from each drive, the block in reassembled in cache, and the cache contents are transmitted back through the I/O channel. RAID systems are supported by a large memory cache on the disk controller to support very large blocks. Another approach to solving the disk bottleneck is to avoid accessing the disk at all. As the cost of memory steadily decreases, more and more programmers are using memory to hold data that a few years ago had to be kept on a disk. Two effective ways in which memory can be used to replace secondary storage are memory disks and disk caches. A RAM disk is a large part of memory configured to simulate the behavior of a mechanical disk in every respect except speed and volatility. Since data can be located in memory without a seek or rotational delay, RAM disks can provide much faster access than mechanical disks. Since memory is normally volatile, the contents of a RAM disk are lost when the computer is turned off. RAM disks are often used in place of floppy disks because they are much faster than floppies and because relatively little memory is needed to simulate a typical floppy disk.3.2 Magnetic Tape 65 A disk cachet is a large block of memory configured to contain pages of data from a disk. A typical disk-caching scheme might use a 256-kilo- byte cache with a disk. When data is requested from secondary memory, the file manager first looks into the disk cache to see if it contains the page with the requested data. If it does, the data can be processed immediately. Otherwise, the file manager reads the page containing the data from disk, replacing some page already in the disk cache. Cache memory can provide substantial improvements in perfor- mance, especially when a program's data access patterns exhibit a high degree of locality. Locality exists in a file when blocks that are accessed in close temporal sequence are stored close to one another on the disk. When a disk cache is used, blocks that are close to one another on the disk are much more likely to belong to the page or pages that are read in with a single read, diminishing the likelihood that extra reads are needed for extra accesses. RAM disks and cache memory are examples of buffering, a very important and frequently used family of I/O techniques. We take a closer look at buffering in Section 3.9. In these three techniques we see once again examples of the need to make trade-offs in file processing. With RAM disks and disk caches, there is tension between the cost/capacity advantages of disk over memory on the one hand, and the speed of memory on the other. Striping provides opportunities to increase throughput enormously, but at the cost of a more complex and sophisticated disk management system. Good file design balances these tensions and costs creatively. Magnetic Tape Magnetic tape units belong to a class of devices that provide no direct accessing facility but can provide very rapid sequential access to data. Tapes are compact, stand up well under different environmental condi- tions, are easy to store and transport, and are less expensive than disks. Many years ago tape systems were widely used to store application data. An application that needed data from a specific tape would issue a request 4. The term cache (as opposed to disk cache) generally refers to a very high-speed block of primary memory that performs the same types of performance-enhancing operations with respect to memory that a disk cache does with respect to secondary memory.66 Chapter 3 Secondary Storage and System Software for the tape, which would be mounted by an operator onto a tape drive. The application could then directly read and write on the tape. The tremendous reduction in the cost of disk systems has changed the way tapes are used. At present, tapes are primarily used as archival storage. That is, data is written to tape to provide low cost storage and then copied to disk whenever it is needed. Tapes are very common as backup devices for PC systems. In high performance and high volume applications, tapes are commonly stored in racks and supported by a robot system that is capable of moving tapes between storage racks and tape drives. 3.2.1 Types of Tape Systems There has been tremendous improvement in tape technology in the past few years. There are now a variety of tape formats with prices ranging from $150 to $150,000 per tape drive. For $150, a PC owner can adda tape backup system, with sophisticated backup software, that is capable of stor- ing 4 gigabytes of data on a single $30 tape. For larger systems, a high performance tape system could easily store hundreds of terabytes in a tape robot system costing millions of dollars. Table 3.3 shows a comparison of some current tape systems. In the past, most computer installations had a number of reel-to-reel tape drives and large numbers of racks or cabinets holding tapes. The primary media was one-half inch magnetic tape on 10.5-inch reels with 3600 feet of tape. In the next section we look at the format and data trans- fer capabilities of these tape systems which use nine linear tracks and are usually referred to as nine-track tapes. Table 3.3 Comparison of some current tape systems TapeModel MediaFormat Loading Capacity Tracks ~—Transfer Rate 9-track one-half inch reel autoload © 200MB = 9dinear_-——« MB/sec Digital linear DET cartridge robot 35GB 36 linear § MB/sec tape HP Colorado one-quarterinch manual 16GB helical. ~——(0,5 MB/sec. 73000 cartridge StorageTek one-half inch robot silo 50GB helical 10 MB/sec Redwood cartridgeMagnetic Tape 67 Newer tape systems are usually based on a tape cartridge medium where the tape and its reels are contained in a box. The tape media formats that are available include 4 mm, 8 mm, VHS, 1/2 inch, and 1/4 inch. 3.2.2 An Example of a High-Performance Tape System The StorageTek Redwood SD3 is one of the highest-performance tape systems available in 1997. It is usually configured in a silo that contains storage racks, a tape robot, and multiple tape drives. The tapes are 4-by-4- inch cartridges with one-half inch tape. The tapes are formatted with heli- cal tracks. That is, the tracks are at an angle to the linear direction of the tape. The number of individual tracks is related to the length of the tape rather than the width of the tape as in linear tapes. The expected reliable storage time is more than twenty years, and average durability is 1 million head passes. ‘The performance of the SD3 is achieved with tape capacities of up to 50 gigabytes and a sustained transfer rate of 11 megabytes per second. This transfer rate is necessary to store and retrieve data produced by the newest generation of scientific experimental equipment, including the Hubbell telescope, the Earth Observing System (a collection of weather satellites), seismographic instruments, and a variety of particle accelerators. An important characteristic of a tape silo system is the speed of seek- ing, rewinding, and loading tapes. The SD3 silo using 50-gigabyte tapes has an-average seek time of 53 seconds and can rewind in a maximum of 89 seconds. The load time is only 17 seconds. The time to read or write a full tape is about 75 minutes. Hence, the overhead to rewind, unload, and load is only 3 percent. Another way to look at this is that any tape in the silo can be mounted in under 2 minutes with no operator intervention. 3.2.3 Organization of Data on Nine-Track Tapes Since tapes are accessed sequentially, there is no need for addresses to iden- tify the locations of data on a tape. On a tape, the logical position of a byte within a file corresponds directly to its physical position relative to the start of the file. We may envision the surface of a typical tape as a set of parallel tracks, each of which is a sequence of bits. If there are nine tracks (see Fig. 3.11), the nine bits that are at corresponding positions in the nine respec- tive tracks are taken to constitute 1 byte, plus a parity bit. So a byte can be thought of asa one-bit-wide slice of tape. Such a slice is called a frame.68 Chapter 3 Secondary Storage and System Software Track Frame 2. [-— Gap —+}-———_—— ata block ——____——+}+— Gap—+] Figure3.11 Nine-track tape. The parity bit is not part of the data but is used to check the validity of the data. If odd parity is in effect, this bit is set to make the number of 1 bits in the frame odd, Even parity works similarly but is rarely-used with tapes. Frames (bytes) are grouped into data blocks whose size can vary from a few bytes to many kilobytes, depending on the needs of the user. Since tapes are often read one block at a time and since tapes cannot stop or start instantaneously, blocks are separated by interblock gaps, which contain no information and are long enough to permit stopping and starting. When tapes use odd parity, no valid frame can contain all 0 bits, so a large number of consecutive 0 frames is used to fill the interrecord gap. Tape drives come in many shapes, sizes, and speeds. Performance differences among drives can usually be measured in terms of three quan- tities: Tape density—commonly 800, 1600, or 6250 bits per inch (bpi) per track, but recently as much as 30 000 bpi; Tape speed—commonly 30 to 200 inches per second (ips); and Size of interblock gap—commonly between 0.3 inch and 0.75 inch. Note that a 6250-bpi nine-track tape contains 6250 bits per inch per track, and 6250 bytes per inch when the full nine tracks are taken together. Thus in the computations that follow, 6250 bpi is usually taken to mean 6250 bytes of data per inch.MagneticTape 69 3.2.4 Estimating Tape Length Requirements Suppose we want to store a backup copy of a large mailing-list file with one million 100-byte records. If we want to store the file on a 6250-bpi tape that has an interblock gap of 0.3 inches, how much tape is needed? ‘To answer this question we first need to determine what takes up space on the tape. There are two primary contributors: interblock gaps and data blocks. For every data block there is an interblock gap. If we let b = the physical length of a data block, he length of an interblock gap, and n = the number of data blocks then the space requirement s for storing the file is s=nXx (b+ g) We know that gis 0.3 inch, but we do not know what b and mare. In fact, b is whatever we want it to be, and n depends on our choice of b. Suppose we choose each data block to contain one 100-byte record. Then b, the length of each block, is given by block by block: 100 = Dlock size (bytes per block) 100 16 inch tape density (bytes per inch) 6250 and n, the number of blocks, is 1 million (one per record). The number of records stored in a physical block is called the blocking factor. It has the same meaning it had when it was applied to the use of blocks for disk storage. The blocking factor we have chosen here is 1 because each block has only one record. Hence, the space requirement for the file is s = 1000 000 x (0.016 + 0.3) inch = 1000 000 x 0.316 inch = 316 000 inches = 26 333 feet Magnetic tapes range in length from 300 feet to 3600 feet, with 2400 feet being the most common length. Clearly, we need quite a few 2400-foot tapes to store the file. Or do we? You may have noticed that our choice of block size was not a very smart one from the standpoint of space usage. The interblock gaps in the physical representation of the file take up about nineteen times as much space as the data blocks do. If we were to take a snapshot of our tape, it would iook something like this:70 Chapter 3 Secondary Storage and System Software Data Gap Daa Gap Data Gap Daa Most of the space on the tape is not used! Clearly, we should consider increasing the relative amount of space used for actual data if we want to try to squeeze the file onto one 2400-foot tape. If we increase the blocking factor, we can decrease the number of blocks, which decreases the number of interblock gaps, which in turn decreases the amount of space consumed by interblock gaps. For example, if we increase the blocking factor from 1 to 50, the number of blocks becomes _ 1.000.000 ~ 50 = 20000 and the space requirement for interlock gaps decreases from 300 000 inches to 6000 inches. The space requirement for the data is of course the same as was-previously. What has changed is the relative amount of space occupied by the gaps, as compared to the data. Now a snapshot of the tape would look much different: Dat Gap -Data_s
You might also like
Vazirani
PDF
No ratings yet
Vazirani
1 page
Reverse Mathematics - Problems, Reductions, and Proofs - Damir D - Dzhafarov, Carl Mummert, Damir Dzhalil Dzhafarov - 1st Ed
PDF
No ratings yet
Reverse Mathematics - Problems, Reductions, and Proofs - Damir D - Dzhafarov, Carl Mummert, Damir Dzhalil Dzhafarov - 1st Ed
498 pages
William Gropp, Torsten Hoefler, Rajeev Thakur, Ewing Lusk Using Advanced MPI Modern Features of The Message-Passing Interface
PDF
No ratings yet
William Gropp, Torsten Hoefler, Rajeev Thakur, Ewing Lusk Using Advanced MPI Modern Features of The Message-Passing Interface
376 pages
Analytic Geometry Siceloff Wentworth Smith Edited
PDF
100% (1)
Analytic Geometry Siceloff Wentworth Smith Edited
296 pages
A Course On Set Theory - Ernest Schimmerling - Cambridge University Press, Cambridge, 2011
PDF
No ratings yet
A Course On Set Theory - Ernest Schimmerling - Cambridge University Press, Cambridge, 2011
188 pages
C++ Interactive Course
PDF
100% (1)
C++ Interactive Course
299 pages
Prealgebra and Introductory Algebra Fourth Edition (Margaret L. Lial, Diana L. Hestwood Etc.) (Z-Library)
PDF
No ratings yet
Prealgebra and Introductory Algebra Fourth Edition (Margaret L. Lial, Diana L. Hestwood Etc.) (Z-Library)
1,338 pages
Arcadian Functor 09
PDF
100% (1)
Arcadian Functor 09
346 pages
Lecture 8 Com 1033
PDF
100% (1)
Lecture 8 Com 1033
35 pages
Selected Papers On Analysis of Algorithms (Donald E Knuth)
PDF
100% (2)
Selected Papers On Analysis of Algorithms (Donald E Knuth)
635 pages
Lectures 1-10 (7 Files Merged)
PDF
No ratings yet
Lectures 1-10 (7 Files Merged)
386 pages
DunfordSchwartz LinOp I
PDF
No ratings yet
DunfordSchwartz LinOp I
872 pages
Combinatory Logic in Programming
PDF
No ratings yet
Combinatory Logic in Programming
347 pages
Manim Relations
PDF
No ratings yet
Manim Relations
1 page
Simply Scheme
PDF
100% (3)
Simply Scheme
612 pages
Michael J. Folk, Bill Zoellick, Greg Riccardi - File Structures - An Object-Oriented Approach With C++-Addison-Wesley (1998)
PDF
No ratings yet
Michael J. Folk, Bill Zoellick, Greg Riccardi - File Structures - An Object-Oriented Approach With C++-Addison-Wesley (1998)
749 pages
Arkwright - Water Frame Rules
PDF
No ratings yet
Arkwright - Water Frame Rules
36 pages
XL C-C++ Programming Guide PDF
PDF
100% (1)
XL C-C++ Programming Guide PDF
1,088 pages
Algorithms: 2006 S. Dasgupta, C. H. Papadimitriou, and U. V. Vazirani July 18, 2006
PDF
100% (2)
Algorithms: 2006 S. Dasgupta, C. H. Papadimitriou, and U. V. Vazirani July 18, 2006
336 pages
Applied Linear Statistical Models - Neter Et Al (McGraw Hill Fifth Edition 2005)
PDF
100% (1)
Applied Linear Statistical Models - Neter Et Al (McGraw Hill Fifth Edition 2005)
1,420 pages
OOPS Using C++: Self Learning Material
PDF
No ratings yet
OOPS Using C++: Self Learning Material
227 pages
Trees&Graphs
PDF
No ratings yet
Trees&Graphs
100 pages
(Anneli Lax New Mathematical Library 1) Niven, Ivan Morton - Numbers Rational and Irrational-Mathematical Association of America (MAA) (1961 - 2014)
PDF
No ratings yet
(Anneli Lax New Mathematical Library 1) Niven, Ivan Morton - Numbers Rational and Irrational-Mathematical Association of America (MAA) (1961 - 2014)
149 pages
(Courant Lecture Notes in Mathematics 7) S. R. S. Varadhan-Probability Theory-Courant Institute of Mathematical Sciences - American Mathematical Society (2001)
PDF
0% (1)
(Courant Lecture Notes in Mathematics 7) S. R. S. Varadhan-Probability Theory-Courant Institute of Mathematical Sciences - American Mathematical Society (2001)
227 pages
Knuth - All Questions Answered PDF
PDF
No ratings yet
Knuth - All Questions Answered PDF
7 pages
Vygodsky Mathematical Handbook B0018XY1MG
PDF
No ratings yet
Vygodsky Mathematical Handbook B0018XY1MG
426 pages
How To Solve It by Computer
PDF
100% (1)
How To Solve It by Computer
463 pages
C++ Templates - The Complete Guide PDF
PDF
No ratings yet
C++ Templates - The Complete Guide PDF
651 pages
Calc SageMath
PDF
100% (1)
Calc SageMath
275 pages
Mathematical Cryptography
PDF
100% (1)
Mathematical Cryptography
138 pages
Functions and Graphs - I.M. Gelfand, E.G. Glagoleva, E.E. Shnol.
PDF
100% (1)
Functions and Graphs - I.M. Gelfand, E.G. Glagoleva, E.E. Shnol.
120 pages
C++ Programming de Gruyter
PDF
No ratings yet
C++ Programming de Gruyter
507 pages
C++ Programming - Yuan Dong, Fang Yang Li Zheng
PDF
No ratings yet
C++ Programming - Yuan Dong, Fang Yang Li Zheng
504 pages
50 Boost Libraries PDF
PDF
No ratings yet
50 Boost Libraries PDF
93 pages
File Structures: Data Representation in Memory
PDF
No ratings yet
File Structures: Data Representation in Memory
107 pages
Geometry and Intuition
PDF
100% (1)
Geometry and Intuition
9 pages
Mathematical Handbook
PDF
No ratings yet
Mathematical Handbook
294 pages
American Mathematical Monthly - 1962-05
PDF
No ratings yet
American Mathematical Monthly - 1962-05
115 pages
Nested Radical
PDF
No ratings yet
Nested Radical
40 pages
17isl68 Manual
PDF
No ratings yet
17isl68 Manual
77 pages
FS M1 Part1
PDF
No ratings yet
FS M1 Part1
151 pages
Cmake Cheat Sheet
PDF
100% (2)
Cmake Cheat Sheet
1 page
Object-Oriented Programming in C++: Quick Overview
PDF
No ratings yet
Object-Oriented Programming in C++: Quick Overview
66 pages
Dies Tel
PDF
No ratings yet
Dies Tel
28 pages
Groff
PDF
No ratings yet
Groff
268 pages
Generating Parsers With Javacc
PDF
No ratings yet
Generating Parsers With Javacc
23 pages
Mathematics Magazine
PDF
No ratings yet
Mathematics Magazine
82 pages
Pick's Theorem
PDF
No ratings yet
Pick's Theorem
3 pages
A Tour of Scheme in Gambit
PDF
No ratings yet
A Tour of Scheme in Gambit
61 pages
Algebra Gelfand
PDF
100% (1)
Algebra Gelfand
156 pages
Just The Maths - A.J.hobson (Complex Numbers)
PDF
No ratings yet
Just The Maths - A.J.hobson (Complex Numbers)
47 pages
FS Mod1
PDF
No ratings yet
FS Mod1
13 pages
OO C++ Notes
PDF
No ratings yet
OO C++ Notes
227 pages
Mathematical Association of America Is Collaborating With JSTOR To Digitize, Preserve and Extend Access To The American Mathematical Monthly
PDF
No ratings yet
Mathematical Association of America Is Collaborating With JSTOR To Digitize, Preserve and Extend Access To The American Mathematical Monthly
13 pages
Analysis and Mathematical Physics - Shaun Bullett, Tom Fearn, Frank Smith
PDF
No ratings yet
Analysis and Mathematical Physics - Shaun Bullett, Tom Fearn, Frank Smith
243 pages
(As Per Choice Based Credit System (CBCS) Scheme) (Effective From The Academic Year 2016 - 2017)
PDF
No ratings yet
(As Per Choice Based Credit System (CBCS) Scheme) (Effective From The Academic Year 2016 - 2017)
3 pages
Object Oriented Programming
PDF
No ratings yet
Object Oriented Programming
1 page
Shimura G. - Automorphic Functions and Number Theory (1968) PDF
PDF
No ratings yet
Shimura G. - Automorphic Functions and Number Theory (1968) PDF
38 pages
What Are The Best Books About Group Theory
PDF
No ratings yet
What Are The Best Books About Group Theory
4 pages
File Structures
PDF
No ratings yet
File Structures
6 pages