100% found this document useful (1 vote)
571 views

Database

Database book

Uploaded by

Chris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
100% found this document useful (1 vote)
571 views

Database

Database book

Uploaded by

Chris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 689
Nie ARO eee UCR eRe 1 Fas ee 211-9) DATABASE SYSTEMS An Application-Oriented Approach (Second Edition) LMR Pm eC) Philip M. Lewis ea hohe aE ee ed etter) GLU Reels Database Systems: An Application-Oriented Approach (Second Edition) Designed for students learning databases forthe firs time, Database Systems: An Application-Oriented Approach, Introductory Version, Second Edition, presents the principles underlying the design and implementation of databases and their applications, The book consists of nine core cha parate chapters on triggers (chapter 7) and us {in an application ( 8) that recognize 1! mportance of application development in building database systems. Additional chapters (chapters 11-17) cover database tuning, transaction proc: query pro ‘object-oriented databases, and XML databases and provide a variety of ways to enrich students” introduction to databases. Features of the Second Edition ‘+ An application-oriented introduction to database concepts + SQL updated to the latest standard both Entity-Relationship mod: id the Unifie ing Lang + Discussions of software-engineering issues related to implementing transactio’ 1d case studies providing hands-on experience in application design and programming th coverage of XML, object-oriented databases, and database tuning About the Authors Michael Kifer is a professor in the Department of Computer Science atthe State University of New York at Stony Brook His interests include database systems, knc representation, and Web information systems. He has published and. edited several books and many articles in these areas, including award-wit orks on F-logie and obj pied database languages: Arthur Bernstein is also a professor in the Department of Computer Science atthe State University of New York Brook. His research focuses on transaetion processing, Web services, and concurrency, and he has published mu articles in these areas. This is Professor Bernstein's second textbbook Philip Lewis is a leading professor in the Department of Computer Science at the State University of New York at Stony Brook. With interests in dat iems, transaction processing, and concurrency, he has published four texbooks and ‘many articles in these areas. Professor Lewis is also the founding editor of the SLAM Journal of Computing, For more information, please visit www. PearsonEd.com ISBN 7-04-017817-6 ieee om lome Ue Me Masco) Mom Sls) Loma | China exclusively (except Taiwan, Hong Kong SAR | | Esa lel) oMavocoli7e173!> AagmmMMN4. Ss auie eeoeesc. ib Anes Or amt as Ep 49.50 5 AB epee eau te eee be any BARBERA ES RARRAREA Sik ASE — BARAT (Sx ENA) DATABASE SYSTEMS An Application-Oriented Approach (Second Edition) Michael Kifer Arthur Bernstein Philip M. Lewis senau met pe A= :01-2005-0965 & Database Systems; An Application-Oriented Approach Introductory Version, Second Edition Michael Kifer, Arthur Bernstein, Philip M. Lenis 228A HWA Pearson Education( i Si MSE) MOE DH Dah. TEAR EA REA English eprint edition cup) 2005 by PEARSON EDUCATION ASIA LIMITED and HIGHER EDUCA- TION PRESS. (Database Systems; An Application-Oriented Approach, Introductory Version, 2e from Pearson Edueation’s edition of the Work) Database Systems: An Application-Oriented Approach, Introductory Version, 2e by Michael Kifer, Arthur Bernstein, Philip M. Lewin, Copyright © 2005. Al Rights Reserved. Published by arrangement with the original publisher, Pearson Education, I . publishing a» Pearson Education, Inc. ‘This edition is authorized for sile only in the People’s Republic of China (excluding the Special Admi of Hong Kong and Macau). DRAG ISBN; 0-321.22838.3 For sale and distribution in the People's Republic of China exclusively ( except Taiwan, Hong Kong SAR and Macao SAR AS ee NBO CHAISE (MMe Pte RSH TRE At Be Hh fo trative Regions FA 33 Zs Bi ( CIP) a SCORE AM - 017 FHA WH = Database Systems; ‘An Application ~ Oriented Approach ; 35 2 8&/ ( 3 ) aE 9 ( Kifer, M.) , (3) fF 48 WIA ( Bemstein, A.) , ( ) MS Hh (Lewis ,P.M.). — AA. — Ibe: 4 FAs BEAL 2005. 12 ISBN 7 -04 -017817 -6 1.8... 1.@e... Off... OM... Maat ERG ~ ESE ~ BM - HW. TPBII 13 “pA PA 0H CIP BLHE KF (2005) % 134986 BR A tM HHA 010 -58581118 4h tt Ati aR feeb AB 4 BHAA 800 -810 -0598 ABBRES 100011 FEE hup://www. hep. edu. en Bt 010 -58581000 ’ huip://www. hep. com. en FREITIS hup://www. andraco, com hup://www. landraco, com. en BH RRR BTR BRAC hup://www, widedu, com BL lea eTRED IT Ho 7871092 1/16 WR 2005 48 12 HL RE Bk 4.25 Ae 2005 4812 EL eA 3M 830000 Bt 49.5076 WAHT ATL TE IRE) TB AR LATA BA #88 17817 -00 tH AR i BA WERA THO CRAARAMEEASCRRAHE REA AR ERAN XKEPE TRAN, PLAFRAHRESRARA SHAT HRMBE ERLE HR RASMRPATE RHR, HAMEL AMPK MA WO GES LHR RASH ROMA, ROR PLB KEW ERARAT MERA CHRAGRMK RESO R SRASD RM BARA 2E, GEERAHRRAR PEEP LHORR SRA RAMHMAT ESR ERRAD SRE OK, WEARS SERRE CHAE NE RT RKE RAR HE LADD WHR ARG REA RRRRAD ABM - AER, AL SAMERERSAAUMBEATRE LAE SRAAHAURAARTIES Bat RH TAAER,-REGKF REKHH. EHEKE EMEP ERAERATRRHE RAR HY AS AB He RA Oy aE TA), — ae BO HR EO 20 A LG ARH Bh KU RERAD MRA TO SAE RPP RUE RE RAERRARE EER RYAAZLEPRACRAFRARTHERORS ER RAT AMER RRFAREE RAY TERBHARAEM SAAR RARE, SERHRAMETTHHAS SHAS ULHRA AMT RM ARE ARES AER RS AVM le AMR RAB RRA ERSARRAUR ARERR PRED RR SARKANKF RRMHKEOKRAENSHAL ELLIS ARABS L EPRE A SEN EER KH Re A FRAARSM PA UNAGK ATRHA TM SAR ERA SRRRKERE ASD BARAP MEE WRER-KMRAB RRS AMR ERKAAT RHBR HBA TE Ao RKB ARM Se EMMA URC RH ERA RAR. RAH Ashep. ¢s@ 263. net. GERRY Rt SOOWFTA Bo Til Je AL DE = BE RATAN BR oO KPHMA ERRATA CGRAERARE ARE DARE. © F-TRANRESK, BOUTS A GARE, MAREATRE DH KHER RE, BESKAA GARE, DAFIGKESANBREAAAERAAL. WD ATHERAAALHRRERAAAY RAE, WF IRRE RATER AMER RERDAT MAMMA EERT RS. F-POREAM AAR H-KMARA RAL RERAKA RO EH, AY EWE REI SF Bo ATI A TE BM NT UR A A At RAEN XML LAMPE SEV MRR ER) UF HAIREH NS, BAKE ALR de THER 16 FE st RE AE AI” AI 7 XML 5 Web MAB aI” EAA PRAM BALESARERAHE. ATRARLSARRAHKARAMS AMAR ATRAAH SR HP DATAR XT UML AAR, H AM wT RATA eH RCH 12H). ERK RNEEAAO TANKER READ CTR we KEE ERK RAS. RG RM ARAL He Olt DBMS ELE SAS, Nik RA WAT AR KF aL A Et (KAA SQL.ODBC # JDBC) # API HAA. REAP GRTESATRERPRSA DRA EA HEM AS BRAMNRAGRAE RWSARA TERR AAA REREAD HT. Ak RMRERR PER APR HERA E. MRA SQL PHRMA A GE Et a. HT boi AAR A SOM RAI to TE RD “EERMRR. AW -T EAE MHRA TIFF RIA CEA-TRR HRS, BE EBOEARP OEE AREER E, RERHE“RERD FERS” ALANTY AURMRAREELY SHARPER RED THES, APH-TPERHAMAA SEAM AR FANT PAT AAERSSR ED EE FROAALEMS. HTHSAAARO LMG YR ERREARE LAMAR RAR Hf BKAUNVALEN ERR APERAS -PERERBD, UTE SFERHRRL UH BST HRA GHABHKATEARAAEL GAME. AA ATM ORES MAALE AACR RETRY SPR AAT RTM AG THE Stony Brook AARMHREFRAALERE FRAN LRAFHRUELAR, FEL ANEREER EARACAMNHRAA A PRPMANKALERH, ANARELSPM REEMA BW A DRA TREE ESE EAR. RIM P OB HEEPESRARAHERRAR OS KPA M ABR - MAAS TAR HRA, —PRRRRARARAKR-PRBS(UML), ER AR RHR EA WE-REWSAL MT ARAT UMLAA A, ARE 2 E14 CHS IS AER HLRH ASS Pio T UML AS. HERG? ARRERA EDEN HKREP REAR RAPHE LEED, RM wT E12 BAT MBER AB Bb ob UE AED ET The RH hE FRM T SQL/XML fe RAID #4 AP ERRU LE Web LF PEGR-DRERW AR, AARR-THKARRH ARAABERAKA RMAKH HM CORA TARR RU ARH HAAS, RT ME HAF XML Fo Web M45 tS HIF te A KF SQL/XML by NHK RA A XML RAH HH BSI RMA TR T Web we HHH FH, fa SOAP, WSDL BPEL, UDDI # 4] J WS- Coordination “7 WS-Transaction ft # HT XMLASFEHRANE, CE FSHAS MAR HAT ET XML oy Ana AU XML fe XML ES WS ESA SAML SH HH AR WE “REA BRRAR EWR WATRT MARA APRS Bh DEE HAE Ct AREAS Web WH KH aHHAR SVR -RTIEARUAENAAH TM MRT. SRM TH SEES SREAAKACRAM LEAP RERATAY, CR SEHD AFERARMT B&H. RIP PHKIR-MIAFR KERMIT. BIBRA LED HEPA TRA OR IMAPHRP AE BRATRAL ANAS AOR RD HASHTHR. APHREME HS 16 ROK RKHE RIBS PHT E. KE BERK R BPTI ADE RT EY, ALORA RR, TORRE, BRP RMAC, HRADTRATRARE MH. AP SARS HIMSH AMAA, HeAATESHME RA, ahTe BRT RAH ULI TP ab SAL AE PT OA Hi a aK a © ARG WEA PowerPoint HH. © Ai th £# PowerPoint, © HERMIT MMASH CEPA A HER, MHS SKM Cie RA ARM Hi FA IB FP BMS (www, aw-be. com/kifer) KRHA F tT RMA A A FSR. KR FM PowerPoint H# HK fh Gat Addison-Wesley 2 #144 FA A APR HE. HEU FL www, aw, com FAK A (i Hi AE IR AE RHE Vv ah AER HEA WAH MRRP RRA MER TARR OMAK RK: Tran Cao Son,New Mexico State University Frantisek Franek ,McMaster University Junping Sun , Nova Southeastern University Philip Cannata,Sun Microsystems Dehu Qi, Lamar University Nematollah Shiri,Concordia University Jian Pei State University of New York at Buffalo Jack Wileden , University of Massachusetts Amherst Sibel Adali, Re Roger King, University of Colorado at Boulder laer Polytechnic Institute Markus Schneider, University of Florida Yaron Y. Goland ,BEA Dennis Shasha, New York University Christelle Scharff, Pace University : Zhiwei Wang ,Graduate Programs in Software Enginecring,IT,and IS University of St. Thomas RUKEG RAAB — MAF Ma: Suad Alagic , Wichita University Catriel Beri, The Hebrew University Rick Cattel,Sun Microsystems Jan Chomicki,SUNY Buffalo Henry A. Etlinger, Rochester Institute of Technology Leonidas Fegaras, University of Texas at Arlington Alan Fekete, University of Sidney Johannes Gehrke, Comell University Hershel Gottesman , Consultant Jiawei Han Simon Fraser University Peter Honeyman , University of Michigan Vijay Kumar, University of Missouri-Kansas City Jonathan Lazar, Towson University Dennis McLeod , University of Southern California Rokia Missaoui, University of Quebec in Montreal Clifford Neuman , University of Southern California Fabian Pascal, Consultant Sudha Ram, University of Arizona Krithi Ramamritham , University of Massachusetts-Amherst and IT Bombay Andreas Reuter, International University in Germany ,Bruchsal jit Sengupta , Georgia State University Munindar P. Singh North Carolina State University Greg Speegle , Baylor University Tunping Sun Nova Southeastern University Joe Trubisz Consultant Vassilis J. Tsotras University of California , Riverside Emilia E. Villarreal ,California Polytechnic State University BAA EAR, ATT at AR AT BY AR GT 4b 36 fe 4-H: Don Chamberlin , Daniela Florescu Jim Gray Pankaj Gupta Rob Kelly 49 C. Mohan. TR FL PA RA 9 by RTE A a HL LK; David S. Warren #7 Radu Grosu, Joe Trubicz FRE FR ARGH WAH Pa pA RH SET AP MMAR TX HEL. Ue DEE MAH ED OD A LE BR Me RT (RA FB: Ziyang Duan. Shiyong Lu, Swapnil Patil Guichen Yang ## Yan Zhang. + % il Stony Brook it HALE FE AH AB AL. AI Kathy Germana, # HH I ff #4 — BRAT He HALE A HB idl Matt Goldstein 4 Maite Suarez-Rivas— Addison-Wesley ii Ri“ #1 #y 4 4%, WNERSEMARH HRARYURS HAM FURPHRRAEEEEA EA. RNLES iH Addison-Wesley 2 4] 4 Windfall RAATH AL Ra RNAA DMS Ah 4 Lt : Jeffrey Holcomb ,Paul Anagnostopoulos Elisabeth Beller Jennifer MeClain # Joe Snowden, KE AN EB 28 1 EF Lora Edie Rhoda, SW tb 1A RBS AB at PA TA DKA ORR Michael Kifer Arthur Bernstein. Philip M. Lewis Preface We are publishing the second edition of our textbook in two versions: | This version, which consists of introductory material, is appropriate for a first undergraduate or graduate course in databases. © The second version, which is the complete book, is appropriate for three courses: An introductory undergraduate or graduate course in databases ‘An undergraduate or graduate course in transaction processing for students who have had an introductory course in databases «An advanced undergraduate or a first graduate course in databases for stu- dents who have had an introductory course in databases One of our goals was to reduce the size and make this introductory version more affordable to students. Another was to capitalize on our experience in using the first edition of the book to make an even better introductory text. ‘The chapters in this book are not just a subset of those in the complete book. We believe that instructors of an introductory database course should have the option of enriching an introductory course by including material orr object databases and XML—topics that are covered in great detail in several chapters in the complete book. Therefore we have added to the introductory book two new chapters, Chapter 16, Introduction to Object Databases, and Chapter 17, Introduction to XML and Web Data, which contain an appropriately chosen subset of the material in the full version of this book. To keep the book up-to-date with the rapidly changing technology, we have added a substantial amount of material on UML to a number of chapters and have included a new chapter on Database Tuning, Chapter 12, in both the introductory and complete books. ‘As with the first edition, our focus is on how to build applications using data- bbases rather than on how to build the database management system itself, We believe that many more students will be implementing applications than will be building DBMSs. Thus, we include substantial material describing the languages and APIs used by transactions to access a database, such as embedded SQL, ODBC, and JDBC. Although we cover many practical aspects of database and transaction process- ing applications, we are primarily concerned with the concepts that underlie these topics rather than with the details of particular commercial systems or applications. xviii Preface ‘Thus we concentrate on the concepts behind the relational and object data models. These concepts will remain the foundation of database processing long after SQL is obsolete. To enhance students’ understanding of the technical material, we have included case study of a transaction processing application, the Student Registration System, which is carried through the book. While a student registration system can hardly be considered glamorous, it has the unique advantage that all students have interacted with such a system as users. More importantly, it turns out to be a surprisingly rich application, so we can use it to illustrate many of the issues in database design, query processing, and transaction processing. ‘A unique aspect of the book is a presentation of the software engineering con- cepts required to implement transaction processing applications, using the Student Registration System as an example. Since the implementations of many information systems fail because of poor project management and inadequate software engineer- ing, we feel that these topics should be an important part of the student's education, Our treatment of software engineering issues is brief, since many students will take a separate course in this subject. However, we believe that they will be better able to understand and apply that material when they see it presented in the context of an information system implementation. Since the courses that use this text at Stony Brook are not software engineering courses, we do not cover this material in class, Instead, we ask the students to read it and require that they use good soft- ware engineering practice in their class projects. We do cover in class those aspects of the Student Registration System that illustrate important issues in databases and transaction processing. Changes in the Second Edition ‘The technology underlying database and transaction processing systems ts changing so rapidly that we have made a large number of changes and additions to the material of the first edition. One rapidly advancing technology is the Unified Modeling Language, UML. We added substantial amount of material on UML in Chapter 4 on database design, in addition to the material on E-R diagrams that was already there. We also added UML to the material on software engineering in Chapters 2, 14, and 15. ‘Anew chapter on Database Tuning, Chapter 12, was added because so much ef- fort in the real world is spent increasing the throughput of database and transaction processing applications. In addition, material has been added and updated in almost all the chapters. Significant examples of this are the coverage of SQL/XML and RAID technology. One important area that is not included in this volume is Web Services. Since this is a rapidly developing and interesting application-oriented subject we have significantly revised the compete version of this text to include material on this topic. In addition to strengthening the book on the subject of XML Technology by updating the chapter on XML and Web Data and adding a section on SQL/XML, we have added a new chapter on Web Services that contains material on SOAP, Preface WSDL, BPEL, UDDI, and XML-based transaction processing using WS-Coordination and WS-Transaction. In the chapter on Security and Internet Commerce, we added a section on XML-based encryption, using XML-Encryption, XML-Signature, WS- Security, and SAML. And in the chapter on Architecture of Transaction Processing ‘Systems, we added material on Web Application Servers and J2EF, which are used to implement the back-end of many Web services. Organization of the Book Chapters 1 through 7 should be taught in the order in which they appear in the book. Chapter 8 contains much of the information that students need in order to put the knowledge they acquired in the preceding chapters into practice. However, subsequent chapters do not significantly depend on Chapter 8. Chapters 9 through 12 in Part 3 should be taught sequentially. Chapter 13 in the same part is largely independent. The software engineering chapters in Part 4 utilize the material of the chapters in Parts 2 and 3, but the software engineering chapters can be read in parallel with the database material. Chapters 16 and 17 in the advanced part of the book depend on the first seven chapters in Part 2. Finally we note that the sections in this book that are marked with an asterisk (*) are optional and can be omitted, if the instructor prefers to do so. Sections marked with the @ icon in the table of contents deal with the case study. Also, exercises that are marked with an asterisk are slightly harder than the rest, and exercises that are marked with two asterisks are even harder. Supplements In addition to the text, the following supplementary materials are available to assist instructors: = Online PowerPoint presentations for all chapters = Online PowerPoint slides of all figures ® An online solution manual containing solutions for the exercises ® Additional references, notes, errata, homeworks, and exams. For more information on obtaining these supplements, please visit this book's Companion Website at www.aw-bc.com/kifer. The solutions manual and PowerPoint presentations are available only to instructors through your Addison-Wesley sales representative. To contact your representative, please visit www.aw.com. Acknowledgments ‘We would like to thank the reviewers, whose comments and suggestions significantly improved the second edition of the book: Tran Cao Son, New Mexico State University Frantisek Franek, McMaster University 1 Preface We Junping Sun, Nova Southeastern University Philip Cannata, Sun Microsystems Dehu Ql, Lamar University Nematollah Shiri, Concordia University Jian Pei, State University of New York at Buffalo Jack Wileden, University of Massachusetts Amherst Sibel Adali, Rensselaer Polytechnic Institute Roger King, University of Colorado at Boulder Markus Schneider, University of Florida Yaron Y. Goland, BEA Dennis Shasha, New York University Christelle Scharff, Pace University Zhiwei Wang, Graduate Programs in Software Engineering, IT, and IS University of St. Thomas. would also like to thank the reviewers of the first edition of the book: Suad Alagic, Wichita University Catriel Beri, The Hebrew University Rick Cattel, Sun Microsystems Jan Chomicki, SUNY Buffalo Henry A. Etlinger, Rochester Institute of Technology Leonidas Fegaras, University of Texas at Arlington Alan Fekete, University of Sidney Johannes Gehrke, Cornell University Hershel Gottesman, Consultant Jiawei Han, Simon Fraser University Peter Honeyman, University of Michigan Vijay Kumar, University of Missouri-Kansas City Jonathan Lazar, Towson University Dennis McLeod, University of Southern California Rokia Missaoui, University of Quebec in Montreal Clifford Neuman, University of Southern California Fabian Pascal, Consultant Sudha Ram, University of Arizona Krithi Ramamritham, University of Massachusetts-Amherst, and IIT Bombay Andreas Reuter, International University in Germany, Bruchsal Arijit Sengupta, Georgia State University Preface Munindar P. Singh, North Carolina State University Greg Speegle, Baylor University Junping Sun, Nova Southeastern University Joe Trubisz, Consultant Vassilis J. Tsotras, University of California, Riverside Emilia E. Villarreal, California Polytechnic State University We would also like to thank the following people who were kind enough to provide us with additional information and answers to our questions: Don Chamberlin, Daniela Florescu, Jim Gray, Pankaj Gupta, Rob Kelly, and C. Mohan, ‘Two people taught out of beta versions of the book and made useful comments and suggestions: David S. Warren and Radu Grosu. Joe Trubicz served not only as a reviewer when the manuscript was complete, but provided critical comments on early. versions of many of the chapters. ‘A number of students were very helpful in reading and checking the correctness of various parts of the book: Ziyang Duan, Shiyong Lu, Swapnil Patil, Guizhen Yang, and Yan Zhang. Many thanks to the staff of the Computer Science Department at Stony Brook, and in particular Kathy Germana, who helped make things happen at work. We would particularly like to thank Matt Goldstein and Maite Suarez-Rivas, our editors at Addison-Wesley, who played an important role in shaping the contents and approach of the book in its early stages and throughout the time we were writing it. We would also like to thank the various staff members of Addison-Wesley and Windfall Software, who did an excellent job of editing and producing the book: Jeffrey Holcomb, Paul Anagnostopoulos, Elisabeth Beller, Jennifer McClain, and Joe Snowden, Last, but not least, we would like to thank our wives, Lora, Edie, and Rhoda, who provided much needed support and encouragement while we were writing the book. xd Contents Preface xvii PART ONE Introduction 1 1 Overview of Databases and Transactions 3 1,1 What Are Databases and Transactions? 3 1.2 Features of Modern Database and Transaction Processing Systems 6 1.3 Major Players in the Implementation and Support of Database and Transaction Processing Systems 7 1.4 Decision Support Systems—OLAP and OLTP 9 2 The Big Picture 13 @ 2.1 Case Study: A Student Registration System 13 2.2 Introduction to Relational Databases 14 2.3 What Makes a Program a Transaction—The ACID Properties 20 Bibliographic Notes 25 Exercises 25 PART TWO Database Management 29 3 The Relational Data Model 31 3.1 What Isa Data Model? 31 3.2 The Relational Model 35 3.2.1 Basic Concepts 35 3.2.2 Integrity Constraints 38 3.3 SQL—Data Definition Sublanguage 46 3.3.1 Specifying the Relation Type 46 3.3.2 The System Catalog 46 3.3.3 Key Constraints 47 3.3.4 Dealing with Missing Information 48 viii Contents 3.3.5 Semantic Constraints 49 3.3.6 User-Defined Domains 53 3.3.7 Forelgn-Key Constraints 53 3.3.8 Reactive Constraints 56 3.3.9 Database Views 59 3.3.10 Modifying Existing Definitions 60 3.3.11 SQL-Schemas 62 3.3.12 Access Control 63 Bibliographic Notes 65 Exercises 66 4 Conceptual Modeling of Databases with Entity-Relationship Diagrams and the Unified Modeling Language 69 4.1 Conceptual Modeling with the E-R Approach 70 4.2 Entities and Entity Types 70 4.3. Relationships and Relationship Types 73 44 Advanced Features in Conceptual Data Modeling 78 4.4.1 Entity Type Hierarchies 78 4.4.2 Participation Constraints 81 4.4.3 The Part-of Relationship 83 4.5 From E-R Diagrams to Relational Database Schemas 86 4.5.1 Representation of Entities 86 4.5.2 Representation of Relationships 88 Representing IsA Hierarchies in the Relational Model 90 Representation of Participation Constraints 92 Representation of the Part-of Relationship 94 : ANew Kid on the Block" 95 Representing Entities in UML 96 Representing Relationships in UML 97 4.6.3 Advanced Modeling Concepts in UML 101 4.6.4 Translation to SQL 105 4.7 ABrokerage Firm Example 106 4.7.1 An Entity-Relationship Design 106 4.7.2 AUML Design’ 110 @4.8 Case Study: A Database Design for the Student Registration system 111 4.8.1 The Database Part of the Requirements Document 112 48.2 The Database Design 113 4.9 Limitations of Data Modeling Methodologies 119 Bibliographic Notes 123 Exercises 123 Contents ix 5 Relational Algebra and SQL 27 5,1 5.2 5.3 Relational Algebra: Under the Hood of SQL 128 5.1.1 Basic Operators 128 5.1.2 Derived Operators 137 ‘The Query Sublanguage of SQL 147 5.2.1 Simple SQL Queries 148 5.2.2 Set Operations 154 5.2.3 Nested Queries 157 5.2.4 Quantified Predicates 163 5.2.5. Aggregation over Data 164 5.2.6 Join Expressions in the FROM Clause 170 5.2.7 A Simple Query Evaluation Algorithm — 171 5.2.8 More on Views in SQL 174 5.2.9 Materialized Views 177 5.2.10 The Null Value Quandary 181 Modifying Relation Instances in SQL 182 5.3.1 Inserting Data 182 5.3.2 Deleting Data 184 5.3.3 Updating Existing Data 185 5.3.4 Updates on Views 185 Bibliographic Notes 187 Exercises 188 6 Database Design with the Relational Normalization Theory 193 61 6.2 6.3 64 65 66 67 68 69 ‘The Problem of Redundancy 193 Decompositions 195 Functional Dependencies 198 Properties of Functional Dependencies 200 Normal Forms 207 6.5.1 The Boyce-Codd Normal Form 208 6.5.2 The Third Normal Form 210 Properties of Decompositions 211 6.6.1 Lossless and Lossy Decompositions 212 6.6.2 Dependency-Preserving Decompositions 215 An Algorithm for BCNF Decomposition 219 Synthesis of 3NF Schemas 221 6.8.1 Minimal Cover 222 6.8.2. 3NF Decomposition through Schema Synthesis 224 6.8.3. BCNF Decomposition through 3NF Synthesis 226 The Fourth Normal Form 228 Contents 6.10 611 @612 6.13 Advanced 4NF Design* 233 6.10.1 MVDs and Their Properties. 234 6.10.2 The Difficulty of Designing for 4NF 235 6.10.3 A4NF Decomposition How-To 238 Summary of Normal Form Decomposition 240 Case Study: Schema Refinement for the Student Registration System 241 ‘Tuning Issues: To Decompose or Not to Decompose? 244 Bibliographic Notes 245 Exercises 246 7 Triggers and Active Databases 251 7a 72 73 74 What Is a Trigger? 251 Semantic Issues in Trigger Handling 252 ‘Triggers in SQL:1999 256 Avoiding a Chain Reaction 264 Bibliographic Notes 265 Exercises 265 8 Using SQL in an Application 267 8.1 82 83 84 8.5 ‘Whaat Are the Issues Involved? 267 Embedded SQL 268 8.2.1 Status Processing 271 8.2.2 Sessions, Connections, and Transactions 273 8.2.3 Executing Transactions 274 8.2.4 Cursors 276 8.2.5. Stored Procedures on the Server 282 More on Integrity Constraints 285 Dynamic SQL 286 8.4.1 Statement Preparation in Dynamic SQL 287 8.4.2. Prepared Statements and the Descriptor Area* 290 8.4.3 Cursors 293 8.4.4 Stored Procedures on the Server 293 JDBC and sQly 294 8.5.1 JDBC Basics 294 8.5.2 Prepared Statements 297 8.5.3 Result Sets and Cursors 297 8.5.4 Obtaining Information about a Result Set 300 85.5 Status Processing 300 Contents xi 8.5.6 Executing Transactions 301 8.5.7 Stored Procedures on the Server 302 8.5.8 An Example 303 8.5.9 SQL): Statement-Level Interface to Java 303 8.6 ODBC* 307 8.6.1 Prepared Statements 309 8.6.2 Cursors 309 8.6.3 Status Processing 312 8.6.4 Executing Transactions 312 8.6.5 Stored Procedures on the Server 313 8.6.6 An Example 313 8.7 Comparison 315 Bibliographic Notes 316 Exercises 316 PART THREE Optimizing DBMS Performance and Transaction Processing 319 9 Physical Data Organization and Indexing 321 9.1 Disk Organization 322 9.1.1 RAID Systems 326 9.2 HeapFiles 329 9.3. Sorted Files 333 9.4 Indices 337 9.4.1 Clustered versus Unclustered Indices 340 9.4.2 Sparse versus Dense Indices 342 9.4.3 Search Keys Containing Multiple Attributes 344 9.5 Multilevel Indexing 347 9.5.1 Index-Sequential Access 350 9.5.2 B* Trees 353 9.6 Hash Indexing 360 9.6.1 Static Hashing 360 9.6.2 Dynamic Hashing Algorithms 363 9.7 Special-Purpose Indices 371 9.7.1 Bitmap Indices 371 9.7.2 Join Indices 372 9.8 Tuning Issues: Choosing Indices for an Application 373 Bibliographic Notes 374 Exercises 375 xii Contents 10 The Basics of Query Processing 379 "1 12 10.1 10.2 10.3 10.4 10.5 10.6 10.7 Overview of Query Processing 379 External Sorting 380 Computing Projection, Union, and Set Difference 384 Computing Selection 386 10.4.1 Selections with Simple Conditions 387 10.4.2 Access Paths 389 10.4.3 Selections with Complex Conditions 391 Computing Joins 392 10.5.1 Computing Joins Using Simple Nested Loops 393 10.5.2 Sort-Merge Join 396 10.5.3 HashJoin 398 Multirelational Joins* 399 Computing Aggregate Functions 401 Bibliographic Notes 401 Exercises 401 An Overview of Query Optimization 405 1.1 11.2 11.3 4 11.5 Query Processing Architecture 405 Heuristic Optimization Based on Algebraic Equivalences 407 Estimating the Cost of a Query Execution Plan 410 Estimating the Size of the Output 418 Choosing a Plan 420 Bibliographic Notes 425 Exercises 425 Database Tuning 429 121 12.2 12.3 12.4 125 12.6 Disk Caches 430 12.1.1 Tuning the Cache 431 Tuning the Schema 433 12.2.1 Indices 433 12.2.2 Denormalization 440 12.2.3 Repeating Groups 441 12.2.4 Partitioning 442 Tuning the Data Manipulation Language 443 Tools 446 Managing Physical Resources 447 Influencing the Optimizer 448 Bibliographic Notes 451 Contents xiii Exercises 451 13 An Overview of Transaction Processing 455 13,1 Isolation . 455 Serializability 456 Two-Phase Locking 458 Deadlock 462 Locking in Relational Databases 463 Isolation Levels 465 Lock Granularity and Intention Locks 468 13.1.7 Summary 471 13.2 Atomicity and Durability 472 13.2.1 The Write-Ahead Log 472 13.2.2 Recovery from Mass Storage Failure 476 13.3 Implementing Distributed Transactions 477 13.3.1 Atomicity and Durability—The Two-Phase Commit Protocol 478 13.3.2 Global Serializability and Deadlock 480 13.3.3 Replication 482 13.3.4 Summary 484 Bibliographic Notes 484 Exercises 485 PART FOUR Software Engineering Issues and Documentation 487 14 Requirements and Specifications 489 14.1 Software Engineering Methodology 489 14.1.1 UML Use Cases 490 @ 14.2 The Requirements Document for the Student Registration System 493 ®@ 14.3 Requirements Analysis—New Issues 500 @ 144 Specifying the Student Registration System S02 14.4.1 UML Sequence Diagrams 503 © 14.5 The Specification Document for the Student Registration System: Section Il 504 14.6 The Next Step in the Software Engineering Process 506 Bibliographic Notes 506 Exercises 507 xiv ‘Contents, 15 Design, Coding, and Testing 509 15.1 ‘The Design Process 509 15.1.1 Database Design 510 15.1.2, Describing the Behavior of Objects with UML State Diagrams 510 15.1.3 Structure of the Design Document $12 15.1.4 Design Review 514 15.2 Test Plan 515 15.3 Project Planning 518 15.4 Coding 521 15.5 Incremental Development 523 18.6 The Project Management Plan 524 @ 15.7. Design and Code for the Student Registration System 525 15.7.1. Completing the Database Design: Integrity Constraints 526 15.7.2. Design of the Registration Transaction 528 15.7.3 Partial Code for the Registration Transaction 530 Bibliographic Notes 533 Exercises 533 PART FIVE Advanced Topics in Databases 535 16 Introduction to Object Databases 537 16.1 Shortcomings of the Relational Data Model 537 16.2 The Conceptual Object Data Model 543 16.2.1 Objects and Values 544 16.2.2 Classes $45 16.2.3 Types 546 16.2.4 Object-Relational Databases 549 16.3. Objects in SQL:1999 and SQL:2003 $50 16.3.1 Row Types 551 16.3.2 User-Defined Types 552 16.3.3 Objects 553 16.3.4 Querying User-Defined Types 554 16.3.5 Updating User-Defined Types 555 16.3.6 Reference Types 558 16.3.7 Inheritance 560 16.3.8 Collection Types 561 Bibliographic Notes 563 Exercises 564 Contents xv 17 Introduction to XML and Web Data 567 17.1 Semistructured Data 567 17.2 Overview of XML 570 17.2.1 XML Elements and Database Objects 573 17.2.2 XML Attributes 575 17.2.3 Namespaces 577 17.2.4 Document Type Definitions 582 17.2.5 Inadequacy of DTDs as a Data Definition Language 585 17.3 XML Schema 586 17.3.1 XML Schema and Namespaces 587 17.3.2 Simple Types 590 17.3.3 Complex Types 595 17.3.4 Putting It Together 603 17.3.5 Shortcuts: Anonymous Types and Element References 606 17.3.6 Integrity Constraints 608 17.4 XML Query Languages 615 17.4.1 XPath: A Lightweight XML Query Language 616 17.4.2 SQLIXML — 623 Bibliographic Notes 633 Exercises. 634 Bibliography 639 Index 649 THE INTRODUCTORY PART of the book consists of two chapters. In Chapter 1, we will try to get you excited about the fields of databases and transaction processing by giving you some idea of what the book is all about. In Chapter 2, we will introduce many of the technical concepts underlying the fields of databases and transaction processing, includ- ing the SQL language and the ACID properties of transactions. We will expand on these concepts in the rest of the book. Overview of Databases and Transactions 1.1 What Are Databases and Transactions? During your vacation, you stand at the checkout counter of a department store in Tokyo, hand the clerk your credit card, and wait anxiously for your purchases to be approved. In the few seconds you have to wait, messages are sent around the world to one or more banks and clearinghouses, accessing and updating a number of databases until finally the system approves your purchase. Over 100 million such credit card transactions are processed each day from over 10 million merchants through more than 20 thousand banks. Billions of dollars are involved, and the only record of what happens is stored in the databases on the network. The accuracy, security, and availability of these databases and the correctness and performance characteristics of the transactions that access them are critical to the entire credit card business. What Is a database? A database is a collection of data items related to some enterprise—for example, the depositor account information in a bank. A database might be stored on cards in a Rolodex or on paper in a file cabinet, but we are particularly interested in databases stored as bits and bytes in a computer. Such a database can be centralized on one computer or distributed over several, perhaps widely separated geographically. ‘An increasing number of enterprises depend on such databases for their very existence. No paper records exist within the enterprise; the only up-to-date record of its current status—for example, the balance of each bank customer's checking, account—Is stored in its databases. Many enterprises view their databases as their ‘most important asset. For example, the database of the company that manufactured the airplane ‘on which you flew to Tokyo contains the only record of information about the engineering design, manufacturing processes, and subassembly suppliers involved in producing that plane 10 years ago, together with every test made on it over its lifetime. If, at some time in the future, a test shows that a turbine blade on one of the plane’s jet engines has failed, the company can determine from its database which subcontractor supplied that particular engine, and the subcontractor can determine from its database the date on which that turbine blade was manufactured, CHAPTER 1 Overview of Databases and Transactions the machines and people involved, the source of the materials from which the blade was fabricated, and the results of quality assurance tests made while the blade was being manufactured. In this way it can determine the cause of the failure and increase the quality of future planes. The existence of these detailed historical databases, as well as the ability to search them for information about the fabrication of a specific turbine blade in a specific jet engine on a specific airplane manufactured 10 years ago, gives the airplane manufacturer a significant strategic advantage over any other manufacturer that does not maintain such databases. In some cases, a database is the major asset of an enterprise—for example, the database of the credit history company that your credit card company consulted when you applied for your card. In other cases, the accuracy of the information in the database is critical for human life—for example, the database in the air traffic control system at the Tokyo airport. What is a database management system? “To make access to them conve- nient, databases are generally encapsulated within a database management system (DBMS). The DBMS supports a high-level language in which the application pro- grammer describes the database access it wishes to perform. Typically, all database access is classified into two broad categories: queries and updates. A query is a re- quest to retrieve data, and an update is a request to insert, delete, or modify existing data items. The most commonly used data access language, and the one we study the most in this text, is the Structured Query Language (SQL). Although it is called a query language, updates are also done through SQL. The beauty of SQL lies in its declarative nature: the application programmer need only state what is to be done; the DBMS figures out how to do it efficiently. The DBMS interprets each SQL state- ment and performs the action it describes. The application programmer need not know the details of how the database is stored, need not formulate the algorithm for performing the access, and need not be concerned about many other aspects of managing the database. Compare this to the regular file systems where the pro- grammer not only has to know the details of the file structure but also provide the algorithms to search the files to retrieve the desired information. What is a transaction? Databases frequently store information that describes the current state of an enterprise. For example, a bank’s database stores the current balance in each depositor’s account. When an event happens In the real world that changes the state of the enterprise, a corresponding change must be made to the information stored in the database. With online DBMSs, these changes are made in seal time by programs called transactions, which execute when the real-world event occurs. For example, when a customer deposits money in a bank (an event in the real world), a deposit transaction is executed. Each transaction must be designed so that it maintains the correctness of the relationship between the database state and the real-world enterprise it is modeling. In addition to changing the state of the database, the transaction itself might initiate some events in the real world. For example, a withdraw transaction at an automated teller machine (ATM) initiates the event of dispensing cash, and a transaction that establishes a connection for a 4.4 What Are Databases and Transactions? telephone call requires the allocation of resources (bandwidth on a long-distance link) in the telephone company’s infrastructure. Credit card approval is only one example of a transaction that you executed on your vacation in Tokyo. Your flight arrangements involved a transaction with the airline's reservation database, your passage through passport control at the airport involved a transaction with the immigration services database, and your check-in at the hotel involved a transaction with the hotel reservation database. Even the phone call you made from your hotel room to tell your family you had arrived safely involved transactions with the hotel billing database and with a long-distance carrier to arrange billing and to establish the call. Other examples of transactions you probably execute regularly involve ATM sys- tems, supermarket scanning systems, and university registration and billing systems. Increasingly, these transactions entail access to distributed databases: multiple databases managed by different DBMSs stored at different geographical locations. Your phone call transaction at the Tokyo hotel is an example. What is a transaction processing system? A transaction processing system (IPS) includes one or more databases that store the state of an enterprise, the soft- ware for managing the transactions that manipulate that state, and the transactions themselves that constitute the application code. In its simplest form the TPS involves a single DBMS that contains the software for managing transactions. More complex systems involve several DBMS. In this case, transaction management is handled both within the DBMSs and without, by additional code called a TP monitor that coordinates transactions across multiple sites (see Figure 1.1). Database ‘Transactions| FIGURE 1.1. The structure of a transaction processing system. CHAPTER 1 Overview of Databases and Transactions ‘The database is at the heart of a transaction processing system because it persists beyond the lifetime of any particular transaction. An increasing number of enter- prises depend on such systems for their business. For example, one might say that the credit card transaction processing system is the credit card business. Our concern in this book is with the technical aspects of databases and the trans- action processing systems that use them. Specifically, we are interested in the design and implementation of applications, including the organization of the application database, but we are not concerned with the algorithms and data structures used to implement the underlying DBMS and transaction processing system modules. Nev- ertheless, we must learn enough about these underlying systems so that we can use them intelligently in an application. 1.2 Features of Modern Database and Transaction Processing Systems Modern computer and communication technology has led to significant advances in the architecture, design, and use of database and transaction processing systems. ‘Their enhanced functionality has lead to important new business opportunities for the enterprises that deploy them and, in turn, implies a number of additional requirements on their operation: High availability, Because the system is online, it must be operational at all times when the enterprise is open for business. In some enterprises, this means that the system must always be available. For example, an airline reservation system might be required to accept requests for flight reservations from ticket offices spread over a large number of time zones, so the system is never shut down. With online systems, failures can result in a disruption of business—if the computer in an airline reservation system is down, reservations cannot be made. ‘The ability to tolerate failures depends on the nature of the enterprise. Clearly a flight control system has considerably less tolerance for failures than a flight reservation system has. VISA claimed in 2002 that its system had been down, a total of eight minutes in the previous five years (an uptime of greater than 99.9999%). Highly available systems generally involve replication of hardware and software. = High reliability. The system must accurately reflect the results of all transac- tions. This implies not only that transactions must be correctly programmed but also that errors must not be introduced because of concurrent execution of (correctly programmed) transactions or intercommunication of modules while the transaction is executing. Furthermore, large, distributed transaction pro- cessing systems include thousands of hardware and software modules, and it is unlikely that all are working correctly. The system must not forget the results of any transaction that has completed despite all but the most catastrophic forms of failure. For example, the database in a banking system must accurately reflect 1.3, Major Players in Databases the effect of all the deposits and withdrawals that have completed and cannot lose the results of any such transactions should it subsequently crash. © High throughput. Because the enterprise has many customers who must use the transaction processing system, the system must be capable of performing many transactions per second. For example, a credit card approval system might perform thousands of transactions per second during its busiest periods. As we shall see, this requirement implies that individual transactions cannot be executed sequentially but must be executed concurrently—thus significantly complicating the design of the system. = Low response time. Because customers might be waiting for a response from it, the system must respond quickly. Response requirements may differ depending on the application. Whereas you might be willing to wait fifteen seconds for an ATM to output cash, you expect a telephone connection to be made in no more than one or two seconds. Furthermore, in some applications, if the response does not occur within a fixed period of time, the transaction will not perform properly. For example, in a factory automation system the transaction might be required to actuate a device before some unit passes a particular position on the conveyor belt. Applications of this type are said to have hard real-time constraints. . = Long lifetime. ‘Transaction processing systems are complex and not easily re- placed. They must be designed in such a way that individual hardware or soft- ware modules can be replaced with newer versions (that perform better or have additional functionality) without necessitating major changes to the surround- ing system. ™ Security. Many transaction processing systems contain information about the private concerns of individuals (e.g., the items they purchase, their credit card number, the videos they view, and their health and financial records). Because these systems can be accessed by a large number of people from a large number of places (perhaps over the Internet), security is important. Individual users must be authenticated (are they who they claim to be?), users must be allowed to execute only those transactions they are authorized to execute (only a bank teller can execute a transaction to generate a certified check), the information in the database must not be corrupted or read by an attacker, and the information transmitted between the user and the system must not be altered or overheard by an eavesdropper. 4.3 Major Players in the Implementation and Support of Database and Transaction Processing Systems A transaction processing system, together with its associated databases, can be an immensely complex assemblage of hardware and software, with which many different types of people interact in various roles. Examining these roles is a useful CHAPTER 1 Overview of Databases and Transactions way of understanding what a transaction processing system is. First consider the people involved in the design and implementation of a transaction processing system: | System analyst. The system analyst works with the customer of a proposed application system to develop formal requirements and specifications for it. He or she must understand both the business rules of the enterprise for which the application is being implemented and the database and transaction processing technology underlying the implementation so that the application will meet the customer's needs and execute efficiently. The specifications developed by the system analyst are then refined into the design of the database formats and the individual transactions that will access the database. "= Database designer. The database designer specifies the structure of the database appropriate for an application. The database contains the information that describes the current state of the real-world application. The structure must support the accesses required by the transactions and allow those accesses to be performed in a timely manner. = Application programmer. The application programmer implements the graph- ical user interface and the individual transactions in the system. He or she must ensure that the transactions maintain the correspondence between the state of the real-world application and the state of the database. Together with the data- base designer, the application programmer must ensure that the rules governing the workings of the enterprise are enforced. For example, in the Student Regis- tration System, to be discussed in Section 2.1, the number of students enrolled in a course should not exceed the number of seats in the room assigned to the course. © Project manager. The project manager is responsible for the successful com- pletion of the implementation project. He or she prepares schedules and bud- gets, assigns people to tasks, and monitors day-to-day project operation. Project ‘management is surprisingly difficult. According to a widely quoted report of the Standish Group, an Information Technology (IT) consulting group, of the more than eight thousand IT projects the group surveyed, only 16% completed successfully—on time and on budget Standish 2000}. The primary reason for the failures was almost always poor project management.) ‘The people interacting with (as opposed to building) an operational transaction processing system include the following: © User. The user causes the execution of individual transactions, usually by in- teracting through some graphical user interface. The user interface must be ' For large companies, the success rate dropped to 9%. For projects that completed late or over Dudget, the average completion time was 222% of the scheduled time and the average cost was 189% of the budgeted cost. An astonishing 31% of the projects were canceled before they were ‘completed. At the time this book was written, information about this study, called Chaos, could be found in [Standish 2000]. 1.4 Decision Support Systems—OLAP and OLTP appropriate to the capabilities of the intended class of users. As an example, the user interface presented by an ATM is simple enough that an average person can use the system to perform bank deposit and withdraw transactions without any training or instructions except those presented on the screen. By contrast, the interface to an airline reservation system, which Is used by reservation clerks or travel agents, requites advanced training. In both cases, however, most of the complexities of the system are hidden from the user. © Database administrator. The database administrator is responsible for support- ing the database while the system is running. Among his or her concerns are allocating storage space for the database, monitoring and optimizing database performance, and monitoring and controlling database security. In addition, the database administrator might modify the structure of the database to ac- commodate changes in the enterprise or to handle performance bottlenecks. © System administrator. The system administrator is responsible for supporting the system as a whole while it is running. Among the things he or she must keep track of are = System architecture. What hardware and software modules are connected to the system at any instant, and how are they interconnected? © Configuration management. What version of each software module exists on each machine? = System status. What is the health of the system? Which systems and com- munication links are operational or congested, and what is being done to repair the situation? How is the system currently performing? Our main interest in this book lies at the application level. Thus, we are partic- ularly concerned with the roles of the system analyst, the application programmer, and the database designer. However, in order for someone Working at the applica- tion level to take full advantage of the capabilities of the underlying system, he or she must be knowledgeable about the other roles as well. 4.4 Decision Support Systems—OLAP and OLTP ‘Transaction processing is not the only application domain in which databases play a key role. Another such domain is decision support. While transaction processing is concerned with using a database to maintain an accurate model of some real- world situation, decision support is concerned with using the information in a database to guide management decisions. To illustrate the differences between these two domains, we discuss the roles they might play in the operation of a national supermarket chain. Transaction processing. Each local supermarket in a chain maintains a database of the prices and current inventory of all the items it sells. It uses that database (together with a bar code scanner) as part of a transaction processing system at the checkout counters. One transaction in this system might be, “Three cans of Campbell soup 10 CHAPTER 1 Overview of Databases and Transactions and one box of Ritz crackers were purchased; compute the price, print out a receipt, update the balance in the cash drawer, and subtract these items from the store’s inventory.” The customer expects this transaction to complete in a few seconds. The main goal of such a transaction processing system is to maintain the correspondence between the database and the real-world situation it is modeling as events occur in the real world, In this case, the event is the customer's purchase, and the real-world situation is the store’s inventory and the amount of cash in the cash drawer. Decision support. ‘The managers of the supermarket chain might want to analyze the data stored in the databases in each store to help them make decisions for the chain as a whole. Such decision support applications are becoming increasingly important as enterprises attempt to turn the data in their databases into information they can use to advance their long-term strategic goals. Decision support applications involve queries to one or more databases, possibly followed by some mathematical analysis of the information returned by the queries. Decision support applications are sometimes called online analytic processing (OLAP), in contrast with the online transaction processing (OLTP) applications we have been discussing. In some decision support applications, the queries are so simple they can be implemented as transactions in the same local database used for OLTP applications— for example, "Print out a report of the weekly produce sales in Store 27 for the past six months.” In many applications, however, the queries are quite complex and cannot be efficiently executed against the local databases. They take too long to execute (because the database has been optimized for OLTP transactions) and cause the local transactions—for example, the checkout transactions—to execute too slowly. ‘The supermarket chain therefore maintains a separate database specifically for such complex OLAP queries. The database contains historical information about sales and inventory from all its branches for the past 10 years. This information is extracted from the individual store databases at various times and updated once a day. Such a database Is called a data warehouse. ‘Amanager can enter a complex query about the data in the data warehouse—for example, “During the winter months of the last five years, what is the percentage of customers in northeast urban supermarkets who bought crackers at the same time they bought soup?” (Perhaps these items should be placed near each other on the shelves.) Data warchouses can contain terabytes (10! bytes) of data and require spectal hardware to maintain that data. An OLAP query might be quite difficult to formulate and might require query language concepts more powerful than those needed for OLTP queries. OLAP queries usually do not have severe constraints on execution time and might take several hours to execute. The warehouse database might have been structured to speed up the execution of such queries. The database need be updated only periodically because minute-by-minute correctness is not needed for 1.4. Decision Support Systems—OLAP and OLTP the types of queries it supports—satisfactory responses might be obtained even if the database is less than 100% accurate. Data mining. A manager might also be interested in making a much less structured query about the data in the warehouse database—for example, “Are there any interesting combinations of items bought by customers?” Such queries are called data mining. In contrast with OLAP, in which requests are made to obtain specific information, data mining can be viewed as knowledge discovery—an attempt to extract new knowledge from the data stored in the database. Data mining queries can be extremely difficult to formulate and might require sophisticated mathematics or techniques from the field of artificial intelligence. A query might require many hours to execute and might involve several interactions with the manager for obtaining additional information or reformulating parts of the query. One widely repeated but perhaps apocryphal success story of data mining is that a convenience store chain used the above query (“Are there any interesting combi- nations . . .") and found an unexpected correlation. In the early evenings, a high percentage of male customers who bought diapers also bought beer—presumably these customers were fathers who were going to stay home that night with their babies. " The Big Picture 2.1 Case Study: A Student Registration System Your university is interested in implementing a student registration system so that students can register for courses from their home PCs. You have been asked to build a prototype of that system as a project in this course. The registrar has prepared the following preliminary Statement of Objectives for the system. The objectives of the Student Registration System are to allow students and faculty (as appropriate) to Authenticate themselves as users of the system Register and deregister for courses (offered for the next semester) Obtain reports on a particular student's status Maintain information about students and courses 5. Enter final grades for courses that a student has completed sep e This brief description is typical of what might be supplied as a starting point for a system implementation project, but it is not specific or detailed enough to serve as the basis for the project's design and coding phases. We will be developing the student registration scenario throughout this book and will be using it to illustrate the various concepts in databases and transaction processing. Our next step is to meet with the registrar, faculty, and students to expand this brief description into a formal Requirements Document for the system. We will discuss the Requirements Document in Chapter 14, which we expect you to read at appropriate times as you proceed through the rest of the book. In this chapter, we will take a closer look at some of the underlying concepts of databases and transaction processing that are needed for that system. ‘The following sections provide a brief overview of these concepts. Although we will revisit these concepts in a more detailed fashion in subsequent chapters, an overview will help you see the big picture and will set the stage for better understanding of the following chapters. 14 CHAPTER 2 The Big Picture 2.2 Introduction to Relational Databases A database is at the heart of most transaction processing systems. At every instant of time, the database must contain an accurate description—often the only one—of the real-world enterprise the transaction processing system is modeling. For example, in the Student Registration System the database is the only source of information about which students have registered for each course. Relations and tuples. We are particularly interested in databases that use the relational model (Codd 1970, 1990], in which data is stored in tables. The Student Registration System, for example, might include the STUDENT table, shown in Figure 2.1. A table contains a set of rows. In the figure, each row contains information about one student. Each column of the table describes the student in a particular way. In the example, the columns are Id, Name, Address, and Status. Each column has an associated type, called its domain, from which the value in a particular row for that column is drawn. For example, the domain for Ta is integer and the domain for Name is string. ‘This database model is called “relational” because it is based on the mathe- matical concept of a relation. A mathematical relation captures the notion that elements of different sets are related to one another. For example, John Doe, an ele- ‘ment of the set of all humans, is related to 123 Main St., an element of the set of all addresses, and to 111111111, an element of the set of all Ids. A relation is a set of tu- ples. Following the example of the table STUDENT, we might define a relation called SrubeNr containing the tuple (111111111, John Doe, 123 Main St., Freshman). ‘The Stupenr relation presumably contains a tuple describing every student. ‘We can view a relation as a predicate. A predicate is a declarative statement that is either true or false depending on the values of its arguments—for example, the predicate “It rained in Detroit on date X” is either true or false depending on the value chosen for the argument X. When we view a relation as a predicate, the arguments of the predicate correspond to the elements of a tuple, and the predicate is defined to be true for arguments ay, ...,,, exactly when the tuple (q;,..., dq) is in the relation. For instance, we might define the predicate STUDENT 111111111 | John Doe 123 Main St. 666666666 | Joseph Public | 666 Hollow Rd. 111223344 | Mary Smith 1 Lake St. 987654321 | Bart Simpson | Fox 5 TV 023456789 | Homer Simpson | Fox 5 TV 123454321 | Joe Blow [6 Yara ct. FIGURE 2.1. The table STUDENT. Each row describes a single student. 2.2. Introduction to Relational Databases with arguments Id, Name, Address, and Status. Then we can say that the predicate STUDENT (111111111, John Doe, 123 Main St., Freshman) is true, because the tuple (111111111, John Doe, 123 Main St., Freshman) is in the table STUDENT shown in Figure 2.1. ‘The correspondence between tables and relations should now be clear: the tuples of a relation correspond to the rows of a table, and the column names of a table are the names of the attributes of the relation. Thus, the rows of the STUDENT table can be viewed as enumerating the set of all 4-tuples (tuples with four attributes of the appropriate types) that satisfy the STUDENT relation (i.e,, the Id, Name, Address, and Status of a student). Operations on tables are mathematically defined. In real applications, tables can become quite large—a STUDENT table for our university would contain over 15 thousand rows, and each row would likely contain much more information about each student than is shown here. In addition to the STUDENT table, the complete database for the Student Registration System at our university would contain a number of other tables, each with a large number of rows, containing information about other aspects of student registration. For example, a TRANSCRIPT table might contain a row for each course that every student has ever taken. Hence, the databases fot most applications contain a large amount of information and are generally held in mass storage. In most applications, the database is under the control of a database manage- ment system (DBMS), which is supplied by a commercial vendor. When an applica- tion wants to perform an operation on the database, it does so by making a request to the DBMS. A typical operation might extract some information from the rows of ‘one or more tables, modify some rows, or add or delete rows. For example, when a new student is admitted to the university, a row is added to the STUDENT table, In addition to the fact that tables in the database can be modeled by math- ematical relations, operations on the tables can also be modeled as mathematical ‘operations on the corresponding relations. Thus, a particular unary operation might take a table, T, as an argument and produce a result table containing a subset of the rows of T. For example, an instructor might want to display the roster of students registered for a course. Such a request might involve scanning the TRANSCRIPT table, locating the rows corresponding to the course, and returning them to the applica- tion. A particular binary operation might take two tables as arguments and construct a new table containing the union of the rows of the argument tables. A complex. query against a database might be equivalent to an expression involving many such relational operations involving many tables. Because of this mathematical description, relational operations can be precisely defined and their mathematical properties, such as commutativity and associativ- ity, can be proven. As we shall see, this mathematical description has important practical implications. Commercial DBMSs contain a query optimizer module that converts queries into expressions involving relational operations and then uses these mathematical properties to simplify those expressions and thus optimize query execution. 15 CHAPTER 2 The Big Picture SQL: Basic SELECT statement. An application describes the access that it wants the DBMS to perform on its behalf in a language supported by the DBMS. We are particularly interested in SQL, the most commonly used database language, which Provides facilities for accessing a relational database and is supported by almost all commercial DBMSs. The basic structure of the SQL statements for manipulating data is straightfor- ward and easy to understand. Each statement takes one or more tables as arguments and produces a table as a result. For example, to find the name of the student whose Id is 987654321, we might use the statement SELECT Name FROM — STUDENT 24 WHERE Id = '987654321' ‘More precisely, this statement asks the DBMS to extract from the table named in the FROM clause—that is, the table STUDENT—all rows satisfying the condition in the WHERE clause—that is, all rows whose Id column has value 987654321—and then from each such row to delete all columns except those named in the SELECT clause—that is, Namo. The resulting rows are placed in a result table produced by the statement. In this case, because Ids are unique, at most one row of STUDENT can satisfy the condition, and so the result of the statement is a table with one column and at most one row. Thus, the FROM clause identifies the table to be used as input, the WHERE clause Identifies the rows of that table from which the answer is to be generated, and the ‘SELECT clause identifies the columns of those rows that are to be output in the result table. The result table generated by this example contains only one column and at most one row. As a somewhat more complex example, the statement SELECT Id, Name FROM = STUDENT 22 WHERE Status = ‘senior’ returns a result table (shown in Figure 2.2) containing two columns and multiple rows: the Ids and names of all seniors. If we want to produce a table containing all the columns of SruveNr but describing only seniors, we use the statement SELECT + FROM — STUDENT WHERE Status = ‘senior! 2.2. Introduction to Relational Databases 987654321 | Bart Simpson 023456789 | Homer FIGURE 2.2 The database table returned by the SQL SELECT statement (2.2), ‘The asterisk is simply shorthand that allows us to avoid listing the names of all the columns of STUDENT. In some situations the user is interested not in outputting a result table but in information about the result table. An example is the statement SELECT COUNT(+) FROM — STUDENT WHERE Status = ‘senior’ which returns the number of rows in the result table (.e., the number of seniors). COUNT is referred to as an aggregate function because it produces a value that is a function of all the rows in the result table. Note that when an aggregate is used, the SELECT statement produces a single value instead of a table. ‘The WHERE clause is the most interesting component of the SELECT statement; it contains a general condition that is evaluated over each row of the table named in the FROM clause. Column values from the row are substituted into the condition, yielding an expression that has either a true ora false value. If the condition evaluates to true, the row is retained for processing by the SELECT clause and then stored in the result table. Hencé, the WHERE clause acts as a filter. Conditions can be much more complex than we have seen so fat: A condition can be a Boolean combination of terms. If we want the result table to contain information describing seniors whose Ids are in a particular range, for example, we might use WHERE Status = ‘senior’ AND Id > 's8ssss6s' OR and NOT can also be used. Furthermore, a number of predicates are provided in the language for expressing particular relationships. For example, the IN predicate tests set membership. WHERE Status IN (‘freshnan', 'sophonore') Additional aggregates and predicates and the full complexity of the WHERE clause are discussed in Chapter 5. 18 CHAPTER 2 The Big Picture Multi-table SELECT statements. The result table can contain information ex- tracted from several base tables. Thus, if we have a table TRANSCRIPT with columns StudId, CrsCode, Semester, and Grade, the statement SELECT Name, CrsCode, Grade FROM STUDENT, TRANSCRIPT WHERE StudId = Id AND Status = ‘senior’ can be used to form a result table in which each row contains the name of a senior, a particular course she took, and the grade she received. The first thing to note is that the attribute values in the result table come from different base tables: Name comes from STUDENT; CrsCode and Grade come from ‘TRANSCRIPT. As in the previous examples, the FROM clause produces a table whose rows are input to the WHERE clause. In this case the table is the Cartesian product of the tables listed in the FROM clause: a row of this table is the concatenation of a row of STUDENT and a row of TRANSCRIPT. Many of these rows make no sense. For example, Bart Simpson's row in STUDENT is not related to a row in TRANSCRIPT describing a course that Bart did not take. The first conjunct of the WHERE clause ‘ensures that the rows of TRANSCRIPT for a particular student are associated with the appropriate row of STUDENT by matching the Id values of the rows of the two tables. For example, if TRANSCRIPT has a row (987654321, C3305, F1995, C), it will match only Bart Simpson’s row in STUDENT, producing the row (Bart Simpson, CS306, C) in the result table. Query optimization. One very important feature of SQL is that the programmer does not have to specify the algorithm the DBMS should use to satisfy a particular query. For example, tables are frequently defined to include auxiliary data structures, called indices, which make it possible to locate particular rows without using lengthy searches through the entire table. Thus, an index on the Id column of the STUDENT table might contain a list of pairs (Id, pointer) where the pointer points to the row of the table containing the corresponding Id. If such an index were present, the DBMS would automatically use it to find the row that satisfies the query (2.1). If the table also had an index on the column Status, the DBMS would use that index. to find the rows that satisfy the query (2.2). If this second index did not exist, the DBMS would automatically use some other method to satisfy (2.2)—for example, it might look at every row in the table in order to locate all rows having the value senior in the Status column. The programmer does not specify what method to use—just the condition the desired result table must satisfy. In addition to selecting appropriate indices to use, the query optimizer uses the properties of the relational operations to further improve the efficiency with which a query can be processed—again, without any intervention by the programmer. Nevertheless, programmers should have some understanding of the strategies the DBMS uses to satisfy queries so they can design the database tables, indices, and 2.2. Introduction to Relational Databases SQL statements in such a way that they will be executed in an efficient manner ‘consistent with the requirements of the application. Changing the contents of tables. The following examples illustrate the SQL statements for modifying the contents of a table. The statement UPDATE STUDENT seT Status = ‘sophomore’ WHERE Id = '111111111' updates the STUDENT table to make John Doe a sophomore. The statement INSERT INTO STUDENT (Id, Name, Address, Status) VALUES (999999999', ‘Winston Churchill', '10 Downing St', ior’) inserts a new row for Winston Churchill in the STupeNT table. The statement DELETE FROM — STUDENT WHERE Id = '111111111' deletes the row for John Doe from the STUDENT table. Again, the details of how these operations are to be performed need not be specified by the programmer. Creating tables and specifying constraints. Before you can store data ina table, the table structure must be created. For instance, the STUDENT table could have been created with the SQL statement CREATE TABLE STUDENT( Ia INTEGER, Name CHAR(20), Address (CHAR(60), 2.3 Status ‘CHAR(10), PRIMARY KEY (Id) ) where we have declared the name of each column and the domain (type) of the data that can be stored in that column. We have also declared the Id column to be a primary key to the table, which means that each row of the table must have a unique value in that column and the DBMS will (most probably) automatically construct an index on that column. The DBMS will enforce this uniqueness constraint by not allowing any INSERT or UPDATE statement to produce a row with a value in the Id column that duplicates a value of Id in another row. This requirement is an 9 20 CHAPTER 2 The Big Picture example of an integrity constraint (sometimes called a consistency constraint)— an application-based restriction on the values that can appear as entries in the database. We discuss integrity constraints in more detail in the next section, ‘We have given simple examples of each statement type to highlight the con- ceptual simplicity of the basic ideas underlying SQL, but be aware that the complete language has many subtleties. Each statement type has a large number of options that allow very complex queries and updates. For this reason, mastery of SQL re- quires significant effort. We continue our discussion of relational databases and SQL in Chapter 3. 2.3 What Makes a Program a Transaction— The ACID Properties In many applications, a database is used to model the state of some real-world enterprise. In such applications, a transaction is a program that interacts with that database so as to maintain the correspondence between the state of the enterprise and the state of the database. In particular, a transaction might update the database to reflect the occurrence of a real-world event that affects the enterprise state. An example is a deposit transaction at a bank. The event is that the customer gives the teller the cash and a deposit slip. The transaction updates the customer's account information in the database to reflect the deposit. " ‘Transactions, however, are not just ordinary programs. Requirements are placed on them, particularly on the way they are executed, that go beyond what is normally expected of regular programs. These requirements are enforced by the DBMS and the ‘TP monitor. Consistency. A transaction must access and update the database in such a way that it preserves all database integrity constraints. Every real-world enterprise is organized in accordance with certain rules that restrict the possible states of the enterprise. For example, the number of students registered for a course cannot exceed the number of seats in the room assigned to the course. When such a rule exists, the possible states of the database are similarly restricted. The restrictions are stated as integrity constraints. The integrity constraint cor- responding to the above rule asserts that the value of the database item that records the number of course registrants must not exceed the value of the item that records the room size. Thus, when the registration transaction completes, the database must satisfy this integrity constraint (assuming that the constraint was satisfied when the transaction started), Although we have not yet designed the database for the Student Registration System, we can make some assumptions about the data that will be stored and postulate some additional integrity constraints: = ICO. The database contains the Id of each student. These Ids must be unique. = ICI, The database contains a list of prerequisites for each course and, for each student, a list of completed courses. A student cannot register for a course without having taken all prerequisite courses. 2.3, What Makes a Program a Transaction—The ACID Properties "= IC2. The database contains the maximum number of students allowed to take each course and the number of students who are currently registered for each course. The number of students registered for each course cannot be greater than the maximum number allowed for that course. © IC3. It might be possible to determine the number of students registered for (or enrolled in) a particular course from the database in two ways: the number is stored as a count in the information describing the course, and it can be calcu- lated from the information describing each student by counting the number of student records that indicate that the student is registered for (or enrolled in) the course. These two determinations must yield the same result. In addition to maintaining the integrity constraints, each transaction must update the database in such a way that the new database state reflects the state of the real-world enterprise that it models. If John Doe registers for C5305, but the registration transaction records Mary Smith as the new student in the class, the integrity constraints will be satisfied but the new state will be incorrect. Hence, consistency has two dimensions. Consistency. The transaction designer can assume that when execution of the trans- action is initiated, the database is in a state in which all integrity constraints are satisfied and, in addition, the database correctly models the current state of the en- terprise. The designer has the responsibility of ensuring that when execution has completed, the database is once again in a state in which all integrity constraints are satisfied and, in addition, that the new state reflects the transformation described in the transaction’s specification (in other words, that the database still correctly models the state of the enterprise). SQL provides some support for the transaction designer in maintaining consistency. When the database is being designed, the database designer can specify certain types of integrity constraints and include them within the statements that declare the format of the various tables in the database. The primary key constraint of the SQL statement (2.3) is an example of this. Later, as each transaction is executed, the DBMS automatically checks that each specified constraint is not violated and prevents completion of any transaction that would cause a constraint violation. Atomicity. In addition to the transaction designer's responsibility for consistency, the TP monitor must provide certain guarantees concerning the manner in which transactions are executed. One such condition is atomicity. Atomicity. The system must ensure that the transaction either runs to completion ot, if it does not complete, has no effect at all (as if it had never been started). In the Student Registration System, either a student has registered for a course or he has not registered for a course. Partial registration makes no sense and might leave the database in an inconsistent state. For example, as indicated by constraint IC3, two items of information in the database must be updated when a student registers. 21 CHAPTER 2 The Big Picture If a registration transaction were to have a partial execution in which one update completed but the system crashed before the second update could be executed, the resulting database would be inconsistent. When a transaction has successfully completed, we say that it has committed. If the transaction does not successfully complete, we say that it has aborted and the TP monitor has the responsibility of ensuring that whatever partial changes the transaction has made to the database are undone, or rolled back. Atomic execution means that every transaction either commits or aborts. Notice that ordinary programs do not necessarily have the property of atomicity. For example, if the system were to crash while a program that was updating a file was executing, the file could be left in a partially updated state when the system recovered. Durability. A second requirement of the transaction processing system is that it does not lose information. Durability. The system must ensure that once the transaction commits, Its effects remain in the database even if the computer, or the medium on which the database Is stored, subsequently crashes. For example, if you successfully register for a course, you expect the system to remember that you are registered even if it later crashes. Notice that ordinary programs do not necessarily have the property of durability either. For example, if media failure occurs after a program that has updated a file has completed, the file might be restored to a state that does not include the update. Isolation. In discussing consistency, we concentrated on the effect of a single transaction. We next examine the effect of executing a set of transactions. We say thata set of transactions is executed sequentially, or serially, ifone transaction in the set is executed to completion before another is started. The good news about serial execution Is that If all transactions are consistent and the database is initially in a consistent state, serial execution maintains consistency. When the first transaction in the set starts, the database is in a consistent state and, since the transaction is consistent, the database will be consistent when the transaction completes. Because the database is consistent when the second transaction starts, it too will perform correctly and the argument will repeat. Serial execution is adequate for applications that have modest performance requirements. However, many applications have strict requirements on response time and throughput, and often the only way to meet the requirements is to process transactions concurrently. Modern computing systems are capable of servicing more than one transaction simultaneously, and we refer to this mode of execution as. concurrent. Concurrent execution is appropriate in a transaction processing system serving many users. In this case, there will be many active, partially completed transactions at any given time. 2.3. What Makes a Program a Transaction—The ACID Properties _ Saauanen of Database P11 P12 Operaons Output PY Ty FIGURE 2.3 The database operations output by two transactions in a concurrent schedule might be interleaved in time. (Note that the figure should be interpreted as meaning that op, arrives first at the DBMS, followed by op2,1, etc.) In concurrent execution, the database operations of different transactions are effectively interleaved in time, a situation shown in Figure 2.3. Transaction Ty alternately computes using its local variables and sends requests to the database system to transfer data between the database and its local variables. The requests are made in the sequence op;,1, opy,2. We refer to that sequence as a transaction schedule. T> performs its computation in a similar way. Because the execution of the two transactions is not synchronized, the sequence of operations arriving at the database, called a schedule, is an arbitrary merge of the two transaction schedules. The schedule in the figure is op1,1» OP2,1» 2P2,2» 0P1,2- When transactions are executed concurrently, the consistency of each trans- action is not sufficient to guarantee that the database that exists after both have completed correctly reflects the state of the enterprise. For example, suppose that T, and Tp are two instances of the registration transaction invoked by two students who want to register for the same course. A possible schedule of these transactions is shown in Figure 2.4, where time progresses from left to right and the notation (cur seg :n) means that a transaction has read the database object cur-reg, which FIGURE 2.4 A schedule in which two registration transactions are not isolated from each other. Ty: r(cur-seg: 29) ‘w(cur-reg: 30) Tp: (cur reg: 29) wicur-reg: 30) 23 24 CHAPTER 2 The Big Picture records the number of current registrants, and the value n has been returned. A sim- ilar notation is used for w(cur_reg :n). The figure shows only the accesses! to cur_sreg. Assume that the maximum number of students allowed to register is 30 and the current number is 29. In its first step, each of the two transactions will read this value and store it in its local variable, and both will decide that there is room in the course. In its second step, each will increment its private copy of the number of current registrants; hence, both will calculate the value 30. In their write operations, both will write that same value, 30, into cur_reg. Both transactions complete successfully, but the number of current registrants 1s incorrectly recorded as 30 when it is actually 31 (even though the maximum allowable number is 30). This is an example of what is often referred to as a lost update because one of the increments has been lost. The resulting database does not reflect the real-world state, and integrity constraint IC2 has been violated. By contrast, if the transactions had executed sequentially, T, would have completed before T2 was allowed to start. Hence, T2 would find the course full and would not register the student. As this example demonstrates, we must specify some restriction on concurrent execution that is guaranteed to maintain the consistency of the database and the correspondence between the enterprise state and the database state, One such re- striction that is obviously sufficient follows. Isolation. Even though transactions are executed concurrently, the overall effect of, the schedule must be the same as if the transactions had executed serially in some order. It should be evident that if the transactions are consistent and if the overall effect of a concurrent schedule is the same as that of some serial schedule, the concurrent schedule will maintain consistency. Concurrent schedules that satisfy this condition are called serializable. As was the case with atomicity and durability, ordinary programs do not neces- sarily have the property of isolation. For example, if programs that update a common set of files are executed concurrently, updates might be interleaved and produce an ‘outcome that is quite different from that obtained if they had been executed in any serial order. That result might be totally unacceptable. ACID properties. The features that distinguish transactions from ordinary pro- grams are frequently referred to by the acronym ACID (Haerder and Reuter 1983]: ® = Atomic. Each transaction is executed completely or not at all. ® Consistent. Each transaction maintains database consistency. = Isolated. The concurrent execution of a set of transactions has the same effect. as some serial execution of that set. Vin a relational database, r and w represent SELECT and UPDATE statements. Exercises © Durable. The effects of committed transactions are permanently recorded in the database. When a transaction processing system supports the ACID properties, the database maintains a consistent and up-to-date model of the real world and the transactions supply responses to users that are always correct and up to date. BIBLIOGRAPHIC NOTES The relational model for databases was introduced in [Codd 1970, 1990]. The SQL language is described by the various SQL standards, such as [SQL 1992]. The term “ACID” was coined by [Haerder and Reuter 1983], but the individual components of ACID were introduced in earlier papers—for example, [Gray et al. 1976] and (Eswaran et al. 1976]. EXERCISES 2.1. Given the relation MARRIED that consists of tuples of the form (a, b), where ais the husband and bis the wife, the relation BROTHER that has tuples of the form c,d), where c is the brother of d, and the relation SIBLING, which has tuples of the form (e, f), where e and f are siblings, describe how you would define the relation BROTHER-IN-LAW, where tuples have the form (x, y) with x being the brother-in-law of y. 2.2 Design the following two tables (in addition to that in Figure 2.1) that might be used in the Student Registration System, Note that the same student Id might appear in many rows of each of these tables. a. Atable implementing the relation COURSESREGISTEREDFOR, relating a student's Id and the identifying numbers of the courses for which she is registered b. A table implementing the relation COURSESTAKEN, relating a student's Id, the identifying numbers of the courses he has taken, and the grade received in each course Specify the predicate corresponding to each of these tables. 2.3. Write an SQL statement that a, Returns the Ids of all seniors in the table STUDENT b. Deletes all seniors from STUDENT ¢. Promotes all juniors in the table STUDENT to seniors 24 — Write an SQL statement that creates the TRANSCRIPT table. 2.5 Using the TRANSCRIPT table, write an SQL statement that a, Deregisters the student with Id = 123456789 from the course C8305 for the fall of 2001 'b. Changes to an A the grade assigned to the student with Id = 123456789 for the ‘course C8305 taken in the fall of 2000 c. Returns the Id of all students who took C8308 in the fall of 2000 23 26 CHAPTER 2 The Big Picture 26 27 28 29 2.10 2.41 212 Write an SQL statement that retums the names (not the Ids) of all students who received an A in CS30S in the fall of 2000. State whether of not each of the following statements could be an integrity constraint of a checking account database for a banking application. Give reasons for your answers, a. The value stored in the balance column of an account is greater than or equal to $0. b. The value stored in the balance column of an account is greater than it was last week at this time. c. The value stored in the balance column of an account is $128.32. d. The value stored in the balance column of an account is a decimal number with two digits following the decimal point. ¢. The social_security number column of an account is defined and contains anine-digit number. f. The value stored in the check_credit_in_use column of an account is less than. or equal to the value stored in the total_approved_check_credit column. (These columns have their obvious meanings.) State five integrity constraints, other than those given in the text, for the database in the Student Registration System. Give an example in the Student Registration System where the database satisfies the integrity constraints ICO-IC3 but its state does not reflect the state of the real world. State five (possible) integrity constraints for the database in an airline reservation system. A reservation transaction in an airline reservation system makes a reservation on a flight, reserves a seat on the plane, issues a ticket, and debits the appropriate credit card account. Assure that one of the integrity constraints of the reservation database is that the number of reservations on each flight does not exceed the number of seats on the plane. (Of course, many airlines purposely over-book and so do not use this integrity constraint.) Explain how transactions running on this system might violate a. Atomicity b, Consistency ¢. Isolation d. Durability Describe informally in what ways the following events differ from or are similar to transactions with respect to atomicity and durability. a. A telephone call from a pay phone (Consider line busy, no answer, and wrong number situations. When does this transaction “commit?”) b. Awedding ceremony (Suppose that the groom refuses to say “Ido.” When does this transaction “commit?") c. The purchase of a house (Suppose that, after a purchase agreement is signed, the buyer is unable to obtain a mortgage. Suppose that the buyer backs out during the closing. Suppose that two years later the buyer does not make the mortgage payments and the bank forecloses.) d. A baseball game (Suppose that it rains.) 243 214 Exercises Assume that, in addition to storing the grade a student has received in every course he has completed, the system stores the student's cumulative GPA. Describe an integrity constraint that relates this information. Describe how the constraint ‘would be violated If the transaction that records a new grade were not atomic. Explain how a lost update could occur if, under the circumstances of the previous problem, two transactions that were recording grades for a particular student (in different courses) were run concurrently. 27 Now we are ready to begin a more in-depth study of databases. In Chapter 3, we will discuss how data items are specified in modern database management systems and how they appear to the transactions that use them. In other words, we will learn a few things about data models and data definition languages. In Chapter 4, we will study conceptual database design, which includes methodologies for organizing data around a set of high-level concepts. In Chapter 5, we will discuss how transactions access and modify data in a DBMS using data manipulation and query languages—in particular, SQL. In Chapter 6, we will resume the design theme and will talk about the Relational Normalization Theory. This theory provides al- gorithms and objective measures for improving the quality of data- base design. Chapter 7 introduces the mechanism of triggers—a powerful de- vice for maintaining the consistency of databases and for enabling databases to react to external events. Chapter 8 concludes this part of the book with a discussion of how SQL statements can be executed from within a host language, such as C or Java. The Relational Data Model This chapter is an introduction to the relational data model. First we define its main abstract concepts, and then we show how these concepts are embodied in the concrete syntax of SQL. Specifically, this chapter covers the data definition subset of SQL, which is used to specify data structures, constraints, and authorization policies in databases. 3.4 What Is a Data Model? Data independence. Ultimately, all data is recorded as bytes on a disk. However, as a programmer you know that working with data at this low level of abstraction is quite tedious. Few people are interested in how sectors, tracks, and cylinders are allocated for storing information. Most programmers much prefer to work with data stored in files, which is a more reasonable abstraction for many applications. From a course on file structures, you might be familiar with a variety of methods for storing data in files. Sequential files are best for applications that access records in the order in which they are stored. Direct access (or random access) files are best when records are accessed in a more or less unpredictable order. Files might have indices, which are auxiliary data structures that enable applications to retrieve records based on the value of a search key, We will discuss various index types in Chapter 9. Files might also consist of fixed-length records or records that have variable lengths. The details of how data is stored in files belong to the physical level of data modeling. This level is specified using a physical schema, which in the field of databases refers to the syntax that describes the structure of files and indices. Early data-intensive applications worked directly with the physical schema instead of the higher levels of abstraction provided by a modern DBMS. This choice was made for a number of reasons. First, commercial database systems were rare and costly. Second, computers were slow, and working directly with the file system offered a performance advantage. Third, most early applications were primitive by today’s standards, and building a level of abstraction between those programs and the file system did not seem justified. 32 CHAPTER 3 The Relational Data Model A serious drawback of this approach is that changes to the file format at the physical level could have costly repercussions for software maintenance. The “yea 2000 problem” was a good example of such repercussions. In the 1960s and 1970s, it was common to write programs in which the data item representing the calendar year was hard-coded as a two-digit number. The rationale was that these programs would be replaced within fifteen to twenty years, so using four digits (or using a data abstraction for the DATE data type) was a waste of precious disk space. The result was that every routine that worked with dates expected to find the year in the two-digit format. Hence, any change to that format implied finding and changing code throughout the application. The consequence of these past decisions was the multibillion-dollar bill presented to the industry in the late 1990s for fixing outdated software. If a data abstraction, DATE, had been used in those programs, the whole prob- Jem could have been avoided. Applications would have viewed years as four-digit numbers, even though they had been physically stored in the database as two-digit numbers. To adjust to the change of millennium, designers could have changed the underlying physical representation of years in the database to four-digit numbers by (1) building a simple program that converted the database by adding “1900” to every existing year field and (2) correspondingly changing the implementations of the appropriate functions within the DATE data type to access the new physical representation. None of the existing applications would have had to be modified because they could still use the same DATE data abstraction. When the underlying data structures are subject to change (even infrequently), basing the design of data-intensive applications on a bare file system becomes problematic. Even trivial changes, such as adding or deleting a field in a file, imply that every application that uses this file must be manually updated, recompiled, and retested. Less trivial changes, such as merging two fields or splitting a field into two, might impact the existing applications quite significantly. Accommodating such changes can be labor intensive and error prone. In addition, the data in the original file needs to be converted to the new representation, and without the appropriate tools such conversion can be costly. ‘Also, the file system offers too low a level of abstraction to support the devel- opment of an application that requires frequent and rapid implementation of new queries. For such applications, the conceptual level of data modeling becomes ap- propriate, The conceptual model hides the details of the physical data representation and instead describes data in terms of higher-level concepts that are closer to the way humans view it. For instance, the conceptual schema—the syntax used to describe the data at the conceptual level—could represent some of the information about students as STUDENT (Id: INT, Name: STRING, Address: STRING, Status: STRING) 3.1 What Is a Data Model? While this schema might look similar to the way file records are represented, the important point is that the different pieces of information it describes might be physically stored in a different way than that described in the schema. Indeed, these pieces of information might not even reside in the same file (perhaps not even on the same computer!) ‘The possibility of having separate schemas at the physical and conceptual levels leads to the simple, yet powerful, idea of physical data independence. Instead of working directly with the file system, applications see only the conceptual schema. ‘The DBMS maps data between the conceptual and physical levels automatically. If the physical representation changes, all that needs to be done is to change the mapping between the levels, and all applications that deal exclusively with the conceptual schema will continue to work with the new physical data structures. ‘The conceptual schema is not the last word in the game of data abstraction. The third level of abstraction is called the external schema (also known as the user or view abstraction level). The external schema is used to customize the conceptual schema to the needs of various classes of users, and it also plays a role in database security (as we will see later). The external schema looks and feels like a conceptual schema, and both are defined in essentially the same way in modern DBMSs. However, while there is a single conceptual schema per database, there might be several external schemas (Le., views on the conceptual schema), usually one per user category. For example, to generate proper student billing information, the bursar’s office might need to know each student's GPA and status and the total number of credits the student has taken, but not the names of the courses and the grades received. Even though the GPA and total number of credits might not be stored in the database explicitly, the bursar's office can be presented with a view in which these items appear as regular fields (whose values are calculated at run time when the field is accessed), and all fields and relations that are irrelevant to billing are omitted. Similarly, an academic advisor does not need to know anything about billing, so much of this information can be omitted from the advisor’s view of the registration system. ‘These ideas lead to the principle of conceptual data independence: Applica- tions tailored to the needs of specific user groups can be designed to use the external schemas appropriate for these groups. The mapping between the external and con- ceptual schemas is the responsibility of the DBMS, so applications are insulated from changes in the conceptual schema as well as from changes in the physical schema. The overall picture is shown in Figure 3.1. Data model. A data model consists of a set of concepts and languages for describing 1, Conceptual and external schemas. A schema specifies the structure of the data stored in the database. Schemas are described using a data definition language (DDL). 33 34 CHAPTER 3. The Relational Data Model FIGURE 3.1 Levels of data independence. 2. Constraints. A constraint specifies a condition that the data items in the data- base must satisfy. A constraint specification sublanguage is usually part of the DDL. 3. Operations on data. Operations on database items are described using a data manipulation language (DML). The DML is usually the most important and interesting part of any data model because it is the set of operations that ultimately gives us the high-level data abstraction. In addition, all commercial systems provide some kind of storage definition language (SDL), which allows the database designer to influence the physical schema (although most systems reserve the final say). The SDL is usually tightly integrated with the DDL. Changes in the physical schema that might occur if the database administrator introduces new SDL statements into a database do not affect the semantics of the applications because physical data independence shields the ap- plication from changes at the storage level. Hence, although the performance of an application might change, the results it produces do not. In Sections 3.2 and 3.3, we describe the mother of all data models used by commercial DBMSs, the relational model, and the lingua franca these DBMS speak, Structured Query Language (SQL). Be aware, however, that despite its name SQL is not just a query language; it is an amalgamation of a DML, a DDL, and an SDL—three for the price of one! 3.2 The Relational Model 2 The Relational Model The relational data model was proposed in 1970 by E. F. Codd and was considered a major breakthrough at the time. In fact, database research and development in the 1970s and 1980s was largely shaped by the ideas presented in Codd’s original work [Codd 1970, 1990]. Even today, most commercial DBMS are based on the relational model, although they are beginning to acquire object-oriented features, especially due to the increased use of XML-based data. The main attraction of the relational model is that it is built around a simple and natural mathematical structure—the relation (or table). Relations have a set of powerful, high-level operators, and data manipulation languages are deeply rooted in mathematical logic. This solid mathematical background means that relational expressions (i.e., queries) can be analyzed. Hence, any expression can potentially be transformed (by the DBMS itself) into another, equivalent, expression that can be executed more efficiently, in a process called query optimization. Thus, application programmers need not study the nitty-gritty details of the internals of each database and need not be aware of how query evaluators work. The application programmer can formulate a query ina simple and natural way and leave it to the query optimizer to find an equivalent query that is more efficient to execute. Nevertheless, query optimizers have limitations that can result in performance penalties for certain classes of complex queries, It is therefore important for both programmers and database designers to understand the heuristics they use. With this knowledge, programmers can formulate queries that the DBMS can optimize more easily, and database designers can speed up the evaluation of important queries by adding appropriate indices and using other design techniques. 3.2.1 Basic Concepts ‘The central construct in the relational model is the relation. A relation is two things in one: a schema and an instance of that schema. Relation instance. A relation instance is nothing more than a table with rows and named columns. When no confusion arises, we refer to relation instances as, just “relations.” The rows in a relation are called tuples; they are similar to records in a file, but unlike file records all tuples have the same number of columns (this, number is called the arity of the relation), and no two tuples in a relation instance can be the same. In other words, a relational instance is a set of unique tuples. The cardinality of a relation instance is the number of tuples in it. Figure 3.2 shows one possible instance for the STUDENT relation. The columns in this relation are named, which is the usual convention in the relational model. These named columns are also known as attributes. Because relations are sets of tuples, the order of these tuples is considered immaterial. Similarly, because columns are named, their order in a table is of no importance either. The relations in Figures 3.2 and 3.3 are thus considered to be the same relation. 35 36 CHAPTER 3. The Relational Data Model t1ii11111 | John Doe 123 Main St. Freshman 666666666 | Joseph Public | 666 Hollow Rd. | Sophomore 111223344 | Mary Smith 1 Lake St. Freshman 987654321 | Bart Simpson | Fox 5 TV Senior 023456789 | Homer Simpson | Fox 5 TV Senior 123454321 | Joe Blow 6 Yard Ct. Junior FIGURE 3.2 Instance of the STUDENT relation. 111223344 Mary Smith Freshman | 1 Lake St. 987654321 | Bart Simpson | Senior Fox 5 TV 111111111 | John Doe Freshman | 123 Main St. 023456789 | Homer Simpson | Senior Fox 5 TV 666666666 | Joseph Public | Sophomore | 666 Hollow Rd 123454321 | Joe Blow Junior 6 Yard Ct. FIGURE 3.3 STUDENT relation with different order of columns and tuples. We should note that the terms “tuple,” “attribute,” and “relation” are preferred in relational database theory, while “row,” “column,” and “table” are the terms used in SQL. However, it is common to use these terms interchangeably. The value of a particular attribute in any row of a relation is drawn from a set called the attribute domain—for example, the Address attribute of the STUDENT relation has as its domain the set of all strings. One important requirement placed on the values in a domain is data atomicity.! Data atomicity does not mean that these values are not decomposable. After all, we have seen that the values can be strings of characters, which means that they are decomposable. Rather, data atomicity means that the relational model does not specify any means for looking into the internal structure of the values, so that the values appear indivisible to the relational operators. This atomicity restriction is sometimes seen as a shortcoming of the relational model, and most commercial systems relax it in various ways. Some remove it altogether, which leads to a breed of data models known as object-relational. We will return to the object-relational model in Chapter 16. ' The notion of data atomicity should not be confused with the unrelated notion of transaction atomicity, which we discussed in Section 2.3. 3.2 The Relational Model Relation schema. A relation schema consists of 1. The name of the relation. Relation names must be unique across the database. 2. The names of the attributes in the relation along with their associated domain names. An attribute is simply the name given to a column in a relation instance. All columns in a relation must be named, and no two columns in the same relation can have the same name. A domain name is just a name given to some well-defined set of values. In programming languages, domain names are usually called types. Examples are INTEGER, REAL, and STRING. The integrity constraints (IC). Integrity constraints are restrictions on the rela- tional instances of this schema (i.e., restrictions on which tuples can appear in an instance of the relation). An instance of a schema is said to be legal if it satisfies all ICs associated with the schema. To illustrate, let us revisit the schema that was mentioned before: STUDENT(Id:INTEGER, Name:STRING, Address:STRING, Status:STRING) This schema states that STUDENT relations must have exactly four attributes: Id, Nane, Address, and Status with associated domains INTEGER and STRING. As seen from this example, different attributes in the same schema must have distinct names but can share domains. ‘The domains specify that in STUDENT relations all values in the column Id must belong to the domain INTEGER, while the values in all other columns must belong to the domain STRING. Naturally, we assume that the domain INTEGER consists of all integers and that the domain STRING consists of all character strings. However, schemas can also have user-defined domains, such as SSN or STATUS, that can be constrained to contain precisely the values appropriate for the attributes at hand. For instance, the domain STATUS can be defined to consist just of the symbols “freshman,” “sophomore,” and so forth, and the domain SSN can be defined to contain all (and only) nine- 0 AND StudId < 1000000000 ) ) Restricting the applicable range of attributes is not the only use of the CHECK constraint in the above context. Using the somewhat contrived relation schema below, we can express the constraint that managers must always earn more than their subordinates. CREATE TABLE EMPLOYEE ( Id INTEGER, Name CHAR(20), Salary INTEGER, MngrSalary —_INTEGER, CHECK ( MngrSalary > Salary ) ) ‘The semantics of the CHECK clause inside the CREATE TABLE statement requires that every tuple in the corresponding relation satisfy all of the conditional expressions associated with all CHECK clauses in the corresponding CREATE TABLE statement. One important consequence of this semantics is that the empty relation—a relation that contains no tuples—always satisfies all CHECK constraints as there are no tuples to check. This can lead to certain unexpected results. Consider the following syntactically correct schema definition. CREATE TABLE EMPLOYEE ( Ta INTEGER, Wane CHAR(20) , Salary INTEGER, DepartmentId CHAR(4), Mngrid INTEGER, CHECK ( 0 < (SELECT COUNT(+) FROM EMPLOYEE) ), CHECK ( (SELECT COUNT(+#) FROM MANAGER) < (SELECT GOUNT(#)FROM EMPLOYEE) ) ) Both CHECK clauses involve SELECT statements that count the number of rows in the named relation. Hence, the first CHECK clause presumably says that the EmpLovee relation cannot be empty. However natural this constraint may seem to be, it does not achieve its intended goal. Indeed, as we have remarked, this condition 4s supposed to be satisfied by every tuple in the EMpLovee relation, not by the relation 3.3 SQL—Data Definition Sublanguage itself, Therefore, if the relation is empty, it satisfies every CHECK constraint, even the one that supposedly says that the relation must not be empty! ‘The second CHECK clause in (3.3) shows that in principle nothing stops us from trying to (mis)use this facility for interrelational constraints. We have assumed that there is a relation, MANAGER, that has a tuple for each manager in the company. ‘The constraint presumably says that there must be more employees than managers, which it in fact does, but only if the EMPLOYEE relation is not empty. General constraints: ASSERTIONS. Apart from the subtle bug, the second con- straint in (3.3) looks particularly unintuitive because it is symmetric by nature and yet it is asymmetrically hardwired into the table definition of just one of the two re- lations involved. To overcome this problem, SQL provides one more way to use the (CHECK clause—inside the CREATE ASSERTION statement. An assertion is a compo- nent of the database schema, like a table, so incorporating the CHECK clause within it puts the constraint in a symmetric relationship with the two tables. Thus, the two constraints can be restated as follows (and this time correctly!): CREATE ASSERTION THOUSHALTNOTFIREEVERYONE CHECK ( 0 < (SELECT COUNT(*) FROM EMPLOYEE) ) CREATE ASSERTION WaTCHADMINCosTS CHECK ( (SELECT COUNT(+) FROM MANAGER) < (SELECT COUNT(#) FROM EMPLOYEE) ) ) Unlike the CHECK conditions that appear inside a table definition, those in the CREATE ASSERTION statement must-be satisfied by the contents of the entire database rather than by individual tuples of a host table. Thus, a database satisfies the first assertion (above) if and only if the number of tuples in the EMPLOYEE relation is greater than zero. Likewise, the second assertion is satisfied whenever the MANAGER relation has fewer tuples than the EMPLOvEE relation has. For another example of the use of assertions, suppose that the salary information about nianagers and employees is kept in different relations. We can then state our rule about who should earn more using the following assertion, which literally says that there must not exist an employee who has a boss who earns less. For the sake of this example, we assume that the MANAGER relation has the attributes Id and Salary. CREATE ASSERTION THOUSHALTNOTOUTEARNYOURBOSS: CHECK ( NOT EXISTS (SELECT + FROM EMPLOYEE, MANAGER WHERE EMPLOYEE.Salary > MANAGER. Salary AND EMPLOYEE.MngrId = MANAGER.Id )) An interesting question now is, what if, at the time of specifying the constraint ‘THOUSHALTNOTFIREEVERYONE, the EMPLOYEE relation is empty? And whatif, at the 51 52 CHAPTER 3 The Relational Data Model time of specifying THOUSHALTNoTOUTEARNYOURBOSS, there already isan employee who earns more than the boss? The SQL standard states that if a new constraint is defined and the existing database does not satisfy it, the constraint is rejected. The database designer then has to find out the cause of constraint violation and either amend the constraint or rectify the database. Our last example is a little more complex.3 It shows how assertions can be used to specify inclusion dependencies that are not foreign-key constraints. More specifically, we express the inclusion dependency (3.1) on page 45 using the assertion statement of SQL. CREATE ASSERTION COURSESSHALLNOTBEEMPTY CHECK (NOT EXISTS ( SELECT * FROM TEACHING WHERE NOT EXISTS ( 2.4 SELECT * FROM TRANSCRIPT WHERE Teaching.CrsCode = Transcript .CraCode AND Teaching Semester = Transcript .Semester))) The CHECK constraint here verifies that there is no tuple in the TEACHING relation (the outer NOT EXISTS statement) for which no matching class exists in the TRAN- ‘SCRIPT relation (the inner NOT EXISTS statement). A tuple in the TEACHING relation refers to the same class as does a tuple in the TRANSCRIPT relation if in both tu- ples the CrsCode and Semester components are equal. This test is performed in the innermost WHERE clause. Different assertions have different maintenance costs (the time required for the DBMS to check that the assertion is satisfied). Generally, intrarelational constraints ‘come cheaper than do interrelational constraints. Among the interrelational con- straints, those that are based on keys are easier to enforce than those that are not. ‘Thus, for instance, foreign-key constraints come cheaper than do general inclusion. dependencies, such as (3.4). The automatic checking of integrity constraints by a DBMS is one of the more powerful features of SQL. It not only protects the database from errors that might be introduced by untrustworthy users (or sloppy application programmers) but can simplify access to the database as well. For example, a primary key constraint ensures that at most one tuple containing a particular primary key value exists in a table. If a DBMS did not automatically check this constraint, an application program attempting to insert a new tuple or to update the key attributes of an existing tuple would have to scan the table first to ensure that the primary key constraint is maintained. 348 involves the use of a nested, correlated subquery. If you do not understand (3.4), plan to come back here after reading Chapter 5. 3.3. SQL—Data Definition Sublanguage 3.3.6 User-Defined Domains We have already seen how the CHECK clause lets us limit the range of the attributes ina table. SQL provides an alternative way to enforce such constraints by allowing the user to define appropriate ranges of values, give them domain names, and then use these names in various table definitions. This approach makes the design more modular. We could, for example, create the domain GRADES and use it in the ‘TRANSCRIPT relation instead of using the CHECK constraint directly in the definition of that relation. CREATE DOMAIN Grapes CHAR(1) CHECK ( VALUE IN ('A', ‘BY, *C*, ‘DY, 'FY, I'D) ‘The only difference between this and the previous constraint (3.2) on page SO, which was directly imposed on the table STUDENT, is that here we use a special keyword, VALUE, instead of the attribute name—we cannot use attribute names here, because the domain is not attached to any particular table. Now we can add Grade GRADES to the definition of SruDENT. The overall effect is the same, but we can use this predefined domain name in several tables without having to repeat the definition. At a later time, if we need to change this domain definition, the change will automatically propagate to all the tables that use that domain. A domain is a component of the database schema, like a table or an assertion. Note that, as with assertions, we can use complex queries to define fairly nontrivial domains. CREATE DOMAIN UPPERDIVISIONSTUDENT INTEGER CHECK ( VALUE IN (SELECT Id FROM STUDENT WHERE Status IN (‘senior', 'junior') AND VALUE IS NOTNULL ) ) ‘The domain UpPERDIVISIONSTUDENT consists of student Ids that belong to students whose status is either senior or junior. In addition, the last clause excludes NULL from that domain. Observe that, in order to verify that the constraint imposed by this domain is satisfied, a query against the database is run. Since such queries might be quite expensive, not every vendor supports the creation of such “virtual” domains. 3.3.7 Foreign-Key Constraints SQL provides a simple and natural way of specifying foreign keys. The following statement makes CrsCode a foreign key referencing Couxse and makes Profid a foreign key referencing the PROFESSOR relation. 53, 54 CHAPTER 3. The Relational Data Model CREATE TABLE TEACHING ( Profld INTEGER, CreCode CHAR(6), Semester CHAR(6), PRIMARY KEY (CrsCode, Semester), FOREIGN KEY (CraCode) REFERENCES Course, FOREIGN KEY (ProfId) REFERENCES PRoFEssor (Id) ) If the names of the referring and the referenced attributes are the same, the referenced attribute can be omitted, The attribute CrsCode above is an example of this situation. If the referenced attribute has a different name than that of the referring attribute, both attributes must be specified. The term PROFESSOR (Id) in the second FOREIGN KEY clause shows how this is done. It should be noted that, although the SQL standard does not require that the referenced attributes form a primary key (they can form any candidate key), some database vendors impose the primary key restriction. In the above example, whenever a TEACHING tuple has a course code in it, the actual course record with this course code must exist in the COURSE relation. Similarly, the professor's Id in a TEACHING tuple must reference an existing tuple in the Proressor relation. The DBMS is expected to enforce these constraints automatically once they are specified. Thus, as part of the procedure for deleting a tuple in the Proressor relation, a check is made to ensure that there is no corresponding tuple in the TEACHING relation. Foreign keys and nulls. What if, in a particular tuple, the value of an attribute in a foreign key is NULL? Should we insist that there be a corresponding tuple in the referenced relation with a null value in a key attribute? Not a good idea, especially if the referenced key isa primary key. Therefore, SQL relaxes the foreign-key constraint by letting foreign keys have null values. In this case there need not be a corresponding tuple in the referenced relation. Chicken-and-egg problems. Foreign-key constraints raise other subtle issues too. Consider the table EMPLOYEE defined in (3.3). Suppose that we also have a table that describes departments. CREATE TABLE DEPARTMENT ( DeptId CHAR(4), Wame — CHAR(40) Budget INTEGER, Mngrid INTEGER, FOREIGN KEY (MngrId) REFERENCES EMpLovee (Id) ) 3.3 SQL—Data Definition Sublanguage Now, if we look back at the Department Id attribute of the EMPLOYEE table, it is clear that this attribute is intended to represent valid department Ids (i.e., Ids of the departments stored in the DEPARTMENT relation). In other words, the constraint FOREIGN KEY (DepartmentId) REFERENCES DEPARTMENT (Deptid) is in order as part of the CREATE TABLE EMPLOYEE statement. The problem is that either EMPLOYEE or DEPARTMENT has to be defined first. If EMPLOYEE comes first, we cannot have the above foreign-key constraint in the CREATE TABLE EMPLOYEE statement because it refers to the yet-to-be-defined table DEPARTMENT. If DEPARTMENT is defined before EMPLOYEE, the DBMS will issue an ‘error trying to process the foreign-key constraint in the CREATE TABLE DEPARTMENT statement because this constraint references the yet-to-be-defined table EMPLOYEE. We are facing a chicken-and-egg problem. The solution is to postpone the introduction of the foreign-key constraint in the farst table. That is, if CREATE TABLE EMPLOYEE is executed first, we should not have the FOREIGN KEY clause in it. However, after CREATE TABLE DEPARTMENT has been processed, we can add the desired constraint to EMPLOVEE using the ALTER TABLE directive. This directive will be described in detail later in this section. Here we give only the final result. ALTER TABLE EMPLOYEE ADD CONSTRAINT EMPDEPTCONSTR FOREIGN KEY (Department Id) REFERENCES DEPARTMENT (DeptId) If, after settling this circular reference problem, we now want to start populating the database, we are in for another surprise. Suppose that we want to put the first tuple, (000000007, James Bond, 7000000, B007, 000000000), into the EMPLOYEE relation. Since at this moment the DEPARTMENT table Is empty, the foreign-key constraint that prescribes that 8007 must refer to a valid tuple in the DEPARTMENT relation is violated. ‘One solution is to initially replace the Department component in all tuples in the EMPLOYEE relation with NULL. Then, when DEPARTMENT is populated with appropriate tuples, we can scan the EMPLOYEE relation and replace the null values with valid department Ids. However, this solution is awkward and error-prone. A better solution is to use a transaction and deferred checking of integrity constraints. In Chapter 2, we pointed out that the intermediate states of the database pro- duced by a transaction might be inconsistent—they might temporarily violate in- tegrity constraints. The only important thing is that constraints must be preserved when the transaction commits. To accommodate the possibility of temporary con- straint violations, SQL allows the programmer to specify the mode of a particular integrity constraint to be either IMMEDIATE, in which case a check is made after each ‘SQL statement that changes the database, or DEFERRED, in which case a check is 56 CHAPTER 3 The Relational Data Mode! made only when a transaction commits. Then, to deal with the circular reference problem just described, we can 1. Declare the foreign-key constraints in the two tables, EmpLovee and DEPaRT- MENT, as INITIALLY DEFERRED to set the initial mode of constraint checking. 2. Make the updates that populate these tables part of the same transaction. This will allow the intermediate states to be temporarily inconsistent. 3. Make sure that when all updates are done, the foreign-key constraints are satisfied. Otherwise, the transaction will be aborted when it terminates. The full details of how transactions are defined in SQL and how they interact with constraints will be discussed in Chapter 8. 3.3.8 Reactive Constraints When a constraint is violated, the corresponding transaction is typically aborted. However, in some cases, other remedial actions are more appropriate. Foreign-key constraints are one example of this situation. ‘Suppose that a tuple (007007007 , MGT123, F1994) is inserted into the TEACH- ING relation. Because the table PROFESSOR does not have a professor with the Id 007007007, this insertion violates the foreign-key constraint that requires all non- NULL values in the Prof Id field of TEACHING to reference existing professors, In such acase, the semantics of SQL is very simple: the insertion is rejected. When constraint violation occurs because of deletion of a referenced tuple, ‘SQL offers more choices. Consider the tuple ¢ = (009406321, MGT123, F1994) in the table TEACHING. According to Figure 3.5, t references Professor Taylor in the PROFESSOR relation, and the course Market Analysis in the Course relation. Suppose that Professor Taylor leaves the university. What should happen to £? One solution is to temporarily set the value of ProfId in t to NULL until a replacement lecturer 1s found. Another solution is to have the attempt to delete Professor Taylor’s tuple from the PROFESSOR relation fail, which might reflect the policy that professors are not allowed to leave in the middle of a semester. Finally, if Professor Taylor is the only faculty member capable of teaching the course, we might remove NGT123 from the curriculum altogether, By deleting the referencing tuple, t, the violation. of referential integrity is resolved. These possibilities can be rephrased as reactive constraints. A reactive con- straint is a static constraint coupled with a specification of what to do if a certain event happens. For instance, the first alternative above is a constraint that requires that whenever a PROFESSOR tuple is deleted, the field Prof of all the referenc- ing tuples in TEACHING must be set to NULL. The second alternative is a constraint that asserts that if a referencing tuple exists it cannot be deleted. The third alter- native asserts that all referencing tuples are deleted when the referenced tuple is deleted. 3.3. SQL—Data Definition Sublanguage ‘We can specify the appropriate response to an event using triggers, which are statements of the form WHENEVER event DO action Triggers attached to foreign-key constraints. SQL supports a special kind of triggers, which are attached to foreign-key constraints. These triggers are specified as part of the FOREIGN KEY clause using the options ON DELETE and ON UPDATE, which indicate what to do if a referenced tuple is deleted or updated. To illustrate, let us revisit the definition of TEACHING. CREATE TABLE TEACHING ( ProfId INTEGER, CreCode CHAR(6), Semester CHAR(6), PRIMARY KEY (CrsCode, Semester), FOREIGN KEY (ProfId) REFERENCES PRoFEssoR(Id) ON DELETE NO ACTION ON UPDATE CASCADE, FOREIGN KEY (CrsCode) REFERENCES Course (CrsCode) ON DELETE SET NULL ON UPDATE CASCADE ) Here we have specified four triggers. One is fired (Le., executed) whenever a PROFESSOR tuple is deleted, one whenever a PROFESSOR tuple is modified, one when. a COURSE tuple is deleted, and one when a Course tuple is modified. The clause ON DELETE NO ACTION means that any attempt to remove a PROFESSOR tuple must be rejected outright if the professor is referenced by a TEACHING tuple. NO ACTION is the default situation when an ON DELETE or ON UPDATE clause is not specified. ‘The clause ON UPDATE CASCADE means that if the Id number of a PROFESSOR tuple is changed, the change must be propagated to all referencing TEACHING tuples (i.¢., the new Id must be stored in the referencing tuples). Hence, the same professor is recorded as teaching the course. (Similarly, a specification ON DELETE CASCADE causes the referencing tuple to be deleted.) ON DELETE SET NULL tells the DBMS that if a Course tuple is removed and there is a referencing TEACHING tuple, the referencing attribute, CrsCode, in that tuple must be set to NULL. Alternatively, the designer can specify SET DEFAULT (instead of SET NULL): if CrsCode was defined with a DEFAULT option (e.g., the Status attribute in the STUDENT relation), then it will be reset to its default value if the referenced tuple is deleted; otherwise, it will be set to NULL (which is the default value for the DEFAULT option). Any combination of DELETE or UPDATE triggers with NO ACTION, CASCADE, or SET NULLIDEFAULT options is allowed in foreign-key triggers. The action taken to repair a foreign-key violation in one table, T, in response to a change in another 37 CHAPTER 3 The Relational Data Model table, Ty, (e.g., delete a row in Ty if a row is deleted in T;) might cause a violation of a foreign-key constraint in T, that refers to Tp. The action specified in T controls how that violation is handled. If the entire chain of violations cannot be resolved (eg,, the action specified in T; is NO ACTION), the initial deletion from T; is rejected. General triggers. The ON DELETE/UPDATE triggers are simple and powerful, but they are not powerful enough to capture a wide variety of constraint violations that arise in database applications and are not due to foreign keys. For instance, the referential integrity constraint (3.1) on page 45 is not a foreign-key constraint and yet the same problems arise here when tuples of the TRANSCRIPT relation are modified or deleted. More importantly, foreign-key triggers cannot even begin to address common needs such as preventing salaries from changing by more than 5% in the same transaction. ‘To handle these needs, all major database vendors took destiny into their own hands and retrofitted their products with trigger mechanisms, Interestingly, the original design of SQL—before there was an SQL-92 standard—did have relatively powerful triggers. Triggers reappeared in SQL with the SQL:1999 standard, but some vendors are yet to align their offerings with the new standard. We will briefly describe the general trigger mechanism here and leave the details to Chapter 7. The basic idea behind triggers is simple: whenever a specified event occurs, execute some specified action. Consider the following simple trigger defined using, the syntax of SQL:1999. The trigger fires whenever CrsCode or Semester is changed ina tuple in the TRANSCRIPT relation. When the trigger fires and the grade recorded for the course is not NULL, an exception is raised and the changes made by the transaction are rolled back. Otherwise (if the grade is NULL), we interpret the change as a student dropping one course in favor of another, so the trigger does nothing and the change is allowed to take hold. This trigger is created with the statement CREATE TRIGGER —CRSCHANGETRIGGER AFTER UPDATE OF CrsCode, Semester ON TRANSCRIPT WHEN (Grade IS NOTNULL ) ROLLBACK, This defi jon is self-explanatory except, perhaps, for the WHEN clause, which acts as a guard, that is, as a precondition that must be satisfied in order for the trigger to fire. If the precondition is true, the statements following WHEN are executed. In our case, the statement aborts the transaction. In general, many more details might need to be specified in order to define a ‘trigger. For instance, should the action be executed just before the triggering update applied to the database or after it? Should this action be executed immediately after the event or at some later time? Can a triggered action trigger another action? Moreover, to specify the guard in the WHEN clause, we might need to refer to both the old and the new values of the modified tuples (e.g., to check that salaries have not been changed by more than 5%). We postpone the discussion of these issues 3.3. SQl—Data Definition Sublanguage until Chapter 7, where many more examples of triggers will be given. In particular, we will discuss how general triggers can be used to maintain inclusion dependencies in the presence of updates (analogous to how ON DELETE and ON UPDATE triggers are used to maintain foreign-key constraints). 3.3.9 Database Views In Section 3.1, we discussed the three levels of abstraction in databases: the physical level, the conceptual level, and the external level. We have already shown how the conceptual layer is defined in SQL. We now discuss the external (or view) layer of SQL. The physical layer will be discussed in detail in Chapter 9. In SQL, the external schema is defined using the CREATE VIEW statement. In many respects, a view is like an ordinary table: you can query it, modify it, or control access to it. However, in several important ways a view is not a table. For one thing, the rows of a view are derived from tables (and other views) of the database. Thus, in reality a view repackages information stored elsewhere. Furthermore, the contents of a view do not physically exist in the database. Instead, a recipe for constructing the contents on the fly from other database tables is stored in the system catalog. As will be seen shortly, the view definition is a hybrid of the CREATE TABLE statement and the SELECT statement introduced in Chapter 2. Because of this, views are often called virtual tables. To illustrate, consider the following view, which tells which professors have taught which students (a professor is said to have taught a student if the student took a course in the semester in which the professor offered it). CREATE VIEW = PRoFSTUD (Prof, Stud) AS SELECT TEACHING.ProfId, TRANSCRIPT.StudId FROM TRANSCRIPT, TEACHING 35 WHERE TRANSCRIPT.CrsCode = TEACHING.CreCode AND TRANSCRIPT.Semester = TEACHING Semester The first line defines the name of the view and its attributes. The rest is just an SQL query that tells how to obtain the contents of the view. These contents, with respect to the database instance of Figure 3.5, are shown in Figure 3.9. To help you understand where the tuples in the view come from, each tuple is annotated with a “justification.” (A justification for a tuple (9, s) is a course code together with the semester in which student s took that course from professor p.) ‘The view PROFSTUD might be part of the external schema that helps the uni- versity keep in touch with its alumni since establishing the relationship between students and professors through courses might be an important and frequent op- eration in such an application. So, instead of this relationship being reinvented by every single application, it can be defined once and for all in the form of a view. Once it is defined, all applications can refer to the view as if it were an ordinary table. The rows of the view are constructed at the time it is accessed, so the contents change as the underlying relations are updated by transactions. 59 CHAPTER 3 The Relational Data Model 009406321 | 6e6sece66 MGT123,F1994 121232343 | 666666666 EE101,S1991 900120450 | 666666666 MAT123 ,F1997 855666777 | 987654321 8305 ,F1995 009406321 | 987654321 NGT123,Fi994 101202303 | 123454321 €S315,81997; CS305,S1996 900120450 | 123454321 MAT123,S1996 121232343 | 023456789 EE101,F1995 101202303 | 023456789 S305 81996 900120450 | 111111111 MAT123,F1997 oosao632i1 | 111111111 NGT123,F1997 793432188 | 111111111 MGT123,F1997 FIGURE 3.9 Contents of the view defined by SQL statement (3.5). In Chapter 5, we will expand our discussion of the view mechanism and show how views can be used to modularize the construction of complex queries. The authorization mechanism is another important use of views. In Section 3.3.12, we will see that views can be treated as ordinary tables for the purpose of granting selective access rights to the information stored in the database. 3.3.10 Modifying Existing Definitions Although database schemas are not supposed to change frequently, they do evolve. Occasionally, new fields are added to relations or existing fields are dropped; new constraints and domains are created, or old ones become invalid (perhaps because business rules change). Of course, we can always copy the old contents of a relation to a temporary space, erase the old relation and its schema, and then create a new relation schema with the old name. However, this process is tedious and error-prone. To simplify schema maintenance, SQL provides the ALTER statement, which in its simplest form looks like this. ALTER TABLE STUDENT ADD COLUMN Gpa INTEGER DEFAULT 0 This command adds a new field to the STUDENT relation and initializes the field’s value in each tuple to 0. You can also use DROP COLUMN to remove a column from a relation and add or drop constraints. For instance, 3.3 SQL—Data Definition Sublanguage ALTER TABLE STUDENT ADD CONSTRAINT GPARANGE CHECK (Gpa >= 0 AND Gpa <= 4) ALTER TABLE TEACHING ADD CONSTRAINT TEACHKEY UNIQUE(ProfId, Semester, Time) If the current instance of STUDENT violates the new constraint GPARANGE, or if TEACHING violates TEACHKey, the newly added constraints are rejected. In order for a constraint to be “droppable” from a table definition, the constraint must be named at the time when it is defined—an option we have not used until now. We make up for this by naming every constraint in a revised definition of ‘TRANSCRIPT. CREATE TABLE TRANSCRIPT ( Studia INTEGER, CrsCode CHAR), Semester CHAR(6), Grade Grapes, CONSTRAINT TRKEY PRIMARY KEY (StudId, CreCode, Semester), CONSTRAINT STUDFK FOREIGN KEY (Studd) REFERENCES STUDENT, CONSTRAINT CRSFK FOREIGN KEY (CrsCode) REFERENCES Course, CONSTRAINT IDRANGE CHECK ( StudId > 0 AND StudId < 1000000000 )) Now we can alter the above definition by dropping any one of the specified integrity constraints. For example, ALTER TABLE TRANSCRIPT DROP CONSTRAINT TRKEY ‘When a table is no longer needed, its definition can be erased from the catalog. In this case, the schema of the table and its instance are both lost. Previously defined assertions and domains can also be dropped. For example, DROP TABLE EMPLoveE RESTRICT DROP ASSERTION THOUSHALTNOTFIREEVERYONE DROP DOMAIN GRADES ‘The DROP TABLE command has two options: RESTRICT and CASCADE. With the RESTRICT option, the DROP statement would refuse to delete a table if it is used in some other definition, such as integrity constraint. For instance, the constraint 61

You might also like