0% found this document useful (0 votes)

23 views

6.830/6.814 - Notes For Lecture 1: Introduction To Database Systems

This document provides an overview of the topics that will be covered in the Introduction to Database Systems course. It discusses what a database is and why databases are important. It then provides an example scenario of a mafia boss who needs a system to manage information about operations, people, money, etc. and highlights the challenges of using a file system to store this data versus using a database management system. Key concepts that will be covered in the course are also introduced such as data modeling, query languages, and database transactions.

Uploaded by

thelast2digitsofpi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

6.830/6.814 - Notes For Lecture 1: Introduction To Database Systems

Uploaded by

thelast2digitsofpi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

6.830/6.

814 Notes for Lecture 1: Introduction to Database Systems

Carlo A. Curino September 10, 2010

Administrivia

Here follows a bunch of administrative information about the class, the instructors, the materials. Please keep checking the course website http://db.csail. mit.edu/6.830/ for up to date info. Units: 3-0-9 (H) When: TR 1:00 - 2:30 Where: 32-155 Instructors: Carlo Curino ([email protected]) Michael Stonebraker ([email protected]) Instructor oce hours: by appointment Teach Assistants: Aubrey Tatarowicz ([email protected]), oce hours: Eugene Wu ([email protected]), oce hours: TA oce hours where: 32-G9 Lounge (right out of the elevators) Sta mailing list: [email protected] (for general questions) Overview course format. Syllabus online at: http://db.csail.mit.edu/ 6.830/syllabus.html Readings... do them! (Sometimes we will post questions that you are have to answer by email or prepare to discuss in class.)
These notes are only meant to be a guide of the topics we touch in class. Future notes are likely to be more terse and schematic, and you are required to read/study the papers and book chapters we mention in class, do homeworks and Labs, etc.. etc..

4 problem sets 3 labs + Project for grad students (6.830); 5 labs (Java programming) for undergrad (6.814), you can do the project rather than the last 2 labs if you prefer. Text books: Readings in Database Systems (The Red Book). Online at http://library.books24x7.com.libproxy.mit.edu/toc.asp? site=bbbga&bookid=10757 Database Management Systems (Ramakrishnan and Gehrke) Sign up sheet RSS Feed http://db.csail.mit.edu/6.830/events.xml 6.814 to satisfy the Advanced Undergraduate Subject requirement Grading 6.830 Assignment (Problem Sets and Labs): 35% total 2 Exams: 20% each Final Project: 20% Class Participation: 5%

Grading 6.814 Assignment (Problem Sets and Labs): 50% total 2Exams: 20% each Class Participation: 10%

Introduction

READING MATERIAL: Ramakrishnan and Gehrke Chapter 1 What is a database? A database is a collection of structured data. A database captures an abstract representation of the domain of an application. Typically organized as records (traditionally, large numbers, on disk) and relationships between records This class is about database management systems (DBMS): systems for creating, manipulating, accessing a database. A DBMS is a (usually complex) piece of software that sits in front of a collection of data, and mediates applications accesses to the data, guaranteeing many properties about the data and the accesses. Why should you care? There are lots of applications that we dont oer classes on at MIT. Why are databases any dierent? 2

APP1

APP2
"a system to create, manipulate, access databases (mediate access to the data)"

DBMS DB

"a collection of structure data"

Figure 1: What is a database management system? Ubiquity (anywhere from your smartphone to Wikipedia) Real world impact: software market (roughly same size as OS market roughly $20B/y). Web sites, big companies, scientic projects, all manage both day to day operations as well as business intelligence + data mining. You need to know about databases if you want to be happy! The goal of a DBMS is to simplify the storing and accessing of data. To this purpose DBMSs provide facilities that serve the most common operations performed on data. The database community has devoted signicant eort in formalizing few key concepts that most applications exploit to manipulate data. This provides a formal ground for us to discuss the application requirements on data storage and access, and compare ways for the DBMS to meet such requirements. This will provide you with powerful conceptual tools that go beyond the specic topics we tackle in this class, and are of general use for any application that needs to deal with data. Now we proceed in showing an example, and show how hard is doing things without a DB, later we will introduce formal DB concepts and show how much easier things are using a DB.

Maa Example

Today we cover the user perspective, trying to detail the many reason we want to use a DBMS rather than organizing and accessing data directly, for example as les. Let us assume I am a Maa Boss (Note: despite the accent this is not the case, but only hypothetical!) and I want to organize my group of picciotti (sicilian for the criminals/bad guys working for the boss, a.k.a the soldiers, see Figure 2) to achieve more eciency in all our operations. I will also need a lot of book-keeping, security/privacy etc.. Note that my organization is very

Figure 2: Maa hierarchy. large, so there is quite a bit of things going on at any moment (i.e., many people accessing the database to record or read information). I need to store information about: people that work for me (soldiers, caporegime, etc..) organizations I do business with (police, Ndrangheta, politicians) completed and open operations: protection rackets arms tracking drug tracking loan sharking control of contracting/politics I need to avoid that any of may man is involved in burglary, mugging, kidnapping (too much police attention) cover-up operations/businesses money laundry and funds tracking assignment of soldiers to operations etc... I will need to share some of this information with external organizations I work with, protecting some of the information. Therefore I need: the boss, underboss and consigliere should be able to access all the data and do any kind of operations (assign soldiers to operations, create or shutdown operations, pay cops, check the total state of money movements, etc...) the accountants (20 of them) access to perform money book-keeping (track money laundering operations, move money from bank to bank, report bribing expenses) 4

the soldiers (5000) need to report daily misdeeds in a daily-log, and report money expenses and collections the semi-public interface accessible by other bosses I collaborate with (search for cops on our books, check areas we already cover, etc..)
name nickname phone log_id author title summary accounts account-number false-identity balance name desc $$ coverup-name

person

involve

operation

log

collaboration_with

organization

name boss rank

Figure 3: What data to store in my Maa database.

3.1

An oer you cannot refuse

I make you an oer you cannot refuse: you are hired to create my Maa Information System, if you get it right you will have money, sexy cars, and a great life. If you get it wrong... well you dont want to get it wrong. As a rst attempt, you think about just using a le system: 1. What to represent:, what are the key entities in the real world I need to represent? how many details? 2. How to store data: maybe we can use just les: people.txt, organizations.txt, operations.txt, money.txt, daily-log.txt. Each les contains a textual representation of the information with one item per line. 3. Control access credentials at low granularity: accountants should know about money movement, but not the names and addresses of our soldiers. Soldiers should know about operations, but not access money information 4. How to access data: we could write a separate procedural program opening one or more les, scanning through them and reading/writing information in them. 5. Access patterns and performance: how to nd shop we didnt collected money from for the longest time (and at least 1 month)? scan the huge operation le, sort by time, pick the oldest, measure time? (need to be timely or they will stop paying, and this get the boss mad... you surely 5

dont want that, and make sure no one is accessing it right now). Tony Schifezza is a mole, we need to nd all the operations and people he was involved or knew about and shut them down... quick... like REAL quick!!! 6. Atomicity: when an accountant moves money from one place to another you need to guarantee that either money are removed from account A and added to account B, or nothing at all happens... (You do not want to have money vanishing, unless you plan to vanish too!). 7. Consistency: guarantee that the data are always in a valid state (e.g., there are no two operations with the same name) 8. Isolation: multiple soldiers need to add to daily-log.txt at the same time (risk is that they override each other work, and someone get red because not productive!!) 9. Durability: in case of a computer crash we need to make sure we dont lose any data, nor that data get scrambled (e.g., If the system says the payment of a cop went through, we must guarantee that after reboot the operation will be present in the system and completed. The risk is police taking down our operation!) Using the le system, you realize that most probably you will fail, and that can be very dangerous... Luckily you are enrolled in 6.830/6.814 and you just learned that: Databases address all of these issues!! you might have a chance! In fact, you might notice that the issues listed above are already related to the three concepts we mentioned before: 1-3 are problems related to Data Model, 4-5 are problems related to the Query language and 6-9 are problems related to Transactions. So lets try to do the same with a database and get the boss what he needs.

3.2

More on fundamental concepts

Database are a microcosm of computer science, their study covers: languages, theory, operating systems, concurrent programming, user interfaces, optimization, algorithms, articial intelligence, system design, parallel and distributed systems, statistical techniques, dynamic programming. Some of the key concepts we will investigate are: Representing Data We need a consistent structured way to represent data, this is important for consistency, sharing, eciency of access. From database theory we have the right concepts. Data Model: a set of constructs (or a paradigm) to describe the organization of data. For example tables (or more precisely relations), but we could also choose graph, hierarchies, objects, triples <subject,predicate,object>, etc.. 6

Conceptual/Logical Schema: is a description of a particular collection of data, using the a given data model (e.g., the schema of our Maa database). Physical Schema: is the physical organization of the data (e.g., data and index les on disk for our Maa database). Declarative Querying and Query Processing a high-level (typically declarative) language to describe operations on data (e.g., queries, updates). The goal is to guarantee Data independence (logical and physical), by separating what you want to do with data from how to achieve that (more later). High level language for accessing data Data Independence (logical and physical) Optimization Techniques for eciently accessing data Transactions a way to group actions that must happen atomically (all or nothing) guarantees to move the DB content from a consistent state to another isolate from parallel execution of other actions/transactions recoverable in case of failure (e.g., power goes out) This provide the application with guarantees about a group of actions even in presence of concurrency and failures. It is a unit of access and manipulation of data. And signicantly simplify the work of application developers. This course covers these concepts, and goes deep into the investigation of how modern DBMS are designed to achieve all that. We will not cover the more articial-inteligence / statistical / mining related areas that are also part of database research. Instead, we will explore some of the recent advanced topics in database researchsee class schedule to get an idea of the topics.

3.3

Back to our Maa database

What features of our organization shall we store? How do we want to capture them? Choose a level of abstraction and describe only the relevant details (e.g., I dont care about favorite movies for my soldiers, but I need to store their phone numbers). Lets focus on a subset: each person has real name, nickname, phone number each operation has a name, description, economical value, cover-up name info about the persons involved in an operation and their role,

We could represent this data according to many dierent data models: hierarchies objects graph triples etc.. Lets try using an XML hierarchical le: <person> <name> </name> <nickname> </nickname> <phone> </phone> <operation> <op_name> </op_name> <description> </description> <econ_value> </econ_value> <coverup_name> </coverup_name> </operation> </person> Operations are duplicated in each person, this might make the update very tricky (inconsistencies) and the representation very verbose and redundant. Otherwise we can organize the other way around with people inside operations, well we would have people replicated. Another possibility is using a graph structure with people, names, nicknames,phones, operation names etc.. as nodes, and edges to represent relationships between them. Or we could have objects and methods on them, or triples like <carlo,is a,person>, <carlo,phone,5554348882> etc.. Dierent data models are more suited for dierent problems. They dierent expressive power and dierent strengths depending on what data you want to represent and how you need to access them. Lets choose the relational data model and represent this problem using tables. Again there are many ways to structure the representation, i.e., dierent conceptual/logical schemas that could capture the reality are modeling. For example we can have a single big table with all info together... again, is redundant and might slow down all the access to data. The database design is the art of capturing a set of real world concepts and their relations in the best possible organization in a database. A good representation is shown in Figure 4. It is not redundant and contains all the information we care about.

involved
pers_name oper_name rols
chief sold chief

person
name
carlo mike tony

carlo

snowflake snowflake chocolate

operation
title
snowflake chocolate caffe

nickname
baffo lungo shifezza

phone
123 456 789

tony mike

descr.
.. ... ...

econ_val
$10M $5M $2M

coverup
laundromat irish pub irish pub

Figure 4: Simple Logical Schema for a portion of our Maa database. What about the physical organization of the data? As a database user you can ignore the problem, thanks to the physical independence! As a student of this class you will devote a lot of eort in learning how to best organize data physically to provide great performance to access data.

3.4

Accessing the data (transactionally)

As we introduced before databases provide high-level declarative query languages. The key idea is that you describe what you want to access, rather than how to access it. Lets consider the following operations you want to do on data, and how we can represent them using the standard relational query language SQL: Which operations involve Tony Schifezza? SELECT oper_name FROM involved WHERE person = "tony"; Given the laundromat operation, get the phone numbers of all the people involved in operations using it as a cover up. SELECT p.phone FROM person p, operation o, involve i WHERE p.name = i.person AND i.oper_name = o.name AND o.coverup_name = "laundromat"; Reassign Tonys operations to Sam and remove Tony from the database (he was the mole). BEGIN UPDATE involved i SET pers_name="sam" WHERE pers_name="tony"; DELETE FROM person WHERE name = "tony"; COMMIT

Create a new operation with Sam Astuto in charge of it. BEGIN INSERT INTO operation VALUES (newop1,,0,Sams bakery); INSERT INTO involve VALUES (newop1,sam,chief); COMMIT Let us reconsider the procedural approach. You might organize data into les: one record of each table in a le, and maybe sort the data by one of the elds. Now every dierent access to the data, i.e., every query should become a dierent program opening the les, scanning them, reading or writing certain elds, saving the les.

Extras

The two following concepts have been broadly mentioned but not discussed in details in class. Optimization The goal of a DBMS is to provide a library of sophisticated techniques and strategy to store, access, update data that also guarantees performance, atomicity, consistency, isolation, durability. DBMS automatically compile the user declarative queries into an execution plan (i.e., a strategy that applies various steps to achieve the compute the user queries), looks for equivalent but more ecient ways to obtain the same result query optimization, and execute it, see example in Figure 5.
BASIC PLAN
project(p.phone)

OPTIMIZED PLAN
project(p.phone)

lter(o.coverup="laundromat") lter(p.name=i.person) lter(i.oper_name=o.name) product

lter(p.name=i.person)

product

lter(i.oper_name=o.name) product

product

project(p.name,p.phone)

project(i.oper_name, i.person)

project(o.name)

scan(person)

scan(involved)

scan(operations)

scan(person)

scan(involved)

lookup(operations, coverup="laundromat")

Figure 5: Two equivalent execution plan, a basic and an optimized one.

External schema A set of views over the logical schema, that predicates how users see/access data. (e.g., a set of views for the accountants). It is often not physically materialized, but maintain as a view/query on top of the data. Let try to show only coverup names of operations worth less or equal to $5M and the nicknames of all people involved using a view (see Figure 6): CREATE VIEW nick-cover AS SELECT nickname, coverup_name FROM operation o, involved i, person p WHERE p.name = i.person AND i.oper_name = o.name AND o.econ_val <= 5M;
nick-cover
nickname
baffo lungo schifezza

coverup
laundromat irish pub laundromat

Figure 6: Simple External Schema for a portion of our Maa database.

Whats next?

Next week lessons introduce more formally the relational model (and some of its history) and how to design the schema of a database. After that we will dive into the DBMS internals and study how DBMS are internally architected to achieve all the functionalities we discussed. Later on we will study how to guarantee transactional behaviors, and how to scale a DBMS beyond a single node. The last portion of the course is devoted to more esoteric topics from recent advances in database research.

Database
No ratings yet
Database
11 pages
From Data Warehouses and Lakes To Data Mesh A Guide To Enterprise Data Architecture
No ratings yet
From Data Warehouses and Lakes To Data Mesh A Guide To Enterprise Data Architecture
23 pages
Data Science
No ratings yet
Data Science
24 pages
Database Introduction
No ratings yet
Database Introduction
3 pages
The Concept of a System
No ratings yet
The Concept of a System
3 pages
FDSUNIT 1
No ratings yet
FDSUNIT 1
27 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
36 pages
ISPFL9 Module1
100% (1)
ISPFL9 Module1
22 pages
Data Science Training
No ratings yet
Data Science Training
8 pages
DSUP Chapter 1 PDF
No ratings yet
DSUP Chapter 1 PDF
31 pages
Existing System
0% (1)
Existing System
3 pages
PDF (Ebook) Data Mining Models by David L. Olson ISBN 9781948580496, 1948580497 download
100% (6)
PDF (Ebook) Data Mining Models by David L. Olson ISBN 9781948580496, 1948580497 download
67 pages
Unit-Iii CC&BD CS71
No ratings yet
Unit-Iii CC&BD CS71
89 pages
Instant Download Data Mining Models David L. Olson PDF All Chapters
100% (6)
Instant Download Data Mining Models David L. Olson PDF All Chapters
52 pages
Subject: Port Information Systems and Platforms: Proposed By: Prof Tali
No ratings yet
Subject: Port Information Systems and Platforms: Proposed By: Prof Tali
9 pages
Google Interview Warmup Questions
No ratings yet
Google Interview Warmup Questions
15 pages
1.1.2 Clearly-Explained-How-Machine-Learning-Is-Different-From-Data-Mining
No ratings yet
1.1.2 Clearly-Explained-How-Machine-Learning-Is-Different-From-Data-Mining
7 pages
MIS Chapter 4
No ratings yet
MIS Chapter 4
5 pages
Mathematic
No ratings yet
Mathematic
4 pages
Unit-III CC&BD Cs62 Ab
No ratings yet
Unit-III CC&BD Cs62 Ab
85 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Rdbms Unit I
No ratings yet
Rdbms Unit I
35 pages
Dawit House
No ratings yet
Dawit House
49 pages
1 Introduction
No ratings yet
1 Introduction
9 pages
Data Science Vs Machine Learning Vs Deep Learning: The Difference
No ratings yet
Data Science Vs Machine Learning Vs Deep Learning: The Difference
19 pages
Unit 1
No ratings yet
Unit 1
95 pages
Unit-I Bdaur-Bcom
No ratings yet
Unit-I Bdaur-Bcom
5 pages
Intelligence Community Massive Digital Data Systems Initiative
No ratings yet
Intelligence Community Massive Digital Data Systems Initiative
18 pages
Big Data Characteristics
No ratings yet
Big Data Characteristics
7 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
37 pages
22mca341 - Data Science
No ratings yet
22mca341 - Data Science
109 pages
Module 2-4
No ratings yet
Module 2-4
16 pages
Lumeer - Io-Risks of Using A Spreadsheet For Project Management
No ratings yet
Lumeer - Io-Risks of Using A Spreadsheet For Project Management
9 pages
Data Mining
No ratings yet
Data Mining
11 pages
Designing Machine Learning Systems With Python - Sample Chapter
100% (1)
Designing Machine Learning Systems With Python - Sample Chapter
31 pages
Analysis Report: Lalithya Reddy
No ratings yet
Analysis Report: Lalithya Reddy
7 pages
Data Mining Models David L. Olson - The newest ebook version is ready, download now to explore
100% (4)
Data Mining Models David L. Olson - The newest ebook version is ready, download now to explore
61 pages
The 365 DS Booklet PDF
100% (1)
The 365 DS Booklet PDF
67 pages
Big Data Analytics - Unit 1
No ratings yet
Big Data Analytics - Unit 1
29 pages
Define Data Mining and Discuss The Role Play by Data Mining For Managerial Decision Making
0% (1)
Define Data Mining and Discuss The Role Play by Data Mining For Managerial Decision Making
9 pages
Big Data Analytics_AAM_Unit 1
No ratings yet
Big Data Analytics_AAM_Unit 1
178 pages
Introduction to Data Science_students
No ratings yet
Introduction to Data Science_students
237 pages
Unit 1 What Is Big Data
No ratings yet
Unit 1 What Is Big Data
26 pages
Acp Excise
No ratings yet
Acp Excise
11 pages
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
From Everand
PYTHON DATA SCIENCE: A Practical Guide to Mastering Python for Data Science and Artificial Intelligence (2023 Beginner Crash Course)
Calvert Long
No ratings yet
PSK Unit 1 Merged
No ratings yet
PSK Unit 1 Merged
125 pages
ETL Specific
No ratings yet
ETL Specific
12 pages
IDS - UNIT-2 - Notes part1_Introduction to Data Science and Prob concept[1]
No ratings yet
IDS - UNIT-2 - Notes part1_Introduction to Data Science and Prob concept[1]
66 pages
The Data Model Resource Book Chapter 1: Introduction: Why Is There A Need For This Book?
No ratings yet
The Data Model Resource Book Chapter 1: Introduction: Why Is There A Need For This Book?
14 pages
Unit 01
No ratings yet
Unit 01
32 pages
data science unit_2
No ratings yet
data science unit_2
9 pages
big data introduction unit 1
No ratings yet
big data introduction unit 1
19 pages
Data Science
100% (2)
Data Science
33 pages
Comp Assignment
No ratings yet
Comp Assignment
11 pages
Becoming a Systems Analyst 1st Edition Laura La Bella - Download the ebook and start exploring right away
100% (6)
Becoming a Systems Analyst 1st Edition Laura La Bella - Download the ebook and start exploring right away
70 pages
Paper 2 Datawarehouse Notes
No ratings yet
Paper 2 Datawarehouse Notes
20 pages
Unit 1 DA
No ratings yet
Unit 1 DA
72 pages
Data Ware Housing
No ratings yet
Data Ware Housing
12 pages
Big Data
No ratings yet
Big Data
77 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet