Unit 1: Introduction: What Is Data?
Unit 1: Introduction: What Is Data?
Ted Codd
Turing Award 1981
Instances and Schemas Database example for reference
• Similar to types and variables in programming languages
• Logical Schema – the overall logical structure of the database
• Example: The database consists of information about a set of
customers and accounts in a bank and the relationship between
them
• Analogous to type information of a variable in a program
• Physical schema – the overall physical structure of the database
• Instance – the actual content of the database at a particular point in
time
• Analogous to the value of a variable
Database Architecture
Transactional management (Centralized/Shared-
Memory)
• A transaction is a collection of operations that performs a single logical function in a
database application
• Transaction-management component ensures that the database remains in a consistent
(correct) state despite system failures (e.g., power failures and operating system crashes)
and transaction failures.
• Concurrency-control manager controls the interaction among the concurrent transactions,
to ensure the consistency of the database.
Database Users
Database Applications
• Two-tier architecture -- the application resides at the client machine, where it invokes
database system functionality at the server machine
• Three-tier architecture -- the client machine acts as a front end and does not contain any
direct database calls.
• The client end communicates with an application server, usually through a forms
interface.
• The application server in turn communicates with a database system to access data.
• Schema definition
• Storage structure and access-method definition
• Schema and physical-organization modification
• Granting of authorization for data access
• Routine maintenance
• Periodically backing up the database
• Ensuring that enough free disk space is available for normal operations, and upgrading
disk space as required
• Monitoring jobs running on the database
History of Database Systems History of Database Systems (Cont.)
• 1950s and early 1960s: • 2000s
• Data processing using magnetic tapes for storage • Big data storage systems
• Tapes provided only sequential access
• Google BigTable, Yahoo PNuts, Amazon,
• Punched cards for input
• “NoSQL” systems.
• Late 1960s and 1970s:
• Hard disks allowed direct access to data • Big data analysis: beyond SQL
• Network and hierarchical data models in widespread use • Map reduce and friends
• Ted Codd defines the relational data model • 2010s
• Would win the ACM Turing Award for this work
• SQL reloaded
• IBM Research begins System R prototype
• UC Berkeley (Michael Stonebraker) begins Ingres prototype • SQL front end to Map Reduce systems
• Oracle releases first commercial relational database • Massively parallel database systems
• High-performance (for the era) transaction processing • Multi-core main-memory databases
• The Explosive Growth of Data: from terabytes to petabytes • This is a view from typical database systems
and data warehousing communities Pattern Evaluation
• Data collection and data availability
• Data mining plays an essential role in the
• Automated data collection tools, database systems, Web, computerized knowledge discovery process
society Data Mining
50 52
Data Mining in Business Intelligence KDD Process: A Typical View from ML and Statistics
Increasing potential
to support
business decisions End User
Decision Input Data Data Pre- Data Post-
Processing Mining Processing
Making
54 56
Example of a Instructor Relation
attributes
(or columns)
tuples
Outline
• Structure of Relational Databases
Relation Schema and Instance
• Database Schema
• Keys • A1, A2, …, An are attributes
• Schema Diagrams • R = (A1, A2, …, An ) is a relation schema
• Relational Query Languages
Example:
• The Relational Algebra
• Overview of The SQL Query Language instructor = (ID, name, dept_name, salary)
• SQL Data Definition • A relation instance r defined over schema R is denoted by r (R).
• Basic Query Structure of SQL Queries
• Additional Basic Operations
• The current values a relation are specified by a table
• Set Operations • An element t of relation r is called a tuple and is represented by a
• Null Values row in a table
• Aggregate Functions
• Nested Subqueries
• Modification of the Database
Attributes Database Schema
• Database schema -- is the logical structure of the database.
• Database instance -- is a snapshot of the data in the database at a given instant in time.
• The set of allowed values for each attribute is called the domain of the
attribute • Example:
• Attribute values are (normally) required to be atomic; that is, indivisible • schema: instructor (ID, name, dept_name, salary)
• Instance:
• The special value null is a member of every domain. Indicated that the
value is “unknown”
• The null value causes complications in the definition of many
operations
• Example: instructor relation with unordered tuples • Example: {ID} and {ID,name} are both superkeys of instructor.
• Which one?
• Foreign key constraint: Value in one relation must appear in another
• Referencing relation
• Referenced relation
• Example: dept_name in instructor is a foreign key from instructor
referencing department
Schema Diagram for University Database Relational Query Languages
Relational Algebra
• The select operation selects tuples that satisfy a given predicate. • A unary operation that returns its argument relation, with certain attributes left
p (r)
out.
• Notation:
• Notation:
• p is called the selection predicate
• Example: select those tuples of the instructor relation where the
instructor is in the “Physics” department.
A1,A2,A3 ….Ak (r)
• Query
where A1, A2, …, Ak are attribute names and r is a relation name.
Slide 10 dept_name=“Physics” (instructor)
Result • The result is defined as the relation of k columns obtained by erasing the columns
that are not listed
• Duplicate rows removed from result, since relations are sets
• Instead of giving the name of a relation as the argument of the projection operation, we give
an expression that evaluates to a relation.
• The join operation allows us to combine a select operation and a Cartesian-Product operation into a single • Result of:
operation.
• Consider relations r (R) and s (S)
course_id ( semester=“Fall” Λ year=2017 (section))
• Let “theta” be a predicate on attributes in the schema R “union” S. The join operation r ⋈𝜃 s is defined as course_id ( semester=“Spring” Λ year=2018 (section))
follows:
𝑟 ⋈𝜃 𝑠 = 𝜎𝜃 (𝑟 × 𝑠)
• Thus
• With the assignment operation, a query can be written as a sequential program consisting of a
• Result series of assignments followed by an expression whose value is displayed as the result of the
query.
• The two queries are not identical; they are, however, equivalent -- they give the same result on
any database.
dept_name=“Physics” (instructor ⋈ instructor.ID = teaches.ID teaches) • View definition -- The DDL includes commands for defining views.
• Transaction control –includes commands for specifying the beginning and ending of
transactions.
• Query 2
• Embedded SQL and dynamic SQL -- define how SQL statements can be embedded
( dept_name=“Physics” (instructor)) ⋈ instructor.ID = teaches.ID teaches within general-purpose programming languages.
• Authorization – includes commands for specifying access rights to relations and views.
• The two queries are not identical; they are, however, equivalent -- they give the same result on any
database.
Data Definition Language Create Table Construct
• Find the supervisor of “Bob” • SQL supports a variety of string operations such as
• concatenation (using “||”)
• Find the supervisor of the supervisor of “Bob”
• converting from upper to lower case (and vice versa)
• Can you find ALL the supervisors (direct and indirect) of “Bob”? • finding string length, extracting substrings, etc.
• SQL includes a string-matching operator for comparisons on character strings. The operator • List in alphabetic order the names of all instructors
like uses patterns that are described using two special characters: select distinct name
• percent ( % ). The % character matches any substring. from instructor
• underscore ( _ ). The _ character matches any character. order by name
• Find the names of all instructors whose name includes the substring “dar”. • We may specify desc for descending order or asc for ascending order, for each attribute;
ascending order is the default.
select name
from instructor • Example: order by name desc
where name like '%dar%' • Can sort on multiple attributes
• Match the string “100%” • Example: order by dept_name, name
like '100 \%' escape '\'
in that above we use backslash (\) as the escape character.
Where Clause Predicates Set Operations (Cont.)
Set Operations
Null Values
• Find courses that ran in Fall 2017 or in Spring 2018
(select course_id from section where sem = 'Fall' and year = 2017)
union • It is possible for tuples to have a null value, denoted by null, for some of their attributes
(select course_id from section where sem = 'Spring' and year = 2018) • null signifies an unknown value or that a value does not exist.
• Find courses that ran in Fall 2017 and in Spring 2018 • The result of any arithmetic expression involving null is null
(select course_id from section where sem = 'Fall' and year = 2017) • Example: 5 + null returns null
intersect • The predicate is null can be used to check for null values.
(select course_id from section where sem = 'Spring' and year = 2018) • Example: Find all instructors whose salary is null.
• Find courses that ran in Fall 2017 but not in Spring 2018 select name
from instructor
(select course_id from section where sem = 'Fall' and year = 2017)
where salary is null
except
(select course_id from section where sem = 'Spring' and year = 2018) • The predicate is not null succeeds if the value on which it is applied is not null.
Null Values (Cont.)
Aggregate Functions Examples
• SQL treats as unknown the result of any comparison involving a null value (other than
predicates is null and is not null).
• Example: 5 < null or null <> null or null = null • Find the average salary of instructors in the Computer Science department
• The predicate in a where clause can involve Boolean operations (and, or, not); thus the • select avg (salary)
definitions of the Boolean operations need to be extended to deal with the value from instructor
unknown. where dept_name= 'Comp. Sci.';
• and : (true and unknown) = unknown, • Find the total number of instructors who teach a course in the Spring 2018 semester
(false and unknown) = false, • select count (distinct ID)
(unknown and unknown) = unknown from teaches
• or: (unknown or true) = true, where semester = 'Spring' and year = 2018;
(unknown or false) = unknown • Find the number of tuples in the course relation
(unknown or unknown) = unknown
• select count (*)
• Result of where clause predicate is treated as false if it evaluates to unknown from course;
• Find the names and average salaries of all departments whose average salary is
greater than 42000
• Find all students who have taken all courses offered in the Biology department.
• The exists construct returns the value true if the argument subquery is nonempty. select distinct S.ID, S.name
• exists r r Ø from student as S
where not exists ( (select course_id
• not exists r r = Ø from course
where dept_name = 'Biology')
except
(select T.course_id
from takes as T
where S.ID = T.ID));
With Clause
• The with clause provides a way of defining a temporary relation whose definition is
available only to the query in which the with clause occurs.
• Find all departments with the maximum budget
Subqueries in the From Clause with max_budget (value) as
(select max(budget)
from department)
select department.name
from department, max_budget
where department.budget = max_budget.value;
Modification of the Database
Complex Queries using With Clause
• Deletion of tuples from a given relation.
• Insertion of new tuples into a given relation
• Find all departments where the total salary is greater than the average of the
total salary at all departments • Updating of values in some tuples in a given relation
Deletion
Scalar Subquery
• Delete all instructors
delete from instructor
• Scalar subquery is one which is used where a single value is expected
• List all departments along with the number of instructors in each department • Delete all instructors from the Finance department
delete from instructor
select dept_name, where dept_name= 'Finance’;
( select count(*)
from instructor
where department.dept_name = instructor.dept_name) • Delete all tuples in the instructor relation for those instructors associated with a
as num_instructors department located in the Watson building.
from department; delete from instructor
• Runtime error if subquery returns more than one result tuple where dept name in (select dept name
from department
where building = 'Watson');
Insertion (Cont.)
Deletion (Cont.)
• Make each student in the Music department who has earned more than 144
credit hours an instructor in the Music department with a salary of $18,000.
• Delete all instructors whose salary is less than the average salary of instructors insert into instructor
• Problem: as we delete tuples from instructor, the select ID, name, dept_name, 18000
from student
average salary changes where dept_name = 'Music' and total_cred > 144;
• Solution used in SQL:
• The select from where statement is evaluated fully before any of its results are
1. First, compute avg (salary) and find all tuples to delete inserted into the relation.
2. Next, delete all tuples found above (without recomputing avg Otherwise queries like
or retesting the tuples)
insert into table1 select * from table1
would cause problem
delete from instructor
where salary < (select avg (salary)
from instructor);
Insertion Updates
• Give a 5% salary raise to those instructors who earn less than 70000
• or equivalently update instructor
set salary = salary * 1.05
insert into course (course_id, title, dept_name, credits) where salary < 70000;
values ('CS-437', 'Database Systems', 'Comp. Sci.', 4);
• Give a 5% salary raise to instructors whose salary is less than average
• Add a new tuple to student with tot_creds set to null update instructor
set salary = salary * 1.05
insert into student where salary < (select avg (salary)
values ('3003', 'Green', 'Finance', null); from instructor);
Updates (Cont.)
Updates with Scalar Subqueries
• Increase salaries of instructors whose salary is over $100,000 by 3%, and all others
by a 5%
• Write two update statements: • Recompute and update tot_creds value for all students
update instructor update student S
set salary = salary * 1.03 set tot_cred = (select sum(credits)
where salary > 100000; from takes, course
update instructor where takes.course_id = course.course_id and
set salary = salary * 1.05 S.ID= takes.ID.and
where salary <= 100000; takes.grade <> 'F' and
• The order is important takes.grade is not null);
• Can be done better using the case statement (next slide) • Sets tot_creds to null for students who have not taken any course
• Instead of sum(credits), use:
case
when sum(credits) is not null then sum(credits)
else 0
end
Entity-Relationship Model
Design Phases Design Approaches
• The initial phase of database design is to characterize fully the data needs of the
• Entity Relationship Model
prospective database users. • Models an enterprise as a collection of entities and relationships
• Next, the designer chooses a data model and, by applying the concepts of the • Entity: a “thing” or “object” in the enterprise that is distinguishable from other objects
chosen data model, translates these requirements into a conceptual schema of the
database. • Described by a set of attributes
• A fully developed conceptual schema also indicates the functional requirements of • Relationship: an association among several entities
the enterprise. In a “specification of functional requirements”, users describe the • Represented diagrammatically by an entity-relationship diagram:
kinds of operations (or transactions) that will be performed on the data.
• Normalization Theory
• Formalize what designs are bad, and test for them
• Logical Design – Deciding on the database schema. Database Outline of the ER Model
design requires that we find a “good” collection of relation
schemas.
• Business decision – What attributes should we record in the database?
• Computer Science decision – What relation schemas should we have
and how should the attributes be distributed among the various
relation schemas?
• Physical Design – Deciding on the physical layout of the
database
Entity Sets -- instructor and student
ER model -- Database Modeling instructor_ID instructor_name student-ID student_name
Note: Some elements in A and B may not be mapped to any elements in the other set
Composite Attributes
Mapping Cardinalities
E-R Diagrams
Entity Sets
Entities can be represented graphically as follows:
Relationship Sets with Attributes
• Rectangles represent entity sets.
• Attributes listed inside entity rectangle
• Underline indicates primary key attributes
One-to-Many Relationship
Roles
• Entity sets of a relationship need not be distinct
• Each occurrence of an entity set plays a “role” in the relationship
• one-to-many relationship between an instructor and
• The labels “course_id” and “prereq_id” are called roles. a student
• an instructor is associated with several (including 0)
students via advisor
• a student is associated with at most one instructor via
advisor,
Many-to-One Relationships
Cardinality Constraints
• We express cardinality constraints by drawing either a • In a many-to-one relationship between an
directed line (→), signifying “one,” or an undirected line (—), instructor and a student,
signifying “many,” between the relationship set and the entity • an instructor is associated with at most one student via
set. advisor,
• and a student is associated with several (including 0)
• One-to-one relationship between an instructor and a student instructors via advisor
:
• A student is associated with at most one instructor via the
relationship advisor
• A student is associated with at most one department via stud_dept
Notation for Expressing More Complex Constraints
Many-to-Many Relationship A line may have an associated minimum and maximum cardinality,
shown in the form l..h, where l is the minimum and h the maximum
cardinality
A minimum value of 1 indicates total participation.
• An instructor is associated with several (possibly 0) students via advisor
A maximum value of 1 indicates that the entity participates in
• A student is associated with several (possibly 0) instructors via advisor at most one relationship
A maximum value of * indicates no limit.
Total and Partial Participation Notation to Express Entity with Complex Attributes
Specialization
• Top-down design process; we designate sub-
• Most relationship sets are binary
groupings within an entity set that are distinctive
• There are occasions when it is more from other entities in the set.
convenient to represent relationships as • These sub-groupings become lower-level entity sets
that have attributes or participate in relationships
non-binary. that do not apply to the higher-level entity set.
• E-R Diagram with a Ternary Relationship • Depicted by a triangle component labeled ISA (e.g.,
instructor “is a” person).
• Attribute inheritance – a lower-level entity set
inherits all the attributes and relationship
participation of the higher-level entity set to which
it is linked.
Generalization
• A bottom-up design process – combine a
• Method 1:
• Form a schema for the higher-level entity number of entity sets that share the same
features into a higher-level entity set.
• Form a schema for each lower-level entity set, include
primary key of higher-level entity set and local attributes • Specialization and generalization are simple
inversions of each other; they are represented in
schema attributes an E-R diagram in the same way.
person
student
ID, name, street, city
ID, tot_cred
• The terms specialization and generalization are
employee ID, salary
used interchangeably.
Aggregation (Cont.)
• Relationship sets eval_for and proj_guide To represent aggregation, create a schema containing
represent overlapping information Primary key of the aggregated relationship,
• Every eval_for relationship corresponds to a The primary key of the associated entity set
proj_guide relationship Any descriptive attributes
• However, some proj_guide relationships may not In our example:
correspond to any eval_for relationships The schema eval_for is:
• So we can’t discard the proj_guide relationship eval_for (s_ID, project_id, i_ID, evaluation_id)
Possible guideline is to designate a relationship set to describe an action that occurs between entities
Alternative ER Notations
• Chen, IDE1FX, …
UML
• UML: Unified Modeling Language
• UML has many components to graphically
model different aspects of an entire software
system
• UML Class Diagrams correspond to E-R Diagram,
but several differences.
ER vs. UML Class Diagrams
UML Class Diagrams (Cont.)
• Binary relationship sets are represented in UML
by just drawing a line connecting the entity
sets. The relationship set name is written
adjacent to the line.
• The role played by an entity set in a relationship
set may also be specified by writing the role
name on the line, adjacent to the entity set.
• The relationship set name may alternatively be
written in a box, along with attributes of the
relationship set, and the box is connected, using
a dotted line, to the line depicting the
relationship set.
*Note reversal of position in cardinality constraint depiction
End of Chapter 7
• Join operations take two relations and return as a result another relation.
• A join operation is a Cartesian product which requires that tuples in the two relations match
(under some condition). It also specifies the attributes that are present in the result of the join
Unit 3: Intermediate SQL • The join operations are typically used as subquery expressions in the from clause
• Three types of joins:
• Natural join
• Inner join
• Outer join
• The from clause can have multiple relations combined using natural
join:
select A1, A2, … An
from r1 natural join r2 natural join .. natural join rn
where P ;
where <query expression> is any legal SQL expression. The view name is represented by v.
• course full outer join prereq using (course_id) • Once a view is defined, the view name can be used to refer to the virtual relation that the view generates.
• View definition is not the same as creating a new relation by evaluating the query expression
• Rather, a view definition causes the saving of an expression; the expression is substituted into queries
using the view.
View Definition and Use Views Defined Using Other Views
• A view of instructors without their salary
• create view physics_fall_2017 as
create view faculty as
select ID, name, dept_name select course.course_id, sec_id, building, room_number
from instructor from course, section
where course.course_id = section.course_id
• Find all instructors in the Biology department and course.dept_name = 'Physics'
select name and section.semester = 'Fall'
from faculty and section.year = '2017’;
where dept_name = 'Biology'
• Create a view of department salary totals • create view physics_fall_2017_watson as
select course_id, room_number
create view departments_total_salary(dept_name, total_salary) from physics_fall_2017
as where building= 'Watson';
select dept_name, sum (salary)
from instructor
group by dept_name;
• As long as the view definitions are not recursive, this loop will ('30765', 'Green', 'Music', null)
terminate into the instructor relation
• Integrity constraints guard against accidental damage to the database, by ensuring that
• Most SQL implementations allow updates only on simple views
authorized changes to the database do not result in a loss of data consistency.
• The from clause has only one database relation.
• A checking account must have a balance greater than $10,000.00
• The select clause contains only attribute names of the relation,
• A salary of a bank employee must be at least $4.00 an hour
and does not have any expressions, aggregates, or distinct
specification. • A customer must have a (non-null) phone number
• Any attribute not listed in the select clause can be set to null
• The query does not have a group by or having clause.
Constraints on a Single Relation Unique Constraints
• not null • The check (P) clause specifies a predicate P that must be satisfied by every tuple in a
relation.
• Declare name and budget to be not null
• Example: ensure that semester is one of fall, winter, spring or summer
name varchar(20) not null
budget numeric(12,2) not null create table section
(course_id varchar (8),
sec_id varchar (8),
semester varchar (6),
year numeric (4,0),
building varchar (15),
room_number varchar (7),
time slot id varchar (4),
primary key (course_id, sec_id, semester, year),
check (semester in ('Fall', 'Winter', 'Spring', 'Summer')))
Cascading Actions in Referential Integrity
Referential Integrity
• When a referential-integrity constraint is violated, the normal procedure is to reject the
action that caused the violation.
• Ensures that a value that appears in one relation for a given set of • An alternative, in case of delete or update is to cascade
attributes also appears for a certain set of attributes in another relation. create table course (
• Example: If “Biology” is a department name appearing in one of the (…
dept_name varchar(20),
tuples in the instructor relation, then there exists a tuple in the foreign key (dept_name) references department
department relation for “Biology”. on delete cascade
on update cascade,
• Let A be a set of attributes. Let R and S be two relations that contain . . .)
attributes A and where A is the primary key of S. A is said to be a foreign • Instead of cascade we can use :
key of R if for any values of A appearing in R these values also appear in S. • set null,
• set default
• An assertion is a predicate expressing a condition that we wish the database always to satisfy. • Large objects (photos, videos, CAD files, etc.) are stored as a large object:
• blob: binary large object -- object is a large collection of uninterpreted binary data
• The following constraints, can be expressed using assertions: (whose interpretation is left to an application outside of the database system)
• For each tuple in the student relation, the value of the attribute tot_cred must equal the sum of • clob: character large object -- object is a large collection of character data
credits of courses that the student has completed successfully.
• When a query returns a large object, a pointer is returned rather than the large object itself.
• An instructor cannot teach in two different classrooms in a semester in the same time slot
• An assertion in SQL takes the form:
create assertion <assertion-name> check (<predicate>);
User-Defined Types Index Creation
Domains
Index Creation Example
• create domain construct in SQL-92 creates user-defined domain types
• create table student
create domain person_name char(20) not null (ID varchar (5),
name varchar (20) not null,
• Types and domains are similar. Domains can have constraints, such as not null, dept_name varchar (20),
specified on them. tot_cred numeric (3,0) default 0,
• Example: primary key (ID))
create domain degree_level varchar(10) • create index studentID_index on student(ID)
constraint degree_level_test • The query:
check (value in ('Bachelors', 'Masters', 'Doctorate'));
select *
from student
where ID = '12345'
can be executed by using the index to find the required record, without looking at
all records of student
Authorization Authorization Specification in SQL
• We may assign a user several forms of authorizations on parts of the database. • The grant statement is used to confer authorization
• Read - allows reading, but not modification of data. grant <privilege list> on <relation or view > to <user list>
• Insert - allows insertion of new data, but not modification of existing data. • <user list> is:
• Update - allows modification, but not deletion of data. • a user-id
• Delete - allows deletion of data. • public, which allows all valid users the privilege granted
• Each of these types of authorizations is called a privilege. We may authorize the user all, none, or a • A role (more on this later)
combination of these types of privileges on specified parts of a database, such as a relation or a • Example:
view.
• grant select on department to Amit, Satoshi
• Granting a privilege on a view does not imply granting any privileges on the underlying
relations.
• The grantor of the privilege must already hold the privilege on the specified item (or be
the database administrator).
• select: allows read access to relation, or the ability to query using the view
• Forms of authorization to modify the database schema • Example: grant users U1, U2, and U3 select authorization on the instructor
• Index - allows creation and deletion of indices. relation:
• Resources - allows creation of new relations. grant select on instructor to U1, U2, U3
• Alteration - allows addition or deletion of attributes in a relation. • insert: the ability to insert tuples
• Drop - allows deletion of relations.
• update: the ability to update using the SQL update statement
• delete: the ability to delete tuples.
• all privileges: used as a short form for all the allowable privileges
Revoking Authorization in SQL Roles Example
Thank you