S2 DataStructuresandSQL
S2 DataStructuresandSQL
APIs
www.crisp-dm.org
4
Data
Structures
A
data
structure
is
a
scheme
for
organizing
data
in
the
memory
of
a
computer.
Some
of
the
more
commonly
used
data
structures
include
lists,
arrays,
stacks,
queues,
heaps,
trees,
and
graphs.
Binary Tree
Data
Structures
The
way
in
which
the
data
is
organized
aects
the
performance
of
a
program
for
dierent
tasks.
Computer
programmers
decide
which
data
structures
to
use
based
on
the
nature
of
the
data
and
the
processes
that
need
to
be
performed
on
that
data.
Binary Tree
Example:
A
Queue
A
queue
is
an
example
of
commonly
used
simple
data
structure.
A
queue
has
beginning
and
end,
called
the
front
and
back
of
the
queue.
Data
enters
the
queue
at
one
end
and
leaves
at
the
other.
Because
of
this,
data
exits
the
queue
in
the
same
order
in
which
it
enters
the
queue,
like
people
in
a
checkout
line
at
a
supermarket.
Binary Tree
Lecture Outline
See the Access Database File (Lab1_Database_BBB.accdb) for this lecture on the course website ! Relational Databases
Tables Columns Rows
Queries
Query Design View Structured Query Language (SQL View)
Selecting rows meeting a condition Algebraic operations Text matching (with LIKE) Computing summary figures Joining multiple tables
Databases: Overview
Overview what databases are, and what a database management system is. Present key concepts and vocabulary. Demonstrate Microsoft Access and the BBB database.
BBB: Bookbinders Book Club We are using a reduced version, of a real database, for class discussion Full version exists
18
Rationale
Why do we have databases?
Because data are valuable and essential for conduct of business Because data are created in huge volumes, and need to be retained (on disk) and efficiently accessed
Our focus will be on extracting data from databases in useful ways. a.k.a. querying.
19
RelaIonal
Databases
Relational databases
Databases that use a series of logically related two-dimensional tables to store their information Tables are comprised of fields/records, which in turn contain field values
Field
Last Name SS#
Student
DOB 06/11/84 1/1/85 12/31/81 3/3/88
Table
Major IS FIN MKT ACC
20
Record
Smith Kim
Field value
Davis Pat
RelaIonal
Databases
Relational Database Tables Records Fields Field values Bytes, bits
Field
Last Name SS# 100201122 200202222 300201232 999132212
Student
DOB 06/11/84 1/1/85 12/31/81 3/3/88
Table
Major IS FIN MKT ACC
21
Record
Smith Kim
Field value
Davis Pat
Order# Date
ISBN
Book Name
Author
Price
1 9/1/03 C1001 2 9/2/03 C1004 3 9/3/03 C1002 4 9/4/03 C1003 5 9/5/03 C1003 6 9/6/03 C1001 7 9/7/03 C1002 8 9/8/03 C1001
#0465039138 Code and other laws of cyberspace Lessig, Lawrence Digital Copyright: Protecting #1573928895 Intellectual Property on the Internet Litman, Jessica Haag, Stephen Barabasi, AlbertLaszlo Rheingold, Howard Rheingold, Howard Litman, Jessica Rheingold, Howard
Tisch LC-12, New York #0072952849 MIS in the Information Age Microsoft Corporation, Redmond Microsoft Corporation, Redmond Linked: The New Science of #0738206679 Networks
Smart Mobs: The Next Social #0738206083 Revolution Smart Mobs: The Next Social 1 Amazon Plaza #0738206083 Revolution Digital Copyright: Protecting Tisch LC-12, New York #1573928895 Intellectual Property on the Internet Smart Mobs: The Next Social 1 Amazon Plaza #0738206083 Revolution
22
Insertion anomalies
Deletion anomalies
The loss of a piece of information about one object when a piece of information about a different object is deleted Example: Deleting an order => deleting a customer/book
Update anomalies
A need to change the same piece of information about an object multiple times Example: Changing Bill Gates address
23
24
Benefits of Normalization
Greater
overall
database
organizaIon
Minimize
data
redundancies
Data
consistency
within
the
database
A
more
exible
database
design
Data
can
be
used
more
producIvely
A
beUer
handle
on
database
security
Disadvantage of Normalization
Reduced database performance because database must locate requested tables and join data - requires additional processing logic A lot of Planning goes in to the design of a database
Primary keys
A field (or group of fields in some cases) that uniquely describes each record in a table Examples: Customer ID, ISBN, Order#
Foreign keys
A field that is a primary key in one table and appears in a different table (though not as the primary key) Examples: Customer ID in Orders
Integrity constraints
rules (most built in) that help ensure the quality of the information. Not NULL
26
28
ACCDB files
Access 2007/2010 files end in .accdb Previous versions ended in .mdb One database = one .accdb file Access handles one database file at a time. Each database consists of a number of tables. In other databases, you might find one file per table, rather than just one file per database. You will notice a temporary .laccdb file created whenever you open your Access database: this is a lock file that prevents problems when there is concurrent access by multiple users.
29
30
31
32
33
The database consists of tables, each table having many rows (a.k.a. records or tuples). The 'Relationships View' of the database shows the tables and their attributes (columns), and the crossreferences between tables.
34
Structure vs Content
of a Database Table To view or edit the structure of a table:
Right-click it, then choose Design View or Table Design
35
Structure of a Table
Design View of Customers Table (ACCTNUM is Key)
37
Contents
of a Database Table If in Design View, and you want to view or edit the contents of the table:
Choose Datasheet View from the Design tab
38
39
40
Indexes
to improve performance
To improve the performance of your database queries, you should create indexes on any columns that are regularly looked up. Creating an index is easy: simply indicate Indexed as Yes in the settings for the column in Design view.
41
Key Points
Tables: hold all data in a database Rows = records Columns = attributes Attributes: have a data type (number, text, date, etc.) Fields: at the intersection of a row and a column Primary Keys: attributes whose values uniquely refer to any given row in a table
42
43
44
45
46
MIDDLE FIRSTNAME NAME DENNIS N SAMUEL E EARL J DOROTHY GERALD C JOSEPH L ROBERT A JAMES M
47
3. 4. 5. 6. 7. 8. 9.
Pick rows using two or more conditions Selecting rows with attributes you do not display Using string matches to pick rows Renaming an attribute (column) name Arithmetic operations in a query Sorting the output (ordering the rows) Mathematical aggregations in a query
48
49
Query Wizard
50
Query Wizard
Press > button to select columns one at a time. Press >> button to select all columns.
51
Query Wizard
Choose Detail to retrieve individual rows that meet certain criteria. Choose Summary to compute aggregates (e.g. min, max, average, ..)
52
Query Wizard
Name your query, so that you can save it and view / modify it later
53
Select the Table you want to query, then press Add the button. Press Done button when finished.
54
Double-click the columns you want to include in your query, or drag-and-drop them from the table into the grid view at the bottom Double-click * to include them all
55
The grid view at the bottom shows you whether the column is visible (tick Show row). It also lets you specify filter conditions, using the Criteria row.
56
Show tables, so you can add more columns to your query results.
Choose whether you want to: View data (SELECT) Insert a table (MAKE) Insert a row (APPEND) Edit a row (UPDATE) Produce aggregate reports (CROSSTAB) Remove rows of data (DELETE)
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
Notice that the customers Last Name is not shown, because we unchecked the Show box for Last Name in the Query Design View in the previous slide. 74
75
Notice here we have all customers whose familyname has 'IL' in the middle of the string
76
77
78
Simply type the formula in the Field row. No loops or VBA code are necessary.
79
80
81
82
Other aggregate functions: Min, Max, Count, See the Query Design Wizard.
83
We can see the average Money for all customers in the data set
84
85
86
Solution: Do not show the non-aggregate filter condition column. That will fix the error.
87
88
90
91
SQL
General query framework SQL is the language underlying Microsoft Access, and is the query language standard used by all major RDBMS
92
Editing SQL
Just like "Editing a recorded VBA Macro : Write a query using Query Design View and then edit using SQL View or Open the SQL View from a new query and write in SQL from scratch
93
SQL View
Viewing the SQL underlying all your graphical query designs
For any query, you can view the SQL that Microsoft Access wrote for you, by simply choosing View and then SQL View. Editing the SQL directly is quicker and more powerful, so expert users tend to use SQL view, whereas novices tend to use the Graphical Design view.
94
SQL
Query 1: Picking attributes (columns)
SELECT ACCTNUM, FAMILYNAME, STATE FROM CUSTOMER SELECT STATE, ACCTNUM, FAMILYNAME FROM CUSTOMER
95
SQL
Query 2: Picking records (rows)
Order of rows in output does not matter, and is not necessarily predictable
96
SQL
Query 3.1: Picking rows with two conditions - AND
SELECT ACCTNUM, FAMILYNAME, STATE FROM CUSTOMER WHERE STATE = 'PA' AND FAMILYNAME = 'MILLER' SQL is case sensitive within quotes Microsoft Access is NOT SELECT acctnum, familyname, state FROM customer WHERE state = 'PA' and familyname = 'Miller' SELECT acctnum, familyname, state FROM customer WHERE state = 'PA' AND familyname = 'MILLER'
97
SQL
Query 3.2: Picking rows with two conditions - AND Remember that all attributes have a datatype SELECT ACCTNUM, FAMILYNAME, STATE FROM CUSTOMER WHERE STATE = 'PA' AND ACCTNUM = 'MILLER'
SQL
Query 3.3: Picking rows with two conditions - OR
SELECT ACCTNUM, FAMILYNAME, STATE FROM CUSTOMER WHERE STATE = 'PA' OR FAMILYNAME = 'MILLER' SELECT ACCTNUM, FAMILYNAME, STATE FROM CUSTOMER WHERE STATE = 'PA' AND (FAMILYNAME = 'MILLER' OR FAMILYNAME = 'SMITH')
99
SQL
Query 4: Picking rows using attributes you do not display SELECT ACCTNUM, STATE FROM CUSTOMER WHERE STATE = 'PA' OR FAMILYNAME = 'MILLER'
100
SQL
Query 5: Using string matching to pick rows*
SELECT ACCTNUM, FAMILYNAME, STATE FROM CUSTOMER WHERE STATE = 'PA' AND FAMILYNAME LIKE '%IL%' SELECT ACCTNUM, FAMILYNAME, STATE FROM CUSTOMER WHERE STATE = 'PA' AND FAMILYNAME LIKE 'IL%' SELECT ACCTNUM, FAMILYNAME, STATE FROM CUSTOMER WHERE STATE = 'PA' AND FAMILYNAME LIKE '_IL'
Standard SQL notation shown above Microsoft Access notation differs as shown
SQL
Query 6: Renaming an attribute (column) name
SELECT ACCTNUM, FAMILYNAME AS Last_Name, STATE FROM CUSTOMER WHERE STATE = 'PA' AND FAMILYNAME LIKE '*IL*' SELECT ACCTNUM, FAMILYNAME AS [Last Name], STATE FROM CUSTOMER WHERE STATE = 'PA' AND FAMILYNAME LIKE '*IL*' Bracket notation to allow white-space is unique to Microsoft
102
SQL
Query 7: Arithmetic operations in a query
SELECT ACCTNUM, FAMILYNAME AS Last_Name, MONEY * 0.10 AS Taxed_Expense FROM CUSTOMER WHERE FAMILYNAME LIKE '*IL*' SELECT ACCTNUM, FAMILYNAME AS Last_Name, (MONEY + ACCTNUM)*1.09 AS Random_nonsense FROM CUSTOMER WHERE FAMILYNAME LIKE '*IL*'
103
SQL
Query 8: Sorting the output (ordering the rows)
SELECT ACCTNUM, FAMILYNAME AS Last_Name, MONEY * 1.09 AS Taxed_Expense FROM CUSTOMER You must use the original WHERE FAMILYNAME LIKE '%IL%' column name in your ORDER BY clause. ORDER BY FAMILYNAME ASC Use ASC for Ascending order or
DESC for Descending order.
SELECT ACCTNUM, FAMILYNAME AS Last_Name, MONEY * 1.09 AS Taxed_Expense FROM CUSTOMER If you use the new column name, it won t work. WHERE FAMILYNAME LIKE '%IL%' ORDER BY Last_Name DESC
104
SQL
Query 9.1: Mathematical aggregations Average for all customers
SELECT AVG(MONEY) AS Average_Expense FROM CUSTOMER
105
SQL
Query 9.2: Mathematical aggregations Average for a subset of customers
SQL
Query 9.2: Mathematical aggregations Average for a subset of customers
107
SQL
Query 9.3: Mathematical aggregations Average for each group of customers
SELECT Avg(CUSTOMER.[Money]) AS [Average Money], CUSTOMER.State FROM CUSTOMER GROUP BY CUSTOMER.State ORDER BY Avg(CUSTOMER.[Money]) DESC;
108
Use of the * versus % in matching multiple characters in LIKE expression Use of ? Versus _ in String matching single character in LIKE expression Use of [ ] in renaming to allow white spaces e.g. CUSTOMER.FAMILYNAME AS [Last Name] Case sensitivity (Microsoft Access is case insensitive e.g. State = "PA" vs. State = "pa")
109
WARNING
110
Key Points
Querying is simple
Pick the table Pick the columns Pick the rows (using some simple or complex criteria)
Nine (9) simple cases were given as examples. Be wary of exceptions (e.g. SELECT AVG(MONEY), STATE)
111
15
113
114
115
Database Design
Splitting tables Products
PROD CAT NUM PRODNAME NUM 1 ALICE IN WONDERLAND 1 2 PINNOCCHIO 1 21 SECRETS OF FRENCH COOKING 3 31 CAR MAINTANANCE 4 33 GARDENING 4
Campaign
CAMP PROD CHAN NUM NUM PRICE CAMPDATE NUM 21 21 15 12/1/1986 13 221 21 15 2/1/1987 3 201 1 15 2/1/1987 4
Category
CAT NUM 1 2 3 4 5 6 7 CATNAME CHILDREN YOUTH COOK DO-IT-YOURSELF REFERENCE ART GEOGRAPHY
Channels
CHAN NUM 3 4 6 13 CHANNAME 2 MINUTES SPOT - TV - ESPN (SPORT CHANNEL) 2 MINUTES SPOT - TV - MTV (MUSIC CHANNEL) ADVERTISING IN LOCAL NEWSPAPERS "ROCK STARS" - MAGAZINE
116
Database Design
Splitting tables (cont)
ACCT CAMP NUM NUM QTY PURCHDATE 13018 21 1 1/26/1987 19635 21 1 1/9/1987 23361 21 1 1/22/1987 25508 21 1 1/26/1987 27028 21 1 1/20/1987 27259 21 1 1/22/1987 34652 21 1 1/23/1987 39403 21 1 1/18/1987 50670 21 1 1/26/1987 58775 21 1 1/11/1987 17646 221 1 3/8/1987 23088 221 1 3/20/1987 32698 221 1 3/18/1987 44985 221 1 3/16/1987 54690 221 1 3/20/1987 13971 1231 1 1/21/1988 14200 1231 1 1/8/1988 42763 1231 1 1/13/1988
117
Purchase
Database Design
Splitting tables
Database Design
Splitting tables
Avoid repetition within individual tables Separate attributes that are used independently into separate tables Introduce (foreign) keys to link tables and keep any redundancy minimal/simple
119
120
BBB Customers and Purchases What data do we have about customers What data do we have about purchases
Who purchased (Acctnum) In response to what advertising campaign In what quantity When did they purchase
121
122
123
Customer
ACCT NUM FIRSTNAME STATE ZIP GENDER MONEY 13015 DENNIS NY 11050 M 164
13018 13018
21 21
1 1
1/26/1987 1/26/1987
19635 ROBERT
23361 HARRIET
PA
DE
19148 M
19971 F
93
190
Customer
PURC. CUST. ACCT CAMP ACCT NUM NUM QTY PURCHDATE NUM FIRSTNAME 13018 21 1 1/26/1987 13015 DENNIS 13018 21 1 1/26/1987 19635 ROBERT 13018 21 1 1/26/1987 23361 HARRIET 19635 21 1 1/9/1987 13015 DENNIS 19635 21 1 1/9/1987 19635 ROBERT 19635 21 1 1/9/1987 23361 HARRIET 23361 21 1 1/22/1987 13015 DENNIS 23361 21 1 1/22/1987 19635 ROBERT 23361 21 1 1/22/1987 23361 HARRIET
STATE NY PA DE NY PA DE NY PA DE
ZIP 11050 19148 19971 11050 19148 19971 11050 19148 19971
125
STATE NY PA DE NY PA DE NY PA DE
ZIP 11050 19148 19971 11050 19148 19971 11050 19148 19971
PURC. CUST. ACCT CAMP ACCT NUM NUM QTY PURCHDATE NUM FIRSTNAME STATE ZIP GENDER MONEY PA 19148 M 93 19635 21 1 1/9/1987 19635 ROBERT DE 19971 F 190 23361 21 1 1/22/1987 23361 HARRIET
126
127
SQL
General JOIN framework: two tables
SELECT
<pick attribute(s)>
in the order you want to see them prefix attributes with tablename e.g. table1.attr1
FROM WHERE
You must cross-reference (i.e. join) the related columns in the tables, using table1.columnA = table2.columnA
128
SQL
Customer responses to campaigns (i.e. customer purchases)
Notice how, unlike VBA, no line-continuation character is used in SQL
SELECT CAMPNUM, QUANTITY, CUSTOMER.ACCTNUM, GENDER FROM PURCHASE, CUSTOMER WHERE PURCHASE.ACCTNUM = CUSTOMER.ACCTNUM
129
SQL
Customer responses to campaigns (i.e. customer purchases)
If you make a spelling mistake in your query, you ll get an (error) message like the one shown below
130
SQL
A nonsense query for example purposes
SELECT CAMPNUM, PRICE, ACCTNUM FROM CAMPAIGN, CUSTOMER WHERE PRICE > MONEY OR FIRSTNAME LIKE 'ROB%'
131
Press Shift + Click to select multiple tables, and press the Add button. Press Close button when done.
133
Drag and drop your columns into the grid In this case there is no need to explicitly say: Customer.AcctNum = Purchase.Acctnum because Microsoft Access is intelligent enough to infer the relationship between the tables automatically
134
This result isn t very readable Campaign number doesn t mean much: Product name (from the Product table) would be more descriptive Account number doesn t mean much: Customer name (from the Customer table) would be more descriptive
135
SQL View
Case 1: Key attribute with same name
Microsoft Access generates some pretty nasty looking SQL:
136
The Customer and Product tables are not directly related, so Microsoft Access is unable to infer the relationship.
137
138
139
140
141
142
143
To confirm you have the correct number of results, check that you have the same number of rows here as there are in the PURCHASES table. There were 19 purchases, so there should be 19 results here !
144
Querying more than two tables: Results What happens if you forget the join / cross-reference?
Notice that you get an impossible number of results if you forget the JOIN in the query: i.e. if you forget to specify which columns cross-reference ! There were 19 purchases, so there should be 19 results here !
145
146
147
148
149
150
151
To confirm you have the correct number of results, check that you have the same number of rows here as there are in the CAMPAIGN table. There were 3 campaigns, so there should be 3 results here !
152
Option 2: Combine tables pair-wise Associative: (A B) C = A (B C) Combine in stages. For three or more
A B (A B) C ((A B) C ) D
153
Key Points
Divide tables for ease in managing data Query two (or more) tables like one big table
Combine tables: Cartesian Product Put tables side-by-side: every row permutation Look for attributes that link tables (e.g. Acctnum in CUSTOMER and PURCHASES)
155
Inserting Data
Use the INSERT statement to append new data.
Here are the various parts of the INSERT statement: INSERT INTO [table-name] ( [ column1, column2, ] ) VALUES ( [value for column1], [value for column2], );
156
Inserting Data
Use the INSERT statement to append new data.
157
Inserting Data
After you save the INSERT query, it will look like this in the query list on the left
158
Updating Data
Use the UPDATE statement to edit existing data.
Here are the various parts of the UPDATE statement: UPDATE [table-name] SET [column1] = [value], [column2] = [value], WHERE [column] = [value]
DON T FORGET YOUR WHERE CLAUSE !! IF YOU FORGET IT YOU COULD END UP MISTAKENLY UPDATING ALL DATA IN THE TABLE ! Access 2007 will give you a warning if you inadvertently update multiple rows, but, if you execute the UPDATE statement from Excel 2007, you ll get no such warning !!
159
Updating Data
Use the UPDATE statement to edit existing data.
160
Updating Data
After you save the UPDATE query, it will look like this in the query list on the left
NOTE: It is often bad practice to update data in a database. For example, rather than update a customer s address, you can add an event table e.g. a table called MOVE with a date the customer moved. That way you always have an audit log of when the customer moved, and you have not lost any data. You might, for instance, later want to count how many times the customer moved in the past 3 years, so you can compute a credit score for them.
161
Deleting Data
Use the DELETE statement to permanently remove a row of data.
Here are the various parts of the DELETE statement: DELETE FROM [table-name] WHERE [column] = [value]
DON T FORGET YOUR WHERE CLAUSE !! IF YOU FORGET IT YOU COULD END UP MISTAKENLY DELETING ALL DATA IN THE TABLE ! Access 2007 will give you a warning if you inadvertently delete multiple rows, but, if you execute the DELETE statement from Excel 2007, you ll get no such warning, and it will be impossible to recover the data you have deleted !!
162
Deleting Data
Use the DELETE statement to permanently remove a row of data.
163
Deleting Data
After you save the DELETE query, it will look like this in the query list on the left
NOTE: It is generally bad practice to delete data from a database. Rather, you want to add a column, and mark the data as inactive (include a column with the date it was marked inactive and a column with the user-ID of the user who marked it as inactive, if necessary). That way you always have an audit log of when data was deleted , and you can always recover the data later.
164
165
166
167
168
169
170
171
172
173
174
175
176
177
Lab Exercise
Databases
lab exercise ROOM JMHH 380
178
179