F15 CS194 Lec 03 Tabular Data
F15 CS194 Lec 03 Tabular Data
Lecture 3
Manipulating Tabular Data
Extract
Transform
Load
4
Two views of tables
First view
Key Concept: Structured Data
A data model is a collection of concepts for
describing data.
*Codd, E. F. (1970). "A relational model of data for large shared data banks".
Communications of the ACM 13 (6): 37
Relational Database: Definitions
• Relational database: a set of relations
• Relation: made up of 2 parts:
Schema : specifies name of relation, plus name and type
of each column
Students(sid: string, name: string, login: string, age:
integer, gpa: real)
Instance : the actual data at a given time
• #rows = cardinality
• #fields = degree / arity
• A relation is a mathematical object (from set theory)
which is true for certain arguments.
• An instance defines the set of arguments for which the
relation is true (it’s a table not a row).
Ex: Instance of Students Relation
sid name login age gpa
53666 Jones jones@cs 18 3.4
53688 Smith smith@eecs 18 3.2
53650 Smith smith@math 19 3.8
• The relation is true for these tuples and false for others
SQL - A language for Relational DBs*
SELECT *
FROM Students S
WHERE S.age=18
Name Mortality
Socrates Mortal
Thor Immortal
Barney Mortal
Blarney stone Non-living
SELECT [DISTINCT] target-list
Basic SQL Query FROM
WHERE
relation-list
qualification
Note the previous version of this query (with no join keyword) is an “Implicit join”
SQL Inner Joins
SELECT S.name, E.classid
FROM Students S (INNER) JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid E.sid E.classid
S Jones 11111
E 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194 Unmatched keys
Smith French150
What kind of Join is this?
SELECT S.name, E.classid
FROM Students S ?? Enrolled E
ON S.sid=E.sid
S.name S.sid E.sid E.classid
S Jones 11111
E 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
Brown NULL
SQL Joins
SELECT S.name, E.classid
FROM Students S LEFT OUTER JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid E.sid E.classid
S Jones 11111
E 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
Brown NULL
What kind of Join is this?
SELECT S.name, E.classid
FROM Students S ?? Enrolled E
ON S.sid=E.sid
S.name S.sid E.sid E.classid
S Jones 11111
E 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
SQL Joins
SELECT S.name, E.classid
FROM Students S RIGHT OUTER JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid E.sid E.classid
S Jones 11111
E 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
SQL Joins
SELECT S.name, E.classid
FROM Students S ? JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid E.sid E.classid
S Jones 11111
E 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
Brown NULL
SQL Joins
SELECT S.name, E.classid
FROM Students S FULL OUTER JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid E.sid E.classid
S Jones 11111
E 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
Brown NULL
What kind of Join is this?
SELECT S.name, E.classid
FROM Students S ?? Enrolled E
ON S.sid=E.sid
S.name S.sid E.sid E.classid
S Jones 11111
E 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Smith French150
SQL Joins
SELECT S.name, E.classid
FROM Students S LEFT SEMI JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid E.sid E.classid
S Jones 11111
E 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Smith French150
What kind of Join is this?
SELECT *
FROM Students S ?? Enrolled E
S S.name S.sid E E.sid E.classid
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
22222 French150
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
22222 French150
E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
22222 French150
38
Reductions and GroupBy
• One of the most common operations on Data Tables is
aggregation or reduction (count, sum, average, min, max,…).
• They provide a means to see high-level patterns in the data,
to make summaries of it etc.
• You need ways of specifying which columns are being
aggregated over, which is the role of a GroupBy operator.
SID Name Course Semester Grade GPA
111 Jones Stat 134 F13 A 4.0
111 Jones CS 162 F13 B- 2.7
222 Smith EE 141 S14 B+ 3.3
222 Smith CS162 F14 C+ 2.3
222 Smith CS189 F14 A- 3.7
39
Reductions and GroupBy
SID Name Course Semester Grade GPA
111 Jones Stat 134 F13 A 4.0
111 Jones CS 162 F13 B- 2.7
222 Smith EE 141 S14 B+ 3.3
222 Smith CS162 F14 C+ 2.3
222 Smith CS189 F14 A- 3.7
40
Reductions and GroupBy
SID Name Course Semester Grade GPA
111 Jones Stat 134 F13 A 4.0
111 Jones CS 162 F13 B- 2.7
222 Smith EE 141 S14 B+ 3.3
222 Smith CS162 F14 C+ 2.3
222 Smith CS189 F14 A- 3.7
41
Pandas/Python
• Series: a named, ordered dictionary
– The keys of the dictionary are the indexes
– Built on NumPy’s ndarray
– Values can be any Numpy data type object
42
Operations
• map() functions
• filter (apply predicate to rows)
• sort/group by
• aggregate: sum, count, average, max, min
• Pivot or reshape
• Relational:
– union, intersection, difference, cartesian product (CROSS
JOIN)
– select/filter, project
– join: natural join (INNER JOIN), theta join, semi-join, etc.
– rename
43
Pandas vs SQL
+ Pandas is lightweight and fast.
+ Full SQL expressiveness plus the expressiveness of
Python, especially for function evaluation.
+ Integration with plotting functions like Matplotlib.
44
Jacobs Update
• Room 310 not ready, but other rooms are. For
the next few (?) weeks, we will meet:
• Mondays in 155 Donner
• Wednesdays in 110/120 Jacobs Hall
Cube
Cube
dimensions Semester values
Cell contents are Grade, Unit values
Classid
Name
Queries on OLAP cubes
• Once the cube is defined, its easy to do aggregate queries by
projecting along one or more axes.
• E.g. to get student GPAs, we project the Grade field onto the
student (Name) axis.
• In fact, such aggregates are precomputed and maintained
automatically in an OLAP cube, so queries are instantaneous.
Semester
Name
OLAP
• Slicing:
fixing one or
more variables
• Dicing:
selecting a range of
values for one or
more variables
OLAP
• Drilling Up/Down
(change levels of a
hierarchically-indexed
variable)
• Pivoting:
produce a two-axis
view for viewing
as a spreadsheet.
Outline
• To support real-time querying, OLAP DBs store aggregates
of data values along many dimensions.
• This works best if axes can be tree-structured. E.g time can
be expressed as a hierarchy
hour day week month year
OLAP tradeoffs
• Aggregates increase space and the cost of updates.
• On the other hand, since they are projections of data, or
tree structures, the storage overhead can be small.
• Aggregates are limited, but cover a lot of common cases:
avg, stdev, min, max.
• Operations (slice, dice, pivot, etc.) are conceptually simpler
than SQL, but cover a lot of common cases.
• Good integration with clients, e.g. spreadsheets, for visual
interaction, although there is an underlying query
language (MDX).
Numpy/Matlab and OLAP
• Numpy and Matlab have an efficient implementation of nd-
arrays for dense data.
• Indices must be integer, but you can implement general
indices using dictionaries from indexval->int.
• Slicing and dicing are available using index ranges:
a[5,1:3,:] etc.
• Roll-down/up involve aggregates along dimensions such as
sum(a[3,4:6,:],2)
• Pivoting involves index permutations (.transpose()) and
aggregation over the other indices.
• Limitation: MATLAB and Numpy currently only support dense
nd-arrays (or sparse 2d arrays).
What’s Wrong with Tables?
Represented as:
53831 Jones jones@cs 18 3.4
Represented as:
52841 Jones jones@cs NULL NULL NULL NULL NULL NULL
…
NoSQL Storage Systems
64
Column-Family Stores (Cassandra)
A column-family groups data columns together, and is
analogous to a table (and similar to Pandas DataFrame)
Static column family from Apache Cassandra:
Columns fixed
66
Key-value stores
• A key-value store is an even simpler approach.
• It implements storage and retrieval of (key,value) pairs.
• i.e. Basic functionality is that of a dictionary
age[“john”] = 25.
• But some KV-stores also implement sorting and
indexing with the keys (e.g. leveldb).
• You can build either column-based or row-based DBs
on top of such KV-stores to optimize performance (e.g.
omitting indices or ACID qualities).
67
Pig
• Started at Yahoo! Research
• Features:
– Expresses sequences of MapReduce jobs
– Data model: nested “bags” of items
• Schema is optional
– Provides relational (SQL) operators
(JOIN, GROUP BY, etc)
– Easy to plug in Java functions
An Example Problem
Suppose you have user Load Users Load Pages
data in one file, website
data in another, and you Filter by age
Count clicks
Order by clicks
Take top 5