0% found this document useful (0 votes)

35 views72 pages

F15 CS194 Lec 03 Tabular Data

Uploaded by

Abdirahman Ismail

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views72 pages

F15 CS194 Lec 03 Tabular Data

Uploaded by

Abdirahman Ismail

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 72

Introduction to Data Science

Lecture 3
Manipulating Tabular Data

Intro. to Data Science Fall 2015

John Canny
including notes from Michael Franklin and others
Outline for this Evening
• Two views of tables:
– SQL/Pandas
– OLAP/Numpy/Matlab
• SQL, NoSQL
– Non-Tabular Structures
Data Science – One Definition
The Big Picture

Extract
Transform
Load

4
Two views of tables
First view
Key Concept: Structured Data
A data model is a collection of concepts for
describing data.

A schema is a description of a particular

collection of data, using a given data model.
The Relational Model*
• The Relational Model is Ubiquitous:
• MySQL, PostgreSQL, Oracle, DB2, SQLServer, …
• Foundational work done at
• IBM - System R
• UC Berkeley - Ingres
E. F., “Ted” Codd
Turing Award 1981
• Object-oriented concepts have been merged in
• Early work: POSTGRES research project at Berkeley
• Informix, IBM DB2, Oracle 8i

• Also has support for XML (semi-structured data)

*Codd, E. F. (1970). "A relational model of data for large shared data banks".
Communications of the ACM 13 (6): 37
Relational Database: Definitions
• Relational database: a set of relations
• Relation: made up of 2 parts:
Schema : specifies name of relation, plus name and type
of each column
Students(sid: string, name: string, login: string, age:
integer, gpa: real)
Instance : the actual data at a given time
• #rows = cardinality
• #fields = degree / arity
• A relation is a mathematical object (from set theory)
which is true for certain arguments.
• An instance defines the set of arguments for which the
relation is true (it’s a table not a row).
Ex: Instance of Students Relation
sid name login age gpa
53666 Jones jones@cs 18 3.4
53688 Smith smith@eecs 18 3.2
53650 Smith smith@math 19 3.8

• Cardinality = 3, arity = 5 , all rows distinct

• The relation is true for these tuples and false for others
SQL - A language for Relational DBs*

• SQL = Structured Query Language

• Data Definition Language (DDL)
– create, modify, delete relations
– specify constraints
– administer users, security, etc.
• Data Manipulation Language (DML)
– Specify queries to find tuples that satisfy criteria
– add, modify, remove tuples
• The DBMS is responsible for efficient evaluation.

* Developed at IBM by Donald D. Chamberlin and Raymond F. Boyce in the 1970s.

Used to be SEQUEL (Structured English QUEry Language)
Creating Relations in SQL
• Create the Students relation.
– Note: the type (domain) of each field is specified,
and enforced by the DBMS whenever tuples are
added or modified.

CREATE TABLE Students

(sid CHAR(20),
name CHAR(20),
login CHAR(10),
age INTEGER,
gpa FLOAT)
Table Creation (continued)

• Another example: the Enrolled table holds

information about courses students take.

CREATE TABLE Enrolled

(sid CHAR(20),
cid CHAR(20),
grade CHAR(2))
Adding and Deleting Tuples
• Can insert a single tuple using:
INSERT INTO Students (sid, name, login, age, gpa)
VALUES ('53688', 'Smith', 'smith@ee', 18, 3.2)

• Can delete all tuples satisfying some condition (e.g.,

name = Smith):
DELETE
FROM Students S
WHERE S.name = 'Smith'
Queries in SQL
• Single-table queries are straightforward.

• To find all 18 year old students, we can write:

SELECT *
FROM Students S
WHERE S.age=18

• To find just names and logins, replace the first line:

SELECT S.name, S.login
Joins and Inference
• Chaining relations together is the basic inference
method in relational DBs. It produces new
relations (effectively new facts) from the data:
SELECT S.name, M.mortality
FROM Students S, Mortality M
WHERE S.Race=M.Race
S M
Name Race Race Mortality
Socrates Man Man Mortal
Thor God God Immortal
Barney Dinosaur Dinosaur Mortal
Blarney stone Stone Stone Non-living
Joins and Inference
• Chaining relations together is the basic inference
method in relational DBs. It produces new
relations (effectively new facts) from the data:
SELECT S.name, M.mortality
FROM Students S, Mortality M
WHERE S.Race=M.Race

Name Mortality
Socrates Mortal
Thor Immortal
Barney Mortal
Blarney stone Non-living
SELECT [DISTINCT] target-list
Basic SQL Query FROM
WHERE
relation-list
qualification

• relation-list : A list of relation names

• possibly with a range-variable after each name
• target-list : A list of attributes of tables in relation-list
• qualification : Comparisons combined using AND, OR and NOT.
• Comparisons are Attr op const or Attr1 op Attr2, where op is
one of =≠<>≤≥
• DISTINCT: optional keyword indicating that the answer
should not contain duplicates.
• In SQL SELECT, the default is that duplicates are not
eliminated! (Result is called a “multiset”)
SQL Inner Joins
SELECT S.name, E.classid
FROM Students S (INNER) JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid E.sid E.classid
S Jones 11111
E 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150

Note the previous version of this query (with no join keyword) is an “Implicit join”
SQL Inner Joins
SELECT S.name, E.classid
FROM Students S (INNER) JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid E.sid E.classid
S Jones 11111
E 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194 Unmatched keys
Smith French150
What kind of Join is this?
SELECT S.name, E.classid
FROM Students S ?? Enrolled E
ON S.sid=E.sid
S.name S.sid E.sid E.classid
S Jones 11111
E 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
Brown NULL
SQL Joins
SELECT S.name, E.classid
FROM Students S LEFT OUTER JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid E.sid E.classid
S Jones 11111
E 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
Brown NULL
What kind of Join is this?
SELECT S.name, E.classid
FROM Students S ?? Enrolled E
ON S.sid=E.sid
S.name S.sid E.sid E.classid
S Jones 11111
E 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
SQL Joins
SELECT S.name, E.classid
FROM Students S RIGHT OUTER JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid E.sid E.classid
S Jones 11111
E 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
SQL Joins
SELECT S.name, E.classid
FROM Students S ? JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid E.sid E.classid
S Jones 11111
E 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
Brown NULL
SQL Joins
SELECT S.name, E.classid
FROM Students S FULL OUTER JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid E.sid E.classid
S Jones 11111
E 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Jones DataScience194
Smith French150
NULL English10
Brown NULL
What kind of Join is this?
SELECT S.name, E.classid
FROM Students S ?? Enrolled E
ON S.sid=E.sid
S.name S.sid E.sid E.classid
S Jones 11111
E 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Smith French150
SQL Joins
SELECT S.name, E.classid
FROM Students S LEFT SEMI JOIN Enrolled E
ON S.sid=E.sid
S.name S.sid E.sid E.classid
S Jones 11111
E 11111 History105
Smith 22222 11111 DataScience194
Brown 33333 22222 French150
44444 English10
S.name E.classid
Jones History105
Smith French150
What kind of Join is this?
SELECT *
FROM Students S ?? Enrolled E
S S.name S.sid E E.sid E.classid
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
22222 French150

S.name S.sid E.sid E.classid

Jones 11111 11111 History105
Jones 11111 11111 DataScience194
Jones 11111 22222 French150
Smith 22222 11111 History105
Smith 22222 11111 DataScience194
Smith 22222 22222 French150
SQL Joins
SELECT *
FROM Students S CROSS JOIN Enrolled E
S S.name S.sid E E.sid E.classid
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
22222 French150

S.name S.sid E.sid E.classid

Jones 11111 11111 History105
Jones 11111 11111 DataScience194
Jones 11111 22222 French150
Smith 22222 11111 History105
Smith 22222 11111 DataScience194
Smith 22222 22222 French150
What kind of Join is this?
SELECT *
FROM Students S, Enrolled E
WHERE S.sid <= E.sid

E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
22222 French150

S.name S.sid E.sid E.classid

Jones 11111 11111 History105
Jones 11111 11111 DataScience194
Jones 11111 22222 French150
Smith 22222 22222 French150
Theta Joins
SELECT *
FROM Students S, Enrolled E
WHERE S.sid <= E.sid

E.sid E.classid
S S.name S.sid E
Jones 11111 11111 History105
Smith 22222 11111 DataScience194
22222 French150

S.name S.sid E.sid E.classid

Jones 11111 11111 History105
Jones 11111 11111 DataScience194
Jones 11111 22222 French150
Smith 22222 22222 French150
Recall: Tweet JSON Format
Normalization
Raw twitter data storage is very inefficient because, e.g. user
records are repeated with every tweet by that user.
Normalization is the process of minimizing data redundancy.
Tweet id User id Location id Body
11 111 1111 I need a Jamba juice
22 111 1111 Cal Soccer rules
33 111 2222 Why do we procrastinate?
44 222 3333 Close your eyes and push “go”

User.id Name Attribs… Loc.id Name Attribs…

111 Jones 1111 Berkeley
222 Smith 2222 Oakland
3333 Hayward
Normalization
Normalized tables include only a foreign key to the information in
another table for repeated data.
The original table is the result of inner joins between tables.
Tweet id User id Location id Body
11 111 1111 I need a Jamba juice
22 111 1111 Cal Soccer rules
33 111 2222 Why do we procrastinate?
44 222 3333 Close your eyes and push “go”

User.id Name Attribs… Loc.id Name Attribs…

111 Jones 1111 Berkeley
222 Smith 2222 Oakland
3333 Hayward
Aggregate Queries
Including reference counts in the lookup tables allows you to
perform aggregate queries on those tables alone:
Average age of users, most popular location,…
Tweet id User id Location id Body
11 111 1111 I need a Jamba Juice
22 111 1111 Cal Soccer rules
33 111 2222 Why do we procrastinate?
44 222 3333 Close your eyes and push “go”

U.id Name Count Attr.. L.id Name Count Attr…

111 Jones 3 1111 Berkeley 2
222 Smith 1 2222 Oakland 1
3333 Hayward 1
SQL Query Semantics
Semantics of an SQL query are defined in terms of the
following conceptual evaluation strategy:
1. do FROM clause: compute cross-product of tables (e.g.,
Students and Enrolled).
2. do WHERE clause: Check conditions, discard tuples that
fail. (i.e., “selection”).
3. do SELECT clause: Delete unwanted fields. (i.e.,
“projection”).
4. If DISTINCT specified, eliminate duplicate rows.
Probably the least efficient way to compute a query!
– An optimizer will find more efficient strategies to get
the same answer.
Data Model (Tabular)
• SQLite
– Table: fixed number of named columns of specified type
– 5 storage classes for columns
• NULL
• INTEGER
• REAL
• TEXT
• BLOB
– Data stored on disk in a single file in row-major order
– Operations performed via sqlite3 shell

38
Reductions and GroupBy
• One of the most common operations on Data Tables is
aggregation or reduction (count, sum, average, min, max,…).
• They provide a means to see high-level patterns in the data,
to make summaries of it etc.
• You need ways of specifying which columns are being
aggregated over, which is the role of a GroupBy operator.
SID Name Course Semester Grade GPA
111 Jones Stat 134 F13 A 4.0
111 Jones CS 162 F13 B- 2.7
222 Smith EE 141 S14 B+ 3.3
222 Smith CS162 F14 C+ 2.3
222 Smith CS189 F14 A- 3.7

39
Reductions and GroupBy
SID Name Course Semester Grade GPA
111 Jones Stat 134 F13 A 4.0
111 Jones CS 162 F13 B- 2.7
222 Smith EE 141 S14 B+ 3.3
222 Smith CS162 F14 C+ 2.3
222 Smith CS189 F14 A- 3.7

SELECT SID, Name, AVG(GPA)

FROM Students
GROUP BY SID

40
Reductions and GroupBy
SID Name Course Semester Grade GPA
111 Jones Stat 134 F13 A 4.0
111 Jones CS 162 F13 B- 2.7
222 Smith EE 141 S14 B+ 3.3
222 Smith CS162 F14 C+ 2.3
222 Smith CS189 F14 A- 3.7

SELECT SID, Name, AVG(GPA)

FROM Students SID Name GPA
GROUP BY SID
111 Jones 3.35
222 Smith 3.1

41
Pandas/Python
• Series: a named, ordered dictionary
– The keys of the dictionary are the indexes
– Built on NumPy’s ndarray
– Values can be any Numpy data type object

• DataFrame: a table with named columns

– Represented as a Dict (col_name -> series)
– Each Series object represents a column

42
Operations
• map() functions
• filter (apply predicate to rows)
• sort/group by
• aggregate: sum, count, average, max, min
• Pivot or reshape
• Relational:
– union, intersection, difference, cartesian product (CROSS
JOIN)
– select/filter, project
– join: natural join (INNER JOIN), theta join, semi-join, etc.
– rename
43
Pandas vs SQL
+ Pandas is lightweight and fast.
+ Full SQL expressiveness plus the expressiveness of
Python, especially for function evaluation.
+ Integration with plotting functions like Matplotlib.

- Tables must fit into memory.

- No post-load indexing functionality: indices are built
when a table is created.
- No transactions, journaling, etc.
- Large, complex joins probably slower.

44
Jacobs Update
• Room 310 not ready, but other rooms are. For
the next few (?) weeks, we will meet:
• Mondays in 155 Donner
• Wednesdays in 110/120 Jacobs Hall

Starting this Weds.

5 min break
The other view of tables: OLAP
• OnLine Analytical Processing
• Conceptually like an n-dimensional spreadsheet (Cube)
• (Discrete) columns become dimensions
• The goal is live interaction with numerical data for business
intelligence
The other view of tables: OLAP
From a table to a cube:
name classid Semester Grade Units
Jones History105 F13 3.3 4.0
Jones DataScience194 S12 4.0 3.0
Jones French150 F14 3.7 4.0
Smith History105 S15 2.3 3.0
Smith DataScience194 F14 2.7 3.0
Smith French150 F13 3.0 4.0
From tables to OLAP cubes
From a table to a cube:
name classid Semester Grade Units
Jones History105 F13 3.3 4.0
Jones DataScience194 S12 4.0 3.0
Jones French150 F14 3.7 4.0
Smith History105 S15 2.3 3.0
Smith DataScience194 F14 2.7 3.0
Smith French150 F13 3.0 4.0

Variables used as qualifiers Variables we want to measure

(In where, GroupBy clauses) Normally numeric
Normally discrete
Constructing OLAP cubes
name classid Semester Grade Units
Jones History105 F13 3.3 4.0
Jones DataScience194 S12 4.0 3.0
… … … … …

Cube
Cube
dimensions Semester values
Cell contents are Grade, Unit values
Classid

Name
Queries on OLAP cubes
• Once the cube is defined, its easy to do aggregate queries by
projecting along one or more axes.
• E.g. to get student GPAs, we project the Grade field onto the
student (Name) axis.
• In fact, such aggregates are precomputed and maintained
automatically in an OLAP cube, so queries are instantaneous.

Semester

Cell contents are Grade, Unit values

Classid

Name
OLAP
• Slicing:
fixing one or
more variables

• Dicing:
selecting a range of
values for one or
more variables
OLAP
• Drilling Up/Down
(change levels of a
hierarchically-indexed
variable)

• Pivoting:
produce a two-axis
view for viewing
as a spreadsheet.
Outline
• To support real-time querying, OLAP DBs store aggregates
of data values along many dimensions.
• This works best if axes can be tree-structured. E.g time can
be expressed as a hierarchy
hour  day  week  month  year
OLAP tradeoffs
• Aggregates increase space and the cost of updates.
• On the other hand, since they are projections of data, or
tree structures, the storage overhead can be small.
• Aggregates are limited, but cover a lot of common cases:
avg, stdev, min, max.
• Operations (slice, dice, pivot, etc.) are conceptually simpler
than SQL, but cover a lot of common cases.
• Good integration with clients, e.g. spreadsheets, for visual
interaction, although there is an underlying query
language (MDX).
Numpy/Matlab and OLAP
• Numpy and Matlab have an efficient implementation of nd-
arrays for dense data.
• Indices must be integer, but you can implement general
indices using dictionaries from indexval->int.
• Slicing and dicing are available using index ranges:
a[5,1:3,:] etc.
• Roll-down/up involve aggregates along dimensions such as
sum(a[3,4:6,:],2)
• Pivoting involves index permutations (.transpose()) and
aggregation over the other indices.
• Limitation: MATLAB and Numpy currently only support dense
nd-arrays (or sparse 2d arrays).
What’s Wrong with Tables?

• Too limited in structure?

• Too rigid?
• Too old fashioned?
What’s Wrong with (RDBMS) Tables?
• Indices: Typical RDBMS table storage is mostly indices
– Cant afford this overhead for large datastores
• Transactions:
– Safe state changes require journals etc., and are slow
• Relations:
– Checking relations adds further overhead to updates
• Sparse Data Support:
– RDBMS Tables are very wasteful when data is very sparse
– Very sparse data is common in modern data stores
– RDBMS tables might have dozens of columns, modern data
stores might have many thousands.
RDBMS tables – row based
Table:
sid name login age gpa
53831 Jones jones@cs 18 3.4
53831 Smith smith@ee 18 3.2

Represented as:
53831 Jones jones@cs 18 3.4

53831 Smith smith@ee 18 3.2

Tweet JSON Format
RDBMS tables – row based
Table:
ID name login loc locid LAT LONG ALT State
52841 Jones jones@cs NULL NULL NULL NULL NULL NULL
53831 Smith smith@ee NULL NULL NULL NULL NULL NULL
55541 Brown brown@ee NULL NULL NULL NULL NULL NULL

Represented as:
52841 Jones jones@cs NULL NULL NULL NULL NULL NULL

53831 Smith smith@ee NULL NULL NULL NULL NULL NULL

55541 Brown brown@ee NULL NULL NULL NULL NULL NULL

Column-based store
Table:
ID name login loc locid LAT LONG ALT State
52841 Jones jones@cs Albany 2341 38.4 122.7 100 CA
53831 Smith smith@ee NULL NULL NULL NULL NULL NULL
55541 Brown brown@ee NULL NULL NULL NULL NULL NULL

Represented as column (key-value) stores:

ID name ID login ID loc ID locid
52841 Jones 52841 jones@cs 52841 Albany 52841 2341
53831 Smith 53831 smith@ee
55541 Brown 55541 brown@e ID LAT ID LONG
e 52841 38.4 52841 122.7

…
NoSQL Storage Systems

64
Column-Family Stores (Cassandra)
A column-family groups data columns together, and is
analogous to a table (and similar to Pandas DataFrame)
Static column family from Apache Cassandra:
Columns fixed

Dynamic Column family (Cassandra):

Can add or
remove columns
from a dynamic
column family
CouchDB Data Model (JSON)
• “With CouchDB, no schema is enforced, so new document
types with new meaning can be safely added alongside
the old.”
• A CouchDB document is an object that consists of named
fields. Field values may be:
– strings, numbers, dates,
– ordered lists, associative maps
"Subject": "I like Plankton"
"Author": "Rusty"
"PostedDate": "5/23/2006"
"Tags": ["plankton", "baseball", "decisions"]
"Body": "I decided today that I don't like baseball. I like plankton."

66
Key-value stores
• A key-value store is an even simpler approach.
• It implements storage and retrieval of (key,value) pairs.
• i.e. Basic functionality is that of a dictionary
age[“john”] = 25.
• But some KV-stores also implement sorting and
indexing with the keys (e.g. leveldb).
• You can build either column-based or row-based DBs
on top of such KV-stores to optimize performance (e.g.
omitting indices or ACID qualities).

67
Pig
• Started at Yahoo! Research
• Features:
– Expresses sequences of MapReduce jobs
– Data model: nested “bags” of items
• Schema is optional
– Provides relational (SQL) operators
(JOIN, GROUP BY, etc)
– Easy to plug in Java functions
An Example Problem
Suppose you have user Load Users Load Pages
data in one file, website
data in another, and you Filter by age

need to find the top 5

most visited pages by Join on name

users aged 18-25. Group on url

Count clicks

Order by clicks

Take top 5

Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt

In MapReduce

Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt

In Pig Latin

Users = load ‘users’ as (name, age);

Filtered = filter Users by
age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
count(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;

store Top5 into ‘top5sites’;

Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt

Hive
• Developed at Facebook
• Relational database built on Hadoop
– Maintains table schemas
– SQL-like query language (which can also call
Hadoop Streaming scripts)
– Supports table partitioning,
complex data types, sampling,
some query optimization
• Used for most Facebook jobs
– Less than 1% of daily jobs at Facebook use
MapReduce directly!!! (SQL – or PIG – wins!)
– Note: Google also has several SQL-like systems in use.
Summary
• Two views of tables:
– SQL/Pandas
– OLAP/Numpy/Matlab
• SQL, NoSQL
– Non-Tabular Structures

Wednesday come to 110/120 Jacobs Hall for

Pandas Lab

Saddle Finisher V2 SM
No ratings yet
Saddle Finisher V2 SM
146 pages
Protection Analysis Report Somaliland 2019 Clean Version
100% (1)
Protection Analysis Report Somaliland 2019 Clean Version
17 pages
SSC Stihl SC GB
No ratings yet
SSC Stihl SC GB
64 pages
Introduction To Data Science Manipulating Tabular Data
No ratings yet
Introduction To Data Science Manipulating Tabular Data
72 pages
SQL Queries and PL/SQL
No ratings yet
SQL Queries and PL/SQL
92 pages
Database Cheatsheet
No ratings yet
Database Cheatsheet
6 pages
2 DBMS
No ratings yet
2 DBMS
59 pages
Database Systems
No ratings yet
Database Systems
136 pages
DBMS Final
No ratings yet
DBMS Final
89 pages
Relational Databases: Week 9 INFM 603
No ratings yet
Relational Databases: Week 9 INFM 603
54 pages
Relational Model
No ratings yet
Relational Model
23 pages
Unit - 4 Dbms
No ratings yet
Unit - 4 Dbms
12 pages
W2 DBMS
No ratings yet
W2 DBMS
18 pages
Unit 5 FSD Iv Icse
No ratings yet
Unit 5 FSD Iv Icse
40 pages
02-Introduction To Relational Model and SQL-SCD
No ratings yet
02-Introduction To Relational Model and SQL-SCD
50 pages
Lecture 06
No ratings yet
Lecture 06
65 pages
Unit 2
No ratings yet
Unit 2
85 pages
TDA357 L01b 02 03 SQL
No ratings yet
TDA357 L01b 02 03 SQL
103 pages
DBMS Unit2
No ratings yet
DBMS Unit2
26 pages
The Relational Database Model: Database Systems: Design, Implementation, and Management
No ratings yet
The Relational Database Model: Database Systems: Design, Implementation, and Management
52 pages
Week2 & 3 (Part 1)
No ratings yet
Week2 & 3 (Part 1)
19 pages
DBMS60
No ratings yet
DBMS60
130 pages
Lecture 1.2.5 - Relational Model (RM)
No ratings yet
Lecture 1.2.5 - Relational Model (RM)
15 pages
Unit 3 - Database Management
No ratings yet
Unit 3 - Database Management
34 pages
UNIT II - New
No ratings yet
UNIT II - New
31 pages
Presentation On Data Base & SQL
No ratings yet
Presentation On Data Base & SQL
52 pages
Unit03 DBMS
No ratings yet
Unit03 DBMS
67 pages
Department of Computer Science and Engineering: Certification Course
No ratings yet
Department of Computer Science and Engineering: Certification Course
36 pages
02 SQL
No ratings yet
02 SQL
7 pages
Relational Algebra
No ratings yet
Relational Algebra
55 pages
DATABSE
No ratings yet
DATABSE
4 pages
Databases: Wednesday, January 21, 2009 3:20 PM
No ratings yet
Databases: Wednesday, January 21, 2009 3:20 PM
7 pages
4 SQL
No ratings yet
4 SQL
41 pages
Relational Modeling and SQL Basics
No ratings yet
Relational Modeling and SQL Basics
23 pages
Full Mysql Class 12
No ratings yet
Full Mysql Class 12
74 pages
2 RelationalModel
No ratings yet
2 RelationalModel
31 pages
03 Relational Model
No ratings yet
03 Relational Model
85 pages
Unit-2 PPT SQL and PL SQL
No ratings yet
Unit-2 PPT SQL and PL SQL
26 pages
DATABASE CONCEPT and SQL For NCERT CS
No ratings yet
DATABASE CONCEPT and SQL For NCERT CS
10 pages
Unit 3A
No ratings yet
Unit 3A
13 pages
Database
No ratings yet
Database
14 pages
The Relational Model: February 6, 2014
No ratings yet
The Relational Model: February 6, 2014
43 pages
DBMS
No ratings yet
DBMS
83 pages
CPS405 ch03
No ratings yet
CPS405 ch03
25 pages
ACTIVITY 2 - Database Relational Model
No ratings yet
ACTIVITY 2 - Database Relational Model
3 pages
As-5 Data & Databases
No ratings yet
As-5 Data & Databases
10 pages
Chapter 2
No ratings yet
Chapter 2
77 pages
DBMS - Module - 3 Part 2
No ratings yet
DBMS - Module - 3 Part 2
96 pages
CH03-The Relational Model
No ratings yet
CH03-The Relational Model
54 pages
Lecture 11 SQ Li
No ratings yet
Lecture 11 SQ Li
58 pages
Database1 Final Revision ٠٤٥٢٢٤
100% (1)
Database1 Final Revision ٠٤٥٢٢٤
14 pages
Relational Model
No ratings yet
Relational Model
88 pages
Chapter 3-1 and 3-2
No ratings yet
Chapter 3-1 and 3-2
24 pages
02 SQL 1 Final
No ratings yet
02 SQL 1 Final
41 pages
Dbms Unit 2
No ratings yet
Dbms Unit 2
84 pages
Lecture No 04: By: Syed Aun Irtaza
No ratings yet
Lecture No 04: By: Syed Aun Irtaza
10 pages
Relational Model Slides
No ratings yet
Relational Model Slides
30 pages
Week 2 Lecture 1: Introduction To Relational Model
No ratings yet
Week 2 Lecture 1: Introduction To Relational Model
26 pages
Unit03 DBMS
No ratings yet
Unit03 DBMS
67 pages
DBMS 2nd Unit Notes
No ratings yet
DBMS 2nd Unit Notes
14 pages
CS202 Exam Prep
No ratings yet
CS202 Exam Prep
15 pages
DBMS - Sub Q Index Triger Cursor
No ratings yet
DBMS - Sub Q Index Triger Cursor
132 pages
Who Am I? Pi-E
From Everand
Who Am I? Pi-E
Pandora Alexander Walker
No ratings yet
Study Timeline - Chapter 4
No ratings yet
Study Timeline - Chapter 4
1 page
Projects 1
No ratings yet
Projects 1
80 pages
Minutes of The Food Security Cluster Mogadishu - 25th June 2019
No ratings yet
Minutes of The Food Security Cluster Mogadishu - 25th June 2019
10 pages
Democracy in Somaliland
No ratings yet
Democracy in Somaliland
118 pages
FSNAU Nutrition Update December 2020
No ratings yet
FSNAU Nutrition Update December 2020
11 pages
Safe Programming Booket Digital
No ratings yet
Safe Programming Booket Digital
9 pages
Post Election Study Final Report1
No ratings yet
Post Election Study Final Report1
83 pages
Linux Cheat Sheet
No ratings yet
Linux Cheat Sheet
3 pages
The Laplace Transform
No ratings yet
The Laplace Transform
3 pages
Lesson 3. Performance Assessment
No ratings yet
Lesson 3. Performance Assessment
6 pages
Language Proficiency 1: Week 1 Lesson Plan
No ratings yet
Language Proficiency 1: Week 1 Lesson Plan
33 pages
M & A Mimaropa Lecture
No ratings yet
M & A Mimaropa Lecture
2 pages
Vsphere Esxi 672 Installation Setup Guide
No ratings yet
Vsphere Esxi 672 Installation Setup Guide
222 pages
God's Will or Your Will
No ratings yet
God's Will or Your Will
4 pages
Soal Pedagogik Bahasa Inggris
No ratings yet
Soal Pedagogik Bahasa Inggris
3 pages
Pronoun
No ratings yet
Pronoun
25 pages
Speaking in Subtitles Revaluing Screen Translation 1st Edition Tessa Dwyer 2024 Scribd Download
100% (1)
Speaking in Subtitles Revaluing Screen Translation 1st Edition Tessa Dwyer 2024 Scribd Download
72 pages
World English 3e Level 2 Grammar Activities Unit 1 Lesson C
No ratings yet
World English 3e Level 2 Grammar Activities Unit 1 Lesson C
1 page
Error Log
No ratings yet
Error Log
59 pages
NLC Accomplishment Report With Documentation
No ratings yet
NLC Accomplishment Report With Documentation
10 pages
Contoh Format Skrip Role Play (F2F)
No ratings yet
Contoh Format Skrip Role Play (F2F)
7 pages
Ankitseth SAP Basis
No ratings yet
Ankitseth SAP Basis
2 pages
7 Cs of Communication
No ratings yet
7 Cs of Communication
2 pages
Napoleon Hill's Golden Rules-The Lost Writings by Napoleon Hill
No ratings yet
Napoleon Hill's Golden Rules-The Lost Writings by Napoleon Hill
2 pages
Grupo 17115 A2 - Eci 2021
No ratings yet
Grupo 17115 A2 - Eci 2021
6 pages
National & Kapodistrian University of Athens Lesson Planning and Materials Development
No ratings yet
National & Kapodistrian University of Athens Lesson Planning and Materials Development
17 pages
IDoc Status Description
No ratings yet
IDoc Status Description
15 pages
FOF Preview
No ratings yet
FOF Preview
7 pages
SPSS Manajemen Pemasaran
No ratings yet
SPSS Manajemen Pemasaran
8 pages
SAT Writing - Punctuation and Grammar
100% (1)
SAT Writing - Punctuation and Grammar
5 pages
Mastering G2 MATHEMATICS Secondary 1 - Sample Pages
No ratings yet
Mastering G2 MATHEMATICS Secondary 1 - Sample Pages
17 pages
Flutter Certified Application Developer - Exam Sample - AFD 200 - English2
No ratings yet
Flutter Certified Application Developer - Exam Sample - AFD 200 - English2
10 pages
Java Workshop
No ratings yet
Java Workshop
2 pages
Lab 5
No ratings yet
Lab 5
10 pages
Introduction To Dynamic Spin Chemistry Magnetic Field Effects On Chemical and Biochemical Reactions Hisaharu Hayashi PDF Download
No ratings yet
Introduction To Dynamic Spin Chemistry Magnetic Field Effects On Chemical and Biochemical Reactions Hisaharu Hayashi PDF Download
27 pages

F15 CS194 Lec 03 Tabular Data

Uploaded by

F15 CS194 Lec 03 Tabular Data

Uploaded by

Introduction to Data Science

Intro. to Data Science Fall 2015

A schema is a description of a particular

• Also has support for XML (semi-structured data)

• Cardinality = 3, arity = 5 , all rows distinct

• SQL = Structured Query Language

* Developed at IBM by Donald D. Chamberlin and Raymond F. Boyce in the 1970s.

CREATE TABLE Students

• Another example: the Enrolled table holds

CREATE TABLE Enrolled

• Can delete all tuples satisfying some condition (e.g.,

• To find all 18 year old students, we can write:

• To find just names and logins, replace the first line:

• relation-list : A list of relation names

S.name S.sid E.sid E.classid

S.name S.sid E.sid E.classid

S.name S.sid E.sid E.classid

S.name S.sid E.sid E.classid

User.id Name Attribs… Loc.id Name Attribs…

User.id Name Attribs… Loc.id Name Attribs…

U.id Name Count Attr.. L.id Name Count Attr…

SELECT SID, Name, AVG(GPA)

SELECT SID, Name, AVG(GPA)

• DataFrame: a table with named columns

- Tables must fit into memory.

Starting this Weds.

Variables used as qualifiers Variables we want to measure

Cell contents are Grade, Unit values

• Too limited in structure?

53831 Smith smith@ee 18 3.2

53831 Smith smith@ee NULL NULL NULL NULL NULL NULL

55541 Brown brown@ee NULL NULL NULL NULL NULL NULL

Represented as column (key-value) stores:

Dynamic Column family (Cassandra):

need to find the top 5

users aged 18-25. Group on url

Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt

Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt

Users = load ‘users’ as (name, age);

store Top5 into ‘top5sites’;

Example from http://wiki.apache.org/pig-data/attachments/PigTalksPapers/attachments/ApacheConEurope09.ppt

Wednesday come to 110/120 Jacobs Hall for

You might also like