Lecture 1 - Introduction
Lecture 1 - Introduction
DSCI 551
Wensheng Wu
1
Logistics
• Instructor email: [email protected]
• Office hours:
– 30 minutes after each class
2
Logistics
• Graders
– check out announcements
• Class materials
– Posted on course web site
– https://courses.uscden.net/
3
Piazza
• Discussion forums
– You may post general and homework questions
– Do not post solutions
– Please actively participate in helping others!
– Do not abuse forum (an academic misconduct!)
4
Prerequisites
• Programming skills:
– Python (homework, Spark), Java (e.g., for Hadoop
only)
• Scala
5
Prerequisites
• Basic knowledge of algorithms and data structures
– Sorting, hashing, etc. (CS 570)//merge sort?
• 3, 2, 1, 4, 6, 5
=> 1, 2, 3 4, 5, 6 (runs)
merge => 1, 2, 3, 4, 5, 6
– h(k) => if k is even, send (k,v) to R0; otherwise, send to R1
– 3%2 = 1
– 2%2 = 0
– I/O
• 1TB (data on SSD) 1GB main memory) => runs
• I/O
6
notes
• h(x) = x % 2
• h(3) = 1
• h(4) = 0
• h('john') = 0
• h('bill') = 1
7
Textbooks
• Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-
Dusseau. Operating Systems: Three Easy Pieces, 2015
(selected chapters only). Available free at:
http://pages.cs.wisc.edu/~remzi/OSTEP/
8
Additional readings
• Links can be found in Syllabus
– Check out the schedule
9
Grading structure
• See syllabus
10
Grading scale
• [94, 100] = A
• [90, 94) = A-
• [87, 90) = B+
• [83, 87) = B
• [80, 83) = B-
• [77, 80) = C+
• [73, 77) = C
• … (see Syllabus for complete breakdown)
11
Lab tasks
• Four tasks:
– EC2, HDFS, MongoDB, DynamoDB
12
Exams
• 3 exams
13
Calculator
• Bring one to the tests
14
Course project
• Details to be posted
• Done in phases
– Proposal
– Midterm report
– Final report
15
Participation
• submit a summary of lecture contents on a
weekly basis
16
Late Policy
• No LATE submissions will be accepted
17
Grading Corrections
• All coursework's grades are final one week
after grades are posted or as stated in the
announcement
18
Academic Integrity
• Cheating will NOT be tolerated
19
Now, movie time ☺
• Explain big data:
– https://www.youtube.com/watch?v=7D1CQ_LOiz
A
• Questions:
– Where does big data come from?
– What characteristics doe it have? 3Vs?
– What big data technologies were mentioned?
• Hadoop: HDFS and MapReduce
20
Variety
21
Internet Traffic in 2012
• 4.8 zettabyte = 4.8 billion terabytes
Main memory:
12GB
• Zettabyte (1000 exabytes)
• Exabyte SSD:
1TB
• Petabyte
• Terabyte = 2^40 (storage) 123 (decimal) = 1 * 10^2 + 2
– 1TB = 1024 (2^10) GB 111 (binary) = 1*2^2 + 1*2^1
– 2^2 = 4, 2^3 = 8, 2^7 = 128
111 + 1 (binary) = 1000 = 8 (d
• Gigabyte = 2^30 (memory) 001
• Megabyte (128MB, HDFS) ==
– 1MB = 2^20 = 2^10 * 2^10 1000
23
Major topics
• Storage systems
24
Storage Systems
• Hard disk
• SSD (Solid state drive)
4KB = block size for HDD
128MB = block size in HDFS
128MB/4KB = 32K
hard_drive_problems_harddrive_02.jpg
25
Internal of hard disk
Actuator
hard_drive_problems_harddrive_02.jpg
Spindle
Platter
Disk head
26
NAND flash
27
Latencies: read, write, and erase
28
Major topics
• Storage systems
29
File Systems
• Standalone
– Single machine
30
Standalone file systems
• Data structures
– Data blocks
– Metadata blocks (Inodes)
– Bitmap blocks (for space allocation)
• Access paths
– Read a file
– Write a file
31
Inode (index node)
• Each is identified by a number
– Low-level number of file name: inumber
• Can figure out location of inode from inumber
32
Distributed file systems
• Hadoop HDFS (after GFS)
– Data are distributed among data nodes
• Replication
– Automatic creation of replica (typically 2 or 3
copies/replica of data)
• Fault-tolerant
– Automatic recovery from node failure
33
HDFS architecture
A B C
34
File system image in namenode
35
Directory section
36
Major topics
• Storage systems
37
File Formats
• JSON
38
HTML
<h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteoul, Buneman, Suciu
<br> Morgan Kaufmann, 1999
39
XML
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
41
Android app resource file
42
Manifest.xml
43
<bib>
…
<book price="35">
<publisher>Addison-Wesley</publisher>
<author>Serge Abiteboul</author>
<author><first-name>Rick</first-name><last-name>Hull</last-name></author>
<author age="20">Victor Vianu</author>
<title>Foundations of Databases</title>
<year>1995</year>
<price>38.8</price>
</book>
<book price="55">
<publisher>Freeman</publisher>
<author>Jeffrey D. Ullman</author>
<title>Principles of Database and Knowledge Base Systems</title>
<year>1998</year>
</book>
…
</bib>
44
Data Model for XPath
Document node
book book
publisher author . . . .
/bib/paper/year
Result: empty (there were no papers)
46
Major topics
• Storage systems
47
Relational DBMS
• Data models
– E (entity set) R
– Relational (redundancy => update anomaly)
• Schema
– describes the structure of data
– including constraints
48
RDBMS
• Query languages
– Relational algebra
– SQL, constraints, views
• Data organization
– Records and blocks
– Index structure: B+-tree (external data structure)
49
RDBMS
• Query execution algorithms
– External sorting
– One-pass algorithms
– Nested-loop join, sorting, hashing-based
– Multiple-pass algorithms
50
RDBMS
• Rigid schema
51
RDBMS
• Hard to scale out
– Horizontal partitioning/sharding possible
– But would need distributed storage & computing
support like Hadoop & MapReduce
52
RDBMS Examples
• MySQL (can be installed in Amazon AWS EC2)
54
Conceptual
name category Modeling
name
cid
ssn
Takes Course
Student
semester
Advises Teaches
Professor
57
Query Optimization
Goal:
Declarative SQL query Imperative query execution plan:
Courses.name
select C.name
from Students S, Takes T, Courses C
where S.name="Mary" and cid=cid
name="Mary"
filtering
projection Students Takes Courses
59
Topics
• Big data management & analytics
– Cloud data storage (Amazon S3)
– NoSQL
• Google Firebase (real-time database, …)
• MongoDB (shell, mongo)
• Amazon DynamoDB (row store, key-value)
• Cassandra (not required)
– Apache Hadoop & MapReduce
– Apache Spark
60
Cloud data storage
• Amazon S3 (simple storage service)
– Ideal for storing large binary files
– E.g., audio, video, image
– Simple RESTful web service
61
62
Upload a file
63
64
NoSQL
• Not only SQL
• Flexible schemas
– e.g., JSON documents or key-value pairs
– Ideal for managing a mix of structured, semi-
structured, and unstructured data
65
Example NoSQL databases
• MongoDB, Firebase, etc.
– Manage JSON documents
• Amazon DynamoDB
– Row store
– row = item = a collection of key-value pairs
• Neo4J…
66
Key techniques
• Consistent hashing (Cassandra, Dynamo)
– Avoid moving too much data when adding new
machines (scaling out)
67
Write path in Cassandra
1
3
append only
68
Key techniques
• Compaction
– Introduced in Google "Bigtable" paper
– Merge multiple versions of data
– Remove expired or deleted data
69
DynamoDB
• https://console.aws.amazon.com/dynamodb/
home?region=us-east-1#gettingStarted:
70
71
Insert items
72
May add new attributes
73
Firebase: a cloud database
74
Firebase
75
Topics
• Big data management & analytics
– Cloud data storage (Amazon S3)
– NoSQL (Amazon DynamoDB, Cassandra,
MongoDB)
– MapReduce
– Apache Hadoop
– Apache Spark
76
Roots in functional programming
• Functional programming languages:
– Python, Lisp (list processor), Scheme, Erlang, Haskell
• Two functions:
– Map: mapping a list => list
– Reduce: reducing a list => value
78
Lambda function
• Anonymous function (not bound to a name)
• list = [1, 2, 3]
79
How is reduce() in Python evaluated?
• z = reduce(f, list) where f is add function
80
Hadoop MapReduce
• Map
– <k, v> => list of <k', v'>
• Reduce:
– <k', list of v'> => list of <k'', v''>
81
MapReduce
82
WordCount: mapper
Object can be replaced with LongWritable
83
WordCount: reducer
Data types of input key-value
Data types of output key-value
A list of values
84
Characteristics of Hadoop
• Acyclic data flow model
– Data loaded from stable storage (e.g., HDFS)
– Processed through a sequence of steps
– Results written to disk
• Batch processing
– No interactions permitted during processing
85
Problems
• Ill-suited for iterative algorithms that requires
repeated reuse of data
– E.g., machine learning and data mining algorithms
such as k-means, PageRank, logistic regression
86
In-memory MapReduce (Spark)
• Key concepts
– RDD (resilient distributed dataset)
– Transformations
– Actions
87
Apache Spark: history
88
Spark
• Support working sets through RDD
– Enabling reuse & fault-tolerance
89
Spark
• Combine SQL, streaming, and complex
analytics
• We will see DataFrame in Spark too
90
Spark
• Run on Hadoop, Cassandra, HBase, etc.
91
wc.py
from pyspark import SparkContext
from operator import add
sc = SparkContext(appName="dsci551")
lines = sc.textFile('hello.txt')
output = counts.collect()
for v in output:
print(v[0], v[1]) 92
Coming up…
• Task: Setting up an EC2 instance
• Details:
– see posted instructions and come to class!
93
Resources
• Merge sort:
– https://www.interviewbit.com/tutorial/merge-sort-
algorithm/
– https://www.youtube.com/watch?v=Nso25TkBsYI
• Hashing
– https://www.tutorialspoint.com/python_data_structu
re/python_hash_table.htm
– https://www.programiz.com/python-
programming/methods/built-in/hash
94