0% found this document useful (0 votes)
4 views

Lecture 1 - Introduction

DSCI 551 is a course taught by Wensheng Wu that covers topics in big data management, including storage systems, file formats, and database management systems. Students are expected to have programming skills in Python and Java, as well as knowledge of algorithms and data structures. The course includes a grading structure, participation requirements, and a strict late policy, with a focus on academic integrity and collaboration through platforms like Piazza.

Uploaded by

Yuan-hsuan Wen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture 1 - Introduction

DSCI 551 is a course taught by Wensheng Wu that covers topics in big data management, including storage systems, file formats, and database management systems. Students are expected to have programming skills in Python and Java, as well as knowledge of algorithms and data structures. The course includes a grading structure, participation requirements, and a strict late policy, with a focus on academic integrity and collaboration through platforms like Piazza.

Uploaded by

Yuan-hsuan Wen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

Introduction

DSCI 551
Wensheng Wu

1
Logistics
• Instructor email: [email protected]

• Class meeting times:


– see syllabus

• Office hours:
– 30 minutes after each class

2
Logistics
• Graders
– check out announcements

• Class materials
– Posted on course web site
– https://courses.uscden.net/

• AWS EC2 (elastic compute cloud)

3
Piazza
• Discussion forums
– You may post general and homework questions
– Do not post solutions
– Please actively participate in helping others!
– Do not abuse forum (an academic misconduct!)

• Check frequently for updates

• Check course website on how to access Piazza

4
Prerequisites
• Programming skills:
– Python (homework, Spark), Java (e.g., for Hadoop
only)
• Scala

• Unix-like environment & shell commands (ls?)


– E.g., Amazon EC2

5
Prerequisites
• Basic knowledge of algorithms and data structures
– Sorting, hashing, etc. (CS 570)//merge sort?
• 3, 2, 1, 4, 6, 5
=> 1, 2, 3 4, 5, 6 (runs)
merge => 1, 2, 3, 4, 5, 6
– h(k) => if k is even, send (k,v) to R0; otherwise, send to R1
– 3%2 = 1
– 2%2 = 0
– I/O
• 1TB (data on SSD) 1GB main memory) => runs
• I/O

• Basic probability and statistics

6
notes
• h(x) = x % 2

• h(3) = 1
• h(4) = 0

• h('john') = 0
• h('bill') = 1

7
Textbooks
• Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-
Dusseau. Operating Systems: Three Easy Pieces, 2015
(selected chapters only). Available free at:
http://pages.cs.wisc.edu/~remzi/OSTEP/

• Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer


Widom. Database Systems: The Complete Book
(Second Edition), Prentice Hall, 2009. (selected
chapters only)
– http://infolab.stanford.edu/~ullman/dscb.html

• See four more books in syllabus

8
Additional readings
• Links can be found in Syllabus
– Check out the schedule

9
Grading structure
• See syllabus

10
Grading scale
• [94, 100] = A
• [90, 94) = A-
• [87, 90) = B+
• [83, 87) = B
• [80, 83) = B-
• [77, 80) = C+
• [73, 77) = C
• … (see Syllabus for complete breakdown)

11
Lab tasks
• Four tasks:
– EC2, HDFS, MongoDB, DynamoDB

12
Exams
• 3 exams

• Closed-notes & book, in-person

13
Calculator
• Bring one to the tests

• If calculator is needed, we will either


announce or state it on the tests

• Otherwise, no electronic devices are allowed

14
Course project
• Details to be posted

• Done in phases
– Proposal
– Midterm report
– Final report

15
Participation
• submit a summary of lecture contents on a
weekly basis

• see syllabus for details

16
Late Policy
• No LATE submissions will be accepted

• Make up for tests are permitted only when


– You have a medical emergency with doctor note,
signed with contact info

• No makeups for personal matters, scheduling


conflicts, etc.

17
Grading Corrections
• All coursework's grades are final one week
after grades are posted or as stated in the
announcement

• Please submit reasonable regrading requests


– Irrational requests (e.g., simply asking for more
points or special treatments) may result in
reduction of your grades

18
Academic Integrity
• Cheating will NOT be tolerated

• All parties involved will receive a grade of F


for the course and be reported to SJACS
WITHOUT EXCEPTION
– USC Student Judicial Affairs and Community
Standards

19
Now, movie time ☺
• Explain big data:
– https://www.youtube.com/watch?v=7D1CQ_LOiz
A

• Questions:
– Where does big data come from?
– What characteristics doe it have? 3Vs?
– What big data technologies were mentioned?
• Hadoop: HDFS and MapReduce

20
Variety

21
Internet Traffic in 2012
• 4.8 zettabyte = 4.8 billion terabytes
Main memory:
12GB
• Zettabyte (1000 exabytes)
• Exabyte SSD:
1TB
• Petabyte
• Terabyte = 2^40 (storage) 123 (decimal) = 1 * 10^2 + 2
– 1TB = 1024 (2^10) GB 111 (binary) = 1*2^2 + 1*2^1
– 2^2 = 4, 2^3 = 8, 2^7 = 128
111 + 1 (binary) = 1000 = 8 (d
• Gigabyte = 2^30 (memory) 001
• Megabyte (128MB, HDFS) ==
– 1MB = 2^20 = 2^10 * 2^10 1000

• Kilobyte = 2^10 (1KB) = 1024B // 2^5 = 32 11 = 1*2^1 + 1*2^0 = 3


100 (binary) = 1*2^2 = 4
100 (decimal) = 1*10^2
22
Notes
• main memory volatile
• export data from MySQL into CSV/JSON
format
• apache Hive – HQL
• very structured – relations (data in MySQL)
• semi-structured (JSON/XML)
• unstructured (texts) NLP

23
Major topics
• Storage systems

• File systems & file formats

• Database management systems (RDBMS)


– R = relational

• Big data solution stack

24
Storage Systems
• Hard disk
• SSD (Solid state drive)
4KB = block size for HDD
128MB = block size in HDFS
128MB/4KB = 32K

hard_drive_problems_harddrive_02.jpg

25
Internal of hard disk
Actuator
hard_drive_problems_harddrive_02.jpg

Spindle

Platter

Disk head
26
NAND flash

27
Latencies: read, write, and erase

28
Major topics
• Storage systems

• File systems & file formats

• Database management systems

• Big data solution stack

29
File Systems
• Standalone
– Single machine

• Distributed (e.g., Hadoop)


– A number of data servers

30
Standalone file systems
• Data structures
– Data blocks
– Metadata blocks (Inodes)
– Bitmap blocks (for space allocation)

• Access paths
– Read a file
– Write a file

31
Inode (index node)
• Each is identified by a number
– Low-level number of file name: inumber
• Can figure out location of inode from inumber

32
Distributed file systems
• Hadoop HDFS (after GFS)
– Data are distributed among data nodes

• Replication
– Automatic creation of replica (typically 2 or 3
copies/replica of data)

• Fault-tolerant
– Automatic recovery from node failure

33
HDFS architecture

A B C

34
File system image in namenode

35
Directory section

36
Major topics
• Storage systems

• File systems & file formats

• Database management systems

• Big data solution stack

37
File Formats
• JSON

38
HTML
<h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteoul, Buneman, Suciu
<br> Morgan Kaufmann, 1999

39
XML
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>

</bibliography>

XML describes the content 40


XML usages
• Software configurations files
– E.g., HDFS

• Android app development


– Layout resource files, e.g., activity_main.xml

• Java archive (.jar file)


– Manifest.xml

41
Android app resource file

42
Manifest.xml

43
<bib>

<book price="35">
<publisher>Addison-Wesley</publisher>
<author>Serge Abiteboul</author>
<author><first-name>Rick</first-name><last-name>Hull</last-name></author>
<author age="20">Victor Vianu</author>
<title>Foundations of Databases</title>
<year>1995</year>
<price>38.8</price>
</book>
<book price="55">
<publisher>Freeman</publisher>
<author>Jeffrey D. Ullman</author>
<title>Principles of Database and Knowledge Base Systems</title>
<year>1998</year>
</book>

</bib>
44
Data Model for XPath

Document node

bib The root element

book book

publisher author . . . .

Addison-Wesley Serge Abiteboul


45
XPath: Simple Expressions
/bib/book/year
Result: <year> 1995 </year>
<year> 1998 </year>

/bib/paper/year
Result: empty (there were no papers)

46
Major topics
• Storage systems

• File systems & file formats

• Database management systems

• Big data solution stack

47
Relational DBMS
• Data models
– E (entity set) R
– Relational (redundancy => update anomaly)

• Schema
– describes the structure of data
– including constraints

48
RDBMS
• Query languages
– Relational algebra
– SQL, constraints, views

• Data organization
– Records and blocks
– Index structure: B+-tree (external data structure)

49
RDBMS
• Query execution algorithms
– External sorting
– One-pass algorithms
– Nested-loop join, sorting, hashing-based
– Multiple-pass algorithms

50
RDBMS
• Rigid schema

• Strong consistency is the key design goal


– Never read old data
– Suitable for mission-critical applications, e.g.,
banking

• But may suffer from low availability


– ACID vs CAP

51
RDBMS
• Hard to scale out
– Horizontal partitioning/sharding possible
– But would need distributed storage & computing
support like Hadoop & MapReduce

52
RDBMS Examples
• MySQL (can be installed in Amazon AWS EC2)

• Amazon RDS (Relational database as a service)


– DBMS in the cloud
– Database as a service

• Data warehouse on RDBMS


– OLAP
53
Amazon RDS: Database-as-a-service
• MySQL, PostgreSQL, Oracle, SQL Server, etc.

54
Conceptual
name category Modeling
name
cid
ssn
Takes Course
Student

semester

Advises Teaches

Professor

address name field


55
Schema Design and Implementation
• Tables (relations):
Students: Takes:
SSN Name Category SSN CID
123-45-6789 Charles undergrad 123-45-6789 CSE444
234-56-7890 Dan grad 123-45-6789 CSE541
… … 234-56-7890 CSE142

Courses:
CID Name Semster
CSE444 Databases fall
CSE541 Operating systems spring

• Separates the logical view from the physical view


of the data.
56
Select A’s ,agg
From R’s
Querying a Database Where C’s
Group by A’s
Having
Order by
• Find all courses that "Mary" takes Limit ?
• S(tructured) Q(uery) L(anguage) Offset ?
(pagination)
– clause
select C.name ===
from Students S, Takes T, Courses C Insert
Update
where S.name = "Mary" and Delete
S.ssn = T.ssn and T.cid = C.cid
Declarative (what)

• Query processor figures out how to answer the


query efficiently.

57
Query Optimization
Goal:
Declarative SQL query Imperative query execution plan:
Courses.name

select C.name
from Students S, Takes T, Courses C
where S.name="Mary" and cid=cid

S.ssn = T.ssn and T.cid = C.cid


sid=sid

name="Mary"
filtering
projection Students Takes Courses

Plan: tree of Relational Algebra operators,


choice of algorithms at each operator
58
Major topics
• Storage systems

• File systems & file formats

• Database management systems

• Big data solution stack

59
Topics
• Big data management & analytics
– Cloud data storage (Amazon S3)
– NoSQL
• Google Firebase (real-time database, …)
• MongoDB (shell, mongo)
• Amazon DynamoDB (row store, key-value)
• Cassandra (not required)
– Apache Hadoop & MapReduce
– Apache Spark

60
Cloud data storage
• Amazon S3 (simple storage service)
– Ideal for storing large binary files
– E.g., audio, video, image
– Simple RESTful web service

• Eventual consistency for high availability

61
62
Upload a file

63
64
NoSQL
• Not only SQL

• Flexible schemas
– e.g., JSON documents or key-value pairs
– Ideal for managing a mix of structured, semi-
structured, and unstructured data

• High availability (CAP)

• Weaker (e.g., eventual) consistency model

65
Example NoSQL databases
• MongoDB, Firebase, etc.
– Manage JSON documents

• Amazon DynamoDB
– Row store
– row = item = a collection of key-value pairs

• Apache Cassandra (not required)


– Wide column store
– Google's Bigtable clone

• Neo4J…

66
Key techniques
• Consistent hashing (Cassandra, Dynamo)
– Avoid moving too much data when adding new
machines (scaling out)

• Efficient writes (for update-heavy apps)


– Append-only
– No overwrites
– Avoid random seek
– But compaction needed later

67
Write path in Cassandra

1
3

append only

68
Key techniques
• Compaction
– Introduced in Google "Bigtable" paper
– Merge multiple versions of data
– Remove expired or deleted data

69
DynamoDB
• https://console.aws.amazon.com/dynamodb/
home?region=us-east-1#gettingStarted:

70
71
Insert items

72
May add new attributes

73
Firebase: a cloud database

74
Firebase

75
Topics
• Big data management & analytics
– Cloud data storage (Amazon S3)
– NoSQL (Amazon DynamoDB, Cassandra,
MongoDB)
– MapReduce
– Apache Hadoop
– Apache Spark

76
Roots in functional programming
• Functional programming languages:
– Python, Lisp (list processor), Scheme, Erlang, Haskell

• Two functions:
– Map: mapping a list => list
– Reduce: reducing a list => value

• map() and reduce() in Python


– https://docs.python.org/2/library/functions.html#ma
p
77
map() and reduce() in Python
• list = [1, 2, 3]
• def sqr(x): return x ** 2
• list1 = map(sqr, list)
What are the value of list1 and z?

• def add(x, y): return x + y


• z = reduce(add, list)
reduce() is in functools module of Python 3

78
Lambda function
• Anonymous function (not bound to a name)

• list = [1, 2, 3]

• list1 = map(lambda x: x ** 2, list)


• z = reduce(lambda x, y: x + y, list)

79
How is reduce() in Python evaluated?
• z = reduce(f, list) where f is add function

• Initially, z (an accumulator) is set to list[0]


• Next, repeat z = add(z, list[i]) for each i > 0
• Return final z

• Example: z = reduce(add, [1, 2, 3])


– i = 0, z = 1; i = 1, z = 3; i = 2, z = 6

80
Hadoop MapReduce
• Map
– <k, v> => list of <k', v'>

• Reduce:
– <k', list of v'> => list of <k'', v''>

• Write MapReduce programs on Hadoop


– Using Java

81
MapReduce

82
WordCount: mapper
Object can be replaced with LongWritable

Data types of input key-value


Data types of output key-value

Key-value pairs with specified data types

83
WordCount: reducer
Data types of input key-value
Data types of output key-value

A list of values

84
Characteristics of Hadoop
• Acyclic data flow model
– Data loaded from stable storage (e.g., HDFS)
– Processed through a sequence of steps
– Results written to disk

• Batch processing
– No interactions permitted during processing

85
Problems
• Ill-suited for iterative algorithms that requires
repeated reuse of data
– E.g., machine learning and data mining algorithms
such as k-means, PageRank, logistic regression

• Ill-suited for interactive exploration of data


– E.g., OLAP on big data

86
In-memory MapReduce (Spark)
• Key concepts
– RDD (resilient distributed dataset)
– Transformations
– Actions

87
Apache Spark: history

88
Spark
• Support working sets through RDD
– Enabling reuse & fault-tolerance

• 10x faster than Hadoop in iterative jobs

• Interactively explore 39GB with sub-second


response time

89
Spark
• Combine SQL, streaming, and complex
analytics
• We will see DataFrame in Spark too

90
Spark
• Run on Hadoop, Cassandra, HBase, etc.

91
wc.py
from pyspark import SparkContext
from operator import add

sc = SparkContext(appName="dsci551")

lines = sc.textFile('hello.txt')

counts = lines.flatMap(lambda x: x.split(' ')) \


.map(lambda x: (x, 1)) \
.reduceByKey(add)

output = counts.collect()

for v in output:
print(v[0], v[1]) 92
Coming up…
• Task: Setting up an EC2 instance

• Details:
– see posted instructions and come to class!

93
Resources
• Merge sort:
– https://www.interviewbit.com/tutorial/merge-sort-
algorithm/
– https://www.youtube.com/watch?v=Nso25TkBsYI

• Hashing
– https://www.tutorialspoint.com/python_data_structu
re/python_hash_table.htm
– https://www.programiz.com/python-
programming/methods/built-in/hash

94

You might also like