0% found this document useful (0 votes)

6 views94 pages

Lecture 1 - Introduction

DSCI 551 is a course taught by Wensheng Wu that covers topics in big data management, including storage systems, file formats, and database management systems. Students are expected to have programming skills in Python and Java, as well as knowledge of algorithms and data structures. The course includes a grading structure, participation requirements, and a strict late policy, with a focus on academic integrity and collaboration through platforms like Piazza.

Uploaded by

Yuan-hsuan Wen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views94 pages

Lecture 1 - Introduction

Uploaded by

Yuan-hsuan Wen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 94

Introduction

DSCI 551
Wensheng Wu

1
Logistics
• Instructor email: [email protected]

• Class meeting times:

– see syllabus

• Office hours:
– 30 minutes after each class

2
Logistics
• Graders
– check out announcements

• Class materials
– Posted on course web site
– https://courses.uscden.net/

• AWS EC2 (elastic compute cloud)

3
Piazza
• Discussion forums
– You may post general and homework questions
– Do not post solutions
– Please actively participate in helping others!
– Do not abuse forum (an academic misconduct!)

• Check frequently for updates

• Check course website on how to access Piazza

4
Prerequisites
• Programming skills:
– Python (homework, Spark), Java (e.g., for Hadoop
only)
• Scala

• Unix-like environment & shell commands (ls?)

– E.g., Amazon EC2

5
Prerequisites
• Basic knowledge of algorithms and data structures
– Sorting, hashing, etc. (CS 570)//merge sort?
• 3, 2, 1, 4, 6, 5
=> 1, 2, 3 4, 5, 6 (runs)
merge => 1, 2, 3, 4, 5, 6
– h(k) => if k is even, send (k,v) to R0; otherwise, send to R1
– 3%2 = 1
– 2%2 = 0
– I/O
• 1TB (data on SSD) 1GB main memory) => runs
• I/O

• Basic probability and statistics

6
notes
• h(x) = x % 2

• h(3) = 1
• h(4) = 0

• h('john') = 0
• h('bill') = 1

7
Textbooks
• Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-
Dusseau. Operating Systems: Three Easy Pieces, 2015
(selected chapters only). Available free at:
http://pages.cs.wisc.edu/~remzi/OSTEP/

• Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer

Widom. Database Systems: The Complete Book
(Second Edition), Prentice Hall, 2009. (selected
chapters only)
– http://infolab.stanford.edu/~ullman/dscb.html

• See four more books in syllabus

8
Additional readings
• Links can be found in Syllabus
– Check out the schedule

9
Grading structure
• See syllabus

10
Grading scale
• [94, 100] = A
• [90, 94) = A-
• [87, 90) = B+
• [83, 87) = B
• [80, 83) = B-
• [77, 80) = C+
• [73, 77) = C
• … (see Syllabus for complete breakdown)

11
Lab tasks
• Four tasks:
– EC2, HDFS, MongoDB, DynamoDB

12
Exams
• 3 exams

• Closed-notes & book, in-person

13
Calculator
• Bring one to the tests

• If calculator is needed, we will either

announce or state it on the tests

• Otherwise, no electronic devices are allowed

14
Course project
• Details to be posted

• Done in phases
– Proposal
– Midterm report
– Final report

15
Participation
• submit a summary of lecture contents on a
weekly basis

• see syllabus for details

16
Late Policy
• No LATE submissions will be accepted

• Make up for tests are permitted only when

– You have a medical emergency with doctor note,
signed with contact info

• No makeups for personal matters, scheduling

conflicts, etc.

17
Grading Corrections
• All coursework's grades are final one week
after grades are posted or as stated in the
announcement

• Please submit reasonable regrading requests

– Irrational requests (e.g., simply asking for more
points or special treatments) may result in
reduction of your grades

18
Academic Integrity
• Cheating will NOT be tolerated

• All parties involved will receive a grade of F

for the course and be reported to SJACS
WITHOUT EXCEPTION
– USC Student Judicial Affairs and Community
Standards

19
Now, movie time ☺
• Explain big data:
– https://www.youtube.com/watch?v=7D1CQ_LOiz
A

• Questions:
– Where does big data come from?
– What characteristics doe it have? 3Vs?
– What big data technologies were mentioned?
• Hadoop: HDFS and MapReduce

20
Variety

21
Internet Traffic in 2012
• 4.8 zettabyte = 4.8 billion terabytes
Main memory:
12GB
• Zettabyte (1000 exabytes)
• Exabyte SSD:
1TB
• Petabyte
• Terabyte = 2^40 (storage) 123 (decimal) = 1 * 10^2 + 2
– 1TB = 1024 (2^10) GB 111 (binary) = 1*2^2 + 1*2^1
– 2^2 = 4, 2^3 = 8, 2^7 = 128
111 + 1 (binary) = 1000 = 8 (d
• Gigabyte = 2^30 (memory) 001
• Megabyte (128MB, HDFS) ==
– 1MB = 2^20 = 2^10 * 2^10 1000

• Kilobyte = 2^10 (1KB) = 1024B // 2^5 = 32 11 = 12^1 + 12^0 = 3

100 (binary) = 1*2^2 = 4
100 (decimal) = 1*10^2
22
Notes
• main memory volatile
• export data from MySQL into CSV/JSON
format
• apache Hive – HQL
• very structured – relations (data in MySQL)
• semi-structured (JSON/XML)
• unstructured (texts) NLP

23
Major topics
• Storage systems

• File systems & file formats

• Database management systems (RDBMS)

– R = relational

• Big data solution stack

24
Storage Systems
• Hard disk
• SSD (Solid state drive)
4KB = block size for HDD
128MB = block size in HDFS
128MB/4KB = 32K

hard_drive_problems_harddrive_02.jpg

25
Internal of hard disk
Actuator
hard_drive_problems_harddrive_02.jpg

Spindle

Platter

Disk head
26
NAND flash

27
Latencies: read, write, and erase

28
Major topics
• Storage systems

• File systems & file formats

• Database management systems

• Big data solution stack

29
File Systems
• Standalone
– Single machine

• Distributed (e.g., Hadoop)

– A number of data servers

30
Standalone file systems
• Data structures
– Data blocks
– Metadata blocks (Inodes)
– Bitmap blocks (for space allocation)

• Access paths
– Read a file
– Write a file

31
Inode (index node)
• Each is identified by a number
– Low-level number of file name: inumber
• Can figure out location of inode from inumber

32
Distributed file systems
• Hadoop HDFS (after GFS)
– Data are distributed among data nodes

• Replication
– Automatic creation of replica (typically 2 or 3
copies/replica of data)

• Fault-tolerant
– Automatic recovery from node failure

33
HDFS architecture

A B C

34
File system image in namenode

35
Directory section

36
Major topics
• Storage systems

• File systems & file formats

• Database management systems

• Big data solution stack

37
File Formats
• JSON

38
HTML
<h1> Bibliography </h1>
<p> <i> Foundations of Databases </i>
Abiteboul, Hull, Vianu
<br> Addison Wesley, 1995
<p> <i> Data on the Web </i>
Abiteoul, Buneman, Suciu
<br> Morgan Kaufmann, 1999

39
XML
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>

XML describes the content 40

XML usages
• Software configurations files
– E.g., HDFS

• Android app development

– Layout resource files, e.g., activity_main.xml

• Java archive (.jar file)

– Manifest.xml

41
Android app resource file

42
Manifest.xml

43
<bib>
…
<book price="35">
<publisher>Addison-Wesley</publisher>
<author>Serge Abiteboul</author>
<author><first-name>Rick</first-name><last-name>Hull</last-name></author>
<author age="20">Victor Vianu</author>
<title>Foundations of Databases</title>
<year>1995</year>
<price>38.8</price>
</book>
<book price="55">
<publisher>Freeman</publisher>
<author>Jeffrey D. Ullman</author>
<title>Principles of Database and Knowledge Base Systems</title>
<year>1998</year>
</book>
…
</bib>
44
Data Model for XPath

Document node

bib The root element

book book

publisher author . . . .

Addison-Wesley Serge Abiteboul

45
XPath: Simple Expressions
/bib/book/year
Result: <year> 1995 </year>
<year> 1998 </year>

/bib/paper/year
Result: empty (there were no papers)

46
Major topics
• Storage systems

• File systems & file formats

• Database management systems

• Big data solution stack

47
Relational DBMS
• Data models
– E (entity set) R
– Relational (redundancy => update anomaly)

• Schema
– describes the structure of data
– including constraints

48
RDBMS
• Query languages
– Relational algebra
– SQL, constraints, views

• Data organization
– Records and blocks
– Index structure: B+-tree (external data structure)

49
RDBMS
• Query execution algorithms
– External sorting
– One-pass algorithms
– Nested-loop join, sorting, hashing-based
– Multiple-pass algorithms

50
RDBMS
• Rigid schema

• Strong consistency is the key design goal

– Never read old data
– Suitable for mission-critical applications, e.g.,
banking

• But may suffer from low availability

– ACID vs CAP

51
RDBMS
• Hard to scale out
– Horizontal partitioning/sharding possible
– But would need distributed storage & computing
support like Hadoop & MapReduce

52
RDBMS Examples
• MySQL (can be installed in Amazon AWS EC2)

• Amazon RDS (Relational database as a service)

– DBMS in the cloud
– Database as a service

• Data warehouse on RDBMS

– OLAP
53
Amazon RDS: Database-as-a-service
• MySQL, PostgreSQL, Oracle, SQL Server, etc.

54
Conceptual
name category Modeling
name
cid
ssn
Takes Course
Student

semester

Advises Teaches

Professor

address name field

55
Schema Design and Implementation
• Tables (relations):
Students: Takes:
SSN Name Category SSN CID
123-45-6789 Charles undergrad 123-45-6789 CSE444
234-56-7890 Dan grad 123-45-6789 CSE541
… … 234-56-7890 CSE142
…
Courses:
CID Name Semster
CSE444 Databases fall
CSE541 Operating systems spring

• Separates the logical view from the physical view

of the data.
56
Select A’s ,agg
From R’s
Querying a Database Where C’s
Group by A’s
Having
Order by
• Find all courses that "Mary" takes Limit ?
• S(tructured) Q(uery) L(anguage) Offset ?
(pagination)
– clause
select C.name ===
from Students S, Takes T, Courses C Insert
Update
where S.name = "Mary" and Delete
S.ssn = T.ssn and T.cid = C.cid
Declarative (what)

• Query processor figures out how to answer the

query efficiently.

57
Query Optimization
Goal:
Declarative SQL query Imperative query execution plan:
Courses.name

select C.name
from Students S, Takes T, Courses C
where S.name="Mary" and cid=cid

S.ssn = T.ssn and T.cid = C.cid

sid=sid

name="Mary"
filtering
projection Students Takes Courses

Plan: tree of Relational Algebra operators,

choice of algorithms at each operator
58
Major topics
• Storage systems

• File systems & file formats

• Database management systems

• Big data solution stack

59
Topics
• Big data management & analytics
– Cloud data storage (Amazon S3)
– NoSQL
• Google Firebase (real-time database, …)
• MongoDB (shell, mongo)
• Amazon DynamoDB (row store, key-value)
• Cassandra (not required)
– Apache Hadoop & MapReduce
– Apache Spark

60
Cloud data storage
• Amazon S3 (simple storage service)
– Ideal for storing large binary files
– E.g., audio, video, image
– Simple RESTful web service

• Eventual consistency for high availability

61
62
Upload a file

63
64
NoSQL
• Not only SQL

• Flexible schemas
– e.g., JSON documents or key-value pairs
– Ideal for managing a mix of structured, semi-
structured, and unstructured data

• High availability (CAP)

• Weaker (e.g., eventual) consistency model

65
Example NoSQL databases
• MongoDB, Firebase, etc.
– Manage JSON documents

• Amazon DynamoDB
– Row store
– row = item = a collection of key-value pairs

• Apache Cassandra (not required)

– Wide column store
– Google's Bigtable clone

• Neo4J…

66
Key techniques
• Consistent hashing (Cassandra, Dynamo)
– Avoid moving too much data when adding new
machines (scaling out)

• Efficient writes (for update-heavy apps)

– Append-only
– No overwrites
– Avoid random seek
– But compaction needed later

67
Write path in Cassandra

1
3

append only

68
Key techniques
• Compaction
– Introduced in Google "Bigtable" paper
– Merge multiple versions of data
– Remove expired or deleted data

69
DynamoDB
• https://console.aws.amazon.com/dynamodb/
home?region=us-east-1#gettingStarted:

70
71
Insert items

72
May add new attributes

73
Firebase: a cloud database

74
Firebase

75
Topics
• Big data management & analytics
– Cloud data storage (Amazon S3)
– NoSQL (Amazon DynamoDB, Cassandra,
MongoDB)
– MapReduce
– Apache Hadoop
– Apache Spark

76
Roots in functional programming
• Functional programming languages:
– Python, Lisp (list processor), Scheme, Erlang, Haskell

• Two functions:
– Map: mapping a list => list
– Reduce: reducing a list => value

• map() and reduce() in Python

– https://docs.python.org/2/library/functions.html#ma
p
77
map() and reduce() in Python
• list = [1, 2, 3]
• def sqr(x): return x ** 2
• list1 = map(sqr, list)
What are the value of list1 and z?

• def add(x, y): return x + y

• z = reduce(add, list)
reduce() is in functools module of Python 3

78
Lambda function
• Anonymous function (not bound to a name)

• list = [1, 2, 3]

• list1 = map(lambda x: x ** 2, list)

• z = reduce(lambda x, y: x + y, list)

79
How is reduce() in Python evaluated?
• z = reduce(f, list) where f is add function

• Initially, z (an accumulator) is set to list[0]

• Next, repeat z = add(z, list[i]) for each i > 0
• Return final z

• Example: z = reduce(add, [1, 2, 3])

– i = 0, z = 1; i = 1, z = 3; i = 2, z = 6

80
Hadoop MapReduce
• Map
– <k, v> => list of <k', v'>

• Reduce:
– <k', list of v'> => list of <k'', v''>

• Write MapReduce programs on Hadoop

– Using Java

81
MapReduce

82
WordCount: mapper
Object can be replaced with LongWritable

Data types of input key-value

Data types of output key-value

Key-value pairs with specified data types

83
WordCount: reducer
Data types of input key-value
Data types of output key-value

A list of values

84
Characteristics of Hadoop
• Acyclic data flow model
– Data loaded from stable storage (e.g., HDFS)
– Processed through a sequence of steps
– Results written to disk

• Batch processing
– No interactions permitted during processing

85
Problems
• Ill-suited for iterative algorithms that requires
repeated reuse of data
– E.g., machine learning and data mining algorithms
such as k-means, PageRank, logistic regression

• Ill-suited for interactive exploration of data

– E.g., OLAP on big data

86
In-memory MapReduce (Spark)
• Key concepts
– RDD (resilient distributed dataset)
– Transformations
– Actions

87
Apache Spark: history

88
Spark
• Support working sets through RDD
– Enabling reuse & fault-tolerance

• 10x faster than Hadoop in iterative jobs

• Interactively explore 39GB with sub-second

response time

89
Spark
• Combine SQL, streaming, and complex
analytics
• We will see DataFrame in Spark too

90
Spark
• Run on Hadoop, Cassandra, HBase, etc.

91
wc.py
from pyspark import SparkContext
from operator import add

sc = SparkContext(appName="dsci551")

lines = sc.textFile('hello.txt')

counts = lines.flatMap(lambda x: x.split(' ')) \

.map(lambda x: (x, 1)) \
.reduceByKey(add)

output = counts.collect()

for v in output:
print(v[0], v[1]) 92
Coming up…
• Task: Setting up an EC2 instance

• Details:
– see posted instructions and come to class!

93
Resources
• Merge sort:
– https://www.interviewbit.com/tutorial/merge-sort-
algorithm/
– https://www.youtube.com/watch?v=Nso25TkBsYI

• Hashing
– https://www.tutorialspoint.com/python_data_structu
re/python_hash_table.htm
– https://www.programiz.com/python-
programming/methods/built-in/hash

CS3492 DBMS Notes
No ratings yet
CS3492 DBMS Notes
165 pages
Introduction to Dbms
No ratings yet
Introduction to Dbms
37 pages
Database Systems: CPSC 304
No ratings yet
Database Systems: CPSC 304
39 pages
Database Management Systems CS 564: Lecture #1
No ratings yet
Database Management Systems CS 564: Lecture #1
47 pages
Introduction To Databases Compsci 316 Fall 2017
No ratings yet
Introduction To Databases Compsci 316 Fall 2017
44 pages
01-intro
No ratings yet
01-intro
52 pages
Lect01-Annotated DB
No ratings yet
Lect01-Annotated DB
31 pages
Introduction To CS 4604: Zaki Malik August 26, 2007
No ratings yet
Introduction To CS 4604: Zaki Malik August 26, 2007
17 pages
Database System Implementation
No ratings yet
Database System Implementation
16 pages
Introduction To Database Systems: Ruoming Jin TTH 9:15 - 10:30pm Spring 2009 RM MSB115
No ratings yet
Introduction To Database Systems: Ruoming Jin TTH 9:15 - 10:30pm Spring 2009 RM MSB115
54 pages
Mca Second Sem New
No ratings yet
Mca Second Sem New
9 pages
Lecture1 Intro To DBMS
No ratings yet
Lecture1 Intro To DBMS
32 pages
3)Wase 2021 Dds Ho Modified
No ratings yet
3)Wase 2021 Dds Ho Modified
8 pages
MCA 2nd Sem Detailed Syllabus
No ratings yet
MCA 2nd Sem Detailed Syllabus
14 pages
Database Applications Cy S 125242: DR - Layla Abdour
No ratings yet
Database Applications Cy S 125242: DR - Layla Abdour
32 pages
lecture012
No ratings yet
lecture012
33 pages
Is
No ratings yet
Is
31 pages
CSE 460 - Syllabusf23
No ratings yet
CSE 460 - Syllabusf23
4 pages
Module 1
No ratings yet
Module 1
72 pages
CENG 351 Introduction To Data Management and File Structures
No ratings yet
CENG 351 Introduction To Data Management and File Structures
29 pages
CS145: Intro To Databases: Lecture 1: Course Overview
No ratings yet
CS145: Intro To Databases: Lecture 1: Course Overview
52 pages
Fundamentos de Las Tecnologías de La Información Uc3m Tema 4
No ratings yet
Fundamentos de Las Tecnologías de La Información Uc3m Tema 4
5 pages
Monday, March 29, 2004
No ratings yet
Monday, March 29, 2004
32 pages
INFO445: Advanced Database Design, Management, and Maintenance
No ratings yet
INFO445: Advanced Database Design, Management, and Maintenance
39 pages
00 Introduction
No ratings yet
00 Introduction
20 pages
Lecture1-Introduction
No ratings yet
Lecture1-Introduction
29 pages
B561 Advanced Database Concepts: 0 Introduction
No ratings yet
B561 Advanced Database Concepts: 0 Introduction
53 pages
DA Full
No ratings yet
DA Full
738 pages
CS6303 - Database Management Systems (DBMS) Sasuri
No ratings yet
CS6303 - Database Management Systems (DBMS) Sasuri
185 pages
1 Intro 2 Up
No ratings yet
1 Intro 2 Up
16 pages
Syllabus E63 Spring2016-2
No ratings yet
Syllabus E63 Spring2016-2
3 pages
001-2023-0921 DLMDSBDT01 Course Book
No ratings yet
001-2023-0921 DLMDSBDT01 Course Book
124 pages
DBMS Unit 1
No ratings yet
DBMS Unit 1
41 pages
MCA Syllabus
No ratings yet
MCA Syllabus
76 pages
Sybca Bigdata
No ratings yet
Sybca Bigdata
97 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
Csc 220 Dbase
No ratings yet
Csc 220 Dbase
27 pages
Computer Science: Instructor: Komal Khalid
No ratings yet
Computer Science: Instructor: Komal Khalid
51 pages
CSCP 254 - Databases - Lecture 01
No ratings yet
CSCP 254 - Databases - Lecture 01
48 pages
DIC_PLB_L1
No ratings yet
DIC_PLB_L1
64 pages
WK 3
No ratings yet
WK 3
29 pages
01-Introduction
No ratings yet
01-Introduction
42 pages
Lecture#01 Introduction
No ratings yet
Lecture#01 Introduction
55 pages
1_Intro-1
No ratings yet
1_Intro-1
47 pages
Veritabanı Yönetim Sistemleri YZM508: Dr. Osman GÖKALP
No ratings yet
Veritabanı Yönetim Sistemleri YZM508: Dr. Osman GÖKALP
49 pages
Database
No ratings yet
Database
48 pages
CS3 Database Systems Handout 1 Introduction and XML: Peter Buneman 21 Sept, 2010
No ratings yet
CS3 Database Systems Handout 1 Introduction and XML: Peter Buneman 21 Sept, 2010
99 pages
DSE 310 - Topic 1
No ratings yet
DSE 310 - Topic 1
43 pages
M.Tech 1 year
No ratings yet
M.Tech 1 year
8 pages
Notes 00 - Course Organization
No ratings yet
Notes 00 - Course Organization
21 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
17 pages
Course Outline Bs It 4th Sem
No ratings yet
Course Outline Bs It 4th Sem
14 pages
Final Lec
No ratings yet
Final Lec
22 pages
1.1-Introduction DB v2
No ratings yet
1.1-Introduction DB v2
55 pages
Database Management System: CSMI14
No ratings yet
Database Management System: CSMI14
24 pages
Introduction to DDBMS Enhanced
No ratings yet
Introduction to DDBMS Enhanced
17 pages
Applied Databases: Please Consult The Web Page For Updates, Course Material, Etc
No ratings yet
Applied Databases: Please Consult The Web Page For Updates, Course Material, Etc
21 pages
FreeBSD Mastery: Storage Essentials: IT Mastery, #4
From Everand
FreeBSD Mastery: Storage Essentials: IT Mastery, #4
Michael W. Lucas
No ratings yet
Trackpad Pro Ver. 5.0 Class 4: WINDOWS 11 & MS OFFICE 2021
From Everand
Trackpad Pro Ver. 5.0 Class 4: WINDOWS 11 & MS OFFICE 2021
Nidhi Arora
No ratings yet
Delta Lake Unveiled : Your Path to Efficient Big Data Management: 1, #1
From Everand
Delta Lake Unveiled : Your Path to Efficient Big Data Management: 1, #1
Amulya
No ratings yet
Analysis of User Behavior Patterns Using Machine Learning Algorithms
No ratings yet
Analysis of User Behavior Patterns Using Machine Learning Algorithms
7 pages
Introduction to MongoDB
No ratings yet
Introduction to MongoDB
8 pages
English Practical Report (1)
No ratings yet
English Practical Report (1)
22 pages
DP-900 3
No ratings yet
DP-900 3
33 pages
CD Unit 1
No ratings yet
CD Unit 1
489 pages
100 Data Structure Interview Question & Answers
No ratings yet
100 Data Structure Interview Question & Answers
16 pages
Informatica MCQs Set - 1 - Informatica Training & Programing Free Tutorials
No ratings yet
Informatica MCQs Set - 1 - Informatica Training & Programing Free Tutorials
2 pages
Adbms Notes
No ratings yet
Adbms Notes
50 pages
Project Report Final Updated'
No ratings yet
Project Report Final Updated'
37 pages
Southampton History Dissertation
100% (1)
Southampton History Dissertation
4 pages
M23IqbalResume 2
No ratings yet
M23IqbalResume 2
1 page
UNIT-I NLP
No ratings yet
UNIT-I NLP
37 pages
Zhou 2020
No ratings yet
Zhou 2020
5 pages
Emerging Trends Notes 1 30 10 22 20221030111517351
No ratings yet
Emerging Trends Notes 1 30 10 22 20221030111517351
2 pages
Complete Download Shaping the Future of ICT: Trends in Information Technology, Communications Engineering, and Management 1st Edition Ibrahiem M. M. El Emary PDF All Chapters
100% (1)
Complete Download Shaping the Future of ICT: Trends in Information Technology, Communications Engineering, and Management 1st Edition Ibrahiem M. M. El Emary PDF All Chapters
55 pages
Pemberdayaan Penyandang Disabilitas Melalui Advokasi Terhadap Aksesibilitas Fasilitas Publik
No ratings yet
Pemberdayaan Penyandang Disabilitas Melalui Advokasi Terhadap Aksesibilitas Fasilitas Publik
15 pages
Database System 1 - Edo Tioputra 2131116
No ratings yet
Database System 1 - Edo Tioputra 2131116
3 pages
1.2. Preparing Machine Learning Environment: Installation of Python (In Windows OS)
No ratings yet
1.2. Preparing Machine Learning Environment: Installation of Python (In Windows OS)
8 pages
Yash Banode Resume
No ratings yet
Yash Banode Resume
1 page
Azure AI Engineer Associate
No ratings yet
Azure AI Engineer Associate
4 pages
2020 An Ensemble Architecture of Deep Convolutional
No ratings yet
2020 An Ensemble Architecture of Deep Convolutional
22 pages
CH 9 Kiit
No ratings yet
CH 9 Kiit
28 pages
Hive Main
No ratings yet
Hive Main
24 pages
Al 2020 Computer Science 1 Free
No ratings yet
Al 2020 Computer Science 1 Free
6 pages
Big Data study 1
No ratings yet
Big Data study 1
77 pages
Rudra Bhatt Data
No ratings yet
Rudra Bhatt Data
9 pages
Ipl Cricket Score
No ratings yet
Ipl Cricket Score
8 pages
Ontology Driven Testing Strategiesfor Io TApplications
No ratings yet
Ontology Driven Testing Strategiesfor Io TApplications
16 pages
TMI2053-SEM2-2023-24-LU8 Improving Decision Making and Managing Knowledge
No ratings yet
TMI2053-SEM2-2023-24-LU8 Improving Decision Making and Managing Knowledge
52 pages
Yash Arote Da J PDF
No ratings yet
Yash Arote Da J PDF
1 page

Lecture 1 - Introduction

Uploaded by

Lecture 1 - Introduction

Uploaded by

Introduction

• Class meeting times:

• AWS EC2 (elastic compute cloud)

• Check frequently for updates

• Check course website on how to access Piazza

• Unix-like environment & shell commands (ls?)

• Basic probability and statistics

• Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer

• See four more books in syllabus

• Closed-notes & book, in-person

• If calculator is needed, we will either

• Otherwise, no electronic devices are allowed

• see syllabus for details

• Make up for tests are permitted only when

• No makeups for personal matters, scheduling

• Please submit reasonable regrading requests

• All parties involved will receive a grade of F

• Kilobyte = 2^10 (1KB) = 1024B // 2^5 = 32 11 = 1*2^1 + 1*2^0 = 3

• File systems & file formats

• Database management systems (RDBMS)

• Big data solution stack

• File systems & file formats

• Database management systems

• Big data solution stack

• Distributed (e.g., Hadoop)

• File systems & file formats

• Database management systems

• Big data solution stack

XML describes the content 40

• Android app development

• Java archive (.jar file)

bib The root element

Addison-Wesley Serge Abiteboul

• File systems & file formats

• Database management systems

• Big data solution stack

• Strong consistency is the key design goal

• But may suffer from low availability

• Amazon RDS (Relational database as a service)

• Data warehouse on RDBMS

address name field

• Separates the logical view from the physical view

• Query processor figures out how to answer the

S.ssn = T.ssn and T.cid = C.cid

Plan: tree of Relational Algebra operators,

• File systems & file formats

• Database management systems

• Big data solution stack

• Eventual consistency for high availability

• High availability (CAP)

• Weaker (e.g., eventual) consistency model

• Apache Cassandra (not required)

• Efficient writes (for update-heavy apps)

• map() and reduce() in Python

• def add(x, y): return x + y

• list1 = map(lambda x: x ** 2, list)

• Initially, z (an accumulator) is set to list[0]

• Example: z = reduce(add, [1, 2, 3])

• Write MapReduce programs on Hadoop

Data types of input key-value

Key-value pairs with specified data types

• Ill-suited for interactive exploration of data

• 10x faster than Hadoop in iterative jobs

• Interactively explore 39GB with sub-second

counts = lines.flatMap(lambda x: x.split(' ')) \

You might also like

• Kilobyte = 2^10 (1KB) = 1024B // 2^5 = 32 11 = 12^1 + 12^0 = 3