Big Data Basics
Big Data Basics
Outline
q Data Types and File Systems
Reference:
• Chapter 10, “Principles of Distributed Database Systems” by Özsu, M. Tamer, Valduriez,
Patrick. 4th Ed, ISBN 978-3-030-26253-2
• Chapter 1, “Big Data Fundamentals: Concepts, Drivers & Techniques”, by Thomas Erl,
Wajid Khattak, Paul Buhler. 1st Ed. ISBN-10: 0134291077,
Dr. M. N. Sadat 2
Image source: https://medium.com/@get_excelsior 3
5 V’s: Volume
5
5 V’s: Variety
Dr. M. N. Sadat 6
5 V’s: Veracity
Dr. M. N. Sadat 7
5 V’s: Value
Dr. M. N. Sadat 8
Types of Data
Dr. M. N. Sadat 9
Types of Data
metadata
Dr. M. N. Sadat 10
File Systems and Distributed File Systems
Example:
● Google File System (GFS)
● Hadoop Distributed File
System (HDFS)
● Network File System (NFS)
● Amazon S3
● GlusterFS
● Ceph
Dr. M. N. Sadat 11
Google File System (GFS)
Dr. M. N. Sadat 12
Google File System (GFS)
● Characteristics of Google data-intensive applications:
○ Files are very large, typically several gigabytes, containing many objects such
as web documents.
○ Workloads consist mainly of read and append operations, while random
updates are rare. Read operations consist of large reads of bulk data (e.g., 1 MB)
and small random reads (e.g., a few KBs).
○ The append operations are also large and there may be many concurrent
clients that append the same file.
○ Because workloads consist mainly of large read and append operations, high
throughput is more important than low latency
Dr. M. N. Sadat 13
Google File System (GFS)
Dr. M. N. Sadat 14
Data Analytics