DSCI 5350 - Lecture 2 PDF
DSCI 5350 - Lecture 2 PDF
Kashif Saeed
1
Lecture Outline
2
Why Organizations need Big Data?
• On-Premises Scenarios
Hadoop with MapReduce as a processing engine –
already a legacy
Hadoop with Spark as processing engine
Spark with NoSQL (or other data stores)
• Cloud Scenarios
Persistent Cluster on Cloud
Non-persistent Cluster on Cloud
Managed cluster on Cloud
4
Big Data Cloud Deployment - concepts
Non-Persistent Cluster
Pay as you go
No investment in hardware, setup, maintenance
Setup, Process data, and terminate the cluster
Entire process from setup, processing, and termination
can be automated
Since you terminate the cluster, you should not store the
outcome on the cluster itself – AWS uses S3 for storage,
GCP uses Cloud Storage for storage
5
What will be cover in this class
7
Distributed Systems
8
Challenges with Distributed Systems
• Programming complexity
• The software needs to be sophisticated enough to handle all
scenarios that can go wrong (exception and error handling)
• Centralized Data Storage
• With hundreds of computers reading and writing to the
storage, the read/write speed of the storage becomes the
bottleneck
• Data Processing & Storage at different layers
• With Data being centralized and processing being localized,
you have to bring the data to the processing engine by
copying the data
• Replicating/Moving data to processing engine is not possible
when dealing with petabytes or Exabyte's of data
9
Figure 1: Distributed System
10
Google’s Solution for Distributed Systems
12
What is Hadoop
https://www.cloudera.com/about/news-and-blogs/press-releases/2018-10-03-cloudera-and-hortonworks-
announce-merger-to-create-worlds-leading-next-generation-data-platform-and-deliver-industrys-first-
enterprise-data-cloud.html 15
Hadoop Architecture
16
Hadoop Cluster Terminologies
20
21
Hadoop Core Components
22
23
Hadoop Distributed File System (HDFS)
24
Image Source: Cloudera Academic Alliance
Hadoop Distributed File System (HDFS)
26
27
Image Source: Cloudera Academic Alliance
Processing Framework: MapReduce
28
Processing Framework: Others
32
Storing and Retrieving Files
33
Image Source: Cloudera Academic Alliance
34
Image Source: Cloudera Academic Alliance
35
Image Source: Cloudera Academic Alliance
The data itself is NEVER
retrieved via the Name Node
36
Image Source: Cloudera Academic Alliance
Accessing Hadoop System
37
Image Source: Cloudera Academic Alliance
The Hadoop Ecosystem
and additional tools integrated in CDH
38
39
Data Storage: Apache HBase
40
Apache Sqoop
41
Streaming Systems: Flume, Kafka, Flink,
Others
• Apache Flume
• Distributed service for ingesting streaming data
• Ideally suited for event data from multiple
systems – For example, log files
• Covered later in this course
• Kafka
• A high throughput, scalable messaging system
• Distributed, reliable publish-subscribe system
• Integrates with Flume and Spark Streaming
• Use Cases: Large scale messaging, log
aggregation, customer activity tracking etc.
42
Image Source: Cloudera Academic Alliance
Data Processing: Apache Spark
43
Data Processing: MapReduce
44
Data Processing: Apache Pig
45
Image Source: Cloudera Academic Alliance
High Performance SQL: Cloudera Impala
46
SQL on MapReduce: Apache Hive
47
UI Interface: Hue
48
Workflow Management: Apache Oozie
49
50
51
Apache Incubator Projects
52
Cloudera Labs
53
Cloudera Labs - continued
• Training Material:
• ~/training_materials/dev1 folder on the VM
• Course Data:
• ~/training_materials/data
54