BDT Unit04
BDT Unit04
IV
Technologies and tools for Big Data:
• Zookeeper
• Importing Relational data with Sqoop
• Injecting stream data with flume.
• Basic concepts of Pig, Architecture of Pig,
• what is Hive. Architecture of Hive, Hive Commands.
• Overview of Apache Spark Ecosystem, Spark
Architecture
UNIT
III
Hadoop Ecosystem
UNIT
III 2
Hadoop Ecosystem: Component
• Avro:. Avro is an open source project that provides data serialization and data
exchange services for Hadoop. A serialization system for efficient,
cross-language RPC, and persistent data storage.
• Features provided by Avro:
• Rich data structures.
• Remote procedure call.
• Compact, fast, binary data format.
• Container file, to store persistent data.
UNIT
III 4
Hadoop Ecosystem: Component
Hive:
• A distributed data warehouse. Hive manages data stored in HDFS and
provides a query language based on SQL (and which is translated by
the runtime engine to MapReduce jobs) for querying the data.
• Hive do three main functions: data summarization, query, and analysis.
UNIT
III 5
Hadoop Ecosystem: What Hive can provide
UNIT
III 6
Hadoop Ecosystem:Component
• ZooKeeper: A distributed, highly available coordination service. ZooKeeper
provides primitives such as distributed locks that can be used for building
distributed applications. Zookeeper manages and coordinates a large cluster of
machines.
UNIT
III 7
Hadoop Ecosystem: Component
• Sqoop: A tool for efficiently moving data between relational databases and
HDFS. Sqoop imports data from external sources into related Hadoop
ecosystem components like HDFS, Hbase or Hive. It also exports data from
Hadoop to other external sources. Sqoop works with relational databases such
as teradata, Netezza, oracle, MySQL.
UNIT
III 8
Hadoop Ecosystem: Component
• Capabilities of Sqoop Includes
• It helps to Import individual tables or entire databases to
files in HDFS
• Also can Generate Java classes to interact with your imported
data
• Moreover, it offers the ability to import from SQL databases
straight into your Hive data warehouse
UNIT
III 9
Hadoop Ecosystem: Component
• Sqoop:
CREATE TABLE Test(
id INT NOT NULL PRIMARY KEY,
msg VARCHAR(32),
bar INT);
0, this is a test,42
1, some more data,100
UNIT
III 11
Hadoop Ecosystem: Sqoop Export
UNIT
III 12
Hadoop Ecosystem:
Sqoop Import
UNIT
III 14
Hadoop Ecosystem: Flume Features
• Features of Flume:
• Flume has a flexible design based upon streaming data flows.
• It is fault tolerant and robust with multiple failovers and recovery
mechanisms. Flume has different levels of reliability to offer which
includes 'best-effort delivery' and an 'end-to-end delivery'.
• Flume carries data between sources and sinks. This gathering of data can
either be scheduled or event-driven. Flume has its own query
processing engine which makes it easy to transform each new batch of
data before it is moved to the intended sink.
• Flume can also be used to transport event data including but not limited
to network traffic data, data generated by social media websites and email
messages.
UNIT
III 15
Hadoop Ecosystem: Component
• Ambari:
UNIT
III 16
Hadoop Ecosystem: Component
Difference Between Apache Ambari and Apache
Zookeeper
UNIT
III 17
Hadoop Ecosystem: Component
• Oozie
• It is a workflow scheduler system for managing apache Hadoop jobs.
• Oozie combines multiple jobs sequentially into one logical unit of work.
• Oozie framework is fully integrated with apache Hadoop stack, YARN as an architecture
center and supports Hadoop jobs for apache MapReduce, Pig, Hive, and Sqoop.
UNIT
III 18
Hadoop Ecosystem: Component
• Working of Oozie
UNIT
III 19
HIVE
Query Language by Hadoop
5/15/2024
22
History Hive
• SQL:
▪ INSERT INTO TABLE pageid_age_sum
▪ SELECT pageid, age, count(1)
▪ FROM pv_users
• GROUP BY pageid, age;
Hive QL – Group By in Map Reduce
pv_users key value key value
pag age <1,2 1 <1,2 1 p
eid 5> 5> e
1 25 <2,2 1 <1,3 1
2 25 5> Shu 2>
ffle
Ma p Reduce
Sort
key value key value
pag age
eid <1,3 1 <2,2 1 p
2> 5> e
1 32
<2,2 1 <2,2 1
2 25 5> 5>
Hive QL – Group By with Distinct
page_view
pag user time
eid id result
1 111 9:08:0 page count_distinct
1 id _userid
2 111 9:08:1 1 2
3
2 1
1 222 9:08:1
4
2 111 9:08:2
• SQ 0
L
• SELECT pageid, COUNT(DISTINCT userid)
• FROM page_view GROUP BY pageid
Hive QL – Group By with Distinct in Map
Reduce
page_view
page useri time key v page cou
id d <1,11 id nt
1 111 9:08:0 1> 1 2
Shuffl
1 <1,22
e
9t:i0m8 and 2> Reduce
pa2g u 1s 1 e:1 Sort
3 key v
e 1
e r page cou
id <2,11 id nt
i d 1>
2 1
1 222 9:08:1 <2,11
efix of t he
ke
sort y.
4
1>
Hive QL: Order By
page_view
page useri time key v page
id d <1,11 9:08: useri id
2 111 9:08:1 1> 01 1 111 9:
3 Shuffl <2,11 9:08:
d
e
9t:i0m8 and 1> 13 Re duce 2 111 9:
pa1g u 1s 1 e:0 Sort
1 key v page
e e 1r
id <1,22 9:08: useri id
i d 2> 14 1 222 9:
<2,11 9:08: d
2 111 9:08:2
. 1> 20
0 2 111 9:
Features of Hive
drop database office cascade; // Drops the tables in the database when it is not empty
Drop
Pig
5/15/2024
4
9
Why Pig?
pig –help
Pig
>grunt
5/15/2024 Big Data Analytics Lab 66
Demo
For each tuple in 'GroupByCountry', generate the resulting string of the form-> Name of
Country: No. of products sold
UNIT
III 70
Word Count Problem using
MapReduce
UNIT
III 71
Word Count Problem uisng
MapReduce
UNIT
III 72
Overview of - Apache Spark
Ecosystem.
UNIT
III
Introduction to Apache Spark
• Apache Spark is a general-purpose cluster computing framework.
• It was introduced by UC Berkeley’s AMP Lab in 2009 as a distributed computing system. But
later maintained by Apache Software Foundation from 2013 till date.
• Spark is a lighting fast computing engine designed for faster processing of large size of data.
• Spark supports batch application, iterative processing, interactive queries, and streaming
data. It reduces the burden of managing separate tools for the respective workload.
• The main feature of Spark is its in-memory processing which makes computation faster.
• It has its own cluster management system and it uses Hadoop for storage purpose
UNIT
III 74
Introduction to Apache Spark
• One of the main features Spark offers for speed is the ability to run
computations in memory, but the system is also more efficient than
MapReduce for complex applications running on disk.
• Apache Spark achieves high performance for both batch and streaming
data, using a state-of-the-art DAG scheduler, a query optimizer, and a
physical execution engine.
UNIT
III 76
Apache Spark Ecosystem (The Spark Stack )
UNIT
III 77
Apache Spark Ecosystem: Component
1. Spark Core:
• Spark Core contains the basic functionality of Spark, including components for task
scheduling, memory management, fault recovery, interacting with storage systems,
and more.
• Spark Core is also home to the API that defines resilient distributed datasets (RDDs),
which are Spark’s main programming abstraction.
• RDDs represent a collection of items distributed across many compute nodes that can
be manipulated in parallel.
• Spark Core provides many APIs for building and manipulating these collections.
UNIT
III 78
Apache Spark Ecosystem: Component
2. Spark SQL
• Spark SQL is Spark’s package for working with structured data. It allows querying data via
SQL as well as the Apache Hive variant of SQL—called the Hive Query Language (HQL)
—and it supports many sources of data, including Hive tables, Parquet, and JSON.
• Beyond providing a SQL interface to Spark, Spark SQL allows developers to intermix SQL
queries with the programmatic data manipulations supported by RDDs in Python, Java, and
Scala, all within a single application, thus combining SQL with complex analytics.
3. Spark Streaming
• Spark Streaming is a Spark component that enables processing of live streams of data.
Examples of data streams include logfiles generated by production web servers, or queues
of messages containing status updates posted by users of a web service.
• Streaming provides an API for manipulating data streams that closely matches the Spark
Core’s RDD API, making it easy for programmers to learn the project and move between
applications that manipulate data stored in memory, on disk, or arriving in real time.
• Underneath its API, Spark Streaming was designed to provide the same degree of fault
tolerance, throughput, and scalability as Spark Core.
UNIT
III 79
Apache Spark Ecosystem: Component
UNIT
III 80
Features of Apache Spark
• Speed: Though spark is based on MapReduce, it is 10 times faster than Hadoop when
it comes to big data processing.
• Usability: Spark supports multiple languages thus making it easier to work with.
• In-Memory Processing: Unlike Hadoop, Spark doesn’t move data in and out of the
cluster.
• Lazy Evaluation: It means that spark waits for the code to complete and then process
the instruction in the most efficient way possible.
• Fault Tolerance: Spark has improved fault tolerance than Hadoop. Both storage and
computation can tolerate failure by backing
UNIT up to another node.
III 81
Comparison of different Framework
UNIT
III
Case Study:
• Google Analytics:
• Cloud Dataflow
• Run faster and scale better than pretty much any other system
• Cloud Save:
• It is an API that enables an application to save an individual
user’s data in the cloud or elsewhere and use it without requiring
any server-side coding.
• Cloud Debugging
• makes it easier to sift through lines of code deployed across many
servers in the cloud to identify software bugs.
UNIT
III 83
Case Study:
• Google Analytics:
• Cloud Tracing
• It provides latency statistics across different groups and provides
analysis reports.
• Cloud Monitoring:
• It is an intelligent monitoring system. The feature monitors cloud
infrastructure resources, such as disks and virtual machines, as
well as service levels for Google’s services as well as more than a
dozen non-Google open source packages.
UNIT
III 84
Case Study:
• Twitter Analytics: Capturing and Analyzing
Tweets
https://blogs.ischooUl.NbIeTrkIIIeley.edu/i290-abdt-s12/
85
Hadoop High Level Architecture
UNIT
III 86
Hadoop Cluster
•A Small Hadoop Cluster Include a single master &
multiple worker nodes
Master node:
Data Node
Job Tracker
Task
Tracker
Name Node
Slave node:
Data Node
Task Tracke
UNIT
III 87
Active Learning
UNIT 89
III