0% found this document useful (0 votes)
21 views89 pages

BDT Unit04

Uploaded by

gundmrunal04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views89 pages

BDT Unit04

Uploaded by

gundmrunal04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 89

Unit-

IV
Technologies and tools for Big Data:
• Zookeeper
• Importing Relational data with Sqoop
• Injecting stream data with flume.
• Basic concepts of Pig, Architecture of Pig,
• what is Hive. Architecture of Hive, Hive Commands.
• Overview of Apache Spark Ecosystem, Spark
Architecture
UNIT
III
Hadoop Ecosystem

UNIT
III 2
Hadoop Ecosystem: Component
• Avro:. Avro is an open source project that provides data serialization and data
exchange services for Hadoop. A serialization system for efficient,
cross-language RPC, and persistent data storage.
• Features provided by Avro:
• Rich data structures.
• Remote procedure call.
• Compact, fast, binary data format.
• Container file, to store persistent data.

• MapReduce: A distributed data processing model and execution


environment that runs on large clusters of commodity machines.

• HDFS:A distributed file system that runs on large clusters of


commodity machines. HDFS supports write-once-read-many semantics
on files.
UNIT
III 3
Hadoop Ecosystem: Component
Pig:
• A data flow language and execution environment for exploring/processing very large datasets. Pig
runs on HDFS and MapReduce clusters.
• Apache Pig is a high-level language platform for analyzing and querying huge dataset that are stored
in HDFS.
• Pig as a component of Hadoop Ecosystem uses PigLatin language. It is very similar to SQL.
• It loads the data, applies the required filters and dumps the data in the required format. For
Programs execution, pig requires Java runtime environment.

UNIT
III 4
Hadoop Ecosystem: Component
Hive:
• A distributed data warehouse. Hive manages data stored in HDFS and
provides a query language based on SQL (and which is translated by
the runtime engine to MapReduce jobs) for querying the data.
• Hive do three main functions: data summarization, query, and analysis.

UNIT
III 5
Hadoop Ecosystem: What Hive can provide

UNIT
III 6
Hadoop Ecosystem:Component
• ZooKeeper: A distributed, highly available coordination service. ZooKeeper
provides primitives such as distributed locks that can be used for building
distributed applications. Zookeeper manages and coordinates a large cluster of
machines.

UNIT
III 7
Hadoop Ecosystem: Component
• Sqoop: A tool for efficiently moving data between relational databases and
HDFS. Sqoop imports data from external sources into related Hadoop
ecosystem components like HDFS, Hbase or Hive. It also exports data from
Hadoop to other external sources. Sqoop works with relational databases such
as teradata, Netezza, oracle, MySQL.

UNIT
III 8
Hadoop Ecosystem: Component
• Capabilities of Sqoop Includes
• It helps to Import individual tables or entire databases to
files in HDFS
• Also can Generate Java classes to interact with your imported
data
• Moreover, it offers the ability to import from SQL databases
straight into your Hive data warehouse

UNIT
III 9
Hadoop Ecosystem: Component
• Sqoop:
CREATE TABLE Test(
id INT NOT NULL PRIMARY KEY,
msg VARCHAR(32),
bar INT);

Consider also a dataset in


HDFS containing records
like these:

0, this is a test,42
1, some more data,100

Running sqoop-export –table Test –update-key id –export-dir /path/to/data –connect … will


run an export job that executes SQL statements based on the data like so:

UPDATE Test SET msg=’this is a test’, bar=42 WHERE id=0;


UNIT
III 10
Hadoop Ecosystem: Sqoop Import

• $ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*

• E.g. emp table data will be stored in HDFS as:


1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP

UNIT
III 11
Hadoop Ecosystem: Sqoop Export

• $ sqoop export (generic-args) (export-args)


• $ sqoop-export (generic-args) (export-args)E.g.
emp table data to be exported stored
1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP

UNIT
III 12
Hadoop Ecosystem:
Sqoop Import

• $ sqoop import (generic-args) (import-


args)
• $ sqoop-import (generic-args) (import-
args) UNIT
III 13
Hadoop Ecosystem: Flume
• Flume:
• efficiently collects, aggregate and moves a large amount of data from its origin and sending
it back to HDFS. It is fault tolerant and reliable mechanism.
• This Hadoop Ecosystem component allows the data flow from the source into Hadoop
environment.
• It uses a simple extensible data model that allows for the online analytic
application. Using Flume,
• we can get the data from multiple servers immediately into hadoop.

UNIT
III 14
Hadoop Ecosystem: Flume Features

• Features of Flume:
• Flume has a flexible design based upon streaming data flows.
• It is fault tolerant and robust with multiple failovers and recovery
mechanisms. Flume has different levels of reliability to offer which
includes 'best-effort delivery' and an 'end-to-end delivery'.
• Flume carries data between sources and sinks. This gathering of data can
either be scheduled or event-driven. Flume has its own query
processing engine which makes it easy to transform each new batch of
data before it is moved to the intended sink.
• Flume can also be used to transport event data including but not limited
to network traffic data, data generated by social media websites and email
messages.

UNIT
III 15
Hadoop Ecosystem: Component
• Ambari:

• another Hadop ecosystem component, is a management platform for provisioning,


managing, monitoring and securing apache Hadoop cluster.
• Features of Ambari:
• Simplified installation, configuration, and management –
• Centralized security setup –
• Highly extensible and customizable –
• Full visibility into cluster health –

UNIT
III 16
Hadoop Ecosystem: Component
Difference Between Apache Ambari and Apache
Zookeeper

UNIT
III 17
Hadoop Ecosystem: Component
• Oozie
• It is a workflow scheduler system for managing apache Hadoop jobs.
• Oozie combines multiple jobs sequentially into one logical unit of work.
• Oozie framework is fully integrated with apache Hadoop stack, YARN as an architecture
center and supports Hadoop jobs for apache MapReduce, Pig, Hive, and Sqoop.

UNIT
III 18
Hadoop Ecosystem: Component
• Working of Oozie

UNIT
III 19
HIVE
Query Language by Hadoop

Big Data Analytics Lab

5/15/2024

22
History Hive

5/15/2024 Big Data Analytics Lab 21


Why Hive?

5/15/2024 Big Data Analytics Lab 22


What is Hive

5/15/2024 Big Data Analytics Lab 23


Architecture of Hive

5/15/2024 Big Data Analytics Lab 24


Architecture of Hive

5/15/2024 Big Data Analytics Lab 25


Data Flow in Hive

5/15/2024 Big Data Analytics Lab 26


Hive Data Modeling

5/15/2024 Big Data Analytics Lab 27


Hive Data Types

5/15/2024 Big Data Analytics Lab 28


Different Modes of Hive

5/15/2024 Big Data Analytics Lab 29


Hive Vs RDBMS

5/15/2024 Big Data Analytics Lab 30


Hive QL – Join
page_view pv_users
user
pag use time pag age
eid use age gende eid
rid X r =
rid 1 25
1 111 9:08:0 2 25
1 111 25 femal
e 1 32
2 111 9:08:1
3 222 32 male
1 222 9:08:1
• SQL: 4
INSERT INTO TABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv JOIN
user u ON (pv.userid =
Hive QL – Join in Map Reduce
page_view
Page User time
id id
key value key value
1 111 9:08:01
111 <1,1> 111 <1,1>
2 111 9:08:13
111 <1,2> 111 <1,2>
1 222 9:08:14
222 <1,1> 111 <2,25>
Shuffl
user Map e
User age gender Reduce
Sort
id key value
key value
111 <2,25> 222 <1,1>
111 25 female
222 <2,32> 222 <2,32>
222 32 male
Hive QL – Group By
pv_users pageid_age_sum

pageid age Page age Count


1 25 id
2 25 1 25 1
1 32 2 25 2
2 25 1 32 1

• SQL:
▪ INSERT INTO TABLE pageid_age_sum
▪ SELECT pageid, age, count(1)
▪ FROM pv_users
• GROUP BY pageid, age;
Hive QL – Group By in Map Reduce
pv_users key value key value
pag age <1,2 1 <1,2 1 p
eid 5> 5> e
1 25 <2,2 1 <1,3 1
2 25 5> Shu 2>
ffle
Ma p Reduce
Sort
key value key value
pag age
eid <1,3 1 <2,2 1 p
2> 5> e
1 32
<2,2 1 <2,2 1
2 25 5> 5>
Hive QL – Group By with Distinct
page_view
pag user time
eid id result
1 111 9:08:0 page count_distinct
1 id _userid
2 111 9:08:1 1 2
3
2 1
1 222 9:08:1
4
2 111 9:08:2
• SQ 0
L
• SELECT pageid, COUNT(DISTINCT userid)
• FROM page_view GROUP BY pageid
Hive QL – Group By with Distinct in Map
Reduce
page_view
page useri time key v page cou
id d <1,11 id nt
1 111 9:08:0 1> 1 2
Shuffl
1 <1,22
e
9t:i0m8 and 2> Reduce
pa2g u 1s 1 e:1 Sort
3 key v
e 1
e r page cou
id <2,11 id nt
i d 1>
2 1
1 222 9:08:1 <2,11
efix of t he
ke
sort y.
4
1>
Hive QL: Order By
page_view
page useri time key v page
id d <1,11 9:08: useri id
2 111 9:08:1 1> 01 1 111 9:
3 Shuffl <2,11 9:08:
d
e
9t:i0m8 and 1> 13 Re duce 2 111 9:
pa1g u 1s 1 e:0 Sort
1 key v page
e e 1r
id <1,22 9:08: useri id
i d 2> 14 1 222 9:
<2,11 9:08: d
2 111 9:08:2
. 1> 20
0 2 111 9:
Features of Hive

5/15/2024 Big Data Analytics Lab 40


Hive Demo

5/15/2024 Big Data Analytics Lab 41


Hive Demo
Create create database office; // We are creating a database called the office

Show show databases; // Shows the created database

Drop drop database office; // Drops the office database as it is empty

drop database office cascade; // Drops the tables in the database when it is not empty
Drop

Create create database office; // We will recreate the database office

Use use office; // Sets office as the default database

5/15/2024 Big Data Analytics Lab 40


5/15/2024 Big Data Analytics Lab 41
Hive Demo

5/15/2024 Big Data Analytics Lab 42


Hive Demo

5/15/2024 Big Data Analytics Lab 43


Hive Demo

5/15/2024 Big Data Analytics Lab 44


Hive Demo

5/15/2024 Big Data Analytics Lab 45


Hive Demo

5/15/2024 Big Data Analytics Lab 46


Big Data Analytics Lab

Pig

5/15/2024

4
9
Why Pig?

5/15/2024 Big Data Analytics Lab 48


Why Pig?

5/15/2024 Big Data Analytics Lab 49


Why Pig?

5/15/2024 Big Data Analytics Lab 50


What is Pig

5/15/2024 Big Data Analytics Lab 51


Map Reduce Vs Hive Vs Pig

5/15/2024 Big Data Analytics Lab 52


Map Reduce Vs Hive Vs Pig

5/15/2024 Big Data Analytics Lab 53


Components of Pig

5/15/2024 Big Data Analytics Lab 54


Pig Architecture

5/15/2024 Big Data Analytics Lab 55


Working of Pig

5/15/2024 Big Data Analytics Lab 56


Pig Latin data Model

5/15/2024 Big Data Analytics Lab 57


Pig Latin data Model

5/15/2024 Big Data Analytics Lab 58


Pig Execution Modes

5/15/2024 Big Data Analytics Lab 59


Pig Execution Modes

5/15/2024 Big Data Analytics Lab 60


Use Case-Twitter

5/15/2024 Big Data Analytics Lab 61


Use Case-Twitter

5/15/2024 Big Data Analytics Lab 62


Use Case-Twitter

5/15/2024 Big Data Analytics Lab 63


Features of Pig

5/15/2024 Big Data Analytics Lab 64


• Hdfs dfs –mkdir input
• Hdfs dfs –ls /
Demo • Hdfs dfs –copyFromLocal
Sales2009.csv /input/

5/15/2024 Big Data Analytics Lab 65


Pig Demo salesTable = LOAD '/SalesJan2009.csv' USING PigStorage(',') AS
(Transaction_date:chararray,Product:chararray,Price:chararray,Payment_Type:chararray,Na
me:chararray,City:chararray,State:chararray,Country:chararray,Account_Created:chararray,
Last_Login:chararray,Latitude:chararray,Longitude:chararray);

pig –help

pig // run in default mode

pig -x local //local mode

pig -x mapreduce //mapreduce mode

Pig

>grunt
5/15/2024 Big Data Analytics Lab 66
Demo

For each tuple in 'GroupByCountry', generate the resulting string of the form-> Name of
Country: No. of products sold

hdfs dfs -cat pig_output_sales/part-r-00000

5/15/2024 Big Data Analytics Lab 67


Example Pig Commands
•-- Load movie ratings data
•ratings = LOAD '/path/to/movie_ratings.csv' USING PigStorage(',') AS (user_id:int, movie_id:int, rating:int);

•-- Group ratings by movie_id and calculate average rating for each movie
•avg_ratings = FOREACH (GROUP ratings BY movie_id) {
• avg_rating = AVG(ratings.rating);
• GENERATE group AS movie_id, avg_rating AS avg_rating;
•}

•-- Order movies by average rating
•ordered_ratings = ORDER avg_ratings BY avg_rating DESC;

•-- Store recommendations into HDFS
•STORE ordered_ratings INTO '/path/to/movie_recommendations' USING PigStorage(',');

5/15/2024 Big Data Analytics Lab 68


Example Pig Commands
• -- Load customer data
•customer_data = LOAD '/path/to/customer_data.csv' USING PigStorage(',') AS (customer_id:int, age:int,
income:double, spending_score:int);

• -- Segment customers based on spending score
• segmented_customers = FOREACH customer_data {
• segment = (spending_score >= 80) ? 'High Spenders' :
• (spending_score >= 50) ? 'Medium Spenders' :
• 'Low Spenders';
• GENERATE customer_id, age, income, spending_score, segment;
•}

• -- Store segmented customers into HDFS
• STORE segmented_customers INTO '/path/to/customer_segments' USING PigStorage(',');

5/15/2024 Big Data Analytics Lab 69


Comparison of MapReduce and
RDBMS

UNIT
III 70
Word Count Problem using
MapReduce

Map Reduce Real Time Use Cases


1. Merging Small Files into AVRO Files
2. Merging Small Files into SEQUENCE Files
3. Visits Per Hour
4. Measuring the Page Rank
5. Word Search in Huge Log Files (Word Count also)

UNIT
III 71
Word Count Problem uisng
MapReduce

Map Reduce Real Time Use Case


1. Take a bunch of data
2. Perform some kind of transformation that converts every
datum to another kind of datum
3. Combine those new data into yet simpler data

UNIT
III 72
Overview of - Apache Spark
Ecosystem.

UNIT
III
Introduction to Apache Spark
• Apache Spark is a general-purpose cluster computing framework.

• It was introduced by UC Berkeley’s AMP Lab in 2009 as a distributed computing system. But
later maintained by Apache Software Foundation from 2013 till date.

• Spark is a lighting fast computing engine designed for faster processing of large size of data.

• It is based on Hadoop’s Map Reduce model.

• Spark supports batch application, iterative processing, interactive queries, and streaming
data. It reduces the burden of managing separate tools for the respective workload.

• The main feature of Spark is its in-memory processing which makes computation faster.

• It has its own cluster management system and it uses Hadoop for storage purpose
UNIT
III 74
Introduction to Apache Spark

• Spark extends the popular MapReduce model to efficiently support


more types of computations, including interactive queries and stream
processing.

• One of the main features Spark offers for speed is the ability to run
computations in memory, but the system is also more efficient than
MapReduce for complex applications running on disk.

• Apache Spark achieves high performance for both batch and streaming
data, using a state-of-the-art DAG scheduler, a query optimizer, and a
physical execution engine.

• Spark is designed to be highly accessible, offering simple APIs in


Python, Java, Scala, and SQL, and rich built-in libraries.
UNIT
III 75
Apache Spark Ecosystem (The Spark Stack )

• Spark powers a stack of libraries including,


• SQL and DataFrames
• Spark Streaming
• MLlib for machine learning
• GraphX fro graph computation

UNIT
III 76
Apache Spark Ecosystem (The Spark Stack )

UNIT
III 77
Apache Spark Ecosystem: Component

• The Spark project contains multiple closely integrated components.


• At its core, Spark is a “computational engine” that is responsible for scheduling,
distributing, and monitoring applications consisting of many computational tasks
across many worker machines, or a computing cluster.

1. Spark Core:
• Spark Core contains the basic functionality of Spark, including components for task
scheduling, memory management, fault recovery, interacting with storage systems,
and more.
• Spark Core is also home to the API that defines resilient distributed datasets (RDDs),
which are Spark’s main programming abstraction.
• RDDs represent a collection of items distributed across many compute nodes that can
be manipulated in parallel.
• Spark Core provides many APIs for building and manipulating these collections.

UNIT
III 78
Apache Spark Ecosystem: Component

2. Spark SQL
• Spark SQL is Spark’s package for working with structured data. It allows querying data via
SQL as well as the Apache Hive variant of SQL—called the Hive Query Language (HQL)
—and it supports many sources of data, including Hive tables, Parquet, and JSON.
• Beyond providing a SQL interface to Spark, Spark SQL allows developers to intermix SQL
queries with the programmatic data manipulations supported by RDDs in Python, Java, and
Scala, all within a single application, thus combining SQL with complex analytics.

3. Spark Streaming
• Spark Streaming is a Spark component that enables processing of live streams of data.
Examples of data streams include logfiles generated by production web servers, or queues
of messages containing status updates posted by users of a web service.
• Streaming provides an API for manipulating data streams that closely matches the Spark
Core’s RDD API, making it easy for programmers to learn the project and move between
applications that manipulate data stored in memory, on disk, or arriving in real time.
• Underneath its API, Spark Streaming was designed to provide the same degree of fault
tolerance, throughput, and scalability as Spark Core.

UNIT
III 79
Apache Spark Ecosystem: Component

4. MLlib (Machine Learning Library):


• MLlib provides multiple types of machine learning algorithms, including
classification, regression, clustering, and collaborative filtering, as well as supporting
functionality such as model evaluation and data import.
• It also provides some lower-level ML primitives, including a generic gradient descent
optimization algorithm.
• All of these methods are designed to scale out across a cluster.

5. GraphX (Graph Computation):


• GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and
performing graph-parallel computations
• GraphX also provides various operators for manipulating graphs (e.g., subgraph and
mapVertices) and a library of common graph algorithms (e.g., PageRank and triangle
counting).

UNIT
III 80
Features of Apache Spark
• Speed: Though spark is based on MapReduce, it is 10 times faster than Hadoop when
it comes to big data processing.

• Usability: Spark supports multiple languages thus making it easier to work with.

• Sophisticated Analytics: Spark provides a complex algorithm for Big Data


Analytics and Machine Learning.

• In-Memory Processing: Unlike Hadoop, Spark doesn’t move data in and out of the
cluster.

• Lazy Evaluation: It means that spark waits for the code to complete and then process
the instruction in the most efficient way possible.

• Fault Tolerance: Spark has improved fault tolerance than Hadoop. Both storage and
computation can tolerate failure by backing
UNIT up to another node.
III 81
Comparison of different Framework

UNIT
III
Case Study:
• Google Analytics:
• Cloud Dataflow
• Run faster and scale better than pretty much any other system
• Cloud Save:
• It is an API that enables an application to save an individual
user’s data in the cloud or elsewhere and use it without requiring
any server-side coding.
• Cloud Debugging
• makes it easier to sift through lines of code deployed across many
servers in the cloud to identify software bugs.

UNIT
III 83
Case Study:
• Google Analytics:
• Cloud Tracing
• It provides latency statistics across different groups and provides
analysis reports.
• Cloud Monitoring:
• It is an intelligent monitoring system. The feature monitors cloud
infrastructure resources, such as disks and virtual machines, as
well as service levels for Google’s services as well as more than a
dozen non-Google open source packages.

UNIT
III 84
Case Study:
• Twitter Analytics: Capturing and Analyzing
Tweets

https://blogs.ischooUl.NbIeTrkIIIeley.edu/i290-abdt-s12/
85
Hadoop High Level Architecture

UNIT
III 86
Hadoop Cluster
•A Small Hadoop Cluster Include a single master &
multiple worker nodes
Master node:
Data Node
Job Tracker
Task
Tracker
Name Node
Slave node:
Data Node
Task Tracke

UNIT
III 87
Active Learning

Answer the following


1. Hadoop Distributed File System (HDFS) is renamed from NDFS

2. All the core projects of Hadoop were hosted by Yahoo!

3. Large amount of data needs large hardware

4. IT organizations can handle Information growth irrespective of use


of commodity s/w and h/w
5. Apache Storm is a stream processing framework

6. YARN stands for


UNIT
III 88
Difference Between
Hadoop and SQL
Hadoop SQL
Schema on Read Schema on Write
Data Data stored in the logical
stored in form with interrelated tables
compressed
file

UNIT 89
III

You might also like