0% found this document useful (0 votes)

21 views89 pages

BDT Unit04

Uploaded by

gundmrunal04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views89 pages

BDT Unit04

Uploaded by

gundmrunal04

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 89

Unit-

IV
Technologies and tools for Big Data:
• Zookeeper
• Importing Relational data with Sqoop
• Injecting stream data with flume.
• Basic concepts of Pig, Architecture of Pig,
• what is Hive. Architecture of Hive, Hive Commands.
• Overview of Apache Spark Ecosystem, Spark
Architecture
UNIT
III
Hadoop Ecosystem

UNIT
III 2
Hadoop Ecosystem: Component
• Avro:. Avro is an open source project that provides data serialization and data
exchange services for Hadoop. A serialization system for efficient,
cross-language RPC, and persistent data storage.
• Features provided by Avro:
• Rich data structures.
• Remote procedure call.
• Compact, fast, binary data format.
• Container file, to store persistent data.

• MapReduce: A distributed data processing model and execution

environment that runs on large clusters of commodity machines.

• HDFS:A distributed file system that runs on large clusters of

commodity machines. HDFS supports write-once-read-many semantics
on files.
UNIT
III 3
Hadoop Ecosystem: Component
Pig:
• A data flow language and execution environment for exploring/processing very large datasets. Pig
runs on HDFS and MapReduce clusters.
• Apache Pig is a high-level language platform for analyzing and querying huge dataset that are stored
in HDFS.
• Pig as a component of Hadoop Ecosystem uses PigLatin language. It is very similar to SQL.
• It loads the data, applies the required filters and dumps the data in the required format. For
Programs execution, pig requires Java runtime environment.

UNIT
III 4
Hadoop Ecosystem: Component
Hive:
• A distributed data warehouse. Hive manages data stored in HDFS and
provides a query language based on SQL (and which is translated by
the runtime engine to MapReduce jobs) for querying the data.
• Hive do three main functions: data summarization, query, and analysis.

UNIT
III 5
Hadoop Ecosystem: What Hive can provide

UNIT
III 6
Hadoop Ecosystem:Component
• ZooKeeper: A distributed, highly available coordination service. ZooKeeper
provides primitives such as distributed locks that can be used for building
distributed applications. Zookeeper manages and coordinates a large cluster of
machines.

UNIT
III 7
Hadoop Ecosystem: Component
• Sqoop: A tool for efficiently moving data between relational databases and
HDFS. Sqoop imports data from external sources into related Hadoop
ecosystem components like HDFS, Hbase or Hive. It also exports data from
Hadoop to other external sources. Sqoop works with relational databases such
as teradata, Netezza, oracle, MySQL.

UNIT
III 8
Hadoop Ecosystem: Component
• Capabilities of Sqoop Includes
• It helps to Import individual tables or entire databases to
files in HDFS
• Also can Generate Java classes to interact with your imported
data
• Moreover, it offers the ability to import from SQL databases
straight into your Hive data warehouse

UNIT
III 9
Hadoop Ecosystem: Component
• Sqoop:
CREATE TABLE Test(
id INT NOT NULL PRIMARY KEY,
msg VARCHAR(32),
bar INT);

Consider also a dataset in

HDFS containing records
like these:

0, this is a test,42
1, some more data,100

Running sqoop-export –table Test –update-key id –export-dir /path/to/data –connect … will

run an export job that executes SQL statements based on the data like so:

UPDATE Test SET msg=’this is a test’, bar=42 WHERE id=0;

UNIT
III 10
Hadoop Ecosystem: Sqoop Import

• $ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*

• E.g. emp table data will be stored in HDFS as:

1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP

UNIT
III 11
Hadoop Ecosystem: Sqoop Export

• $ sqoop export (generic-args) (export-args)

• $ sqoop-export (generic-args) (export-args)E.g.
emp table data to be exported stored
1201, gopal, manager, 50000, TP
1202, manisha, preader, 50000, TP
1203, kalil, php dev, 30000, AC
1204, prasanth, php dev, 30000, AC
1205, kranthi, admin, 20000, TP

UNIT
III 12
Hadoop Ecosystem:
Sqoop Import

• $ sqoop import (generic-args) (import-

args)
• $ sqoop-import (generic-args) (import-
args) UNIT
III 13
Hadoop Ecosystem: Flume
• Flume:
• efficiently collects, aggregate and moves a large amount of data from its origin and sending
it back to HDFS. It is fault tolerant and reliable mechanism.
• This Hadoop Ecosystem component allows the data flow from the source into Hadoop
environment.
• It uses a simple extensible data model that allows for the online analytic
application. Using Flume,
• we can get the data from multiple servers immediately into hadoop.

UNIT
III 14
Hadoop Ecosystem: Flume Features

• Features of Flume:
• Flume has a flexible design based upon streaming data flows.
• It is fault tolerant and robust with multiple failovers and recovery
mechanisms. Flume has different levels of reliability to offer which
includes 'best-effort delivery' and an 'end-to-end delivery'.
• Flume carries data between sources and sinks. This gathering of data can
either be scheduled or event-driven. Flume has its own query
processing engine which makes it easy to transform each new batch of
data before it is moved to the intended sink.
• Flume can also be used to transport event data including but not limited
to network traffic data, data generated by social media websites and email
messages.

UNIT
III 15
Hadoop Ecosystem: Component
• Ambari:

• another Hadop ecosystem component, is a management platform for provisioning,

managing, monitoring and securing apache Hadoop cluster.
• Features of Ambari:
• Simplified installation, configuration, and management –
• Centralized security setup –
• Highly extensible and customizable –
• Full visibility into cluster health –

UNIT
III 16
Hadoop Ecosystem: Component
Difference Between Apache Ambari and Apache
Zookeeper

UNIT
III 17
Hadoop Ecosystem: Component
• Oozie
• It is a workflow scheduler system for managing apache Hadoop jobs.
• Oozie combines multiple jobs sequentially into one logical unit of work.
• Oozie framework is fully integrated with apache Hadoop stack, YARN as an architecture
center and supports Hadoop jobs for apache MapReduce, Pig, Hive, and Sqoop.

UNIT
III 18
Hadoop Ecosystem: Component
• Working of Oozie

UNIT
III 19
HIVE
Query Language by Hadoop

Big Data Analytics Lab

5/15/2024

22
History Hive

5/15/2024 Big Data Analytics Lab 21

Why Hive?

5/15/2024 Big Data Analytics Lab 22

What is Hive

5/15/2024 Big Data Analytics Lab 23

Architecture of Hive

5/15/2024 Big Data Analytics Lab 24

Architecture of Hive

5/15/2024 Big Data Analytics Lab 25

Data Flow in Hive

5/15/2024 Big Data Analytics Lab 26

Hive Data Modeling

5/15/2024 Big Data Analytics Lab 27

Hive Data Types

5/15/2024 Big Data Analytics Lab 28

Different Modes of Hive

5/15/2024 Big Data Analytics Lab 29

Hive Vs RDBMS

5/15/2024 Big Data Analytics Lab 30

Hive QL – Join
page_view pv_users
user
pag use time pag age
eid use age gende eid
rid X r =
rid 1 25
1 111 9:08:0 2 25
1 111 25 femal
e 1 32
2 111 9:08:1
3 222 32 male
1 222 9:08:1
• SQL: 4
INSERT INTO TABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv JOIN
user u ON (pv.userid =
Hive QL – Join in Map Reduce
page_view
Page User time
id id
key value key value
1 111 9:08:01
111 <1,1> 111 <1,1>
2 111 9:08:13
111 <1,2> 111 <1,2>
1 222 9:08:14
222 <1,1> 111 <2,25>
Shuffl
user Map e
User age gender Reduce
Sort
id key value
key value
111 <2,25> 222 <1,1>
111 25 female
222 <2,32> 222 <2,32>
222 32 male
Hive QL – Group By
pv_users pageid_age_sum

pageid age Page age Count

1 25 id
2 25 1 25 1
1 32 2 25 2
2 25 1 32 1

• SQL:
▪ INSERT INTO TABLE pageid_age_sum
▪ SELECT pageid, age, count(1)
▪ FROM pv_users
• GROUP BY pageid, age;
Hive QL – Group By in Map Reduce
pv_users key value key value
pag age <1,2 1 <1,2 1 p
eid 5> 5> e
1 25 <2,2 1 <1,3 1
2 25 5> Shu 2>
ffle
Ma p Reduce
Sort
key value key value
pag age
eid <1,3 1 <2,2 1 p
2> 5> e
1 32
<2,2 1 <2,2 1
2 25 5> 5>
Hive QL – Group By with Distinct
page_view
pag user time
eid id result
1 111 9:08:0 page count_distinct
1 id _userid
2 111 9:08:1 1 2
3
2 1
1 222 9:08:1
4
2 111 9:08:2
• SQ 0
L
• SELECT pageid, COUNT(DISTINCT userid)
• FROM page_view GROUP BY pageid
Hive QL – Group By with Distinct in Map
Reduce
page_view
page useri time key v page cou
id d <1,11 id nt
1 111 9:08:0 1> 1 2
Shuffl
1 <1,22
e
9t:i0m8 and 2> Reduce
pa2g u 1s 1 e:1 Sort
3 key v
e 1
e r page cou
id <2,11 id nt
i d 1>
2 1
1 222 9:08:1 <2,11
efix of t he
ke
sort y.
4
1>
Hive QL: Order By
page_view
page useri time key v page
id d <1,11 9:08: useri id
2 111 9:08:1 1> 01 1 111 9:
3 Shuffl <2,11 9:08:
d
e
9t:i0m8 and 1> 13 Re duce 2 111 9:
pa1g u 1s 1 e:0 Sort
1 key v page
e e 1r
id <1,22 9:08: useri id
i d 2> 14 1 222 9:
<2,11 9:08: d
2 111 9:08:2
. 1> 20
0 2 111 9:
Features of Hive

5/15/2024 Big Data Analytics Lab 40

Hive Demo

5/15/2024 Big Data Analytics Lab 41

Hive Demo
Create create database office; // We are creating a database called the office

Show show databases; // Shows the created database

Drop drop database office; // Drops the office database as it is empty

drop database office cascade; // Drops the tables in the database when it is not empty
Drop

Create create database office; // We will recreate the database office

Use use office; // Sets office as the default database

5/15/2024 Big Data Analytics Lab 40

5/15/2024 Big Data Analytics Lab 41
Hive Demo

5/15/2024 Big Data Analytics Lab 42

Hive Demo

5/15/2024 Big Data Analytics Lab 43

Hive Demo

5/15/2024 Big Data Analytics Lab 44

Hive Demo

5/15/2024 Big Data Analytics Lab 45

Hive Demo

5/15/2024 Big Data Analytics Lab 46

Big Data Analytics Lab

Pig

5/15/2024

4
9
Why Pig?

5/15/2024 Big Data Analytics Lab 48

Why Pig?

5/15/2024 Big Data Analytics Lab 49

Why Pig?

5/15/2024 Big Data Analytics Lab 50

What is Pig

5/15/2024 Big Data Analytics Lab 51

Map Reduce Vs Hive Vs Pig

5/15/2024 Big Data Analytics Lab 52

Map Reduce Vs Hive Vs Pig

5/15/2024 Big Data Analytics Lab 53

Components of Pig

5/15/2024 Big Data Analytics Lab 54

Pig Architecture

5/15/2024 Big Data Analytics Lab 55

Working of Pig

5/15/2024 Big Data Analytics Lab 56

Pig Latin data Model

5/15/2024 Big Data Analytics Lab 57

Pig Latin data Model

5/15/2024 Big Data Analytics Lab 58

Pig Execution Modes

5/15/2024 Big Data Analytics Lab 59

Pig Execution Modes

5/15/2024 Big Data Analytics Lab 60

Use Case-Twitter

5/15/2024 Big Data Analytics Lab 61

Use Case-Twitter

5/15/2024 Big Data Analytics Lab 62

Use Case-Twitter

5/15/2024 Big Data Analytics Lab 63

Features of Pig

5/15/2024 Big Data Analytics Lab 64

• Hdfs dfs –mkdir input
• Hdfs dfs –ls /
Demo • Hdfs dfs –copyFromLocal
Sales2009.csv /input/

5/15/2024 Big Data Analytics Lab 65

Pig Demo salesTable = LOAD '/SalesJan2009.csv' USING PigStorage(',') AS
(Transaction_date:chararray,Product:chararray,Price:chararray,Payment_Type:chararray,Na
me:chararray,City:chararray,State:chararray,Country:chararray,Account_Created:chararray,
Last_Login:chararray,Latitude:chararray,Longitude:chararray);

pig –help

pig // run in default mode

pig -x local //local mode

pig -x mapreduce //mapreduce mode

Pig

>grunt
5/15/2024 Big Data Analytics Lab 66
Demo

For each tuple in 'GroupByCountry', generate the resulting string of the form-> Name of
Country: No. of products sold

hdfs dfs -cat pig_output_sales/part-r-00000

5/15/2024 Big Data Analytics Lab 67

Example Pig Commands
•-- Load movie ratings data
•ratings = LOAD '/path/to/movie_ratings.csv' USING PigStorage(',') AS (user_id:int, movie_id:int, rating:int);
•
•-- Group ratings by movie_id and calculate average rating for each movie
•avg_ratings = FOREACH (GROUP ratings BY movie_id) {
• avg_rating = AVG(ratings.rating);
• GENERATE group AS movie_id, avg_rating AS avg_rating;
•}
•
•-- Order movies by average rating
•ordered_ratings = ORDER avg_ratings BY avg_rating DESC;
•
•-- Store recommendations into HDFS
•STORE ordered_ratings INTO '/path/to/movie_recommendations' USING PigStorage(',');

5/15/2024 Big Data Analytics Lab 68

Example Pig Commands
• -- Load customer data
•customer_data = LOAD '/path/to/customer_data.csv' USING PigStorage(',') AS (customer_id:int, age:int,
income:double, spending_score:int);
•
• -- Segment customers based on spending score
• segmented_customers = FOREACH customer_data {
• segment = (spending_score >= 80) ? 'High Spenders' :
• (spending_score >= 50) ? 'Medium Spenders' :
• 'Low Spenders';
• GENERATE customer_id, age, income, spending_score, segment;
•}
•
• -- Store segmented customers into HDFS
• STORE segmented_customers INTO '/path/to/customer_segments' USING PigStorage(',');
•

5/15/2024 Big Data Analytics Lab 69

Comparison of MapReduce and
RDBMS

UNIT
III 70
Word Count Problem using
MapReduce

Map Reduce Real Time Use Cases

1. Merging Small Files into AVRO Files
2. Merging Small Files into SEQUENCE Files
3. Visits Per Hour
4. Measuring the Page Rank
5. Word Search in Huge Log Files (Word Count also)

UNIT
III 71
Word Count Problem uisng
MapReduce

Map Reduce Real Time Use Case

1. Take a bunch of data
2. Perform some kind of transformation that converts every
datum to another kind of datum
3. Combine those new data into yet simpler data

UNIT
III 72
Overview of - Apache Spark
Ecosystem.

UNIT
III
Introduction to Apache Spark
• Apache Spark is a general-purpose cluster computing framework.

• It was introduced by UC Berkeley’s AMP Lab in 2009 as a distributed computing system. But
later maintained by Apache Software Foundation from 2013 till date.

• Spark is a lighting fast computing engine designed for faster processing of large size of data.

• It is based on Hadoop’s Map Reduce model.

• Spark supports batch application, iterative processing, interactive queries, and streaming
data. It reduces the burden of managing separate tools for the respective workload.

• The main feature of Spark is its in-memory processing which makes computation faster.

• It has its own cluster management system and it uses Hadoop for storage purpose
UNIT
III 74
Introduction to Apache Spark

• Spark extends the popular MapReduce model to efficiently support

more types of computations, including interactive queries and stream
processing.

• One of the main features Spark offers for speed is the ability to run
computations in memory, but the system is also more efficient than
MapReduce for complex applications running on disk.

• Apache Spark achieves high performance for both batch and streaming
data, using a state-of-the-art DAG scheduler, a query optimizer, and a
physical execution engine.

• Spark is designed to be highly accessible, offering simple APIs in

Python, Java, Scala, and SQL, and rich built-in libraries.
UNIT
III 75
Apache Spark Ecosystem (The Spark Stack )

• Spark powers a stack of libraries including,

• SQL and DataFrames
• Spark Streaming
• MLlib for machine learning
• GraphX fro graph computation

UNIT
III 76
Apache Spark Ecosystem (The Spark Stack )

UNIT
III 77
Apache Spark Ecosystem: Component

• The Spark project contains multiple closely integrated components.

• At its core, Spark is a “computational engine” that is responsible for scheduling,
distributing, and monitoring applications consisting of many computational tasks
across many worker machines, or a computing cluster.

1. Spark Core:
• Spark Core contains the basic functionality of Spark, including components for task
scheduling, memory management, fault recovery, interacting with storage systems,
and more.
• Spark Core is also home to the API that defines resilient distributed datasets (RDDs),
which are Spark’s main programming abstraction.
• RDDs represent a collection of items distributed across many compute nodes that can
be manipulated in parallel.
• Spark Core provides many APIs for building and manipulating these collections.

UNIT
III 78
Apache Spark Ecosystem: Component

2. Spark SQL
• Spark SQL is Spark’s package for working with structured data. It allows querying data via
SQL as well as the Apache Hive variant of SQL—called the Hive Query Language (HQL)
—and it supports many sources of data, including Hive tables, Parquet, and JSON.
• Beyond providing a SQL interface to Spark, Spark SQL allows developers to intermix SQL
queries with the programmatic data manipulations supported by RDDs in Python, Java, and
Scala, all within a single application, thus combining SQL with complex analytics.

3. Spark Streaming
• Spark Streaming is a Spark component that enables processing of live streams of data.
Examples of data streams include logfiles generated by production web servers, or queues
of messages containing status updates posted by users of a web service.
• Streaming provides an API for manipulating data streams that closely matches the Spark
Core’s RDD API, making it easy for programmers to learn the project and move between
applications that manipulate data stored in memory, on disk, or arriving in real time.
• Underneath its API, Spark Streaming was designed to provide the same degree of fault
tolerance, throughput, and scalability as Spark Core.

UNIT
III 79
Apache Spark Ecosystem: Component

4. MLlib (Machine Learning Library):

• MLlib provides multiple types of machine learning algorithms, including
classification, regression, clustering, and collaborative filtering, as well as supporting
functionality such as model evaluation and data import.
• It also provides some lower-level ML primitives, including a generic gradient descent
optimization algorithm.
• All of these methods are designed to scale out across a cluster.

5. GraphX (Graph Computation):

• GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and
performing graph-parallel computations
• GraphX also provides various operators for manipulating graphs (e.g., subgraph and
mapVertices) and a library of common graph algorithms (e.g., PageRank and triangle
counting).

UNIT
III 80
Features of Apache Spark
• Speed: Though spark is based on MapReduce, it is 10 times faster than Hadoop when
it comes to big data processing.

• Usability: Spark supports multiple languages thus making it easier to work with.

• Sophisticated Analytics: Spark provides a complex algorithm for Big Data

Analytics and Machine Learning.

• In-Memory Processing: Unlike Hadoop, Spark doesn’t move data in and out of the
cluster.

• Lazy Evaluation: It means that spark waits for the code to complete and then process
the instruction in the most efficient way possible.

• Fault Tolerance: Spark has improved fault tolerance than Hadoop. Both storage and
computation can tolerate failure by backing
UNIT up to another node.
III 81
Comparison of different Framework

UNIT
III
Case Study:
• Google Analytics:
• Cloud Dataflow
• Run faster and scale better than pretty much any other system
• Cloud Save:
• It is an API that enables an application to save an individual
user’s data in the cloud or elsewhere and use it without requiring
any server-side coding.
• Cloud Debugging
• makes it easier to sift through lines of code deployed across many
servers in the cloud to identify software bugs.

UNIT
III 83
Case Study:
• Google Analytics:
• Cloud Tracing
• It provides latency statistics across different groups and provides
analysis reports.
• Cloud Monitoring:
• It is an intelligent monitoring system. The feature monitors cloud
infrastructure resources, such as disks and virtual machines, as
well as service levels for Google’s services as well as more than a
dozen non-Google open source packages.

UNIT
III 84
Case Study:
• Twitter Analytics: Capturing and Analyzing
Tweets

https://blogs.ischooUl.NbIeTrkIIIeley.edu/i290-abdt-s12/
85
Hadoop High Level Architecture

UNIT
III 86
Hadoop Cluster
•A Small Hadoop Cluster Include a single master &
multiple worker nodes
Master node:
Data Node
Job Tracker
Task
Tracker
Name Node
Slave node:
Data Node
Task Tracke

UNIT
III 87
Active Learning

Answer the following

1. Hadoop Distributed File System (HDFS) is renamed from NDFS

2. All the core projects of Hadoop were hosted by Yahoo!

3. Large amount of data needs large hardware

4. IT organizations can handle Information growth irrespective of use

of commodity s/w and h/w
5. Apache Storm is a stream processing framework

6. YARN stands for

UNIT
III 88
Difference Between
Hadoop and SQL
Hadoop SQL
Schema on Read Schema on Write
Data Data stored in the logical
stored in form with interrelated tables
compressed
file

UNIT 89
III

BDT Unit04
No ratings yet
BDT Unit04
136 pages
MODULE 2 Hadoop Ecosystem Tools
No ratings yet
MODULE 2 Hadoop Ecosystem Tools
44 pages
Notes - 4 Unit Neha
No ratings yet
Notes - 4 Unit Neha
44 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
13 Lecture
No ratings yet
13 Lecture
23 pages
Unit 3 BDA
No ratings yet
Unit 3 BDA
4 pages
Unit 3-1
No ratings yet
Unit 3-1
41 pages
Lecture 4 - Hadoop Ecosystem - 1691899782480
No ratings yet
Lecture 4 - Hadoop Ecosystem - 1691899782480
36 pages
Session 3.1
No ratings yet
Session 3.1
29 pages
Hadoop Intro - Part1
No ratings yet
Hadoop Intro - Part1
45 pages
BDT Unit04
No ratings yet
BDT Unit04
89 pages
Bda Ia-3 QB-1
No ratings yet
Bda Ia-3 QB-1
17 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
12 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
S Pig Hive HBase Zookeeper
No ratings yet
S Pig Hive HBase Zookeeper
19 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
BDA Module-4
No ratings yet
BDA Module-4
4 pages
DA Unit-5
No ratings yet
DA Unit-5
78 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
49 pages
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
No ratings yet
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
5 pages
Unit Iii
No ratings yet
Unit Iii
9 pages
Part Big Data Unit-IV
No ratings yet
Part Big Data Unit-IV
12 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Data Lake 1
No ratings yet
Data Lake 1
48 pages
Hadoop Tutorial
No ratings yet
Hadoop Tutorial
17 pages
21CS71 Module 2 Git
No ratings yet
21CS71 Module 2 Git
11 pages
Module 2.2
No ratings yet
Module 2.2
32 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
BD U-5 (Anupam Sir)
No ratings yet
BD U-5 (Anupam Sir)
12 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
Lesson 1 - Introduction To Big Data and Hadoop
No ratings yet
Lesson 1 - Introduction To Big Data and Hadoop
46 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
HADOOP Notes Unit 3 and 4
No ratings yet
HADOOP Notes Unit 3 and 4
14 pages
U2 - Hadoop EcoSytem
No ratings yet
U2 - Hadoop EcoSytem
6 pages
Notes - 4 Unit-Big Data
No ratings yet
Notes - 4 Unit-Big Data
38 pages
INTRO Hadoop-Ecosystem
No ratings yet
INTRO Hadoop-Ecosystem
6 pages
Assignment 4-Gcc: Hive Is Not
No ratings yet
Assignment 4-Gcc: Hive Is Not
3 pages
Cloud Computing Era Practice
No ratings yet
Cloud Computing Era Practice
75 pages
Big Data and Hadoop: by - Ujjwal Kumar Gupta
No ratings yet
Big Data and Hadoop: by - Ujjwal Kumar Gupta
57 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
Big Data Hadoop - Course Curriculum - V1
No ratings yet
Big Data Hadoop - Course Curriculum - V1
7 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Big Data and Hadoop Guide
No ratings yet
Big Data and Hadoop Guide
8 pages
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
Big Data Testing
100% (1)
Big Data Testing
34 pages
Haryana Roadways Training PDF
No ratings yet
Haryana Roadways Training PDF
41 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Beyond Stubs and Traps
No ratings yet
Beyond Stubs and Traps
123 pages
0245 PRC 20 DC 0001 - 000 - 00 - Rev.0
No ratings yet
0245 PRC 20 DC 0001 - 000 - 00 - Rev.0
108 pages
Those Magical Manatees by Jan Lee Wicker
0% (2)
Those Magical Manatees by Jan Lee Wicker
11 pages
Sap Senior Application Consultant Hitachi Zosen Inova Ag
No ratings yet
Sap Senior Application Consultant Hitachi Zosen Inova Ag
17 pages
IS 3370.1967 Part-4 PDF
No ratings yet
IS 3370.1967 Part-4 PDF
50 pages
Mobilization of Cabin
No ratings yet
Mobilization of Cabin
1 page
List of Regulated Electrical Equipment 250718
No ratings yet
List of Regulated Electrical Equipment 250718
15 pages
Bonifacio Vs GSIS 1986 Digest
100% (1)
Bonifacio Vs GSIS 1986 Digest
2 pages
RTGS & Messaging
No ratings yet
RTGS & Messaging
26 pages
Samuel Kifle
No ratings yet
Samuel Kifle
95 pages
Introducing ODIN: Adfom's Powerful AI
No ratings yet
Introducing ODIN: Adfom's Powerful AI
10 pages
ERD Report - Goldman Sachs
No ratings yet
ERD Report - Goldman Sachs
23 pages
MBA Admission Handbook 2025 26
No ratings yet
MBA Admission Handbook 2025 26
38 pages
Car Pro Final Document
No ratings yet
Car Pro Final Document
33 pages
Class Lecture Nootte
No ratings yet
Class Lecture Nootte
33 pages
Alat Mesin Fiamma All
No ratings yet
Alat Mesin Fiamma All
45 pages
Graph Theory Algorithms
No ratings yet
Graph Theory Algorithms
19 pages
Exxsol D80
No ratings yet
Exxsol D80
2 pages
Instructor's Manual
No ratings yet
Instructor's Manual
13 pages
KTM Adventure Range Folder MY21-EN
No ratings yet
KTM Adventure Range Folder MY21-EN
33 pages
Assisted Living Facilities
No ratings yet
Assisted Living Facilities
8 pages
Chapter VI. Biomechanics Evaluation Using 3DSSPP Software
No ratings yet
Chapter VI. Biomechanics Evaluation Using 3DSSPP Software
15 pages
Study OF Operation OF Tesla Turbine: International Journal of Advance Research in Engineering, Science & Technology
No ratings yet
Study OF Operation OF Tesla Turbine: International Journal of Advance Research in Engineering, Science & Technology
5 pages
Zone Refining
No ratings yet
Zone Refining
21 pages
BNP - Settlement Agreement - Department of The Treasury
No ratings yet
BNP - Settlement Agreement - Department of The Treasury
10 pages
Pradhanmantri Jan Dhan Yojana: Finatix Club IIM Raipur
No ratings yet
Pradhanmantri Jan Dhan Yojana: Finatix Club IIM Raipur
10 pages
Captive 2
No ratings yet
Captive 2
1 page
Seabin
No ratings yet
Seabin
6 pages

BDT Unit04

Uploaded by

BDT Unit04

Uploaded by

Unit-

• MapReduce: A distributed data processing model and execution

• HDFS:A distributed file system that runs on large clusters of

Consider also a dataset in

Running sqoop-export –table Test –update-key id –export-dir /path/to/data –connect … will

UPDATE Test SET msg=’this is a test’, bar=42 WHERE id=0;

• $ $HADOOP_HOME/bin/hadoop fs -cat /emp/part-m-*

• E.g. emp table data will be stored in HDFS as:

• $ sqoop export (generic-args) (export-args)

• $ sqoop import (generic-args) (import-

• another Hadop ecosystem component, is a management platform for provisioning,

Big Data Analytics Lab

5/15/2024 Big Data Analytics Lab 21

5/15/2024 Big Data Analytics Lab 22

5/15/2024 Big Data Analytics Lab 23

5/15/2024 Big Data Analytics Lab 24

5/15/2024 Big Data Analytics Lab 25

5/15/2024 Big Data Analytics Lab 26

5/15/2024 Big Data Analytics Lab 27

5/15/2024 Big Data Analytics Lab 28

5/15/2024 Big Data Analytics Lab 29

5/15/2024 Big Data Analytics Lab 30

pageid age Page age Count

5/15/2024 Big Data Analytics Lab 40

5/15/2024 Big Data Analytics Lab 41

Show show databases; // Shows the created database

Drop drop database office; // Drops the office database as it is empty

Create create database office; // We will recreate the database office

Use use office; // Sets office as the default database

5/15/2024 Big Data Analytics Lab 40

5/15/2024 Big Data Analytics Lab 42

5/15/2024 Big Data Analytics Lab 43

5/15/2024 Big Data Analytics Lab 44

5/15/2024 Big Data Analytics Lab 45

5/15/2024 Big Data Analytics Lab 46

5/15/2024 Big Data Analytics Lab 48

5/15/2024 Big Data Analytics Lab 49

5/15/2024 Big Data Analytics Lab 50

5/15/2024 Big Data Analytics Lab 51

5/15/2024 Big Data Analytics Lab 52

5/15/2024 Big Data Analytics Lab 53

5/15/2024 Big Data Analytics Lab 54

5/15/2024 Big Data Analytics Lab 55

5/15/2024 Big Data Analytics Lab 56

5/15/2024 Big Data Analytics Lab 57

5/15/2024 Big Data Analytics Lab 58

5/15/2024 Big Data Analytics Lab 59

5/15/2024 Big Data Analytics Lab 60

5/15/2024 Big Data Analytics Lab 61

5/15/2024 Big Data Analytics Lab 62

5/15/2024 Big Data Analytics Lab 63

5/15/2024 Big Data Analytics Lab 64

5/15/2024 Big Data Analytics Lab 65

pig // run in default mode

pig -x local //local mode

pig -x mapreduce //mapreduce mode

hdfs dfs -cat pig_output_sales/part-r-00000

5/15/2024 Big Data Analytics Lab 67

5/15/2024 Big Data Analytics Lab 68

5/15/2024 Big Data Analytics Lab 69

Map Reduce Real Time Use Cases

Map Reduce Real Time Use Case

• It is based on Hadoop’s Map Reduce model.

• Spark extends the popular MapReduce model to efficiently support

• Spark is designed to be highly accessible, offering simple APIs in

• Spark powers a stack of libraries including,

• The Spark project contains multiple closely integrated components.

4. MLlib (Machine Learning Library):

5. GraphX (Graph Computation):

• Sophisticated Analytics: Spark provides a complex algorithm for Big Data

Answer the following

2. All the core projects of Hadoop were hosted by Yahoo!

3. Large amount of data needs large hardware

4. IT organizations can handle Information growth irrespective of use

6. YARN stands for

You might also like