0% found this document useful (0 votes)

37 views29 pages

Session 3.1

The document discusses the Hadoop ecosystem including Hive, Pig, Jasper Reports, machine learning with Spark, graph processing with Spark GraphX, stream processing concepts and Spark Streaming. It provides an overview of the topics that will be covered in each session including data modeling in Hive, the Pig Latin language, connecting to MongoDB and Cassandra with Jasper Reports, machine learning algorithms in Spark MLLib, graph processing in GraphX and stream processing architectures.

Uploaded by

dhurgadevi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views29 pages

Session 3.1

Uploaded by

dhurgadevi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 29

MODULE III HADOOP ECO SYSTEMS

Hadoop Eco systems: Hive – Architecture – data type – File format –

HQL – SerDe – User defined functions – Pig: Features – Anatomy – Pig
on Hadoop - Pig Latin overview – Data types – Running pig – Execution
modes of Pig – HDFS commands – Relational operators – Eval
Functions – Complex data type – Piggy Bank – User defined Functions
– Parameter substitution – Diagnostic operator. Jasper Report:
Introduction – Connecting to Mongo DB – Connecting to Cassandra –
Introduction of Big data Machine learning with Spark: Introduction to
Spark MLib, Linear Regression - Clustering - Collaborative filtering -
Association rule mining – Decision tree using Spark. Introduction to
Graph - Introduction to Spark GraphX, Introduction to Streams
Concepts - Stream Data Model and Architecture Introduction to Spark
Streaming - Kafka -Streaming Ecosystem.
Module 3- HADOOP ECO SYSTEMS

Session Topic

3.1 Hadoop Eco systems: Hive – Architecture – data type – File format

3.2 HQL – SerDe – User defined functions

3.3 Pig: Features – Anatomy – Pig on Hadoop - Pig Latin overview – Data
types – Running pig – Execution modes of Pig

3.4 HDFS commands – Relational operators – Eval Functions – Complex

data type – Piggy Bank – User defined Functions – Parameter
substitution – Diagnostic operator

3.5 Jasper Report: Introduction – Connecting to Mongo DB – Connecting

to Cassandra

3.6 Introduction of Big data Machine learning with Spark: Introduction to

Spark MLib, Linear Regression - Clustering - Collaborative filtering

3.7 Association rule mining – Decision tree using Spark

3.8 Introduction to Graph - Introduction to Spark GraphX

3.9 Introduction to Streams Concepts - Stream Data Model and

Architecture
3.10 Introduction to Spark Streaming - Kafka -Streaming Ecosystem.
MODULE 1 Introduction to Big Data
Hadoop
Hadoop Eco systems: Hive – Architecture – data type – File format
– HQL – SerDe – User defined functions
Data Analysts with Hadoop
Challenges that Data Analysts faced
• Data Explosion
- TBs of data generated everyday
Solution – HDFS to store data and Hadoop Map-Reduce
framework to parallelize processing of Data
What is the catch?
- Hadoop Map Reduce is Java intensive
- Thinking in Map Reduce paradigm can get tricky
Hive - A Warehousing Solution Over
a Map-Reduce Framework
… Enter Hive!
What is Hive?
• Hive is a Data Warehousing tool. Hive is used to query
structured data built on top of Hadoop. Facebook
created Hive component to manage their ever- growing
volumes of data. Hive makes use of the following:
1.HDFS for Storage
2.MapReduce for execution
3.Stores metadata in an RDBMS.
• Hive is suitable for data warehousing applications,
processing batch jobs on huge data that is
immutable. Examples: Eg: Analysis of web logs,
application logs.
History
• 2007: HIVE was born at Facebook to analyze their incoming
log data
• 2008: HIVE became Apache Hadoop sub-project.

Hive provides HQL(Hive Query Language) which is similar to SQL.

Hive compiles SQL queries into Map Reduce jobs and then runs
the job in the Hadoop Cluster.
Hive provides extensive data type functions and formats for data
summarization and analysis.
Hive Key Principles
HiveQL to MapReduce
Hive Framework

Data Analyst

SELECT COUNT(1) FROM Sales;

rowcount, N
rowcount,1 rowcount,1

Sales: Hive table

MR JOB Instance
Hive Data Model
Data in Hive organized into :
• Tables
• Partitions
• Buckets
Hive Data Model Contd.
• Tables
- Analogous to relational tables
- Each table has a corresponding directory in HDFS
- Data serialized and stored as files within that directory
- Hive has default serialization built in which supports
compression and lazy deserialization
- Users can specify custom serialization –deserialization schemes
(SerDe’s)
Hive Data Model Contd.
• Partitions
- Each table can be broken into partitions
- Partitions determine distribution of data within subdirectories
Example -
CREATE_TABLE Sales (sale_id INT, amount FLOAT)
PARTITIONED BY (country STRING, year INT, month INT)
So each partition will be split out into different folders like
Sales/country=US/year=2012/month=12
Hierarchy of Hive Partitions

/hivebase/Sales

/country=US
/country=CANADA

/year=2012 /year=2012
/year=2015
/year=2014
/month=12
/month=11 /month=11
File File File
Hive Data Model Contd.
• Buckets
- Data in each partition divided into buckets
- Based on a hash function of the column
- H(column) mod NumBuckets = bucket number
- Each bucket is stored as a file in partition directory
Hive Data Units
• Databases : The namespace for tables.
• Tables: Set of records that have similar schema.
• Partitions: Logical separations of data based on classification
of given information as per specific attributes. Once hive has
partitioned the data based on a specified key, it starts to
assemble the records into specific folders as and when the
records are inserted.
• Buckets (Clusters) : Similar to partitions but uses hash
functions to segregate data and determines the cluster or
bucket into which the record should be placed.
• In Hive, tables are stored as a folder, partitions are stored as a
sub-directory and buckets are stored as a file.
Architecture
Externel Interfaces- CLI, WebUI, JDBC,
ODBC programming interfaces

Thrift Server – Cross Language service

framework .

Metastore - Meta data about the Hive

tables, partitions

Driver - Brain of Hive! Compiler,

Optimizer and Execution engine
Hive Thrift Server

• Framework for cross language services

• Server written in Java
• Support for clients written in different languages
- JDBC(java), ODBC(c++), php, perl, python scripts
Metastore

• System catalog which contains metadata about the Hive tables

• Stored in RDBMS/local fs. HDFS too slow(not optimized for random access)
• Objects of Metastore
 Database - Namespace of tables
 Table - list of columns, types, owner, storage, SerDes
 Partition – Partition specific column, Serdes and storage
Hive Driver

• Driver - Maintains the lifecycle of HiveQL statement

• Query Compiler – Compiles HiveQL in a DAG of map reduce tasks
• Executor - Executes the tasks plan generated by the compiler in proper
dependency order. Interacts with the underlying Hadoop instance
Compiler
• Converts the HiveQL into a plan for execution
• Plans can
- Metadata operations for DDL statements e.g. CREATE
- HDFS operations e.g. LOAD
• Semantic Analyzer – checks schema information, type
checking, implicit type conversion, column verification
• Optimizer – Finding the best logical plan e.g. Combines
multiple joins in a way to reduce the number of map reduce
jobs, Prune columns early to minimize data transfer
• Physical plan generator – creates the DAG of map-reduce jobs
What Is Hive?
• Developed by Facebook and a top-level Apache project
• A data warehousing infrastructure based on Hadoop
• Immediately makes data on a cluster available to non-Java
programmers via SQL like queries
• Built on HiveQL (HQL), a SQL-like query language
• Interprets HiveQL and generates MapReduce jobs that run on
the cluster
• Enables easy data summarization, ad-hoc reporting and
querying, and analysis of large volumes of data
What Hive Is Not
• Hive, like Hadoop, is designed for batch processing of large
datasets
• Not an OLTP or real-time system
• Latency and throughput are both high compared to a
traditional RDBMS
• Even when dealing with relatively small data ( <100 MB )
Data Hierarchy
• Hive is organised hierarchically into:
• Databases: namespaces that separate tables and other objects
• Tables: homogeneous units of data with the same schema
• Analogous to tables in an RDBMS
• Partitions: determine how the data is stored
• Allow efficient access to subsets of the data
• Buckets/clusters
• For sub-sampling within a partition
• Join optimization
HiveQL
• HiveQL / HQL provides the basic SQL-like operations:
• Select columns using SELECT
• Filter rows using WHERE
• JOIN between tables
• Evaluate aggregates using GROUP BY
• Store query results into another table
• Download results to a local directory (i.e., export from HDFS)
• Manage tables and queries with CREATE, DROP, and ALTER
Primitive Data Types
Type Comments
TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8-byte integers
BOOLEAN TRUE/FALSE
FLOAT, DOUBLE Single and double precision real numbers
STRING Character string
TIMESTAMP Unix-epoch offset or datetime string
DECIMAL Arbitrary-precision decimal
BINARY Opaque; ignore these bytes
Complex Data Types
Type Comments
STRUCT A collection of elements
If S is of type STRUCT {a INT, b INT}:
S.a returns element a
MAP Key-value tuple
If M is a map from 'group' to GID:
M['group'] returns value of GID
ARRAY Indexed list
If A is an array of elements ['a','b','c']:
A[0] returns 'a'

05-Solidworks Advanced Part Modeling 2018
100% (4)
05-Solidworks Advanced Part Modeling 2018
473 pages
Fundamentals of Computers: Reema Thareja
No ratings yet
Fundamentals of Computers: Reema Thareja
39 pages
Bda Unit 5 Hive Notes
No ratings yet
Bda Unit 5 Hive Notes
23 pages
3.6 E-Mail, SMTP, MIME, POP3, IMAP
No ratings yet
3.6 E-Mail, SMTP, MIME, POP3, IMAP
67 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
1.3 Module-1
No ratings yet
1.3 Module-1
26 pages
Supervised Learning
No ratings yet
Supervised Learning
147 pages
Unit 5 (BDC)
No ratings yet
Unit 5 (BDC)
59 pages
Two Pointers - Without Code
No ratings yet
Two Pointers - Without Code
74 pages
Hadoop - Hive
No ratings yet
Hadoop - Hive
190 pages
Lab Manual 21bce11136
No ratings yet
Lab Manual 21bce11136
80 pages
SV Assignments
No ratings yet
SV Assignments
46 pages
MCC Class Notes Module 1 2 and 3
No ratings yet
MCC Class Notes Module 1 2 and 3
27 pages
Mongo DB
No ratings yet
Mongo DB
3 pages
本科学位论文
100% (1)
本科学位论文
8 pages
BD - Unit - IV - Hive and Pig
No ratings yet
BD - Unit - IV - Hive and Pig
41 pages
3 7-FTP, HTTP, WWW
No ratings yet
3 7-FTP, HTTP, WWW
65 pages
Hive Final
No ratings yet
Hive Final
75 pages
CS8091 BDA Unit 3
No ratings yet
CS8091 BDA Unit 3
144 pages
CS8091 BDA Unit 2
No ratings yet
CS8091 BDA Unit 2
101 pages
(R17a0528) Big Data Analytics-57-100
No ratings yet
(R17a0528) Big Data Analytics-57-100
44 pages
CS8091 BDA Unit 5
No ratings yet
CS8091 BDA Unit 5
125 pages
Chapter+9+ HIVE
No ratings yet
Chapter+9+ HIVE
50 pages
Module 1
No ratings yet
Module 1
71 pages
Hive
No ratings yet
Hive
52 pages
Hive Data Types and Data Models
No ratings yet
Hive Data Types and Data Models
24 pages
Big Data & Analytics (CSE6005) L6
No ratings yet
Big Data & Analytics (CSE6005) L6
56 pages
Unit-Vi Hive Hadoop & Big Data
100% (1)
Unit-Vi Hive Hadoop & Big Data
24 pages
Hive Introduction
No ratings yet
Hive Introduction
47 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
33 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Unit 5 Lecture No-1 (Hive)
No ratings yet
Unit 5 Lecture No-1 (Hive)
30 pages
Hive
No ratings yet
Hive
49 pages
Unit-IV - BDA
No ratings yet
Unit-IV - BDA
42 pages
Srs Project
No ratings yet
Srs Project
40 pages
Introduction To Hive
No ratings yet
Introduction To Hive
14 pages
Big-Data-Unit 5
No ratings yet
Big-Data-Unit 5
54 pages
Hive Unit VI
No ratings yet
Hive Unit VI
39 pages
Hive
No ratings yet
Hive
63 pages
3PL Billing - Test Plan
No ratings yet
3PL Billing - Test Plan
97 pages
HIVE
No ratings yet
HIVE
18 pages
Unit5 Notes
No ratings yet
Unit5 Notes
29 pages
21FE27 Artificial Intelligence - : III Year/V Semester IT
No ratings yet
21FE27 Artificial Intelligence - : III Year/V Semester IT
29 pages
ns2 Lecture
No ratings yet
ns2 Lecture
34 pages
3.5 Application Layer DNS
No ratings yet
3.5 Application Layer DNS
32 pages
Hive
No ratings yet
Hive
23 pages
7 Hive
No ratings yet
7 Hive
30 pages
Solar Power Monitoring System Using IOT - Formatted Paper
No ratings yet
Solar Power Monitoring System Using IOT - Formatted Paper
5 pages
Hive
No ratings yet
Hive
28 pages
Course3 Module2 Intro To Hive Slides
No ratings yet
Course3 Module2 Intro To Hive Slides
76 pages
Unit 2
No ratings yet
Unit 2
19 pages
1.2 Module-1
No ratings yet
1.2 Module-1
21 pages
Final Doc Presentation Hive
No ratings yet
Final Doc Presentation Hive
20 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
BD U-5 (Anupam Sir)
No ratings yet
BD U-5 (Anupam Sir)
12 pages
Tut Hemant ns2
No ratings yet
Tut Hemant ns2
24 pages
IET Udaipur BDA Unit-5
No ratings yet
IET Udaipur BDA Unit-5
9 pages
Towards Automated Defense From Rootkit Attacks: Arati Baliga and Liviu Iftode
No ratings yet
Towards Automated Defense From Rootkit Attacks: Arati Baliga and Liviu Iftode
32 pages
Hive
No ratings yet
Hive
30 pages
Bda Ia-3 QB-1
No ratings yet
Bda Ia-3 QB-1
17 pages
Screenshot 2023-07-05 at 10.44.47 PM
No ratings yet
Screenshot 2023-07-05 at 10.44.47 PM
31 pages
Bda Report
No ratings yet
Bda Report
16 pages
3.2. File System Interface-Directory Structure and
No ratings yet
3.2. File System Interface-Directory Structure and
23 pages
1.4 Module-1
No ratings yet
1.4 Module-1
21 pages
Ibiz Hive
No ratings yet
Ibiz Hive
27 pages
1.5 Module-1
No ratings yet
1.5 Module-1
21 pages
Introduction To Hive
No ratings yet
Introduction To Hive
28 pages
2.data Science Tools
No ratings yet
2.data Science Tools
13 pages
Big Data 4
No ratings yet
Big Data 4
14 pages
3 Python
No ratings yet
3 Python
16 pages
Session 3.8
No ratings yet
Session 3.8
17 pages
Hive
No ratings yet
Hive
12 pages
Introduction To 2d Drawing and Orthographic Projection
No ratings yet
Introduction To 2d Drawing and Orthographic Projection
36 pages
Hive Full Lecture
No ratings yet
Hive Full Lecture
17 pages
Hive Tutorial
No ratings yet
Hive Tutorial
19 pages
3.5. File System Implementation-Allocation
No ratings yet
3.5. File System Implementation-Allocation
16 pages
Linux Privilege Escalation 1714714339
No ratings yet
Linux Privilege Escalation 1714714339
18 pages
Session 3.9.1
No ratings yet
Session 3.9.1
11 pages
Unit V-Hive
No ratings yet
Unit V-Hive
10 pages
BTE in SAP
No ratings yet
BTE in SAP
12 pages
Handbook Mcu
No ratings yet
Handbook Mcu
31 pages
What Is A Java Program?
No ratings yet
What Is A Java Program?
25 pages
Actividad 7. Investigación Hive
No ratings yet
Actividad 7. Investigación Hive
25 pages
Electobulletin
No ratings yet
Electobulletin
10 pages
Hive Slides-2
No ratings yet
Hive Slides-2
25 pages
5 Statcraft
No ratings yet
5 Statcraft
8 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
24 pages
Universal Serial Bus Device Class Definition For Video Devices: Uncompressed Payload
No ratings yet
Universal Serial Bus Device Class Definition For Video Devices: Uncompressed Payload
19 pages
Eiam Login
No ratings yet
Eiam Login
12 pages
Introduction To Hive
No ratings yet
Introduction To Hive
9 pages
HIVE
No ratings yet
HIVE
7 pages
Project Report: Demonstration of Types of Viruses and Its Mechanism
No ratings yet
Project Report: Demonstration of Types of Viruses and Its Mechanism
11 pages
Final Year Project Format
No ratings yet
Final Year Project Format
11 pages
Unit 3
No ratings yet
Unit 3
8 pages
Lecture 3.2.2 Object Oriented Design
No ratings yet
Lecture 3.2.2 Object Oriented Design
8 pages
Hive
No ratings yet
Hive
5 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
4 pages
Sis 2.0
No ratings yet
Sis 2.0
7 pages
Accs Brochure
No ratings yet
Accs Brochure
5 pages
Hive
No ratings yet
Hive
4 pages
PCDA-Rev Advisory For Pensioners SPARSH
No ratings yet
PCDA-Rev Advisory For Pensioners SPARSH
5 pages
Lesson 2 Variables Data Types and Operators
No ratings yet
Lesson 2 Variables Data Types and Operators
6 pages
Change Healthcare Cardiology Quick Guide CHC 14.1.1
No ratings yet
Change Healthcare Cardiology Quick Guide CHC 14.1.1
4 pages
1 - Introduction
No ratings yet
1 - Introduction
5 pages
Chapters 5-6 Summary and Reflection
No ratings yet
Chapters 5-6 Summary and Reflection
5 pages
Using Hive For Data Warehousing: Introduction To Hive
No ratings yet
Using Hive For Data Warehousing: Introduction To Hive
4 pages
Dreamhome DDL
No ratings yet
Dreamhome DDL
3 pages
Sibun Nayak Developer Resume - pdf-1
No ratings yet
Sibun Nayak Developer Resume - pdf-1
1 page
Ravikiran Mediboyina
No ratings yet
Ravikiran Mediboyina
1 page
The Free Hive Book
No ratings yet
The Free Hive Book
1 page
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Learn Hbase in 24 Hours
From Everand
Learn Hbase in 24 Hours
Alex Nordeen
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet

Session 3.1

Uploaded by

Session 3.1

Uploaded by

MODULE III HADOOP ECO SYSTEMS

Hadoop Eco systems: Hive – Architecture – data type – File format –

3.2 HQL – SerDe – User defined functions

3.4 HDFS commands – Relational operators – Eval Functions – Complex

3.5 Jasper Report: Introduction – Connecting to Mongo DB – Connecting

3.6 Introduction of Big data Machine learning with Spark: Introduction to

3.7 Association rule mining – Decision tree using Spark

3.9 Introduction to Streams Concepts - Stream Data Model and

Hive provides HQL(Hive Query Language) which is similar to SQL.

SELECT COUNT(1) FROM Sales;

Sales: Hive table

Thrift Server – Cross Language service

Metastore - Meta data about the Hive

Driver - Brain of Hive! Compiler,

• Framework for cross language services

• System catalog which contains metadata about the Hive tables

• Driver - Maintains the lifecycle of HiveQL statement

You might also like