0% found this document useful (0 votes)
37 views29 pages

Session 3.1

The document discusses the Hadoop ecosystem including Hive, Pig, Jasper Reports, machine learning with Spark, graph processing with Spark GraphX, stream processing concepts and Spark Streaming. It provides an overview of the topics that will be covered in each session including data modeling in Hive, the Pig Latin language, connecting to MongoDB and Cassandra with Jasper Reports, machine learning algorithms in Spark MLLib, graph processing in GraphX and stream processing architectures.

Uploaded by

dhurgadevi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views29 pages

Session 3.1

The document discusses the Hadoop ecosystem including Hive, Pig, Jasper Reports, machine learning with Spark, graph processing with Spark GraphX, stream processing concepts and Spark Streaming. It provides an overview of the topics that will be covered in each session including data modeling in Hive, the Pig Latin language, connecting to MongoDB and Cassandra with Jasper Reports, machine learning algorithms in Spark MLLib, graph processing in GraphX and stream processing architectures.

Uploaded by

dhurgadevi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

MODULE III HADOOP ECO SYSTEMS

Hadoop Eco systems: Hive – Architecture – data type – File format –


HQL – SerDe – User defined functions – Pig: Features – Anatomy – Pig
on Hadoop - Pig Latin overview – Data types – Running pig – Execution
modes of Pig – HDFS commands – Relational operators – Eval
Functions – Complex data type – Piggy Bank – User defined Functions
– Parameter substitution – Diagnostic operator. Jasper Report:
Introduction – Connecting to Mongo DB – Connecting to Cassandra –
Introduction of Big data Machine learning with Spark: Introduction to
Spark MLib, Linear Regression - Clustering - Collaborative filtering -
Association rule mining – Decision tree using Spark. Introduction to
Graph - Introduction to Spark GraphX, Introduction to Streams
Concepts - Stream Data Model and Architecture Introduction to Spark
Streaming - Kafka -Streaming Ecosystem.
Module 3- HADOOP ECO SYSTEMS

Session Topic

3.1 Hadoop Eco systems: Hive – Architecture – data type – File format

3.2 HQL – SerDe – User defined functions

3.3 Pig: Features – Anatomy – Pig on Hadoop - Pig Latin overview – Data
types – Running pig – Execution modes of Pig

3.4 HDFS commands – Relational operators – Eval Functions – Complex


data type – Piggy Bank – User defined Functions – Parameter
substitution – Diagnostic operator

3.5 Jasper Report: Introduction – Connecting to Mongo DB – Connecting


to Cassandra

3.6 Introduction of Big data Machine learning with Spark: Introduction to


Spark MLib, Linear Regression - Clustering - Collaborative filtering

3.7 Association rule mining – Decision tree using Spark


3.8 Introduction to Graph - Introduction to Spark GraphX

3.9 Introduction to Streams Concepts - Stream Data Model and


Architecture
3.10 Introduction to Spark Streaming - Kafka -Streaming Ecosystem.
MODULE 1 Introduction to Big Data
Hadoop
Hadoop Eco systems: Hive – Architecture – data type – File format
– HQL – SerDe – User defined functions
Data Analysts with Hadoop
Challenges that Data Analysts faced
• Data Explosion
- TBs of data generated everyday
Solution – HDFS to store data and Hadoop Map-Reduce
framework to parallelize processing of Data
What is the catch?
- Hadoop Map Reduce is Java intensive
- Thinking in Map Reduce paradigm can get tricky
Hive - A Warehousing Solution Over
a Map-Reduce Framework
… Enter Hive!
What is Hive?
• Hive is a Data Warehousing tool. Hive is used to query
structured data built on top of Hadoop. Facebook
created Hive component to manage their ever- growing
volumes of data. Hive makes use of the following:
1.HDFS for Storage
2.MapReduce for execution
3.Stores metadata in an RDBMS.
• Hive is suitable for data warehousing applications,
processing batch jobs on huge data that is
immutable. Examples: Eg: Analysis of web logs,
application logs.
History
• 2007: HIVE was born at Facebook to analyze their incoming
log data
• 2008: HIVE became Apache Hadoop sub-project.

Hive provides HQL(Hive Query Language) which is similar to SQL.


Hive compiles SQL queries into Map Reduce jobs and then runs
the job in the Hadoop Cluster.
Hive provides extensive data type functions and formats for data
summarization and analysis.
Hive Key Principles
HiveQL to MapReduce
Hive Framework

Data Analyst

SELECT COUNT(1) FROM Sales;

rowcount, N
rowcount,1 rowcount,1

Sales: Hive table


MR JOB Instance
Hive Data Model
Data in Hive organized into :
• Tables
• Partitions
• Buckets
Hive Data Model Contd.
• Tables
- Analogous to relational tables
- Each table has a corresponding directory in HDFS
- Data serialized and stored as files within that directory
- Hive has default serialization built in which supports
compression and lazy deserialization
- Users can specify custom serialization –deserialization schemes
(SerDe’s)
Hive Data Model Contd.
• Partitions
- Each table can be broken into partitions
- Partitions determine distribution of data within subdirectories
Example -
CREATE_TABLE Sales (sale_id INT, amount FLOAT)
PARTITIONED BY (country STRING, year INT, month INT)
So each partition will be split out into different folders like
Sales/country=US/year=2012/month=12
Hierarchy of Hive Partitions

/hivebase/Sales

/country=US
/country=CANADA

/year=2012 /year=2012
/year=2015
/year=2014
/month=12
/month=11 /month=11
File File File
Hive Data Model Contd.
• Buckets
- Data in each partition divided into buckets
- Based on a hash function of the column
- H(column) mod NumBuckets = bucket number
- Each bucket is stored as a file in partition directory
Hive Data Units
• Databases : The namespace for tables.
• Tables: Set of records that have similar schema.
• Partitions: Logical separations of data based on classification
of given information as per specific attributes. Once hive has
partitioned the data based on a specified key, it starts to
assemble the records into specific folders as and when the
records are inserted.
• Buckets (Clusters) : Similar to partitions but uses hash
functions to segregate data and determines the cluster or
bucket into which the record should be placed.
• In Hive, tables are stored as a folder, partitions are stored as a
sub-directory and buckets are stored as a file.
Architecture
Externel Interfaces- CLI, WebUI, JDBC,
ODBC programming interfaces

Thrift Server – Cross Language service


framework .

Metastore - Meta data about the Hive


tables, partitions

Driver - Brain of Hive! Compiler,


Optimizer and Execution engine
Hive Thrift Server

• Framework for cross language services


• Server written in Java
• Support for clients written in different languages
- JDBC(java), ODBC(c++), php, perl, python scripts
Metastore

• System catalog which contains metadata about the Hive tables


• Stored in RDBMS/local fs. HDFS too slow(not optimized for random access)
• Objects of Metastore
 Database - Namespace of tables
 Table - list of columns, types, owner, storage, SerDes
 Partition – Partition specific column, Serdes and storage
Hive Driver

• Driver - Maintains the lifecycle of HiveQL statement


• Query Compiler – Compiles HiveQL in a DAG of map reduce tasks
• Executor - Executes the tasks plan generated by the compiler in proper
dependency order. Interacts with the underlying Hadoop instance
Compiler
• Converts the HiveQL into a plan for execution
• Plans can
- Metadata operations for DDL statements e.g. CREATE
- HDFS operations e.g. LOAD
• Semantic Analyzer – checks schema information, type
checking, implicit type conversion, column verification
• Optimizer – Finding the best logical plan e.g. Combines
multiple joins in a way to reduce the number of map reduce
jobs, Prune columns early to minimize data transfer
• Physical plan generator – creates the DAG of map-reduce jobs
What Is Hive?
• Developed by Facebook and a top-level Apache project
• A data warehousing infrastructure based on Hadoop
• Immediately makes data on a cluster available to non-Java
programmers via SQL like queries
• Built on HiveQL (HQL), a SQL-like query language
• Interprets HiveQL and generates MapReduce jobs that run on
the cluster
• Enables easy data summarization, ad-hoc reporting and
querying, and analysis of large volumes of data
What Hive Is Not
• Hive, like Hadoop, is designed for batch processing of large
datasets
• Not an OLTP or real-time system
• Latency and throughput are both high compared to a
traditional RDBMS
• Even when dealing with relatively small data ( <100 MB )
Data Hierarchy
• Hive is organised hierarchically into:
• Databases: namespaces that separate tables and other objects
• Tables: homogeneous units of data with the same schema
• Analogous to tables in an RDBMS
• Partitions: determine how the data is stored
• Allow efficient access to subsets of the data
• Buckets/clusters
• For sub-sampling within a partition
• Join optimization
HiveQL
• HiveQL / HQL provides the basic SQL-like operations:
• Select columns using SELECT
• Filter rows using WHERE
• JOIN between tables
• Evaluate aggregates using GROUP BY
• Store query results into another table
• Download results to a local directory (i.e., export from HDFS)
• Manage tables and queries with CREATE, DROP, and ALTER
Primitive Data Types
Type Comments
TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8-byte integers
BOOLEAN TRUE/FALSE
FLOAT, DOUBLE Single and double precision real numbers
STRING Character string
TIMESTAMP Unix-epoch offset or datetime string
DECIMAL Arbitrary-precision decimal
BINARY Opaque; ignore these bytes
Complex Data Types
Type Comments
STRUCT A collection of elements
If S is of type STRUCT {a INT, b INT}:
S.a returns element a
MAP Key-value tuple
If M is a map from 'group' to GID:
M['group'] returns value of GID
ARRAY Indexed list
If A is an array of elements ['a','b','c']:
A[0] returns 'a'

You might also like