0% found this document useful (0 votes)
3 views4 pages

Unit 5

The document outlines key frameworks in the Hadoop ecosystem for big data applications, including Pig, Hive, HBase, and IBM Big Data tools. Pig is a high-level platform for processing large datasets with an interactive shell and data flow language, while Hive provides SQL-like access to data and converts queries into MapReduce jobs. HBase is a NoSQL database designed for real-time read/write operations, and IBM offers a suite of tools for big data storage, analysis, and visualization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views4 pages

Unit 5

The document outlines key frameworks in the Hadoop ecosystem for big data applications, including Pig, Hive, HBase, and IBM Big Data tools. Pig is a high-level platform for processing large datasets with an interactive shell and data flow language, while Hive provides SQL-like access to data and converts queries into MapReduce jobs. HBase is a NoSQL database designed for real-time read/write operations, and IBM offers a suite of tools for big data storage, analysis, and visualization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Hadoop Ecosystem Frameworks:

Applications on Big Data

1. Pig
Introduction to Pig

• High-level platform for processing large data sets.


• Developed by Yahoo; runs on Hadoop.

Execution Modes

• Local Mode: Runs on a single machine.


• MapReduce Mode: Distributes processing across Hadoop cluster.

Comparison with Databases

• Schema-less vs. schema-based.


• Suitable for semi-structured data.

Grunt

• Interactive shell for Pig.


• Supports command execution and script testing.

Pig Latin

• Data flow language.


• Includes constructs like LOAD, FILTER, FOREACH, JOIN.

User Defined Functions (UDFs)

• Extend Pig’s capabilities using Java, Python, or other languages.

Data Processing Operators

• Filtering (FILTER), grouping (GROUP), joining (JOIN), sorting (ORDER BY), etc.
2. Hive
Apache Hive Architecture

• Built on Hadoop to provide SQL-like access to data.


• Converts HiveQL queries into MapReduce jobs.

Hive Components

• Hive Shell
• Hive Services (Driver, Compiler, Execution Engine)
• Metastore: Stores schema and metadata.

Comparison with Traditional Databases

• Schema-on-read vs schema-on-write.
• Optimized for batch processing, not OLTP.

HiveQL

• SQL-like language to write queries.

Tables and Querying

• Supports managed and external tables.


• Standard SQL queries for data retrieval.

User Defined Functions (UDFs)

• Customize operations like filters and transformations.

Advanced Query Features

• Sorting, aggregation, joins, subqueries.


• MapReduce integration.

3. HBase
HBase Concepts

• NoSQL database modeled after Google’s Bigtable.


• Column-oriented storage.
Clients and Examples

• Java API, REST, Thrift clients.

HBase vs RDBMS

• Schema-less, horizontal scalability, real-time read/write vs. structured schema and ACID.

Advanced Usage

• Time-stamped versioning, compression, sharding.

Schema Design

• Design based on access patterns, not normalization.

Advanced Indexing

• Custom secondary indexes via Phoenix or Coprocessors.

Zookeeper

• Coordinates distributed systems.


• Used for monitoring HBase clusters.

4. IBM Big Data Tools


IBM Big Data Strategy

• End-to-end platform for big data storage, analysis, and visualization.

Infosphere

• Platform for information integration and governance.

BigInsights

• Enterprise Hadoop solution with additional tooling.

Big Sheets

• Spreadsheet-like interface for analyzing big data.


Big SQL

• SQL engine to query Hadoop data using ANSI-compliant SQL.

You might also like