Module 5_data analytics
Module 5_data analytics
Unit -V
Apache Pig
Apache Pig is a high-level platform designed for analyzing large data sets using a simple
scripting language called Pig Latin. It runs on top of Apache Hadoop and provides an
abstraction over MapReduce, making it easier for developers to work with big data without
writing complex MapReduce programs. Pig is particularly valuable for ETL (Extract,
Transform, Load) operations and data pipeline creation.
Execution Modes
Pig offers two primary execution modes to accommodate different use cases:
1. Local Mode: In this mode, Pig runs on a single machine, making it ideal for testing
and development with smaller datasets. All files are processed from the local file
system.
2. MapReduce Mode: This is the production mode where Pig runs on a Hadoop cluster,
processing data from HDFS (Hadoop Distributed File System).
Comparison with Traditional Databases
Unlike traditional RDBMS systems that require structured data and predefined schemas, Pig
offers several advantages:
Schema-on-read flexibility allows data structure to be defined when querying
Native support for complex data types like bags, tuples, and maps
Built-in support for ETL operations and data transformations
Ability to handle semi-structured and unstructured data effectively
Pig Latin and Data Processing Operators
Pig Latin provides a rich set of operators for data manipulation:
LOAD/STORE: For reading and writing data
FILTER: For selecting specific records
GROUP: For aggregating data
JOIN: For combining datasets
FOREACH: For transforming data records
DISTINCT: For removing duplicates
Apache Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop that provides data
summarization, query, and analysis capabilities. It allows SQL developers to write familiar
queries while processing data stored in a distributed environment.
Hive Architecture Components
The Hive architecture consists of several key components:
1. Hive Shell: Command-line interface for executing HiveQL queries
2. Hive Services: Including HiveServer2 for client connections and query processing
3. Hive Metastore: Central repository storing metadata about tables, columns, partitions
HiveQL and Data Operations
HiveQL closely resembles SQL but with additional features for big data processing. Here's an
example of creating and querying a table:
Create a table for customer data
CREATE TABLE customers (
customer_id INT,
name STRING,
email STRING,
purchase_date DATE
)
PARTITIONED BY (country STRING);