0% found this document useful (0 votes)
6 views4 pages

Module 5_data analytics

The document provides an overview of Apache Pig, Hive, and HBase, highlighting their roles in big data analytics and processing. It discusses execution modes, data manipulation operators, and comparisons with traditional databases for Pig and Hive, while also detailing HBase's architecture and data modeling. Additionally, it introduces machine learning concepts, including supervised and unsupervised learning, as well as collaborative filtering techniques.

Uploaded by

Shreyas C.K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views4 pages

Module 5_data analytics

The document provides an overview of Apache Pig, Hive, and HBase, highlighting their roles in big data analytics and processing. It discusses execution modes, data manipulation operators, and comparisons with traditional databases for Pig and Hive, while also detailing HBase's architecture and data modeling. Additionally, it introduces machine learning concepts, including supervised and unsupervised learning, as well as collaborative filtering techniques.

Uploaded by

Shreyas C.K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

BIG DATA & ANALYTICS (ELECTIVE)

Unit -V

Apache Pig
Apache Pig is a high-level platform designed for analyzing large data sets using a simple
scripting language called Pig Latin. It runs on top of Apache Hadoop and provides an
abstraction over MapReduce, making it easier for developers to work with big data without
writing complex MapReduce programs. Pig is particularly valuable for ETL (Extract,
Transform, Load) operations and data pipeline creation.
Execution Modes
Pig offers two primary execution modes to accommodate different use cases:
1. Local Mode: In this mode, Pig runs on a single machine, making it ideal for testing
and development with smaller datasets. All files are processed from the local file
system.
2. MapReduce Mode: This is the production mode where Pig runs on a Hadoop cluster,
processing data from HDFS (Hadoop Distributed File System).
Comparison with Traditional Databases
Unlike traditional RDBMS systems that require structured data and predefined schemas, Pig
offers several advantages:
 Schema-on-read flexibility allows data structure to be defined when querying
 Native support for complex data types like bags, tuples, and maps
 Built-in support for ETL operations and data transformations
 Ability to handle semi-structured and unstructured data effectively
Pig Latin and Data Processing Operators
Pig Latin provides a rich set of operators for data manipulation:
 LOAD/STORE: For reading and writing data
 FILTER: For selecting specific records
 GROUP: For aggregating data
 JOIN: For combining datasets
 FOREACH: For transforming data records
 DISTINCT: For removing duplicates
Apache Hive
Apache Hive is a data warehouse infrastructure built on top of Hadoop that provides data
summarization, query, and analysis capabilities. It allows SQL developers to write familiar
queries while processing data stored in a distributed environment.
Hive Architecture Components
The Hive architecture consists of several key components:
1. Hive Shell: Command-line interface for executing HiveQL queries
2. Hive Services: Including HiveServer2 for client connections and query processing
3. Hive Metastore: Central repository storing metadata about tables, columns, partitions
HiveQL and Data Operations
HiveQL closely resembles SQL but with additional features for big data processing. Here's an
example of creating and querying a table:
Create a table for customer data
CREATE TABLE customers (
customer_id INT,
name STRING,
email STRING,
purchase_date DATE
)
PARTITIONED BY (country STRING);

Query to analyze customer purchases


SELECT country, COUNT(*) as customer_count,
AVG(purchase_amount) as avg_purchase
FROM customers
GROUP BY country
HAVING customer_count > 1000;

Comparison with Traditional Databases


Hive differs from traditional databases in several ways:
 Designed for large-scale data processing rather than transaction processing
 Schema-on-read approach allows flexible data handling
 Built-in support for Hadoop ecosystem integration
 Partitioning and bucketing features for optimizing large dataset queries
Apache HBase
HBase is a distributed, scalable, big data store designed for random, real-time read/write
access to large datasets. It's modeled after Google's BigTable and runs on top of HDFS.
Key Concepts
1. Tables: Data is organized into tables
2. Column Families: Columns are grouped into column families
3. Regions: Tables are horizontally split into regions
4. Row Keys: Each row has a unique identifier
Example of HBase data modeling:
// Creating a table
create 'users', 'profile', 'activity'
// Inserting data
put 'users', 'user123', 'profile:name', 'John Doe'
put 'users', 'user123', 'activity:last_login', '2025-02-14'
// Retrieving data
get 'users', 'user123'
HBase vs RDBMS
Key differences include:
 Schema-less data model
 Automatic sharding and distribution
 Built for horizontal scalability
 Optimized for high-throughput operations
Data Analytics with R and Machine Learning
Introduction to Machine Learning
Machine learning enables systems to learn from data without being explicitly programmed.
It's particularly valuable for discovering patterns and making predictions from large datasets.
Supervised Learning
In supervised learning, algorithms learn from labeled training data. Common applications
include:
Unsupervised Learning
Unsupervised learning finds hidden patterns in unlabeled data. Common techniques include:
 Clustering: Grouping similar data points
 Dimensionality Reduction: Reducing data complexity while preserving important
features
Collaborative Filtering
Collaborative filtering is used in recommendation systems to predict user preferences based
on similarities between users or items. Common approaches include:
 User-based: Finding similar users and recommending items they liked
 Item-based: Recommending items similar to those the user already likes

You might also like