0% found this document useful (0 votes)
29 views

big data notes

Big Data refers to large volumes of complex data that traditional data management tools cannot process efficiently, encompassing structured, semi-structured, and unstructured data from various sources. Its importance lies in uncovering insights for better decision-making across sectors like business, healthcare, and security. The future of Big Data is shaped by advancements in AI, cloud computing, and edge computing, emphasizing real-time processing and enhanced security.

Uploaded by

paridhikadwey78
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

big data notes

Big Data refers to large volumes of complex data that traditional data management tools cannot process efficiently, encompassing structured, semi-structured, and unstructured data from various sources. Its importance lies in uncovering insights for better decision-making across sectors like business, healthcare, and security. The future of Big Data is shaped by advancements in AI, cloud computing, and edge computing, emphasizing real-time processing and enhanced security.

Uploaded by

paridhikadwey78
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 89

801 – BIG DATA

Unit -1
Introduction to Big Data
What is Big Data?
Big Data refers to massive volumes of data that are generated, stored, and
analyzed for insights to improve decision-making. This data can be
structured, semi-structured, or unstructured and is too complex to be
processed using traditional data management tools. Big Data is collected
from multiple sources, including social media, sensors, IoT devices, financial
transactions, healthcare records, and digital applications.

The growth of digital technology has significantly contributed to the


explosion of data. Companies, governments, and research institutions use
Big Data analytics to enhance productivity, improve customer experiences,
optimize business operations, and drive innovation.

Importance of Big Data

The significance of Big Data lies in its ability to uncover hidden patterns,
correlations, and insights that were previously inaccessible due to
computational limitations. Organizations use Big Data to:

 Enhance Business Decisions – Helps businesses understand market


trends, customer behavior, and operational performance.

 Improve Healthcare – Used for disease prediction, personalized


medicine, and efficient patient care.

 Boost Customer Experience – Enables businesses to deliver


personalized recommendations and services.

 Detect Fraud & Security Threats – Helps financial institutions and


cybersecurity experts identify fraudulent activities.

 Optimize Supply Chain & Logistics – Enables better inventory


management, demand forecasting, and transportation efficiency.
Future of Big Data

The future of Big Data is driven by advancements in AI, cloud computing,


edge computing, and blockchain technology. Emerging trends include:

 AI & Machine Learning – Automating data-driven decision-making.

 Edge Computing – Processing data closer to the source to reduce


latency.

 Blockchain & Big Data – Enhancing security and transparency in


data transactions.

Characteristics of Big Data (5Vs of Big Data)


Big Data is defined by five primary characteristics, known as the 5Vs:

1. Volume (Data Size and Scale)

 The most distinguishing feature of Big Data is its enormous size.

 Data is collected from various sources such as social media, IoT


sensors, transaction logs, and online activity.

 Traditional database systems struggle to handle petabytes (PB) and


exabytes (EB) of data efficiently.

Example:

 Facebook generates 4+ petabytes of data daily from user activities,


posts, comments, and likes.

 A single Boeing 787 aircraft generates 500GB of data per flight from
sensors monitoring engine performance.

2. Velocity (Speed of Data Generation and Processing)

 Big Data is generated at an incredibly high speed and needs to be


processed in real-time.

 Businesses must analyze data streams instantly to make timely


decisions (e.g., fraud detection, stock market analysis).
Example:

 Stock trading platforms process millions of transactions per


second, requiring real-time analytics to detect anomalies.

 Social media platforms such as Twitter generate over 500 million


tweets per day, which need to be processed for trend analysis and
sentiment detection.

3. Variety (Different Types of Data)

 Data comes in different formats:

o Structured data (organized, relational databases).

o Semi-structured data (XML, JSON, log files).

o Unstructured data (images, videos, social media posts, sensor


data).

 Handling and integrating these diverse formats is a challenge.

Example:

 A single e-commerce transaction includes:

o Structured Data: Customer ID, transaction amount, payment


details.

o Semi-structured Data: JSON/XML containing order details.

o Unstructured Data: Customer reviews, images, and voice


support calls.

4. Veracity (Data Quality and Reliability)

 The accuracy and trustworthiness of data are critical, as poor-quality


data can lead to incorrect decisions.

 Issues such as missing values, inconsistencies, and noise must be


handled through data cleansing and preprocessing.

Example:
 In fraud detection, financial institutions filter out false positives
(legitimate transactions flagged as fraudulent) by analyzing spending
patterns and customer behavior.

 Fake news on social media platforms requires verification to prevent


misinformation.

5. Value (Extracting Meaningful Insights from Data)

 The main purpose of Big Data is to extract useful business insights that
improve decision-making.

 Organizations invest in analytics, AI, and machine learning to derive


value from data.

Example:

 Netflix analyzes users' watch history to provide personalized


recommendations, improving user engagement and retention.

 Retail stores analyze shopping patterns to offer targeted promotions


and optimize inventory.

Types of Big Data


Big Data can be classified into three main categories:

1. Structured Data

 Data that follows a predefined schema and is stored in relational


databases.

 Can be easily searched and analyzed using SQL queries.

Examples:

 Customer databases (ID, Name, Age, Address).

 Bank transaction records.

 Inventory management systems.

2. Unstructured Data
 Data that does not have a specific format, making it difficult to store
and analyze using traditional tools.

 Requires advanced AI/ML models for processing.

Examples:

 Social media posts (tweets, Facebook comments).

 Audio and video recordings.

 Satellite images and CCTV footage.

3. Semi-structured Data

 Data that does not follow a strict schema but contains tags, metadata,
or markers to define structure.

 Often stored in NoSQL databases.

Examples:

 XML and JSON files.

 Log files from web servers.

 Emails (structured header + unstructured body).

Traditional Data vs. Big Data: A Detailed


Comparison
As businesses and industries evolve, data management has shifted from
traditional data processing methods to Big Data technologies. Below is a
detailed comparison between Traditional Data and Big Data based on
various factors.

1. Definition

Traditional Data

 Refers to structured and well-organized data stored in relational


databases (RDBMS).

 Managed using SQL-based systems like MySQL, Oracle, PostgreSQL.


 Suitable for small to medium-sized datasets that fit into traditional
databases.

Big Data

 Refers to extremely large, complex, and diverse datasets that


traditional databases cannot efficiently handle.

 Includes structured, semi-structured, and unstructured data.

 Requires distributed computing frameworks like Hadoop, Apache


Spark, NoSQL databases for processing.

2. Data Volume (Size of Data)

Traditional Data

 Handles limited amounts of data (usually gigabytes to terabytes).

 Designed for small datasets with structured relationships.

Big Data

 Deals with massive datasets (ranging from terabytes to petabytes


and beyond).

 Grows exponentially with data generated from IoT, social media, and
real-time applications.

✅ Example:

 A traditional banking system stores customer account details in a


relational database.

 Big Data applications analyze customer transactions, fraud


detection, and market trends in real time.

3. Data Variety (Types of Data)

Traditional Data

 Mostly structured data with predefined formats (e.g., tables in a


database).

 Can be stored and retrieved using SQL queries.


✅ Example:

 Employee records stored in an SQL database (name, age, salary,


department).

Big Data

 Can be structured, semi-structured, or unstructured.

 Includes data from social media, sensors, emails, videos, logs, IoT
devices.

 Requires specialized tools like MongoDB (NoSQL), HDFS (Hadoop),


and Apache Spark.

✅ Example:

 Customer reviews (text), social media posts (images/videos), and GPS


location data.

4. Data Velocity (Speed of Processing)

Traditional Data

 Processes data in batch mode (stored and then analyzed later).

 Does not support real-time analytics.

 Suitable for businesses with slow-changing datasets.

Big Data

 Requires real-time or near real-time processing due to fast data


generation.

 Uses Apache Kafka, Apache Spark, and Flink for streaming data
processing.

 Essential for applications like fraud detection, stock trading, and IoT
monitoring.

✅ Example:

 Traditional Approach: A retail store analyzes last month’s sales for


future planning.

 Big Data Approach: Analyzing real-time customer purchases to offer


instant discounts.
5. Data Storage & Management

Traditional Data

 Stored in centralized databases (RDBMS) like MySQL, PostgreSQL,


and Oracle.

 Uses structured schemas to define data formats.

 Cannot efficiently handle unstructured or semi-structured data.

Big Data

 Stored in distributed file systems (HDFS, Google BigTable, Amazon


S3).

 Uses NoSQL databases (MongoDB, Cassandra, HBase) to handle


various data types.

 Enables horizontal scaling for handling massive data volumes.

✅ Example:

 Traditional: A company's financial records stored in an Oracle


database.

 Big Data: Google indexing billions of web pages for search results.

6. Data Processing Techniques

Traditional Data

 Uses SQL-based queries to retrieve structured data.

 Relies on single-server architecture for computations.

 Not optimized for distributed parallel processing.

Big Data

 Uses parallel distributed processing across multiple nodes.

 Processes data using MapReduce (Hadoop), Spark, and Flink.

 Supports advanced machine learning and AI-driven analytics.

✅ Example:
 Traditional: A business runs monthly reports using SQL queries.

 Big Data: Facebook processes millions of user activities in real-time.

7. Scalability

Traditional Data

 Vertical scaling (Scaling Up) – Requires upgrading a single


machine’s hardware (CPU, RAM).

 Limited by hardware capacity and performance constraints.

Big Data

 Horizontal scaling (Scaling Out) – Increases capacity by adding


multiple machines.

 Distributed computing ensures high availability and fault tolerance.

✅ Example:

 Traditional Approach: Increasing database performance by adding


more RAM to a server.

 Big Data Approach: Using a Hadoop cluster with 100+ machines to


process data simultaneously.

8. Cost & Infrastructure

Traditional Data

 Requires high-end relational database licenses and expensive


infrastructure.

 Involves high maintenance costs for centralized database


management.

Big Data

 Uses open-source technologies (Hadoop, Spark, NoSQL) to


reduce costs.

 Leverages cloud-based solutions like AWS, Google Cloud, and Azure for
cost-effective scaling.
✅ Example:

 Traditional: A bank purchases Oracle licenses for customer data


management.

 Big Data: Netflix uses AWS cloud storage for its massive video
streaming data.

9. Security & Data Privacy

Traditional Data

 Uses authentication & authorization (e.g., role-based access


control).

 Easier to manage due to structured data and centralized databases.

Big Data

 Requires advanced security measures due to distributed


architecture.

 Involves data encryption, anonymization, and compliance with GDPR &


CCPA.

 Security challenges in handling real-time, large-scale transactions.

✅ Example:

 Traditional: Secure access to a company’s HR database using SQL


authentication.

 Big Data: Securing global IoT device communications from cyber


threats.

10. Use Cases & Applications

Traditional Data Use Cases

1. Banking & Finance: Storing customer transactions in relational


databases.

2. HR Systems: Employee payroll and attendance records.

3. Inventory Management: Tracking stock levels using SQL databases.


Big Data Use Cases

1. Healthcare: AI-driven disease prediction and patient data analysis.

2. E-commerce: Personalized product recommendations (Amazon,


Flipkart).

3. Social Media: Sentiment analysis of Twitter, Facebook, Instagram.

4. Smart Cities: Traffic pattern analysis and IoT-based energy


management.

✅ Example:

 Traditional: A telecom company maintains customer billing records in


a relational database.

 Big Data: Analyzing millions of customer calls to detect network


failures.

Evolution of Big Data: A Detailed Overview


Big Data has evolved over the decades as technology, computing power, and
data generation capabilities have advanced. The journey from traditional
databases to the era of AI-driven analytics showcases how businesses and
industries have adapted to massive, fast-growing, and diverse
datasets.

1. Pre-Big Data Era (Before 2000s) – The Age of Traditional Data

Key Characteristics:

 Data Volume: Small to moderate (Megabytes to Gigabytes).

 Data Type: Mostly structured data (tables in relational databases).

 Storage & Processing: Traditional Relational Database


Management Systems (RDBMS) such as Oracle, MySQL,
PostgreSQL, and SQL Server.

 Challenges:
o Could not handle unstructured data (images, videos, emails,
logs).

o Data storage and processing had hardware limitations.

o No real-time analytics; only batch processing was possible.

Example:

 Banks stored customer transaction details in SQL databases.

 Businesses used Excel spreadsheets for data management.

2. Early Big Data (2000s) – The Birth of Distributed Systems

Key Characteristics:

 Data Volume: Increased significantly (Gigabytes to Terabytes).

 Data Type: Semi-structured and unstructured data emerged


(emails, XML, JSON, log files).

 Storage & Processing:

o Google developed MapReduce (2004) – a distributed


computing framework for large-scale data processing.

o Hadoop (2006) was introduced as an open-source


implementation of MapReduce.

o NoSQL databases (MongoDB, Cassandra, HBase) emerged


to handle unstructured data.

Major Developments:

 Search Engines: Google revolutionized search with its PageRank


algorithm and distributed computing techniques.

 Social Media Rise: Platforms like Facebook, Twitter, and LinkedIn


started generating massive amounts of unstructured user data.

 Cloud Storage: Companies like Amazon (AWS S3) and Google


Cloud Storage introduced scalable storage solutions.

Challenges:

 Limited real-time processing – Hadoop was batch-based.


 High latency – Not suitable for instant decision-making.

Example:

 Yahoo! used Hadoop to process search engine data.

 Walmart analyzed sales trends using distributed computing.

3. Big Data Boom (2010s) – Real-Time Analytics & AI Integration

Key Characteristics:

 Data Volume: Exploded to Terabytes, Petabytes, and even


Exabytes.

 Data Type:

o Structured (SQL databases).

o Semi-structured (JSON, XML, web logs).

o Unstructured (videos, images, sensor data, social media).

 Storage & Processing:

o Hadoop improved with Apache Spark (2014), which enabled


real-time data processing.

o Streaming frameworks like Apache Kafka and Flink were


introduced.

o AI and Machine Learning started leveraging Big Data for


predictive analytics.

Major Developments:

 Real-Time Applications:

o Fraud detection in banking.

o Personalized recommendations (Netflix, Amazon).

o Self-driving cars using sensor data.

 Cloud & Edge Computing:

o Companies adopted Google Cloud, AWS, Microsoft Azure for


scalable data processing.
o Edge Computing emerged to process data closer to the source
(IoT devices).

 Deep Learning & AI:

o Neural networks (like TensorFlow, PyTorch) leveraged Big Data for


speech recognition, image processing, and automation.

Challenges:

 Data privacy concerns (GDPR, CCPA).

 Cybersecurity threats due to large-scale data breaches.

 Complexity in managing multi-cloud environments.

Example:

 Netflix processes user viewing data in real time to recommend


shows.

 Tesla’s autonomous cars analyze millions of sensor inputs for


navigation.

4. Modern Era (2020s – Present) – AI-Driven Big Data & Quantum


Computing

Key Characteristics:

 Data Volume: Massive expansion to Zettabytes.

 Data Type: Multi-source, heterogeneous, streaming data from IoT,


blockchain, AI, and 5G networks.

 Storage & Processing:

o Serverless computing allows automatic scaling (AWS Lambda,


Google Cloud Functions).

o Quantum computing experiments with ultra-fast Big Data


analytics.

o Federated learning allows AI to train on decentralized data


without privacy risks.

Major Developments:

 Big Data + AI + Blockchain:


o AI predicts diseases from medical Big Data.

o Blockchain ensures data security and traceability.

 5G & IoT Revolution:

o 5G enables real-time streaming and analytics at a massive scale.

o Smart cities use IoT-generated data for traffic, waste, and energy
management.

 Augmented Analytics:

o AI automatically cleans, processes, and interprets data.

o NLP (Natural Language Processing) allows businesses to ask


data-related questions in human language.

Challenges:

 Ethical AI – Bias in AI models using Big Data.

 Sustainability – Huge energy consumption for AI and data centers.

 Data Sovereignty – Legal battles over where data is stored and


processed.

Example:

 Google’s DeepMind uses AI and Big Data for protein structure


predictions.

 Facebook detects fake news using real-time Big Data analytics.

Future of Big Data (Beyond 2030)

Expected Developments:

✅ AI-powered autonomous systems:

 AI-driven data pipelines will fully automate data collection, processing,


and insights generation.

✅ Quantum Computing for Big Data:

 Quantum algorithms will analyze exabytes of data in seconds.

✅ DNA Data Storage:


 Storing vast amounts of data in DNA molecules for near-infinite
storage.

✅ AI-Augmented Decision-Making:

 Governments and businesses will rely on AI-driven insights for


policymaking, defense, and economy.

Challenges with Big Data: A Detailed Analysis


Big Data brings immense opportunities, but it also presents several
challenges related to data storage, processing, security, privacy, and
real-time analytics. As organizations deal with high-volume, high-
velocity, and high-variety data, they face significant hurdles in
managing, analyzing, and deriving insights efficiently.

Below is an in-depth exploration of the major challenges associated with


Big Data.

1. Data Storage & Scalability Issues

Problem:

 The massive increase in data volume (from Terabytes to


Petabytes and beyond) creates a storage bottleneck.

 Traditional relational databases (SQL-based systems) struggle to


handle this scale efficiently.

 Storage costs increase as data volume grows, requiring distributed


and cloud storage solutions.

Solutions:

✅ Distributed File Systems – Apache Hadoop’s HDFS (Hadoop


Distributed File System) stores large-scale data across multiple machines.
✅ Cloud Storage – AWS S3, Google Cloud Storage, and Azure Blob Storage
provide scalable and cost-effective solutions.
✅ Compression Techniques – Reduce storage costs by using efficient
compression algorithms like Snappy, LZ4, or Gzip.

Example:
 Facebook stores petabytes of user data using a combination of
Hadoop, Hive, and cloud-based infrastructure.

2. Data Processing & Real-Time Analytics

Problem:

 Traditional data processing tools (SQL, RDBMS) are too slow to


analyze massive datasets.

 Real-time analytics is critical for fraud detection, stock trading,


self-driving cars, and other applications.

 High latency in batch processing (Hadoop’s MapReduce) makes


real-time decision-making difficult.

Solutions:

✅ In-Memory Processing – Apache Spark and Google’s BigQuery enable


faster analytics by processing data in memory instead of disk.
✅ Streaming Frameworks – Apache Kafka, Flink, and Storm provide
real-time data stream processing.
✅ Edge Computing – Process data closer to the source (IoT devices,
sensors) instead of sending everything to a central data center.

Example:

 Uber processes ride requests in real time using Apache Kafka and
Spark Streaming.

3. Data Integration & Heterogeneity

Problem:

 Big Data comes from multiple sources – IoT sensors, social media,
emails, logs, images, videos, etc.

 Data is structured (SQL databases), semi-structured (JSON,


XML), and unstructured (videos, images, social media posts).

 Integrating diverse datasets into a unified platform is complex.

Solutions:
✅ ETL (Extract, Transform, Load) Pipelines – Tools like Apache NiFi,
Talend, and Apache Beam automate data integration.
✅ Data Lakes – Platforms like AWS Lake Formation store raw,
unstructured data efficiently.
✅ Schema-on-Read Approach – Allows flexible querying of diverse data
formats without predefining strict schemas.

Example:

 Netflix integrates data from different sources (user interactions,


streaming behavior, network logs) into a unified analytics system.

4. Data Quality & Cleaning

Problem:

 Incomplete, inconsistent, duplicate, and inaccurate data leads to


poor decision-making.

 Different data sources may have conflicting formats (e.g., date


formats in YYYY-MM-DD vs. MM-DD-YYYY).

 Missing values and noise make data less reliable for analytics and AI
models.

Solutions:

✅ Automated Data Cleaning Tools – OpenRefine, Trifacta, and Apache


Griffin help clean and standardize data.
✅ AI & ML for Data Cleaning – AI-powered tools detect anomalies,
inconsistencies, and missing values automatically.
✅ Master Data Management (MDM) – Ensures a single, consistent version
of data across an organization.

Example:

 Healthcare data cleaning ensures that patient records are complete


and consistent for accurate diagnostics.

5. Security & Privacy Concerns

Problem:
 Big Data contains sensitive personal, financial, and business
information.

 Cyberattacks, data breaches, and unauthorized access pose


major threats.

 Regulatory compliance (GDPR, CCPA) requires organizations to


protect user data.

Solutions:

✅ Encryption & Access Control – Data is encrypted using AES-256, and


role-based access controls (RBAC) restrict access.
✅ Blockchain for Data Security – Ensures tamper-proof data storage and
audit trails.
✅ Anomaly Detection with AI – AI-driven cybersecurity tools detect
unusual patterns and prevent breaches.

Example:

 Equifax Data Breach (2017): A cyberattack exposed 147 million


users’ personal data, highlighting the need for stronger Big Data
security.

6. High Infrastructure & Maintenance Costs

Problem:

 Storing and processing petabytes of data requires expensive


servers, cloud services, and high-performance computing
clusters.

 Maintenance costs rise due to data replication, backup, and


redundancy.

Solutions:

✅ Hybrid Cloud Solutions – Use on-premises + cloud storage for cost


optimization.
✅ Serverless Computing – AWS Lambda, Google Cloud Functions
dynamically allocate resources as needed.
✅ Data Archiving & Compression – Move infrequently accessed data to
cold storage (cheaper but slower).

Example:
 Google optimizes Big Data costs by using AI-powered workload
scheduling and serverless infrastructure.

7. Ethical Issues & AI Bias in Big Data

Problem:

 AI models trained on biased data make unfair decisions (e.g.,


biased hiring, racial profiling).

 Privacy invasion – Companies use Big Data to track users without


consent.

 Manipulation of public opinion using AI-driven fake news and


deepfakes.

Solutions:

✅ Fair AI Algorithms – Ensure diverse training data to remove bias.


✅ Ethical AI Regulations – Enforce AI transparency and explainability.
✅ User Control Over Data – Implement opt-in policies for data collection.

Example:

 Facebook’s AI algorithm was accused of racial bias in advertising


placement, leading to stricter AI fairness policies.

8. Data Governance & Compliance

Problem:

 Governments enforce strict data protection laws (GDPR in Europe,


CCPA in California).

 Companies must track, store, and process user data legally.

 Failure to comply leads to heavy fines (e.g., GDPR fines up to €20


million).

Solutions:

✅ Data Masking & Tokenization – Hide sensitive user data to protect


privacy.
✅ Compliance Audits – Conduct regular checks to ensure legal compliance.
✅ Metadata Management – Properly label and classify sensitive data.
Example:

 Amazon was fined €746 million for GDPR violations in 2021 due
to improper handling of user data.

Technologies Available for Big Data


Big Data technologies help in storing, processing, analyzing, and
visualizing massive amounts of structured and unstructured data.
These technologies are categorized into data storage, processing
frameworks, databases, real-time analytics, machine learning, and
visualization tools.

Below is a detailed classification of Big Data technologies and their


applications.

1. Data Storage Technologies

Big Data storage technologies are essential for storing massive volumes
of structured and unstructured data efficiently.

a) Distributed File Systems

Used to store data across multiple nodes to ensure scalability and fault
tolerance.

✅ Hadoop Distributed File System (HDFS) – The backbone of Apache


Hadoop, used for storing large-scale datasets across clusters.
✅ Google File System (GFS) – Google’s proprietary distributed file system,
predecessor of HDFS.
✅ Amazon S3 (Simple Storage Service) – A cloud-based object storage
solution for handling large data workloads.

b) Cloud Storage

Used for storing and managing data on remote cloud servers.

✅ AWS S3, Google Cloud Storage, Azure Blob Storage – Cloud storage
services that offer high availability and scalability.
✅ Snowflake – A cloud-based data warehouse optimized for analytics.
✅ MinIO – An open-source alternative to AWS S3 for private cloud storage.

c) Data Warehousing
Optimized for storing structured and semi-structured data for analytical
processing.

✅ Amazon Redshift – A cloud-based data warehouse optimized for complex


queries.
✅ Google BigQuery – A serverless data warehouse for large-scale analytics.
✅ Apache Hive – A data warehouse built on top of Hadoop, enabling SQL-like
queries on Big Data.

2. Data Processing Technologies

These frameworks process large datasets efficiently using batch and real-
time processing methods.

a) Batch Processing Frameworks

Process large datasets in batches at scheduled intervals.

✅ Apache Hadoop (MapReduce) – A framework that processes massive


datasets using the MapReduce programming model.
✅ Apache Spark – Faster than Hadoop, uses in-memory processing for high-
speed batch analytics.
✅ Apache Flink – Handles both batch and real-time stream processing.

b) Real-Time & Stream Processing Frameworks

Process continuous streams of data from IoT devices, sensors, and social
media.

✅ Apache Kafka – A distributed messaging system used for streaming data


pipelines.
✅ Apache Storm – Processes real-time streaming data with low latency.
✅ Apache Flink – Supports event-driven real-time stream processing.
✅ Google Dataflow – A serverless data processing service for batch and
real-time streams.

3. Big Data Databases

Big Data requires specialized databases that can handle massive amounts
of structured, semi-structured, and unstructured data.

a) NoSQL Databases
Designed for scalability and high availability, ideal for handling
unstructured data.

✅ MongoDB – A document-oriented NoSQL database for flexible schema


storage.
✅ Cassandra – A highly scalable, distributed database used by Facebook,
Netflix, and Apple.
✅ HBase – A NoSQL database that runs on Hadoop, optimized for large
tables.

b) NewSQL Databases

Combine the benefits of traditional SQL databases with the scalability


of NoSQL.

✅ Google Spanner – A globally distributed database that provides strong


consistency.
✅ CockroachDB – A fault-tolerant, horizontally scalable SQL database.
✅ MemSQL – A real-time, high-performance distributed database.

c) Graph Databases

Designed for storing complex relationships in social networks, fraud


detection, and recommendation systems.

✅ Neo4j – A popular graph database used for social media and fraud
analytics.
✅ Amazon Neptune – A fully managed graph database optimized for deep
link analytics.
✅ ArangoDB – A multi-model NoSQL database that supports graph,
document, and key-value data.

4. Machine Learning & AI for Big Data

AI and Machine Learning technologies analyze Big Data to extract patterns,


trends, and predictive insights.

✅ TensorFlow & PyTorch – Deep learning frameworks for training large-


scale AI models.
✅ Apache Mahout – A scalable machine learning library built for Hadoop.
✅ MLlib (Apache Spark) – A distributed machine learning library for high-
performance AI workloads.
✅ Google AI Platform – A cloud-based AI service for training and deploying
ML models.
✅ H2O.ai – An open-source AI platform for predictive analytics and deep
learning.

5. Data Visualization & Business Intelligence (BI) Tools

Visualization tools help in interpreting Big Data insights in a more


intuitive way.

✅ Tableau – A leading BI tool for interactive data visualization.


✅ Power BI – A Microsoft analytics platform for real-time dashboards.
✅ Apache Superset – An open-source visualization tool for large datasets.
✅ Google Data Studio – A free tool for connecting and visualizing data from
multiple sources.
✅ D3.js – A JavaScript library for creating custom data visualizations.

6. Data Security & Privacy Technologies

Big Data security ensures data confidentiality, integrity, and protection


against cyber threats.

✅ Apache Ranger – Security and policy framework for Hadoop and Big Data
environments.
✅ Apache Knox – Provides authentication and access control for Big Data
systems.
✅ GDPR & CCPA Compliance Tools – Tools like BigID and Privacera help
companies comply with privacy laws.
✅ Encryption (AES-256, SSL, TLS) – Ensures data is encrypted during
transmission and storage.
✅ Blockchain for Data Security – Used in fraud detection and tamper-proof
audit trails.

7. Orchestration & Workflow Management

Big Data workflows require tools for scheduling, automating, and


monitoring data pipelines.

✅ Apache Airflow – A powerful workflow automation tool for orchestrating


ETL jobs.
✅ Apache Oozie – A workflow scheduler designed for Hadoop jobs.
✅ Prefect – A Python-based workflow automation tool for data pipelines.
✅ Luigi – A Python library for building complex batch workflows.

Big Data Infrastructure: A Detailed Definition

Big Data infrastructure refers to the collection of hardware, software,


networking, and cloud solutions that are specifically designed to store,
process, manage, and analyze vast amounts of data. It is the
backbone of any system or platform dealing with Big Data and aims to
ensure the scalability, reliability, security, and efficiency needed to
handle massive volumes of data that traditional infrastructures cannot.

Big Data infrastructure is built to manage the three Vs of Big Data:

 Volume: The amount of data generated.

 Velocity: The speed at which data is generated, processed, and


analyzed.

 Variety: The different types of data—structured, semi-structured, and


unstructured—that must be handled.

Key Components of Big Data Infrastructure

1. Data Storage Layer


This layer is responsible for the storage and management of large
datasets. Since Big Data often consists of structured, semi-structured,
and unstructured data, storage systems must be distributed and
scalable to handle this large volume of data.

o Distributed File Systems (DFS): These are used to store large


datasets across multiple servers. Examples include HDFS
(Hadoop Distributed File System) and Google File System
(GFS).

o Cloud Storage Solutions: To handle elastic scalability, cloud


storage such as AWS S3, Google Cloud Storage, and Azure
Blob Storage are commonly used.

o Data Warehouses and Databases: Specialized systems like


Amazon Redshift, Google BigQuery, and NoSQL databases
(like MongoDB, Cassandra) allow for fast data retrieval and
analytical processing.

2. Data Processing Layer


This layer handles the processing of data, whether it's batch
processing (data processed in chunks) or stream processing (real-time
data processing).

o Batch Processing Frameworks: For large-scale data


processing, frameworks like Apache Hadoop (MapReduce)
and Apache Spark are commonly used.

o Stream Processing Frameworks: Real-time data processing


can be managed by systems like Apache Kafka and Apache
Flink, which allow for the processing of data as it arrives.

3. Data Management Layer


This layer is responsible for the management of databases and
other forms of data storage.

o SQL Databases (NewSQL): For scenarios requiring consistency


and transactions, databases such as Google Spanner and
CockroachDB provide relational database features with
scalability.

o NoSQL Databases: For handling large amounts of unstructured


data, NoSQL systems like MongoDB, Cassandra, and HBase
are used.

o Graph Databases: These are used for managing relationship-


based data and include systems like Neo4j and Amazon
Neptune.

4. Data Integration & Ingestion Layer


This layer focuses on getting data from various sources and
bringing it into the system.

o Data Ingestion Tools: Tools like Apache NiFi, Apache Flume,


and Kafka Connect are used to collect, cleanse, and route data
from disparate sources into the Big Data system.

o ETL (Extract, Transform, Load): These tools are used to


extract data from various sources, transform it into a usable
format, and load it into storage systems or databases. Popular
ETL tools include Apache Airflow and Talend.
5. Networking Infrastructure
Big Data infrastructures rely heavily on high-bandwidth, low-
latency networking to transfer data quickly and efficiently across
various components of the system.

o High-Speed Networking: Technologies like InfiniBand,


100GbE (Gigabit Ethernet), and 10GbE are used to handle the
large data transfer needs.

o Cloud Networking: For cloud-based infrastructure, solutions like


AWS Direct Connect and Google Cloud Interconnect provide
fast and secure network connections.

6. Security & Governance Layer


Ensuring the security, privacy, and compliance of data is critical.
This layer focuses on protecting the data from unauthorized access
and ensuring that it is stored and processed according to relevant laws
and regulations.

o Access Control: Apache Ranger and Apache Knox provide


security by enforcing access policies in Hadoop ecosystems.

o Data Encryption: Sensitive data is encrypted both in transit


(using SSL/TLS) and at rest (using AES-256 encryption).

o Compliance: Big Data systems must ensure compliance with


industry standards like GDPR, HIPAA, and other data protection
laws.

o Data Masking & Anonymization: For privacy, personal and


sensitive data can be masked or anonymized.

7. Data Analytics & Machine Learning Layer


Once data is stored and processed, organizations use analytics and
machine learning (ML) tools to derive insights from the data.

o Analytics Tools: Tools like Apache Hive, Apache Presto, and


Apache Drill allow for querying and analyzing structured and
unstructured data.

o Machine Learning Tools: Platforms like Apache Mahout,


MLlib (Apache Spark), and AI frameworks like TensorFlow and
PyTorch are used for building and training machine learning
models on Big Data.
o Real-Time Analytics: Streaming platforms like Apache Flink
and Apache Storm are used for analyzing real-time data to gain
immediate insights.

8. Data Visualization Layer


Visualizing Big Data insights is crucial for making data-driven
decisions. This layer includes Business Intelligence (BI) tools and
visualization platforms.

o BI Tools: Tools like Tableau, Power BI, and QlikView are used
for creating interactive dashboards and reports.

o Custom Visualization: D3.js and Apache Superset can be


used for building custom visualizations for Big Data applications.

Big Data Infrastructure Architecture

The architecture of Big Data infrastructure consists of multiple layers, each


handling different aspects of the Big Data lifecycle:

1. Data Collection: Sources include IoT devices, social media, sensors,


databases, and logs.

2. Data Ingestion: Collecting data using tools like Kafka, NiFi, or


Flume.

3. Storage: Using distributed file systems like HDFS or cloud storage


platforms.

4. Processing: Distributed computing using Hadoop, Spark, or Flink.

5. Analysis: Querying with Hive, Presto, and using machine learning


models.

6. Visualization: Displaying insights using Tableau, Power BI, or


custom dashboards.

Infrastructure Considerations

When designing Big Data infrastructure, organizations must consider:

 Scalability: The ability to handle growing volumes of data by scaling


horizontally (adding more machines) or vertically (upgrading
hardware).
 Fault Tolerance: Ensuring the system can recover quickly from
hardware or software failures. This is achieved through replication
and redundancy.

 Performance: The infrastructure must be optimized for both data


processing speed and real-time analytics.

 Cost Management: Managing the cost of storing and processing vast


amounts of data, especially in cloud environments.

Use of Data Analytics in Big Data

Data analytics plays a critical role in Big Data as it helps organizations make
data-driven decisions by extracting valuable insights from massive
datasets. Big Data analytics involves examining large datasets (often with
complex and varied data types) to uncover hidden patterns, correlations, and
trends. Here are some key uses:

1. Predictive Analytics

 Predicting trends and future outcomes based on historical data, such


as predicting customer behavior, stock market trends, or potential
risks.

 Applications: Retailers use predictive analytics to forecast demand,


manufacturers predict equipment failures, and healthcare systems
predict disease outbreaks.

2. Descriptive Analytics

 Summarizing historical data to identify patterns and trends.


Descriptive analytics answers "What happened?"

 Applications: Businesses use descriptive analytics to understand past


sales trends, consumer behavior, or operational performance.

3. Diagnostic Analytics

 Finding reasons behind specific outcomes or events by analyzing


data. Diagnostic analytics answers "Why did it happen?"

 Applications: Identifying the root cause of a problem (e.g., why sales


dropped last quarter or why a machine malfunctioned).
4. Real-time Analytics

 Processing data in real-time to make instant decisions. Real-time


analytics is essential for scenarios that require immediate insights, like
monitoring transactions or detecting fraud.

 Applications: Stock market trading, IoT devices, and real-time


marketing strategies.

5. Prescriptive Analytics

 Providing recommendations for actions based on data analysis.


Prescriptive analytics answers "What should we do?"

 Applications: Optimizing supply chain management, personalized


marketing recommendations, or identifying the best course of action in
healthcare (e.g., treatment plans).

6. Machine Learning and AI

 Training algorithms on Big Data to recognize patterns, make


predictions, and improve over time without human intervention.

 Applications: Personalized product recommendations (e.g., Netflix,


Amazon), fraud detection, and autonomous vehicles.

7. Data Visualization

 Displaying data in an easy-to-understand visual format such as


charts, graphs, or dashboards. This makes it easier to identify trends
and insights.

 Applications: Business Intelligence tools (e.g., Power BI, Tableau)


allow executives to track key performance indicators (KPIs) and
metrics.

Desired Properties of a Big Data System

Big Data systems must be equipped with several essential properties to


manage vast amounts of data efficiently. Below are the most important
characteristics:

1. Scalability
 A Big Data system should be able to handle growing amounts of
data and adapt as the dataset increases in size, speed, and
complexity.

 Horizontal scaling (adding more nodes to a cluster) and vertical


scaling (upgrading the existing hardware) are common approaches.

2. Reliability

 Big Data systems need to provide consistent data availability and


fault tolerance to ensure that they work even during hardware or
software failures.

 Techniques like data replication (duplicating data across multiple


machines) and data recovery mechanisms are used to ensure
reliability.

3. Flexibility (Variety)

 A Big Data system must be able to handle structured, semi-


structured, and unstructured data (e.g., text, images, video, and
audio).

 It should support data formats like JSON, XML, CSV, and Parquet,
and be able to process data from different sources (social media,
sensor data, web logs, etc.).

4. Speed (Low Latency)

 Big Data systems should be capable of processing data at high speed,


especially when real-time or near-real-time analytics are required.

 For example, stream processing frameworks like Apache Kafka or


Apache Flink help achieve low-latency processing for real-time data.

5. High Throughput

 The system should be able to process large volumes of data in a short


period. This is achieved by leveraging distributed computing and
parallel processing techniques.

 Batch processing (for large datasets) and stream processing (for


real-time data) are optimized for high throughput.

6. Data Consistency
 Consistency is crucial for ensuring that all data copies across
distributed systems are in sync. Systems should handle distributed
data consistency through protocols like CAP theorem (Consistency,
Availability, Partition tolerance).

 For high consistency, systems like Google Spanner or Apache


HBase are commonly used.

7. Fault Tolerance

 The ability of a system to recover from failures without losing data. A


Big Data system must ensure automatic failover, data replication,
and the ability to resynchronize after a failure.

 Technologies like Hadoop Distributed File System (HDFS) and


Apache Spark ensure fault tolerance by replicating data blocks across
nodes.

8. Security and Privacy

 As Big Data often involves sensitive information, ensuring data


security is essential. Security measures should include encryption,
user authentication, and role-based access control (RBAC) to
protect data from unauthorized access.

 Privacy is also important, with compliance to GDPR, HIPAA, or other


data protection laws.

9. Manageability

 A Big Data system should be easy to manage, with monitoring tools,


dashboards, and automatic updates.

 It should allow data integration, metadata management, and data


lineage tracking to make it easier for data engineers and analysts to
work with.

10. Cost-Effectiveness

 Big Data systems should be designed to manage large datasets while


keeping costs manageable. Often, organizations leverage cloud
infrastructure for elastic scalability and on-demand resource
allocation to reduce costs.

 Open-source technologies like Hadoop and Apache Spark are used to


minimize software licensing costs.
11. Interoperability

 A Big Data system should be able to integrate with other systems,


platforms, and applications. It should allow easy interaction with third-
party tools, APIs, and data exchange formats.

 This ensures seamless data transfer between the Big Data system and
other enterprise systems like ERP, CRM, and BI tools.

Unit -2
Introduction to Hadoop
Hadoop is an open-source framework designed to store and process large
amounts of data in a distributed computing environment. Developed by
Apache Software Foundation, Hadoop allows users to process and
analyze massive datasets that cannot be handled by traditional data-
processing systems due to their size or complexity. It is built to scale from a
single server to thousands of machines, providing flexibility and fault
tolerance.

Hadoop is based on a distributed computing model, which breaks down


data into smaller chunks and distributes them across a cluster of
commodity hardware, allowing parallel processing. It is widely used for big
data analytics, data warehousing, and data storage tasks, and is
particularly effective when handling unstructured data (e.g., text, images,
videos, etc.).

Core Hadoop Components


Hadoop has several core components, each responsible for a specific
function in the processing of big data. The primary components are:

1. Hadoop Distributed File System (HDFS)

 HDFS is the storage layer of Hadoop, designed to store large volumes


of data in a distributed manner across multiple machines in a cluster. It
provides high fault tolerance, scalability, and efficiency in storing
vast amounts of data.
 Key Features:

o Block-level storage: Files are split into fixed-size blocks


(usually 128 MB or 256 MB) and distributed across different
nodes in the cluster.

o Replication: Each data block is replicated across multiple


machines (default is 3 copies) to ensure reliability and data
recovery in case of failure.

o Fault tolerance: If a node fails, data can be retrieved from other


replicas stored on different nodes.

o High throughput: It is optimized for throughput rather than low


latency, making it ideal for batch processing of large datasets.

2. MapReduce

 MapReduce is a programming model and processing framework used


for processing large data sets in parallel across a distributed cluster. It
breaks down tasks into smaller sub-tasks and processes them
concurrently across multiple nodes.

 Key Functions:

o Map phase: The input data is divided into smaller chunks, which
are processed by individual mapper tasks. These mappers output
key-value pairs.

o Reduce phase: The output of the mappers is shuffled and


sorted based on keys, and the reducers process the data to
generate the final output.

o It allows for scalable and parallel processing of data across large


clusters, providing a distributed computation model.

3. YARN (Yet Another Resource Negotiator)

 YARN is the resource management layer in Hadoop. It is


responsible for managing and allocating resources in the cluster to
ensure that the applications have enough resources to run.

 Key Features:

o Resource management: It decides how to distribute


computational resources (CPU, memory) among the running
applications.
o Job scheduling: YARN manages job execution and schedules the
tasks across the cluster.

o Fault tolerance: YARN ensures that if a task fails, it will restart


and continue from where it left off.

o Multi-tenancy: It supports multiple applications running


simultaneously in the same cluster by providing resource
isolation.

4. Hadoop Common (Hadoop Core Libraries)

 Hadoop Common consists of the set of shared libraries and


utilities required by other Hadoop components. It provides the
necessary tools for Hadoop to run across a distributed system.

 Key Features:

o Java libraries: These libraries provide essential functionality for


working with Hadoop components (like HDFS, MapReduce, etc.).

o Configuration files: The configuration files used by all the


Hadoop components for setting up properties like file paths,
directories, memory allocations, etc.

o Distributed computing support: It provides support for


different cluster nodes to communicate, interact, and execute
tasks effectively.

Hadoop Ecosystem

The Hadoop Ecosystem consists of various tools and frameworks that


enhance the capabilities of Hadoop, making it more versatile and scalable for
different big data processing tasks. These tools and frameworks address the
need for distributed storage, data processing, data management, real-time
analytics, and more.

Core Hadoop Components:

 HDFS (Hadoop Distributed File System): Distributed storage layer.

 MapReduce: Distributed data processing model.


 YARN (Yet Another Resource Negotiator): Resource management
layer.

 Hadoop Common: Common utilities and libraries used by all Hadoop


modules.

Additional Components in the Hadoop Ecosystem:

1. Apache Hive:

o A data warehousing and SQL-like query engine built on top of


Hadoop.

o It provides an SQL-based interface (HiveQL) to interact with


the Hadoop distributed storage (HDFS) for querying and
managing data.

o Ideal for data analysts familiar with SQL but need to process
massive amounts of data.

o Use cases: Querying large datasets, summarizing data, and


creating reports.

2. Apache HBase:

o A NoSQL database built on top of HDFS designed for real-time


random read/write access to large datasets.

o Suitable for applications requiring fast access to large amounts of


structured data.

3. Apache Pig:

o A high-level data flow scripting language used to process


large datasets in Hadoop.

o It is designed to simplify the complexities of writing raw


MapReduce code by using the Pig Latin language, which is a
simple scripting language.

4. Apache Spark:

o A real-time, in-memory data processing engine that provides


a faster and more flexible alternative to MapReduce.

o It supports batch processing, streaming analytics, machine


learning, and graph processing.
o Spark can be used for interactive querying and real-time
processing.

5. Apache Kafka:

o A distributed messaging system used for real-time data


streaming.

o Kafka allows the collection, storage, and real-time processing of


data streams, and is widely used for integrating different
components of the Hadoop ecosystem.

6. Apache Zookeeper:

o A coordination service that ensures synchronization and


management of distributed systems.

o Zookeeper is used by several Hadoop-related components (e.g.,


HBase, Kafka) to maintain distributed locks, configuration
management, and leader election.

7. Apache Flume:

o A service for collecting, aggregating, and moving large


amounts of log data to HDFS or other destinations in a Hadoop
cluster.

o Commonly used for streaming log data from multiple sources


into HDFS.

8. Apache Sqoop:

o A tool used for importing/exporting data between Hadoop and


relational databases like MySQL, Oracle, etc.

o It is used for bulk transfer of structured data to and from HDFS


and RDBMS.

9. Apache Oozie:

o A workflow scheduler system to manage Hadoop jobs.

o It allows the scheduling and coordination of complex data


processing workflows and jobs in Hadoop (like MapReduce jobs,
Hive jobs, etc.).

10. Apache Mahout:


 A machine learning library for creating scalable machine learning
algorithms.

 It leverages Hadoop's MapReduce framework to scale machine learning


models across large datasets.

Hive Overview

Apache Hive is a data warehousing system built on top of Hadoop that


enables users to perform SQL-like queries on large datasets stored in
HDFS. Hive abstracts the complexity of writing MapReduce code by allowing
users to query data using HiveQL, which is similar to SQL.

Key Features of Hive:

1. SQL-Like Query Language: HiveQL allows users to run SQL-style


queries on data stored in HDFS.

2. Scalable Data Warehousing: Hive is designed to scale and handle


large datasets, often used for data summarization, queries, and
reporting.

3. Extensibility: Hive supports user-defined functions (UDFs) to extend


its capabilities.

4. Integration with other Hadoop Ecosystem Tools: Hive can


integrate with tools like HBase, Apache Pig, and Apache Spark for a
more complete big data solution.

5. Partitioning and Bucketing: It supports partitioning and bucketing of


data, improving query performance and organizing large datasets.

Hive Physical Architecture

The physical architecture of Hive defines how the system works


internally, focusing on how it processes, stores, and retrieves data from
HDFS. Hive interacts with HDFS for storage and leverages MapReduce for
processing queries.

Key Components of Hive Physical Architecture:

1. Hive Client:

o The interface through which users submit HiveQL queries.


o Users can interact with Hive using the command-line interface
(CLI), Web UI, or through JDBC/ODBC connections.

2. Hive Metastore:

o The central repository that stores metadata for Hive tables


(schema information, partition details, etc.).

o It maintains the structure of the data, but not the actual data,
which is stored in HDFS.

o The metastore is essential for schema management and helps in


providing structured access to data stored in HDFS.

3. Hive Execution Engine:

o The component responsible for executing HiveQL queries.

o Hive uses MapReduce as the default execution engine for query


processing, but it also supports other engines such as Apache
Tez or Apache Spark for faster processing.

o The execution engine converts HiveQL queries into a sequence of


MapReduce jobs and executes them on Hadoop.

4. Hive Driver:

o The interface between the Hive client and the execution


engine.

o It handles the parsing of the HiveQL queries and interacts with


the execution engine.

o It also manages session states, such as user configurations and


session variables.

5. Hive Query Compiler:

o The compiler processes the HiveQL queries and generates an


abstract syntax tree (AST) from the HiveQL query.

o It ensures that the SQL-like syntax in the query is converted into


MapReduce jobs that can be executed on the Hadoop cluster.

6. Hive Optimizer:

o The optimizer works on the generated query plan and applies


optimization techniques, such as predicate pushdown,
column pruning, and join reordering, to improve query
performance.

7. HDFS (Hadoop Distributed File System):

o Hive queries are executed on data stored in HDFS. The actual


data files are managed and stored on HDFS, which provides
scalability and fault tolerance.

8. Execution Framework:

o As mentioned, the execution framework can be MapReduce,


Apache Spark, or Apache Tez (a faster, optimized engine). This
framework executes the distributed jobs based on the parsed
and compiled Hive queries.

Hive Architecture Flow:

1. User Interaction:

o A user submits a HiveQL query using the Hive CLI, Web UI, or
programmatically through JDBC/ODBC connections.

2. Driver:

o The driver receives the query and forwards it to the query


compiler.

3. Compiler:

o The compiler parses the query, checks syntax, and generates an


abstract syntax tree (AST) to determine the execution plan.

4. Optimizer:

o The optimizer enhances the execution plan by applying various


performance optimization techniques.

5. Execution:

o The execution engine translates the optimized plan into


MapReduce jobs (or other execution engines like Spark or Tez),
which are then executed on the Hadoop cluster.

6. Result:
o After the query is executed, the result is returned to the user,
either via the command-line interface or the chosen client.

Hadoop Limitations

Despite its immense capabilities in handling large-scale data, Hadoop also


has certain limitations that make it unsuitable for all use cases. Below are
some of the key limitations of Hadoop:

1. Complexity:

o Hadoop has a steep learning curve, especially for users


unfamiliar with distributed systems or MapReduce programming.

o Setting up and managing Hadoop clusters require skilled


personnel and is complex for newcomers.

o The integration of multiple tools in the Hadoop ecosystem also


requires expertise.

2. Real-Time Processing:

o Hadoop is primarily designed for batch processing, meaning it


is not well-suited for real-time data processing.

o While tools like Apache Spark and Apache Storm can provide
real-time capabilities, Hadoop itself is not optimized for low-
latency processing.

3. I/O Intensive:

o Hadoop is heavily reliant on disk I/O, which can lead to


bottlenecks in data processing.

o Since Hadoop’s processing model is based on MapReduce, the


intermediate data between tasks is written to the disk, resulting
in high disk I/O operations and relatively slower processing
speeds compared to in-memory processing.

4. Lack of ACID Transactions:

o Hadoop does not natively support ACID (Atomicity,


Consistency, Isolation, Durability) transactions, which are
critical for certain types of applications that require data
integrity.
o While there are workarounds, such as using HBase or integrating
with Apache Phoenix to provide transactional capabilities, they
do not provide full ACID compliance in the way traditional
databases do.

5. Limited Support for Advanced Analytics:

o Hadoop itself is primarily a distributed storage and


processing system; it does not inherently offer advanced
analytics capabilities (like machine learning, AI models, or
complex queries).

o This limitation can be addressed with additional tools, such as


Apache Mahout for machine learning or Apache Spark for in-
memory processing, but these tools add complexity to the
ecosystem.

6. Security Issues:

o Although Hadoop provides basic security features, such as


Kerberos authentication, authorization, and data
encryption, it often requires additional third-party security tools
to ensure a robust security model.

o Data privacy and access control can be challenging to


implement properly within a Hadoop ecosystem without
additional configurations.

7. Data Quality and Consistency:

o Since Hadoop is designed for unstructured and semi-structured


data, ensuring data consistency and quality can be challenging.

o Managing data formats, schemas, and data quality issues


becomes more complex, especially in large datasets.

8. Cost of Implementation:

o While Hadoop is often touted as being cheaper than traditional


databases for storing large amounts of data, the cost of
infrastructure, maintenance, and management of Hadoop
clusters can add up, especially as the scale of the deployment
increases.

o Cloud services like Amazon EMR help reduce infrastructure


costs but can still be expensive in the long run.
RDBMS (Relational Database Management System) vs Hadoop

The comparison between RDBMS and Hadoop is a key consideration for


organizations looking to process large-scale data. Both systems have their
strengths and weaknesses depending on the type of workload. Below is a
detailed comparison between RDBMS and Hadoop:

Aspect RDBMS Hadoop

Unstructured and semi-


Data Structured data (tables with
structured data (e.g., text,
Structure predefined schema).
logs, JSON, XML).

Highly scalable (horizontal


Limited scalability (vertical
Scalability scaling with commodity
scaling).
hardware).

Provides ACID (Atomicity,


Data Consistency, Isolation, Does not natively support ACID
Integrity Durability) properties for data properties.
integrity.

Primarily batch processing


Primarily transactional and with MapReduce (though real-
Data
real-time processing (using time processing can be added
Processing
SQL). using additional tools like
Apache Spark).

Performs well for large


Performs well for small to datasets, but has higher
Performan
medium-sized datasets with latency for I/O-bound
ce
indexed queries. operations due to disk-based
processing.

Uses HiveQL (SQL-like query


Query
Uses SQL for querying. language) or custom
Language
MapReduce scripts.

Uses HDFS (Hadoop


Data Uses tables for storing data, Distributed File System) for
Storage often on disk. distributed storage of large
datasets.
Aspect RDBMS Hadoop

Relatively easy to set up and Complex to set up and


Complexity manage for small-scale manage, especially for large-
systems. scale clusters.

Can be cheaper for storing vast


Typically more expensive due
amounts of data on commodity
to licensing fees for enterprise-
Cost hardware, but infrastructure
grade solutions (e.g., Oracle,
and management costs can
SQL Server).
add up.

Supports high concurrency Not optimized for high-


Concurrenc
and many simultaneous concurrency transactional
y
transactions. workloads.

Suitable for transactional Best suited for big data


systems, such as banking or applications like data
Use Cases inventory systems where warehousing, data lakes,
data consistency and real-time and real-time analytics on
access are critical. massive datasets.

Highly flexible in terms of data


Limited flexibility with data
types and can handle various
Flexibility types and schema. Changes to
formats like JSON, XML, or plain
schema are difficult.
text.

Eventual consistency model


Data
Strong consistency ensured (in HDFS and MapReduce),
Consistenc
by relational constraints. which may lead to temporary
y
inconsistency.

Data backup is required for Built-in fault tolerance using


Fault fault tolerance; replication in HDFS where data is
Tolerance some RDBMS (e.g., MySQL, replicated across nodes in the
PostgreSQL). cluster.

When to Use RDBMS:

 When you need strong data integrity and support for complex
transactions (banking, e-commerce platforms, etc.).

 When the data is structured and you need to work with real-time
queries.
 When the dataset is small to medium-sized (fits within the limits of a
traditional server or database system).

When to Use Hadoop:

 When dealing with big data that cannot be processed or stored


effectively using traditional relational databases.

 When data is unstructured or semi-structured (e.g., logs, social media


posts, sensor data).

 When you need to process large datasets using distributed storage


and parallel processing (e.g., data warehousing, data lakes, and big
data analytics).

Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is the primary storage


system of Hadoop. It is designed to store large volumes of data across a
distributed cluster of machines, ensuring reliability and scalability. HDFS is
optimized for high throughput and fault tolerance rather than low-
latency access to small files.

Key Characteristics of HDFS:

1. Distributed Storage:

o HDFS divides large files into fixed-size blocks (typically 128 MB


or 256 MB), which are distributed across the nodes in a Hadoop
cluster.

o Each block is replicated multiple times (default is 3 replicas)


across different nodes to ensure fault tolerance.

2. Fault Tolerance:

o In the event of node failure, HDFS can recover by accessing


replicas of the data stored on other nodes. This replication
ensures data availability even when hardware fails.

o The NameNode in HDFS tracks the locations of each block in the


cluster, and the DataNodes store the actual blocks of data.

3. Block-based Architecture:
o Files are split into blocks, and each block is stored across
different machines in the cluster. This allows parallel processing
and ensures efficient data access.

o Blocks are typically large, which minimizes the overhead caused


by seeking between blocks and improves read/write throughput.

4. Master-Slave Architecture:

o NameNode (Master): The NameNode manages the filesystem


namespace, keeps track of the metadata of all files (e.g., file
names, block locations), and coordinates the storage of data.

o DataNode (Slave): The DataNodes store the actual data blocks.


These nodes handle read and write requests from clients and
send periodic heartbeat signals to the NameNode to indicate
they are alive.

5. High Throughput:

o HDFS is designed for high throughput access to data, making it


well-suited for batch processing. It is optimized for reading and
writing large files, and not for frequent random reads/writes to
small files.

6. Write-once, Read-many Model:

o HDFS follows a write-once, read-many model, which means


that once a file is written, it cannot be modified. This simplifies
data consistency models and is ideal for applications that append
data or need to process data sequentially.

o If modifications are needed, the data must be rewritten as new


files, and the old versions can be discarded.

7. Scalability:

o HDFS can scale by adding more nodes to the cluster. As the data
volume grows, more DataNodes can be added to store the data,
ensuring that the system remains efficient.

8. Data Locality:

o HDFS tries to move the computation to the data (instead of


moving data to the computation), minimizing network congestion
and optimizing performance.
o This is accomplished by running the MapReduce jobs on the
nodes where the data resides, reducing the amount of data
transfer across the network.

Processing Data with Hadoop (MapReduce)

MapReduce is the computational model that Hadoop uses to process large


datasets in a distributed manner across a Hadoop cluster. It breaks down the
processing of data into two main stages: Map and Reduce.

1. Map Phase:

In the Map phase, the input data (typically stored in HDFS) is processed in
parallel across the nodes of the cluster.

 The data is split into chunks (blocks) and distributed across


DataNodes.

 Each Map task processes a block of data and outputs a set of key-
value pairs.

o Input: Each record in the input is processed by the Mapper


function.

o Output: The output of the Map function is a set of intermediate


key-value pairs.

Example: In a word count program, the input data might be a text file, and
the Mapper reads the text, breaking it down into words (key-value pairs like:
"word": 1).

2. Shuffle and Sort:

After the Map phase, Hadoop automatically performs a Shuffle and Sort
step, which groups the intermediate key-value pairs by key and sorts them.

 Shuffle: Groups all the values associated with the same key together
across all nodes in the cluster.

 Sort: Sorts the intermediate key-value pairs so that the Reducer can
process them efficiently.

3. Reduce Phase:

In the Reduce phase, the Reducer processes each group of key-value


pairs that have been shuffled and sorted.
 The Reducer receives the key and a list of values associated with that
key.

 It then processes the values and returns a final key-value pair.

 In the example of word count, the Reducer would aggregate the word
counts for each word, summing up the counts.

Example: The output could be something like: ("word", 5) indicating that the
word appeared 5 times in the text.

4. Output:

The final output from the Reduce phase is written back to HDFS as a set of
files.

MapReduce in Hadoop: Key Components

 InputFormat: Defines how the input data is split and read. It


determines how the data is divided into manageable chunks for the
Map phase.

 Mapper: The function that processes input data, applies a


transformation, and emits key-value pairs.

 Partioner: Determines how the intermediate key-value pairs are


distributed across Reducers.

 Combiner: An optional optimization that can perform a local reduce


operation on the output of a Mapper before sending it to the Reducer,
reducing the amount of data transferred between Map and Reduce.

 Reducer: The function that aggregates the key-value pairs produced


by the Mappers and generates the final output.

Hadoop Data Processing Workflow:

1. Data Splitting: The input data is split into smaller chunks (blocks) by
HDFS.

2. Map Task Execution: The Map tasks are distributed across nodes in
the Hadoop cluster, each task processing its chunk of the data and
emitting key-value pairs.
3. Shuffle and Sort: The intermediate data produced by Mappers is
shuffled and sorted by the system to group the values for each key.

4. Reduce Task Execution: Reducers process the grouped key-value


pairs, applying the required computations (e.g., summing the counts).

5. Writing Output: The results of the Reduce phase are written back to
HDFS, where they can be accessed for further analysis.

Benefits of Hadoop for Data Processing:

1. Scalability: Hadoop can process massive datasets by distributing the


data and computation across many nodes in a cluster.

2. Fault Tolerance: HDFS ensures that data is replicated across nodes,


so even if a node fails, data can still be accessed from another replica.

3. Cost-Effectiveness: Hadoop can run on commodity hardware, making


it much more cost-effective compared to traditional data storage and
processing systems.

4. Flexibility: It can process a wide variety of data types (structured,


semi-structured, unstructured) and formats (e.g., JSON, XML, text,
images).

Limitations of MapReduce:

1. Latency: MapReduce can be slower due to its reliance on writing


intermediate data to disk between the Map and Reduce phases.

2. Difficulty in Real-Time Processing: Hadoop is optimized for batch


processing and is not suitable for low-latency, real-time data
processing without additional frameworks like Apache Spark.

3. Complexity: Developing efficient MapReduce jobs can be complex,


particularly for developers unfamiliar with the programming model.

Managing Resources and Applications with Hadoop YARN

YARN (Yet Another Resource Negotiator) is the resource management


layer of Hadoop. It is a critical component that was introduced in Hadoop
2.x to improve the resource management and job scheduling in a Hadoop
cluster. YARN decouples the resource management and job scheduling
functionalities, allowing the system to handle multiple applications
concurrently in a more efficient manner.

Key Components of YARN:

1. ResourceManager (RM):

o The ResourceManager is the master daemon responsible for


managing resources across the cluster.

o It has two main components:

 Scheduler: The Scheduler is responsible for allocating


resources to applications based on user-defined policies
(e.g., fairness, capacity).

 ApplicationManager: The ApplicationManager handles


the lifecycle of applications, ensuring they are started,
executed, and monitored correctly.

2. NodeManager (NM):

o NodeManager is responsible for managing resources on a single


node within the Hadoop cluster.

o It monitors resource usage (memory, CPU) on each node and


reports back to the ResourceManager. It also launches and
manages containers (the execution environment for tasks).

3. ApplicationMaster (AM):

o The ApplicationMaster is responsible for managing the


lifecycle of a specific application. It negotiates resources from the
ResourceManager, coordinates with NodeManagers to execute
tasks, and monitors the application's progress.

o There is one ApplicationMaster per application running in the


cluster.

4. Containers:

o Containers are the execution environments allocated by


NodeManagers on nodes. A container can hold one or more tasks
and has a defined amount of resources (CPU, memory, etc.)
based on the application's requirements.
o The ApplicationMaster requests containers from the
ResourceManager and the NodeManager launches these
containers to run tasks.

How YARN Works:

1. Job Submission:

o When a user submits a job to the cluster, the


ResourceManager first allocates resources for the job by
determining which nodes have available resources.

o The ApplicationMaster for the job is then launched in one of


the containers. The ApplicationMaster is responsible for
managing the job's execution.

2. Resource Allocation:

o The ResourceManager uses the Scheduler to allocate


resources based on policies like capacity, fairness, and priority.

o NodeManagers communicate with the ResourceManager,


reporting available resources on their respective nodes, enabling
the ResourceManager to make informed decisions.

3. Task Execution:

o Once resources are allocated, the ApplicationMaster requests


containers from the NodeManagers. The containers hold the
tasks that are executed.

o NodeManagers monitor the execution of tasks within


containers, ensuring that resource usage is within specified limits
and reporting status back to the ResourceManager.

4. Job Completion:

o The ApplicationMaster monitors the progress of the tasks and,


once all tasks are completed, it signals the job's completion. The
ResourceManager cleans up the resources and updates the job
status.

Advantages of YARN:

 Multi-Tenancy: YARN enables multiple applications (MapReduce,


Spark, Tez, etc.) to run concurrently in the same Hadoop cluster,
allowing for improved resource utilization.
 Resource Isolation: YARN allows for resource isolation, ensuring that
different applications do not interfere with each other and each gets
the necessary resources.

 Scalability: YARN can handle a much larger scale of cluster and


applications than the original Hadoop 1.x version, enabling it to
efficiently manage resources in big clusters.

 Flexibility: YARN supports different types of workloads, including


batch processing, real-time processing, and interactive queries.

MapReduce Programming in Hadoop

MapReduce is a programming model used for processing large data sets


with a distributed algorithm on a Hadoop cluster. It is the core computational
model in the Hadoop ecosystem. MapReduce allows developers to write
distributed applications that process data in parallel on a large number of
nodes.

MapReduce Workflow:

MapReduce programs consist of two primary phases:

1. Map Phase:

o The Map function takes an input key-value pair and produces a


set of intermediate key-value pairs.

o This phase involves splitting the input data into chunks (called
splits), which are then processed in parallel by different Mapper
tasks.

o Each Mapper reads a split, processes the data, and emits


intermediate key-value pairs (e.g., "word" -> 1 for a word count
program).

2. Reduce Phase:

o After the Map phase, the intermediate data is shuffled and sorted
based on the key.

o The Reduce function takes each key and a list of associated


values, processes them, and outputs a final result (e.g., sum of
counts for each word in a word count program).
MapReduce Programming Model:

1. Mapper Function:

o Input: A chunk of data (record) from the input file, represented


as a key-value pair.

o Output: Intermediate key-value pairs.

o Example: In a word count program, the Mapper reads lines of


text, splits them into words, and outputs key-value pairs like
("word": 1).

2. Reducer Function:

o Input: Key-value pairs generated by the Map phase (grouped by


key).

o Output: The final result after reducing the intermediate data. For
a word count example, it would sum the counts of each word.

o Example: In the word count program, the Reducer sums the


counts for each word and outputs a final result, like ("word", 5).

MapReduce Example:

Consider a word count program where the task is to count how often each
word appears in a large text file.

1. Map Phase:

o Input: A text file, with lines like "apple orange banana apple".

o Mapper emits: ("apple", 1), ("orange", 1), ("banana", 1), ("apple",


1).

2. Shuffle and Sort:

o The system groups all the pairs by their keys:

 Key: "apple", Values: [1, 1]

 Key: "orange", Values: [1]

 Key: "banana", Values: [1]

3. Reduce Phase:

o The Reducer processes each key and aggregates the values:


 Key: "apple", Values: [1, 1], Output: "apple", 2

 Key: "orange", Values: [1], Output: "orange", 1

 Key: "banana", Values: [1], Output: "banana", 1

4. Final Output:

o The final output is written to the HDFS:

 "apple", 2

 "orange", 1

 "banana", 1

Writing a MapReduce Program:

To write a MapReduce program, you typically implement three key methods:

1. Mapper:

o Processes input data and outputs intermediate key-value pairs.

2. Reducer:

o Aggregates the intermediate key-value pairs and outputs the


final result.

3. Driver:

o Configures the job, sets up input/output paths, and runs the


MapReduce job.

Unit 3
🐝 1. Introduction to Hive

📌 What is Apache Hive?

Apache Hive is a data warehouse infrastructure built on top of Hadoop


for providing data summarization, query, and analysis. Hive allows users
to read, write, and manage large datasets residing in distributed storage
using SQL-like syntax called HiveQL (HQL).

🎯 Why Hive?
 Writing MapReduce manually is complex — Hive simplifies this.

 Ideal for data analysts and non-programmers to query large-scale


datasets.

 Converts HQL into MapReduce, Tez, or Spark jobs behind the scenes.

 Supports structured data stored in formats like Text, ORC, Parquet,


etc.

2. Hive Architecture

Apache Hive's architecture consists of the following components:

🔷 1. User Interface (UI)

Provides various ways for users to interact:

 CLI: Command Line Interface.

 Web UI: Browsers like Hue.

 ODBC/JDBC Drivers: For connecting external tools (e.g., Tableau, Java


apps).

🔷 2. Driver

It manages the lifecycle of a HiveQL query. It acts like the controller:

 Parser: Validates syntax of the query.

 Planner: Creates an execution plan.

 Optimizer: Optimizes the plan for better performance.

 Executor: Executes the plan.

🔷 3. Compiler

 Converts the HiveQL query into DAG (Directed Acyclic Graph) of


tasks.

 Uses MapReduce, Tez, or Spark as execution engines.

🔷 4. Metastore

 Stores metadata: table names, column types, partitions, etc.

 Typically uses RDBMS like MySQL/PostgreSQL.


 Essential for query planning and optimization.

🔷 5. Execution Engine

 Works with Hadoop/YARN to execute jobs.

 Translates the logical plan into physical plan.

🔷 6. HDFS (Hadoop Distributed File System)

 The storage layer where data is actually stored.

 Hive only reads/writes; does not manage storage directly.

🧮 3. Hive Data Types

Hive data types are categorized into Primitive and Complex types.

➤ Primitive Data Types:

Type Description

TINYINT 1-byte signed integer

SMALLINT 2-byte signed integer

INT 4-byte signed integer

BIGINT 8-byte signed integer

BOOLEAN True/False

FLOAT 4-byte floating point

DOUBLE 8-byte floating point

Arbitrary precision
DECIMAL
numbers

STRING Variable-length string

String with specified


VARCHAR
length

CHAR Fixed-length string

DATE Date without time

TIMESTAM Date and time


Type Description

➤ Complex Data Types:

Type Description

ARRAY<T
Ordered collection of elements
>

MAP<K,V
Key-value pairs
>

Collection of elements grouped under one


STRUCT
record

UNIONTYP
Supports multiple types in a single field
E

Example:

CREATE TABLE student_info (

name STRING,

marks ARRAY<INT>,

address STRUCT<city:STRING, state:STRING>,

metadata MAP<STRING, STRING>

);

💻 4. Hive Query Language (HQL)

🔸 HiveQL is SQL-like, with some differences:

 Case-insensitive

 Schema-on-read (parses data at query time)

🛠 DDL (Data Definition Language)

Used to define and manage schema.

CREATE TABLE employees (

id INT,
name STRING,

salary FLOAT

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ','

STORED AS TEXTFILE;

DROP TABLE employees;

DESCRIBE employees;

🧱 DML (Data Manipulation Language)

Used to load or insert data into tables.

LOAD DATA LOCAL INPATH '/user/data/employees.csv' INTO TABLE


employees;

INSERT INTO TABLE employees VALUES (1, 'Alice', 7000.0);

🔍 SELECT Queries

SELECT name, salary FROM employees WHERE salary > 5000;

SELECT department, COUNT(*) FROM employees GROUP BY department;

SELECT * FROM employees ORDER BY salary DESC LIMIT 5;

🗂 Partitioning & Bucketing

Partitioning:

Divides table data based on a column's value.

CREATE TABLE logs (


id INT,

log_message STRING

PARTITIONED BY (log_date STRING);

Bucketing:

Splits data into fixed number of files (buckets).

CREATE TABLE user_data (

user_id INT,

name STRING

CLUSTERED BY (user_id) INTO 4 BUCKETS;

🧰 Joins in Hive:

SELECT a.name, b.salary

FROM dept a

JOIN emp b

ON a.id = b.dept_id;

✅ Summary

Topic Key Points

Hive SQL-like querying for Hadoop; built for scalability

Architectur
UI → Driver → Compiler → Execution Engine ↔ Metastore & HDFS
e

Data
Primitive (INT, STRING...) and Complex (ARRAY, MAP, STRUCT)
Types

HiveQL Similar to SQL: supports DDL, DML, SELECT, JOIN, PARTITIONING,


BUCKETING
Topic Key Points

🐷 1. Introduction to Pig

📌 What is Apache Pig?

Apache Pig is a high-level platform for creating MapReduce programs used


with Hadoop. It uses a scripting language called Pig Latin, which
abstracts the complexity of writing raw MapReduce code.

🎯 Key Goals of Pig:

 Simplify the development of big data transformation tasks.

 Allow developers to process large datasets without writing low-level


Java MapReduce code.

✅ Features of Pig:

 Pig Latin: A high-level, procedural scripting language.

 Automatically converts Pig Latin scripts into MapReduce jobs.

 Works with structured, semi-structured, and unstructured data.

 Can process data stored in HDFS, HBase, or Hive.

 Supports UDFs (User Defined Functions) in Java, Python, etc.

🧬 2. Anatomy of Pig

Here’s what a typical Pig environment looks like and how it functions:

🔷 Components of Pig:

Component Description

Pig Latin Code written by the user to process


Scripts data

Converts scripts into logical and


Pig Compiler
physical plans
Component Description

Execution Runs the actual tasks


Engine (MapReduce/Spark/Tez)

HDFS Stores the input and output data

🔁 Execution Flow (Anatomy)

1. Write Pig Script (in Pig Latin)

pig

data = LOAD '/user/data/employees.csv' USING PigStorage(',') AS (id:int,


name:chararray, salary:float);

highEarners = FILTER data BY salary > 5000;

DUMP highEarners;

2. Parse & Semantic Check

o Syntax checking and type resolution.

3. Logical Plan Generation

o Operator-based logical structure is built (LOAD → FILTER →


DUMP).

4. Optimization

o Apply rule-based optimizations (e.g., push filters early).

5. Physical Plan Generation

o Convert into a plan of physical operations.

6. MapReduce Jobs

o Converted into one or more MR jobs and executed on Hadoop.

📘 Modes of Execution:

Mode Description

Local Mode Runs on local file system without


Mode Description

Hadoop.

MapReduce Runs on HDFS using Hadoop cluster


Mode (default).

🧩 3. Pig on Hadoop

Apache Pig is deeply integrated with Hadoop, and it executes scripts as


MapReduce jobs.

🧱 Integration Points:

 HDFS: Pig reads input from and writes output to HDFS.

 YARN: Pig scripts are executed using MapReduce jobs scheduled by


Hadoop YARN.

 MapReduce Engine: Pig converts each step of its script into a


MapReduce job.

Example Pig Latin Script on Hadoop:

pig

-- Load data from HDFS

logs = LOAD '/data/server_logs.txt' USING PigStorage('\t') AS (ip:chararray,


url:chararray);

-- Filter entries

filtered = FILTER logs BY url MATCHES '.*.jpg';

-- Count accesses

grouped = GROUP filtered BY ip;

counts = FOREACH grouped GENERATE group AS ip, COUNT(filtered) AS


access_count;

-- Store output in HDFS


STORE counts INTO '/output/image_access_counts' USING PigStorage(',');

🔄 Pig vs Hive vs MapReduce

Feature Pig Hive MapReduce

Pig Latin HiveQL


Language Java (Low-level)
(Procedural) (Declarative)

Data Custom
Use Case Data Analysis
Transformation processing

Ease of Use Medium Easy Complex

Execution MapReduce/Tez/
MapReduce Native
Engine Spark

✅ Summary

Topic Key Points

High-level platform for writing MapReduce programs using Pig


Apache Pig
Latin

Anatomy of Scripting → Parsing → Logical Plan → Optimization → Physical


Pig Plan → MR Jobs

Pig on Pig runs on top of Hadoop using HDFS for storage and
Hadoop MapReduce for execution

✅ Use Case for Pig (In Detail)

🔍 What kind of problems does Pig solve?

Pig is best for analyzing and transforming large datasets. It's especially
useful for:

 ETL jobs (Extract, Transform, Load)

 Data preparation for Machine Learning

 Log file processing

 Ad-hoc data analysis


🔧 Real-world Use Cases:

📊 1. Log Analysis (e.g., Server Logs)

Problem: A company has terabytes of log data from web servers and wants
to find how many requests came from each country.

Solution with Pig:

 Load logs into Pig from HDFS.

 Extract the IP field and map it to locations.

 Group by country and count.

🛒 2. Retail Analytics

Problem: An e-commerce platform wants to analyze the average spend per


customer.

Pig Tasks:

 Load transaction data.

 Group by customer ID.

 Calculate total and average spend.

📈 3. Preprocessing for ML

Problem: Before feeding data into a Machine Learning algorithm, it needs to


be cleaned, normalized, and filtered.

Pig Tasks:

 Remove nulls/duplicates.

 Normalize values.

 Generate features from raw data.

🧪 4. Data Sampling

For data scientists who need only a sample of data for testing or
visualization.

🔄 ETL Processing in Pig (In Detail)


📌 What is ETL?

 Extract: Get data from sources like HDFS, Hive, relational databases.

 Transform: Apply filters, joins, calculations, or clean data.

 Load: Store it back to HDFS, Hive, or another system.

🔃 Pig in ETL

Pig makes ETL processes simpler through Pig Latin – a script-based


language that supports:

 Filtering

 Sorting

 Joining

 Grouping

 Aggregation

🔧 Example ETL Pipeline:

Input (CSV file in HDFS):

CopyEdit

1,John,Sales,5600

2,Alice,HR,4300

3,Bob,Sales,7000

🔤 Pig Script:

pig

CopyEdit

-- Extract

data = LOAD '/user/hr/employees.csv' USING PigStorage(',')

AS (id:int, name:chararray, dept:chararray, salary:float);


-- Transform

filtered = FILTER data BY salary > 5000;

grouped = GROUP filtered BY dept;

avg_sal = FOREACH grouped GENERATE group AS department,


AVG(filtered.salary) AS average_salary;

-- Load

STORE avg_sal INTO '/user/output/high_earners' USING PigStorage(',');

✅ Output (in HDFS):

CopyEdit

Sales,6300.0

🔢 Data Types in Pig (In Detail)

Pig supports both primitive types and complex types.

🔹 Primitive (Scalar) Data Types:

Type Example Description

int 100 32-bit signed integer

100000000
long 64-bit integer
00

float 12.5f 32-bit floating point

double 13.456 64-bit floating point

chararra
"John" String of characters
y

bytearr Raw byte data


binary data
ay (uninterpreted)

boolean true Boolean value


🔸 Complex Data Types:

1. Tuple

A tuple is a collection of fields (like a row).

pig

CopyEdit

(id, name, salary)

(1, 'Alice', 5000)

2. Bag

A bag is a collection of tuples (like a table). Bags can have duplicates.

pig

CopyEdit

{(1, 'Alice'), (2, 'Bob'), (1, 'Alice')}

3. Map

A map is a key-value pair.

pig

CopyEdit

[name#'John', age#30, dept#'HR']

🧪 Data Type Example in Schema:

pig

CopyEdit

student = LOAD 'students.txt' USING PigStorage(',')

AS (roll:int, name:chararray, scores:bag{t:tuple(subject:chararray,


marks:int)});

This defines:

 A roll number (int)

 A name (string)
 A bag of subject-mark pairs

🏃 Running Pig (In Detail)

Modes of Execution:

Mode When to Use

Local Mode For testing on a local machine

MapReduce For real execution on Hadoop


Mode cluster

🔹 Starting Pig in Local Mode:

bash

CopyEdit

pig -x local

🔹 Starting Pig in Hadoop Mode:

bash

CopyEdit

pig

🧾 Running Pig Scripts

1. Interactive Mode (Grunt Shell)

Open shell:

bash

CopyEdit

pig

Commands:

pig

CopyEdit
grunt> data = LOAD 'file.txt' AS (name:chararray);

grunt> DUMP data;

2. Batch Mode (Script File)

Write script:

-- File: etl_script.pig

data = LOAD '/data/input.csv' USING PigStorage(',') AS (id:int,


name:chararray);

filtered = FILTER data BY id > 5;

DUMP filtered;

Run script:

bash

CopyEdit

pig etl_script.pig

3. Embedded Mode (Java)

You can also run Pig from Java using PigServer.

🔚 Summary Table

Topic Key Takeaways

Use Case Data transformation, analysis, and preparation

Extract from HDFS, Transform via Pig, Load


ETL
results

Data Primitive (int, float, chararray), Complex (tuple,


Types bag, map)

Running
CLI (Grunt), Script, Local/Hadoop modes
Pig

✅ 1. Execution Model of Pig


Apache Pig follows a step-by-step dataflow model using a scripting
language called Pig Latin, which is translated into MapReduce jobs under
the hood.

🔸 Pig Execution Flow:

1. Pig Latin Script

o You write data transformation logic in Pig Latin.

2. Parser

o The script is parsed to check syntax and semantics.

o A logical plan is generated.

3. Optimizer

o The logical plan is optimized (e.g., combining filters, removing


redundant steps).

o Converts into a physical plan.

4. Compiler

o Converts the physical plan into a sequence of MapReduce jobs.

5. Execution

o Jobs are submitted to the Hadoop cluster (or run locally if in local
mode).

o Results are collected and returned.

Execution Modes:

Mode Description

Runs on a single JVM; good for


Local
testing.

MapRedu
Default mode; runs on Hadoop.
ce

🔄 2. Operators in Pig
Pig provides relational operators similar to SQL but more flexible.

🔹 Core Pig Operators:

Operat
Description Example
or

Loads data from


LOAD LOAD 'file.csv'
HDFS/local

STORE Saves output STORE result INTO 'output/'

DUMP Prints data to console DUMP A

FILTER Filters rows FILTER A BY salary > 5000

FOREAC FOREACH A GENERATE name,


Iterates each row
H salary

GROUP Groups rows by key GROUP A BY dept

JOIN Joins two datasets JOIN A BY id, B BY id

ORDER Orders data ORDER A BY salary DESC

DISTINC
Removes duplicates DISTINCT A
T

LIMIT Limits records LIMIT A 10

UNION Combines datasets UNION A, B

SPLIT A INTO X IF cond, Y IF


SPLIT Splits dataset
cond2

3. Functions in Pig

Pig offers a wide range of built-in functions and allows custom UDFs
(User Defined Functions).

🔹 Built-in Functions:

📊 Aggregate Functions:
Functio
Description Example
n

COUNT( FOREACH G GENERATE


Count rows
) COUNT(A)

SUM() Total value SUM(A.salary)

AVG() Average AVG(A.salary)

Minimum
MIN() MIN(A.salary)
value

Maximum
MAX() MAX(A.salary)
value

🔤 String Functions:

Functio
Description
n

CONCAT(
Combines strings
)

STRSPLIT
Splits a string
()

Converts to
UPPER()
uppercase

Converts to
LOWER()
lowercase

🔢 Math Functions:

Functio
Description
n

Absolute
ABS()
value

ROUND(
Round value
)

SQRT() Square root


🔸 UDFs (User Defined Functions)

You can write custom functions in Java, Python, or other languages.

java

CopyEdit

public class UpperCase extends EvalFunc<String> {

public String exec(Tuple input) {

return input.get(0).toString().toUpperCase();

Register in Pig:

pig

CopyEdit

REGISTER 'myfuncs.jar';

DEFINE ToUpper com.example.UpperCase();

🔢 4. Data Types in Pig (Deep Dive)

🟢 Primitive Data Types:

Descriptio
Type Example
n

32-bit
int 25
integer

64-bit 100000000
long
integer 00

float 32-bit float 12.5f

double 64-bit float 3.14159

chararra
String 'hello'
y

bytearr Binary data -


Descriptio
Type Example
n

ay

boolean True/False true

🔵 Complex Data Types:

Typ
Description Example
e

Tupl Ordered collection of


(1, 'John')
e fields

Bag Collection of tuples {(1,'A'),(2,'B')}

[name#'John',
Map Key-value pairs
age#25]

🔸 Example of Nested Data:

pig

CopyEdit

students = LOAD 'marks.txt' USING PigStorage(',')

AS (id:int, name:chararray, scores:bag{t:tuple(subject:chararray, mark:int)});

Here:

 scores is a bag of tuples containing subject and mark.

📘 Summary Table:

Topic Details

Execution Converts Pig Latin → Logical Plan → MapReduce


Model Jobs

LOAD, FILTER, JOIN, GROUP, DUMP, FOREACH,


Operators
ORDER, etc.
Topic Details

Functions Built-in (COUNT, AVG, CONCAT), and Custom UDFs

Primitive (int, chararray), Complex (tuple, bag,


Data Types
map)

UNIT 4
🔰 Introduction to NoSQL

✅ What is NoSQL?

NoSQL stands for "Not Only SQL". It refers to a category of non-relational


databases designed to handle:

 Large volumes of unstructured or semi-structured data,

 High scalability and performance,

 Flexible schemas (no fixed table structure),

 Real-time or near-real-time data processing.

💡 Key Features:

Feature Description

No predefined schema; flexible data


Schema-less
models

Easily scales out using distributed


Scalability
clusters

High
Built to handle failures gracefully
Availability

Fast
Optimized for read/write throughput
Performance

🧠 Why Use NoSQL?


Traditional RDBMS may struggle with:

 Huge datasets (big data),

 High-speed streaming data (e.g., logs, IoT),

 Complex hierarchical data (like JSON/XML),

 Cloud-based scalability.

Hence, NoSQL fits use cases where relational schemas limit


performance or flexibility.

💼 NoSQL Business Drivers

🔍 Why Businesses Adopt NoSQL:

Driver Explanation

Data from social media, IoT, logs, and sensors is


Big Data Growth
exploding – NoSQL handles large-scale data better.

Agility and Developers need faster development cycles – NoSQL


Speed allows dynamic data models.

Cloud Native NoSQL is often designed to run on distributed cloud


Architecture environments.

Real-Time Applications like fraud detection or recommendation


Analytics systems need real-time responses.

Global NoSQL systems (like Cassandra) support geo-distributed


Scalability databases.

🧾 Business Examples:

 Facebook, Twitter – Store and retrieve user-generated content.

 Netflix – Uses Cassandra for global scalability.

 Amazon – Uses DynamoDB to handle millions of transactions per


second.

NoSQL Data Architectural Patterns


NoSQL offers multiple data models and architectural patterns for different
needs:

🔸 1. Key-Value Store

 🧱 Structure: Key → Value (like a hashmap)

 ⚡ Use Case: Session storage, caching

 🧰 Examples: Redis, Riak, DynamoDB

json

CopyEdit

"user123": {"name": "John", "age": 25}

🔹 2. Document Store

 📦 Structure: Documents in JSON, BSON, or XML

 📖 Use Case: Content management systems, catalogs

 🧰 Examples: MongoDB, CouchDB, Firebase

json

CopyEdit

"id": "123",

"name": "Alice",

"address": {

"city": "NY",

"zip": "10001"

🔺 3. Column Family Store


 📊 Structure: Columns grouped in families (like wide tables)

 💼 Use Case: Analytics, event logs, data warehousing

 🧰 Examples: Apache Cassandra, HBase

makefile

CopyEdit

RowKey: 101

Name: Alice

Subject: Math

Marks: 95

🔘 4. Graph Store

 🌐 Structure: Nodes (entities) and Edges (relationships)

 🔄 Use Case: Social networks, fraud detection, recommendation engines

 🧰 Examples: Neo4j, ArangoDB

scss

CopyEdit

(Alice) --[FRIEND]--> (Bob)

🧩 Summary of Patterns:

Best Use
Type Examples
Case

Key- Caching,
Redis, DynamoDB
Value Session

Docume
CMS, Products MongoDB, CouchDB
nt

Columna
Logs, Analytics Cassandra, HBase
r

Graph Relationships Neo4j, Amazon


Best Use
Type Examples
Case

Neptune

📌 Architectural Patterns in NoSQL Systems:

Pattern Description

Sharding Horizontal partitioning – splits data across nodes

Replication Copies data to multiple servers for fault-tolerance

You can only guarantee 2 of 3: Consistency, Availability,


CAP Theorem
Partition Tolerance

Eventually Common in distributed NoSQL – data becomes consistent


Consistent over time

🔄 Variations of NoSQL Architectural Patterns

NoSQL databases support various architectural variations to optimize


performance, scalability, and fault tolerance when managing Big Data.

✅ 1. Shared Nothing Architecture

 Every node is independent.

 No single point of failure.

 Best for horizontal scaling.

 Used by: Cassandra, MongoDB, DynamoDB

✅ 2. Sharding (Horizontal Partitioning)

 Data is split across multiple shards (nodes) using a shard key.

 Enables parallel processing of large datasets.

 Example:

o Shard 1: users with ID 1–1000

o Shard 2: users with ID 1001–2000


✅ 3. Replication

 Copies of data are maintained across multiple servers.

 Provides high availability and fault tolerance.

 Replication factor determines how many copies exist.

✅ 4. MapReduce Pattern

 Batch processing of large datasets.

 Data is divided into chunks and processed in parallel.

 Common in document stores and columnar databases.

✅ 5. Eventual Consistency

 In distributed systems, updates propagate gradually.

 Prioritizes availability and partition tolerance over immediate


consistency.

 Used by systems like Cassandra, DynamoDB.

🧠 Use of NoSQL to Manage Big Data

NoSQL databases are optimized for handling:

 Volume – Can handle petabytes of data.

 Velocity – Supports fast insert/read operations.

 Variety – Handles structured, semi-structured, and unstructured data.

📦 Example Use Cases:


NoSQL
Use Case Database
Pattern

Real-time Column- Apache


analytics oriented Cassandra

Product catalogs Document store MongoDB

Social
Graph store Neo4j
networking

IoT & Sensors Key-value Redis

Graph + ArangoDB,
Fraud detection
Document MongoDB

🍃 Introduction to MongoDB

📌 What is MongoDB?

 MongoDB is a document-oriented NoSQL database.

 Stores data in JSON-like documents (BSON format).

 Highly flexible, scalable, and widely used in web and big data apps.

🧱 MongoDB Architecture

Compone
Description
nt

Documen
Basic unit of data (like a row)
t

Collectio
Group of documents (like a table)
n

Database Container for collections

Replica
Group of MongoDB servers for redundancy
Set

Splitting data across multiple machines for


Sharding
scaling
MongoDB Document Example:

json

CopyEdit

"_id": "123",

"name": "John Doe",

"email": "[email protected]",

"orders": [

{"id": 1, "item": "Laptop", "price": 750},

{"id": 2, "item": "Mouse", "price": 25}

 Nested structures are allowed

 No fixed schema – fields can vary between documents

Key Features:

Feature Details

Schema-less No need to define structure in advance

MongoDB Query Language (MQL) –


Query Language
JSON-based

Indexing Fast search on any field

Aggregation
Like SQL GROUP BY, but more powerful
Framework

Supports replication and automatic


High Availability
failover
Feature Details

Horizontal
Built-in sharding support
Scalability

🚀 Use Cases of MongoDB:

 Real-time analytics

 Product catalogs

 CMS (Content Management Systems)

 IoT platforms

 Social apps

UNIT 5
Mining Social Network Graphs

📌 What is Social Network Mining?

Social network mining is the process of extracting patterns,


relationships, and useful information from social network data. It
involves using graph theory, data mining, and machine learning to
analyze social connections.

🔰 Introduction to Social Network Mining

Social networks represent people or entities as nodes and their


relationships as edges in a graph.

Examples:

 Facebook: Users are nodes, friendships are edges.

 Twitter: Users are nodes, "follows" are directed edges.

 LinkedIn: Professional connections.


🔍 Goals of Social Network Mining:

Goal Description

Community Detection Finding groups of closely connected nodes

Influencer Finding nodes with high influence


Identification (centrality)

Recommendation
Suggesting friends, content, or products
Systems

Spotting unusual patterns like spam bots


Anomaly Detection
or fraud

Information
Studying how content or ideas spread
Propagation

📱 Applications of Social Network Mining

1. Marketing & Advertisement

 Identifying influencers to promote products.

 Viral marketing strategies.

2. Recommendation Engines

 Friend recommendations, product suggestions (like Amazon or Netflix).

3. Fraud & Spam Detection

 Detecting fake accounts or abnormal patterns in communication.

4. Epidemic Modeling

 Studying how diseases or information spread across people.

5. Security & Surveillance

 Monitoring suspicious social interactions in criminal networks.

6. Political & Sentiment Analysis

 Understanding public opinion or political campaigns on social


platforms.
📊 Social Networks as a Graph

Social networks can be modeled using graph data structures:

🧱 Graph Components:

Component Explanation

Nodes
Represent people, accounts, or entities
(Vertices)

Edges (Links) Represent relationships (friendship, follows, etc.)

Directed
A → B means A follows B (e.g., Twitter)
Edge

Undirected
A – B means mutual relationship (e.g., Facebook friend)
Edge

Weighted Represents strength of connection (number of messages,


Edge likes, etc.)

🧠 Example Graph Types:

1. Undirected Graph – Mutual relationships

2. Directed Graph (Digraph) – One-way relationships

📏 Common Graph Metrics:

Metric Meaning

Degree Centrality Number of connections a node has

Betweenness Influence of a node over information


Centrality flow

Closeness
How quickly a node can access others
Centrality

Density How interconnected the network is


Metric Meaning

Clustering How likely nodes are to form triangles


Coefficient (groups)

🧮 Tools & Libraries:

 NetworkX (Python) – Graph manipulation and analysis

 Gephi – Visualization of social graphs

 Neo4j – Graph database used for social network modeling

 GraphX (Apache Spark) – Scalable graph processing

Types of Social Networks

Social networks can be categorized based on the nature of relationships,


purpose, and structure. Here are the main types:

1. Personal or Ego-Centric Networks

 Focus on a single individual (ego) and their direct connections.

 Example: Facebook profile with friends.

2. Collaboration Networks

 Formed by individuals working together.

 Example: Co-authorship networks in academic research (authors as


nodes, shared papers as edges).

3. Communication Networks

 Nodes are individuals; edges represent communication (calls, emails).

 Example: Email exchange networks within an organization.

4. Information Networks

 Nodes are pieces of content, and edges represent citation or reference.

 Example: Citation network in research papers.

5. Online Social Networks (OSNs)

 Platforms like Facebook, Instagram, Twitter where nodes are users and
edges represent various interactions (likes, comments, follows).
🧩 Clustering of Social Graphs

Clustering in social graphs is the process of grouping users (nodes) who


are more connected to each other than to the rest of the graph.

🔍 Why Cluster?

 Identify communities, interest groups, or social circles

 Useful for:

o Targeted marketing

o Recommendation systems

o Influencer identification

📈 Common Clustering Techniques:

Method Description

K-means (after graph


Cluster nodes based on feature vectors
embedding)

Hierarchical Clustering Builds a tree of clusters

Spectral Clustering Uses eigenvalues of graph Laplacian matrix

Spreads labels through the network to form


Label Propagation
clusters

Girvan–Newman Removes edges with high betweenness to find


Algorithm communities

🧠 Direct Discovery of Communities in a Social Graph

Community detection is the process of identifying dense subgraphs or


clusters of nodes that are more connected within than outside.

📌 Popular Algorithms:

1. Modularity-Based Detection (Louvain Algorithm)

o Measures how well a graph is partitioned into communities.


o High modularity = strong community structure.

2. Clique Percolation

o Communities are overlapping groups formed by k-cliques.

3. Edge Betweenness (Girvan–Newman)

o Removes “bridge” edges (high betweenness) to split the


network.

4. Label Propagation Algorithm (LPA)

o Labels are propagated iteratively and nodes adopt the most


frequent label among neighbors.

Visual Example:

scss

CopyEdit

[Community A] — [Bridge Nodes] — [Community B]

Each community is tightly connected within, but has few connections


outside.

🎯 Introduction to Recommender Systems

Recommender systems are software tools and techniques that provide


suggestions for items (movies, products, people, etc.) that are likely to be
of interest to a user.

📦 Types of Recommender Systems:

Type Description Example

Content-Based Recommends items similar to "You watched Inception


Filtering those the user liked in the past → Suggest Interstellar"

Collaborative Recommends items liked by "People who liked this


Filtering similar users also liked..."
Type Description Example

Combines content-based +
Hybrid Systems Netflix, Amazon
collaborative filtering

Social Uses data from social networks Spotify: “Your friend liked
Recommenders (friends’ likes) this playlist”

🔧 Algorithms Used:

 Cosine Similarity

 Matrix Factorization (SVD, ALS)

 Deep Learning (Neural Collaborative Filtering)

 Graph-Based Recommendations (Personalized PageRank)

🧠 Real-Life Examples:

Platfor
Recommendation
m

Amazo
Products based on purchase/view history
n

Netflix Movies based on viewing and ratings

LinkedI
People you may know
n

YouTub Videos based on watch history and


e subscriptions

You might also like