big data notes
big data notes
Unit -1
Introduction to Big Data
What is Big Data?
Big Data refers to massive volumes of data that are generated, stored, and
analyzed for insights to improve decision-making. This data can be
structured, semi-structured, or unstructured and is too complex to be
processed using traditional data management tools. Big Data is collected
from multiple sources, including social media, sensors, IoT devices, financial
transactions, healthcare records, and digital applications.
The significance of Big Data lies in its ability to uncover hidden patterns,
correlations, and insights that were previously inaccessible due to
computational limitations. Organizations use Big Data to:
Example:
A single Boeing 787 aircraft generates 500GB of data per flight from
sensors monitoring engine performance.
Example:
Example:
In fraud detection, financial institutions filter out false positives
(legitimate transactions flagged as fraudulent) by analyzing spending
patterns and customer behavior.
The main purpose of Big Data is to extract useful business insights that
improve decision-making.
Example:
1. Structured Data
Examples:
2. Unstructured Data
Data that does not have a specific format, making it difficult to store
and analyze using traditional tools.
Examples:
3. Semi-structured Data
Data that does not follow a strict schema but contains tags, metadata,
or markers to define structure.
Examples:
1. Definition
Traditional Data
Big Data
Traditional Data
Big Data
Grows exponentially with data generated from IoT, social media, and
real-time applications.
✅ Example:
Traditional Data
Big Data
Includes data from social media, sensors, emails, videos, logs, IoT
devices.
✅ Example:
Traditional Data
Big Data
Uses Apache Kafka, Apache Spark, and Flink for streaming data
processing.
Essential for applications like fraud detection, stock trading, and IoT
monitoring.
✅ Example:
Traditional Data
Big Data
✅ Example:
Big Data: Google indexing billions of web pages for search results.
Traditional Data
Big Data
✅ Example:
Traditional: A business runs monthly reports using SQL queries.
7. Scalability
Traditional Data
Big Data
✅ Example:
Traditional Data
Big Data
Leverages cloud-based solutions like AWS, Google Cloud, and Azure for
cost-effective scaling.
✅ Example:
Big Data: Netflix uses AWS cloud storage for its massive video
streaming data.
Traditional Data
Big Data
✅ Example:
✅ Example:
Key Characteristics:
Challenges:
o Could not handle unstructured data (images, videos, emails,
logs).
Example:
Key Characteristics:
Major Developments:
Challenges:
Example:
Key Characteristics:
Data Type:
Major Developments:
Real-Time Applications:
Challenges:
Example:
Key Characteristics:
Major Developments:
o Smart cities use IoT-generated data for traffic, waste, and energy
management.
Augmented Analytics:
Challenges:
Example:
Expected Developments:
✅ AI-Augmented Decision-Making:
Problem:
Solutions:
Example:
Facebook stores petabytes of user data using a combination of
Hadoop, Hive, and cloud-based infrastructure.
Problem:
Solutions:
Example:
Uber processes ride requests in real time using Apache Kafka and
Spark Streaming.
Problem:
Big Data comes from multiple sources – IoT sensors, social media,
emails, logs, images, videos, etc.
Solutions:
✅ ETL (Extract, Transform, Load) Pipelines – Tools like Apache NiFi,
Talend, and Apache Beam automate data integration.
✅ Data Lakes – Platforms like AWS Lake Formation store raw,
unstructured data efficiently.
✅ Schema-on-Read Approach – Allows flexible querying of diverse data
formats without predefining strict schemas.
Example:
Problem:
Missing values and noise make data less reliable for analytics and AI
models.
Solutions:
Example:
Problem:
Big Data contains sensitive personal, financial, and business
information.
Solutions:
Example:
Problem:
Solutions:
Example:
Google optimizes Big Data costs by using AI-powered workload
scheduling and serverless infrastructure.
Problem:
Solutions:
Example:
Problem:
Solutions:
Amazon was fined €746 million for GDPR violations in 2021 due
to improper handling of user data.
Big Data storage technologies are essential for storing massive volumes
of structured and unstructured data efficiently.
Used to store data across multiple nodes to ensure scalability and fault
tolerance.
b) Cloud Storage
✅ AWS S3, Google Cloud Storage, Azure Blob Storage – Cloud storage
services that offer high availability and scalability.
✅ Snowflake – A cloud-based data warehouse optimized for analytics.
✅ MinIO – An open-source alternative to AWS S3 for private cloud storage.
c) Data Warehousing
Optimized for storing structured and semi-structured data for analytical
processing.
These frameworks process large datasets efficiently using batch and real-
time processing methods.
Process continuous streams of data from IoT devices, sensors, and social
media.
Big Data requires specialized databases that can handle massive amounts
of structured, semi-structured, and unstructured data.
a) NoSQL Databases
Designed for scalability and high availability, ideal for handling
unstructured data.
b) NewSQL Databases
c) Graph Databases
✅ Neo4j – A popular graph database used for social media and fraud
analytics.
✅ Amazon Neptune – A fully managed graph database optimized for deep
link analytics.
✅ ArangoDB – A multi-model NoSQL database that supports graph,
document, and key-value data.
✅ Apache Ranger – Security and policy framework for Hadoop and Big Data
environments.
✅ Apache Knox – Provides authentication and access control for Big Data
systems.
✅ GDPR & CCPA Compliance Tools – Tools like BigID and Privacera help
companies comply with privacy laws.
✅ Encryption (AES-256, SSL, TLS) – Ensures data is encrypted during
transmission and storage.
✅ Blockchain for Data Security – Used in fraud detection and tamper-proof
audit trails.
o BI Tools: Tools like Tableau, Power BI, and QlikView are used
for creating interactive dashboards and reports.
Infrastructure Considerations
Data analytics plays a critical role in Big Data as it helps organizations make
data-driven decisions by extracting valuable insights from massive
datasets. Big Data analytics involves examining large datasets (often with
complex and varied data types) to uncover hidden patterns, correlations, and
trends. Here are some key uses:
1. Predictive Analytics
2. Descriptive Analytics
3. Diagnostic Analytics
5. Prescriptive Analytics
7. Data Visualization
1. Scalability
A Big Data system should be able to handle growing amounts of
data and adapt as the dataset increases in size, speed, and
complexity.
2. Reliability
3. Flexibility (Variety)
It should support data formats like JSON, XML, CSV, and Parquet,
and be able to process data from different sources (social media,
sensor data, web logs, etc.).
5. High Throughput
6. Data Consistency
Consistency is crucial for ensuring that all data copies across
distributed systems are in sync. Systems should handle distributed
data consistency through protocols like CAP theorem (Consistency,
Availability, Partition tolerance).
7. Fault Tolerance
9. Manageability
10. Cost-Effectiveness
This ensures seamless data transfer between the Big Data system and
other enterprise systems like ERP, CRM, and BI tools.
Unit -2
Introduction to Hadoop
Hadoop is an open-source framework designed to store and process large
amounts of data in a distributed computing environment. Developed by
Apache Software Foundation, Hadoop allows users to process and
analyze massive datasets that cannot be handled by traditional data-
processing systems due to their size or complexity. It is built to scale from a
single server to thousands of machines, providing flexibility and fault
tolerance.
2. MapReduce
Key Functions:
o Map phase: The input data is divided into smaller chunks, which
are processed by individual mapper tasks. These mappers output
key-value pairs.
Key Features:
Key Features:
Hadoop Ecosystem
1. Apache Hive:
o Ideal for data analysts familiar with SQL but need to process
massive amounts of data.
2. Apache HBase:
3. Apache Pig:
4. Apache Spark:
5. Apache Kafka:
6. Apache Zookeeper:
7. Apache Flume:
8. Apache Sqoop:
9. Apache Oozie:
Hive Overview
1. Hive Client:
2. Hive Metastore:
o It maintains the structure of the data, but not the actual data,
which is stored in HDFS.
4. Hive Driver:
6. Hive Optimizer:
8. Execution Framework:
1. User Interaction:
o A user submits a HiveQL query using the Hive CLI, Web UI, or
programmatically through JDBC/ODBC connections.
2. Driver:
3. Compiler:
4. Optimizer:
5. Execution:
6. Result:
o After the query is executed, the result is returned to the user,
either via the command-line interface or the chosen client.
Hadoop Limitations
1. Complexity:
2. Real-Time Processing:
o While tools like Apache Spark and Apache Storm can provide
real-time capabilities, Hadoop itself is not optimized for low-
latency processing.
3. I/O Intensive:
6. Security Issues:
8. Cost of Implementation:
When you need strong data integrity and support for complex
transactions (banking, e-commerce platforms, etc.).
When the data is structured and you need to work with real-time
queries.
When the dataset is small to medium-sized (fits within the limits of a
traditional server or database system).
1. Distributed Storage:
2. Fault Tolerance:
3. Block-based Architecture:
o Files are split into blocks, and each block is stored across
different machines in the cluster. This allows parallel processing
and ensures efficient data access.
4. Master-Slave Architecture:
5. High Throughput:
7. Scalability:
o HDFS can scale by adding more nodes to the cluster. As the data
volume grows, more DataNodes can be added to store the data,
ensuring that the system remains efficient.
8. Data Locality:
1. Map Phase:
In the Map phase, the input data (typically stored in HDFS) is processed in
parallel across the nodes of the cluster.
Each Map task processes a block of data and outputs a set of key-
value pairs.
Example: In a word count program, the input data might be a text file, and
the Mapper reads the text, breaking it down into words (key-value pairs like:
"word": 1).
After the Map phase, Hadoop automatically performs a Shuffle and Sort
step, which groups the intermediate key-value pairs by key and sorts them.
Shuffle: Groups all the values associated with the same key together
across all nodes in the cluster.
Sort: Sorts the intermediate key-value pairs so that the Reducer can
process them efficiently.
3. Reduce Phase:
In the example of word count, the Reducer would aggregate the word
counts for each word, summing up the counts.
Example: The output could be something like: ("word", 5) indicating that the
word appeared 5 times in the text.
4. Output:
The final output from the Reduce phase is written back to HDFS as a set of
files.
1. Data Splitting: The input data is split into smaller chunks (blocks) by
HDFS.
2. Map Task Execution: The Map tasks are distributed across nodes in
the Hadoop cluster, each task processing its chunk of the data and
emitting key-value pairs.
3. Shuffle and Sort: The intermediate data produced by Mappers is
shuffled and sorted by the system to group the values for each key.
5. Writing Output: The results of the Reduce phase are written back to
HDFS, where they can be accessed for further analysis.
Limitations of MapReduce:
1. ResourceManager (RM):
2. NodeManager (NM):
3. ApplicationMaster (AM):
4. Containers:
1. Job Submission:
2. Resource Allocation:
3. Task Execution:
4. Job Completion:
Advantages of YARN:
MapReduce Workflow:
1. Map Phase:
o This phase involves splitting the input data into chunks (called
splits), which are then processed in parallel by different Mapper
tasks.
2. Reduce Phase:
o After the Map phase, the intermediate data is shuffled and sorted
based on the key.
1. Mapper Function:
2. Reducer Function:
o Output: The final result after reducing the intermediate data. For
a word count example, it would sum the counts of each word.
MapReduce Example:
Consider a word count program where the task is to count how often each
word appears in a large text file.
1. Map Phase:
o Input: A text file, with lines like "apple orange banana apple".
3. Reduce Phase:
4. Final Output:
"apple", 2
"orange", 1
"banana", 1
1. Mapper:
2. Reducer:
3. Driver:
Unit 3
🐝 1. Introduction to Hive
🎯 Why Hive?
Writing MapReduce manually is complex — Hive simplifies this.
Converts HQL into MapReduce, Tez, or Spark jobs behind the scenes.
2. Hive Architecture
🔷 2. Driver
🔷 3. Compiler
🔷 4. Metastore
🔷 5. Execution Engine
Hive data types are categorized into Primitive and Complex types.
Type Description
BOOLEAN True/False
Arbitrary precision
DECIMAL
numbers
Type Description
ARRAY<T
Ordered collection of elements
>
MAP<K,V
Key-value pairs
>
UNIONTYP
Supports multiple types in a single field
E
Example:
name STRING,
marks ARRAY<INT>,
);
Case-insensitive
id INT,
name STRING,
salary FLOAT
STORED AS TEXTFILE;
DESCRIBE employees;
🔍 SELECT Queries
Partitioning:
log_message STRING
Bucketing:
user_id INT,
name STRING
🧰 Joins in Hive:
FROM dept a
JOIN emp b
ON a.id = b.dept_id;
✅ Summary
Architectur
UI → Driver → Compiler → Execution Engine ↔ Metastore & HDFS
e
Data
Primitive (INT, STRING...) and Complex (ARRAY, MAP, STRUCT)
Types
🐷 1. Introduction to Pig
✅ Features of Pig:
🧬 2. Anatomy of Pig
Here’s what a typical Pig environment looks like and how it functions:
🔷 Components of Pig:
Component Description
pig
DUMP highEarners;
4. Optimization
6. MapReduce Jobs
📘 Modes of Execution:
Mode Description
Hadoop.
🧩 3. Pig on Hadoop
🧱 Integration Points:
pig
-- Filter entries
-- Count accesses
Data Custom
Use Case Data Analysis
Transformation processing
Execution MapReduce/Tez/
MapReduce Native
Engine Spark
✅ Summary
Pig on Pig runs on top of Hadoop using HDFS for storage and
Hadoop MapReduce for execution
Pig is best for analyzing and transforming large datasets. It's especially
useful for:
Problem: A company has terabytes of log data from web servers and wants
to find how many requests came from each country.
🛒 2. Retail Analytics
Pig Tasks:
📈 3. Preprocessing for ML
Pig Tasks:
Remove nulls/duplicates.
Normalize values.
🧪 4. Data Sampling
For data scientists who need only a sample of data for testing or
visualization.
Extract: Get data from sources like HDFS, Hive, relational databases.
🔃 Pig in ETL
Filtering
Sorting
Joining
Grouping
Aggregation
CopyEdit
1,John,Sales,5600
2,Alice,HR,4300
3,Bob,Sales,7000
🔤 Pig Script:
pig
CopyEdit
-- Extract
-- Load
CopyEdit
Sales,6300.0
100000000
long 64-bit integer
00
chararra
"John" String of characters
y
1. Tuple
pig
CopyEdit
2. Bag
pig
CopyEdit
3. Map
pig
CopyEdit
pig
CopyEdit
This defines:
A name (string)
A bag of subject-mark pairs
Modes of Execution:
bash
CopyEdit
pig -x local
bash
CopyEdit
pig
Open shell:
bash
CopyEdit
pig
Commands:
pig
CopyEdit
grunt> data = LOAD 'file.txt' AS (name:chararray);
Write script:
-- File: etl_script.pig
DUMP filtered;
Run script:
bash
CopyEdit
pig etl_script.pig
🔚 Summary Table
Running
CLI (Grunt), Script, Local/Hadoop modes
Pig
2. Parser
3. Optimizer
4. Compiler
5. Execution
o Jobs are submitted to the Hadoop cluster (or run locally if in local
mode).
Execution Modes:
Mode Description
MapRedu
Default mode; runs on Hadoop.
ce
🔄 2. Operators in Pig
Pig provides relational operators similar to SQL but more flexible.
Operat
Description Example
or
DISTINC
Removes duplicates DISTINCT A
T
3. Functions in Pig
Pig offers a wide range of built-in functions and allows custom UDFs
(User Defined Functions).
🔹 Built-in Functions:
📊 Aggregate Functions:
Functio
Description Example
n
Minimum
MIN() MIN(A.salary)
value
Maximum
MAX() MAX(A.salary)
value
🔤 String Functions:
Functio
Description
n
CONCAT(
Combines strings
)
STRSPLIT
Splits a string
()
Converts to
UPPER()
uppercase
Converts to
LOWER()
lowercase
🔢 Math Functions:
Functio
Description
n
Absolute
ABS()
value
ROUND(
Round value
)
java
CopyEdit
return input.get(0).toString().toUpperCase();
Register in Pig:
pig
CopyEdit
REGISTER 'myfuncs.jar';
Descriptio
Type Example
n
32-bit
int 25
integer
64-bit 100000000
long
integer 00
chararra
String 'hello'
y
ay
Typ
Description Example
e
[name#'John',
Map Key-value pairs
age#25]
pig
CopyEdit
Here:
📘 Summary Table:
Topic Details
UNIT 4
🔰 Introduction to NoSQL
✅ What is NoSQL?
💡 Key Features:
Feature Description
High
Built to handle failures gracefully
Availability
Fast
Optimized for read/write throughput
Performance
Cloud-based scalability.
Driver Explanation
🧾 Business Examples:
🔸 1. Key-Value Store
json
CopyEdit
🔹 2. Document Store
json
CopyEdit
"id": "123",
"name": "Alice",
"address": {
"city": "NY",
"zip": "10001"
makefile
CopyEdit
RowKey: 101
Name: Alice
Subject: Math
Marks: 95
🔘 4. Graph Store
scss
CopyEdit
🧩 Summary of Patterns:
Best Use
Type Examples
Case
Key- Caching,
Redis, DynamoDB
Value Session
Docume
CMS, Products MongoDB, CouchDB
nt
Columna
Logs, Analytics Cassandra, HBase
r
Neptune
Pattern Description
Example:
✅ 4. MapReduce Pattern
✅ 5. Eventual Consistency
Social
Graph store Neo4j
networking
Graph + ArangoDB,
Fraud detection
Document MongoDB
🍃 Introduction to MongoDB
📌 What is MongoDB?
Highly flexible, scalable, and widely used in web and big data apps.
🧱 MongoDB Architecture
Compone
Description
nt
Documen
Basic unit of data (like a row)
t
Collectio
Group of documents (like a table)
n
Replica
Group of MongoDB servers for redundancy
Set
json
CopyEdit
"_id": "123",
"email": "[email protected]",
"orders": [
Key Features:
Feature Details
Aggregation
Like SQL GROUP BY, but more powerful
Framework
Horizontal
Built-in sharding support
Scalability
Real-time analytics
Product catalogs
IoT platforms
Social apps
UNIT 5
Mining Social Network Graphs
Examples:
Goal Description
Recommendation
Suggesting friends, content, or products
Systems
Information
Studying how content or ideas spread
Propagation
2. Recommendation Engines
4. Epidemic Modeling
🧱 Graph Components:
Component Explanation
Nodes
Represent people, accounts, or entities
(Vertices)
Directed
A → B means A follows B (e.g., Twitter)
Edge
Undirected
A – B means mutual relationship (e.g., Facebook friend)
Edge
Metric Meaning
Closeness
How quickly a node can access others
Centrality
2. Collaboration Networks
3. Communication Networks
4. Information Networks
Platforms like Facebook, Instagram, Twitter where nodes are users and
edges represent various interactions (likes, comments, follows).
🧩 Clustering of Social Graphs
🔍 Why Cluster?
Useful for:
o Targeted marketing
o Recommendation systems
o Influencer identification
Method Description
📌 Popular Algorithms:
2. Clique Percolation
Visual Example:
scss
CopyEdit
Combines content-based +
Hybrid Systems Netflix, Amazon
collaborative filtering
Social Uses data from social networks Spotify: “Your friend liked
Recommenders (friends’ likes) this playlist”
🔧 Algorithms Used:
Cosine Similarity
🧠 Real-Life Examples:
Platfor
Recommendation
m
Amazo
Products based on purchase/view history
n
LinkedI
People you may know
n