Big-Data Unit-3
Big-Data Unit-3
🌍 2. What is Hadoop?
Hadoop is an open-source framework that allows for distributed storage and processing
of big data across many machines.
🧠 1. What is MapReduce?
MapReduce is a programming model used in Hadoop to process and analyze large
datasets in a distributed and parallel manner.
✅ Final output:
nginx
Copy code
apple → 5
banana → 3
🧠 Tip to Remember:
“Map = Break & Tag, Reduce = Group & Count”
Store the data file into HDFS (Hadoop Distributed File System) using:
arduino
Copy code
hdfs dfs -put inputfile.txt /input
ruby
Copy code
javac -classpath `hadoop classpath` MyJob.java
jar cf myjob.jar MyJob*.class
bash
Copy code
hadoop jar myjob.jar MyJob /input /output
bash
Copy code
hdfs dfs -cat /output/part-r-00000
🧩 Component 🔍 Role
NameNode Master node – manages metadata (file names, locations)
DataNode Worker node – stores actual data blocks
📌 Example:
File name
Block IDs
DataNode locations
🔁 4. Interaction Flow
text
Copy code
Client → NameNode (asks where data is)
Client → DataNodes (reads/writes actual data)
DataNodes → NameNode (send heartbeats & block reports)
📦 1. What is HDFS?
HDFS (Hadoop Distributed File System) is the storage system of Hadoop, designed to store
very large files reliably across multiple machines.
🧩 3. HDFS Architecture
🔹 Two Main Components:
Component Role
NameNode Master – stores metadata (file info, locations)
DataNodes Workers – store actual data blocks
📁 4. HDFS Concepts
Concept Description
Block Data is split into blocks (default: 128MB)
Replication Each block is copied to multiple DataNodes (default: 3 copies)
Rack Awareness Replicas are stored on different racks for fault tolerance
Heartbeat DataNodes send signals to NameNode to show they’re alive
Block Report DataNodes send list of blocks they store to NameNode
Secondary NameNode Takes periodic snapshots of metadata (not a backup, just helper)