Experiment: - 1: Aim: Installing Hadoop, Configure HDFS, Configuring Hadoop
Experiment: - 1: Aim: Installing Hadoop, Configure HDFS, Configuring Hadoop
Experiment: -1
Aim: Installing Hadoop, configure HDFS, Configuring Hadoop
1. Introduction:
Hadoop is an open-source framework developed by the Apache Software Foundation
for processing and storing massive amounts of data. It uses a distributed storage
system called HDFS (Hadoop Distributed File System) and a processing engine called
MapReduce to handle big data efficiently.
Hadoop's key features include scalability, fault tolerance, and the ability to process
structured, semi-structured, and unstructured data across clusters of commodity
hardware. Its ecosystem includes tools like Hive, Pig, HBase, and Spark, making it a
cornerstone for big data analytics.
Components of Hadoop
Hadoop has four core components:
1. HDFS (Hadoop Distributed File System): Stores large data across distributed
nodes with replication for reliability.
2. YARN (Yet Another Resource Negotiator): Manages cluster resources and task
scheduling.
3. MapReduce: Processes data in parallel with mapping and reducing phases.
4. Hadoop Common: Provides shared libraries and utilities for other modules.
2. Installation of Hadoop
Pre requisites: - This software should be prepared to install Hadoop 2.8.0 on
window 10 (64bit)
1. Download Hadoop 2.8.0
(Link:https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.4.1/
hadoop-3.4.1.tar.gz)
OR
(https://dlcdn.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz)
2. Java JDK 1.8.0.zip
(Link: https://www.oracle.com/java/technologies/downloads/#java2)
Set up
1. Check either Java 1.8.0 is already installed on your system or not, use "Javac -
version" to check.
Configuration
1. Edit core-site.xml: Edit file C:/Hadoop-3.4.1/etc/hadoop/core-site.xml, paste below
xml paragraph and save this file.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
2. Create folder "data" under "C:\Hadoop-3.4.1"
Create folder "datanode" under "C:\Hadoop-3.4.1\data"
Create folder "namenode" under "C:\Hadoop-3.4.1\data"
3. Edit hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-3.4.1\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-3.4.1\data\namenode</value>
</property>
</configuration>
4. Edit mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
5. Edit yarn-site.xml:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanger.auxservices.mapreduce.suffle.class</name>
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
6. Hadoop Configuration
1. Download file Hadoop Configuration.zip
(Link:https://drive.google.com/file/d/
1nCN_jK7EJF2DmPUUxgOggnvJ6k6tksYz/view)
2. Delete file bin on C:\Hadoop-3.4.1\bin, replaced by file bin on file
just download (from Hadoop Configuration.zip).
8. Testing
1. Open cmd and change directory to "C:\Hadoop-3.4.1\
sbin" and type “start-dfs.cmd" to start apache.
9.
Start Hadoop Services:
a. Use the following commands to start the HDFS and
YARN services: start-dfs.cmd and start-yarn.cmd
10.
Verify Hadoop Installation:
a. Access the Hadoop NameNode UI by visiting http://localhost:9870 in a
web browser.
b. Access the ResourceManager UI by visiting http://localhost:8088.
Localhost:50070
So, now you have successfully installed Hadoop
3. Experiment Code
• N/A (This experiment is focused on configuration rather than coding).
4. Execution
5. Observations
6. Analysis
7. Conclusion
8. Viva Questions
Experiment No.: 2
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N
Moreover, the HDFS is designed to be highly fault-tolerant. The file system replicates
—or copies—each piece of data multiple times and distributes the copies to individual
nodes, placing at least one copy on a different server rack than the other copies.
2. Procedure
a) Start HDFS Services:
Run the following command to start the HDFS: start-dfs.sh
b) Access the HDFS UI:
Open the Hadoop NameNode UI in a browser:” http://localhost:9870”
c) Create a Directory in HDFS:
Use the following command to create a new directory:
“hdfs dfs -mkdir /user/<your-username>/example_dir”
d) Upload a File to HDFS:
Upload a local file to HDFS:
“hdfs dfs -put <local-file-path> /user/<your-username>/example_dir/”
e) List Files in HDFS:
View the contents of the HDFS directory:
“hdfs dfs -ls /user/<your-username>/example_dir “
f) Retrieve a File from HDFS:
Download a file from HDFS to the local filesystem:
“hdfs dfs -get /user/<your-username>/example_dir/<file-name> <local-destination-
path>”
g) Delete a File in HDFS:
Remove a file from HDFS:
hdfs dfs -rm /user/<your-username>/example_dir/<file-name>
h) Stop HDFS Services:
Use the following command to stop HDFS services: stop-dfs.sh
3. Experiment Code
N/A (This experiment is focused on file operations via commands).
4. Execution
Perform the HDFS commands step by step as outlined in the process.
Validate each operation by checking the Hadoop NameNode UI and reviewing the
terminal outputs.
5. Observations
Successfully executed HDFS operations such as directory creation, file upload,
and retrieval.
Confirmed replication factor and block distribution for uploaded files through the
HDFS web interface.
6. Analysis
Data Distribution: Files are divided into blocks and spread across multiple
DataNodes.
Fault Tolerance: Replication ensures data is available even if some DataNodes
fail.
Ease of Use: Simple terminal commands and the web UI make managing HDFS
straightforward.
7. Conclusion
Gained hands-on experience with basic HDFS operations using terminal
commands.
Verified data replication and block storage via the NameNode interface.
8. Viva Questions
1. What is the primary function of the NameNode in HDFS?
2. How does HDFS maintain fault tolerance?
3. Describe what happens when a DataNode becomes unavailable.
4. What is the purpose of the hdfs dfs -put command?
9. Multiple Choice Questions (MCQs)
1. What is the default replication factor in HDFS?
a) 2
b) 3
c) 4
d) 1
Answer: b) 3
2. Which component stores metadata in HDFS?
a) DataNode
b) NameNode
c) Secondary NameNode
d) JobTracker
Answer: b) NameNode
10. References
Official Apache Hadoop Documentation: https://hadoop.apache.org/docs/
Hadoop: The Definitive Guide by Tom White
Experiment No.: 3
1. Aim: Running Jobs on Hadoop
2. Theory:
MapReduce is a programming model used to process large datasets in a distributed
environment, often in Hadoop clusters.
Map Function: The Map function takes input data and transforms it into key-value
pairs, which are the intermediate outputs. Each key-value pair is used for further
processing.
Reduce Function:The Reduce function processes the intermediate key-value pairs
generated by the Map function, aggregates them, and provides the final result.
Workflow of MapReduce:
1. Input Splits: Input data is divided into smaller chunks (input splits), which can be
processed in parallel across the cluster.
2. Mapper: The Mapper processes each input split and generates intermediate key-
value pairs.
3. Shuffle and Sort: This phase organizes the intermediate key-value pairs by
grouping them according to their keys.
4. Reducer: The Reducer processes the sorted key-value pairs and produces the final
output, often after aggregation or summarization.
3. Procedure
1. Start Hadoop Services:
Run the following commands to start Hadoop's HDFS and YARN
services: start-dfs.sh start-yarn.sh
2. Prepare the Input Data:
Create a sample text file (e.g., input.txt) with some data.
Upload the file to HDFS:
hdfs dfs -put input.txt /user/<your-username>/example_input
3. Write a MapReduce Job:
Use the built-in WordCount example or create a custom job using Java or
Python.
Example: WordCount program to count word frequencies in a dataset.
4. Run the Job:
Execute the MapReduce job:
Hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/Hadoop-mapreduce
examples-*.jar wordcount/user/<yourusername>/example_input/user/
<yourusername>example_output
Hadoop is a framework
Hadoop processes large datasets
Distributed systems are scalable
Output from WordCount Job: text
Hadoop 2
Distribution 1
Framework 1
is 1
large 1
processes 1 scalable 1
systems 1
6. Execution
Execute the WordCount job using the steps provided.
Verify the output using hdfs dfs -cat and analyze the logs in the ResourceManager UI.
7. Observations
The MapReduce job processed the input data and generated accurate word counts.
Logs provided detailed insights into the execution phases, including map and reduce
tasks.
8. Analysis
Distributed Processing: MapReduce splits data and processes it across multiple
nodes, improving speed and scalability.
Fault Tolerance: If a node fails, Hadoop reassigns tasks to ensure job completion.
Resource Management: YARN allocates and monitors resources effectively during
job execution.
9. Conclusion
Successfully executed a MapReduce job on Hadoop.
Learned the workflow of MapReduce and verified outputs using HDFS.
10. Viva Questions
1. What is the role of the Mapper in MapReduce?
2. How does the Shuffle and Sort phase optimize job performance?
3. What is the significance of YARN in Hadoop jobs?
4. How can you monitor the progress of a Hadoop job?
11. Multiple Choice Questions (MCQs)
Experiment No.: 4
Aim: Install Zookeeper
1. Theory:
Apache ZooKeeper
Leader: The leader node handles all write requests in the system, ensuring that there
is a single source of truth for updates.
Followers: The follower nodes handle read requests and forward write requests to the
leader. They replicate the leader's changes to maintain consistency.
Observers: These nodes do not participate in leader election and are used primarily
for scaling the system. Observers can handle read requests, but they do not vote
during elections.
Use Cases:
Leader Election: Ensuring only one active leader node to coordinate tasks.
Configuration Management: Storing configuration data in a centralized and
consistent way.
Distributed Locking: Implementing locks in a distributed environment to avoid race
conditions.
Service Discovery: Helping in the discovery of available services in a dynamic
distributed system.
2. Procedure
1. Download ZooKeeper:
• Visit the Apache ZooKeeper official website.
• Download the latest stable release (e.g., zookeeper-3.x.x.tar.gz).
2. Install ZooKeeper:
• Extract the downloaded package to a desired directory:
tar -xvzf zookeeper-3.x.x.tar.gz -C
/usr/local/ cd /usr/local/zookeeper-3.x.x/
• Set environment variables in the . rc file:
export ZOOKEEPER_HOME=/usr/local/zookeeper-
3.x.x export
PATH=$PATH:$ZOOKEEPER_HOME/bin
• Source the . rc file to apply changes:
source ~/. rc
3. Configure ZooKeeper:
• Navigate to the conf directory and create a configuration file:
cp conf/zoo_sample.cfg conf/zoo.cfg
• Edit the zoo.cfg file to configure ZooKeeper: properties tickTime=2000
dataDir=/usr/local/zookeeper-3.x.x/data
clientPort=2181
• Create the data directory:
mkdir -p /usr/local/zookeeper-3.x.x/data
4. Start ZooKeeper Service:
•Start the ZooKeeper server:
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N
zkServer.sh start
3. Verify Installation:
• Check ZooKeeper status:” zkServer.sh status”
• Use the ZooKeeper CLI to test functionality:” zkCli.sh -server localhost:2181 “
• Create a ZNode: “ create /my_node "Hello ZooKeeper" “
•Verify the ZNode: “get /my_node”
4. Stop ZooKeeper Service:
zkServer.sh stop
5. Experiment Code
• Sample commands for ZooKeeper CLI:
create /example "Sample Data"
ls /
set /example "Updated Data"
get /example
delete /example
6. Execution
Execute the commands mentioned above using the ZooKeeper CLI.
Observe the creation, update, retrieval, and deletion of ZNodes.
7. Observations
The ZooKeeper server successfully started and maintained its state.
The CLI commands effectively interacted with the ZooKeeper service.
8. Analysis
Distributed Coordination: ZooKeeper ensures coordination and consistency across
distributed nodes, managing shared data and operations.
Atomicity, Reliability, and Fault-Tolerance: ZooKeeper’s mechanisms ensure data
consistency and fault tolerance, making it suitable for maintaining the state in
distributed systems.
Hierarchical ZNode Structure: The hierarchical structure of ZNodes allows for easy
organization, access, and management of configuration data.
9. Conclusion
Successfully installed and configured ZooKeeper.
Understood its functionality by creating and manipulating ZNodes using the CLI.
10. Viva Questions
1. What is the role of ZooKeeper in a distributed system?
2. Explain the significance of the tickTime parameter in zoo.cfg.
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N
Experiment No.: 5
1. Aim: Pig Installation
2. Theory
Apache Pig is a high-level platform for creating MapReduce programs used with
Hadoop.
It uses a scripting language called Pig Latin, which simplifies complex data
transformations and operations.
Key Features:
Ease of Programming: Simplifies the development of MapReduce programs.
Extensibility: Allows the use of user-defined functions (UDFs) for custom
processing.
Optimization Opportunities: Automatically optimizes scripts for efficient
execution.
Execution Modes:
Local Mode: Runs on a single machine without Hadoop.
MapReduce Mode: Runs on a Hadoop cluster.
3. Procedure
1. Pre-Requisites:
Ensure Java and Hadoop are installed and configured.
Verify Hadoop is running by checking its status.
Install Pig in the same environment as Hadoop.
2. Download Apache Pig:
Visit the official Apache Pig website.
Download the latest stable release (e.g., pig-0.x.x.tar.gz).
3. Install Apache Pig:
Extract the downloaded file to a desired directory:
tar -xvzf pig-0.x.x.tar.gz -C /usr/local/
cd /usr/local/pig-0.x.x/
Set environment variables in the .bashrc file:
export PIG_HOME=/usr/local/pig-0.x.x
export PATH=$PATH:$PIG_HOME/bin
Source the .bashrc file to apply changes:
source ~/.bashrc
4. Configure Pig:
For local mode, no additional configuration is required.
For MapReduce mode, ensure HADOOP_HOME is properly configured in
the environment variables.
5. Verify Installation:
Check Pig version to confirm installation:
pig -version
6. Run Pig in Local Mode:
Start Pig in local mode:
pig -x local
Execute a sample Pig script:
A = LOAD 'data.txt' USING PigStorage(',') AS (id:int, name:chararray,
age:int);
B = FILTER A BY age > 25;
DUMP B;
12. References
Apache Pig Documentation
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N
Experiment No.: 6
1. Aim: Sqoop Installation.
2. Theory: Apache Sqoop is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured data stores such as relational databases.
Key Features:
Import/Export Data: Move data between Hadoop and RDBMS.
Parallel Processing: Supports parallel data transfer for efficiency.
Integration: Works seamlessly with Hive and HBase.
Incremental Load: Import only new or updated data.
Typical Use Cases:
Importing data from a MySQL or PostgreSQL database into HDFS.
Exporting processed data from HDFS back to an RDBMS.
3. Procedure
1. Pre-Requisites:
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N
Ensure Java, Hadoop, and a relational database (e.g., MySQL) are installed and
configured.
Verify Hadoop is running: jps
2. Download Apache Sqoop:
Visit the official Apache Sqoop website.
Download the latest stable release (e.g., sqoop-1.x.x.tar.gz).
3. Install Apache Sqoop:
Extract the downloaded file to a desired directory:
“ tar -xvzf sqoop-1.x.x.tar.gz -C /usr/local/ cd /usr/local/sqoop-1.x.x/ “
Set environment variables in the .bashrc file:
“export SQOOP_HOME=/usr/local/sqoop-1.x.x
export PATH=$PATH:$SQOOP_HOME/bin “
Source the .bashrc file to apply changes: “source ~/.bashrc “
4. Configure Sqoop:
Download and place the JDBC driver for your database in the
$SQOOP_HOME/lib directory. For example, for MySQL:wget
https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-
x.x.x.tar.gz
tar -xvzf mysql-connector-java-x.x.x.tar.gz
cp mysql-connector-java-x.x.x.jar $SQOOP_HOME/lib/
Ensure the database service (e.g., MySQL) is running.
5. Verify Installation:
Check Sqoop version to confirm installation:
sqoop version
6. Perform Data Import:
Import a table from MySQL into HDFS:
sqoop import \
--connect jdbc:mysql://localhost:3306/db_name \
--username db_user \
--password db_password \
--table table_name \
--target-dir /user/hadoop/table_data \
--m 1
Verify the imported data in HDFS: hdfs dfs -ls /user/hadoop/table_data
7. Perform Data Export:
Export data from HDFS back to MySQL: sqoop export \
--connect jdbc:mysql://localhost:3306/db_name \
--username db_user \
--password db_password \
--table table_name \
--export-dir /user/hadoop/table_data \
--m 1
8. Exit Sqoop:
Sqoop runs as a command-line tool; no special command is required to exit.
4. Experiment Code
Sample Import Command: sqoop import \
--connect jdbc:mysql://localhost:3306/employees_db \
--username root \
--password root_password \
--table employees \
--target-dir /user/hadoop/employees_data \
--m 1
Sample Export Command: sqoop export \
--connect jdbc:mysql://localhost:3306/employees_db \
--username root \
--password root_password \
--table salaries \
--export-dir /user/hadoop/salaries_data \
--m 1
5. Execution
Execute the Sqoop commands in the terminal.
Verify the transferred data in HDFS and the relational database.
6. Observations
Sqoop successfully transferred data between MySQL and HDFS.
Data transfer operations were efficient and error-free.
7. Analysis
Sqoop simplifies data movement between RDBMS and Hadoop.
The tool supports parallel processing for faster data transfer.
8. Conclusion
Successfully installed and configured Apache Sqoop.
Experiment No.: 7
1. Aim: HBase Installation
2. Theory: Apache HBase is an open-source, distributed, and scalable NoSQL
database inspired by Google's Bigtable. It is designed to manage large amounts of
data in a fault-tolerant way, running on top of Hadoop and using HDFS for storage.
Key Features:
Distributed: Scales horizontally by adding machines.
Column-Oriented: Data is stored in columns rather than rows.
Real-Time Access: Fast random read/write operations on large datasets.
Fault-Tolerant: Data replication ensures fault tolerance.
HBase Operations:
Put: Inserts data into tables.
Get: Retrieves data from tables.
3. Procedure
1. Pre-Requisites:
Ensure Hadoop and Java are installed and configured.
Verify that Hadoop is running: jps
2. Download Apache HBase:
Visit the official Apache HBase website and download the latest stable release
(e.g., hbase-2.x.x.tar.gz).
3. Extract HBase Files:
Extract the downloaded file to the desired directory:
tar -xvzf hbase-2.x.x.tar.gz -C /usr/local/
cd /usr/local/hbase-2.x.x/
4. Configure HBase:
Open the hbase-site.xml file located in $HBASE_HOME/conf/. If not present,
create it.
Add the following basic configuration:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
</configuration>
Ensure Hadoop HDFS and Zookeeper services are running before starting
HBase.
Optionally, configure additional properties like hbase.master
andhbase.regionserver.
5. Set HBase Environment Variables:
Add the following to your .bashrc file:
export HBASE_HOME=/usr/local/hbase-2.x.x
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N
export PATH=$PATH:$HBASE_HOME/bin
Apply changes: source ~/.bashrc
6. Start HBase:
Start the HBase master and region server: start-hbase.sh
Verify HBase is running: jps
You should see processes like HMaster and HRegionServer.
7. Verify HBase Installation:
Open the HBase shell: hbase shell
Check if HBase is running: status 'simple'
8. Create an HBase Table:
Create a table student with two column families: personal and grades:
create 'students', 'personal', 'grades'
Verify the table creation:
list
9. Insert Data into HBase:
Insert data into the students table:
put 'students', 'row1', 'personal:name', 'John Doe'
put 'students', 'row1', 'personal:age', '21'
put 'students', 'row1', 'grades:math', 'A'
10. Retrieve Data from HBase:
Retrieve data from the students table:
get 'students', 'row1'
11. Scan the HBase Table:
Scan the entire table:
scan 'students'
12. Stop HBase:
Stop the HBase services:
stop-hbase.sh
4. Experiment Code
Create Table Command:
create 'students', 'personal', 'grades'
5. Execution
Execute the commands in the HBase shell to create tables, insert data, and retrieve
data.
Verify the inserted data and scanned results.
6. Observations
HBase was successfully installed and configured.
Data was successfully inserted into and retrieved from the student’s table.
7. Analysis
HBase's column-oriented architecture allows efficient storage and retrieval of large
datasets.
Its integration with Hadoop ensures scalability and distributed storage for
managing big data.
8. Conclusion
Successfully installed and configured Apache HBase.
Performed basic operations such as table creation, data insertion, and data retrieval.
9. Viva Questions
1. What is HBase, and how does it differ from relational databases?
2. Explain the architecture of HBase.
3. How does HBase achieve scalability?
4. What is the role of Zookeeper in HBase?
5. Explain the concept of column families in HBase.
11.References
Apache HBase Documentation
HBase: The Definitive Guide by Lars George
Experiment No.: 8
1. Aim: Hadoop streaming
2. Theory: Hadoop Streaming: A utility enabling MapReduce jobs with any language
capable of handling stdin and stdout, such as Python, Perl, or Ruby.
Key Components:
Mapper: Processes each line of input and outputs key-value pairs.
Reducer: Aggregates key-value pairs output by the Mapper to produce the
final result.
3. Procedure
1. Setup and Verification:
Ensure Hadoop is installed, and HDFS and YARN services are running.
Confirm with commands like jps.
2. Input Data Preparation:
Create an input file input.txt containing:
hello world
hello hadoop
hello streaming
hadoop world
3. Mapper Script:
Create a Python file mapper.py for the Mapper function.
mapper.py:
import sys
for line in sys.stdin:
words = line.strip().split()
for word in words:
print(f"{word}\t1")
4. Reducer Script:
Create a Python file reducer.py for the Reducer function.
reducer.py:
import sys
from collections import defaultdict
current_word = None
current_count = 0
if current_word:
print(f"{current_word}\t{current_count}")
5. Upload Data to HDFS:
hadoop fs -put input.txt /user/hadoop/input/
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N
5. Experiment Code
Mapper (mapper.py) and Reducer (reducer.py) scripts.
Hadoop Streaming job command.
6. Execution
Run the scripts and commands.
Confirm correct processing by inspecting the output in HDFS.
7. Observations
Successful execution of the MapReduce job with accurate word count results.
Mapper and Reducer functions worked as expected.
8. Analysis
Hadoop Streaming simplifies the integration of non-Java languages with
Hadoop MapReduce.
Enables flexibility for developers using Python or other scripting languages
for big data processing.
9. Conclusion
11. MCQs
1. Hadoop Streaming allows the use of which languages for MapReduce
jobs?
a) Only Java programs
b) Only Python programs
c) Any language that can read from stdin and write to stdout
d) Only shell scripts
Answer: c) Any language that can read from stdin and write to stdout
2. Which command is used to execute a Hadoop Streaming job?
a) hadoop-streaming.jar
b) hadoop jar hadoop-streaming.jar
c) hadoop run streaming.jar
d) hadoop-mapreduce.jar
Answer: b) hadoop jar hadoop-streaming.jar
3. In Hadoop Streaming, the output from the Mapper is:
a) A single line
b) A key-value pair
c) A JSON file
d) A binary file
Answer: b) A key-value pair
4. The primary function of the Reducer in Hadoop Streaming is:
a) To split data into smaller chunks
b) To merge the results from the Mapper
c) To run a MapReduce job
d) To filter data
Answer: b) To merge the results from the Mapper
5. What type of data does Hadoop Streaming primarily process?
a) Binary data only
b) Text data only
c) Structured, semi-structured, and unstructured data
12. References
Hadoop Streaming Documentation
Hadoop: The Definitive Guide by Tom White.
Experiment No.: 9
1. Aim: Creating Mapper function using python.
2. Theory
The Mapper function in Hadoop MapReduce processes input data and produces
intermediate key-value pairs as output.
This function works line by line, reading each line, splitting it into words, and
outputting each word as a key with an associated count (usually 1).
Python is widely preferred for writing Mapper functions due to its simplicity and an
extensive library ecosystem.
3. Procedure
1. Ensure Hadoop is Installed and Running:
Verify Hadoop installation and running status:
Ensure that the HDFS and YARN daemons are running.
2. Create Input Data:
Create a text file input.txt containing sample data for processing. For
example:
Hadoop is a framework
MapReduce is a programming model
Python is used for writing mapper
Hadoop streaming supports multiple languages
3. Create the Mapper Script:
Write a Python script (mapper.py) that will process each line of input data and
output key-value pairs.
In this case, the script will split each line into words and output a (word, 1)
pair.
Python code for Mapper(mapper.py):
python # mapper.py
import sys
# Read from stdin
for line in
sys.stdin:
# Remove leading/trailing whitespaces and split the line into words
words = line.strip().split()
# Emit (word, 1) for each word in the line
for word in words:
print(f"{word}\t1")
This script reads input from standard input (sys.stdin), processes each line, and
outputs the word followed by the number 1 (indicating its occurrence).
4. Upload Input Data to HDFS:
Use the following command to upload the input file input.txt to HDFS:
hadoop fs -put input.txt /user/hadoop/input/
5. Run the Hadoop Streaming Job:
Use the following command to execute the MapReduce job, specifying the
Python Mapper script:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/hadoop/input/input.txt \
-output /user/hadoop/output/ \
-mapper mapper.py
This command tells Hadoop to use the mapper.py script to process the
data.
6. Verify the Output:
After the job completes, the output will be saved in HDFS
at/user/hadoop/output/.
You can check the output using the following command:
hadoop fs -cat /user/hadoop/output/part-00000
The output will show key-value pairs (word, 1), such as:
Hadoop 1
is 2
a 2
framework 1
MapReduce 1
programming 1
model 1
Python 1
used 1
for 1
writing 1
mapper 1
streaming 1
supports 1
multiple 1
languages 1
7. Clean Up:
Remove the output directory from HDFS after the job completes:
hadoop fs -rm -r /user/hadoop/output/
4. Experiment Code
Mapper Code (mapper.py):
import sys
# Read from stdin
for line in
sys.stdin:
# Remove leading/trailing whitespaces and split the line into words
words = line.strip().split()
# Emit (word, 1) for each word in the line
for word in words:
print(f"{word}\t1")
Hadoop Streaming Command:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/hadoop/input/input.txt \
-output /user/hadoop/output/ \
-mapper mapper.py
5. Execution
Run the Hadoop Streaming command provided above to execute the Mapper
function.
Check the output stored in HDFS using the hadoop fs -cat command.
6. Observations
The Mapper function processed the input data and emitted key-value pairs for
each word with an occurrence count of 1.
The output file in HDFS contained all words along with their counts.
7. Analysis
The Mapper function in Python efficiently processes text by breaking it into words.
The output can be used in the Reducer phase for further aggregation.
Python's simplicity makes it a reliable choice for developing Mapper functions in
Hadoop Streaming.
8. Conclusion
A Python-based Mapper function was successfully implemented within the
Hadoop MapReduce framework.
The process demonstrated Python's utility for writing effective MapReduce jobs.
9. Viva Questions
1. What is the Mapper’s function in the MapReduce framework?
2. How does Hadoop Streaming facilitate MapReduce jobs with Python?
3. What format does the Mapper produce as its output?
4. How can large datasets be handled effectively using Hadoop Streaming?
5. Name other languages supported by Hadoop Streaming apart from Python.
b) To sort data
c) To aggregate intermediate results
d) To handle I/O operations
Answer: a) To read input data and generate key-value pairs
4. Which command executes the Python Mapper in Hadoop Streaming?
a) hadoop stream run
b) hadoop jar hadoop-streaming.jar
c) hadoop exec
d) hadoop maprun
Answer: b) hadoop jar hadoop-streaming.jar
11.References
Hadoop Streaming Documentation
Hadoop: The Definitive Guide by Tom Whit
Experiment No.: 10
1. Aim: Creating Reducer function using python
2. Theory
The Reducer function in Hadoop MapReduce receives the output of the Mapper
function, which is sorted and grouped by key. The Reducer processes each group of
values associated with a particular key and performs aggregation (e.g., sum,
average, etc.).
The Reducer function outputs a key-value pair as its final result.
Python is widely used for writing Reducer functions in Hadoop Streaming because
of its simplicity and flexibility.
3. Procedure
1. Ensure Hadoop is Installed and Running:
Verify Hadoop installation and running status:
Ensure that the HDFS and YARN daemons are running.
mapper 1
streaming 1
multiple 1
languages 1
8. Clean Up:
Remove the output directory from HDFS after the job completes:
hadoop fs -rm -r /user/hadoop/output/
4. Experiment Code
Mapper Code (mapper.py):
import sys
# Read from stdin
for line in
sys.stdin:
# Remove leading/trailing whitespaces and split the line into words
words = line.strip().split()
# Emit (word, 1) for each word in the line
for word in words:
print(f"{word}\t1")
Reducer Code (reducer.py):
import sys
from collections import defaultdict
# Dictionary to hold the aggregate counts for each word
word_count = defaultdict(int)
# Read from stdin
for line in
sys.stdin:
# Remove leading/trailing whitespaces
line = line.strip()
# Split the line into key-value pair (word, count)
word, count = line.split('\t')
word_count[word] += int(count)
# Emit the aggregated result (word, total count)
for word, count in word_count.items():
print(f"{word}\t{count}")
Hadoop Streaming Command:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/hadoop/input/input.txt \
-output /user/hadoop/output/ \
-mapper mapper.py \
-reducer reducer.py
5. Execution
• Execute the job by running the Hadoop Streaming command.
• Check the output in HDFS using hadoop fs -cat.
6. Observations
• The Reducer function successfully aggregated the counts for each word and
emitted the total count.
• The final output in HDFS contains each word from the input text file along with
its corresponding count.
7. Analysis
• The Mapper emits key-value pairs of words and counts, which are grouped by the
Reducer.
• The Reducer aggregates these values (sum in this case) and outputs the final word
count for each word in the input.
8. Conclusion
• We successfully created a Reducer function using Python in Hadoop's
MapReduce framework.
• The job processed the input data, aggregated the counts for each word, and
produced the correct output, demonstrating the effective use of Python in Hadoop
Streaming.
9. Viva Questions
1. What is the role of the Reducer in the MapReduce framework?
2. How does the Reducer function process data in Hadoop Streaming?
3. What happens to the key-value pairs before they are passed to the Reducer?
4. What would happen if the word count logic in the Reducer was incorrect?
5. How do you optimize the performance of a Reducer in Hadoop?
10. Multiple Choice Questions (MCQs)
1. In the Reducer function, the input is grouped by:
a) Key
b) Value
c) Both Key and Value
d) None of the above
Answer: a) Key
2. Which Python data structure is used to hold the word counts in the
Reducer? a) List
b) Tuple
c) Dictionary
d) Set
Answer: c) Dictionary
11. References
Hadoop Streaming Documentation:
https://hadoop.apache.org/docs/r1.2.1/streaming.html
"Hadoop: The Definitive Guide" by Tom White
Experiment No.: 11
b. Iterators allow you to loop through a collection (like a list or tuple) without
needing to explicitly use an index.
Generators:
a. A generator is a special type of iterator that is defined using a function and
the yield keyword.
b. A generator function does not return a value all at once, but rather yields
values one at a time as the function is iterated over.
c. Generators are memory efficient as they generate items lazily (only when
required), rather than storing all values in memory at once.
Key Differences:
4. Experiment Code
Iterator Example:
class MyIterator:
def __init__(self, start,
end):
self.current =
start
self.end = end
def __iter__(self):
return self
def __next__(self):
if self.current >=
self.end:
raise StopIteration
# No more items to return
self.current += 1
return self.current - 1
# Usage of Iterator
my_iter = MyIterator(0, 5)
for num in my_iter:
print(num)
Generator Example:
def my_generator(start,
end): while start <
end:
yield start
# Yield one value at a time
start += 1
# Usage of Generator for
num in my_generator(0,
5):
print(num)
5. Execution
• Run both the iterator and generator examples to see how they produce values and
compare their usage.
• Observe the efficiency of the generator when working with larger data sets.
6. Observations
• Both the iterator and the generator produced the same output for the range of
values.
• The generator was more memory efficient, especially when working with larger
data sets.
7. Analysis
• Iterators are suitable for situations where the data is finite and already available
in memory.
• Generators are ideal for cases where large data sets need to be processed lazily,
minimizing memory consumption.
8. Conclusion
• Python's iterators and generators provide an elegant and efficient way to handle
large data sets and sequences.
• While iterators require more boilerplate code, generators simplify the task by
leveraging the yield keyword and are more memory-efficient.
9. Viva Questions
1. How do iterators differ from generators in Python?
2. What role does the yield keyword play in creating generators?
3. Is it possible to retrieve values from a generator using the next() function? If
yes, how?
4. What occurs if you attempt to iterate over a generator after all its values have
been exhausted?
5. In what ways do iterators and generators influence memory usage in Python?
10. Multiple Choice Questions (MCQs)
1. Which statement is correct regarding Python generators?
a) Generators produce all values at once.
b) Generators produce values one at a time, as needed.
c) Generators can always be restarted from the beginning.
d) None of the above.
Answer: b) Generators produce values one at a time, as needed.
2. Which method must be implemented in a custom iterator class?
a) __getitem__()
b) __next__()
c) __call__()
d) __yield__()
Answer: b) __next__()
3. What does the yield keyword do in a generator function?
a) Returns a value and stops the function execution permanently
b) Pauses the function execution and allows it to resume later from where it
left off
c) Creates a list of values
d) Directly prints the values in the function
4. What is the primary benefit of using generators?
a) Faster execution.
b) Reduced memory consumption.
c) Easier debugging.
d) Better code readability.
Answer: b) Reduced memory consumption.
11. References
• Python Official Documentation: https://docs.python.org/
• "Python Programming: An Introduction to Computer Science" by John Zelle
Experiment No.: 12
1. Aim: Twitter data sentimental analysis using Flume and Hive
2. Theory
Apache Flume:
• Apache Flume is a distributed system for efficiently collecting, aggregating, and
moving large amounts of log data.
• It can be used to collect Twitter data (using Twitter's streaming API) and move it
into a distributed data store like HDFS (Hadoop Distributed File System).
Apache Hive:
twitter-source-agent.sources.TwitterSource.consumerKey = <your-consumer-key>
twitter-source-agent.sources.TwitterSource.consumerSecret = <your-consumer-
secret> twitter-source-agent.sources.TwitterSource.accessToken = <your-access-
token> twitter-source-agent.sources.TwitterSource.accessTokenSecret = <your-
access-token-
secret>
twitter-source-agent.sources.TwitterSource.keywords = ["#bigdata", "#hadoop"]
twitter-source-agent.sources.TwitterSource.batchSize = 100
twitter-source-agent.sinks.hdfs-sink.type = hdfs
twitter-source-agent.sinks.hdfs-sink.hdfs.path =
hdfs://localhost:9000/user/flume/twitter/ twitter-source-agent.sinks.hdfs-
sink.hdfs.filePrefix = tweet_ twitter-source-agent.channels.memory-channel.type
= memory twitter-source-agent.channels.memory-channel.capacity = 1000
twitter-source-agent.channels.memory-channel.transactionCapacity = 100
Running Flume:
Run the Flume agent using the command:
flume-ng agent --conf ./conf --conf-file twitter-source-agent.conf --name twitter-
sourceagent
This will start collecting real-time tweets and store them in HDFS.
2. Setting Up Apache Hive:
• Ensure Apache Hive is installed and configured to run queries over Hadoop.
• Create a Hive table to store the Twitter data from Flume.
Example Hive table creation for storing tweet data: SQL
CREATE EXTERNAL TABLE twitter_data ( tweet_id STRING, user_name
STRING, tweet_text STRING, timestamp STRING )
STORED AS TEXTFILE
LOCATION '/user/flume/twitter/';
This will allow you to run Hive queries over the data collected by Flume.
3. Perform Sentiment Analysis on Twitter Data:
Install NLTK or TextBlob in Python for sentiment analysis.
Example Python code for Sentiment Analysis using
TextBlob: python from textblob
import TextBlob
return "neutral" # Sample tweet data tweets = ["I love Hadoop", "I hate
bugs in my code", "Apache Flume is amazing"]
# Apply sentiment analysis sentiments =
[analyze_sentiment(tweet) for tweet in tweets]
# Display the result for tweet, sentiment
in zip(tweets, sentiments):
print(f"Tweet: {tweet} --> Sentiment: {sentiment}")
6. Execution
• Run the Flume agent to fetch live data from Twitter.
• Store the collected tweets in Hive using the Flume configuration.
• Apply the Python sentiment analysis function to classify each tweet's sentiment.
• Store the sentiment results back in Hive.
7. Observations
• You can observe the data collection in real-time and analyze the sentiment of
each tweet.
• The sentiment of each tweet (positive, negative, or neutral) is stored in the Hive
table, allowing for further analysis.
7. Analysis
• By using Flume, you were able to collect Twitter data efficiently.
• Sentiment analysis was performed on the text of tweets, allowing classification
into categories that are useful for understanding public opinion or reaction to
certain topics.
8. Conclusion
• This experiment demonstrates the integration of Apache Flume, Apache Hive,
and Python for real-time Twitter data collection, storage, and sentiment analysis.
• Flume helps in efficiently collecting data in real time, while Hive provides an
easy way to store and query large datasets.
• Sentiment analysis can be useful in applications such as social media monitoring,
brand analysis, or political sentiment analysis.
9. Viva Questions
1. What is the role of Flume in this experiment?
2. How does Flume help in collecting Twitter data? 3. What is
sentiment analysis, and why is it useful?
4. How does Hive help in storing and querying large data sets?
5. Explain how the sentiment analysis function works in Python.
Experiment No.: 13
1. Aim: Business insights of User usage records of data cards
2. Theory
User Usage Records of Data Cards:
• Data cards are mobile broadband devices used for internet access. They typically
come with specific data plans that users can use for browsing, downloading,
streaming, etc.
• User usage records include data such as the amount of data used, time of use,
type of services accessed (streaming, browsing, social media, etc.), and duration
of use.
Business Insights:
• Business insights refer to the process of analyzing raw data to draw conclusions
that inform business decisions. For data card usage records, these insights can
reveal trends like peak usage times, data consumption patterns, and user
demographics.
Customer Segmentation:
• Analyzing the usage records helps segment users based on factors like usage
volume (heavy vs. light users), usage behavior (data usage, browsing habits), and
preferences (types of websites accessed, device usage).
Revenue Optimization:
• Usage data can provide insights into which plans are the most popular, which can
be used to optimize pricing strategies, improve plans, and create new plans that
better meet customer needs.
3. Requirements
Software:
Python (for data analysis and visualization)
Pandas (for data manipulation)
Matplotlib / Seaborn (for data visualization)
SQL (for querying databases, if the usage records are stored in a relational
database)
Jupyter Notebook (for interactive analysis) Data:
A dataset of user usage records, which could include the following attributes:
• User ID: Unique identifier for each user.
• Date: Date of data usage.
• Total Data Used: Amount of data consumed in MB/GB.
• Usage Time: Duration of data usage.
Service Type: Type of service used (e.g., browsing, streaming).
Data Plan Type: Type of plan the user is on (e.g., prepaid, postpaid).
• Location: User's geographical location (optional).
4. Procedure
1. Data Collection:
• Collect user usage records for a period of time (e.g., last 3 months, 6
months).
• Ensure that the dataset includes the relevant fields like user ID, date, data
used, usage time, and plan type.
2. Data Preprocessing:
• Clean the data by handling missing values, correcting any errors, and
ensuring consistency in units (e.g., data usage in GB or MB).
• Convert timestamps or dates into a consistent format for time-based analysis.
Example of data cleaning in Python:
python import pandas as pd # Load
data data =
pd.read_csv("user_usage.csv") #
Convert date column to datetime
type data['Date'] =
pd.to_datetime(data['Date'])
# Handle missing values (e.g., filling with the mean) data['Total Data
Used'].fillna(data['Total Data Used'].mean(), inplace=True)
3. Data Analysis:
• Usage Patterns: Analyze peak usage times (e.g., evening, weekend) and the
correlation between data usage and plan types.
• Customer Segmentation: Segment users based on data consumption into
categories like:
Low users (0-1 GB/month)
Medium users (1-5 GB/month)
Heavy users (5+ GB/month) o Popular Services: Identify the most
accessed services (e.g., browsing, streaming) based on the usage
records.
Example of segmentation:
python
# Segmentation by data usage
bins = [0, 1, 5, 100]
labels = ['Low User', 'Medium User', 'Heavy User']
data['User Category'] = pd.cut(data['Total Data Used'], bins=bins,
labels=labels)
4. Business Insights:
• Usage Trends: Identify trends in data usage over time, such as increasing
usage during holidays or weekends.
• Revenue Opportunities: Analyze which data plans are most popular and
correlate them with user demographics to suggest potential improvements or
changes in pricing.
• Customer Retention: Identify heavy users who may need upgraded plans
and offer retention strategies (e.g., discounts, loyalty rewards).
5. Visualization:
Create visualizations to summarize insights, such as the distribution of data
usage across users, the most common data usage times, and the relationship
between usage and plan type.
Example of visualization in Python:
import matplotlib.pyplot as plt
import seaborn as sns
# Plotting the distribution of data usage
plt.figure(figsize=(10,6))
sns.histplot(data['Total Data Used'], bins=20, kde=True)
plt.title('Distribution of Data Usage')
plt.xlabel('Data Used (GB)')
plt.ylabel('Frequency')
plt.show()
5. Experiment Code
• Python Code for Business Insights Analysis:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Analyzing the relationship between user category and total data used
plt.figure(figsize=(10,6))
sns.boxplot(x='User Category', y='Total Data Used', data=data)
plt.title('Data Usage by User Category')
plt.show()
6. Execution
• Load the user data into a Python script and clean the data as required.
• Run the analysis to segment users and find patterns in data usage.
• Visualize the data using plots to identify trends, customer behaviors, and
opportunities.
7. Observations
• You may observe peak usage times for specific user categories (e.g., heavy users
may have higher data consumption on weekends).
• Popular services (such as streaming or browsing) may become apparent.
• You might notice some plans are more commonly associated with higher data
usage.
8. Analysis
• From the analysis, you can understand customer behavior based on data
consumption.
• Insights into high-usage patterns can guide marketing efforts, product
development
(new plans), and customer retention strategies.
9. Conclusion
• This experiment demonstrates how business insights can be derived from user
data card usage records.
• By analyzing data usage patterns, businesses can improve customer
segmentation, optimize revenue strategies, and enhance customer satisfaction
through tailored plans and services.
10. Viva Questions
1. What is the importance of segmenting users based on data usage?
2. How can business insights from data card usage be used to improve customer
retention?
3. What are the benefits of visualizing data usage trends?
4. Explain how customer behavior can be used to optimize data plans.
5. How would you apply this analysis to a real-world business scenario?
11. Multiple Choice Questions (MCQs)
1. What is the first step in analyzing user usage records?
a) Segmenting users
b) Data cleaning
c) Data visualization
d) Customer retention
Answer: b) Data cleaning
2. Which of the following is a possible outcome of analyzing data card usage
records?
a) Identifying peak usage times
b) Increasing data plan prices for all users
c) Ignoring customer feedback
d) Limiting user access to data
Answer: a) Identifying peak usage times
3. How can visualizations help in understanding data usage?
a) By removing irrelevant data
b) By identifying trends and patterns
c) By simplifying data collection
d) By changing the pricing model
Answer: b) By identifying trends and patterns
12. References
• Pandas Documentation: https://pandas.pydata.org/pandas-docs/stable/
• Seaborn Documentation: https://seaborn.pydata.org/
• Matplotlib Documentation: https://matplotlib.org/
Experiment No.: 14
1. Aim: Wiki page ranking with Hadoop
2. Theory
PageRank Algorithm:
• PageRank is a link analysis algorithm used by Google to rank web pages in their
search engine results.
• The algorithm works by counting the number of inbound links to a page and
assigning a rank based on the importance of the linking pages. The higher the
rank of a linking page, the higher the rank transferred to the linked page.
Hadoop MapReduce:
• MapReduce is a programming model used to process and generate large datasets
that can be split into independent tasks. It is used in the Hadoop ecosystem to
handle largescale data processing in a distributed environment.
The core components of Hadoop MapReduce:
• Mapper: The mapper reads data, processes it, and outputs intermediate results
(keyvalue pairs).
• Reducer: The reducer aggregates the results and outputs the final computation.
Wiki Page Ranking:
• For Wikipedia page ranking, the data would typically consist of page-to-page
links (a directed graph). Each page will have links to other pages, and the rank of
each page will be computed iteratively based on the number of incoming links
from other pages.
3. Requirements
Software:
• Hadoop (with HDFS and MapReduce)
• Java (for implementing MapReduce)
• Python (optional for any additional scripting/processing)
• Linux/Mac or Windows with WSL (for setting up Hadoop) Data:
• A sample Wikipedia dataset containing page links. The dataset could be in the
form of a text file where each line contains a page and its outbound links (e.g.,
PageA -> PageB,
PageC).
4. Procedure
1. Set Up Hadoop:
o Install Hadoop and set up a single-node or multi-node cluster.
o Ensure that HDFS (Hadoop Distributed File System) is running, and the
necessary directories are created to store the input and output data.
2. Prepare Input Data:
o The dataset should contain the structure of Wikipedia pages, where each
page links to other pages. This data can be represented in the following
format:
rust
PageA -> PageB, PageC
PageB -> PageC
PageC -> PageA
3. Implement Mapper Function:
The mapper function reads the page links and emits the current page and the
linked pages as key-value pairs. Each linked page is given an initial rank.
Example of a Mapper in Java:
public class PageRankMapper extends Mapper<LongWritable, Text, Text,
Text>
{
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] parts = value.toString().split("->");
String page = parts[0].trim();
String[] linkedPages = parts[1].split(",");
// Emit the current page with its outbound links
context.write(new Text(page), new Text("links:" +String.join(",",
linkedPages)));
// Emit each linked page with a reference to the current page
double initialRank = 1.0;
job.setJarByClass(PageRankDriver.class);
job.setMapperClass(PageRankMapper.class);
job.setReducerClass(PageRankReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
6. Execution
• Compile the Java files into a JAR file.
• Upload the input data (Wikipedia links dataset) to HDFS:
hadoop fs -put wiki_data.txt /input
• Run the PageRank MapReduce job:
hadoop jar pagerank.jar PageRankDriver /input /output
• Check the output directory to get the ranked pages.
7. Observations
• The output will show each page and its computed PageRank value.
• Observe how the ranks change with each iteration, and how the pages with more
inbound links gradually rise in rank.
8. Analysis
• After a few iterations, the ranks should stabilize. Pages with the most links from
important pages will be ranked higher.
• This experiment helps demonstrate how Hadoop can be used to implement large-
scale algorithms like PageRank.
9. Conclusion
• This experiment successfully demonstrates how to compute the Wiki PageRank
using Hadoop's MapReduce framework.
• By iteratively computing PageRank, we can rank Wikipedia pages based on their
importance, which can then be used to improve search engine results or content
recommendations.
12. References
• Hadoop Documentation: https://hadoop.apache.org/docs/stable/
MapReduce Programming Guide:
https://hadoop.apache.org/docs/stable/hadoopmapreduce-client-core/mapreduce-
programming-guide.html
Experiment No.: 15
1. Aim: Health care Data Management using Apache Hadoop ecosystem
2. Theory
Apache Hadoop Ecosystem:
Apache Hadoop is a framework that allows for the distributed processing of large data
sets across clusters of computers using simple programming models. It is designed to
scale up from a single server to thousands of machines, each offering local computation
and storage.
Key components of the Hadoop ecosystem include:
• HDFS (Hadoop Distributed File System): A distributed file system designed to
store vast amounts of data across a cluster of computers.
• MapReduce: A programming model for processing large data sets in a parallel,
distributed manner.
• YARN (Yet Another Resource Negotiator): Manages resources in a Hadoop cluster.
• Hive: A data warehouse infrastructure built on top of Hadoop for providing data
summarization, querying, and analysis.
• Pig: A high-level platform for creating MapReduce programs used with Hadoop.
• HBase: A NoSQL database that provides real-time read/write access to large datasets.
• Sqoop: A tool for transferring bulk data between Hadoop and relational databases.
• Flume: A service for collecting, aggregating, and moving large amounts of log data.
• Oozie: A workflow scheduler for managing Hadoop jobs.
• Healthcare Data: Healthcare data includes a wide variety of information such as
patient records, treatment histories, diagnostic data, lab results, medical images,
prescriptions, and more. These datasets are typically:
Structured: Tables containing patient demographics, billing information,
lab results, etc.
Unstructured: Medical records in free-text form, images, reports, etc.
Semi-structured: Data formats like XML or JSON that contain structured
information but do not fit neatly into tables.
Challenges in Healthcare Data Management:
• Volume: Healthcare data is growing rapidly, especially with the rise of electronic
health records (EHRs) and medical imaging.
• Variety: Healthcare data comes in various formats (structured, semi-structured, and
unstructured).
• Velocity: Real-time processing of healthcare data, such as monitoring patient vitals
or emergency response systems, is critical.
• Veracity: Healthcare data must be accurate and trustworthy, making its quality
management a significant concern.
3. Requirements
Software:
• Apache Hadoop (including HDFS, MapReduce, YARN)
• Hive (for SQL-like querying)
• Hive can be used to run SQL-like queries on the healthcare data stored in
HDFS.
This is especially useful for structured data, such as patient records.
Example of creating a table and running a query in Hive:
CREATE TABLE patient_records (patient_id INT, name STRING, age INT,
disease STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA INPATH '/healthcare_data/patient_records' INTO
TABLE patient_records;
SELECT * FROM patient_records WHERE disease = 'Diabetes';
6. Data Storage in HBase:
• Use HBase for real-time data storage, particularly for time-series data like
patient vitals or continuous monitoring systems. Example command to insert
data into HBase:
hbase shell
create 'patient_vitals', 'patient_id', 'vitals'
put 'patient_vitals', 'row1', 'patient_id:1', 'vitals:heart_rate', '80'
7. Data Processing with Pig (optional):
• Use Pig for high-level data processing. Pig scripts are an alternative to
MapReduce and allow you to process data using a simpler language than Java.
5. Experiment Code
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N
• Hive queries will provide insights into patient demographics, diseases, and
treatment patterns.
• HBase will allow real-time access to patient vital data.
8. Analysis
• Healthcare data processing on Hadoop allows for scalable, distributed processing
of massive datasets.
• By using HDFS for storage, MapReduce for processing, and Hive for querying,
healthcare institutions can gain valuable insights from large-scale patient data.
• The system is capable of handling structured, semi-structured, and unstructured
healthcare data efficiently.
9. Conclusion
• This experiment demonstrates the power of the Hadoop ecosystem in managing
and processing large healthcare datasets. The combination of HDFS,
MapReduce, Hive, and HBase allows healthcare providers to store, analyze, and
retrieve data efficiently.
• This approach can lead to improved healthcare outcomes through faster data
processing, enhanced analytics, and better decision-making.
10. Viva Questions
1. What are the key components of the Hadoop Ecosystem, and how do they
contribute to healthcare data management?
2. How does HBase differ from HDFS, and when would you use it in healthcare
data management?
3. Explain the role of Hive in querying healthcare data stored in HDFS.
4. What are the advantages of using MapReduce for healthcare data analysis?
5. How would you process and analyze real-time healthcare data streams using
Flume and Hadoop?
11. Multiple Choice Questions (MCQs)
1. What is the primary purpose of using the Apache Hadoop Ecosystem in
healthcare data management?
a) To enhance the quality of medical images
b) To manage and process large-scale healthcare data efficiently
c) To monitor patient vitals in real-time
d) To replace traditional healthcare management systems
Answer: b) To manage and process large-scale healthcare data efficiently
b) MapReduce
c) HDFS
d) Hive
Answer: c) HDFS
3. Which Hadoop component allows healthcare data to be queried using SQL-like
syntax? a) MapReduce
b) Hive
c) HBase
d) Pig
Answer: b) Hive
4. Which of the following tools in the Hadoop ecosystem is used to transfer bulk
data between Hadoop and relational databases? a) Sqoop
b) Flume
c) HBase
d) Pig
Answer: a) Sqoop
5. In healthcare data management, which Hadoop tool is useful for processing real-
time data streams, such as patient monitoring data? a) Sqoop
b) Flume
c) MapReduce
d) HBase
Answer: b) Flume
12. References
• Healthcare Big Data Management and Analytics:
https://www.springer.com/gp/book/9783030323897
• Cloudera Healthcare Solutions:
https://www.cloudera.com/solutions/healthcare.html