Unit 1 Bdhall
Unit 1 Bdhall
Experiment: -1
Aim: Installing Hadoop, configure HDFS, Configuring Hadoop
1. Introduction:
Hadoop is an open-source framework developed by the Apache Software Foundation
for processing and storing massive amounts of data. It uses a distributed storage system
called HDFS (Hadoop Distributed File System) and a processing engine called
MapReduce to handle big data efficiently.
Hadoop's key features include scalability, fault tolerance, and the ability to process
structured, semi-structured, and unstructured data across clusters of commodity
hardware. Its ecosystem includes tools like Hive, Pig, HBase, and Spark, making it a
cornerstone for big data analytics.
Components of Hadoop
Hadoop has four core components:
1. HDFS (Hadoop Distributed File System): Stores large data across distributed
nodes with replication for reliability.
2. YARN (Yet Another Resource Negotiator): Manages cluster resources and task
scheduling.
3. MapReduce: Processes data in parallel with mapping and reducing phases.
4. Hadoop Common: Provides shared libraries and utilities for other modules.
2. Installation of Hadoop
Pre requisites: - This software should be prepared to install Hadoop 2.8.0 on
window 10 (64bit)
1. Download Hadoop 2.8.0
(Link:https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-
3.4.1/hadoop-3.4.1.tar.gz)
OR
(https://dlcdn.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz)
2. Java JDK 1.8.0.zip
(Link: https://www.oracle.com/java/technologies/downloads/#java2)
Set up
1. Check either Java 1.8.0 is already installed on your system or not, use "Javac - version"
to check.
Configuration
1. Edit core-site.xml: Edit file C:/Hadoop-3.4.1/etc/hadoop/core-site.xml, paste below
xml paragraph and save this file.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
2. Create folder "data" under "C:\Hadoop-3.4.1"
• Create folder "datanode" under "C:\Hadoop-3.4.1\data"
• Create folder "namenode" under "C:\Hadoop-3.4.1\data"
3. Edit hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-3.4.1\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-3.4.1\data\namenode</value>
</property>
</configuration>
4. Edit mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
5. Edit yarn-site.xml:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanger.auxservices.mapreduce.suffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
ARJUN LODHA 2110DMBCSE10193
BIG DATA AND HADOOP BTCS702N
6. Hadoop Configuration
1. Download file Hadoop Configuration.zip
(Link:https://drive.google.com/file/d/1nCN_jK7EJF2DmPUUxgOggnvJ
6k6tksYz/view)
2. Delete file bin on C:\Hadoop-3.4.1\bin, replaced by file bin on file
just download (from Hadoop Configuration.zip).
8. Testing
1. Open cmd and change directory to "C:\Hadoop-
3.4.1\sbin" and type “start-dfs.cmd" to start apache.
Localhost:50070
So, now you have successfully installed Hadoop
3. Experiment Code
• N/A (This experiment is focused on configuration rather than coding).
4. Execution
5. Observations
6. Analysis
7. Conclusion
8. Viva Questions
Experiment No.: 2
Aim: Working on HDFS
1. Theory: HDFS is a distributed storage system designed to handle large datasets across
a cluster of machines, ensuring high throughput and fault tolerance.
• Components of HDFS:
a) NameNode: Oversees the metadata and structure of the filesystem.
b) DataNode: Handles the storage of data blocks and manages read/write
requests.
• Key Characteristics of HDFS:
• Provides fault tolerance by replicating data blocks.
• Scales efficiently to manage petabytes of data.
• Enables seamless streaming access to the data stored in the filesystem.
2. Procedure
a) Start HDFS Services:
• Run the following command to start the HDFS: start-dfs.sh
b) Access the HDFS UI:
• Open the Hadoop NameNode UI in a browser:” http://localhost:9870”
c) Create a Directory in HDFS:
• Use the following command to create a new directory:
“hdfs dfs -mkdir /user/<your-username>/example_dir”
d) Upload a File to HDFS:
• Upload a local file to HDFS:
“hdfs dfs -put <local-file-path> /user/<your-username>/example_dir/”
e) List Files in HDFS:
• View the contents of the HDFS directory:
“hdfs dfs -ls /user/<your-username>/example_dir “
f) Retrieve a File from HDFS:
• Download a file from HDFS to the local filesystem:
“hdfs dfs -get /user/<your-username>/example_dir/<file-name> <local-destination-
path>”
g) Delete a File in HDFS:
• Remove a file from HDFS:
hdfs dfs -rm /user/<your-username>/example_dir/<file-name>
h) Stop HDFS Services:
• Use the following command to stop HDFS services: stop-dfs.sh
3. Experiment Code
N/A (This experiment is focused on file operations via commands).
4. Execution
• Perform the HDFS commands step by step as outlined in the process.
• Validate each operation by checking the Hadoop NameNode UI and reviewing the
terminal outputs.
5. Observations
• Successfully executed HDFS operations such as directory creation, file upload, and
retrieval.
• Confirmed replication factor and block distribution for uploaded files through the
HDFS web interface.
6. Analysis
• Data Distribution: Files are divided into blocks and spread across multiple
DataNodes.
• Fault Tolerance: Replication ensures data is available even if some DataNodes fail.
• Ease of Use: Simple terminal commands and the web UI make managing HDFS
straightforward.
7. Conclusion
• Gained hands-on experience with basic HDFS operations using terminal
commands.
• Verified data replication and block storage via the NameNode interface.
8. Viva Questions
1. What is the primary function of the NameNode in HDFS?
2. How does HDFS maintain fault tolerance?
3. Describe what happens when a DataNode becomes unavailable.
4. What is the purpose of the hdfs dfs -put command?
9. Multiple Choice Questions (MCQs)
1. What is the default replication factor in HDFS?
a) 2
b) 3
c) 4
d) 1
Answer: b) 3
2. Which component stores metadata in HDFS?
a) DataNode
b) NameNode
c) Secondary NameNode
d) JobTracker
Answer: b) NameNode
10. References
• Official Apache Hadoop Documentation: https://hadoop.apache.org/docs/
• Hadoop: The Definitive Guide by Tom White
Experiment No.: 3
1. Aim: Running Jobs on Hadoop
2. Theory:
• MapReduce is a programming model used to process large datasets in a distributed
environment, often in Hadoop clusters.
• Map Function: The Map function takes input data and transforms it into key-value
pairs, which are the intermediate outputs. Each key-value pair is used for further
processing.
• Reduce Function:The Reduce function processes the intermediate key-value pairs
generated by the Map function, aggregates them, and provides the final result.
Workflow of MapReduce:
1. Input Splits: Input data is divided into smaller chunks (input splits), which can be
processed in parallel across the cluster.
2. Mapper: The Mapper processes each input split and generates intermediate key-
value pairs.
3. Shuffle and Sort: This phase organizes the intermediate key-value pairs by
grouping them according to their keys.
4. Reducer: The Reducer processes the sorted key-value pairs and produces the final
output, often after aggregation or summarization.
3. Procedure
1. Start Hadoop Services:
• Run the following commands to start Hadoop's HDFS and YARN
services: start-dfs.sh start-yarn.sh
2. Prepare the Input Data:
• Create a sample text file (e.g., input.txt) with some data.
• Upload the file to HDFS:
hdfs dfs -put input.txt /user/<your-username>/example_input
3. Write a MapReduce Job:
• Use the built-in WordCount example or create a custom job using Java or
Python.
• Example: WordCount program to count word frequencies in a dataset.
4. Run the Job:
• Execute the MapReduce job:
Hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/Hadoop-mapreduce
examples-*.jar wordcount/user/<yourusername>/example_input/user/
<yourusername>example_output
• Fault Tolerance: If a node fails, Hadoop reassigns tasks to ensure job completion.
• Resource Management: YARN allocates and monitors resources effectively during
job execution.
9. Conclusion
• Successfully executed a MapReduce job on Hadoop.
• Learned the workflow of MapReduce and verified outputs using HDFS.
10. Viva Questions
1. What is the role of the Mapper in MapReduce?
2. How does the Shuffle and Sort phase optimize job performance?
3. What is the significance of YARN in Hadoop jobs?
4. How can you monitor the progress of a Hadoop job?
11. Multiple Choice Questions (MCQs)
1. What is the primary purpose of MapReduce?
a) Data storage
b) Distributed processing
c) File replication
d) Resource management
Answer: b) Distributed processing
2. Which command is used to upload files to HDFS for a MapReduce job?
a) hdfs dfs -get
b) hdfs dfs -ls
c) hdfs dfs -put
d) hdfs dfs -rm
Answer: c) hdfs dfs -put
3. What is the output format of a typical MapReduce job?
a) Key-value pairs
b) JSON
c) XML
d) CSV
Answer: a) Key-value pairs
4. Where can you monitor the status of a MapReduce job?
a) NameNode UI
b) DataNode logs
c) ResourceManager UI
d) Secondary NameNode logs
Answer: c) ResourceManager UI
12. References
• Apache Hadoop Documentation: https://hadoop.apache.org/docs/
• Hadoop: The Definitive Guide by Tom White
Experiment No.: 4
Aim: Install Zookeeper
1. Theory:
Apache ZooKeeper
• Leader: The leader node handles all write requests in the system, ensuring that there is
a single source of truth for updates.
• Followers: The follower nodes handle read requests and forward write requests to the
leader. They replicate the leader's changes to maintain consistency.
• Observers: These nodes do not participate in leader election and are used primarily for
scaling the system. Observers can handle read requests, but they do not vote during
elections.
Use Cases:
• Leader Election: Ensuring only one active leader node to coordinate tasks.
• Configuration Management: Storing configuration data in a centralized and
consistent way.
• Distributed Locking: Implementing locks in a distributed environment to avoid race
conditions.
• Service Discovery: Helping in the discovery of available services in a dynamic
distributed system.
2. Procedure
1. Download ZooKeeper:
• Visit the Apache ZooKeeper official website.
• Download the latest stable release (e.g., zookeeper-3.x.x.tar.gz).
2. Install ZooKeeper:
• Extract the downloaded package to a desired directory:
tar -xvzf zookeeper-3.x.x.tar.gz -C
/usr/local/ cd /usr/local/zookeeper-3.x.x/
• Set environment variables in the . rc file:
export ZOOKEEPER_HOME=/usr/local/zookeeper-
3.x.x export
PATH=$PATH:$ZOOKEEPER_HOME/bin
3. Configure ZooKeeper:
• Navigate to the conf directory and create a configuration file:
cp conf/zoo_sample.cfg conf/zoo.cfg
• Edit the zoo.cfg file to configure ZooKeeper: properties tickTime=2000
dataDir=/usr/local/zookeeper-3.x.x/data
clientPort=2181
• Create the data directory:
mkdir -p /usr/local/zookeeper-3.x.x/data
4. Start ZooKeeper Service:
• Start the ZooKeeper server:
zkServer.sh start
3. Verify Installation:
• Check ZooKeeper status:” zkServer.sh status”
• Use the ZooKeeper CLI to test functionality:” zkCli.sh -server localhost:2181 “
• Create a ZNode: “ create /my_node "Hello ZooKeeper" “
• Verify the ZNode: “get /my_node”
4. Stop ZooKeeper Service:
zkServer.sh stop
5. Experiment Code
• Sample commands for ZooKeeper CLI:
create /example "Sample Data"
ls /
set /example "Updated Data"
get /example
delete /example
6. Execution
• Execute the commands mentioned above using the ZooKeeper CLI.
• Observe the creation, update, retrieval, and deletion of ZNodes.
7. Observations
• The ZooKeeper server successfully started and maintained its state.
• The CLI commands effectively interacted with the ZooKeeper service.
8. Analysis
Experiment No.: 5
1. Aim: Pig Installation
2. Theory
Apache Pig is a high-level platform for creating MapReduce programs used with
Hadoop.
It uses a scripting language called Pig Latin, which simplifies complex data
transformations and operations.
Key Features:
• Ease of Programming: Simplifies the development of MapReduce programs.
• Extensibility: Allows the use of user-defined functions (UDFs) for custom
processing.
• Optimization Opportunities: Automatically optimizes scripts for efficient
execution.
Execution Modes:
• Local Mode: Runs on a single machine without Hadoop.
• MapReduce Mode: Runs on a Hadoop cluster.
3. Procedure
1. Pre-Requisites:
• Ensure Java and Hadoop are installed and configured.
• Verify Hadoop is running by checking its status.
• Install Pig in the same environment as Hadoop.
2. Download Apache Pig:
• Visit the official Apache Pig website.
• Download the latest stable release (e.g., pig-0.x.x.tar.gz).
3. Install Apache Pig:
• Extract the downloaded file to a desired directory:
tar -xvzf pig-0.x.x.tar.gz -C /usr/local/
cd /usr/local/pig-0.x.x/
• Set environment variables in the .bashrc file:
export PIG_HOME=/usr/local/pig-0.x.x
export PATH=$PATH:$PIG_HOME/bin
• Source the .bashrc file to apply changes:
source ~/.bashrc
4. Configure Pig:
• For local mode, no additional configuration is required.
12. References
• Apache Pig Documentation
• Hadoop: The Definitive Guide by Tom White
Experiment No.: 6
1. Aim: Sqoop Installation.
2. Theory: Apache Sqoop is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured data stores such as relational databases.
Key Features:
• Import/Export Data: Move data between Hadoop and RDBMS.
• Parallel Processing: Supports parallel data transfer for efficiency.
• Integration: Works seamlessly with Hive and HBase.
• Incremental Load: Import only new or updated data.
Typical Use Cases:
• Importing data from a MySQL or PostgreSQL database into HDFS.
• Exporting processed data from HDFS back to an RDBMS.
3. Procedure
1. Pre-Requisites:
• Ensure Java, Hadoop, and a relational database (e.g., MySQL) are installed and
configured.
• Verify Hadoop is running: jps
2. Download Apache Sqoop:
• Visit the official Apache Sqoop website.
• Download the latest stable release (e.g., sqoop-1.x.x.tar.gz).
3. Install Apache Sqoop:
• Extract the downloaded file to a desired directory:
“ tar -xvzf sqoop-1.x.x.tar.gz -C /usr/local/ cd /usr/local/sqoop-1.x.x/ “
• Set environment variables in the .bashrc file:
“export SQOOP_HOME=/usr/local/sqoop-1.x.x
export PATH=$PATH:$SQOOP_HOME/bin “
• Source the .bashrc file to apply changes: “source ~/.bashrc “
4. Configure Sqoop:
• Download and place the JDBC driver for your database in the
$SQOOP_HOME/lib directory. For example, for MySQL:wget
https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-
x.x.x.tar.gz
tar -xvzf mysql-connector-java-x.x.x.tar.gz
cp mysql-connector-java-x.x.x.jar $SQOOP_HOME/lib/
• Ensure the database service (e.g., MySQL) is running.
5. Verify Installation:
ARJUN LODHA 2110DMBCSE10193
BIG DATA AND HADOOP BTCS702N
--export-dir /user/hadoop/salaries_data \
--m 1
5. Execution
• Execute the Sqoop commands in the terminal.
• Verify the transferred data in HDFS and the relational database.
6. Observations
• Sqoop successfully transferred data between MySQL and HDFS.
• Data transfer operations were efficient and error-free.
7. Analysis
• Sqoop simplifies data movement between RDBMS and Hadoop.
• The tool supports parallel processing for faster data transfer.
8. Conclusion
• Successfully installed and configured Apache Sqoop.
• Performed data import/export operations using Sqoop commands.
9. Viva Questions
1. What is the purpose of Apache Sqoop?
2. Explain the difference between import and export in Sqoop.
3. What role does the JDBC driver play in Sqoop?
4. How can you perform incremental imports in Sqoop?
10.Multiple Choice Questions (MCQs)
1. Apache Sqoop is primarily used for:
a) Real-time data processing
b) Data transfer between RDBMS and Hadoop
c) Monitoring Hadoop cluster health
d) Running machine learning algorithms
Answer: b) Data transfer between RDBMS and Hadoop
2. Which command is used to import data from RDBMS to HDFS?
a) sqoop export
b) sqoop import
c) sqoop transfer
d) sqoop migrate
Answer: b) sqoop import
11.References
• Apache Sqoop Documentation
• Hadoop: The Definitive Guide by Tom White
Experiment No.: 7
1. Aim: HBase Installation
2. Theory: Apache HBase is an open-source, distributed, and scalable NoSQL database
inspired by Google's Bigtable. It is designed to manage large amounts of data in a fault-
tolerant way, running on top of Hadoop and using HDFS for storage.
• Key Features:
• Distributed: Scales horizontally by adding machines.
• Column-Oriented: Data is stored in columns rather than rows.
• Real-Time Access: Fast random read/write operations on large datasets.
• Fault-Tolerant: Data replication ensures fault tolerance.
• HBase Operations:
• Put: Inserts data into tables.
• Get: Retrieves data from tables.
• Scan: Fetches a range of rows.
• Delete: Removes data from tables.
3. Procedure
1. Pre-Requisites:
• Ensure Hadoop and Java are installed and configured.
• Verify that Hadoop is running: jps
2. Download Apache HBase:
• Visit the official Apache HBase website and download the latest stable release
(e.g., hbase-2.x.x.tar.gz).
3. Extract HBase Files:
• Extract the downloaded file to the desired directory:
tar -xvzf hbase-2.x.x.tar.gz -C /usr/local/
cd /usr/local/hbase-2.x.x/
4. Configure HBase:
• Open the hbase-site.xml file located in $HBASE_HOME/conf/. If not present,
create it.
<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
</configuration>
• Ensure Hadoop HDFS and Zookeeper services are running before starting
HBase.
• Optionally, configure additional properties like hbase.master
andhbase.regionserver.
4. Experiment Code
Create Table Command:
create 'students', 'personal', 'grades'
Insert Data Commands:
put 'students', 'row1', 'personal:name', 'John Doe'
put 'students', 'row1', 'personal:age', '21'
put 'students', 'row1', 'grades:math', 'A'
Retrieve Data Command:
get 'students', 'row1'
Scan Table Command:
scan 'students'
5. Execution
• Execute the commands in the HBase shell to create tables, insert data, and retrieve
data.
• Verify the inserted data and scanned results.
6. Observations
• HBase was successfully installed and configured.
• Data was successfully inserted into and retrieved from the student’s table.
7. Analysis
• HBase's column-oriented architecture allows efficient storage and retrieval of large
datasets.
• Its integration with Hadoop ensures scalability and distributed storage for
managing big data.
8. Conclusion
9. Viva Questions
1. What is HBase, and how does it differ from relational databases?
2. Explain the architecture of HBase.
3. How does HBase achieve scalability?
4. What is the role of Zookeeper in HBase?
5. Explain the concept of column families in HBase.
11.References
• Apache HBase Documentation
• HBase: The Definitive Guide by Lars George
Experiment No.: 8
1. Aim: Hadoop streaming
2. Theory: Hadoop Streaming: A utility enabling MapReduce jobs with any language
capable of handling stdin and stdout, such as Python, Perl, or Ruby.
• Key Components:
• Mapper: Processes each line of input and outputs key-value pairs.
• Reducer: Aggregates key-value pairs output by the Mapper to produce the
final result.
3. Procedure
1. Setup and Verification:
• Ensure Hadoop is installed, and HDFS and YARN services are running.
• Confirm with commands like jps.
2. Input Data Preparation:
• Create an input file input.txt containing:
hello world
hello hadoop
hello streaming
hadoop world
3. Mapper Script:
• Create a Python file mapper.py for the Mapper function.
• mapper.py:
import sys
for line in sys.stdin:
words = line.strip().split()
for word in words:
print(f"{word}\t1")
4. Reducer Script:
• Create a Python file reducer.py for the Reducer function.
• reducer.py:
import sys
from collections import defaultdict
current_word = None
current_count = 0
if current_word:
print(f"{current_word}\t{current_count}")
5. Upload Data to HDFS:
• hadoop fs -put input.txt /user/hadoop/input/
6. Run Hadoop Streaming Job:
• hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/hadoop/input/input.txt \
-output /user/hadoop/output/ \
-mapper mapper.py \
-reducer reducer.py
7. Output Verification:
• Once the job completes, the output will be stored in HDFS at
/user/hadoop/output/. To view the result, use the following command:
5. Experiment Code
• Mapper (mapper.py) and Reducer (reducer.py) scripts.
• Hadoop Streaming job command.
6. Execution
• Run the scripts and commands.
• Confirm correct processing by inspecting the output in HDFS.
7. Observations
• Successful execution of the MapReduce job with accurate word count results.
• Mapper and Reducer functions worked as expected.
8. Analysis
• Hadoop Streaming simplifies the integration of non-Java languages with
Hadoop MapReduce.
• Enables flexibility for developers using Python or other scripting languages
for big data processing.
9. Conclusion
• Demonstrated Hadoop Streaming with Python to execute a MapReduce word
count job.
• Highlighted the versatility of Hadoop's ecosystem.
11. MCQs
1.Hadoop Streaming allows the use of which languages for MapReduce jobs?
a) Only Java programs
b) Only Python programs
c) Any language that can read from stdin and write to stdout
d) Only shell scripts
Answer: c) Any language that can read from stdin and write to stdout
2.Which command is used to execute a Hadoop Streaming job?
a) hadoop-streaming.jar
b) hadoop jar hadoop-streaming.jar
c) hadoop run streaming.jar
d) hadoop-mapreduce.jar
Answer: b) hadoop jar hadoop-streaming.jar
3.In Hadoop Streaming, the output from the Mapper is:
a) A single line
b) A key-value pair
c) A JSON file
d) A binary file
Answer: b) A key-value pair
4.The primary function of the Reducer in Hadoop Streaming is:
a) To split data into smaller chunks
b) To merge the results from the Mapper
c) To run a MapReduce job
d) To filter data
Answer: b) To merge the results from the Mapper
5.What type of data does Hadoop Streaming primarily process?
a) Binary data only
b) Text data only
c) Structured, semi-structured, and unstructured data
d) Graphical data only
Answer: c) Structured, semi-structured, and unstructured data
12. References
• Hadoop Streaming Documentation
• Hadoop: The Definitive Guide by Tom White.
Experiment No.: 9
1. Aim: Creating Mapper function using python.
2. Theory
• The Mapper function in Hadoop MapReduce processes input data and produces
intermediate key-value pairs as output.
• This function works line by line, reading each line, splitting it into words, and
outputting each word as a key with an associated count (usually 1).
• Python is widely preferred for writing Mapper functions due to its simplicity and an
extensive library ecosystem.
3. Procedure
1. Ensure Hadoop is Installed and Running:
• Verify Hadoop installation and running status:
• Ensure that the HDFS and YARN daemons are running.
2. Create Input Data:
• Create a text file input.txt containing sample data for processing. For
example:
Hadoop is a framework
MapReduce is a programming model
Python is used for writing mapper
Hadoop streaming supports multiple languages
3. Create the Mapper Script:
• Write a Python script (mapper.py) that will process each line of input data and
output key-value pairs.
• In this case, the script will split each line into words and output a (word, 1)
pair.
Python code for Mapper(mapper.py):
python # mapper.py
import sys
# Read from stdin
for line in
sys.stdin:
# Remove leading/trailing whitespaces and split the line into words
words = line.strip().split()
# Emit (word, 1) for each word in the line
for word in words:
print(f"{word}\t1")
• This script reads input from standard input (sys.stdin), processes each line, and
outputs the word followed by the number 1 (indicating its occurrence).
7. Analysis
• The Mapper function in Python efficiently processes text by breaking it into words.
• The output can be used in the Reducer phase for further aggregation.
• Python's simplicity makes it a reliable choice for developing Mapper functions in
Hadoop Streaming.
8. Conclusion
• A Python-based Mapper function was successfully implemented within the
Hadoop MapReduce framework.
• The process demonstrated Python's utility for writing effective MapReduce jobs.
9. Viva Questions
1. What is the Mapper’s function in the MapReduce framework?
2. How does Hadoop Streaming facilitate MapReduce jobs with Python?
3. What format does the Mapper produce as its output?
4. How can large datasets be handled effectively using Hadoop Streaming?
11.References
• Hadoop Streaming Documentation
• Hadoop: The Definitive Guide by Tom Whit
Experiment No.: 10
1. Aim: Creating Reducer function using python
2. Theory
• The Reducer function in Hadoop MapReduce receives the output of the Mapper
function, which is sorted and grouped by key. The Reducer processes each group of
values associated with a particular key and performs aggregation (e.g., sum,
average, etc.).
• The Reducer function outputs a key-value pair as its final result.
• Python is widely used for writing Reducer functions in Hadoop Streaming because
of its simplicity and flexibility.
3. Procedure
1. Ensure Hadoop is Installed and Running:
• Verify Hadoop installation and running status:
Ensure that the HDFS and YARN daemons are running.
2. Create Input Data:
If you don't already have an input file, create a text file input.txt containing
data that the Mapper will process. For example:
Hadoop is a framework
MapReduce is a programming model
Python is used for writing mapper
Hadoop streaming supports multiple languages
3. Create the Mapper Script:
• Write a Python script (mapper.py) that emits key-value pairs (for instance,
words with a count of 1).
Python code for Mapper (mapper.py):
# mapper.py
import sys
# Read from stdin
for line in
sys.stdin:
# Remove leading/trailing whitespaces and split the line into words
words = line.strip().split()
# Emit (word, 1) for each word in the line
for word in words:
print(f"{word}\t1")
ARJUN LODHA 2110DMBCSE10193
BIG DATA AND HADOOP BTCS702N
b) Value
c) Both Key and Value
d) None of the above
Answer: a) Key
2. Which Python data structure is used to hold the word counts in the
Reducer? a) List
b) Tuple
c) Dictionary
d) Set
Answer: c) Dictionary
3. The Reducer function is responsible for:
a) Splitting input data into words
b) Aggregating values for a given key
c) Emitting key-value pairs
d) None of the above
Answer: b) Aggregating values for a given key
4. What is the purpose of the collections.defaultdict in the Reducer code?
a) It allows storing values with default initialization
b) It stores data in alphabetical order
c) It limits the number of values stored
d) None of the above
Answer: a) It allows storing values with default initialization
11. References
• Hadoop Streaming Documentation:
https://hadoop.apache.org/docs/r1.2.1/streaming.html
• "Hadoop: The Definitive Guide" by Tom White
Experiment No.: 11
if self.current >=
self.end:
raise StopIteration
# No more items to return
self.current += 1
return self.current - 1
# Usage
my_iter = MyIterator(0, 5)
# Creates an iterator from 0 to 4
for num in my_iter:
print(num)
# Output: 0 1 2 3 4
2. Creating a Generator:
• A generator is defined by using a function with the yield keyword. When
the generator function is called, it returns a generator object that can be
iterated over.
Example of Custom Generator:
python def
my_generator(start,
end): while start <
end:
yield start # Yields value and suspends function execution
start += 1 # Resumes from the previous yield point
# Usage for num in
my_generator(0, 5):
print(num) # Output: 0 1 2 3 4
3. Difference in Memory Usage:
• You can observe that generators are memory-efficient as they yield one
value at a time and do not store the entire list in memory.
4. Experiment Code
Iterator Example:
class MyIterator:
def __init__(self, start,
end):
self.current =
start
self.end = end
def __iter__(self):
return self
def __next__(self):
if self.current >=
self.end:
raise StopIteration
# No more items to return
self.current += 1
return self.current - 1
# Usage of Iterator
my_iter = MyIterator(0, 5)
for num in my_iter:
print(num)
Generator Example:
def my_generator(start,
end): while start < end:
yield start
# Yield one value at a time
start += 1
# Usage of Generator for
num in my_generator(0,
5):
print(num)
5. Execution
• Run both the iterator and generator examples to see how they produce values and
compare their usage.
• Observe the efficiency of the generator when working with larger data sets.
6. Observations
• Both the iterator and the generator produced the same output for the range of
values.
• The generator was more memory efficient, especially when working with larger
data sets.
7. Analysis
• Iterators are suitable for situations where the data is finite and already available
in memory.
• Generators are ideal for cases where large data sets need to be processed lazily,
minimizing memory consumption.
8. Conclusion
• Python's iterators and generators provide an elegant and efficient way to handle
large data sets and sequences.
• While iterators require more boilerplate code, generators simplify the task by
leveraging the yield keyword and are more memory-efficient.
9. Viva Questions
1. How do iterators differ from generators in Python?
2. What role does the yield keyword play in creating generators?
Experiment No.: 12
1. Aim: Twitter data sentimental analysis using Flume and Hive
2. Theory
Apache Flume:
• Apache Flume is a distributed system for efficiently collecting, aggregating, and
moving large amounts of log data.
• It can be used to collect Twitter data (using Twitter's streaming API) and move it
into a distributed data store like HDFS (Hadoop Distributed File System).
Apache Hive:
• Apache Hive is a data warehouse infrastructure built on top of Hadoop for
providing data summarization, querying, and analysis.
• It uses HQL (Hive Query Language), which is similar to SQL, and allows for
easy storage and querying of large datasets in Hadoop.
Sentiment Analysis:
• Sentiment Analysis is the process of determining the emotional tone (positive,
negative, or neutral) behind a piece of text.
• For Twitter data, sentiment analysis can be done using various NLP techniques
like tokenization, part-of-speech tagging, and machine learning models.
3. Requirements
Software:
• Apache Flume
• Apache Hive
• Python (for sentiment analysis)
• Twitter Developer API Access
• Apache Hadoop
• Hadoop Streaming (for running Python on Hadoop)
• NLTK or TextBlob (for sentiment analysis in Python) Hardware:
• A system with at least 4 GB RAM and 2 CPU cores (for testing purposes).
4. Procedure
1. Setting Up Apache Flume:
• Download and install Apache Flume on the system.
• Create a configuration file for Flume to fetch data from Twitter using the
Twitter streaming API.
• Set the source, channel, and sink configuration for Flume:
• Source: The Twitter source that fetches the real-time tweets.
twitter-source-agent.sinks.hdfs-sink.type = hdfs
twitter-source-agent.sinks.hdfs-sink.hdfs.path =
hdfs://localhost:9000/user/flume/twitter/ twitter-source-agent.sinks.hdfs-
sink.hdfs.filePrefix = tweet_ twitter-source-agent.channels.memory-channel.type
= memory twitter-source-agent.channels.memory-channel.capacity = 1000
twitter-source-agent.channels.memory-channel.transactionCapacity = 100
Running Flume:
Run the Flume agent using the command:
flume-ng agent --conf ./conf --conf-file twitter-source-agent.conf --name twitter-
sourceagent
This will start collecting real-time tweets and store them in HDFS.
2. Setting Up Apache Hive:
• Ensure Apache Hive is installed and configured to run queries over Hadoop.
• Create a Hive table to store the Twitter data from Flume.
Example Hive table creation for storing tweet data: SQL
Once sentiment analysis is done on the tweet text, store the sentiment label (positive,
negative, neutral) in a new column in the Hive table.
Example query for adding sentiment to Hive table:
ALTER TABLE twitter_data ADD COLUMNS (sentiment STRING);
Then, you can insert the sentiment data into the table after processing tweets.
5. Experiment Code
Python Code for Sentiment Analysis:
from textblob import TextBlob from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("TwitterSentimentAnalysis").getOrCreate()
# Function to analyze
sentiment def
analyze_sentiment(tweet):
ARJUN LODHA 2110DMBCSE10193
BIG DATA AND HADOOP BTCS702N
blob = TextBlob(tweet)
polarity =
blob.sentiment.polarity if
polarity > 0:
return
"positive" elif
polarity < 0:
return "negative"
else:
return "neutral" # Sample tweet data tweets = ["I love Hadoop", "I hate
bugs in my code", "Apache Flume is amazing"]
# Apply sentiment analysis sentiments =
[analyze_sentiment(tweet) for tweet in tweets]
# Display the result for tweet, sentiment
in zip(tweets, sentiments):
print(f"Tweet: {tweet} --> Sentiment: {sentiment}")
6. Execution
• Run the Flume agent to fetch live data from Twitter.
• Store the collected tweets in Hive using the Flume configuration.
• Apply the Python sentiment analysis function to classify each tweet's sentiment.
• Store the sentiment results back in Hive.
7. Observations
• You can observe the data collection in real-time and analyze the sentiment of
each tweet.
• The sentiment of each tweet (positive, negative, or neutral) is stored in the Hive
table, allowing for further analysis.
7. Analysis
• By using Flume, you were able to collect Twitter data efficiently.
• Sentiment analysis was performed on the text of tweets, allowing classification
into categories that are useful for understanding public opinion or reaction to
certain topics.
8. Conclusion
• This experiment demonstrates the integration of Apache Flume, Apache Hive,
and Python for real-time Twitter data collection, storage, and sentiment analysis.
• Flume helps in efficiently collecting data in real time, while Hive provides an
easy way to store and query large datasets.
• Sentiment analysis can be useful in applications such as social media monitoring,
brand analysis, or political sentiment analysis.
9. Viva Questions
1. What is the role of Flume in this experiment?
2. How does Flume help in collecting Twitter data? 3. What is
sentiment analysis, and why is it useful?
4. How does Hive help in storing and querying large data sets?
5. Explain how the sentiment analysis function works in Python.
10. Multiple Choice Questions (MCQs)
1. Which of the following is used to collect Twitter data in this experiment?
a) Apache Kafka
b) Apache Flume
c) Apache Spark
d) None of the above
Answer: b) Apache Flume
2. What is the function of the TextBlob library in this experiment?
a) Data collection
b) Data storage
c) Sentiment analysis
d) None of the above
Answer: c) Sentiment analysis
3. Which Hadoop component is used to store the collected Twitter data?
a) HBase
b) HDFS
c) MapReduce
d) Hive
Answer: b) HDFS
4. What is the purpose of sentiment.polarity in TextBlob?
a) To calculate the date of the tweet
b) To calculate the overall tone of the text
c) To detect the keywords in the tweet
d) None of the above
Answer: b) To calculate the overall tone of the text
11. References
• Apache Flume documentation: https://flume.apache.org/
• Apache Hive documentation: https://hive.apache.org/
• TextBlob documentation: https://textblob.readthedocs.io/en/dev/
Experiment No.: 13
1. Aim: Business insights of User usage records of data cards
2. Theory
User Usage Records of Data Cards:
• Data cards are mobile broadband devices used for internet access. They typically
come with specific data plans that users can use for browsing, downloading,
streaming, etc.
• User usage records include data such as the amount of data used, time of use, type
of services accessed (streaming, browsing, social media, etc.), and duration of use.
Business Insights:
• Business insights refer to the process of analyzing raw data to draw conclusions
that inform business decisions. For data card usage records, these insights can
reveal trends like peak usage times, data consumption patterns, and user
demographics.
Customer Segmentation:
• Analyzing the usage records helps segment users based on factors like usage
volume (heavy vs. light users), usage behavior (data usage, browsing habits), and
preferences (types of websites accessed, device usage).
Revenue Optimization:
• Usage data can provide insights into which plans are the most popular, which can
be used to optimize pricing strategies, improve plans, and create new plans that
better meet customer needs.
3. Requirements
Software:
• Python (for data analysis and visualization)
• Pandas (for data manipulation)
• Matplotlib / Seaborn (for data visualization)
• SQL (for querying databases, if the usage records are stored in a relational
database)
• Jupyter Notebook (for interactive analysis) Data:
• A dataset of user usage records, which could include the following attributes:
• User ID: Unique identifier for each user.
• Date: Date of data usage.
• Total Data Used: Amount of data consumed in MB/GB.
ARJUN LODHA 2110DMBCSE10193
BIG DATA AND HADOOP BTCS702N
5. Experiment Code
• Python Code for Business Insights Analysis:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Analyzing the relationship between user category and total data used
plt.figure(figsize=(10,6))
sns.boxplot(x='User Category', y='Total Data Used', data=data)
plt.title('Data Usage by User Category')
plt.show()
6. Execution
• Load the user data into a Python script and clean the data as required.
• Run the analysis to segment users and find patterns in data usage.
• Visualize the data using plots to identify trends, customer behaviors, and
opportunities.
7. Observations
• You may observe peak usage times for specific user categories (e.g., heavy users
may have higher data consumption on weekends).
• Popular services (such as streaming or browsing) may become apparent.
• You might notice some plans are more commonly associated with higher data
usage.
8. Analysis
• From the analysis, you can understand customer behavior based on data
consumption.
• Insights into high-usage patterns can guide marketing efforts, product
development
(new plans), and customer retention strategies.
9. Conclusion
• This experiment demonstrates how business insights can be derived from user
data card usage records.
12. References
• Pandas Documentation: https://pandas.pydata.org/pandas-docs/stable/
• Seaborn Documentation: https://seaborn.pydata.org/
• Matplotlib Documentation: https://matplotlib.org/
Experiment No.: 14
1. Aim: Wiki page ranking with Hadoop
2. Theory
PageRank Algorithm:
• PageRank is a link analysis algorithm used by Google to rank web pages in their
search engine results.
• The algorithm works by counting the number of inbound links to a page and
assigning a rank based on the importance of the linking pages. The higher the
rank of a linking page, the higher the rank transferred to the linked page.
Hadoop MapReduce:
• MapReduce is a programming model used to process and generate large datasets
that can be split into independent tasks. It is used in the Hadoop ecosystem to
handle largescale data processing in a distributed environment.
The core components of Hadoop MapReduce:
• Mapper: The mapper reads data, processes it, and outputs intermediate results
(keyvalue pairs).
• Reducer: The reducer aggregates the results and outputs the final computation.
Wiki Page Ranking:
• For Wikipedia page ranking, the data would typically consist of page-to-page
links (a directed graph). Each page will have links to other pages, and the rank of
each page will be computed iteratively based on the number of incoming links
from other pages.
3. Requirements
Software:
• Hadoop (with HDFS and MapReduce)
• Java (for implementing MapReduce)
• Python (optional for any additional scripting/processing)
• Linux/Mac or Windows with WSL (for setting up Hadoop) Data:
• A sample Wikipedia dataset containing page links. The dataset could be in the
form of a text file where each line contains a page and its outbound links (e.g.,
PageA -> PageB,
PageC).
4. Procedure
1. Set Up Hadoop:
}
4. Implement Reducer Function:
The reducer function will take all incoming data for each page, sum up the ranks of
the linked pages, and compute the new rank based on the PageRank formula.
Example of a Reducer in Java:
public class PageRankReducer extends Reducer<Text, Text, Text,
DoubleWritable> {
private final static DoubleWritable result = new DoubleWritable();
private static final double DAMPING_FACTOR = 0.85;
private static final double INITIAL_RANK = 1.0;
job.setJarByClass(PageRankDriver.class);
job.setMapperClass(PageRankMapper.class);
job.setReducerClass(PageRankReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
6. Execution
• Compile the Java files into a JAR file.
• Upload the input data (Wikipedia links dataset) to HDFS:
hadoop fs -put wiki_data.txt /input
• Run the PageRank MapReduce job:
hadoop jar pagerank.jar PageRankDriver /input /output
• Check the output directory to get the ranked pages.
7. Observations
• The output will show each page and its computed PageRank value.
• Observe how the ranks change with each iteration, and how the pages with more
inbound links gradually rise in rank.
8. Analysis
• After a few iterations, the ranks should stabilize. Pages with the most links from
important pages will be ranked higher.
• This experiment helps demonstrate how Hadoop can be used to implement large-
scale algorithms like PageRank.
9. Conclusion
• This experiment successfully demonstrates how to compute the Wiki PageRank
using Hadoop's MapReduce framework.
• By iteratively computing PageRank, we can rank Wikipedia pages based on their
importance, which can then be used to improve search engine results or content
recommendations.
12. References
• Hadoop Documentation: https://hadoop.apache.org/docs/stable/
MapReduce Programming Guide:
https://hadoop.apache.org/docs/stable/hadoopmapreduce-client-core/mapreduce-
programming-guide.html
Experiment No.: 15
1. Aim: Health care Data Management using Apache Hadoop ecosystem
2. Theory
Apache Hadoop Ecosystem:
Apache Hadoop is a framework that allows for the distributed processing of large data
sets across clusters of computers using simple programming models. It is designed to
scale up from a single server to thousands of machines, each offering local computation
and storage.
Key components of the Hadoop ecosystem include:
• HDFS (Hadoop Distributed File System): A distributed file system designed to
store vast amounts of data across a cluster of computers.
• MapReduce: A programming model for processing large data sets in a parallel,
distributed manner.
• YARN (Yet Another Resource Negotiator): Manages resources in a Hadoop cluster.
• Hive: A data warehouse infrastructure built on top of Hadoop for providing data
summarization, querying, and analysis.
• Pig: A high-level platform for creating MapReduce programs used with Hadoop.
• HBase: A NoSQL database that provides real-time read/write access to large datasets.
• Sqoop: A tool for transferring bulk data between Hadoop and relational databases.
• Flume: A service for collecting, aggregating, and moving large amounts of log data.
• Oozie: A workflow scheduler for managing Hadoop jobs.
• Healthcare Data: Healthcare data includes a wide variety of information such as
patient records, treatment histories, diagnostic data, lab results, medical images,
prescriptions, and more. These datasets are typically:
▪ Structured: Tables containing patient demographics, billing information,
lab results, etc.
▪ Unstructured: Medical records in free-text form, images, reports, etc.
▪ Semi-structured: Data formats like XML or JSON that contain structured
information but do not fit neatly into tables.
Challenges in Healthcare Data Management:
• Volume: Healthcare data is growing rapidly, especially with the rise of electronic
health records (EHRs) and medical imaging.
• Variety: Healthcare data comes in various formats (structured, semi-structured, and
unstructured).
• Velocity: Real-time processing of healthcare data, such as monitoring patient vitals
or emergency response systems, is critical.
• Veracity: Healthcare data must be accurate and trustworthy, making its quality
management a significant concern.
3. Requirements
Software:
• Apache Hadoop (including HDFS, MapReduce, YARN)
• Hive (for SQL-like querying)
• Pig (optional for high-level data processing)
• HBase (for storing large datasets)
• Sqoop (optional for importing data from relational databases)
• Oozie (optional for job scheduling) Data:
• Sample healthcare datasets (e.g., patient records, hospital data, medical images, etc.).
These can be downloaded from public healthcare repositories or simulated for the
experiment.
4. Procedure
1. Set Up Hadoop Cluster:
• Install and configure Apache Hadoop (single-node or multi-node cluster).
• Set up HDFS for distributed storage of healthcare data.
2. Data Ingestion:
Healthcare data can be ingested into HDFS using Sqoop (for importing from
relational databases), Flume (for streaming data), or directly uploading from local
systems.
Example of using Sqoop to import data from a relational database (e.g., MySQL):
sqoop import --connect jdbc:mysql://localhost/healthcare --table patient_records -
username user --password pass --target-dir /healthcare_data/patient_records
3. Data Storage in HDFS:
• Upload raw healthcare data files (CSV, JSON, XML) to HDFS.
• Use HDFS to store and manage these datasets across the Hadoop cluster.
Example command to upload data:
hadoop fs -put patient_records.csv /healthcare_data/patient_records
4. Data Processing with MapReduce:
MapReduce jobs can be written to process healthcare data, such as filtering
patient records, analyzing medical histories, or extracting features from
unstructured medical texts.
Example use case: Filtering patients who have a particular disease.
9. Visualization (optional):
Integrate Hadoop with data visualization tools (like Tableau, QlikView, or Power
BI) to present the processed data visually for easier decision-making and insights.
5. Experiment Code
Sample MapReduce Code for Healthcare Data:
• Mapper (Java):
public class PatientMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String[] fields = value.toString().split(",");
String disease = fields[3]; // Assuming disease is in the 4th column
context.write(new Text(disease), new IntWritable(1));
}
}
• Reducer (Java):
public class PatientReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
12. References
• Healthcare Big Data Management and Analytics:
https://www.springer.com/gp/book/9783030323897
• Cloudera Healthcare Solutions:
https://www.cloudera.com/solutions/healthcare.html