0% found this document useful (0 votes)
29 views67 pages

Experiment: - 1: Aim: Installing Hadoop, Configure HDFS, Configuring Hadoop

Uploaded by

ankitpal2802
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views67 pages

Experiment: - 1: Aim: Installing Hadoop, Configure HDFS, Configuring Hadoop

Uploaded by

ankitpal2802
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 67

BIG DATA AND HADOOP BTCS702N

Experiment: -1
Aim: Installing Hadoop, configure HDFS, Configuring Hadoop
1. Introduction:
Hadoop is an open-source framework developed by the Apache Software Foundation
for processing and storing massive amounts of data. It uses a distributed storage
system called HDFS (Hadoop Distributed File System) and a processing engine called
MapReduce to handle big data efficiently.
Hadoop's key features include scalability, fault tolerance, and the ability to process
structured, semi-structured, and unstructured data across clusters of commodity
hardware. Its ecosystem includes tools like Hive, Pig, HBase, and Spark, making it a
cornerstone for big data analytics.
Components of Hadoop
Hadoop has four core components:
1. HDFS (Hadoop Distributed File System): Stores large data across distributed
nodes with replication for reliability.
2. YARN (Yet Another Resource Negotiator): Manages cluster resources and task
scheduling.
3. MapReduce: Processes data in parallel with mapping and reducing phases.
4. Hadoop Common: Provides shared libraries and utilities for other modules.

2. Installation of Hadoop
Pre requisites: - This software should be prepared to install Hadoop 2.8.0 on
window 10 (64bit)
1. Download Hadoop 2.8.0
(Link:https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.4.1/
hadoop-3.4.1.tar.gz)
OR
(https://dlcdn.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz)
2. Java JDK 1.8.0.zip
(Link: https://www.oracle.com/java/technologies/downloads/#java2)
Set up
1. Check either Java 1.8.0 is already installed on your system or not, use "Javac -
version" to check.

Checking Java Version


2. If Java is not installed on your system, then first install java under "C:\JAVA"

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

3. Extract file Hadoop 2.8.0.tar.gz or Hadoop-2.8.0.zip and place under "C:\Hadoop-


2.8.0".
4. Set the path HADOOP_HOME Environment variable on windows 10(see Step 1,2,3
and 4 below).

Setting Hadoop Environment Path


5. Set the path JAVA_HOME Environment variable on windows 10(see Step 1,2,3 and 4
below).
6. Next, we set the Hadoop bin directory path and JAVA bin directory path.

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

Setting Hadoop bin Directory Path

Configuration
1. Edit core-site.xml: Edit file C:/Hadoop-3.4.1/etc/hadoop/core-site.xml, paste below
xml paragraph and save this file.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
2. Create folder "data" under "C:\Hadoop-3.4.1"
 Create folder "datanode" under "C:\Hadoop-3.4.1\data"
 Create folder "namenode" under "C:\Hadoop-3.4.1\data"
3. Edit hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-3.4.1\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-3.4.1\data\namenode</value>
</property>
</configuration>
4. Edit mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
5. Edit yarn-site.xml:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>

<name>yarn.nodemanger.auxservices.mapreduce.suffle.class</name>
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N

<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
6. Hadoop Configuration
1. Download file Hadoop Configuration.zip
(Link:https://drive.google.com/file/d/
1nCN_jK7EJF2DmPUUxgOggnvJ6k6tksYz/view)
2. Delete file bin on C:\Hadoop-3.4.1\bin, replaced by file bin on file
just download (from Hadoop Configuration.zip).

7. Format the Namenode:


a. Run the following command to format the HDFS
namenode: hdfs namenode -format

8. Testing
1. Open cmd and change directory to "C:\Hadoop-3.4.1\
sbin" and type “start-dfs.cmd" to start apache.

9.
Start Hadoop Services:
a. Use the following commands to start the HDFS and
YARN services: start-dfs.cmd and start-yarn.cmd
10.
Verify Hadoop Installation:
a. Access the Hadoop NameNode UI by visiting http://localhost:9870 in a
web browser.
b. Access the ResourceManager UI by visiting http://localhost:8088.

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

Localhost:50070
So, now you have successfully installed Hadoop

3. Experiment Code
• N/A (This experiment is focused on configuration rather than coding).
4. Execution

 Configuration: After setting up the necessary configuration files, format the


NameNode and initiate the Hadoop services.
 Verification: Confirm the successful installation by accessing the HDFS and
YARN Web UIs at the specified URLs.

5. Observations

 NameNode Web UI: Accessible at http://localhost:9870, this interface provides


detailed information about the Hadoop filesystem.
 ResourceManager Web UI: Available at http://localhost:8088, it displays the
status of YARN resources and applications.

6. Analysis

 Pseudo-Distributed Mode: In this configuration, Hadoop operates in a


pseudo-distributed mode, emulating a distributed environment on a single
machine.
 Data Management: This setup offers insights into how data is stored and
managed within HDFS, and how resources are allocated and monitored
through YARN.

7. Conclusion

 Successful Setup: Hadoop has been successfully installed and configured in


pseudo-distributed mode.
 Configuration Files: Key configuration files have been explored and
appropriately set up.
 Web UI Verification: The setup has been verified through Hadoop’s web
UIs, confirming the system's operational status.

8. Viva Questions

1. What are the prerequisites for installing Hadoop?


2. What is the purpose of the core-site.xml file in Hadoop?
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N

3. How do you start and verify Hadoop services?


4. Why do we need to format the NameNode?
5. Explain the purpose of hdfs-site.xml and its key configurations.

9. Multiple Choice Questions (MCQs)


1. Which of the following is NOT a prerequisite for installing Hadoop?
A) Linux-based operating system
B) Java Development Kit (JDK)
C) SSH with passwordless login
D) Microsoft Excel
Answer: D) Microsoft Excel

2. What is the purpose of the core-site.xml file in Hadoop?


A) Configures HDFS block size
B) Sets the default filesystem URI
C) Specifies replication factor
D) Stores DataNode directories
Answer: B) Sets the default filesystem URI

3. Which command is used to format the NameNode in Hadoop?


A) hadoop format
B) hdfs namenode -format
C) start-dfs.sh
D) start-yarn.sh
Answer: B) hdfs namenode -format

4. What is the default port for the HDFS Web UI?


A) 8080
B) 8088
C) 9870
D) 9000
Answer: C) 9870
10. References
• Apache Hadoop Documentation: https://hadoop.apache.org/docs/
• Hadoop: The Definitive Guide by Tom White

Experiment No.: 2
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N

Aim: Working on HDFS


1. Theory: HDFS is a distributed storage system designed to handle large datasets
across a cluster of machines, ensuring high throughput and fault tolerance.
 Components of HDFS:
a) NameNode: Oversees the metadata and structure of the filesystem.
b) DataNode: Handles the storage of data blocks and manages read/write
requests.
 Key Characteristics of HDFS:
 Provides fault tolerance by replicating data blocks.
 Scales efficiently to manage petabytes of data.
 Enables seamless streaming access to the data stored in the filesystem.

Hadoop Distributed File System (HDFS)


HDFS enables the rapid transfer of data between compute nodes. At its outset, it was
closely coupled with MapReduce, a framework for data processing that filters and
divides up work among the nodes in a cluster, and it organizes and condenses the
results into a cohesive answer to a query. Similarly, when HDFS takes in data, it
breaks the information down into separate blocks and distributes them to different
nodes in a cluster.
With HDFS, data is written on the server once and read and reused numerous times
after that. HDFS has a primary NameNode, which keeps track of where file data is
kept in the cluster.
HDFS also has multiple DataNodes on a commodity hardware cluster—typically one
per node in a cluster. The DataNodes are generally organized within the same rack in
the data center. Data is broken down into separate blocks and distributed among the
various DataNodes for storage. Blocks are also replicated across nodes, enabling
highly efficient parallel processing.
The NameNode knows which DataNode contains which blocks and where the
DataNodes reside within the machine cluster. The NameNode also manages access to
the files, including reads, writes, creates, deletes, and the data block replication across
the DataNodes.
The NameNode operates in conjunction with the DataNodes. As a result, the cluster
can dynamically adapt to server capacity demand in real time by adding or subtracting
nodes as necessary.
The DataNodes are in constant communication with the NameNode to determine if
the DataNodes need to complete specific tasks. Consequently, the NameNode is
always aware of the status of each DataNode. If the NameNode realizes that one
DataNode isn't working properly, it can immediately reassign that DataNode's task to
a different node containing the same data block. DataNodes also communicate with
each other, which enables them to cooperate during normal file operations.

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

Moreover, the HDFS is designed to be highly fault-tolerant. The file system replicates
—or copies—each piece of data multiple times and distributes the copies to individual
nodes, placing at least one copy on a different server rack than the other copies.

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

2. Procedure
a) Start HDFS Services:
 Run the following command to start the HDFS: start-dfs.sh
b) Access the HDFS UI:
 Open the Hadoop NameNode UI in a browser:” http://localhost:9870”
c) Create a Directory in HDFS:
 Use the following command to create a new directory:
“hdfs dfs -mkdir /user/<your-username>/example_dir”
d) Upload a File to HDFS:
 Upload a local file to HDFS:
“hdfs dfs -put <local-file-path> /user/<your-username>/example_dir/”
e) List Files in HDFS:
View the contents of the HDFS directory:
“hdfs dfs -ls /user/<your-username>/example_dir “
f) Retrieve a File from HDFS:
 Download a file from HDFS to the local filesystem:
“hdfs dfs -get /user/<your-username>/example_dir/<file-name> <local-destination-
path>”
g) Delete a File in HDFS:
Remove a file from HDFS:
hdfs dfs -rm /user/<your-username>/example_dir/<file-name>
h) Stop HDFS Services:
 Use the following command to stop HDFS services: stop-dfs.sh

3. Experiment Code
N/A (This experiment is focused on file operations via commands).
4. Execution
 Perform the HDFS commands step by step as outlined in the process.
 Validate each operation by checking the Hadoop NameNode UI and reviewing the
terminal outputs.
5. Observations
 Successfully executed HDFS operations such as directory creation, file upload,
and retrieval.
 Confirmed replication factor and block distribution for uploaded files through the
HDFS web interface.
6. Analysis
 Data Distribution: Files are divided into blocks and spread across multiple
DataNodes.
 Fault Tolerance: Replication ensures data is available even if some DataNodes
fail.

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

 Ease of Use: Simple terminal commands and the web UI make managing HDFS
straightforward.
7. Conclusion
 Gained hands-on experience with basic HDFS operations using terminal
commands.
 Verified data replication and block storage via the NameNode interface.
8. Viva Questions
1. What is the primary function of the NameNode in HDFS?
2. How does HDFS maintain fault tolerance?
3. Describe what happens when a DataNode becomes unavailable.
4. What is the purpose of the hdfs dfs -put command?
9. Multiple Choice Questions (MCQs)
1. What is the default replication factor in HDFS?
a) 2
b) 3
c) 4
d) 1
Answer: b) 3
2. Which component stores metadata in HDFS?
a) DataNode
b) NameNode
c) Secondary NameNode
d) JobTracker
Answer: b) NameNode
10. References
 Official Apache Hadoop Documentation: https://hadoop.apache.org/docs/
 Hadoop: The Definitive Guide by Tom White

Experiment No.: 3
1. Aim: Running Jobs on Hadoop
2. Theory:
 MapReduce is a programming model used to process large datasets in a distributed
environment, often in Hadoop clusters.
 Map Function: The Map function takes input data and transforms it into key-value
pairs, which are the intermediate outputs. Each key-value pair is used for further
processing.
 Reduce Function:The Reduce function processes the intermediate key-value pairs
generated by the Map function, aggregates them, and provides the final result.
Workflow of MapReduce:
1. Input Splits: Input data is divided into smaller chunks (input splits), which can be
processed in parallel across the cluster.

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

2. Mapper: The Mapper processes each input split and generates intermediate key-
value pairs.
3. Shuffle and Sort: This phase organizes the intermediate key-value pairs by
grouping them according to their keys.
4. Reducer: The Reducer processes the sorted key-value pairs and produces the final
output, often after aggregation or summarization.

3. Procedure
1. Start Hadoop Services:
 Run the following commands to start Hadoop's HDFS and YARN
services: start-dfs.sh start-yarn.sh
2. Prepare the Input Data:
 Create a sample text file (e.g., input.txt) with some data.
 Upload the file to HDFS:
hdfs dfs -put input.txt /user/<your-username>/example_input
3. Write a MapReduce Job:
 Use the built-in WordCount example or create a custom job using Java or
Python.
 Example: WordCount program to count word frequencies in a dataset.
4. Run the Job:
Execute the MapReduce job:
Hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/Hadoop-mapreduce
examples-*.jar wordcount/user/<yourusername>/example_input/user/
<yourusername>example_output

5. View Job Output:


 List the contents of the output directory in HDFS:
hdfs dfs -ls /user/<your-username>/example_output
 Retrieve and view the results:
hdfs dfs -cat /user/<your-username>/example_output/part-r-00000
6. Analyze Job Logs:
 View the job tracker or ResourceManager UI at http://localhost:8088.
 Check job progress, logs, and statistics.
7. Stop Hadoop Services:
After completing the experiment, stop all services:
i. stop-dfs.sh
ii. stop-yarn.sh
4. Experiment Code
 Sample Dataset: text
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N

Hadoop is a framework
Hadoop processes large datasets
Distributed systems are scalable
 Output from WordCount Job: text
Hadoop 2
Distribution 1
Framework 1
is 1
large 1
processes 1 scalable 1
systems 1
6. Execution
 Execute the WordCount job using the steps provided.
 Verify the output using hdfs dfs -cat and analyze the logs in the ResourceManager UI.
7. Observations
 The MapReduce job processed the input data and generated accurate word counts.
 Logs provided detailed insights into the execution phases, including map and reduce
tasks.
8. Analysis
 Distributed Processing: MapReduce splits data and processes it across multiple
nodes, improving speed and scalability.
 Fault Tolerance: If a node fails, Hadoop reassigns tasks to ensure job completion.
 Resource Management: YARN allocates and monitors resources effectively during
job execution.
9. Conclusion
 Successfully executed a MapReduce job on Hadoop.
 Learned the workflow of MapReduce and verified outputs using HDFS.
10. Viva Questions
1. What is the role of the Mapper in MapReduce?
2. How does the Shuffle and Sort phase optimize job performance?
3. What is the significance of YARN in Hadoop jobs?
4. How can you monitor the progress of a Hadoop job?
11. Multiple Choice Questions (MCQs)

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

1. What is the primary purpose of MapReduce?


a) Data storage
b) Distributed processing
c) File replication
d) Resource management
Answer: b) Distributed processing
2. Which command is used to upload files to HDFS for a MapReduce job?
a) hdfs dfs -get
b) hdfs dfs -ls
c) hdfs dfs -put
d) hdfs dfs -rm
Answer: c) hdfs dfs -put
3. What is the output format of a typical MapReduce job?
a) Key-value pairs
b) JSON
c) XML
d) CSV
Answer: a) Key-value pairs
4. Where can you monitor the status of a MapReduce job?
a) NameNode UI
b) DataNode logs
c) ResourceManager UI
d) Secondary NameNode logs
Answer: c) ResourceManager UI
12. References
 Apache Hadoop Documentation: https://hadoop.apache.org/docs/
 Hadoop: The Definitive Guide by Tom White

Experiment No.: 4
Aim: Install Zookeeper
1. Theory:
Apache ZooKeeper

 Apache ZooKeeper is a centralized service used to manage configuration


information, naming, provide distributed synchronization, and offer group services.
 It is widely used in distributed systems like Hadoop, Kafka, and other services to
manage state and provide coordination among nodes.
ZooKeeper Architecture

 Leader: The leader node handles all write requests in the system, ensuring that there
is a single source of truth for updates.

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

 Followers: The follower nodes handle read requests and forward write requests to the
leader. They replicate the leader's changes to maintain consistency.
 Observers: These nodes do not participate in leader election and are used primarily
for scaling the system. Observers can handle read requests, but they do not vote
during elections.
Use Cases:

 Leader Election: Ensuring only one active leader node to coordinate tasks.
 Configuration Management: Storing configuration data in a centralized and
consistent way.
 Distributed Locking: Implementing locks in a distributed environment to avoid race
conditions.
 Service Discovery: Helping in the discovery of available services in a dynamic
distributed system.
2. Procedure
1. Download ZooKeeper:
• Visit the Apache ZooKeeper official website.
• Download the latest stable release (e.g., zookeeper-3.x.x.tar.gz).
2. Install ZooKeeper:
• Extract the downloaded package to a desired directory:
tar -xvzf zookeeper-3.x.x.tar.gz -C
/usr/local/ cd /usr/local/zookeeper-3.x.x/
• Set environment variables in the . rc file:
export ZOOKEEPER_HOME=/usr/local/zookeeper-
3.x.x export
PATH=$PATH:$ZOOKEEPER_HOME/bin
• Source the . rc file to apply changes:
source ~/. rc
3. Configure ZooKeeper:
• Navigate to the conf directory and create a configuration file:
cp conf/zoo_sample.cfg conf/zoo.cfg
• Edit the zoo.cfg file to configure ZooKeeper: properties tickTime=2000
dataDir=/usr/local/zookeeper-3.x.x/data
clientPort=2181
• Create the data directory:
mkdir -p /usr/local/zookeeper-3.x.x/data
4. Start ZooKeeper Service:
•Start the ZooKeeper server:
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N

zkServer.sh start

3. Verify Installation:
• Check ZooKeeper status:” zkServer.sh status”
• Use the ZooKeeper CLI to test functionality:” zkCli.sh -server localhost:2181 “
• Create a ZNode: “ create /my_node "Hello ZooKeeper" “
•Verify the ZNode: “get /my_node”
4. Stop ZooKeeper Service:
zkServer.sh stop
5. Experiment Code
• Sample commands for ZooKeeper CLI:
create /example "Sample Data"
ls /
set /example "Updated Data"
get /example
delete /example

6. Execution
 Execute the commands mentioned above using the ZooKeeper CLI.
 Observe the creation, update, retrieval, and deletion of ZNodes.

7. Observations
 The ZooKeeper server successfully started and maintained its state.
 The CLI commands effectively interacted with the ZooKeeper service.

8. Analysis
 Distributed Coordination: ZooKeeper ensures coordination and consistency across
distributed nodes, managing shared data and operations.
 Atomicity, Reliability, and Fault-Tolerance: ZooKeeper’s mechanisms ensure data
consistency and fault tolerance, making it suitable for maintaining the state in
distributed systems.
 Hierarchical ZNode Structure: The hierarchical structure of ZNodes allows for easy
organization, access, and management of configuration data.
9. Conclusion
 Successfully installed and configured ZooKeeper.
 Understood its functionality by creating and manipulating ZNodes using the CLI.
10. Viva Questions
1. What is the role of ZooKeeper in a distributed system?
2. Explain the significance of the tickTime parameter in zoo.cfg.
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N

3. How does ZooKeeper achieve consistency across distributed nodes?


4. What are the key use cases of ZooKeeper?
11. Multiple Choice Questions (MCQs)
1. Which of the following is a core feature of ZooKeeper?
a) File storage
b) Distributed synchronization
c) Query optimization
d) Data replication
Answer: b) Distributed synchronization
2. Which file contains ZooKeeper's main configuration?
a) zoo.cfg
b) core-site.xml
c) server.properties
d) application.properties
Answer: a) zoo.cfg
3. What is the default port for ZooKeeper client connections?
a) 8080
b) 9000
c) 2181
d) 50070
Answer: c) 2181
12. References
 Apache ZooKeeper Documentation: https://zookeeper.apache.org/
 Hadoop: The Definitive Guide by Tom White

Experiment No.: 5
1. Aim: Pig Installation
2. Theory

Apache Pig is a high-level platform for creating MapReduce programs used with
Hadoop.
It uses a scripting language called Pig Latin, which simplifies complex data
transformations and operations.
Key Features:
 Ease of Programming: Simplifies the development of MapReduce programs.
 Extensibility: Allows the use of user-defined functions (UDFs) for custom
processing.
 Optimization Opportunities: Automatically optimizes scripts for efficient
execution.

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

Execution Modes:
 Local Mode: Runs on a single machine without Hadoop.
 MapReduce Mode: Runs on a Hadoop cluster.
3. Procedure
1. Pre-Requisites:
 Ensure Java and Hadoop are installed and configured.
 Verify Hadoop is running by checking its status.
 Install Pig in the same environment as Hadoop.
2. Download Apache Pig:
 Visit the official Apache Pig website.
 Download the latest stable release (e.g., pig-0.x.x.tar.gz).
3. Install Apache Pig:
 Extract the downloaded file to a desired directory:
tar -xvzf pig-0.x.x.tar.gz -C /usr/local/
cd /usr/local/pig-0.x.x/
 Set environment variables in the .bashrc file:
export PIG_HOME=/usr/local/pig-0.x.x
export PATH=$PATH:$PIG_HOME/bin
 Source the .bashrc file to apply changes:
source ~/.bashrc
4. Configure Pig:
 For local mode, no additional configuration is required.
 For MapReduce mode, ensure HADOOP_HOME is properly configured in
the environment variables.
5. Verify Installation:
 Check Pig version to confirm installation:
 pig -version
6. Run Pig in Local Mode:
 Start Pig in local mode:
pig -x local
 Execute a sample Pig script:
A = LOAD 'data.txt' USING PigStorage(',') AS (id:int, name:chararray,
age:int);
B = FILTER A BY age > 25;
DUMP B;

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

7. Run Pig in MapReduce Mode:


 Start Pig in MapReduce mode: pig
 Run the same script, ensuring Hadoop is running.
8. Exit Pig Grunt Shell:
 Exit the Pig shell after execution: quit;
5. Experiment Code
Sample Pig Latin Script:
pig
-- Load data from a text file
data = LOAD 'employees.txt' USING PigStorage(',') AS (id:int, name:chararray,
salary:float, dept:chararray);

-- Filter employees with salary greater than 50000


high_salary = FILTER data BY salary > 50000;

-- Group employees by department


grouped_data = GROUP high_salary BY dept;

-- Calculate average salary for each department


avg_salary = FOREACH grouped_data GENERATE group AS dept,
AVG(high_salary.salary) AS avg_salary;

-- Store the result


STORE avg_salary INTO 'output' USING PigStorage(',');
6. Execution
 Execute the script in Pig Grunt Shell.
 Verify the output stored in the specified location (output folder).
7. Observations
 Apache Pig successfully executed the script and stored the output.
 Pig's high-level abstraction simplified complex MapReduce operations.
8. Analysis
 Pig Latin scripts are easy to write and manage compared to traditional
MapReduce.
 The platform efficiently processes large datasets and supports extensibility
through UDFs.
9. Conclusion

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

 Successfully installed and configured Apache Pig.


 Executed Pig Latin scripts in both local and MapReduce modes to process
data.

10. Viva Questions


1. What are the key differences between Local and MapReduce modes in Pig?
2. Explain the role of Pig Latin in simplifying MapReduce operations.
3. How does Pig handle optimization during script execution?
4. What is the function of LOAD and STORE commands in Pig?

11. Multiple Choice Questions (MCQs)


1. Apache Pig is best described as:
a) A file system
b) A data processing platform
c) A database management system
d) A distributed file system
Answer: b) A data processing platform
2. Which scripting language does Pig use?
a) Python
b) Pig Latin
c) Scala
d) Java
Answer: b) Pig Latin
3. What does the GROUP operation do in Pig?
a) Combines data from multiple files
b) Groups data by a specified field
c) Removes duplicate rows
d) Sorts data in ascending order
Answer: b) Groups data by a specified field
4. How do you execute a Pig script stored in a file?
a) pig run
b) pig -f
c) pig execute
d) pig start
Answer: b) pig -f
5. Which command is used to exit the Pig shell?
a) exit;
b) quit;
c) stop;
d) terminate;
Answer: b) quit;

12. References
 Apache Pig Documentation
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N

 Hadoop: The Definitive Guide by Tom White

Experiment No.: 6
1. Aim: Sqoop Installation.
2. Theory: Apache Sqoop is a tool designed for efficiently transferring bulk data
between Apache Hadoop and structured data stores such as relational databases.
Key Features:
 Import/Export Data: Move data between Hadoop and RDBMS.
 Parallel Processing: Supports parallel data transfer for efficiency.
 Integration: Works seamlessly with Hive and HBase.
 Incremental Load: Import only new or updated data.
Typical Use Cases:
 Importing data from a MySQL or PostgreSQL database into HDFS.
 Exporting processed data from HDFS back to an RDBMS.

3. Procedure
1. Pre-Requisites:
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N

 Ensure Java, Hadoop, and a relational database (e.g., MySQL) are installed and
configured.
 Verify Hadoop is running: jps
2. Download Apache Sqoop:
 Visit the official Apache Sqoop website.
 Download the latest stable release (e.g., sqoop-1.x.x.tar.gz).
3. Install Apache Sqoop:
 Extract the downloaded file to a desired directory:
“ tar -xvzf sqoop-1.x.x.tar.gz -C /usr/local/ cd /usr/local/sqoop-1.x.x/ “
 Set environment variables in the .bashrc file:
“export SQOOP_HOME=/usr/local/sqoop-1.x.x
export PATH=$PATH:$SQOOP_HOME/bin “
 Source the .bashrc file to apply changes: “source ~/.bashrc “
4. Configure Sqoop:
 Download and place the JDBC driver for your database in the
$SQOOP_HOME/lib directory. For example, for MySQL:wget
https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-
x.x.x.tar.gz
tar -xvzf mysql-connector-java-x.x.x.tar.gz
cp mysql-connector-java-x.x.x.jar $SQOOP_HOME/lib/
 Ensure the database service (e.g., MySQL) is running.
5. Verify Installation:
 Check Sqoop version to confirm installation:
 sqoop version
6. Perform Data Import:
 Import a table from MySQL into HDFS:
sqoop import \
--connect jdbc:mysql://localhost:3306/db_name \
--username db_user \
--password db_password \
--table table_name \
--target-dir /user/hadoop/table_data \
--m 1
 Verify the imported data in HDFS: hdfs dfs -ls /user/hadoop/table_data
7. Perform Data Export:
 Export data from HDFS back to MySQL: sqoop export \

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

--connect jdbc:mysql://localhost:3306/db_name \
--username db_user \
--password db_password \
--table table_name \
--export-dir /user/hadoop/table_data \
--m 1
8. Exit Sqoop:
 Sqoop runs as a command-line tool; no special command is required to exit.
4. Experiment Code
 Sample Import Command: sqoop import \
--connect jdbc:mysql://localhost:3306/employees_db \
--username root \
--password root_password \
--table employees \
--target-dir /user/hadoop/employees_data \
--m 1
 Sample Export Command: sqoop export \
--connect jdbc:mysql://localhost:3306/employees_db \
--username root \
--password root_password \
--table salaries \
--export-dir /user/hadoop/salaries_data \
--m 1
5. Execution
 Execute the Sqoop commands in the terminal.
 Verify the transferred data in HDFS and the relational database.
6. Observations
 Sqoop successfully transferred data between MySQL and HDFS.
 Data transfer operations were efficient and error-free.
7. Analysis
 Sqoop simplifies data movement between RDBMS and Hadoop.
 The tool supports parallel processing for faster data transfer.
8. Conclusion
 Successfully installed and configured Apache Sqoop.

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

 Performed data import/export operations using Sqoop commands.


9. Viva Questions
1. What is the purpose of Apache Sqoop?
2. Explain the difference between import and export in Sqoop.
3. What role does the JDBC driver play in Sqoop?
4. How can you perform incremental imports in Sqoop?
10.Multiple Choice Questions (MCQs)
1. Apache Sqoop is primarily used for:
a) Real-time data processing
b) Data transfer between RDBMS and Hadoop
c) Monitoring Hadoop cluster health
d) Running machine learning algorithms
Answer: b) Data transfer between RDBMS and Hadoop
2. Which command is used to import data from RDBMS to HDFS?
a) sqoop export
b) sqoop import
c) sqoop transfer
d) sqoop migrate
Answer: b) sqoop import
11.References
 Apache Sqoop Documentation
 Hadoop: The Definitive Guide by Tom White

Experiment No.: 7
1. Aim: HBase Installation
2. Theory: Apache HBase is an open-source, distributed, and scalable NoSQL
database inspired by Google's Bigtable. It is designed to manage large amounts of
data in a fault-tolerant way, running on top of Hadoop and using HDFS for storage.

 Key Features:
 Distributed: Scales horizontally by adding machines.
 Column-Oriented: Data is stored in columns rather than rows.
 Real-Time Access: Fast random read/write operations on large datasets.
 Fault-Tolerant: Data replication ensures fault tolerance.

 HBase Operations:
 Put: Inserts data into tables.
 Get: Retrieves data from tables.

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

 Scan: Fetches a range of rows.


 Delete: Removes data from tables.

3. Procedure
1. Pre-Requisites:
 Ensure Hadoop and Java are installed and configured.
 Verify that Hadoop is running: jps
2. Download Apache HBase:
 Visit the official Apache HBase website and download the latest stable release
(e.g., hbase-2.x.x.tar.gz).
3. Extract HBase Files:
 Extract the downloaded file to the desired directory:
tar -xvzf hbase-2.x.x.tar.gz -C /usr/local/
cd /usr/local/hbase-2.x.x/
4. Configure HBase:
 Open the hbase-site.xml file located in $HBASE_HOME/conf/. If not present,
create it.
 Add the following basic configuration:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
</configuration>
 Ensure Hadoop HDFS and Zookeeper services are running before starting
HBase.
 Optionally, configure additional properties like hbase.master
andhbase.regionserver.
5. Set HBase Environment Variables:
 Add the following to your .bashrc file:
export HBASE_HOME=/usr/local/hbase-2.x.x
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N

export PATH=$PATH:$HBASE_HOME/bin
 Apply changes: source ~/.bashrc
6. Start HBase:
 Start the HBase master and region server: start-hbase.sh
 Verify HBase is running: jps
 You should see processes like HMaster and HRegionServer.
7. Verify HBase Installation:
 Open the HBase shell: hbase shell
 Check if HBase is running: status 'simple'
8. Create an HBase Table:
 Create a table student with two column families: personal and grades:
create 'students', 'personal', 'grades'
 Verify the table creation:
list
9. Insert Data into HBase:
 Insert data into the students table:
put 'students', 'row1', 'personal:name', 'John Doe'
put 'students', 'row1', 'personal:age', '21'
put 'students', 'row1', 'grades:math', 'A'
10. Retrieve Data from HBase:
 Retrieve data from the students table:
get 'students', 'row1'
11. Scan the HBase Table:
 Scan the entire table:
scan 'students'
12. Stop HBase:
 Stop the HBase services:
stop-hbase.sh

4. Experiment Code
Create Table Command:
create 'students', 'personal', 'grades'

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

Insert Data Commands:


put 'students', 'row1', 'personal:name', 'John Doe'
put 'students', 'row1', 'personal:age', '21'
put 'students', 'row1', 'grades:math', 'A'
Retrieve Data Command:
get 'students', 'row1'
Scan Table Command:
scan 'students'

5. Execution
 Execute the commands in the HBase shell to create tables, insert data, and retrieve
data.
 Verify the inserted data and scanned results.
6. Observations
 HBase was successfully installed and configured.
 Data was successfully inserted into and retrieved from the student’s table.
7. Analysis
 HBase's column-oriented architecture allows efficient storage and retrieval of large
datasets.
 Its integration with Hadoop ensures scalability and distributed storage for
managing big data.
8. Conclusion
 Successfully installed and configured Apache HBase.
 Performed basic operations such as table creation, data insertion, and data retrieval.

9. Viva Questions
1. What is HBase, and how does it differ from relational databases?
2. Explain the architecture of HBase.
3. How does HBase achieve scalability?
4. What is the role of Zookeeper in HBase?
5. Explain the concept of column families in HBase.

10.Multiple Choice Questions (MCQs)


1. HBase is primarily used for:
a) Storing unstructured data
b) Real-time processing of data
c) Storing large amounts of structured data
d) Running MapReduce jobs
Answer: c) Storing large amounts of structured data

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

2. In HBase, a column family is:


a) A set of rows
b) A set of columns with a similar type of data
c) A row in a table
d) A type of data format
Answer: b) A set of columns with a similar type of data
3. Which of the following is true about HBase?
a) It is a relational database
b) It is used for batch processing only
c) It is a distributed NoSQL database
d) It does not support real-time access
Answer: c) It is a distributed NoSQL database
4. Which of the following is required for HBase to work?
a) Only HBase
b) HBase and Zookeeper
c) HBase and MySQL
d) HBase and Apache Spark
Answer: b) HBase and Zookeeper

11.References
 Apache HBase Documentation
 HBase: The Definitive Guide by Lars George

Experiment No.: 8
1. Aim: Hadoop streaming
2. Theory: Hadoop Streaming: A utility enabling MapReduce jobs with any language
capable of handling stdin and stdout, such as Python, Perl, or Ruby.

 Key Components:
 Mapper: Processes each line of input and outputs key-value pairs.
 Reducer: Aggregates key-value pairs output by the Mapper to produce the
final result.

3. Procedure
1. Setup and Verification:
 Ensure Hadoop is installed, and HDFS and YARN services are running.
 Confirm with commands like jps.
2. Input Data Preparation:
 Create an input file input.txt containing:
hello world

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

hello hadoop
hello streaming
hadoop world
3. Mapper Script:
 Create a Python file mapper.py for the Mapper function.
 mapper.py:
import sys
for line in sys.stdin:
words = line.strip().split()
for word in words:
print(f"{word}\t1")
4. Reducer Script:
 Create a Python file reducer.py for the Reducer function.
 reducer.py:
import sys
from collections import defaultdict
current_word = None
current_count = 0

for line in sys.stdin:


word, count = line.strip().split('\t')
count = int(count)
if word == current_word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count

if current_word:
print(f"{current_word}\t{current_count}")
5. Upload Data to HDFS:
 hadoop fs -put input.txt /user/hadoop/input/
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N

6. Run Hadoop Streaming Job:


 hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/hadoop/input/input.txt \
-output /user/hadoop/output/ \
-mapper mapper.py \
-reducer reducer.py
7. Output Verification:
 Once the job completes, the output will be stored in HDFS at
/user/hadoop/output/. To view the result, use the following command:
hadoop fs -cat /user/hadoop/output/part-00000
 Expected Output:
hello 3
hadoop 2
streaming 1
world 2
8. Cleanup:
 Remove the output directory from HDFS after the job completes:
hadoop fs -rm -r /user/hadoop/output/

5. Experiment Code
 Mapper (mapper.py) and Reducer (reducer.py) scripts.
 Hadoop Streaming job command.

6. Execution
 Run the scripts and commands.
 Confirm correct processing by inspecting the output in HDFS.

7. Observations
 Successful execution of the MapReduce job with accurate word count results.
 Mapper and Reducer functions worked as expected.

8. Analysis
 Hadoop Streaming simplifies the integration of non-Java languages with
Hadoop MapReduce.
 Enables flexibility for developers using Python or other scripting languages
for big data processing.

9. Conclusion

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

 Demonstrated Hadoop Streaming with Python to execute a MapReduce word


count job.
 Highlighted the versatility of Hadoop's ecosystem.

10. Viva Questions


1. What is Hadoop Streaming?
2. How does Hadoop Streaming differ from standard Hadoop MapReduce?
3. Explain the role of stdin and stdout in Hadoop Streaming.
4. What are the advantages and limitations of Hadoop Streaming?
5. How does a Reducer aggregate data in Hadoop Streaming?

11. MCQs
1. Hadoop Streaming allows the use of which languages for MapReduce
jobs?
a) Only Java programs
b) Only Python programs
c) Any language that can read from stdin and write to stdout
d) Only shell scripts
Answer: c) Any language that can read from stdin and write to stdout
2. Which command is used to execute a Hadoop Streaming job?
a) hadoop-streaming.jar
b) hadoop jar hadoop-streaming.jar
c) hadoop run streaming.jar
d) hadoop-mapreduce.jar
Answer: b) hadoop jar hadoop-streaming.jar
3. In Hadoop Streaming, the output from the Mapper is:
a) A single line
b) A key-value pair
c) A JSON file
d) A binary file
Answer: b) A key-value pair
4. The primary function of the Reducer in Hadoop Streaming is:
a) To split data into smaller chunks
b) To merge the results from the Mapper
c) To run a MapReduce job
d) To filter data
Answer: b) To merge the results from the Mapper
5. What type of data does Hadoop Streaming primarily process?
a) Binary data only
b) Text data only
c) Structured, semi-structured, and unstructured data

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

d) Graphical data only


Answer: c) Structured, semi-structured, and unstructured data

12. References
 Hadoop Streaming Documentation
 Hadoop: The Definitive Guide by Tom White.

Experiment No.: 9
1. Aim: Creating Mapper function using python.
2. Theory
 The Mapper function in Hadoop MapReduce processes input data and produces
intermediate key-value pairs as output.
 This function works line by line, reading each line, splitting it into words, and
outputting each word as a key with an associated count (usually 1).
 Python is widely preferred for writing Mapper functions due to its simplicity and an
extensive library ecosystem.

3. Procedure
1. Ensure Hadoop is Installed and Running:
 Verify Hadoop installation and running status:
 Ensure that the HDFS and YARN daemons are running.
2. Create Input Data:

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

 Create a text file input.txt containing sample data for processing. For
example:
Hadoop is a framework
MapReduce is a programming model
Python is used for writing mapper
Hadoop streaming supports multiple languages
3. Create the Mapper Script:
 Write a Python script (mapper.py) that will process each line of input data and
output key-value pairs.
 In this case, the script will split each line into words and output a (word, 1)
pair.
Python code for Mapper(mapper.py):
python # mapper.py
import sys
# Read from stdin
for line in
sys.stdin:
# Remove leading/trailing whitespaces and split the line into words
words = line.strip().split()
# Emit (word, 1) for each word in the line
for word in words:
print(f"{word}\t1")
 This script reads input from standard input (sys.stdin), processes each line, and
outputs the word followed by the number 1 (indicating its occurrence).
4. Upload Input Data to HDFS:
 Use the following command to upload the input file input.txt to HDFS:
hadoop fs -put input.txt /user/hadoop/input/
5. Run the Hadoop Streaming Job:
 Use the following command to execute the MapReduce job, specifying the
Python Mapper script:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/hadoop/input/input.txt \
-output /user/hadoop/output/ \
-mapper mapper.py

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

 This command tells Hadoop to use the mapper.py script to process the
data.
6. Verify the Output:
 After the job completes, the output will be saved in HDFS
at/user/hadoop/output/.
 You can check the output using the following command:
hadoop fs -cat /user/hadoop/output/part-00000
 The output will show key-value pairs (word, 1), such as:
Hadoop 1
is 2
a 2
framework 1
MapReduce 1
programming 1
model 1
Python 1
used 1
for 1
writing 1
mapper 1
streaming 1
supports 1
multiple 1
languages 1
7. Clean Up:
Remove the output directory from HDFS after the job completes:
hadoop fs -rm -r /user/hadoop/output/
4. Experiment Code
Mapper Code (mapper.py):
import sys
# Read from stdin
for line in
sys.stdin:
# Remove leading/trailing whitespaces and split the line into words
words = line.strip().split()
# Emit (word, 1) for each word in the line
for word in words:
print(f"{word}\t1")
Hadoop Streaming Command:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/hadoop/input/input.txt \
-output /user/hadoop/output/ \
-mapper mapper.py

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

5. Execution
 Run the Hadoop Streaming command provided above to execute the Mapper
function.
 Check the output stored in HDFS using the hadoop fs -cat command.
6. Observations
 The Mapper function processed the input data and emitted key-value pairs for
each word with an occurrence count of 1.
 The output file in HDFS contained all words along with their counts.

7. Analysis
 The Mapper function in Python efficiently processes text by breaking it into words.
 The output can be used in the Reducer phase for further aggregation.
 Python's simplicity makes it a reliable choice for developing Mapper functions in
Hadoop Streaming.

8. Conclusion
 A Python-based Mapper function was successfully implemented within the
Hadoop MapReduce framework.
 The process demonstrated Python's utility for writing effective MapReduce jobs.

9. Viva Questions
1. What is the Mapper’s function in the MapReduce framework?
2. How does Hadoop Streaming facilitate MapReduce jobs with Python?
3. What format does the Mapper produce as its output?
4. How can large datasets be handled effectively using Hadoop Streaming?
5. Name other languages supported by Hadoop Streaming apart from Python.

10.Multiple Choice Questions (MCQs)


1. The output of the Mapper function in Hadoop Streaming is:
a) A single text line
b) Key-value pairs
c) JSON-formatted data
d) None of the above
Answer: b) Key-value pairs
2. Which Python function processes input data in the Mapper function?
a) sys.stdin.read()
b) sys.stdin()
c) sys.stdin.readline()
d) None of these
Answer: a) sys.stdin.read()
3. What is the Mapper’s role in the MapReduce workflow?
a) To read input data and generate key-value pairs
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N

b) To sort data
c) To aggregate intermediate results
d) To handle I/O operations
Answer: a) To read input data and generate key-value pairs
4. Which command executes the Python Mapper in Hadoop Streaming?
a) hadoop stream run
b) hadoop jar hadoop-streaming.jar
c) hadoop exec
d) hadoop maprun
Answer: b) hadoop jar hadoop-streaming.jar

11.References
 Hadoop Streaming Documentation
 Hadoop: The Definitive Guide by Tom Whit

Experiment No.: 10
1. Aim: Creating Reducer function using python
2. Theory
 The Reducer function in Hadoop MapReduce receives the output of the Mapper
function, which is sorted and grouped by key. The Reducer processes each group of
values associated with a particular key and performs aggregation (e.g., sum,
average, etc.).
 The Reducer function outputs a key-value pair as its final result.
 Python is widely used for writing Reducer functions in Hadoop Streaming because
of its simplicity and flexibility.
3. Procedure
1. Ensure Hadoop is Installed and Running:
Verify Hadoop installation and running status:
Ensure that the HDFS and YARN daemons are running.

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

2. Create Input Data:


If you don't already have an input file, create a text file input.txt containing
data that the Mapper will process. For example:
Hadoop is a framework
MapReduce is a programming model
Python is used for writing mapper
Hadoop streaming supports multiple languages
3. Create the Mapper Script:
 Write a Python script (mapper.py) that emits key-value pairs (for instance,
words with a count of 1).
Python code for Mapper (mapper.py):
# mapper.py
import sys
# Read from stdin
for line in
sys.stdin:
# Remove leading/trailing whitespaces and split the line into words
words = line.strip().split()
# Emit (word, 1) for each word in the line
for word in words:
print(f"{word}\t1")

4. Create the Reducer Script:


 Write a Python script (reducer.py) that will take the key-value pairs from the
Mapper, aggregate the values (sum them in this case), and emit the final
output.
 The Reducer function will receive input where each key will be grouped
with its corresponding values (counts). It will sum these counts for each
word.
Python code for Reducer (reducer.py):
# reducer.py
import sys
from collections import defaultdict
# Dictionary to hold the aggregate counts for each word
word_count = defaultdict(int)

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

# Read from stdin


for line in
sys.stdin:
# Remove leading/trailing whitespaces
line = line.strip()
# Split the line into key-value pair (word, count)
word, count = line.split('\t')
word_count[word] += int(count)
# Emit the aggregated result (word, total count)
for word, count in word_count.items():
print(f"{word}\t{count}")
5. Upload Input Data to HDFS:
 Upload the input file input.txt to HDFS if it's not already there:
hadoop fs -put input.txt /user/hadoop/input/
6. Run the Hadoop Streaming Job:
 Use the following command to execute the MapReduce job,
specifying both the Mapper and Reducer Python scripts:
Hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/hadoop/input/input.txt \
-output /user/hadoop/output/ \
-mapper mapper.py \
-reducer reducer.py
7. Verify the Output:
 After the job completes, the output will be saved in HDFS at
/user/hadoop/output/. Use the following command to check the output:
hadoop fs -cat /user/hadoop/output/part-00000
 The output should show the word count for each word in the input
file, like:
Hadoop 2
is 2
a 2
framework 1
MapReduce 1
programming 1
model 1
Python 1
used 1
for 1
writing 1

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

mapper 1
streaming 1
multiple 1
languages 1
8. Clean Up:
 Remove the output directory from HDFS after the job completes:
hadoop fs -rm -r /user/hadoop/output/
4. Experiment Code
Mapper Code (mapper.py):
import sys
# Read from stdin
for line in
sys.stdin:
# Remove leading/trailing whitespaces and split the line into words
words = line.strip().split()
# Emit (word, 1) for each word in the line
for word in words:
print(f"{word}\t1")
Reducer Code (reducer.py):
import sys
from collections import defaultdict
# Dictionary to hold the aggregate counts for each word
word_count = defaultdict(int)
# Read from stdin
for line in
sys.stdin:
# Remove leading/trailing whitespaces
line = line.strip()
# Split the line into key-value pair (word, count)
word, count = line.split('\t')
word_count[word] += int(count)
# Emit the aggregated result (word, total count)
for word, count in word_count.items():
print(f"{word}\t{count}")
Hadoop Streaming Command:
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input /user/hadoop/input/input.txt \
-output /user/hadoop/output/ \
-mapper mapper.py \
-reducer reducer.py
5. Execution
• Execute the job by running the Hadoop Streaming command.
• Check the output in HDFS using hadoop fs -cat.
6. Observations

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

• The Reducer function successfully aggregated the counts for each word and
emitted the total count.
• The final output in HDFS contains each word from the input text file along with
its corresponding count.
7. Analysis
• The Mapper emits key-value pairs of words and counts, which are grouped by the
Reducer.
• The Reducer aggregates these values (sum in this case) and outputs the final word
count for each word in the input.
8. Conclusion
• We successfully created a Reducer function using Python in Hadoop's
MapReduce framework.
• The job processed the input data, aggregated the counts for each word, and
produced the correct output, demonstrating the effective use of Python in Hadoop
Streaming.
9. Viva Questions
1. What is the role of the Reducer in the MapReduce framework?
2. How does the Reducer function process data in Hadoop Streaming?
3. What happens to the key-value pairs before they are passed to the Reducer?
4. What would happen if the word count logic in the Reducer was incorrect?
5. How do you optimize the performance of a Reducer in Hadoop?
10. Multiple Choice Questions (MCQs)
1. In the Reducer function, the input is grouped by:
a) Key
b) Value
c) Both Key and Value
d) None of the above

Answer: a) Key

2. Which Python data structure is used to hold the word counts in the
Reducer? a) List
b) Tuple
c) Dictionary
d) Set
Answer: c) Dictionary

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

3. The Reducer function is responsible for:


a) Splitting input data into words
b) Aggregating values for a given key
c) Emitting key-value pairs
d) None of the above
Answer: b) Aggregating values for a given key
4. What is the purpose of the collections.defaultdict in the Reducer code?
a) It allows storing values with default initialization
b) It stores data in alphabetical order
c) It limits the number of values stored
d) None of the above
Answer: a) It allows storing values with default initialization

11. References
 Hadoop Streaming Documentation:
https://hadoop.apache.org/docs/r1.2.1/streaming.html
 "Hadoop: The Definitive Guide" by Tom White

Experiment No.: 11

1. Aim: Python iterator and generators


2. Theory
Iterators:
a. An iterator in Python is any object that implements two methods:
i. __iter__() which returns the iterator object itself.
ii. __next__() which returns the next value from the container, raising a
StopIteration exception when there are no more items.

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

b. Iterators allow you to loop through a collection (like a list or tuple) without
needing to explicitly use an index.
Generators:
a. A generator is a special type of iterator that is defined using a function and
the yield keyword.
b. A generator function does not return a value all at once, but rather yields
values one at a time as the function is iterated over.
c. Generators are memory efficient as they generate items lazily (only when
required), rather than storing all values in memory at once.
Key Differences:

1. Iterator: Requires an explicit __iter__() and __next__() method.


2. Generator: Defined using a function and the yield keyword, making it
simpler and more memory efficient.
3. Procedure
1. Creating an Iterator:
 An iterator in Python requires defining a class with two key methods:
__iter__() and __next__().
 Example of Custom Iterator:
python class MyIterator:
def __init__(self, start, end):
self.current = start
self.end = end
def __iter__(self):
return self def
__next__(self):
if self.current >=
self.end:
raise StopIteration
# No more items to return
self.current += 1
return self.current - 1
# Usage
my_iter = MyIterator(0, 5)
# Creates an iterator from 0 to 4

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

for num in my_iter:


print(num)
# Output: 0 1 2 3 4
2. Creating a Generator:
 A generator is defined by using a function with the yield keyword. When
the generator function is called, it returns a generator object that can be
iterated over.
Example of Custom Generator:
python def
my_generator(start,
end): while start <
end:
yield start # Yields value and suspends function execution
start += 1 # Resumes from the previous yield point
# Usage for num in
my_generator(0, 5):
print(num) # Output: 0 1 2 3 4
3. Difference in Memory Usage:
 You can observe that generators are memory-efficient as they yield one
value at a time and do not store the entire list in memory.

 Create a large range using an iterator and a generator to compare memory


consumption.
Iterator Example (Memory inefficient):
# This stores the entire range in memory numbers = [x for x in range(1000000)]
# Creates a list of 1 million numbers
Generator Example (Memory efficient): python
# This generates one value at a time without storing the entire
range def generate_numbers():
for x in range(1000000):
yield x
4. Usage in Large Data Sets:
 Iterators: Used when you have a finite, pre-existing collection.
 Generators: Especially useful when working with large data sets, as they
allow processing without loading the entire data set into memory.

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

4. Experiment Code
Iterator Example:
class MyIterator:
def __init__(self, start,
end):
self.current =
start
self.end = end
def __iter__(self):
return self
def __next__(self):
if self.current >=
self.end:
raise StopIteration
# No more items to return
self.current += 1
return self.current - 1
# Usage of Iterator
my_iter = MyIterator(0, 5)
for num in my_iter:
print(num)
Generator Example:
def my_generator(start,
end): while start <
end:
yield start
# Yield one value at a time
start += 1
# Usage of Generator for
num in my_generator(0,
5):
print(num)

5. Execution

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

• Run both the iterator and generator examples to see how they produce values and
compare their usage.
• Observe the efficiency of the generator when working with larger data sets.

6. Observations
• Both the iterator and the generator produced the same output for the range of
values.
• The generator was more memory efficient, especially when working with larger
data sets.

7. Analysis
• Iterators are suitable for situations where the data is finite and already available
in memory.
• Generators are ideal for cases where large data sets need to be processed lazily,
minimizing memory consumption.

8. Conclusion
• Python's iterators and generators provide an elegant and efficient way to handle
large data sets and sequences.
• While iterators require more boilerplate code, generators simplify the task by
leveraging the yield keyword and are more memory-efficient.

9. Viva Questions
1. How do iterators differ from generators in Python?
2. What role does the yield keyword play in creating generators?
3. Is it possible to retrieve values from a generator using the next() function? If
yes, how?
4. What occurs if you attempt to iterate over a generator after all its values have
been exhausted?
5. In what ways do iterators and generators influence memory usage in Python?
10. Multiple Choice Questions (MCQs)
1. Which statement is correct regarding Python generators?
a) Generators produce all values at once.
b) Generators produce values one at a time, as needed.
c) Generators can always be restarted from the beginning.
d) None of the above.
Answer: b) Generators produce values one at a time, as needed.
2. Which method must be implemented in a custom iterator class?
a) __getitem__()

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

b) __next__()
c) __call__()
d) __yield__()
Answer: b) __next__()
3. What does the yield keyword do in a generator function?
a) Returns a value and stops the function execution permanently
b) Pauses the function execution and allows it to resume later from where it
left off
c) Creates a list of values
d) Directly prints the values in the function
4. What is the primary benefit of using generators?
a) Faster execution.
b) Reduced memory consumption.
c) Easier debugging.
d) Better code readability.
Answer: b) Reduced memory consumption.
11. References
• Python Official Documentation: https://docs.python.org/
• "Python Programming: An Introduction to Computer Science" by John Zelle

Experiment No.: 12
1. Aim: Twitter data sentimental analysis using Flume and Hive
2. Theory
Apache Flume:
• Apache Flume is a distributed system for efficiently collecting, aggregating, and
moving large amounts of log data.
• It can be used to collect Twitter data (using Twitter's streaming API) and move it
into a distributed data store like HDFS (Hadoop Distributed File System).
Apache Hive:

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

• Apache Hive is a data warehouse infrastructure built on top of Hadoop for


providing data summarization, querying, and analysis.
• It uses HQL (Hive Query Language), which is similar to SQL, and allows for
easy storage and querying of large datasets in Hadoop.
Sentiment Analysis:
• Sentiment Analysis is the process of determining the emotional tone (positive,
negative, or neutral) behind a piece of text.
• For Twitter data, sentiment analysis can be done using various NLP techniques
like tokenization, part-of-speech tagging, and machine learning models.
3. Requirements
Software:
• Apache Flume
• Apache Hive
• Python (for sentiment analysis)
• Twitter Developer API Access
• Apache Hadoop
• Hadoop Streaming (for running Python on Hadoop)
• NLTK or TextBlob (for sentiment analysis in Python) Hardware:
• A system with at least 4 GB RAM and 2 CPU cores (for testing purposes).
4. Procedure
1. Setting Up Apache Flume:
 Download and install Apache Flume on the system.
 Create a configuration file for Flume to fetch data from Twitter using the
Twitter streaming API.
 Set the source, channel, and sink configuration for Flume:
• Source: The Twitter source that fetches the real-time tweets.
• Channel: Stores tweets temporarily before sending them to the sink.
• Sink: Writes the collected data to a HDFS or Hive.
Flume Configuration Example (Twitter Source):
properties
twitter-source-agent.sources = TwitterSource twitter-source-
agent.channels = memory-channel twitter-source-agent.sinks = hdfs-
sink
twitter-source-agent.sources.TwitterSource.type =
org.apache.flume.source.twitter.TwitterSource

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

twitter-source-agent.sources.TwitterSource.consumerKey = <your-consumer-key>
twitter-source-agent.sources.TwitterSource.consumerSecret = <your-consumer-
secret> twitter-source-agent.sources.TwitterSource.accessToken = <your-access-
token> twitter-source-agent.sources.TwitterSource.accessTokenSecret = <your-
access-token-
secret>
twitter-source-agent.sources.TwitterSource.keywords = ["#bigdata", "#hadoop"]
twitter-source-agent.sources.TwitterSource.batchSize = 100

twitter-source-agent.sinks.hdfs-sink.type = hdfs
twitter-source-agent.sinks.hdfs-sink.hdfs.path =
hdfs://localhost:9000/user/flume/twitter/ twitter-source-agent.sinks.hdfs-
sink.hdfs.filePrefix = tweet_ twitter-source-agent.channels.memory-channel.type
= memory twitter-source-agent.channels.memory-channel.capacity = 1000
twitter-source-agent.channels.memory-channel.transactionCapacity = 100

Running Flume:
Run the Flume agent using the command:
flume-ng agent --conf ./conf --conf-file twitter-source-agent.conf --name twitter-
sourceagent
This will start collecting real-time tweets and store them in HDFS.
2. Setting Up Apache Hive:
• Ensure Apache Hive is installed and configured to run queries over Hadoop.
• Create a Hive table to store the Twitter data from Flume.
Example Hive table creation for storing tweet data: SQL
CREATE EXTERNAL TABLE twitter_data ( tweet_id STRING, user_name
STRING, tweet_text STRING, timestamp STRING )
STORED AS TEXTFILE
LOCATION '/user/flume/twitter/';
This will allow you to run Hive queries over the data collected by Flume.
3. Perform Sentiment Analysis on Twitter Data:
Install NLTK or TextBlob in Python for sentiment analysis.
Example Python code for Sentiment Analysis using
TextBlob: python from textblob
import TextBlob

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

# Sample tweet text tweet_text = "I love using Hadoop


for big data processing!"
# Create a TextBlob object
blob =
TextBlob(tweet_text)
# Get the sentiment polarity (ranges from -1
to 1) sentiment = blob.sentiment.polarity if
sentiment > 0:
print("Positive Sentiment")
elif sentiment < 0:
print("Negative Sentiment")
else:
print("Neutral Sentiment")
4. Store Sentiment Analysis Results in Hive:
Once sentiment analysis is done on the tweet text, store the sentiment label (positive,
negative, neutral) in a new column in the Hive table.
Example query for adding sentiment to Hive table:
ALTER TABLE twitter_data ADD COLUMNS (sentiment STRING);
Then, you can insert the sentiment data into the table after processing tweets.
5. Experiment Code
Python Code for Sentiment Analysis:
from textblob import TextBlob from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("TwitterSentimentAnalysis").getOrCreate()
# Function to analyze
sentiment def
analyze_sentiment(tweet):
blob = TextBlob(tweet)
polarity =
blob.sentiment.polarity if
polarity > 0:
return
"positive" elif
polarity < 0:
return "negative"
else:

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

return "neutral" # Sample tweet data tweets = ["I love Hadoop", "I hate
bugs in my code", "Apache Flume is amazing"]
# Apply sentiment analysis sentiments =
[analyze_sentiment(tweet) for tweet in tweets]
# Display the result for tweet, sentiment
in zip(tweets, sentiments):
print(f"Tweet: {tweet} --> Sentiment: {sentiment}")
6. Execution
• Run the Flume agent to fetch live data from Twitter.
• Store the collected tweets in Hive using the Flume configuration.
• Apply the Python sentiment analysis function to classify each tweet's sentiment.
• Store the sentiment results back in Hive.
7. Observations
• You can observe the data collection in real-time and analyze the sentiment of
each tweet.
• The sentiment of each tweet (positive, negative, or neutral) is stored in the Hive
table, allowing for further analysis.
7. Analysis
• By using Flume, you were able to collect Twitter data efficiently.
• Sentiment analysis was performed on the text of tweets, allowing classification
into categories that are useful for understanding public opinion or reaction to
certain topics.
8. Conclusion
• This experiment demonstrates the integration of Apache Flume, Apache Hive,
and Python for real-time Twitter data collection, storage, and sentiment analysis.
• Flume helps in efficiently collecting data in real time, while Hive provides an
easy way to store and query large datasets.
• Sentiment analysis can be useful in applications such as social media monitoring,
brand analysis, or political sentiment analysis.
9. Viva Questions
1. What is the role of Flume in this experiment?
2. How does Flume help in collecting Twitter data? 3. What is
sentiment analysis, and why is it useful?
4. How does Hive help in storing and querying large data sets?
5. Explain how the sentiment analysis function works in Python.

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

10. Multiple Choice Questions (MCQs)


1. Which of the following is used to collect Twitter data in this experiment?
a) Apache Kafka
b) Apache Flume
c) Apache Spark
d) None of the above
Answer: b) Apache Flume
2. What is the function of the TextBlob library in this experiment?
a) Data collection
b) Data storage
c) Sentiment analysis
d) None of the above
Answer: c) Sentiment analysis
3. Which Hadoop component is used to store the collected Twitter data?
a) HBase
b) HDFS
c) MapReduce
d) Hive
Answer: b) HDFS
4. What is the purpose of sentiment.polarity in TextBlob?
a) To calculate the date of the tweet
b) To calculate the overall tone of the text
c) To detect the keywords in the tweet
d) None of the above
Answer: b) To calculate the overall tone of the text
11. References
• Apache Flume documentation: https://flume.apache.org/
• Apache Hive documentation: https://hive.apache.org/
• TextBlob documentation: https://textblob.readthedocs.io/en/dev/

Experiment No.: 13
1. Aim: Business insights of User usage records of data cards
2. Theory
User Usage Records of Data Cards:
• Data cards are mobile broadband devices used for internet access. They typically
come with specific data plans that users can use for browsing, downloading,
streaming, etc.

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

• User usage records include data such as the amount of data used, time of use,
type of services accessed (streaming, browsing, social media, etc.), and duration
of use.
Business Insights:
• Business insights refer to the process of analyzing raw data to draw conclusions
that inform business decisions. For data card usage records, these insights can
reveal trends like peak usage times, data consumption patterns, and user
demographics.
Customer Segmentation:
• Analyzing the usage records helps segment users based on factors like usage
volume (heavy vs. light users), usage behavior (data usage, browsing habits), and
preferences (types of websites accessed, device usage).
Revenue Optimization:
• Usage data can provide insights into which plans are the most popular, which can
be used to optimize pricing strategies, improve plans, and create new plans that
better meet customer needs.
3. Requirements
Software:
 Python (for data analysis and visualization)
 Pandas (for data manipulation)
 Matplotlib / Seaborn (for data visualization)
 SQL (for querying databases, if the usage records are stored in a relational
database)
 Jupyter Notebook (for interactive analysis) Data:
 A dataset of user usage records, which could include the following attributes:
• User ID: Unique identifier for each user.
• Date: Date of data usage.
• Total Data Used: Amount of data consumed in MB/GB.
• Usage Time: Duration of data usage.
 Service Type: Type of service used (e.g., browsing, streaming).
 Data Plan Type: Type of plan the user is on (e.g., prepaid, postpaid).
• Location: User's geographical location (optional).
4. Procedure
1. Data Collection:

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

• Collect user usage records for a period of time (e.g., last 3 months, 6
months).
• Ensure that the dataset includes the relevant fields like user ID, date, data
used, usage time, and plan type.
2. Data Preprocessing:
• Clean the data by handling missing values, correcting any errors, and
ensuring consistency in units (e.g., data usage in GB or MB).
• Convert timestamps or dates into a consistent format for time-based analysis.
Example of data cleaning in Python:
python import pandas as pd # Load
data data =
pd.read_csv("user_usage.csv") #
Convert date column to datetime
type data['Date'] =
pd.to_datetime(data['Date'])
# Handle missing values (e.g., filling with the mean) data['Total Data
Used'].fillna(data['Total Data Used'].mean(), inplace=True)
3. Data Analysis:
• Usage Patterns: Analyze peak usage times (e.g., evening, weekend) and the
correlation between data usage and plan types.
• Customer Segmentation: Segment users based on data consumption into
categories like:
 Low users (0-1 GB/month)
 Medium users (1-5 GB/month)
 Heavy users (5+ GB/month) o Popular Services: Identify the most
accessed services (e.g., browsing, streaming) based on the usage
records.
Example of segmentation:
python
# Segmentation by data usage
bins = [0, 1, 5, 100]
labels = ['Low User', 'Medium User', 'Heavy User']
data['User Category'] = pd.cut(data['Total Data Used'], bins=bins,
labels=labels)
4. Business Insights:

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

• Usage Trends: Identify trends in data usage over time, such as increasing
usage during holidays or weekends.
• Revenue Opportunities: Analyze which data plans are most popular and
correlate them with user demographics to suggest potential improvements or
changes in pricing.
• Customer Retention: Identify heavy users who may need upgraded plans
and offer retention strategies (e.g., discounts, loyalty rewards).
5. Visualization:
Create visualizations to summarize insights, such as the distribution of data
usage across users, the most common data usage times, and the relationship
between usage and plan type.
Example of visualization in Python:
import matplotlib.pyplot as plt
import seaborn as sns
# Plotting the distribution of data usage
plt.figure(figsize=(10,6))
sns.histplot(data['Total Data Used'], bins=20, kde=True)
plt.title('Distribution of Data Usage')
plt.xlabel('Data Used (GB)')
plt.ylabel('Frequency')
plt.show()

5. Experiment Code
• Python Code for Business Insights Analysis:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load user usage data


data = pd.read_csv("user_usage.csv")

# Data cleaning: converting date and handling missing values


data['Date'] = pd.to_datetime(data['Date'])
data['Total Data Used'].fillna(data['Total Data Used'].mean(), inplace=True)
# Segmenting users by data usage
bins = [0, 1, 5, 100]
labels = ['Low User', 'Medium User', 'Heavy User']
data['User Category'] = pd.cut(data['Total Data Used'], bins=bins, labels=labels)

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

# Plotting the distribution of data usage


plt.figure(figsize=(10,6))
sns.histplot(data['Total Data Used'], bins=20, kde=True)
plt.title('Distribution of Data Usage')
plt.xlabel('Data Used (GB)')
plt.ylabel('Frequency')
plt.show()

# Analyzing the relationship between user category and total data used
plt.figure(figsize=(10,6))
sns.boxplot(x='User Category', y='Total Data Used', data=data)
plt.title('Data Usage by User Category')
plt.show()

6. Execution
• Load the user data into a Python script and clean the data as required.
• Run the analysis to segment users and find patterns in data usage.
• Visualize the data using plots to identify trends, customer behaviors, and
opportunities.
7. Observations
• You may observe peak usage times for specific user categories (e.g., heavy users
may have higher data consumption on weekends).
• Popular services (such as streaming or browsing) may become apparent.
• You might notice some plans are more commonly associated with higher data
usage.
8. Analysis
• From the analysis, you can understand customer behavior based on data
consumption.
• Insights into high-usage patterns can guide marketing efforts, product
development
(new plans), and customer retention strategies.

9. Conclusion
• This experiment demonstrates how business insights can be derived from user
data card usage records.
• By analyzing data usage patterns, businesses can improve customer
segmentation, optimize revenue strategies, and enhance customer satisfaction
through tailored plans and services.
10. Viva Questions
1. What is the importance of segmenting users based on data usage?

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

2. How can business insights from data card usage be used to improve customer
retention?
3. What are the benefits of visualizing data usage trends?
4. Explain how customer behavior can be used to optimize data plans.
5. How would you apply this analysis to a real-world business scenario?
11. Multiple Choice Questions (MCQs)
1. What is the first step in analyzing user usage records?
a) Segmenting users
b) Data cleaning
c) Data visualization
d) Customer retention
Answer: b) Data cleaning
2. Which of the following is a possible outcome of analyzing data card usage
records?
a) Identifying peak usage times
b) Increasing data plan prices for all users
c) Ignoring customer feedback
d) Limiting user access to data
Answer: a) Identifying peak usage times
3. How can visualizations help in understanding data usage?
a) By removing irrelevant data
b) By identifying trends and patterns
c) By simplifying data collection
d) By changing the pricing model
Answer: b) By identifying trends and patterns

12. References
• Pandas Documentation: https://pandas.pydata.org/pandas-docs/stable/
• Seaborn Documentation: https://seaborn.pydata.org/
• Matplotlib Documentation: https://matplotlib.org/

Experiment No.: 14
1. Aim: Wiki page ranking with Hadoop
2. Theory
PageRank Algorithm:

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

• PageRank is a link analysis algorithm used by Google to rank web pages in their
search engine results.
• The algorithm works by counting the number of inbound links to a page and
assigning a rank based on the importance of the linking pages. The higher the
rank of a linking page, the higher the rank transferred to the linked page.
Hadoop MapReduce:
• MapReduce is a programming model used to process and generate large datasets
that can be split into independent tasks. It is used in the Hadoop ecosystem to
handle largescale data processing in a distributed environment.
The core components of Hadoop MapReduce:
• Mapper: The mapper reads data, processes it, and outputs intermediate results
(keyvalue pairs).
• Reducer: The reducer aggregates the results and outputs the final computation.
Wiki Page Ranking:
• For Wikipedia page ranking, the data would typically consist of page-to-page
links (a directed graph). Each page will have links to other pages, and the rank of
each page will be computed iteratively based on the number of incoming links
from other pages.
3. Requirements
Software:
• Hadoop (with HDFS and MapReduce)
• Java (for implementing MapReduce)
• Python (optional for any additional scripting/processing)
• Linux/Mac or Windows with WSL (for setting up Hadoop) Data:
• A sample Wikipedia dataset containing page links. The dataset could be in the
form of a text file where each line contains a page and its outbound links (e.g.,
PageA -> PageB,
PageC).

4. Procedure
1. Set Up Hadoop:
o Install Hadoop and set up a single-node or multi-node cluster.
o Ensure that HDFS (Hadoop Distributed File System) is running, and the
necessary directories are created to store the input and output data.
2. Prepare Input Data:

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

o The dataset should contain the structure of Wikipedia pages, where each
page links to other pages. This data can be represented in the following
format:
rust
PageA -> PageB, PageC
PageB -> PageC
PageC -> PageA
3. Implement Mapper Function:
The mapper function reads the page links and emits the current page and the
linked pages as key-value pairs. Each linked page is given an initial rank.
Example of a Mapper in Java:
public class PageRankMapper extends Mapper<LongWritable, Text, Text,
Text>
{
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] parts = value.toString().split("->");
String page = parts[0].trim();
String[] linkedPages = parts[1].split(",");
// Emit the current page with its outbound links
context.write(new Text(page), new Text("links:" +String.join(",",
linkedPages)));
// Emit each linked page with a reference to the current page
double initialRank = 1.0;

// Initial rank for each linked page

for (String linkedPage : linkedPages) {

context.write(new Text(linkedPage.trim()), new Text("rank:" + initialRank));


}
}
}
4. Implement Reducer Function:
The reducer function will take all incoming data for each page, sum up the ranks
of the linked pages, and compute the new rank based on the PageRank formula.

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

Example of a Reducer in Java:


public class PageRankReducer extends Reducer<Text, Text, Text,
DoubleWritable> {
private final static DoubleWritable result = new DoubleWritable();
private static final double DAMPING_FACTOR = 0.85;
private static final double INITIAL_RANK = 1.0;

public void reduce(Text key, Iterable<Text> values, Context context)


throws IOException, InterruptedException {
double newRank = 0.0;
List<String> linkedPages = new ArrayList<>();

// Process the input values for the page


for (Text val : values) {
String value = val.toString();
if (value.startsWith("links:")) {
// Collect the links
String[] pages = value.substring(6).split(",");
linkedPages.addAll(Arrays.asList(pages));
}
else if (value.startsWith("rank:")) {
// Sum the ranks for each link
newRank += Double.parseDouble(value.substring(5));
}
}
// Calculate the new rank based on the PageRank formula
newRank = (1 - DAMPING_FACTOR) + DAMPING_FACTOR * newRank /
linkedPages.size();

// Emit the page and its new rank


result.set(newRank);
context.write(key, result);
}
}

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

5. Running the Job on Hadoop:


• Compile the Java program into a JAR file.
• Run the MapReduce job on Hadoop to compute the PageRank for all pages.
Example command to run the job:
hadoop jar pagerank.jar PageRankDriver input_dir output_dir
6. Iterative Computation:
PageRank is calculated iteratively, so the MapReduce job should run multiple
times until the ranks converge (i.e., the changes in rank are below a threshold).
This can be done by running the MapReduce job several times, each time using
the output of the previous run as the input.
5. Experiment Code
Java Code Example for PageRank Implementation:
• Mapper:
public class PageRankMapper extends Mapper<LongWritable, Text, Text,
Text> {
// Mapper implementation as shown above
}
• Reducer:
java public class PageRankReducer extends Reducer<Text, Text, Text,
DoubleWritable> {
// Reducer implementation as shown above
}
• Driver Class (to configure and run the job):
public class PageRankDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "PageRank");

job.setJarByClass(PageRankDriver.class);
job.setMapperClass(PageRankMapper.class);
job.setReducerClass(PageRankReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

FileInputFormat.addInputPath(job, new Path(args[0]));


FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

6. Execution
• Compile the Java files into a JAR file.
• Upload the input data (Wikipedia links dataset) to HDFS:
hadoop fs -put wiki_data.txt /input
• Run the PageRank MapReduce job:
hadoop jar pagerank.jar PageRankDriver /input /output
• Check the output directory to get the ranked pages.
7. Observations
• The output will show each page and its computed PageRank value.
• Observe how the ranks change with each iteration, and how the pages with more
inbound links gradually rise in rank.
8. Analysis
• After a few iterations, the ranks should stabilize. Pages with the most links from
important pages will be ranked higher.
• This experiment helps demonstrate how Hadoop can be used to implement large-
scale algorithms like PageRank.
9. Conclusion
• This experiment successfully demonstrates how to compute the Wiki PageRank
using Hadoop's MapReduce framework.
• By iteratively computing PageRank, we can rank Wikipedia pages based on their
importance, which can then be used to improve search engine results or content
recommendations.

10. Viva Questions


1. What is the PageRank algorithm, and how does it work?
2. Why is the damping factor used in the PageRank algorithm?
3. What is the significance of using Hadoop MapReduce for large-scale data
processing?
4. How can you optimize the PageRank algorithm for faster convergence?
5. What challenges might you face when running PageRank on a very large dataset?
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N

11. Multiple Choice Questions (MCQs)


1. What is the primary purpose of the PageRank algorithm?
a) To sort web pages by keywords
b) To rank web pages based on their importance
c) To classify web pages into categories
d) To retrieve web pages based on user queries
Answer: b) To rank web pages based on their importance
2. Which Hadoop component is used for distributing large datasets across
nodes? a) HDFS
b) MapReduce
c) YARN
d) Pig
Answer: a) HDFS
3. What is the main advantage of using MapReduce in PageRank computation?
a) It reduces the overall complexity of the algorithm
b) It allows for distributed processing of large datasets
c) It simplifies the implementation of the algorithm
d) None of the above
Answer: b) It allows for distributed processing of large datasets

12. References
• Hadoop Documentation: https://hadoop.apache.org/docs/stable/
MapReduce Programming Guide:
https://hadoop.apache.org/docs/stable/hadoopmapreduce-client-core/mapreduce-
programming-guide.html

Experiment No.: 15
1. Aim: Health care Data Management using Apache Hadoop ecosystem
2. Theory
Apache Hadoop Ecosystem:
Apache Hadoop is a framework that allows for the distributed processing of large data
sets across clusters of computers using simple programming models. It is designed to

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

scale up from a single server to thousands of machines, each offering local computation
and storage.
Key components of the Hadoop ecosystem include:
• HDFS (Hadoop Distributed File System): A distributed file system designed to
store vast amounts of data across a cluster of computers.
• MapReduce: A programming model for processing large data sets in a parallel,
distributed manner.
• YARN (Yet Another Resource Negotiator): Manages resources in a Hadoop cluster.
• Hive: A data warehouse infrastructure built on top of Hadoop for providing data
summarization, querying, and analysis.
• Pig: A high-level platform for creating MapReduce programs used with Hadoop.
• HBase: A NoSQL database that provides real-time read/write access to large datasets.
• Sqoop: A tool for transferring bulk data between Hadoop and relational databases.
• Flume: A service for collecting, aggregating, and moving large amounts of log data.
• Oozie: A workflow scheduler for managing Hadoop jobs.
• Healthcare Data: Healthcare data includes a wide variety of information such as
patient records, treatment histories, diagnostic data, lab results, medical images,
prescriptions, and more. These datasets are typically:
 Structured: Tables containing patient demographics, billing information,
lab results, etc.
 Unstructured: Medical records in free-text form, images, reports, etc.
 Semi-structured: Data formats like XML or JSON that contain structured
information but do not fit neatly into tables.
Challenges in Healthcare Data Management:
• Volume: Healthcare data is growing rapidly, especially with the rise of electronic
health records (EHRs) and medical imaging.
• Variety: Healthcare data comes in various formats (structured, semi-structured, and
unstructured).
• Velocity: Real-time processing of healthcare data, such as monitoring patient vitals
or emergency response systems, is critical.
• Veracity: Healthcare data must be accurate and trustworthy, making its quality
management a significant concern.
3. Requirements
Software:
• Apache Hadoop (including HDFS, MapReduce, YARN)
• Hive (for SQL-like querying)

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

• Pig (optional for high-level data processing)


• HBase (for storing large datasets)
• Sqoop (optional for importing data from relational databases)
• Oozie (optional for job scheduling) Data:
• Sample healthcare datasets (e.g., patient records, hospital data, medical images, etc.).
These can be downloaded from public healthcare repositories or simulated for the
experiment.
4. Procedure
1. Set Up Hadoop Cluster:
• Install and configure Apache Hadoop (single-node or multi-node cluster).
• Set up HDFS for distributed storage of healthcare data.
2. Data Ingestion:
Healthcare data can be ingested into HDFS using Sqoop (for importing from
relational databases), Flume (for streaming data), or directly uploading from local
systems.
Example of using Sqoop to import data from a relational database (e.g.,
MySQL): sqoop import --connect jdbc:mysql://localhost/healthcare --table
patient_records -username user --password pass --target-dir
/healthcare_data/patient_records
3. Data Storage in HDFS:
• Upload raw healthcare data files (CSV, JSON, XML) to HDFS.
• Use HDFS to store and manage these datasets across the Hadoop cluster.
Example command to upload data:
hadoop fs -put patient_records.csv /healthcare_data/patient_records
4. Data Processing with MapReduce:
MapReduce jobs can be written to process healthcare data, such as filtering
patient records, analyzing medical histories, or extracting features from
unstructured medical texts.
Example use case: Filtering patients who have a particular disease.
• Implement a Mapper function to read patient records and output disease-
related information.
• Implement a Reducer function to aggregate and filter data based on disease
types.
5. Data Querying with Hive:

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

• Hive can be used to run SQL-like queries on the healthcare data stored in
HDFS.
This is especially useful for structured data, such as patient records.
Example of creating a table and running a query in Hive:
CREATE TABLE patient_records (patient_id INT, name STRING, age INT,
disease STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA INPATH '/healthcare_data/patient_records' INTO
TABLE patient_records;
SELECT * FROM patient_records WHERE disease = 'Diabetes';
6. Data Storage in HBase:
• Use HBase for real-time data storage, particularly for time-series data like
patient vitals or continuous monitoring systems. Example command to insert
data into HBase:
hbase shell
create 'patient_vitals', 'patient_id', 'vitals'
put 'patient_vitals', 'row1', 'patient_id:1', 'vitals:heart_rate', '80'
7. Data Processing with Pig (optional):
• Use Pig for high-level data processing. Pig scripts are an alternative to
MapReduce and allow you to process data using a simpler language than Java.

Example of a Pig script:


pig patients = LOAD '/healthcare_data/patient_records' USING PigStorage(',')
AS
(patient_id:int, name:chararray, age:int, disease:chararray);
diabetes_patients = FILTER patients BY disease == 'Diabetes';
DUMP diabetes_patients;
8. Data Analysis:
Perform healthcare analytics by applying various algorithms or queries. For example,
you could perform trend analysis on patient vitals, clustering of patients with similar
diseases, or even predictive analytics on patient outcomes.
9. Visualization (optional):
Integrate Hadoop with data visualization tools (like Tableau, QlikView, or Power
BI) to present the processed data visually for easier decision-making and insights.

5. Experiment Code
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N

Sample MapReduce Code for Healthcare Data:


• Mapper (Java):
public class PatientMapper extends Mapper<LongWritable, Text, Text, IntWritable>
{ public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String[] fields = value.toString().split(",");
String disease = fields[3]; // Assuming disease is in the 4th column
context.write(new Text(disease), new IntWritable(1));
}
}
• Reducer (Java):
public class PatientReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context)


throws IOException, InterruptedException {
int count = 0;
for (IntWritable val : values) {
count += val.get();
}
result.set(count);
context.write(key, result);
}
}
6. Execution
• Compile the MapReduce job and run it on your Hadoop cluster.
• Upload the data to HDFS, run MapReduce jobs, and store results in HDFS or
HBase for real-time access.
Example command to run the job: hadoop jar healthcare_job.jar PatientMapper
PatientReducer /healthcare_data/input
/healthcare_data/output
7. Observations
• The MapReduce job should process the healthcare data and output disease-
specific statistics, such as the number of patients with each disease.

ANKIT PAL 2110DMBCSE10190


BIG DATA AND HADOOP BTCS702N

• Hive queries will provide insights into patient demographics, diseases, and
treatment patterns.
• HBase will allow real-time access to patient vital data.
8. Analysis
• Healthcare data processing on Hadoop allows for scalable, distributed processing
of massive datasets.
• By using HDFS for storage, MapReduce for processing, and Hive for querying,
healthcare institutions can gain valuable insights from large-scale patient data.
• The system is capable of handling structured, semi-structured, and unstructured
healthcare data efficiently.
9. Conclusion
• This experiment demonstrates the power of the Hadoop ecosystem in managing
and processing large healthcare datasets. The combination of HDFS,
MapReduce, Hive, and HBase allows healthcare providers to store, analyze, and
retrieve data efficiently.
• This approach can lead to improved healthcare outcomes through faster data
processing, enhanced analytics, and better decision-making.
10. Viva Questions
1. What are the key components of the Hadoop Ecosystem, and how do they
contribute to healthcare data management?
2. How does HBase differ from HDFS, and when would you use it in healthcare
data management?
3. Explain the role of Hive in querying healthcare data stored in HDFS.
4. What are the advantages of using MapReduce for healthcare data analysis?
5. How would you process and analyze real-time healthcare data streams using
Flume and Hadoop?
11. Multiple Choice Questions (MCQs)
1. What is the primary purpose of using the Apache Hadoop Ecosystem in
healthcare data management?
a) To enhance the quality of medical images
b) To manage and process large-scale healthcare data efficiently
c) To monitor patient vitals in real-time
d) To replace traditional healthcare management systems
Answer: b) To manage and process large-scale healthcare data efficiently

2. Which component of the Hadoop Ecosystem is primarily responsible for storing


large datasets in a distributed manner? a) YARN
ANKIT PAL 2110DMBCSE10190
BIG DATA AND HADOOP BTCS702N

b) MapReduce
c) HDFS
d) Hive
Answer: c) HDFS
3. Which Hadoop component allows healthcare data to be queried using SQL-like
syntax? a) MapReduce
b) Hive
c) HBase
d) Pig
Answer: b) Hive
4. Which of the following tools in the Hadoop ecosystem is used to transfer bulk
data between Hadoop and relational databases? a) Sqoop
b) Flume
c) HBase
d) Pig
Answer: a) Sqoop
5. In healthcare data management, which Hadoop tool is useful for processing real-
time data streams, such as patient monitoring data? a) Sqoop
b) Flume
c) MapReduce
d) HBase
Answer: b) Flume

12. References
• Healthcare Big Data Management and Analytics:
https://www.springer.com/gp/book/9783030323897
• Cloudera Healthcare Solutions:
https://www.cloudera.com/solutions/healthcare.html

ANKIT PAL 2110DMBCSE10190

You might also like