Lab Manual Big Data
Lab Manual Big Data
1. To draw and explain Hadoop architecture and ecosystem with the help of a case
study.
2. Perform setting up and installing single node Hadoop in a Windows
environment.
3. To implement the following file management tasks in Hadoop System (HDFS):
Adding files and directories, retrieving files, Deleting files
4. Create a database ‘STD’ and make a collection (e.g. "student" with fields 'No.,
Stu_Name, Enroll., Branch, Contact, e-mail, Score') using MongoDB. Perform
various operations in the following experiments.
5. Insert multiple records (at least 10) into the created student collection.
6. Execute the following queries on the collection created.
a. Display data in proper format.
b. Update the contact information of a specific student.
c. Add a new field remark to the document with the name 'REM'.
d. Add a new field as no 11, stu_name XYZ, Enrolll 00101, branch VB, e-mail
[email protected] Contact 098675345 without using insert statement.
7. Create an employee table in monogdb with 4 departments and 25 employees
equally divided along with one manager. The following fields should be added;
Employee_ID, Dept_ID, First_Name, Last_Name, Salary (Range between 20K-
60K). Now Run the following queries a. Find all the employees of a particular
department where salary lies < 40K.
b. Find the highest salary for each department and fetch the name of such
employees.
c. Find all the employees who are on a lesser salary than 30k; increase their salary
by 10% and display the results.
8. To design and implement a social network graph of 50 nodes and edges
between nodes using networkx library in Python.
9. Design and plot an asymmetric social network (socio graph) of 5 nodes (A,
B, C, D, and E) such that A is directed to B, B is directed to D, D is directed to A,
and D is directed to C.
10. Consider the above scenario (No. 09) and plot a weighted asymmetric graph,
the weight range is between 20 to 50.
pg. 1
11. Implement betweenness measure between nodes across the social network.
(Assume the social network of 10 nodes)
1. To draw and explain Hadoop architecture and ecosystem with the help of a case
study.
Creating a visual representation of Hadoop architecture and explaining the Hadoop ecosystem
can be quite complex, but I'll provide a simplified textual explanation of Hadoop's architecture
and ecosystem, followed by a hypothetical case study.
Hadoop Architecture:
Hadoop is designed to process and store large volumes of data in a distributed and fault-tolerant
manner. Its core components include:
1. HDFS (Hadoop Distributed File System): HDFS is the storage component of Hadoop. It
divides data into blocks (typically 128MB or 256MB each) and stores multiple copies of
these blocks across different nodes in a cluster for redundancy. HDFS ensures data
reliability and fault tolerance.
2. YARN (Yet Another Resource Negotiator): YARN is Hadoop's resource management and
job scheduling system. It manages resources and schedules tasks for data processing.
4. Common Utilities: This includes libraries and utilities used by various Hadoop
components.
5. Hadoop Common: This contains the shared utilities, libraries, and APIs needed by
Hadoop modules.
pg. 2
Hadoop Ecosystem:
Hadoop's ecosystem consists of various tools and frameworks that extend Hadoop's capabilities,
making it suitable for a wide range of data processing tasks. Some key components of the
Hadoop ecosystem include:
1. Hive: Hive is a data warehousing and SQL-like query language for Hadoop. It allows
users to query and analyze data using HiveQL.
2. Pig: Pig is a platform for analyzing large data sets. It provides a high-level scripting
language, Pig Latin, for data analysis.
3. HBase: HBase is a NoSQL database that provides real-time read and write access to large
datasets. It's suitable for applications requiring low-latency data access.
4. Sqoop: Sqoop is used for transferring data between Hadoop and relational databases. It
simplifies the data import/export process.
5. Flume: Flume is a service for collecting, aggregating, and moving large amounts of log
data to HDFS. It's commonly used for ingesting data from various sources.
6. Oozie: Oozie is a workflow scheduler for Hadoop jobs. It allows you to define, schedule,
and manage data workflows in Hadoop.
7. ZooKeeper: ZooKeeper is a distributed coordination service used for maintaining
configuration information, naming, providing distributed synchronization, and providing
group services.
pg. 3
Hadoop Solution: XYZ Retail decided to implement Hadoop as a solution to their data
challenges. Here's how they used Hadoop:
1. Data Ingestion: They used Apache Flume to collect data from various sources, including
web logs, point-of-sale systems, and social media.
2. Data Storage: Data was stored in HDFS, which provided a reliable and scalable storage
solution.
3. Data Processing: They used Hadoop MapReduce and Hive to process and analyze the
data. MapReduce helped extract and transform data, while Hive allowed analysts to run
SQL-like queries.
4. Real-time Data: To handle real-time data, they used HBase, which provided low-latency
access to data.
5. Data Integration: Sqoop was used to move data between Hadoop and their existing
relational databases.
6. Workflow Automation: Oozie was employed to schedule and manage data workflows,
ensuring that jobs ran at the right time and in the correct sequence.
7. Data Visualization: For data visualization and reporting, they integrated Hadoop with a
BI tool like Tableau or Power BI.
Results: XYZ Retail was able to gain valuable insights into customer behavior, optimize
inventory management, and make data-driven decisions. Their data processing became more
efficient, and they were able to handle both batch and real-time data effectively.
This hypothetical case study illustrates how a company like XYZ Retail can leverage the Hadoop
ecosystem to address data challenges and drive business improvements.
pg. 4
2. Perform setting up and installing single node Hadoop in a Windows
environment.
Prerequisites:
Before setting up and installing single node Hadoop in a Windows environment, ensure
you have the following prerequisites:
Java 8: Download and install Java 8 Development Kit (JDK) from
https://www.oracle.com/java/technologies/javase/javase8-archive-
downloads.html.
7-Zip: Download and install 7-Zip, a file archiver, from https://www.7-zip.org/a/.
Steps:
1. Download Hadoop: Download the latest stable version of Hadoop from the Apache Hadoop
website: https://hadoop.apache.org/releases.html.
2. Extract Hadoop: Extract the downloaded Hadoop archive to a suitable location, for example,
C:\hadoop.
3. Configure Environment Variables:
I. Open System Properties by searching for it in the Start menu.
V. Add the following paths to the Variable value field, separated by semicolons:
C:\hadoop\bin
C:\Java\jdk1.8.0_261\bin
VI. Click OK to save the changes.
4. Configure Hadoop Configuration Files:
1. Open the following configuration files in a text editor:
C:\hadoop\etc\core-site.xml
C:\hadoop\etc\hdfs-site.xml
C:\hadoop\etc\mapred-site.xml
C:\hadoop\etc\yarn-site.xml
2. Edit the configuration properties as needed. For a single-node cluster, you can leave
the default settings unchanged.
5. Format the NameNode:
1. Open a command prompt and navigate to the Hadoop bin directory:
cd C:\hadoop\bin
2. Execute the following command to format the NameNode:
hadoop namenode -format
pg. 5
6. Start the Hadoop Cluster:
1. Execute the following commands to start the Hadoop cluster:
start hdfs
start yarn
7. Verify the Hadoop Cluster:
1. Execute the following command to verify the status of the Hadoop cluster:
jps
2. You should see the following processes running:
Jps output:
NameNode
ResourceManager
DataNode
Additional Notes:
To stop the Hadoop cluster, execute the following commands:
stop hdfs
stop yarn
To view Hadoop logs, navigate to the Hadoop logs directory:
C:\hadoop\logs
To add a file to HDFS, use the hadoop fs -put command. For example, to add the file myfile.txt
to the directory /user/hadoop/data, use the following command:
To retrieve a file from HDFS, use the hadoop fs -get command. For example, to retrieve the
file /user/hadoop/data/myfile.txt to the local filesystem, use the following command:
To delete a file from HDFS, use the hadoop fs -rm command. For example, to delete the file
/user/hadoop/data/myfile.txt, use the following command:
4. Create a database ‘STD’ and make a collection (e.g. "student" with fields 'No.,
Stu_Name, Enroll., Branch, Contact, e-mail, Score') using MongoDB. Perform
various operations in the following experiments.
pg. 7
To create a database named "STD" and a collection named "student" with the specified fields in
MongoDB, you can follow these steps. MongoDB is a NoSQL database, and you can interact
with it using a MongoDB client or command-line tools. Below, I'll provide instructions for
creating the database and collection using the MongoDB shell, which is a command-line
interface for MongoDB.
reating DataBase
use STD;
// Creating Collection
STD.createCollection("Student")
Here are the steps to create a database 'STD' and make a collection (e.g. "student" with fields
'No., Stu_Name, Enroll., Branch, Contact, e-mail, Score') using MongoDB. Perform various
operations in the following experiments:
use STD
db.createCollection("student")
db.student.insertOne({
"No": 1,
"Stu_Name": “rahul shrivastava",
"Enroll": "0827CS201194",
"Branch": "Computer Science",
pg. 8
"Contact": "1234567890",
"e-mail": "[email protected]",
"Score": 95
})
reating DataBase
use STD;
// Creating Collection
STD.createCollection("Student")
reating DataBase
use STD;
// Creating Collection
STD.createCollection("St
pg. 9
5. Insert multiple records (at least 10) into the created student collection.
db.student.insertMany([
{
"No": 1,
"Stu_Name": “rahul shrivastava",
"Enroll": "0827CS201194",
"Branch": "Computer Science",
"Contact": "1234567890",
"e-mail": "[email protected]",
"Score": 95
},
{
"No": 2,
"Stu_Name": "Akshay Keswani",
"Enroll": "0827CS201022",
"e-mail":
"[email protected]",
"Score": 88
},
{
"No": 3,
"Stu_Name": "Alokit Sharma",
"Enroll": "0827CS201023",
"Branch": "Computer Science",
"Contact": "1234567891",
"e-mail": "[email protected]",
"Score": 75
},
{
"No": 4,
"Stu_Name": "Aditya Sharma",
"Enroll": "0827CS201016",
"Branch": "Computer Science",
"Contact": "1234567891",
"e-mail":
"[email protected]",
pg. 10
"Score": 92 },
{
"No": 5,
"Stu_Name": "Akshat Singh Gour",
"Enroll": "0827CS201020",
"Branch": "Computer Science",
"Contact": "1234567891",
"e-mail": "[email protected]",
"Score": 80
},
{
"No": 6,
"Stu_Name": "Aayush Gupta",
"Enroll": "0827CS201006",
"e-mail": "[email protected]",
"Score": 87
},
{
"No": 7,
"Stu_Name": "Amit Kumar Yadav",
"Enroll": "0827CS201031",
"Branch": "Computer Science",
"Contact": "1234567891",
"e-mail": "[email protected]",
"Score": 78
},
{
"No": 8,
"Stu_Name": "Aryan Tapkire",
"Enroll": "0827CS201044",
"Branch": "Computer Science",
"Contact": "1234567891",
"e-mail": "[email protected]",
"Score": 91
},
pg. 11
{
"No": 9,
"Stu_Name": "Devesh Sharma",
"Enroll": "0827CS201068",
"Branch": "Computer Science",
"Contact": "1234567891",
"e-mail": "[email protected]",
"Score": 85
},
{
"No": 10,
"Stu_Name": "Asit Joshi",
"Enroll": "0827CS201042",
"Contact": "1234567891",
"e-mail": "[email protected]",
"Score": 89
}
])
pg. 12
6. Execute the following queries on the collection created.
a. Display data in proper format.
b. Update the contact information of a specific student.
c. Add a new field remark to the document with the name 'REM'.
d. Add a new field as no 11, stu_name XYZ, Enroll 00101, branch VB, e-mail
[email protected] Contact 098675345 without using insert statement.
db.student.find().pretty()
This query will retrieve all documents from the 'student' collection and display them in a
formatted manner.
db.student.updateOne(
{
"Stu_Name": "Bhumi Pandey"
},
{
$set: { "Contact": "9999999999" }
})
This query will update the contact information for the student named 'Bhumi Pandey'. It uses the
$set operator to modify the value of the 'Contact' field to '9999999999'.
c. Add a new field remark to the document with the name 'REM':
db.student.updateOne(
{ "Stu_Name": "REM" },
{$set: { "remark": "This is a remark" }}
)
This query will add a new field named 'remark' to the document for the student named 'REM'. It
uses the $set operator to set the value of the 'remark' field to 'This is a remark'.
pg. 13
d. Add a new field as no 11, stu_name XYZ, Enroll 00101, branch VB, e-mail xyz@xyz contact
098675345 without using insert statement:
db.student.updateMany(
{}, {$push: {
"No.": 11,
"Stu_Name": "XYZ",
"Enroll.": "00101",
"Branch": "VB",
"e-mail": "xyz@xyz",
"Contact": "098675345"}
})
This query will add new fields for student number, name, enrollment number, branch, email
address, and contact number to all documents in the 'student' collection. It uses the $push operator
to add these fields to an array within each document.
pg. 14
7. Create an employee table in monogdb with 4 departments and 25 employees
equally
divided along with one manager. The following fields should be added;
Employee_ID, Dept_ID, First_Name, Last_Name, Salary (Range between
20K-60K). Now Run the following queries
a. Find all the employees of a particular department where salary lies < 40K.
b. Find the highest salary for each department and fetch the name of such
employees.
c. Find all the employees who are on a lesser salary than 30k; increase their salary
by 10% and display the results.
db.employee.insertMany([
{ Employee_ID: 1,
Dept_ID: 1,
First_Name: "John",
Last_Name: "Doe",
Salary: 50000 },
{Employee_ID: 2,
Dept_ID: 1,
First_Name: "Jane",
Last_Name: "Smith",
Salary: 48000 },
{ Employee_ID: 3,
Dept_ID: 1,
First_Name: "Bob",
Last_Name: "Johnson",
Salary: 35000},
{ Employee_ID: 26,
Dept_ID: 4,
First_Name: "Manager",
Last_Name: "Smith",
Salary:60000}
])
pg. 15
2. Run the Specified Queries:
a. Find all employees of a particular department where salary is less than 40K:
For example, to find employees in Department 1 with a salary less than 40K:
db.employee.find({ Dept_ID: 1, Salary: { $lt: 40000 } })
b. Find the highest salary for each department and fetch the names of such employees:
db.employee.aggregate([
{ $group:
{_
id: "$Dept_ID",
maxSalary:
{ $max: "$Salary" }
}
},
])
To fetch the names of employees with the highest salary in each department, you'll need to use a
more complex aggregation query.
c. Find all employees who are on a salary less than 30K, increase their salary by 10%, and
display the results:
To find employees with a salary less than 30K and increase their salary by 10%:
db.employee.find(
{
Salary: {
$lt: 30000
}
}).forEach(function(employee) {
employee.Salary *= 1.10; // Increase the salary by 10% db.employee.save(employee);
})
This code will find employees with a salary less than 30K and update their salary to 10% higher.
pg. 16
8. To design and implement a social network graph of 50 nodes and edges between
nodes using network library in Python.
This code will create a graph with 50 nodes and 50 edges. The nodes will be numbered from 0 to
49, and the edges will be between randomly selected pairs of nodes. The graph will be drawn to
the screen using the NetworkX draw function.
pg. 17
Figure : Social network graph of 50 nodes and edges
As you can see, the code successfully creates a social network graph of 50 nodes and edges. The
graph is drawn to the screen using the NetworkX draw function.
pg. 18
9. Design and plot an asymmetric social network (socio graph) of 5 nodes (A, B,
C, D, and E) such that A is directed to B, B is directed to D, D is directed to A, and
D is directed to C.
Here's the code for the asymmetric social network you described:
This code will create a directed graph with 5 nodes and 4 directed edges. The nodes will be
labeled A, B, C, D, and E, and the directed edges will be from A to B, B to D, D to A, and D to C.
The graph will be drawn to the screen using the NetworkX draw function.
pg. 19
In this sociograph:
Node A is connected to B, forming a directed edge from A to B.
Node B is connected to D, forming a directed edge from B to D.
Node D is connected to A and C, forming directed edges from D to A and from D to C.
Node E is not connected to any of the other nodes in the network.
This visual representation should help you understand the asymmetric social network with the
specified connections.
pg. 20
10. Consider the above scenario (No. 09) and plot a weighted asymmetric graph,
the weight range is between 20 to 50.
To create a weighted asymmetric graph with the specified connections and edge weights
between 20 and 50, you can use NetworkX in Python. Here's how you can design and plot the
weighted graph:
This code creates a directed graph with 5 nodes (A, B, C, D, and E) and the specified directed
edges with random edge weights between 20 and 50. The resulting plot will visualize the
weighted asymmetric social network graph with labeled edge weights.
pg. 21
11. Implement betweenness measure between nodes across the social network.
(Assume the social network of 10 nodes)
To calculate the betweenness centrality measure between nodes in a social network using Python,
you can utilize the NetworkX library. Below is an example of how to calculate the betweenness
centrality between nodes in a social network with 10 nodes. Please note that the example network
has fewer nodes than requested, but you can easily adapt it for 10 nodes.
In this code, we create a small example social network with 10 nodes and edges. The
nx.betweenness_centrality function is used to calculate the betweenness centrality for each node.
The result is printed, showing the betweenness centrality values for each node.
In this output, each node is listed, and its corresponding betweenness centrality value is
displayed. The betweenness centrality values provide information about the importance of each
node in the network regarding the flow of information or interactions. Nodes with higher
betweenness centrality values can be considered more influential in connecting different parts of
the network.
pg. 22