0% found this document useful (0 votes)
23 views22 pages

Lab Manual Big Data

1. The document outlines a list of 10 experiments related to Hadoop, MongoDB, and network graph analysis using Python libraries. 2. The experiments include setting up a single node Hadoop cluster in Windows, performing file operations in HDFS, inserting and querying data in MongoDB collections, analyzing social networks using NetworkX in Python, and measuring centrality in graphs. 3. Key tasks involve explaining Hadoop architecture and ecosystem, configuring Hadoop in Windows, adding/retrieving/deleting files in HDFS, creating MongoDB collections and performing CRUD operations, designing graphs using NetworkX, and calculating betweenness centrality.

Uploaded by

Rahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views22 pages

Lab Manual Big Data

1. The document outlines a list of 10 experiments related to Hadoop, MongoDB, and network graph analysis using Python libraries. 2. The experiments include setting up a single node Hadoop cluster in Windows, performing file operations in HDFS, inserting and querying data in MongoDB collections, analyzing social networks using NetworkX in Python, and measuring centrality in graphs. 3. Key tasks involve explaining Hadoop architecture and ecosystem, configuring Hadoop in Windows, adding/retrieving/deleting files in HDFS, creating MongoDB collections and performing CRUD operations, designing graphs using NetworkX, and calculating betweenness centrality.

Uploaded by

Rahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

List of Experiments:

1. To draw and explain Hadoop architecture and ecosystem with the help of a case
study.
2. Perform setting up and installing single node Hadoop in a Windows
environment.
3. To implement the following file management tasks in Hadoop System (HDFS):
Adding files and directories, retrieving files, Deleting files
4. Create a database ‘STD’ and make a collection (e.g. "student" with fields 'No.,
Stu_Name, Enroll., Branch, Contact, e-mail, Score') using MongoDB. Perform
various operations in the following experiments.
5. Insert multiple records (at least 10) into the created student collection.
6. Execute the following queries on the collection created.
a. Display data in proper format.
b. Update the contact information of a specific student.
c. Add a new field remark to the document with the name 'REM'.
d. Add a new field as no 11, stu_name XYZ, Enrolll 00101, branch VB, e-mail
[email protected] Contact 098675345 without using insert statement.
7. Create an employee table in monogdb with 4 departments and 25 employees
equally divided along with one manager. The following fields should be added;
Employee_ID, Dept_ID, First_Name, Last_Name, Salary (Range between 20K-
60K). Now Run the following queries a. Find all the employees of a particular
department where salary lies < 40K.
b. Find the highest salary for each department and fetch the name of such
employees.
c. Find all the employees who are on a lesser salary than 30k; increase their salary
by 10% and display the results.
8. To design and implement a social network graph of 50 nodes and edges
between nodes using networkx library in Python.
9. Design and plot an asymmetric social network (socio graph) of 5 nodes (A,
B, C, D, and E) such that A is directed to B, B is directed to D, D is directed to A,
and D is directed to C.
10. Consider the above scenario (No. 09) and plot a weighted asymmetric graph,
the weight range is between 20 to 50.

pg. 1
11. Implement betweenness measure between nodes across the social network.
(Assume the social network of 10 nodes)
1. To draw and explain Hadoop architecture and ecosystem with the help of a case
study.

Creating a visual representation of Hadoop architecture and explaining the Hadoop ecosystem
can be quite complex, but I'll provide a simplified textual explanation of Hadoop's architecture
and ecosystem, followed by a hypothetical case study.

Hadoop Architecture:
Hadoop is designed to process and store large volumes of data in a distributed and fault-tolerant
manner. Its core components include:

1. HDFS (Hadoop Distributed File System): HDFS is the storage component of Hadoop. It
divides data into blocks (typically 128MB or 256MB each) and stores multiple copies of
these blocks across different nodes in a cluster for redundancy. HDFS ensures data
reliability and fault tolerance.

2. YARN (Yet Another Resource Negotiator): YARN is Hadoop's resource management and
job scheduling system. It manages resources and schedules tasks for data processing.

3. MapReduce: MapReduce is a programming model and processing engine for distributed


data processing. It consists of Map tasks (data processing) and Reduce tasks (aggregation
and summarization).

4. Common Utilities: This includes libraries and utilities used by various Hadoop
components.

5. Hadoop Common: This contains the shared utilities, libraries, and APIs needed by
Hadoop modules.

pg. 2
Hadoop Ecosystem:
Hadoop's ecosystem consists of various tools and frameworks that extend Hadoop's capabilities,
making it suitable for a wide range of data processing tasks. Some key components of the
Hadoop ecosystem include:

1. Hive: Hive is a data warehousing and SQL-like query language for Hadoop. It allows
users to query and analyze data using HiveQL.
2. Pig: Pig is a platform for analyzing large data sets. It provides a high-level scripting
language, Pig Latin, for data analysis.
3. HBase: HBase is a NoSQL database that provides real-time read and write access to large
datasets. It's suitable for applications requiring low-latency data access.
4. Sqoop: Sqoop is used for transferring data between Hadoop and relational databases. It
simplifies the data import/export process.
5. Flume: Flume is a service for collecting, aggregating, and moving large amounts of log
data to HDFS. It's commonly used for ingesting data from various sources.
6. Oozie: Oozie is a workflow scheduler for Hadoop jobs. It allows you to define, schedule,
and manage data workflows in Hadoop.
7. ZooKeeper: ZooKeeper is a distributed coordination service used for maintaining
configuration information, naming, providing distributed synchronization, and providing
group services.

Hadoop Case Study: Company:


XYZ Retail
Challenges: XYZ Retail is a large retail company with a wealth of data from its online and
offline stores. They needed a way to analyze customer behavior, optimize inventory
management, and gain insights into sales trends. The existing systems were struggling to handle
the volume and variety of data.

pg. 3
Hadoop Solution: XYZ Retail decided to implement Hadoop as a solution to their data
challenges. Here's how they used Hadoop:

1. Data Ingestion: They used Apache Flume to collect data from various sources, including
web logs, point-of-sale systems, and social media.

2. Data Storage: Data was stored in HDFS, which provided a reliable and scalable storage
solution.

3. Data Processing: They used Hadoop MapReduce and Hive to process and analyze the
data. MapReduce helped extract and transform data, while Hive allowed analysts to run
SQL-like queries.

4. Real-time Data: To handle real-time data, they used HBase, which provided low-latency
access to data.

5. Data Integration: Sqoop was used to move data between Hadoop and their existing
relational databases.

6. Workflow Automation: Oozie was employed to schedule and manage data workflows,
ensuring that jobs ran at the right time and in the correct sequence.

7. Data Visualization: For data visualization and reporting, they integrated Hadoop with a
BI tool like Tableau or Power BI.

Results: XYZ Retail was able to gain valuable insights into customer behavior, optimize
inventory management, and make data-driven decisions. Their data processing became more
efficient, and they were able to handle both batch and real-time data effectively.
This hypothetical case study illustrates how a company like XYZ Retail can leverage the Hadoop
ecosystem to address data challenges and drive business improvements.

pg. 4
2. Perform setting up and installing single node Hadoop in a Windows
environment.

Prerequisites:
Before setting up and installing single node Hadoop in a Windows environment, ensure
you have the following prerequisites:
 Java 8: Download and install Java 8 Development Kit (JDK) from
https://www.oracle.com/java/technologies/javase/javase8-archive-
downloads.html.
 7-Zip: Download and install 7-Zip, a file archiver, from https://www.7-zip.org/a/.
Steps:
1. Download Hadoop: Download the latest stable version of Hadoop from the Apache Hadoop
website: https://hadoop.apache.org/releases.html.
2. Extract Hadoop: Extract the downloaded Hadoop archive to a suitable location, for example,
C:\hadoop.
3. Configure Environment Variables:
I. Open System Properties by searching for it in the Start menu.

II. Click on Advanced system settings.

III. In the Advanced tab, click the Environment Variables button.

IV. Under System Variables, select Path and click Edit.

V. Add the following paths to the Variable value field, separated by semicolons:
 C:\hadoop\bin
 C:\Java\jdk1.8.0_261\bin
VI. Click OK to save the changes.
4. Configure Hadoop Configuration Files:
1. Open the following configuration files in a text editor:
 C:\hadoop\etc\core-site.xml
 C:\hadoop\etc\hdfs-site.xml
 C:\hadoop\etc\mapred-site.xml
 C:\hadoop\etc\yarn-site.xml
2. Edit the configuration properties as needed. For a single-node cluster, you can leave
the default settings unchanged.
5. Format the NameNode:
1. Open a command prompt and navigate to the Hadoop bin directory:
 cd C:\hadoop\bin
2. Execute the following command to format the NameNode:
 hadoop namenode -format

pg. 5
6. Start the Hadoop Cluster:
1. Execute the following commands to start the Hadoop cluster:
 start hdfs
 start yarn
7. Verify the Hadoop Cluster:
1. Execute the following command to verify the status of the Hadoop cluster:
 jps
2. You should see the following processes running:
 Jps output:
 NameNode
 ResourceManager
 DataNode
Additional Notes:
 To stop the Hadoop cluster, execute the following commands:
 stop hdfs
 stop yarn
 To view Hadoop logs, navigate to the Hadoop logs directory:
 C:\hadoop\logs

3. To implement the following file management tasks in Hadoop System (HDFS):


Adding files and directories, retrieving files, Deleting files
pg. 6
here are the commands to implement the following file management tasks in Hadoop Distributed
File System (HDFS):

Adding files and directories:

To add a file to HDFS, use the hadoop fs -put command. For example, to add the file myfile.txt
to the directory /user/hadoop/data, use the following command:

 hadoop fs -put myfile.txt /user/hadoop/data


To add a directory to HDFS, use the hadoop fs -mkdir command. For example, to create the
directory /user/hadoop/newdir, use the following command:

 hadoop fs -mkdir /user/hadoop/newdir


Retrieving files:

To retrieve a file from HDFS, use the hadoop fs -get command. For example, to retrieve the
file /user/hadoop/data/myfile.txt to the local filesystem, use the following command:

 hadoop fs -get /user/hadoop/data/myfile.txt myfile.txt


Deleting files:

To delete a file from HDFS, use the hadoop fs -rm command. For example, to delete the file
/user/hadoop/data/myfile.txt, use the following command:

 hadoop fs -rm /user/hadoop/data/myfile.txt


To delete a directory from HDFS, use the hadoop fs -rmr command. For example, to delete the
directory /user/hadoop/newdir, use the following command:

 hadoop fs -rmr /user/hadoop/newdir

4. Create a database ‘STD’ and make a collection (e.g. "student" with fields 'No.,
Stu_Name, Enroll., Branch, Contact, e-mail, Score') using MongoDB. Perform
various operations in the following experiments.
pg. 7
To create a database named "STD" and a collection named "student" with the specified fields in
MongoDB, you can follow these steps. MongoDB is a NoSQL database, and you can interact
with it using a MongoDB client or command-line tools. Below, I'll provide instructions for
creating the database and collection using the MongoDB shell, which is a command-line
interface for MongoDB.

Launch MongoDB Shell:


● Make sure you have MongoDB installed and the MongoDB server running.
● Open a terminal or command prompt.
● Start the MongoDB shell by running the mongo command.
Switch to the 'STD' Database:
● To create a database, you first need to switch to it. In this case, you want to create
a database called "STD."
● Run the following command in the MongoDB shell:

Shell use STD

reating DataBase
use STD;
// Creating Collection
STD.createCollection("Student")
Here are the steps to create a database 'STD' and make a collection (e.g. "student" with fields
'No., Stu_Name, Enroll., Branch, Contact, e-mail, Score') using MongoDB. Perform various
operations in the following experiments:

1. Create a database 'STD' and make a collection "student"

use STD
db.createCollection("student")

db.student.insertOne({
"No": 1,
"Stu_Name": “rahul shrivastava",
"Enroll": "0827CS201194",
"Branch": "Computer Science",
pg. 8
"Contact": "1234567890",
"e-mail": "[email protected]",
"Score": 95
})

reating DataBase
use STD;
// Creating Collection
STD.createCollection("Student")
reating DataBase
use STD;
// Creating Collection
STD.createCollection("St

pg. 9
5. Insert multiple records (at least 10) into the created student collection.

db.student.insertMany([
{
"No": 1,
"Stu_Name": “rahul shrivastava",
"Enroll": "0827CS201194",
"Branch": "Computer Science",
"Contact": "1234567890",
"e-mail": "[email protected]",
"Score": 95
},
{
"No": 2,
"Stu_Name": "Akshay Keswani",
"Enroll": "0827CS201022",
"e-mail":
"[email protected]",
"Score": 88
},
{
"No": 3,
"Stu_Name": "Alokit Sharma",
"Enroll": "0827CS201023",
"Branch": "Computer Science",
"Contact": "1234567891",
"e-mail": "[email protected]",
"Score": 75
},
{
"No": 4,
"Stu_Name": "Aditya Sharma",
"Enroll": "0827CS201016",
"Branch": "Computer Science",
"Contact": "1234567891",
"e-mail":
"[email protected]",

pg. 10
"Score": 92 },
{
"No": 5,
"Stu_Name": "Akshat Singh Gour",
"Enroll": "0827CS201020",
"Branch": "Computer Science",
"Contact": "1234567891",
"e-mail": "[email protected]",
"Score": 80
},
{
"No": 6,
"Stu_Name": "Aayush Gupta",
"Enroll": "0827CS201006",
"e-mail": "[email protected]",
"Score": 87
},
{
"No": 7,
"Stu_Name": "Amit Kumar Yadav",
"Enroll": "0827CS201031",
"Branch": "Computer Science",
"Contact": "1234567891",
"e-mail": "[email protected]",
"Score": 78
},
{
"No": 8,
"Stu_Name": "Aryan Tapkire",
"Enroll": "0827CS201044",
"Branch": "Computer Science",
"Contact": "1234567891",
"e-mail": "[email protected]",
"Score": 91
},

pg. 11
{
"No": 9,
"Stu_Name": "Devesh Sharma",
"Enroll": "0827CS201068",
"Branch": "Computer Science",
"Contact": "1234567891",
"e-mail": "[email protected]",
"Score": 85
},
{
"No": 10,
"Stu_Name": "Asit Joshi",
"Enroll": "0827CS201042",
"Contact": "1234567891",
"e-mail": "[email protected]",
"Score": 89
}
])

pg. 12
6. Execute the following queries on the collection created.
a. Display data in proper format.
b. Update the contact information of a specific student.
c. Add a new field remark to the document with the name 'REM'.
d. Add a new field as no 11, stu_name XYZ, Enroll 00101, branch VB, e-mail
[email protected] Contact 098675345 without using insert statement.

a. Display data in proper format:

 db.student.find().pretty()
This query will retrieve all documents from the 'student' collection and display them in a
formatted manner.

b. Update the contact information of a specific student:

 db.student.updateOne(
{
"Stu_Name": "Bhumi Pandey"
},
{
$set: { "Contact": "9999999999" }
})
This query will update the contact information for the student named 'Bhumi Pandey'. It uses the
$set operator to modify the value of the 'Contact' field to '9999999999'.

c. Add a new field remark to the document with the name 'REM':

 db.student.updateOne(
{ "Stu_Name": "REM" },
{$set: { "remark": "This is a remark" }}
)
This query will add a new field named 'remark' to the document for the student named 'REM'. It
uses the $set operator to set the value of the 'remark' field to 'This is a remark'.

pg. 13
d. Add a new field as no 11, stu_name XYZ, Enroll 00101, branch VB, e-mail xyz@xyz contact
098675345 without using insert statement:

 db.student.updateMany(
{}, {$push: {
"No.": 11,
"Stu_Name": "XYZ",
"Enroll.": "00101",
"Branch": "VB",
"e-mail": "xyz@xyz",
"Contact": "098675345"}
})

This query will add new fields for student number, name, enrollment number, branch, email
address, and contact number to all documents in the 'student' collection. It uses the $push operator
to add these fields to an array within each document.

pg. 14
7. Create an employee table in monogdb with 4 departments and 25 employees
equally
divided along with one manager. The following fields should be added;
Employee_ID, Dept_ID, First_Name, Last_Name, Salary (Range between
20K-60K). Now Run the following queries
a. Find all the employees of a particular department where salary lies < 40K.
b. Find the highest salary for each department and fetch the name of such
employees.
c. Find all the employees who are on a lesser salary than 30k; increase their salary
by 10% and display the results.

1. Create the "employee" Collection:

 db.employee.insertMany([
{ Employee_ID: 1,
Dept_ID: 1,
First_Name: "John",
Last_Name: "Doe",
Salary: 50000 },
{Employee_ID: 2,
Dept_ID: 1,
First_Name: "Jane",
Last_Name: "Smith",
Salary: 48000 },
{ Employee_ID: 3,
Dept_ID: 1,
First_Name: "Bob",
Last_Name: "Johnson",
Salary: 35000},
{ Employee_ID: 26,
Dept_ID: 4,
First_Name: "Manager",
Last_Name: "Smith",
Salary:60000}
])

pg. 15
2. Run the Specified Queries:
a. Find all employees of a particular department where salary is less than 40K:

For example, to find employees in Department 1 with a salary less than 40K:
 db.employee.find({ Dept_ID: 1, Salary: { $lt: 40000 } })

b. Find the highest salary for each department and fetch the names of such employees:

To find the highest salary for each department:

db.employee.aggregate([
{ $group:
{_
id: "$Dept_ID",
maxSalary:
{ $max: "$Salary" }
}
},
])
To fetch the names of employees with the highest salary in each department, you'll need to use a
more complex aggregation query.

c. Find all employees who are on a salary less than 30K, increase their salary by 10%, and
display the results:

To find employees with a salary less than 30K and increase their salary by 10%:

db.employee.find(
{
Salary: {
$lt: 30000
}
}).forEach(function(employee) {
employee.Salary *= 1.10; // Increase the salary by 10% db.employee.save(employee);
})
This code will find employees with a salary less than 30K and update their salary to 10% higher.

pg. 16
8. To design and implement a social network graph of 50 nodes and edges between
nodes using network library in Python.

This code will create a graph with 50 nodes and 50 edges. The nodes will be numbered from 0 to
49, and the edges will be between randomly selected pairs of nodes. The graph will be drawn to
the screen using the NetworkX draw function.

Here is an example of the output of the code:

pg. 17
Figure : Social network graph of 50 nodes and edges
As you can see, the code successfully creates a social network graph of 50 nodes and edges. The
graph is drawn to the screen using the NetworkX draw function.

pg. 18
9. Design and plot an asymmetric social network (socio graph) of 5 nodes (A, B,
C, D, and E) such that A is directed to B, B is directed to D, D is directed to A, and
D is directed to C.

Here's the code for the asymmetric social network you described:

This code will create a directed graph with 5 nodes and 4 directed edges. The nodes will be
labeled A, B, C, D, and E, and the directed edges will be from A to B, B to D, D to A, and D to C.
The graph will be drawn to the screen using the NetworkX draw function.

Here is an example of the output of the code:

pg. 19
In this sociograph:
 Node A is connected to B, forming a directed edge from A to B.
 Node B is connected to D, forming a directed edge from B to D.
 Node D is connected to A and C, forming directed edges from D to A and from D to C.
 Node E is not connected to any of the other nodes in the network.
This visual representation should help you understand the asymmetric social network with the
specified connections.

pg. 20
10. Consider the above scenario (No. 09) and plot a weighted asymmetric graph,
the weight range is between 20 to 50.
To create a weighted asymmetric graph with the specified connections and edge weights
between 20 and 50, you can use NetworkX in Python. Here's how you can design and plot the
weighted graph:

This code creates a directed graph with 5 nodes (A, B, C, D, and E) and the specified directed
edges with random edge weights between 20 and 50. The resulting plot will visualize the
weighted asymmetric social network graph with labeled edge weights.

pg. 21
11. Implement betweenness measure between nodes across the social network.
(Assume the social network of 10 nodes)

To calculate the betweenness centrality measure between nodes in a social network using Python,
you can utilize the NetworkX library. Below is an example of how to calculate the betweenness
centrality between nodes in a social network with 10 nodes. Please note that the example network
has fewer nodes than requested, but you can easily adapt it for 10 nodes.

In this code, we create a small example social network with 10 nodes and edges. The
nx.betweenness_centrality function is used to calculate the betweenness centrality for each node.
The result is printed, showing the betweenness centrality values for each node.

In this output, each node is listed, and its corresponding betweenness centrality value is
displayed. The betweenness centrality values provide information about the importance of each
node in the network regarding the flow of information or interactions. Nodes with higher
betweenness centrality values can be considered more influential in connecting different parts of
the network.

pg. 22

You might also like