0% found this document useful (0 votes)
17 views

Big Data Lab File

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Big Data Lab File

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

PRACTICAL JOURNAL OF BIG DATA

B.Tech: 4th ­Year

Department of Computer Science and Engineering

Name of the Student : Utkarsh Goyal


Branch & Section : CS­4
Roll No. : 0827CS211252
Year : 4th (7th Semester)

Department of Computer Science and Engineering

AITR, Indore

July ­ Dec 2024


ACROPOLIS INSTITUTE OF TECHNOLOGY & RESEARCH, INDORE

Index
Sign of
Date of
S.No. Date of Exp. Name of the Experiment the
Submission
Faculty
To draw and explain Hadoop
architecture and ecosystem with the
1. help of a case study.

Perform setting up and installing single


2. node Hadoop in a Windows
environment.
To implement the following file
management tasks in Hadoop System
3.
(HDFS): Adding files and directories,
retrieving files, Deleting files
Create a database ‘STD’ and make a
collection (e.g. "student" with fields
'No., Stu_Name, Enrol., Branch,
4.
Contact, e-mail, Score') using
MongoDB. Perform various operations
in the following experiments.

Insert multiple records (at least 10)


5.
into the created student collection.

Execute the following queries on the


collection created.
Display data in proper format.
Update the contact information of a
specific student.
6.
Add a new field remark to the
document with the name 'REM'.
Add a new field as no 11, stu_name
XYZ, enroll 00101, branch VB, e-mail
[email protected] Contact 098675345
without using insert statement.
Create an employee table in mongodb
with 4 departments and 25 employees
equally
divided along with one manager. The
following fields should be added;
Employee_ID, Dept_ID, First_Name,
Last_Name, Salary (Range between
20K-60K). Now Run the following
queries

7. Find all the employees of a particular


department where the salary lies <
40K.
Find the highest salary for each
department and fetch the name of
such employees.
Find all the employees who are on a
lesser salary than 30k; increase their
salary by 10% and display the results.

To design and implement a social


network graph of 50 nodes and edges
8.
between nodes using networkx library
in Python.
Design and plot an asymmetric social
network (socio graph) of 5 nodes (A, B,
9. C, D, and E) such that A is directed to
B, B is directed to D, D is directed to A,
and D is directed to C.
Design and plot an asymmetric social
network (socio graph) of 5 nodes (A, B,
10. C, D, and E) such that A is directed to
B, B is directed to D, D is directed to A,
and D is directed to C.
Implement betweenness measure
between nodes across the social
11.
network. (Assume the social network
of 10 nodes)
Name of Student: Utkarsh Goyal Class: CS Year: 4
Enrolment No: 0827CS211252 Section: 4 Semester: 7
Date of Experiment: Date of Submission:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

Lab Experiment No. 01


Objective: To draw and explain Hadoop architecture and ecosystem with the help of a case
study.

Theory:
Hadoop is an open-source framework that enables distributed storage and processing of large
datasets across clusters of computers using simple programming models. It is highly scalable,
fault-tolerant, and capable of handling vast amounts of data in a reliable manner. The Hadoop
ecosystem comprises several modules, tools, and technologies that interact with each other
to enable efficient data processing and management.

1. Hadoop Architecture

Hadoop has a master-slave architecture where the tasks are distributed across nodes in a
cluster. There are two key components in the core of Hadoop: HDFS (Hadoop Distributed File
System) and MapReduce.

Hadoop Distributed File System (HDFS)

HDFS is the storage layer of Hadoop. It is designed to store large datasets by distributing them
across multiple machines. It uses a block storage method where files are divided into fixed-
size blocks (default 128 MB or 64 MB) and stored across different nodes. HDFS has two key
nodes:

a. NameNode (Master Node): This manages the metadata of HDFS and keeps track of
where data blocks are stored.
b. DataNode (Slave Nodes): These store the actual data blocks and serve read/write
requests from the client.

MapReduce

MapReduce is the processing layer of Hadoop, which enables distributed computation on


large datasets. The process involves two main steps:

0827CS211252 Big Data (CS-704) 1


a. Map Step: The input data is divided into smaller sub-tasks, which are processed in
parallel by different nodes.
b. Reduce Step: The results from the Map step are aggregated and combined to produce
the final output.

The MapReduce architecture consists of:

a. JobTracker (Master Node): Assigns and monitors tasks distributed across the cluster.
b. TaskTracker (Slave Nodes): Executes the tasks as assigned by the JobTracker.

YARN (Yet Another Resource Negotiator)

YARN is the resource management layer that separates resource management from the
processing tasks. It consists of:

a. ResourceManager (Master Node): Manages the allocation of resources across the


cluster.
b. NodeManager (Slave Nodes): Monitors resource usage on individual node.

0827CS211252 Big Data (CS-704) 2


2. Hadoop Ecosystem Components

The Hadoop ecosystem extends beyond HDFS and MapReduce to include several other tools
and technologies that enable advanced data storage, analysis, and management.

a. Hive: A data warehouse infrastructure that provides an SQL-like interface to query data
stored in HDFS.
b. Pig: A high-level scripting platform for processing large datasets, providing data
transformation through Pig Latin scripts.
c. HBase: A NoSQL database built on top of HDFS, which supports real-time read/write
access to large datasets.
d. Sqoop: A tool used to transfer data between Hadoop and relational databases (like
MySQL, Oracle).
e. Flume: A service for collecting and moving large amounts of log data into HDFS.
f. Zookeeper: Provides distributed coordination services to maintain configuration
information and synchronize between distributed systems.
g. Oozie: A workflow scheduler for managing Hadoop jobs.

3. Case Study: LinkedIn’s Use of Hadoop for Data Analytics and Enhanced User
Insights

Background

Uber, a leading ride-hailing platform, connects millions of riders with drivers in real-time.
Uber collects vast amounts of data from users, drivers, and trips on a global scale. This data
needs to be processed and analyzed to optimize pricing, improve user experience, and ensure
efficient ride matching. Given the scale of operations and the volume of real-time data, Uber
turned to Hadoop to create a scalable, fault-tolerant, and efficient data infrastructure for
handling its vast datasets.

Business Problem

Uber faced several data challenges:

a. Data Volume: Uber deals with enormous volumes of data from millions of rides, user
ratings, location data, and driver behaviour.
b. Real­Time Insights: Uber requires real-time data processing for features like dynamic
pricing, ride matching, and surge pricing.
c. Scalability and Fault Tolerance: Uber needs infrastructure that can seamlessly scale to
accommodate growth and ensure data accessibility even with occasional system
failures.

0827CS211252 Big Data (CS-704) 3


Solution: Hadoop Ecosystem

To address these challenges, Uber implemented the Hadoop ecosystem, integrating various
components to build a scalable, fault-tolerant, and efficient data infrastructure capable of
handling massive data volumes. Here’s how they used it:

Step­by­Step Solution Using Hadoop


Step 1: Data Ingestion

a. Data Collection from Diverse Sources: Uber collects data from different sources,
including trip data (pickup/dropoff times and locations), GPS data from drivers, user
ratings, and ride requests.
b. Apache Kafka: Uber uses Apache Kafka to collect real-time data streams, such as ride
requests, driver availability, and location updates. Kafka allows Uber to efficiently
process events as they happen, ensuring a near-instant response for pricing and
matching algorithms.
c. Apache Flume: Uber uses Flume to aggregate log data from various systems, including
server logs, application metrics, and other operational data, and ingests them into
Hadoop for further processing.

Step 2: Data Storage in HDFS

a. Hadoop Distributed File System (HDFS): Uber leverages HDFS to store large volumes
of structured and unstructured data, such as ride logs, GPS data, and user profiles.
Data is distributed across multiple nodes, providing both storage scalability and fault
tolerance.
b. Scalability with HDFS: As Uber’s data grows, HDFS allows them to seamlessly add new
storage nodes without disrupting the existing infrastructure. This scalability ensures
Uber can handle the increasing amount of ride data generated as the platform grows
globally.

Step 3: Data Processing with MapReduce

a. Batch Processing with MapReduce: Uber uses MapReduce for processing and
analyzing large datasets. For example:

 Map Function: Uber’s system may filter out trip data by parameters such as
location, time of day, and driver performance.
 Reduce Function: The results are aggregated to generate insights such as peak
demand times, popular locations for pickups, and user behavior trends.

Step 4: Data Querying with Apache Hive

a. Apache Hive: Uber utilizes Hive for SQL-like querying to enable business analysts to
access and analyze data in a more user-friendly manner. Hive helps Uber quickly
generate reports on trip demand, user demographics, and driver behavior.

0827CS211252 Big Data (CS-704) 4


Step 5: Real­Time Analytics with HBase

a. Apache HBase: For real-time data processing, Uber uses HBase to store and access
data with low latency. HBase is ideal for providing real-time access to user and trip
data, such as:
 Real­Time Price Calculation: HBase supports the dynamic pricing engine,
allowing Uber to apply surge pricing and calculate ride costs in real-time based
on demand, location, and available drivers.
 Real­Time Matching: HBase is also used to store the latest driver availability
and location, enabling Uber’s algorithms to match riders and drivers instantly.

Step 6: Data Transformation with Apache Pig

a. Apache Pig: Uber uses Apache Pig for complex data transformation tasks such as
filtering, aggregating, and transforming raw ride data into usable formats for analysis.
For example:

 Ride Segmentation: Pig scripts segment rides based on different criteria such
as ride duration, geographical location, or fare categories, helping Uber
optimize its pricing models and identify areas with high demand.

Step 7: Workflow Management with Apache Oozie

a. Apache Oozie: Uber uses Oozie for scheduling and managing data processing
workflows. For instance, Uber automates the running of daily reports, batch
processing of trip data, and machine learning model training using Oozie’s workflow
management system.

Step 8: Resource Management with YARN

a. YARN (Yet Another Resource Negotiator): YARN manages resources across the
Hadoop clusters at Uber. It ensures that Hadoop’s processing power is optimally
allocated to various tasks, allowing for efficient parallel processing of large datasets.

Benefits and Results

a. Improved Ride Matching and User Experience


 Uber uses data from HBase to instantly match riders with nearby drivers. Real-
time processing ensures that users experience minimal waiting times,
contributing to greater user satisfaction.
b. Generation Dynamic Pricing and Revenue Optimization
 By analyzing ride data in real-time through Kafka and HBase, Uber can
implement dynamic pricing models that adjust based on factors such as
demand, weather, traffic, and location. This helps maximize revenue during
peak hours and ensures competitive pricing during low-demand periods.
c. Scalability and Cost Efficiency

0827CS211252 Big Data (CS-704) 5



The Hadoop ecosystem enables Uber to scale its infrastructure as the number
of rides grows globally. HDFS allows Uber to manage petabytes of data without
incurring excessive costs for upgrading infrastructure, while YARN optimizes
resource allocation to prevent waste.
d. Real-Time Insights for Drivers and Riders
 With HBase and Kafka, Uber ensures that drivers and riders receive real-time
insights, such as the nearest available drivers, optimal routes, and live pricing
information. This improves overall ride experience and satisfaction.

Conclusion

Uber’s use of the Hadoop ecosystem demonstrates the framework's effectiveness in


managing vast amounts of data and deriving real-time insights. By leveraging HDFS for
scalable storage, MapReduce for data processing, HBase for real-time analytics, and Kafka for
data ingestion, Uber has built a robust and efficient data infrastructure that drives its
platform's operations. These tools help Uber optimize ride matching, pricing, and user
engagement, while also providing a fault-tolerant and scalable solution capable of handling
the company’s rapid growth.

0827CS211252 Big Data (CS-704) 6


Name of Student: Utkarsh Goyal Class: CS Year: 4
Enrolment No: 0827CS211252 Section: 4 Semester: 7
Date of Experiment: Date of Submission:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

Lab Experiment No. 02


Objective: Perform setting up and installing single node Hadoop in a Windows environment.
Prerequisites
I. Java Installation: Ensure that you have the Java Development Kit (JDK) installed.
Hadoop requires Java 8 or higher.
A. Download JDK from Oracle's website or OpenJDK.
B. Install it and set the JAVA_HOME environment variable.
II. Windows Subsystem for Linux (WSL) (optional): For better compatibility, you can
install Hadoop on WSL. However, if you prefer to set it up natively, you can proceed
without it.

Steps to Install Hadoop on Windows


1. Download Hadoop

I. Go to the Apache Hadoop releases page.


II. Download the binary release (e.g., hadoop-3.x.x.tar.gz).

2. Install Hadoop

I. Extract Hadoop: Use a tool like WinRAR or 7-Zip to extract the downloaded Hadoop
archive to a directory (e.g., C:\hadoop).
II. Set Environment Variables:
A. Open the Environment Variables settings (Control Panel > System > Advanced
system settings > Environment Variables).
B. Add the following variables:
1. HADOOP_HOME: Set to the path where Hadoop is extracted (e.g.,
C:\hadoop).
2. PATH: Add %HADOOP_HOME%\bin to the existing PATH variable.
3. JAVA_HOME: Set to the path where JDK is installed (e.g., C:\Program
Files\Java\jdk1.8.0_xx).

0827CS211252 Big Data (CS-704) 7


3. Configure Hadoop
I. Edit core­site.xml:
A. Navigate to %HADOOP_HOME%\etc\hadoop\core-site.xml.
B. Add the following configuration:

II. Edit hdfs­site.xml:


A. Navigate to %HADOOP
B. _HOME%\etc\hadoop\hdfs-site.xml.
C. Add the following configuration:

III. Edit mapred­site.xml:


A. If mapred-site.xml does not exist, create a new file named mapred-site.xml in
the %HADOOP_HOME%\etc\hadoop directory.
B. Add the following configuration:

0827CS211252 Big Data (CS-704) 8


IV. Edit yarn­site.xml:
A. Navigate to %HADOOP_HOME%\etc\hadoop\yarn-site.xml.
B. Add the following configuration:

4. Format the Namenode


Open Command Prompt as Administrator and run the following command to format the
namenode:

5. Start Hadoop
A. Open Command Prompt as Administrator.
B. Navigate to the Hadoop bin directory:

C. Start the HDFS and YARN services:

6. Access Hadoop
To access the Hadoop web interface, open your web browser and go to:
A. HDFS: http://localhost:9870
B. YARN: http://localhost:8088

0827CS211252 Big Data (CS-704) 9


Name of Student: Utkarsh Goyal Class: CS Year: 4
Enrolment No: 0827CS211252 Section: 4 Semester: 7
Date of Experiment: Date of Submission:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

Lab Experiment No. 03


Objective: To implement the following file management tasks in Hadoop System
(HDFS): Adding files and directories, retrieving files, Deleting files
Prerequisites:
Ensure you have your Hadoop cluster running and HDFS started. You can check this by running:

This command should list the root directory of HDFS.

1. Adding Files and Directories

Add Files to HDFS


To upload files from your local file system to HDFS, use the following command:

Add Directories to HDFS


To create a new directory in HDFS, use the following command:

2. Retrieving Files from HDFS


To download files from HDFS to your local file system, use the following command:

0827CS211252 Big Data (CS-704) 10


3. Deleting Files and Directories

Delete Files from HDFS


To delete a file in HDFS, use the following command:

Delete Directories from HDFS


To delete a directory in HDFS, use the following command. If the directory is not empty, you
need to use the -r (recursive) option.

0827CS211252 Big Data (CS-704) 11


Name of Student: Utkarsh Goyal Class: CS Year: 4
Enrolment No: 0827CS211252 Section: 4 Semester: 7
Date of Experiment: Date of Submission:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

Lab Experiment No. 04


Objective: Create a database ‘STD’ and make a collection (e.g. "student" with fields 'No.,
Stu_Name, Enrol., Branch, Contact, e-mail, Score') using MongoDB. Perform various
operations in the following experiments.

In MongoDB, databases and collections are created dynamically when data is inserted into
them. We can perform the following tasks to set up a database called STD, create a student
collection with specific fields, and then demonstrate common operations like inserting,
updating, querying, and deleting documents. Here's how to proceed:

1. Create Database and Collection


a. Switch/Create Database: Switch to the STD database. If it doesn't exist, MongoDB
will create it automatically when you insert data.
Input:
use STD
Output:

b. Insert a Document into the student Collection: Now, let's insert a document with
the fields No., Stu_Name, Enrol., Branch, Contact, e-mail, and Score.

Input:
db.student.insertOne({
No: 1,
Stu_Name: "Ghasiram Pondu",
Enrol: "DOGNO1",
Branch: "Computer Science",
Contact: "666666",
email: "[email protected]",
Score: 6 })

0827CS211252 Big Data (CS-704) 12


Output:

2. Basic Operations

Now that you have data in the student collection, let's perform various operations.

a. Find All Documents: To retrieve all documents in the collection:

Input:
db.student.find().pretty()

Output:

0827CS211252 Big Data (CS-704) 13


b. Find Specific Document (Query by Name):

Input:
db.student.find({ Stu_Name: "Ghasiram Pondu" })
Output:

c. Update a Document:

Input:
db.student.updateOne(
{ Stu_Name: "Ghasiram Pondu" },
{ $set: { Score: 666666} }
)
Output:

d. Delete a Document:

Input:
db.student.deleteOne({ Stu_Name: "Amit Sharma" })

0827CS211252 Big Data (CS-704) 14


Output:

e. Add Multiple Documents:

Input:
db.student.insertMany([
{No: 3, Stu_Name: "Lana Rhoades", Enrol: "PAWN1", Branch: "Biology", Contact:
"6969696969", email: "[email protected]", Score: 69 },
{No: 4, Stu_Name: "Lexi Luna", Enrol: "PAWN2", Branch: "Mechanical Engineering",
Contact: "6969696969", email: "[email protected]", Score: 69 },
{No: 5, Stu_Name: "Angela White", Enrol: "PAWN3", Branch: "Plumbing", Contact:
"6969696969", email: "[email protected]", Score: 69 } ])

0827CS211252 Big Data (CS-704) 15


Output:

0827CS211252 Big Data (CS-704) 16


Name of Student: Utkarsh Goyal Class: CS Year: 4
Enrolment No: 0827CS211252 Section: 4 Semester: 7
Date of Experiment: Date of Submission:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

Lab Experiment No. 05


Objective: Insert multiple records (at least 10) into the created student collection.

To insert multiple records into the student collection in MongoDB, you can use the
insertMany() command. This command allows you to insert several documents (records) in
one operation. Below is an example of how you can insert 12 student records into the student
collection within the STD database.

Input:
db.student.insertMany([
{No: 1, Stu_Name: "John Doe", Enrol: "DOGNO1", Branch: "Computer Science", Contact:
"1234567890", email: "[email protected]", Score: 85},
{No: 2, Stu_Name: "Jane Smith", Enrol: "DOGNO2", Branch: "Electrical Engineering", Contact:
"2345678901", email: "[email protected]", Score: 90},
{No: 3, Stu_Name: "Alice Johnson", Enrol: "DOGNO3", Branch: "Mechanical Engineering",
Contact: "3456789012", email: "[email protected]", Score: 92},
{No: 4, Stu_Name: "Bob Brown", Enrol: "DOGNO4", Branch: "Civil Engineering", Contact:
"4567890123", email: "[email protected]", Score: 78},
{No: 5, Stu_Name: "Charlie Davis", Enrol: "DOGNO5", Branch: "Electronics Engineering",
Contact: "5678901234", email: "[email protected]", Score: 88},
{No: 6, Stu_Name: "David Evans", Enrol: "DOGNO6", Branch: "Chemical Engineering", Contact:
"6789012345", email: "[email protected]", Score: 79},
{No: 7, Stu_Name: "Eve White", Enrol: "DOGNO7", Branch: "Information Technology",
Contact: "7890123456", email: "[email protected]", Score: 93},
{No: 8, Stu_Name: "Frank Harris", Enrol: "DOGNO8", Branch: "Software Engineering", Contact:
"8901234567", email: "[email protected]", Score: 80},
{No: 9, Stu_Name: "Grace Martin", Enrol: "DOGNO9", Branch: "Biomedical Engineering",
Contact: "9012345678", email: "[email protected]", Score: 85},
{No: 10, Stu_Name: "Henry Lee", Enrol: "DOGNO10", Branch: "Environmental Engineering",
Contact: "0123456789", email: "[email protected]", Score: 91},

0827CS211252 Big Data (CS-704) 17


{No: 11, Stu_Name: "Isabel Clark", Enrol: "DOGNO11", Branch: "Agricultural Engineering",
Contact: "1234567890", email: "[email protected]", Score: 77},
{No: 12, Stu_Name: "James Taylor", Enrol: "DOGNO12", Branch: "Aerospace Engineering",
Contact: "2345678901", email: "[email protected]", Score: 89}])

Output:

0827CS211252 Big Data (CS-704) 18


0827CS211252 Big Data (CS-704) 19
0827CS211252 Big Data (CS-704) 20
Name of Student: Utkarsh Goyal Class: CS Year: 4
Enrolment No: 0827CS211252 Section: 4 Semester: 7
Date of Experiment: Date of Submission:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

Lab Experiment No. 06


Objective: Execute the following queries on the collection created.
a. Display data in proper format.
b. Update the contact information of a specific student.
c. Add a new field remark to the document with the name 'REM'.
d. Add a new field as no 11, stu_name XYZ, enroll 00101, branch VB, e-mail [email protected]
Contact 098675345 without using insert statement.

1. Display Data in Proper Format

To display the data stored in the student collection in a neatly formatted way, we can use the
find() command with a projection to show specific fields.

0827CS211252 Big Data (CS-704) 21


Input:
db.student.find().pretty()

Output:

0827CS211252 Big Data (CS-704) 22


2. Update the Contact Information of a Specific Student

Let’s say we want to update the contact information for the student named Tanay. Use the
following command:

Input:
db.student.updateOne(

{ Stu_Name: "Priya Desai" },


{ $set: { Contact: "0000000000" } }

Output:

3. Add a New Field Remark to the Document with the Name 'REM'

You can add a new field called Remark for the student with name "REM" using the following
command:

Input:
db.student.updateOne(
{ Stu_Name: "Amit Kumar" },

{ $set: { remark: "REM" } }


)

0827CS211252 Big Data (CS-704) 23


Output:

0827CS211252 Big Data (CS-704) 24


4. Add a New Field Without Using the Insert Statement
Input:
db.student.updateOne(

{ Stu_Name: "XYZ" },
{
$set: {

No: 11,
Stu_Name: "XYZ",

Enrol: "00101",

Branch: "VB",

email: "[email protected]",
Contact: "098675345"
}

},
{ upsert: true }

Output:

0827CS211252 Big Data (CS-704) 25


Name of Student: Utkarsh Goyal Class: CS Year: 4
Enrolment No: 0827CS211252 Section: 4 Semester: 7
Date of Experiment: Date of Submission:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

Lab Experiment No. 07


Objective: Create an employee table in mongodb with 4 departments and 25 employees
equally divided along with one manager. The following fields should be added; Employee_ID,
Dept_ID, First_Name, Last_Name, Salary (Range between 20K-60K). Now Run the following
queries
a. Find all the employees of a particular department where the salary lies < 40K.
b. Find the highest salary for each department and fetch the name of such employees.
c. Find all the employees who are on a lesser salary than 30k; increase their salary by
10% and display the results.

Step 1: Create the employees Collection in MongoDB


Input:
db.employee.insertMany([
{ "Employee_ID": 1, "Dept_ID": 1, "First_Name": "John", "Last_Name": "Doe", "Salary":
45000, "Role": "Manager" },
{ "Employee_ID": 2, "Dept_ID": 1, "First_Name": "Jane", "Last_Name": "Smith", "Salary":
38000, "Role": "Employee" },
{ "Employee_ID": 3, "Dept_ID": 1, "First_Name": "Michael", "Last_Name": "Johnson",
"Salary": 42000, "Role": "Employee" },
{ "Employee_ID": 4, "Dept_ID": 1, "First_Name": "Emily", "Last_Name": "Williams", "Salary":
39000, "Role": "Employee" },
{ "Employee_ID": 5, "Dept_ID": 1, "First_Name": "David", "Last_Name": "Brown", "Salary":
46000, "Role": "Employee" },
{ "Employee_ID": 6, "Dept_ID": 1, "First_Name": "Sarah", "Last_Name": "Jones", "Salary":
48000, "Role": "Employee" },
{ "Employee_ID": 7, "Dept_ID": 2, "First_Name": "Chris", "Last_Name": "Davis", "Salary":
50000, "Role": "Manager" },
{ "Employee_ID": 8, "Dept_ID": 2, "First_Name": "Patricia", "Last_Name": "Miller", "Salary":
38000, "Role": "Employee" },
{ "Employee_ID": 9, "Dept_ID": 2, "First_Name": "Robert", "Last_Name": "Garcia", "Salary":
44000, "Role": "Employee" },
{ "Employee_ID": 10, "Dept_ID": 2, "First_Name": "Linda", "Last_Name": "Martinez",
"Salary": 39000, "Role": "Employee" },

0827CS211252 Big Data (CS-704) 26


{ "Employee_ID": 11, "Dept_ID": 2, "First_Name": "James", "Last_Name": "Hernandez",
"Salary": 47000, "Role": "Employee" },
{ "Employee_ID": 12, "Dept_ID": 2, "First_Name": "Pat", "Last_Name": "Lopez", "Salary":
46000, "Role": "Employee" },
{ "Employee_ID": 13, "Dept_ID": 3, "First_Name": "William", "Last_Name": "White", "Salary":
54000, "Role": "Manager" },
{ "Employee_ID": 14, "Dept_ID": 3, "First_Name": "Elizabeth", "Last_Name": "Clark",
"Salary": 42000, "Role": "Employee" },
{ "Employee_ID": 15, "Dept_ID": 3, "First_Name": "Thomas", "Last_Name": "Lewis", "Salary":
47000, "Role": "Employee" },
{ "Employee_ID": 16, "Dept_ID": 3, "First_Name": "Barbara", "Last_Name": "Young",
"Salary": 44000, "Role": "Employee" },
{ "Employee_ID": 17, "Dept_ID": 3, "First_Name": "Charles", "Last_Name": "Walker",
"Salary": 50000, "Role": "Employee" },
{ "Employee_ID": 18, "Dept_ID": 3, "First_Name": "Mary", "Last_Name": "Hall", "Salary":
48000, "Role": "Employee" },
{ "Employee_ID": 19, "Dept_ID": 4, "First_Name": "Joseph", "Last_Name": "Allen", "Salary":
55000, "Role": "Manager" },
{ "Employee_ID": 20, "Dept_ID": 4, "First_Name": "Nancy", "Last_Name": "King", "Salary":
39000, "Role": "Employee" },
{ "Employee_ID": 21, "Dept_ID": 4, "First_Name": "Mark", "Last_Name": "Scott", "Salary":
46000, "Role": "Employee" },
{ "Employee_ID": 22, "Dept_ID": 4, "First_Name": "Susan", "Last_Name": "Green", "Salary":
48000, "Role": "Employee" },
{ "Employee_ID": 23, "Dept_ID": 4, "First_Name": "Steven", "Last_Name": "Adams", "Salary":
50000, "Role": "Employee" },
{ "Employee_ID": 24, "Dept_ID": 4, "First_Name": "Karen", "Last_Name": "Baker", "Salary":
46000, "Role": "Employee" }
])

0827CS211252 Big Data (CS-704) 27


Output:

0827CS211252 Big Data (CS-704) 28


0827CS211252 Big Data (CS-704) 29
0827CS211252 Big Data (CS-704) 30
0827CS211252 Big Data (CS-704) 31
a. Find all employees in a department where the salary is less than 40K

This query retrieves all employees from a specific department (e.g., Dept_ID = 2) where the
salary is less than 40K:

Input:
db.employee.find({
Dept_ID: "D2",
Salary: { $lt: 40000 }
});

Output:

0827CS211252 Big Data (CS-704) 32


b. Find the highest salary for each department and fetch the name of such
employees

This aggregation pipeline finds the highest salary in each department and fetches the names
of those employees:

Input:
db.employee.aggregate([
{
$group: {
_id: "$Dept_ID",
maxSalary: { $max: "$Salary" }
}
},
{
$lookup: {
from: "employee",
localField: "maxSalary",
foreignField: "Salary",
as: "highestSalaryEmployees"
}
},
{
$unwind: "$highestSalaryEmployees"
},
{
$project: {
Department: "$_id",
Employee_Name: {
$concat: [
"$highestSalaryEmployees.First_Name",
" ",
"$highestSalaryEmployees.Last_Name"
]
},
Salary: "$highestSalaryEmployees.Salary"
}
}
]);

0827CS211252 Big Data (CS-704) 33


Output:

0827CS211252 Big Data (CS-704) 34


c. Find all employees with a salary less than 30K, increase their salary by 10%,
and display the results

 Update salaries by 10% for employees earning less than 30K:

Input:

db.employee.updateMany(

{ Salary: { $lt: 30000 } },

$set: {

Salary: { $multiply: ["$Salary", 1.1] }

);

Output:

0827CS211252 Big Data (CS-704) 35


 Display updated results for those employees:

Input:
db.employee.find({ Salary: { $lt: 33000 } });

Output:

0827CS211252 Big Data (CS-704) 36


Name of Student: Utkarsh Goyal Class: CS Year: 4
Enrolment No: 0827CS211252 Section: 4 Semester: 7
Date of Experiment: Date of Submission:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

Lab Experiment No. 08


Objective: To design and implement a social network graph of 30 nodes and edges between
nodes using networkx library in Python.

Social Network Graph Theory


In social network analysis, a social network graph represents individuals (people,
organizations, or entities) as nodes and their relationships as edges between those nodes.
Social network graphs help us visualize and understand patterns in social structures,
communities, and interactions.

Key Concepts in Social Network Graphs


a. Nodes (Vertices): In a social network, each node represents an individual, a group, or
an entity. In this case, we have 30 nodes, each representing a unique individual in the
network.
b. Edges (Links): Edges indicate a relationship or interaction between nodes. Edges can
be directed or undirected:
I. Directed Edges: Show relationships with a direction, like following on social
media (node A follows node B).
II. Undirected Edges: Show mutual relationships, like friendships or connections
in a contact list.
c. Degree: The degree of a node represents the number of edges connected to it:
I. In­Degree: The number of incoming edges for a node (useful for understanding
popularity).
II. Out­Degree: The number of outgoing edges from a node (useful for
understanding influence).
d. Weighted Edges: In some social network graphs, edges can have weights, representing
the strength or frequency of interactions between two nodes. For example, two
friends who chat frequently might have a higher weight on their edge.
e. Path: A path is a sequence of edges connecting a sequence of nodes. Paths can reveal
indirect relationships between two nodes (e.g., a friend of a friend).
f. Clustering: In social networks, clusters or communities are groups of nodes with dense
connections within the group. Clustering helps identify sub-groups, like close friends
or professional circles.

0827CS211252 Big Data (CS-704) 37


g. Centrality: Centrality measures identify the most important or influential nodes in a
network. Common centrality metrics include:
I. Degree Centrality: Counts the number of connections each node has.
II. Betweenness Centrality: Measures how often a node appears on the shortest
paths between other nodes.
III. Closeness Centrality: Reflects how close a node is to all other nodes in the
network.

Code:

Explanation of the Code:


a. Create an empty graph: We initialize an empty graph using nx.Graph().
b. Add nodes: We add 25 nodes to the graph, each representing a person. The nodes are
labeled as Person_1, Person_2, ..., Person_30.
c. Add random edges: We create random relationships between the nodes by adding
edges. Each node will have between 1 and 5 edges to other nodes (excluding itself).
The random.choice() function ensures that the edges connect different nodes.
d. Visualize the graph:
I. nx.spring_layout(G) is used to determine the positions of nodes. This layout
uses a force-directed algorithm to position nodes in a visually appealing way.
II. nx.draw() is used to plot the graph with labels, node sizes, and colors. You can
customize the appearance with various parameters like node_size, node_color,
and font_size.
e. Display the graph: The graph is displayed using plt.show(), which will open a window
with the network visualization.

0827CS211252 Big Data (CS-704) 38


Output:

0827CS211252 Big Data (CS-704) 39


Name of Student: Utkarsh Goyal Class: CS Year: 4
Enrolment No: 0827CS211252 Section: 4 Semester: 7
Date of Experiment: Date of Submission:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

Lab Experiment No. 09


Objective: Design and plot an asymmetric social network (socio graph) of 5 nodes (A, B, C,
D, and E) such that A is directed to B, B is directed to D, D is directed to A, and D is directed to
C.

Steps to Implement the Asymmetric Social Network Graph


a. Define Nodes: Nodes represent entities or individuals in the social network. Here, we
have nodes: A, B, C, D, and E.
b. Define Directed Edges: The directed edges represent interactions or relationships
between these nodes:
I. A is directed to B
II. B is directed to D
III. D is directed to A
IV. D is directed to C
c. Visualize: Use matplotlib to display the graph and label the nodes and edges
accordingly.

Code:

Explanation of the Code


a. Graph Creation:

0827CS211252 Big Data (CS-704) 40


I. nx.DiGraph() is used to create a directed graph, meaning the edges have a
direction (from one node to another).
b. Adding Nodes:
I. We explicitly add 5 nodes: "A", "B", "C", "D", and "E".
c. Adding Directed Edges:
I. The directed edges are added with add_edges_from(). These edges are based
on the directions provided in the question.
d. Plotting the Graph:
I. We use nx.spring_layout(G) for positioning the nodes in a way that spreads
them out nicely.
II. The nx.draw() function is used to draw the graph with options for labels, node
colors, font sizes, and edge colors.
III. The arrowsize=20 ensures the arrows representing directed edges are visible.

Output:

0827CS211252 Big Data (CS-704) 41


Name of Student: Utkarsh Goyal Class: CS Year: 4
Enrolment No: 0827CS211252 Section: 4 Semester: 7
Date of Experiment: Date of Submission:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

Lab Experiment No. 10


Objective: Consider the above scenario (No. 09) and plot a weighted asymmetric graph, the
weight range is between 20 to 50.

To create a weighted asymmetric graph based on the previous scenario, we will assign weights
to the directed edges between the nodes A, B, C, D, and E. The weights will be randomly
selected within the range of 20 to 50.

Steps to Create and Plot the Weighted Graph:


a. Define the Nodes: Use the same nodes as before (A, B, C, D, and E).
b. Define the Edges with Weights: Assign random weights to the directed edges.
c. Plot the Graph: Visualize the graph with weights displayed on the edges

Code:

0827CS211252 Big Data (CS-704) 42


Explanation:
a. Adding Weighted Edges:
I. We use random.randint(20, 50) to generate a random weight between 20 and
50 for each edge.
II. The add_weighted_edges_from() method adds edges with weights.
b. Edge Labels:
I. nx.get_edge_attributes(G, 'weight') retrieves the weights associated with each
edge.
II. nx.draw_networkx_edge_labels() is used to display the weights next to the
edges in the plot.
c. Graph Visualization:
I. The graph is drawn using the same spring layout (nx.spring_layout(G)) and the
edges are drawn with directed arrows.
II. The edge labels are drawn on the graph showing the weight of each edge.

Random Weights Example


Here's an example of how the weights might look if you run the code:

● A → B: 35
● B → D: 46
● D → A: 37
● D → C: 21

Output:

0827CS211252 Big Data (CS-704) 43


Name of Student: Utkarsh Goyal Class: CS Year: 4
Enrolment No: 0827CS211252 Section: 4 Semester: 7
Date of Experiment: Date of Submission:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

Lab Experiment No. 11


Objective: Implement betweenness measure between nodes across the social network.
(Assume the social network of 10 nodes)

To implement the betweenness measure between nodes in a social network with 10 nodes,
we can use the NetworkX library in Python. Betweenness centrality is a measure of a node's
centrality in a graph based on the shortest paths that pass through that node.

Steps to Calculate Betweenness Centrality:


a. Define the Nodes: Create a social network with 10 nodes.
b. Define the Edges: Create directed edges to establish relationships between the nodes.
c. Calculate Betweenness Centrality: Use NetworkX to compute the betweenness
centrality for each node.
d. Display the Results: Print the betweenness centrality values for each node.

0827CS211252 Big Data (CS-704) 44


Code

Explanation:

a. Creating the Graph:

I. We create an undirected graph (nx.Graph()), but you could also use a directed
graph (nx.DiGraph()) if you want to represent directional relationships.
II. Nodes (A, B, C, ..., J) are added to the graph.
III. Edges between nodes represent the relationships (e.g., ('A', 'B') represents a
relationship between node A and node B).

b. Betweenness Centrality:

I. The betweenness centrality for all nodes is computed using the


nx.betweenness_centrality() function. This function calculates the number of
shortest paths that pass through each node, normalized by the total number
of possible paths.
II. Betweenness centrality is printed for each node, with higher values indicating
that the node is more crucial in connecting the network.

c. Visualization:

I. The graph is plotted using a spring layout (force-directed layout) where nodes
are placed based on attractive and repulsive forces.
II. The graph is displayed with labels for each node.

0827CS211252 Big Data (CS-704) 45


Output:

0827CS211252 Big Data (CS-704) 46

You might also like