0% found this document useful (0 votes)

32 views49 pages

Big Data Lab File

Uploaded by

utkarshgoyal210258

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views49 pages

Big Data Lab File

Uploaded by

utkarshgoyal210258

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

PRACTICAL JOURNAL OF BIG DATA

B.Tech: 4th Year

Department of Computer Science and Engineering

Name of the Student : Utkarsh Goyal

Branch & Section : CS4
Roll No. : 0827CS211252
Year : 4th (7th Semester)

Department of Computer Science and Engineering

AITR, Indore

July Dec 2024

ACROPOLIS INSTITUTE OF TECHNOLOGY & RESEARCH, INDORE

Index
Sign of
Date of
S.No. Date of Exp. Name of the Experiment the
Submission
Faculty
To draw and explain Hadoop
architecture and ecosystem with the
1. help of a case study.

Perform setting up and installing single

2. node Hadoop in a Windows
environment.
To implement the following file
management tasks in Hadoop System
3.
(HDFS): Adding files and directories,
retrieving files, Deleting files
Create a database ‘STD’ and make a
collection (e.g. "student" with fields
'No., Stu_Name, Enrol., Branch,
4.
Contact, e-mail, Score') using
MongoDB. Perform various operations
in the following experiments.

Insert multiple records (at least 10)

5.
into the created student collection.

Execute the following queries on the

collection created.
Display data in proper format.
Update the contact information of a
specific student.
6.
Add a new field remark to the
document with the name 'REM'.
Add a new field as no 11, stu_name
XYZ, enroll 00101, branch VB, e-mail
[email protected] Contact 098675345
without using insert statement.
Create an employee table in mongodb
with 4 departments and 25 employees
equally
divided along with one manager. The
following fields should be added;
Employee_ID, Dept_ID, First_Name,
Last_Name, Salary (Range between
20K-60K). Now Run the following
queries

7. Find all the employees of a particular

department where the salary lies <
40K.
Find the highest salary for each
department and fetch the name of
such employees.
Find all the employees who are on a
lesser salary than 30k; increase their
salary by 10% and display the results.

To design and implement a social

network graph of 50 nodes and edges
8.
between nodes using networkx library
in Python.
Design and plot an asymmetric social
network (socio graph) of 5 nodes (A, B,
9. C, D, and E) such that A is directed to
B, B is directed to D, D is directed to A,
and D is directed to C.
Design and plot an asymmetric social
network (socio graph) of 5 nodes (A, B,
10. C, D, and E) such that A is directed to
B, B is directed to D, D is directed to A,
and D is directed to C.
Implement betweenness measure
between nodes across the social
11.
network. (Assume the social network
of 10 nodes)
Name of Student: Utkarsh Goyal Class: CS Year: 4
Enrolment No: 0827CS211252 Section: 4 Semester: 7
Date of Experiment: Date of Submission:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

Lab Experiment No. 01

Objective: To draw and explain Hadoop architecture and ecosystem with the help of a case
study.

Theory:
Hadoop is an open-source framework that enables distributed storage and processing of large
datasets across clusters of computers using simple programming models. It is highly scalable,
fault-tolerant, and capable of handling vast amounts of data in a reliable manner. The Hadoop
ecosystem comprises several modules, tools, and technologies that interact with each other
to enable efficient data processing and management.

1. Hadoop Architecture

Hadoop has a master-slave architecture where the tasks are distributed across nodes in a
cluster. There are two key components in the core of Hadoop: HDFS (Hadoop Distributed File
System) and MapReduce.

Hadoop Distributed File System (HDFS)

HDFS is the storage layer of Hadoop. It is designed to store large datasets by distributing them
across multiple machines. It uses a block storage method where files are divided into fixed-
size blocks (default 128 MB or 64 MB) and stored across different nodes. HDFS has two key
nodes:

a. NameNode (Master Node): This manages the metadata of HDFS and keeps track of
where data blocks are stored.
b. DataNode (Slave Nodes): These store the actual data blocks and serve read/write
requests from the client.

MapReduce

MapReduce is the processing layer of Hadoop, which enables distributed computation on

large datasets. The process involves two main steps:

0827CS211252 Big Data (CS-704) 1

a. Map Step: The input data is divided into smaller sub-tasks, which are processed in
parallel by different nodes.
b. Reduce Step: The results from the Map step are aggregated and combined to produce
the final output.

The MapReduce architecture consists of:

a. JobTracker (Master Node): Assigns and monitors tasks distributed across the cluster.
b. TaskTracker (Slave Nodes): Executes the tasks as assigned by the JobTracker.

YARN (Yet Another Resource Negotiator)

YARN is the resource management layer that separates resource management from the
processing tasks. It consists of:

a. ResourceManager (Master Node): Manages the allocation of resources across the

cluster.
b. NodeManager (Slave Nodes): Monitors resource usage on individual node.

0827CS211252 Big Data (CS-704) 2

2. Hadoop Ecosystem Components

The Hadoop ecosystem extends beyond HDFS and MapReduce to include several other tools
and technologies that enable advanced data storage, analysis, and management.

a. Hive: A data warehouse infrastructure that provides an SQL-like interface to query data
stored in HDFS.
b. Pig: A high-level scripting platform for processing large datasets, providing data
transformation through Pig Latin scripts.
c. HBase: A NoSQL database built on top of HDFS, which supports real-time read/write
access to large datasets.
d. Sqoop: A tool used to transfer data between Hadoop and relational databases (like
MySQL, Oracle).
e. Flume: A service for collecting and moving large amounts of log data into HDFS.
f. Zookeeper: Provides distributed coordination services to maintain configuration
information and synchronize between distributed systems.
g. Oozie: A workflow scheduler for managing Hadoop jobs.

3. Case Study: LinkedIn’s Use of Hadoop for Data Analytics and Enhanced User
Insights

Background

Uber, a leading ride-hailing platform, connects millions of riders with drivers in real-time.
Uber collects vast amounts of data from users, drivers, and trips on a global scale. This data
needs to be processed and analyzed to optimize pricing, improve user experience, and ensure
efficient ride matching. Given the scale of operations and the volume of real-time data, Uber
turned to Hadoop to create a scalable, fault-tolerant, and efficient data infrastructure for
handling its vast datasets.

Business Problem

Uber faced several data challenges:

a. Data Volume: Uber deals with enormous volumes of data from millions of rides, user
ratings, location data, and driver behaviour.
b. RealTime Insights: Uber requires real-time data processing for features like dynamic
pricing, ride matching, and surge pricing.
c. Scalability and Fault Tolerance: Uber needs infrastructure that can seamlessly scale to
accommodate growth and ensure data accessibility even with occasional system
failures.

0827CS211252 Big Data (CS-704) 3

Solution: Hadoop Ecosystem

To address these challenges, Uber implemented the Hadoop ecosystem, integrating various
components to build a scalable, fault-tolerant, and efficient data infrastructure capable of
handling massive data volumes. Here’s how they used it:

StepbyStep Solution Using Hadoop

Step 1: Data Ingestion

a. Data Collection from Diverse Sources: Uber collects data from different sources,
including trip data (pickup/dropoff times and locations), GPS data from drivers, user
ratings, and ride requests.
b. Apache Kafka: Uber uses Apache Kafka to collect real-time data streams, such as ride
requests, driver availability, and location updates. Kafka allows Uber to efficiently
process events as they happen, ensuring a near-instant response for pricing and
matching algorithms.
c. Apache Flume: Uber uses Flume to aggregate log data from various systems, including
server logs, application metrics, and other operational data, and ingests them into
Hadoop for further processing.

Step 2: Data Storage in HDFS

a. Hadoop Distributed File System (HDFS): Uber leverages HDFS to store large volumes
of structured and unstructured data, such as ride logs, GPS data, and user profiles.
Data is distributed across multiple nodes, providing both storage scalability and fault
tolerance.
b. Scalability with HDFS: As Uber’s data grows, HDFS allows them to seamlessly add new
storage nodes without disrupting the existing infrastructure. This scalability ensures
Uber can handle the increasing amount of ride data generated as the platform grows
globally.

Step 3: Data Processing with MapReduce

a. Batch Processing with MapReduce: Uber uses MapReduce for processing and
analyzing large datasets. For example:

 Map Function: Uber’s system may filter out trip data by parameters such as
location, time of day, and driver performance.
 Reduce Function: The results are aggregated to generate insights such as peak
demand times, popular locations for pickups, and user behavior trends.

Step 4: Data Querying with Apache Hive

a. Apache Hive: Uber utilizes Hive for SQL-like querying to enable business analysts to
access and analyze data in a more user-friendly manner. Hive helps Uber quickly
generate reports on trip demand, user demographics, and driver behavior.

0827CS211252 Big Data (CS-704) 4

Step 5: RealTime Analytics with HBase

a. Apache HBase: For real-time data processing, Uber uses HBase to store and access
data with low latency. HBase is ideal for providing real-time access to user and trip
data, such as:
 RealTime Price Calculation: HBase supports the dynamic pricing engine,
allowing Uber to apply surge pricing and calculate ride costs in real-time based
on demand, location, and available drivers.
 RealTime Matching: HBase is also used to store the latest driver availability
and location, enabling Uber’s algorithms to match riders and drivers instantly.

Step 6: Data Transformation with Apache Pig

a. Apache Pig: Uber uses Apache Pig for complex data transformation tasks such as
filtering, aggregating, and transforming raw ride data into usable formats for analysis.
For example:

 Ride Segmentation: Pig scripts segment rides based on different criteria such
as ride duration, geographical location, or fare categories, helping Uber
optimize its pricing models and identify areas with high demand.

Step 7: Workflow Management with Apache Oozie

a. Apache Oozie: Uber uses Oozie for scheduling and managing data processing
workflows. For instance, Uber automates the running of daily reports, batch
processing of trip data, and machine learning model training using Oozie’s workflow
management system.

Step 8: Resource Management with YARN

a. YARN (Yet Another Resource Negotiator): YARN manages resources across the
Hadoop clusters at Uber. It ensures that Hadoop’s processing power is optimally
allocated to various tasks, allowing for efficient parallel processing of large datasets.

Benefits and Results

a. Improved Ride Matching and User Experience

 Uber uses data from HBase to instantly match riders with nearby drivers. Real-
time processing ensures that users experience minimal waiting times,
contributing to greater user satisfaction.
b. Generation Dynamic Pricing and Revenue Optimization
 By analyzing ride data in real-time through Kafka and HBase, Uber can
implement dynamic pricing models that adjust based on factors such as
demand, weather, traffic, and location. This helps maximize revenue during
peak hours and ensures competitive pricing during low-demand periods.
c. Scalability and Cost Efficiency

0827CS211252 Big Data (CS-704) 5


The Hadoop ecosystem enables Uber to scale its infrastructure as the number
of rides grows globally. HDFS allows Uber to manage petabytes of data without
incurring excessive costs for upgrading infrastructure, while YARN optimizes
resource allocation to prevent waste.
d. Real-Time Insights for Drivers and Riders
 With HBase and Kafka, Uber ensures that drivers and riders receive real-time
insights, such as the nearest available drivers, optimal routes, and live pricing
information. This improves overall ride experience and satisfaction.

Conclusion

Uber’s use of the Hadoop ecosystem demonstrates the framework's effectiveness in

managing vast amounts of data and deriving real-time insights. By leveraging HDFS for
scalable storage, MapReduce for data processing, HBase for real-time analytics, and Kafka for
data ingestion, Uber has built a robust and efficient data infrastructure that drives its
platform's operations. These tools help Uber optimize ride matching, pricing, and user
engagement, while also providing a fault-tolerant and scalable solution capable of handling
the company’s rapid growth.

0827CS211252 Big Data (CS-704) 6

Name of Student: Utkarsh Goyal Class: CS Year: 4
Enrolment No: 0827CS211252 Section: 4 Semester: 7
Date of Experiment: Date of Submission:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

Lab Experiment No. 02

Objective: Perform setting up and installing single node Hadoop in a Windows environment.
Prerequisites
I. Java Installation: Ensure that you have the Java Development Kit (JDK) installed.
Hadoop requires Java 8 or higher.
A. Download JDK from Oracle's website or OpenJDK.
B. Install it and set the JAVA_HOME environment variable.
II. Windows Subsystem for Linux (WSL) (optional): For better compatibility, you can
install Hadoop on WSL. However, if you prefer to set it up natively, you can proceed
without it.

Steps to Install Hadoop on Windows

1. Download Hadoop

I. Go to the Apache Hadoop releases page.

II. Download the binary release (e.g., hadoop-3.x.x.tar.gz).

2. Install Hadoop

I. Extract Hadoop: Use a tool like WinRAR or 7-Zip to extract the downloaded Hadoop
archive to a directory (e.g., C:\hadoop).
II. Set Environment Variables:
A. Open the Environment Variables settings (Control Panel > System > Advanced
system settings > Environment Variables).
B. Add the following variables:
1. HADOOP_HOME: Set to the path where Hadoop is extracted (e.g.,
C:\hadoop).
2. PATH: Add %HADOOP_HOME%\bin to the existing PATH variable.
3. JAVA_HOME: Set to the path where JDK is installed (e.g., C:\Program
Files\Java\jdk1.8.0_xx).

0827CS211252 Big Data (CS-704) 7

3. Configure Hadoop
I. Edit coresite.xml:
A. Navigate to %HADOOP_HOME%\etc\hadoop\core-site.xml.
B. Add the following configuration:

II. Edit hdfssite.xml:

A. Navigate to %HADOOP
B. _HOME%\etc\hadoop\hdfs-site.xml.
C. Add the following configuration:

III. Edit mapredsite.xml:

A. If mapred-site.xml does not exist, create a new file named mapred-site.xml in
the %HADOOP_HOME%\etc\hadoop directory.
B. Add the following configuration:

0827CS211252 Big Data (CS-704) 8

IV. Edit yarnsite.xml:
A. Navigate to %HADOOP_HOME%\etc\hadoop\yarn-site.xml.
B. Add the following configuration:

4. Format the Namenode

Open Command Prompt as Administrator and run the following command to format the
namenode:

5. Start Hadoop
A. Open Command Prompt as Administrator.
B. Navigate to the Hadoop bin directory:

C. Start the HDFS and YARN services:

6. Access Hadoop
To access the Hadoop web interface, open your web browser and go to:
A. HDFS: http://localhost:9870
B. YARN: http://localhost:8088

0827CS211252 Big Data (CS-704) 9

Lab Experiment No. 03

Objective: To implement the following file management tasks in Hadoop System
(HDFS): Adding files and directories, retrieving files, Deleting files
Prerequisites:
Ensure you have your Hadoop cluster running and HDFS started. You can check this by running:

This command should list the root directory of HDFS.

1. Adding Files and Directories

Add Files to HDFS

To upload files from your local file system to HDFS, use the following command:

Add Directories to HDFS

To create a new directory in HDFS, use the following command:

2. Retrieving Files from HDFS

To download files from HDFS to your local file system, use the following command:

0827CS211252 Big Data (CS-704) 10

3. Deleting Files and Directories

Delete Files from HDFS

To delete a file in HDFS, use the following command:

Delete Directories from HDFS

To delete a directory in HDFS, use the following command. If the directory is not empty, you
need to use the -r (recursive) option.

0827CS211252 Big Data (CS-704) 11

Lab Experiment No. 04

Objective: Create a database ‘STD’ and make a collection (e.g. "student" with fields 'No.,
Stu_Name, Enrol., Branch, Contact, e-mail, Score') using MongoDB. Perform various
operations in the following experiments.

In MongoDB, databases and collections are created dynamically when data is inserted into
them. We can perform the following tasks to set up a database called STD, create a student
collection with specific fields, and then demonstrate common operations like inserting,
updating, querying, and deleting documents. Here's how to proceed:

1. Create Database and Collection

a. Switch/Create Database: Switch to the STD database. If it doesn't exist, MongoDB
will create it automatically when you insert data.
Input:
use STD
Output:

b. Insert a Document into the student Collection: Now, let's insert a document with
the fields No., Stu_Name, Enrol., Branch, Contact, e-mail, and Score.

Input:
db.student.insertOne({
No: 1,
Stu_Name: "Ghasiram Pondu",
Enrol: "DOGNO1",
Branch: "Computer Science",
Contact: "666666",
email: "[email protected]",
Score: 6 })

0827CS211252 Big Data (CS-704) 12

Output:

2. Basic Operations

Now that you have data in the student collection, let's perform various operations.

a. Find All Documents: To retrieve all documents in the collection:

Input:
db.student.find().pretty()

Output:

0827CS211252 Big Data (CS-704) 13

b. Find Specific Document (Query by Name):

Input:
db.student.find({ Stu_Name: "Ghasiram Pondu" })
Output:

c. Update a Document:

Input:
db.student.updateOne(
{ Stu_Name: "Ghasiram Pondu" },
{ $set: { Score: 666666} }
)
Output:

d. Delete a Document:

Input:
db.student.deleteOne({ Stu_Name: "Amit Sharma" })

0827CS211252 Big Data (CS-704) 14

Output:

e. Add Multiple Documents:

Input:
db.student.insertMany([
{No: 3, Stu_Name: "Lana Rhoades", Enrol: "PAWN1", Branch: "Biology", Contact:
"6969696969", email: "[email protected]", Score: 69 },
{No: 4, Stu_Name: "Lexi Luna", Enrol: "PAWN2", Branch: "Mechanical Engineering",
Contact: "6969696969", email: "[email protected]", Score: 69 },
{No: 5, Stu_Name: "Angela White", Enrol: "PAWN3", Branch: "Plumbing", Contact:
"6969696969", email: "[email protected]", Score: 69 } ])

0827CS211252 Big Data (CS-704) 15

Output:

0827CS211252 Big Data (CS-704) 16

Lab Experiment No. 05

Objective: Insert multiple records (at least 10) into the created student collection.

To insert multiple records into the student collection in MongoDB, you can use the
insertMany() command. This command allows you to insert several documents (records) in
one operation. Below is an example of how you can insert 12 student records into the student
collection within the STD database.

Input:
db.student.insertMany([
{No: 1, Stu_Name: "John Doe", Enrol: "DOGNO1", Branch: "Computer Science", Contact:
"1234567890", email: "[email protected]", Score: 85},
{No: 2, Stu_Name: "Jane Smith", Enrol: "DOGNO2", Branch: "Electrical Engineering", Contact:
"2345678901", email: "[email protected]", Score: 90},
{No: 3, Stu_Name: "Alice Johnson", Enrol: "DOGNO3", Branch: "Mechanical Engineering",
Contact: "3456789012", email: "[email protected]", Score: 92},
{No: 4, Stu_Name: "Bob Brown", Enrol: "DOGNO4", Branch: "Civil Engineering", Contact:
"4567890123", email: "[email protected]", Score: 78},
{No: 5, Stu_Name: "Charlie Davis", Enrol: "DOGNO5", Branch: "Electronics Engineering",
Contact: "5678901234", email: "[email protected]", Score: 88},
{No: 6, Stu_Name: "David Evans", Enrol: "DOGNO6", Branch: "Chemical Engineering", Contact:
"6789012345", email: "[email protected]", Score: 79},
{No: 7, Stu_Name: "Eve White", Enrol: "DOGNO7", Branch: "Information Technology",
Contact: "7890123456", email: "[email protected]", Score: 93},
{No: 8, Stu_Name: "Frank Harris", Enrol: "DOGNO8", Branch: "Software Engineering", Contact:
"8901234567", email: "[email protected]", Score: 80},
{No: 9, Stu_Name: "Grace Martin", Enrol: "DOGNO9", Branch: "Biomedical Engineering",
Contact: "9012345678", email: "[email protected]", Score: 85},
{No: 10, Stu_Name: "Henry Lee", Enrol: "DOGNO10", Branch: "Environmental Engineering",
Contact: "0123456789", email: "[email protected]", Score: 91},

0827CS211252 Big Data (CS-704) 17

{No: 11, Stu_Name: "Isabel Clark", Enrol: "DOGNO11", Branch: "Agricultural Engineering",
Contact: "1234567890", email: "[email protected]", Score: 77},
{No: 12, Stu_Name: "James Taylor", Enrol: "DOGNO12", Branch: "Aerospace Engineering",
Contact: "2345678901", email: "[email protected]", Score: 89}])

Output:

0827CS211252 Big Data (CS-704) 18

0827CS211252 Big Data (CS-704) 19
0827CS211252 Big Data (CS-704) 20
Name of Student: Utkarsh Goyal Class: CS Year: 4
Enrolment No: 0827CS211252 Section: 4 Semester: 7
Date of Experiment: Date of Submission:
Remarks by faculty: Grade:
Signature of student: Signature of Faculty:

Lab Experiment No. 06

Objective: Execute the following queries on the collection created.
a. Display data in proper format.
b. Update the contact information of a specific student.
c. Add a new field remark to the document with the name 'REM'.
d. Add a new field as no 11, stu_name XYZ, enroll 00101, branch VB, e-mail [email protected]
Contact 098675345 without using insert statement.

1. Display Data in Proper Format

To display the data stored in the student collection in a neatly formatted way, we can use the
find() command with a projection to show specific fields.

0827CS211252 Big Data (CS-704) 21

Input:
db.student.find().pretty()

Output:

0827CS211252 Big Data (CS-704) 22

2. Update the Contact Information of a Specific Student

Let’s say we want to update the contact information for the student named Tanay. Use the
following command:

Input:
db.student.updateOne(

{ Stu_Name: "Priya Desai" },

{ $set: { Contact: "0000000000" } }

Output:

3. Add a New Field Remark to the Document with the Name 'REM'

You can add a new field called Remark for the student with name "REM" using the following
command:

Input:
db.student.updateOne(
{ Stu_Name: "Amit Kumar" },

{ $set: { remark: "REM" } }

)

0827CS211252 Big Data (CS-704) 23

Output:

0827CS211252 Big Data (CS-704) 24

4. Add a New Field Without Using the Insert Statement
Input:
db.student.updateOne(

{ Stu_Name: "XYZ" },
{
$set: {

No: 11,
Stu_Name: "XYZ",

Enrol: "00101",

Branch: "VB",

email: "[email protected]",
Contact: "098675345"
}

},
{ upsert: true }

Output:

0827CS211252 Big Data (CS-704) 25

Lab Experiment No. 07

Objective: Create an employee table in mongodb with 4 departments and 25 employees
equally divided along with one manager. The following fields should be added; Employee_ID,
Dept_ID, First_Name, Last_Name, Salary (Range between 20K-60K). Now Run the following
queries
a. Find all the employees of a particular department where the salary lies < 40K.
b. Find the highest salary for each department and fetch the name of such employees.
c. Find all the employees who are on a lesser salary than 30k; increase their salary by
10% and display the results.

Step 1: Create the employees Collection in MongoDB

Input:
db.employee.insertMany([
{ "Employee_ID": 1, "Dept_ID": 1, "First_Name": "John", "Last_Name": "Doe", "Salary":
45000, "Role": "Manager" },
{ "Employee_ID": 2, "Dept_ID": 1, "First_Name": "Jane", "Last_Name": "Smith", "Salary":
38000, "Role": "Employee" },
{ "Employee_ID": 3, "Dept_ID": 1, "First_Name": "Michael", "Last_Name": "Johnson",
"Salary": 42000, "Role": "Employee" },
{ "Employee_ID": 4, "Dept_ID": 1, "First_Name": "Emily", "Last_Name": "Williams", "Salary":
39000, "Role": "Employee" },
{ "Employee_ID": 5, "Dept_ID": 1, "First_Name": "David", "Last_Name": "Brown", "Salary":
46000, "Role": "Employee" },
{ "Employee_ID": 6, "Dept_ID": 1, "First_Name": "Sarah", "Last_Name": "Jones", "Salary":
48000, "Role": "Employee" },
{ "Employee_ID": 7, "Dept_ID": 2, "First_Name": "Chris", "Last_Name": "Davis", "Salary":
50000, "Role": "Manager" },
{ "Employee_ID": 8, "Dept_ID": 2, "First_Name": "Patricia", "Last_Name": "Miller", "Salary":
38000, "Role": "Employee" },
{ "Employee_ID": 9, "Dept_ID": 2, "First_Name": "Robert", "Last_Name": "Garcia", "Salary":
44000, "Role": "Employee" },
{ "Employee_ID": 10, "Dept_ID": 2, "First_Name": "Linda", "Last_Name": "Martinez",
"Salary": 39000, "Role": "Employee" },

0827CS211252 Big Data (CS-704) 26

{ "Employee_ID": 11, "Dept_ID": 2, "First_Name": "James", "Last_Name": "Hernandez",
"Salary": 47000, "Role": "Employee" },
{ "Employee_ID": 12, "Dept_ID": 2, "First_Name": "Pat", "Last_Name": "Lopez", "Salary":
46000, "Role": "Employee" },
{ "Employee_ID": 13, "Dept_ID": 3, "First_Name": "William", "Last_Name": "White", "Salary":
54000, "Role": "Manager" },
{ "Employee_ID": 14, "Dept_ID": 3, "First_Name": "Elizabeth", "Last_Name": "Clark",
"Salary": 42000, "Role": "Employee" },
{ "Employee_ID": 15, "Dept_ID": 3, "First_Name": "Thomas", "Last_Name": "Lewis", "Salary":
47000, "Role": "Employee" },
{ "Employee_ID": 16, "Dept_ID": 3, "First_Name": "Barbara", "Last_Name": "Young",
"Salary": 44000, "Role": "Employee" },
{ "Employee_ID": 17, "Dept_ID": 3, "First_Name": "Charles", "Last_Name": "Walker",
"Salary": 50000, "Role": "Employee" },
{ "Employee_ID": 18, "Dept_ID": 3, "First_Name": "Mary", "Last_Name": "Hall", "Salary":
48000, "Role": "Employee" },
{ "Employee_ID": 19, "Dept_ID": 4, "First_Name": "Joseph", "Last_Name": "Allen", "Salary":
55000, "Role": "Manager" },
{ "Employee_ID": 20, "Dept_ID": 4, "First_Name": "Nancy", "Last_Name": "King", "Salary":
39000, "Role": "Employee" },
{ "Employee_ID": 21, "Dept_ID": 4, "First_Name": "Mark", "Last_Name": "Scott", "Salary":
46000, "Role": "Employee" },
{ "Employee_ID": 22, "Dept_ID": 4, "First_Name": "Susan", "Last_Name": "Green", "Salary":
48000, "Role": "Employee" },
{ "Employee_ID": 23, "Dept_ID": 4, "First_Name": "Steven", "Last_Name": "Adams", "Salary":
50000, "Role": "Employee" },
{ "Employee_ID": 24, "Dept_ID": 4, "First_Name": "Karen", "Last_Name": "Baker", "Salary":
46000, "Role": "Employee" }
])

0827CS211252 Big Data (CS-704) 27

Output:

0827CS211252 Big Data (CS-704) 28

0827CS211252 Big Data (CS-704) 29
0827CS211252 Big Data (CS-704) 30
0827CS211252 Big Data (CS-704) 31
a. Find all employees in a department where the salary is less than 40K

This query retrieves all employees from a specific department (e.g., Dept_ID = 2) where the
salary is less than 40K:

Input:
db.employee.find({
Dept_ID: "D2",
Salary: { $lt: 40000 }
});

Output:

0827CS211252 Big Data (CS-704) 32

b. Find the highest salary for each department and fetch the name of such
employees

This aggregation pipeline finds the highest salary in each department and fetches the names
of those employees:

Input:
db.employee.aggregate([
{
$group: {
_id: "$Dept_ID",
maxSalary: { $max: "$Salary" }
}
},
{
$lookup: {
from: "employee",
localField: "maxSalary",
foreignField: "Salary",
as: "highestSalaryEmployees"
}
},
{
$unwind: "$highestSalaryEmployees"
},
{
$project: {
Department: "$_id",
Employee_Name: {
$concat: [
"$highestSalaryEmployees.First_Name",
" ",
"$highestSalaryEmployees.Last_Name"
]
},
Salary: "$highestSalaryEmployees.Salary"
}
}
]);

0827CS211252 Big Data (CS-704) 33

Output:

0827CS211252 Big Data (CS-704) 34

c. Find all employees with a salary less than 30K, increase their salary by 10%,
and display the results

 Update salaries by 10% for employees earning less than 30K:

Input:

db.employee.updateMany(

{ Salary: { $lt: 30000 } },

$set: {

Salary: { $multiply: ["$Salary", 1.1] }

);

Output:

0827CS211252 Big Data (CS-704) 35

 Display updated results for those employees:

Input:
db.employee.find({ Salary: { $lt: 33000 } });

Output:

0827CS211252 Big Data (CS-704) 36

Lab Experiment No. 08

Objective: To design and implement a social network graph of 30 nodes and edges between
nodes using networkx library in Python.

Social Network Graph Theory

In social network analysis, a social network graph represents individuals (people,
organizations, or entities) as nodes and their relationships as edges between those nodes.
Social network graphs help us visualize and understand patterns in social structures,
communities, and interactions.

Key Concepts in Social Network Graphs

a. Nodes (Vertices): In a social network, each node represents an individual, a group, or
an entity. In this case, we have 30 nodes, each representing a unique individual in the
network.
b. Edges (Links): Edges indicate a relationship or interaction between nodes. Edges can
be directed or undirected:
I. Directed Edges: Show relationships with a direction, like following on social
media (node A follows node B).
II. Undirected Edges: Show mutual relationships, like friendships or connections
in a contact list.
c. Degree: The degree of a node represents the number of edges connected to it:
I. InDegree: The number of incoming edges for a node (useful for understanding
popularity).
II. OutDegree: The number of outgoing edges from a node (useful for
understanding influence).
d. Weighted Edges: In some social network graphs, edges can have weights, representing
the strength or frequency of interactions between two nodes. For example, two
friends who chat frequently might have a higher weight on their edge.
e. Path: A path is a sequence of edges connecting a sequence of nodes. Paths can reveal
indirect relationships between two nodes (e.g., a friend of a friend).
f. Clustering: In social networks, clusters or communities are groups of nodes with dense
connections within the group. Clustering helps identify sub-groups, like close friends
or professional circles.

0827CS211252 Big Data (CS-704) 37

g. Centrality: Centrality measures identify the most important or influential nodes in a
network. Common centrality metrics include:
I. Degree Centrality: Counts the number of connections each node has.
II. Betweenness Centrality: Measures how often a node appears on the shortest
paths between other nodes.
III. Closeness Centrality: Reflects how close a node is to all other nodes in the
network.

Code:

Explanation of the Code:

a. Create an empty graph: We initialize an empty graph using nx.Graph().
b. Add nodes: We add 25 nodes to the graph, each representing a person. The nodes are
labeled as Person_1, Person_2, ..., Person_30.
c. Add random edges: We create random relationships between the nodes by adding
edges. Each node will have between 1 and 5 edges to other nodes (excluding itself).
The random.choice() function ensures that the edges connect different nodes.
d. Visualize the graph:
I. nx.spring_layout(G) is used to determine the positions of nodes. This layout
uses a force-directed algorithm to position nodes in a visually appealing way.
II. nx.draw() is used to plot the graph with labels, node sizes, and colors. You can
customize the appearance with various parameters like node_size, node_color,
and font_size.
e. Display the graph: The graph is displayed using plt.show(), which will open a window
with the network visualization.

0827CS211252 Big Data (CS-704) 38

Output:

0827CS211252 Big Data (CS-704) 39

Lab Experiment No. 09

Objective: Design and plot an asymmetric social network (socio graph) of 5 nodes (A, B, C,
D, and E) such that A is directed to B, B is directed to D, D is directed to A, and D is directed to
C.

Steps to Implement the Asymmetric Social Network Graph

a. Define Nodes: Nodes represent entities or individuals in the social network. Here, we
have nodes: A, B, C, D, and E.
b. Define Directed Edges: The directed edges represent interactions or relationships
between these nodes:
I. A is directed to B
II. B is directed to D
III. D is directed to A
IV. D is directed to C
c. Visualize: Use matplotlib to display the graph and label the nodes and edges
accordingly.

Code:

Explanation of the Code

a. Graph Creation:

0827CS211252 Big Data (CS-704) 40

I. nx.DiGraph() is used to create a directed graph, meaning the edges have a
direction (from one node to another).
b. Adding Nodes:
I. We explicitly add 5 nodes: "A", "B", "C", "D", and "E".
c. Adding Directed Edges:
I. The directed edges are added with add_edges_from(). These edges are based
on the directions provided in the question.
d. Plotting the Graph:
I. We use nx.spring_layout(G) for positioning the nodes in a way that spreads
them out nicely.
II. The nx.draw() function is used to draw the graph with options for labels, node
colors, font sizes, and edge colors.
III. The arrowsize=20 ensures the arrows representing directed edges are visible.

Output:

0827CS211252 Big Data (CS-704) 41

Lab Experiment No. 10

Objective: Consider the above scenario (No. 09) and plot a weighted asymmetric graph, the
weight range is between 20 to 50.

To create a weighted asymmetric graph based on the previous scenario, we will assign weights
to the directed edges between the nodes A, B, C, D, and E. The weights will be randomly
selected within the range of 20 to 50.

Steps to Create and Plot the Weighted Graph:

a. Define the Nodes: Use the same nodes as before (A, B, C, D, and E).
b. Define the Edges with Weights: Assign random weights to the directed edges.
c. Plot the Graph: Visualize the graph with weights displayed on the edges

Code:

0827CS211252 Big Data (CS-704) 42

Explanation:
a. Adding Weighted Edges:
I. We use random.randint(20, 50) to generate a random weight between 20 and
50 for each edge.
II. The add_weighted_edges_from() method adds edges with weights.
b. Edge Labels:
I. nx.get_edge_attributes(G, 'weight') retrieves the weights associated with each
edge.
II. nx.draw_networkx_edge_labels() is used to display the weights next to the
edges in the plot.
c. Graph Visualization:
I. The graph is drawn using the same spring layout (nx.spring_layout(G)) and the
edges are drawn with directed arrows.
II. The edge labels are drawn on the graph showing the weight of each edge.

Random Weights Example

Here's an example of how the weights might look if you run the code:

● A → B: 35
● B → D: 46
● D → A: 37
● D → C: 21

Output:

0827CS211252 Big Data (CS-704) 43

Lab Experiment No. 11

Objective: Implement betweenness measure between nodes across the social network.
(Assume the social network of 10 nodes)

To implement the betweenness measure between nodes in a social network with 10 nodes,
we can use the NetworkX library in Python. Betweenness centrality is a measure of a node's
centrality in a graph based on the shortest paths that pass through that node.

Steps to Calculate Betweenness Centrality:

a. Define the Nodes: Create a social network with 10 nodes.
b. Define the Edges: Create directed edges to establish relationships between the nodes.
c. Calculate Betweenness Centrality: Use NetworkX to compute the betweenness
centrality for each node.
d. Display the Results: Print the betweenness centrality values for each node.

0827CS211252 Big Data (CS-704) 44

Code

Explanation:

a. Creating the Graph:

I. We create an undirected graph (nx.Graph()), but you could also use a directed
graph (nx.DiGraph()) if you want to represent directional relationships.
II. Nodes (A, B, C, ..., J) are added to the graph.
III. Edges between nodes represent the relationships (e.g., ('A', 'B') represents a
relationship between node A and node B).

b. Betweenness Centrality:

I. The betweenness centrality for all nodes is computed using the

nx.betweenness_centrality() function. This function calculates the number of
shortest paths that pass through each node, normalized by the total number
of possible paths.
II. Betweenness centrality is printed for each node, with higher values indicating
that the node is more crucial in connecting the network.

c. Visualization:

I. The graph is plotted using a spring layout (force-directed layout) where nodes
are placed based on attractive and repulsive forces.
II. The graph is displayed with labels for each node.

0827CS211252 Big Data (CS-704) 45

Output:

0827CS211252 Big Data (CS-704) 46

Bendix Air Brake System Schematic PDF
75% (4)
Bendix Air Brake System Schematic PDF
1 page
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
No ratings yet
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
44 pages
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
No ratings yet
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
28 pages
BIG Data Master
No ratings yet
BIG Data Master
24 pages
Lab Manual Big Data
No ratings yet
Lab Manual Big Data
22 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
HADOOP
No ratings yet
HADOOP
55 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Certified Hadoop and Spark Course Curriculum
No ratings yet
Certified Hadoop and Spark Course Curriculum
9 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Unit 5
No ratings yet
Unit 5
32 pages
Big Data
No ratings yet
Big Data
27 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Lecture 2
No ratings yet
Lecture 2
70 pages
Sub Unit 3
No ratings yet
Sub Unit 3
9 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Chap3 OverviewOfBigDataEcosystem
No ratings yet
Chap3 OverviewOfBigDataEcosystem
91 pages
Attachment
No ratings yet
Attachment
11 pages
Hadoop - Project 5th Sem - 1
No ratings yet
Hadoop - Project 5th Sem - 1
62 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
Big Data & Hadoop - Course Curriculum
No ratings yet
Big Data & Hadoop - Course Curriculum
6 pages
Hadoop Job Runner UI Tool
No ratings yet
Hadoop Job Runner UI Tool
10 pages
Unit # 2
No ratings yet
Unit # 2
23 pages
DocScanner Jan 12, 2023 2-29 PM
No ratings yet
DocScanner Jan 12, 2023 2-29 PM
32 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Hadoop Course Content
No ratings yet
Hadoop Course Content
2 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Hadoop
No ratings yet
Hadoop
61 pages
Hadoop
No ratings yet
Hadoop
11 pages
Bda Lab 1
No ratings yet
Bda Lab 1
9 pages
Bda Lab
No ratings yet
Bda Lab
36 pages
Lab Manual BDA
No ratings yet
Lab Manual BDA
36 pages
Unit 2
No ratings yet
Unit 2
9 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Apache Hadoop and Spark:: and Use Cases For Data Analysis
No ratings yet
Apache Hadoop and Spark:: and Use Cases For Data Analysis
48 pages
Unit 3 - BD - Hadoop Ecosystem
No ratings yet
Unit 3 - BD - Hadoop Ecosystem
42 pages
Twitter Sentimental Analysis
No ratings yet
Twitter Sentimental Analysis
42 pages
Elementary Concepts of Big Data and Hadoop
No ratings yet
Elementary Concepts of Big Data and Hadoop
4 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
58 pages
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
No ratings yet
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
260 pages
Unit-Iv CC&BD CS62
No ratings yet
Unit-Iv CC&BD CS62
76 pages
Hadoop - The Final Product
100% (2)
Hadoop - The Final Product
42 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
BigData and Hadoop - Syllabus
No ratings yet
BigData and Hadoop - Syllabus
2 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Module 2 Hadoop Eco System
No ratings yet
Module 2 Hadoop Eco System
13 pages
BDA Experiment1
No ratings yet
BDA Experiment1
8 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Vietnam Research.v2
No ratings yet
Vietnam Research.v2
13 pages
Extract Hidden Data From PDF
No ratings yet
Extract Hidden Data From PDF
2 pages
HPE - Dp00002639en - Us - HPE Smart Storage Administrator GUI User Guide
No ratings yet
HPE - Dp00002639en - Us - HPE Smart Storage Administrator GUI User Guide
142 pages
Deliveraddis
No ratings yet
Deliveraddis
7 pages
Language Elements: Clauses
No ratings yet
Language Elements: Clauses
6 pages
Fraud Alert!: "@ril - VC" and "@ril - Sg". These
No ratings yet
Fraud Alert!: "@ril - VC" and "@ril - Sg". These
2 pages
ADR Paper
No ratings yet
ADR Paper
7 pages
2 Edm
No ratings yet
2 Edm
11 pages
Chavez vs. CA
No ratings yet
Chavez vs. CA
1 page
Stock Trading
No ratings yet
Stock Trading
3 pages
Jeppview For Windows: List of Pages in This Trip Kit
No ratings yet
Jeppview For Windows: List of Pages in This Trip Kit
30 pages
1 Introduction To Environmental Science
No ratings yet
1 Introduction To Environmental Science
16 pages
Simple Additive Weighting Method To Determining Employee Salary Increase Rate
No ratings yet
Simple Additive Weighting Method To Determining Employee Salary Increase Rate
7 pages
RCM - Rs 07 Rack System Assembly - II
No ratings yet
RCM - Rs 07 Rack System Assembly - II
2 pages
Chitaliya Dipak - Nirma
No ratings yet
Chitaliya Dipak - Nirma
93 pages
DESCO 19640 Certificated
No ratings yet
DESCO 19640 Certificated
2 pages
Couldry and Meijas (2019) Data-Colonialism Rethinking Big Data S Relation To The Contemporary Subject
No ratings yet
Couldry and Meijas (2019) Data-Colonialism Rethinking Big Data S Relation To The Contemporary Subject
25 pages
Labour Regulations in The UAE Are Governed by The UAE Labour Law
No ratings yet
Labour Regulations in The UAE Are Governed by The UAE Labour Law
10 pages
F5 Privileged User Access With F5 Access Policy Manager F5GS APM
No ratings yet
F5 Privileged User Access With F5 Access Policy Manager F5GS APM
5 pages
Bravo-Guerrero vs. Bravo, 465 SCRA 244, July 29, 2005
No ratings yet
Bravo-Guerrero vs. Bravo, 465 SCRA 244, July 29, 2005
7 pages
Attachment J - Weekly Inspection Report - New
No ratings yet
Attachment J - Weekly Inspection Report - New
8 pages
Henari Security Business Profile1
100% (1)
Henari Security Business Profile1
8 pages
George Henri Hazard - GM
No ratings yet
George Henri Hazard - GM
9 pages
The Role of Chittagong Port in The Economy of Bangladesh II
100% (2)
The Role of Chittagong Port in The Economy of Bangladesh II
15 pages
0901d19680089cee PDF Preview Medium
No ratings yet
0901d19680089cee PDF Preview Medium
4 pages
Pipeline Pre Trenching Pre Qua - Rev A 27june22 - Final
No ratings yet
Pipeline Pre Trenching Pre Qua - Rev A 27june22 - Final
57 pages
Aisi 5140 PDF
No ratings yet
Aisi 5140 PDF
2 pages
70T RT Tadano GR-700EX Load Charts PDF
No ratings yet
70T RT Tadano GR-700EX Load Charts PDF
12 pages
(Final Draft) Taskap Sesdilu - M. Arief Priowahono
No ratings yet
(Final Draft) Taskap Sesdilu - M. Arief Priowahono
21 pages

Big Data Lab File

Uploaded by

Big Data Lab File

Uploaded by

PRACTICAL JOURNAL OF BIG DATA

B.Tech: 4th ­Year

Department of Computer Science and Engineering

Name of the Student : Utkarsh Goyal

Department of Computer Science and Engineering

July ­ Dec 2024

Perform setting up and installing single

Insert multiple records (at least 10)

Execute the following queries on the

7. Find all the employees of a particular

To design and implement a social

Lab Experiment No. 01

Hadoop Distributed File System (HDFS)

MapReduce is the processing layer of Hadoop, which enables distributed computation on

0827CS211252 Big Data (CS-704) 1

The MapReduce architecture consists of:

YARN (Yet Another Resource Negotiator)

a. ResourceManager (Master Node): Manages the allocation of resources across the

0827CS211252 Big Data (CS-704) 2

Uber faced several data challenges:

0827CS211252 Big Data (CS-704) 3

Step­by­Step Solution Using Hadoop

Step 2: Data Storage in HDFS

Step 3: Data Processing with MapReduce

Step 4: Data Querying with Apache Hive

0827CS211252 Big Data (CS-704) 4

Step 6: Data Transformation with Apache Pig

Step 7: Workflow Management with Apache Oozie

Step 8: Resource Management with YARN

Benefits and Results

a. Improved Ride Matching and User Experience

0827CS211252 Big Data (CS-704) 5

Uber’s use of the Hadoop ecosystem demonstrates the framework's effectiveness in

0827CS211252 Big Data (CS-704) 6

Lab Experiment No. 02

Steps to Install Hadoop on Windows

I. Go to the Apache Hadoop releases page.

0827CS211252 Big Data (CS-704) 7

II. Edit hdfs­site.xml:

III. Edit mapred­site.xml:

0827CS211252 Big Data (CS-704) 8

4. Format the Namenode

C. Start the HDFS and YARN services:

0827CS211252 Big Data (CS-704) 9

Lab Experiment No. 03

This command should list the root directory of HDFS.

1. Adding Files and Directories

Add Files to HDFS

Add Directories to HDFS

2. Retrieving Files from HDFS

0827CS211252 Big Data (CS-704) 10

Delete Files from HDFS

Delete Directories from HDFS

0827CS211252 Big Data (CS-704) 11

Lab Experiment No. 04

1. Create Database and Collection

0827CS211252 Big Data (CS-704) 12

a. Find All Documents: To retrieve all documents in the collection:

0827CS211252 Big Data (CS-704) 13

0827CS211252 Big Data (CS-704) 14

e. Add Multiple Documents:

0827CS211252 Big Data (CS-704) 15

0827CS211252 Big Data (CS-704) 16

Lab Experiment No. 05

0827CS211252 Big Data (CS-704) 17

0827CS211252 Big Data (CS-704) 18

Lab Experiment No. 06

1. Display Data in Proper Format

0827CS211252 Big Data (CS-704) 21

0827CS211252 Big Data (CS-704) 22

{ Stu_Name: "Priya Desai" },

{ $set: { remark: "REM" } }

0827CS211252 Big Data (CS-704) 23

0827CS211252 Big Data (CS-704) 24

0827CS211252 Big Data (CS-704) 25

Lab Experiment No. 07

B.Tech: 4th Year

July Dec 2024

StepbyStep Solution Using Hadoop

II. Edit hdfssite.xml:

III. Edit mapredsite.xml: