Big Data Lab File
Big Data Lab File
AITR, Indore
Index
Sign of
Date of
S.No. Date of Exp. Name of the Experiment the
Submission
Faculty
To draw and explain Hadoop
architecture and ecosystem with the
1. help of a case study.
Theory:
Hadoop is an open-source framework that enables distributed storage and processing of large
datasets across clusters of computers using simple programming models. It is highly scalable,
fault-tolerant, and capable of handling vast amounts of data in a reliable manner. The Hadoop
ecosystem comprises several modules, tools, and technologies that interact with each other
to enable efficient data processing and management.
1. Hadoop Architecture
Hadoop has a master-slave architecture where the tasks are distributed across nodes in a
cluster. There are two key components in the core of Hadoop: HDFS (Hadoop Distributed File
System) and MapReduce.
HDFS is the storage layer of Hadoop. It is designed to store large datasets by distributing them
across multiple machines. It uses a block storage method where files are divided into fixed-
size blocks (default 128 MB or 64 MB) and stored across different nodes. HDFS has two key
nodes:
a. NameNode (Master Node): This manages the metadata of HDFS and keeps track of
where data blocks are stored.
b. DataNode (Slave Nodes): These store the actual data blocks and serve read/write
requests from the client.
MapReduce
a. JobTracker (Master Node): Assigns and monitors tasks distributed across the cluster.
b. TaskTracker (Slave Nodes): Executes the tasks as assigned by the JobTracker.
YARN is the resource management layer that separates resource management from the
processing tasks. It consists of:
The Hadoop ecosystem extends beyond HDFS and MapReduce to include several other tools
and technologies that enable advanced data storage, analysis, and management.
a. Hive: A data warehouse infrastructure that provides an SQL-like interface to query data
stored in HDFS.
b. Pig: A high-level scripting platform for processing large datasets, providing data
transformation through Pig Latin scripts.
c. HBase: A NoSQL database built on top of HDFS, which supports real-time read/write
access to large datasets.
d. Sqoop: A tool used to transfer data between Hadoop and relational databases (like
MySQL, Oracle).
e. Flume: A service for collecting and moving large amounts of log data into HDFS.
f. Zookeeper: Provides distributed coordination services to maintain configuration
information and synchronize between distributed systems.
g. Oozie: A workflow scheduler for managing Hadoop jobs.
3. Case Study: LinkedIn’s Use of Hadoop for Data Analytics and Enhanced User
Insights
Background
Uber, a leading ride-hailing platform, connects millions of riders with drivers in real-time.
Uber collects vast amounts of data from users, drivers, and trips on a global scale. This data
needs to be processed and analyzed to optimize pricing, improve user experience, and ensure
efficient ride matching. Given the scale of operations and the volume of real-time data, Uber
turned to Hadoop to create a scalable, fault-tolerant, and efficient data infrastructure for
handling its vast datasets.
Business Problem
a. Data Volume: Uber deals with enormous volumes of data from millions of rides, user
ratings, location data, and driver behaviour.
b. RealTime Insights: Uber requires real-time data processing for features like dynamic
pricing, ride matching, and surge pricing.
c. Scalability and Fault Tolerance: Uber needs infrastructure that can seamlessly scale to
accommodate growth and ensure data accessibility even with occasional system
failures.
To address these challenges, Uber implemented the Hadoop ecosystem, integrating various
components to build a scalable, fault-tolerant, and efficient data infrastructure capable of
handling massive data volumes. Here’s how they used it:
a. Data Collection from Diverse Sources: Uber collects data from different sources,
including trip data (pickup/dropoff times and locations), GPS data from drivers, user
ratings, and ride requests.
b. Apache Kafka: Uber uses Apache Kafka to collect real-time data streams, such as ride
requests, driver availability, and location updates. Kafka allows Uber to efficiently
process events as they happen, ensuring a near-instant response for pricing and
matching algorithms.
c. Apache Flume: Uber uses Flume to aggregate log data from various systems, including
server logs, application metrics, and other operational data, and ingests them into
Hadoop for further processing.
a. Hadoop Distributed File System (HDFS): Uber leverages HDFS to store large volumes
of structured and unstructured data, such as ride logs, GPS data, and user profiles.
Data is distributed across multiple nodes, providing both storage scalability and fault
tolerance.
b. Scalability with HDFS: As Uber’s data grows, HDFS allows them to seamlessly add new
storage nodes without disrupting the existing infrastructure. This scalability ensures
Uber can handle the increasing amount of ride data generated as the platform grows
globally.
a. Batch Processing with MapReduce: Uber uses MapReduce for processing and
analyzing large datasets. For example:
Map Function: Uber’s system may filter out trip data by parameters such as
location, time of day, and driver performance.
Reduce Function: The results are aggregated to generate insights such as peak
demand times, popular locations for pickups, and user behavior trends.
a. Apache Hive: Uber utilizes Hive for SQL-like querying to enable business analysts to
access and analyze data in a more user-friendly manner. Hive helps Uber quickly
generate reports on trip demand, user demographics, and driver behavior.
a. Apache HBase: For real-time data processing, Uber uses HBase to store and access
data with low latency. HBase is ideal for providing real-time access to user and trip
data, such as:
RealTime Price Calculation: HBase supports the dynamic pricing engine,
allowing Uber to apply surge pricing and calculate ride costs in real-time based
on demand, location, and available drivers.
RealTime Matching: HBase is also used to store the latest driver availability
and location, enabling Uber’s algorithms to match riders and drivers instantly.
a. Apache Pig: Uber uses Apache Pig for complex data transformation tasks such as
filtering, aggregating, and transforming raw ride data into usable formats for analysis.
For example:
Ride Segmentation: Pig scripts segment rides based on different criteria such
as ride duration, geographical location, or fare categories, helping Uber
optimize its pricing models and identify areas with high demand.
a. Apache Oozie: Uber uses Oozie for scheduling and managing data processing
workflows. For instance, Uber automates the running of daily reports, batch
processing of trip data, and machine learning model training using Oozie’s workflow
management system.
a. YARN (Yet Another Resource Negotiator): YARN manages resources across the
Hadoop clusters at Uber. It ensures that Hadoop’s processing power is optimally
allocated to various tasks, allowing for efficient parallel processing of large datasets.
Conclusion
2. Install Hadoop
I. Extract Hadoop: Use a tool like WinRAR or 7-Zip to extract the downloaded Hadoop
archive to a directory (e.g., C:\hadoop).
II. Set Environment Variables:
A. Open the Environment Variables settings (Control Panel > System > Advanced
system settings > Environment Variables).
B. Add the following variables:
1. HADOOP_HOME: Set to the path where Hadoop is extracted (e.g.,
C:\hadoop).
2. PATH: Add %HADOOP_HOME%\bin to the existing PATH variable.
3. JAVA_HOME: Set to the path where JDK is installed (e.g., C:\Program
Files\Java\jdk1.8.0_xx).
5. Start Hadoop
A. Open Command Prompt as Administrator.
B. Navigate to the Hadoop bin directory:
6. Access Hadoop
To access the Hadoop web interface, open your web browser and go to:
A. HDFS: http://localhost:9870
B. YARN: http://localhost:8088
In MongoDB, databases and collections are created dynamically when data is inserted into
them. We can perform the following tasks to set up a database called STD, create a student
collection with specific fields, and then demonstrate common operations like inserting,
updating, querying, and deleting documents. Here's how to proceed:
b. Insert a Document into the student Collection: Now, let's insert a document with
the fields No., Stu_Name, Enrol., Branch, Contact, e-mail, and Score.
Input:
db.student.insertOne({
No: 1,
Stu_Name: "Ghasiram Pondu",
Enrol: "DOGNO1",
Branch: "Computer Science",
Contact: "666666",
email: "[email protected]",
Score: 6 })
2. Basic Operations
Now that you have data in the student collection, let's perform various operations.
Input:
db.student.find().pretty()
Output:
Input:
db.student.find({ Stu_Name: "Ghasiram Pondu" })
Output:
c. Update a Document:
Input:
db.student.updateOne(
{ Stu_Name: "Ghasiram Pondu" },
{ $set: { Score: 666666} }
)
Output:
d. Delete a Document:
Input:
db.student.deleteOne({ Stu_Name: "Amit Sharma" })
Input:
db.student.insertMany([
{No: 3, Stu_Name: "Lana Rhoades", Enrol: "PAWN1", Branch: "Biology", Contact:
"6969696969", email: "[email protected]", Score: 69 },
{No: 4, Stu_Name: "Lexi Luna", Enrol: "PAWN2", Branch: "Mechanical Engineering",
Contact: "6969696969", email: "[email protected]", Score: 69 },
{No: 5, Stu_Name: "Angela White", Enrol: "PAWN3", Branch: "Plumbing", Contact:
"6969696969", email: "[email protected]", Score: 69 } ])
To insert multiple records into the student collection in MongoDB, you can use the
insertMany() command. This command allows you to insert several documents (records) in
one operation. Below is an example of how you can insert 12 student records into the student
collection within the STD database.
Input:
db.student.insertMany([
{No: 1, Stu_Name: "John Doe", Enrol: "DOGNO1", Branch: "Computer Science", Contact:
"1234567890", email: "[email protected]", Score: 85},
{No: 2, Stu_Name: "Jane Smith", Enrol: "DOGNO2", Branch: "Electrical Engineering", Contact:
"2345678901", email: "[email protected]", Score: 90},
{No: 3, Stu_Name: "Alice Johnson", Enrol: "DOGNO3", Branch: "Mechanical Engineering",
Contact: "3456789012", email: "[email protected]", Score: 92},
{No: 4, Stu_Name: "Bob Brown", Enrol: "DOGNO4", Branch: "Civil Engineering", Contact:
"4567890123", email: "[email protected]", Score: 78},
{No: 5, Stu_Name: "Charlie Davis", Enrol: "DOGNO5", Branch: "Electronics Engineering",
Contact: "5678901234", email: "[email protected]", Score: 88},
{No: 6, Stu_Name: "David Evans", Enrol: "DOGNO6", Branch: "Chemical Engineering", Contact:
"6789012345", email: "[email protected]", Score: 79},
{No: 7, Stu_Name: "Eve White", Enrol: "DOGNO7", Branch: "Information Technology",
Contact: "7890123456", email: "[email protected]", Score: 93},
{No: 8, Stu_Name: "Frank Harris", Enrol: "DOGNO8", Branch: "Software Engineering", Contact:
"8901234567", email: "[email protected]", Score: 80},
{No: 9, Stu_Name: "Grace Martin", Enrol: "DOGNO9", Branch: "Biomedical Engineering",
Contact: "9012345678", email: "[email protected]", Score: 85},
{No: 10, Stu_Name: "Henry Lee", Enrol: "DOGNO10", Branch: "Environmental Engineering",
Contact: "0123456789", email: "[email protected]", Score: 91},
Output:
To display the data stored in the student collection in a neatly formatted way, we can use the
find() command with a projection to show specific fields.
Output:
Let’s say we want to update the contact information for the student named Tanay. Use the
following command:
Input:
db.student.updateOne(
Output:
3. Add a New Field Remark to the Document with the Name 'REM'
You can add a new field called Remark for the student with name "REM" using the following
command:
Input:
db.student.updateOne(
{ Stu_Name: "Amit Kumar" },
{ Stu_Name: "XYZ" },
{
$set: {
No: 11,
Stu_Name: "XYZ",
Enrol: "00101",
Branch: "VB",
email: "[email protected]",
Contact: "098675345"
}
},
{ upsert: true }
Output:
This query retrieves all employees from a specific department (e.g., Dept_ID = 2) where the
salary is less than 40K:
Input:
db.employee.find({
Dept_ID: "D2",
Salary: { $lt: 40000 }
});
Output:
This aggregation pipeline finds the highest salary in each department and fetches the names
of those employees:
Input:
db.employee.aggregate([
{
$group: {
_id: "$Dept_ID",
maxSalary: { $max: "$Salary" }
}
},
{
$lookup: {
from: "employee",
localField: "maxSalary",
foreignField: "Salary",
as: "highestSalaryEmployees"
}
},
{
$unwind: "$highestSalaryEmployees"
},
{
$project: {
Department: "$_id",
Employee_Name: {
$concat: [
"$highestSalaryEmployees.First_Name",
" ",
"$highestSalaryEmployees.Last_Name"
]
},
Salary: "$highestSalaryEmployees.Salary"
}
}
]);
Input:
db.employee.updateMany(
$set: {
);
Output:
Input:
db.employee.find({ Salary: { $lt: 33000 } });
Output:
Code:
Code:
Output:
To create a weighted asymmetric graph based on the previous scenario, we will assign weights
to the directed edges between the nodes A, B, C, D, and E. The weights will be randomly
selected within the range of 20 to 50.
Code:
● A → B: 35
● B → D: 46
● D → A: 37
● D → C: 21
Output:
To implement the betweenness measure between nodes in a social network with 10 nodes,
we can use the NetworkX library in Python. Betweenness centrality is a measure of a node's
centrality in a graph based on the shortest paths that pass through that node.
Explanation:
I. We create an undirected graph (nx.Graph()), but you could also use a directed
graph (nx.DiGraph()) if you want to represent directional relationships.
II. Nodes (A, B, C, ..., J) are added to the graph.
III. Edges between nodes represent the relationships (e.g., ('A', 'B') represents a
relationship between node A and node B).
b. Betweenness Centrality:
c. Visualization:
I. The graph is plotted using a spring layout (force-directed layout) where nodes
are placed based on attractive and repulsive forces.
II. The graph is displayed with labels for each node.