Big Data Journal
Big Data Journal
VIVEKANANDEDUCATION SOCIETY’S
INSTITUTE OF TECHNOLOGY
CERTIFICATE
INDEX
THEORY:
What is Hadoop?
Hadoop consists of several key components, which can be grouped into two main
categories: the storage component (HDFS) and the processing component (MapReduce).
Over time, the Hadoop ecosystem has expanded to include additional components for
resource management, data querying, and more.
2
Heetika Vaity BDAV Lab SYMCA/B-57
3
Heetika Vaity BDAV Lab SYMCA/B-57
HDFS Architecture
HDFS is an Open source component of the Apache Software Foundation that manages data.
HDFS has scalability, availability, and replication as key features. Name nodes, secondary
name nodes, data nodes, checkpoint nodes, backup nodes, and blocks all make up the
architecture of HDFS. HDFS is fault-tolerant and is replicated. Files are distributed across
the cluster systems using the Name node and Data Nodes. The primary difference between
Hadoop and Apache HBase is that Apache HBase is a non-relational database and Apache
Hadoop is a non-relational data store.
4
Heetika Vaity BDAV Lab SYMCA/B-57
3. Secondary NameNode:
○ Activated when NameNode needs to perform a checkpoint to manage disk
space.
○ Merges FsImage and EditLogs periodically to create a new, consistent
FsImage.
○ Stores transaction logs in a single location for easier access and replication
across the cluster.
○ Helps in recovering from a failed FsImage and ensures data can be backed up
and restored.
4. Checkpoint Node:
○ Creates checkpoints by merging FsImage and EditLogs at regular intervals.
○ Provides a consistent image of the file system to the NameNode.
○ Ensures that the directory structure remains consistent with the NameNode.
5. Backup Node:
○ Ensures high availability of data by maintaining a backup of the active
NameNode’s data.
○ Can be promoted to active in case the NameNode fails.
○ Works with replica sets of data for recovery, rather than relying on individual
nodes.
6. Blocks:
○ Data is split into blocks, typically ranging from 32 to 128 MB in size.
○ Blocks are replicated across multiple DataNodes to ensure fault tolerance.
○ HDFS scales by adding more DataNodes, which automatically manage larger
blocks of data.
○ Block replication ensures that even if one DataNode fails, data can still be
recovered from other nodes.
5
Heetika Vaity BDAV Lab SYMCA/B-57
HDFS is able to survive computer crashes and recover from data corruption. HDFS operates
on the principle of duplicates, so in the event of failure, it can continue operating as long as
there are replicas available. When working on the principle of replicas, data is duplicated
and stored on different machines in the DHFS cluster. A replica of every block is stored on at
least three DataNodes. HDFS uses a technique referred to as nameNode maintenance to
maintain copies on multiple DataNodes. The nameNode keeps track of how many blocks
have been under- or over-replicated, and subsequently adds or deletes copies accordingly.
Write Operation
The process continues until all DataNodes have received the data. After DataNodes receive
a copy of the file, they send back the location of the last block they received. This enables
the NameNode to reconstruct the file. After receiving the last block, the DataNodes notify
the NameNode that the job is complete. The NameNode then replies with a complete file
that can be used by the application. When a file is split into segments, it must be
reassembled to return the file data to the application. Splitting a file into segments is a
method that enables the NameNode to optimize its storage capacity. Splitting a file into
segments also improves fault tolerance and availability. When the client receives a split file,
the process is similar to that of a single file. The client divides the file into segments, which
are then sent to the DataNode. DataNode 1 receives the segment A and passes it to
DataNode 2 and so on.
Read Operation
The client then sends the file to the Replicator. The Replicator does not have a copy of the
file and must read the data from another location. In the background, data is then sent to
the DataNode. The DataNode only has metadata and must contact the other data nodes to
receive the actual data. The data is then sent to the Replicator. The Replicator again does
not have a copy of the file and must read the data from another location. Data is then sent to
the Reducer. The Reducer does have a copy of the data, but a compressed version.
6
Heetika Vaity BDAV Lab SYMCA/B-57
1. It is a highly scalable data storage system. This makes it ideal for data-intensive
applications like Hadoop and streaming analytics. Another major benefit of Hadoop
is that it is easy to set up. This makes it ideal for non-technical users.
2. It is very easy to implement, yet very robust. There is a lot of flexibility you get with
Hadoop. It is a fast and reliable file system.
3. This makes Hadoop a great fit for a wide range of data applications. The most
common one is analytics. You can use Hadoop to process large amounts of data
quickly, and then analyze it to find trends or make recommendations. The most
common type of application that uses Hadoop analytics is data crunching.
4. You can increase the size of the cluster by adding more nodes or increase the size of
the cluster by adding more nodes. If you have many clients that need to be stored on
HDFS you can easily scale your cluster horizontally by adding more nodes to the
cluster. To scale your cluster vertically, you can increase the size of the cluster. Once
the size of the cluster is increased, it can serve more clients.
5. This can be done by setting up a centralized database, or by distributing data across
a cluster of commodity personal computers, or a combination of both. The most
common setup for this type of virtualization is to create a virtual machine on each of
your servers.
6. Automatic data replication can be accomplished with a variety of technologies,
including RAID, Hadoop, and database replication. Logging data and monitoring it
for anomalies can also help to detect and respond to hardware and software failures.
7
Heetika Vaity BDAV Lab SYMCA/B-57
CODE:
HADOOP COMMANDS
1) Hadoop Version:
Description:
Command:
hadoop version
2) Make Directory:
Description:
This command creates the directory in HDFS if it does not already exist.
Note: If the directory already exists in HDFS, then we will get an error message that file
already exists.
Command:
8
Heetika Vaity BDAV Lab SYMCA/B-57
3) Listing Directories:
Description:
Command:
hadoop fs -ls /
4) copyFromLocal or put:
Description:
The Hadoop fs shell command put is similar to the copyFromLocal, which copies files or
directory from the local filesystem to the destination in the Hadoop file system.
Command:
9
Heetika Vaity BDAV Lab SYMCA/B-57
10
Heetika Vaity BDAV Lab SYMCA/B-57
5) count:
Description:
The Hadoop fs shell command count counts the number of files, directories, and bytes
under the paths that match the specified file pattern.
Options:
-q – shows quotas(quota is the hard limit on the number of names and amount of space
used for individual directories)
-u – it limits output to show quotas and usage only
-h – shows sizes in a human-readable format
-v – shows header line
Command:
hadoop fs -count /
6) cat
Description:
The cat command reads the file in HDFS and displays the content of the file on console or
stdout.
Command:
11
Heetika Vaity BDAV Lab SYMCA/B-57
7) touchz
Description:
touchz command creates a file in HDFS with file size equals to 0 byte. The directory is the
name of the directory where we will create the file, and filename is the name of the new file
we are going to create
Command:
8) stat:
Description:
The Hadoop fs shell command stat prints the statistics about the file or directory in the
specified format.
Formats:
%n – file name
%o – block size
%r – replication
%y – modification date
12
Heetika Vaity BDAV Lab SYMCA/B-57
Command:
9) checksum
Description:
The Hadoop fs shell command checksum returns the checksum information of a file.
Command:
13
Heetika Vaity BDAV Lab SYMCA/B-57
10) usage
Description:
The Hadoop fs shell command usage returns the help for an individual command.
Command:
11) help
Description:
The Hadoop fs shell command help shows help for all the commands or the specified
command.
Command:
14
Heetika Vaity BDAV Lab SYMCA/B-57
13) cp
Description:
The cp command copies a file from one directory to another directory within the HDFS.
Command:
14) mv
Description:
The HDFS mv command moves the files or directories from the source to a destination
within HDFS.
Command:
15
Heetika Vaity BDAV Lab SYMCA/B-57
Description:
This command copies files or directories from HDFS to the local filesystem.
Command:
16
Heetika Vaity BDAV Lab SYMCA/B-57
Description:
This command recursively deletes a directory and its contents from HDFS. The -rm
command with the -r flag can also be used for this purpose.
Command:
hadoop fs -rm -r /path/directory_name
or
Description:
This command displays the disk usage of files and directories in HDFS.
Command:
hadoop fs -du /path
17
Heetika Vaity BDAV Lab SYMCA/B-57
Description:
This command provides a summary of disk usage for files and directories in HDFS, typically
aggregating the sizes.
Command:
hadoop fs -dus /path
Description:
This command moves files or directories from the local filesystem to HDFS. It is similar to
copyFromLocal, but the source file is deleted after the move.
Command:
hadoop fs -moveFromLocal /local/path /hdfs/path
18
Heetika Vaity BDAV Lab SYMCA/B-57
CONCLUSION:
In conclusion, this practical exercise on HDFS architecture and basic commands has
provided a foundational understanding of how Hadoop's distributed file system manages
and interacts with large-scale data. By mastering commands such as mkdir, put, get, and du,
users can effectively create, manage, and monitor files within HDFS. This knowledge is
essential for leveraging HDFS’s capabilities in real-world big data environments, ensuring
efficient data storage and retrieval.
19
Heetika Vaity BDAV Lab SYMCA/B-57
1
Heetika Vaity BDAV Lab SYMCA/B-57
THEORY:
What is MapReduce?
A MapReduce system is usually composed of three steps (even though it's generalized as
the combination of Map and Reduce operations/functions). The MapReduce operations are:
● Map: The input data is first split into smaller blocks. The Hadoop framework then
decides how many mappers to use, based on the size of the data to be processed
and the memory block available on each mapper server. Each block is then assigned
to a mapper for processing. Each ‘worker’ node applies the map function to the
local data, and writes the output to temporary storage. The primary (master) node
ensures that only a single copy of the redundant input data is processed.
● Shuffle, combine and partition: worker nodes redistribute data based on the
output keys (produced by the map function), such that all data belonging to one key
is located on the same worker node. As an optional process the combiner (a
reducer) can run individually on each mapper server to reduce the data on each
mapper even further making reducing the data footprint and shuffling and sorting
easier. Partition (not optional) is the process that decides how the data has to be
presented to the reducer and also assigns it to a particular reducer.
2
Heetika Vaity BDAV Lab SYMCA/B-57
● Reduce: A reducer cannot start while a mapper is still in progress. Worker nodes
process each group of <key,value> pairs output data, in parallel to produce
<key,value> pairs as output. All the map output values that have the same key are
assigned to a single reducer, which then aggregates the values for that key. Unlike
the map function which is mandatory to filter and sort the initial data, the reduce
function is optional.
Architecture
● Role: Manages and coordinates the MapReduce jobs. It assigns map and reduce
tasks to worker nodes and monitors their progress.
● Responsibilities: Includes scheduling tasks, handling failures, and managing
resource allocation.
● Role: Execute the map and reduce tasks as assigned by the master node.
● Responsibilities: Each worker node processes data locally to minimize data
transfer overhead, performs the map and reduce operations, and reports progress
back to the master node.
3. Input Splits:
● Definition: The input data is divided into smaller chunks called splits. Each split is
processed independently by a map task.
● Purpose: Helps in parallel processing and efficient utilization of resources.
4. Data Locality:
MapReduce Workflow
3
Heetika Vaity BDAV Lab SYMCA/B-57
1. Data Input: Input data is read and split into smaller chunks.
2. Map Tasks: Each map task processes its chunk and emits intermediate key-value
pairs.
3. Shuffle and Sort: The intermediate data is shuffled and sorted by key.
4. Reduce Tasks: Each reduce task processes the sorted data and generates the final
output.
5. Output Storage: The final results are written to an output file or database.
Advantages
● Scalability: Easily scales to handle large datasets by distributing tasks across many
nodes.
● Fault Tolerance: Automatically recovers from node failures by reassigning tasks.
● Simplicity: Provides a simple abstraction for parallel processing without requiring
detailed knowledge of the underlying infrastructure.
Applications
● Data Analysis: Used for large-scale data analysis, such as log processing, web
indexing, and data mining.
● Search Engines: Helps in indexing large volumes of web data.
● Machine Learning: Facilitates distributed training of machine learning models.
Challenges
4
Heetika Vaity BDAV Lab SYMCA/B-57
—>
CODE/ STEPS:
5
Heetika Vaity BDAV Lab SYMCA/B-57
6
Heetika Vaity BDAV Lab SYMCA/B-57
7
Heetika Vaity BDAV Lab SYMCA/B-57
8
Heetika Vaity BDAV Lab SYMCA/B-57
WordCountHeetika.java
package heetika_57;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
9
Heetika Vaity BDAV Lab SYMCA/B-57
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
10
Heetika Vaity BDAV Lab SYMCA/B-57
11
Heetika Vaity BDAV Lab SYMCA/B-57
12
Heetika Vaity BDAV Lab SYMCA/B-57
13
Heetika Vaity BDAV Lab SYMCA/B-57
14
Heetika Vaity BDAV Lab SYMCA/B-57
15
Heetika Vaity BDAV Lab SYMCA/B-57
16
Heetika Vaity BDAV Lab SYMCA/B-57
OUTPUT:
17
Heetika Vaity BDAV Lab SYMCA/B-57
—>
This MapReduce program performs matrix multiplication in two stages: transforming and
multiplying matrices. It involves custom input/output formats and multiple MapReduce
jobs. This process demonstrates the use of MapReduce for complex data transformations
and aggregations.
○ Element: Represents an individual matrix element with a tag indicating the matrix
(M or N), index, and value.
○ Pair: Represents a matrix coordinate, used as a key for the final output.
○ Mapper (MapMN): Reads intermediate data and outputs key-value pairs for final
aggregation.
○ Reducer (ReduceMN): Aggregates values for each key, producing the final matrix
multiplication result.
Execution Flow:
1. Job 1: Transforms matrix data into a format suitable for multiplication and stores
intermediate results.
2. Job 2: Aggregates intermediate results to produce the final matrix multiplication output.
18
Heetika Vaity BDAV Lab SYMCA/B-57
19
Heetika Vaity BDAV Lab SYMCA/B-57
CODE/STEPS:
● Open Eclipse and create a new Java project. Name your project and click Next
here.
20
Heetika Vaity BDAV Lab SYMCA/B-57
● In the java project, create a new java file. Copy and paste the code provided:
MatrixMultiplicationHeetika.java
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.ReflectionUtils;
Element() {
tag = 0;
index = 0;
value = 0.0;
}
@Override
public void readFields(DataInput input) throws IOException {
21
Heetika Vaity BDAV Lab SYMCA/B-57
tag = input.readInt();
index = input.readInt();
value = input.readDouble();
}
@Override
public void write(DataOutput output) throws IOException {
output.writeInt(tag);
output.writeInt(index);
output.writeDouble(value);
}
}
Pair() {
i = 0;
j = 0;
}
Pair(int i, int j) {
this.i = i;
this.j = j;
}
@Override
public void readFields(DataInput input) throws IOException {
i = input.readInt();
j = input.readInt();
}
@Override
public void write(DataOutput output) throws IOException {
output.writeInt(i);
output.writeInt(j);
}
@Override
public int compareTo(Pair compare) {
22
Heetika Vaity BDAV Lab SYMCA/B-57
if (i > compare.i) {
return 1;
} else if (i < compare.i) {
return -1;
} else {
if (j > compare.j) {
return 1;
} else if (j < compare.j) {
return -1;
}
}
return 0;
}
@Override
public String toString() {
return i + " " + j + " ";
}
}
23
Heetika Vaity BDAV Lab SYMCA/B-57
24
Heetika Vaity BDAV Lab SYMCA/B-57
}
}
25
Heetika Vaity BDAV Lab SYMCA/B-57
MultipleInputs.addInputPath(job1, MPath,
TextInputFormat.class, MatrixMapperM.class);
MultipleInputs.addInputPath(job1, NPath,
TextInputFormat.class, MatrixMapperN.class);
job1.setReducerClass(ReducerMN.class);
job1.setMapOutputKeyClass(IntWritable.class);
job1.setMapOutputValueClass(Element.class);
job1.setOutputKeyClass(Pair.class);
job1.setOutputValueClass(DoubleWritable.class);
job1.setOutputFormatClass(TextOutputFormat.class);
FileOutputFormat.setOutputPath(job1, intermediatePath);
job1.waitForCompletion(true);
26
Heetika Vaity BDAV Lab SYMCA/B-57
27
Heetika Vaity BDAV Lab SYMCA/B-57
Create a file named Manifest.txt in the src folder. The content of Manifest.txt should specify
the entry point of your application. For example:
Main-class: MatrixMultiplicationHeetika
cd /home/cloudera/workspace/MatrixMultiplication
cd src
➔ -cp $(hadoop classpath): Sets the classpath to include all necessary Hadoop
libraries.
● To package your compiled classes and the manifest file into a JAR file, use the
following command:
28
Heetika Vaity BDAV Lab SYMCA/B-57
● Now that you have your JAR file, you can run it on Hadoop to execute your
MapReduce tasks. Execute the Hadoop Jobs (Both should run successfully):
29
Heetika Vaity BDAV Lab SYMCA/B-57
OUTPUT:
Final Output:
CONCLUSION:
MapReduce simplifies large-scale data processing by breaking down complex tasks into
smaller, parallelizable units. Its design emphasizes scalability, fault tolerance, and efficiency,
making it suitable for handling vast amounts of data across distributed computing
environments. The combination of the map and reduce phases, along with a robust
architecture for resource management and fault tolerance, makes MapReduce a powerful
tool for data-intensive applications.
30
Name of Student : Heetika Mahesh Vaity
● Insert Document
● Query Document
● Indexing
● INSERT DOCUMENT
● QUERY DOCUMENT
● INDEXING
THEORY:
MongoDB Overview
1
Heetika Vaity BDAV Lab SYMCA/B-57
MongoDB Architecture
1. Documents: The basic unit of data in MongoDB is the document, which is stored in a
binary JSON format (BSON). A document is a set of key-value pairs, where the values
can include arrays and other documents, enabling complex data structures.
2. Collections: Documents are grouped into collections, which are analogous to tables
in relational databases. Collections do not enforce a schema, so documents within a
collection can have different fields and structures.
2
Heetika Vaity BDAV Lab SYMCA/B-57
3. Replica Sets: A replica set is a group of MongoDB servers that maintain the same
data set. A replica set includes:
○ Primary: The primary server receives all write operations. It replicates the
data to secondary servers.
○ Secondaries: Secondary servers replicate data from the primary. They can
serve read operations, depending on the configuration, and can be promoted
to primary if the current primary fails.
○ Arbiter: An arbiter is a member of the replica set that does not hold data but
participates in elections to determine the primary.
4. Sharding: Sharding is the process of distributing data across multiple machines. In
MongoDB, sharding is achieved by:
○ Shard: A shard is a subset of the data, distributed across servers.
○ Config Servers: Config servers store the metadata and configuration settings
for the cluster.
○ Query Router (mongos): The query router routes client requests to the
appropriate shard based on the shard key, which determines how data is
distributed across shards.
5. MongoDB Deployment: MongoDB can be deployed in various configurations:
○ Single Server: Suitable for development and testing.
○ Replica Set: Provides high availability and redundancy.
○ Sharded Cluster: Provides scalability by distributing data across multiple
shards.
3
Heetika Vaity BDAV Lab SYMCA/B-57
CRUD operations
CRUD operations in MongoDB refer to the four basic operations you can perform on the
data stored in a MongoDB database: Create, Read, Update, and Delete. Here’s how you can
perform these operations using the MongoDB shell or a MongoDB driver.
1. Create (Insert)
● Insert a Single Document:
db.collectionName.insertOne({
name: "John Doe",
age: 30,
address: "123 Main St"
});
db.collectionName.insertMany([
2. Read (Query)
● Find All Documents in a Collection:
db.collectionName.find();
4
Heetika Vaity BDAV Lab SYMCA/B-57
3. Update
● Update a Single Document:
db.collectionName.updateOne(
{ name: "John Doe" }, // Filter
{ $set: { age: 31 } } // Update operation
);
● Replace a Document:
db.collectionName.replaceOne(
{ name: "John Doe" }, // Filter
{ name: "John Doe", age: 31, address: "123 Main St" } // New document
);
4. Delete
● Delete a Single Document:
db.collectionName.deleteOne({ name: "John Doe" });
5
Heetika Vaity BDAV Lab SYMCA/B-57
1. Visit the MongoDB Download Center and select the latest version of MongoDB
Community Server.
6
Heetika Vaity BDAV Lab SYMCA/B-57
7
Heetika Vaity BDAV Lab SYMCA/B-57
5. Follow the installation prompts. During the installation, select the option to install
MongoDB as a Service (recommended).
8
Heetika Vaity BDAV Lab SYMCA/B-57
7. Installation complete
9
Heetika Vaity BDAV Lab SYMCA/B-57
8. Run MongoDB: Open a command prompt and type mongod to start the MongoDB
server.
Open another command prompt and type mongosh to connect to the server.
10
Heetika Vaity BDAV Lab SYMCA/B-57
3. Create collection
MongoDB collections are created automatically(implicit) when you insert data.
Explicit Creation–
> db.createCollection("students");
4. Insert Document
> db.students.insertMany([ { "name" : "Heetika", "age" : 20,
"major" : "IT", "gpa" : 9.8 }, { "name" : "Tanish", "age" : 22,
"major" : "CS", "gpa" : 8.0 }, {"name" : "John", "major":
"Commerce", "gpa" : 8.5}, {"name": "Pragati", "major": "CS",
"gpa": 9.5}]);
11
Heetika Vaity BDAV Lab SYMCA/B-57
5. Query Document
>db.students.find()
12
Heetika Vaity BDAV Lab SYMCA/B-57
6. Delete Document
> db.students.remove({"name" : "John"})
> db.students.find().pretty()
13
Heetika Vaity BDAV Lab SYMCA/B-57
7. Update Document
> db.students.update({'name': 'Tanish'}, {$set: {'major' :
"Chemistry"}})
8. Indexing document
>db.students.createIndex({"gpa":1})
This creates an index on the gpa field. { gpa: 1 }: The 1 indicates that the index will be
created in ascending order. If you wanted a descending order, you would use -1.
> db.students.getIndexes()
14
Heetika Vaity BDAV Lab SYMCA/B-57
> db.students.dropIndexes()
CONCLUSION:
In this practical, we successfully performed various CRUD operations using MongoDB. We explicitly
created a collection and inserted documents into it. We also practiced querying documents with
specific criteria and enhanced query performance by creating indexes. These operations
demonstrated the flexibility and efficiency of MongoDB as a NoSQL database, highlighting its
suitability for handling large datasets with complex, unstructured data.
15
Name of Student : Heetika Mahesh Vaity
THEORY:
Introduction to Hive
Apache Hive is a data warehousing and SQL-like query language tool that is built on top of
Apache Hadoop. It facilitates reading, writing, and managing large datasets stored in
distributed storage using SQL. Hive abstracts the complexity of Hadoop's underlying system
by providing a user-friendly interface that allows users to write queries in HiveQL, which is
similar to SQL. This makes it accessible to users who are familiar with traditional relational
databases.
1
Heetika Vaity BDAV Lab SYMCA/B-57
Hive Architecture
● Metastore:
○ The metastore stores metadata about the databases, tables, partitions,
columns, and data types. This metadata is critical for Hive’s operation and
allows it to understand how to interpret and query the data.
● Driver:
○ The driver manages the lifecycle of a HiveQL statement, including parsing,
compiling, optimizing, and executing queries. It also interacts with the query
compiler and execution engine.
● Compiler:
○ The compiler translates HiveQL statements into a directed acyclic graph
(DAG) of MapReduce jobs, Tez tasks, or Spark jobs, depending on the
execution engine being used.
● Execution Engine:
○ The execution engine executes the tasks generated by the compiler and
interacts with Hadoop's underlying data processing engines, such as
MapReduce, Tez, or Spark.
2
Heetika Vaity BDAV Lab SYMCA/B-57
● Batch Processing:
○ Hive is well-suited for batch processing jobs where large datasets need to be
processed in bulk. It is not designed for low-latency querying, making it more
suitable for reporting and analysis tasks.
● Data ETL (Extract, Transform, Load):
○ Hive can be used for data transformation and loading tasks. It can process
raw data from various sources, transform it into a desired format, and store it
in tables for analysis.
● Log Analysis:
○ Hive is commonly used to analyze large volumes of log data, such as web
server logs or application logs, to extract meaningful insights and patterns.
● Data Warehousing:
○ Hive is ideal for building data warehouses in big data environments, enabling
businesses to store and analyze large amounts of structured and
semi-structured data.
Limitations of Hive
● High Latency:
○ Hive is designed for batch processing and is not optimized for real-time
queries. Queries can take a long time to execute, especially on large datasets.
● Complexity of Joins:
○ Joining large tables in Hive can be resource-intensive and time-consuming, as
it relies on MapReduce or other distributed computing frameworks.
● Limited Transactional Support:
○ While Hive does support ACID transactions, it is limited in comparison to
traditional RDBMS, making it less suitable for applications requiring complex
transactions.
3
Heetika Vaity BDAV Lab SYMCA/B-57
Hive Databases:
Hive supports the concept of databases, which are logical containers for organizing tables,
views, and other database objects. Creating a database helps segregate data logically and
avoid name collisions between objects.
Syntax:
CREATE DATABASE database_name;
Hive Tables:
Tables in Hive are similar to tables in a relational database and can be created using the
CREATE TABLE statement. Each table is associated with a directory in HDFS, and the data
is stored in files within that directory.
Syntax:
CREATE TABLE table_name (
column1 datatype,
column2 datatype,
...
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 'delimiter'
STORED AS file_format;
Example: CREATE TABLE employee (id INT, name STRING, age INT)
STORED AS TEXTFILE;
Tables in Hive can be either Managed Tables (where Hive manages the lifecycle of the table
and data) or External Tables (where the data is stored externally and Hive only manages
the metadata).
4
Heetika Vaity BDAV Lab SYMCA/B-57
Hive Partition:
Partitioning is a technique in Hive used to divide a table into smaller parts based on the
value of one or more columns. Each partition corresponds to a unique value or a
combination of values, and Hive creates separate directories for each partition in HDFS.
Partitioning improves query performance by allowing queries to scan only the relevant
partitions instead of the entire table.
Syntax:
CREATE TABLE table_name (
column1 datatype,
column2 datatype,
...
)
PARTITIONED BY (partition_column datatype)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY 'delimiter'
STORED AS file_format;
Adding Partitions:
ALTER TABLE table_name ADD PARTITION (partition_column='value')
LOCATION 'hdfs_path';
Advantages of Partitioning:
5
Heetika Vaity BDAV Lab SYMCA/B-57
Hive provides a variety of built-in functions that are essential for data processing and
transformation. These functions are categorized into:
● Aggregate Functions:
○ Examples: MAX(),COUNT(),SUM(),AVG()
○ Usage: SELECT COUNT(*) FROM employee;
● String Functions:
○ Examples: UPPER(), LOWER(), CONCAT(), SUBSTRING()
○ Usage: SELECT UPPER(name) FROM employee;
● Mathematical Functions:
○ Examples: ROUND(), CEIL(), FLOOR(), ABS()
○ Usage: SELECT ROUND(salary, 2) FROM employee;
● Date Functions:
○ Examples: CURRENT_DATE(), YEAR(), MONTH()
○ Usage: SELECT YEAR(hire_date) FROM employee;
● Conditional Functions:
○ Examples: IF(), CASE...WHEN, COALESCE()
○ Usage: SELECT IF(age > 30, 'Senior', 'Junior') FROM
employee;
Hive Operators:
6
Heetika Vaity BDAV Lab SYMCA/B-57
Hive Views:
A view in Hive is a logical, virtual table that is derived from a query. Unlike tables, views do
not store data themselves; instead, they store the SQL query used to generate the data.
Views are useful for simplifying complex queries, enforcing security (by limiting access to
specific columns or rows), and abstracting underlying table structures.
Syntax:
CREATE VIEW view_name AS SELECT columns FROM table_name WHERE
conditions;
Querying a View:
SELECT * FROM view_name;
Hive Index:
Indexing in Hive is used to improve the speed of query operations on large datasets.
Indexes are created on specific columns of a table to allow faster data retrieval.
Hive supports several types of indexes, including Compact Indexes and Bitmap Indexes.
However, indexing in Hive is less commonly used compared to traditional databases due to
its performance overhead, and it's often considered based on specific use cases.
Syntax:
CREATE INDEX index_name
ON TABLE table_name (column_name)
AS 'index_type'
WITH DEFERRED REBUILD;
Show Index:
SHOW INDEX ON table_name;
Dropping an Index:
DROP INDEX index_name ON table_name;
7
Heetika Vaity BDAV Lab SYMCA/B-57
8
Heetika Vaity BDAV Lab SYMCA/B-57
● Show databases
● Show tables
9
Heetika Vaity BDAV Lab SYMCA/B-57
● Use the LOAD DATA command to load data from the local file system into the
Hive table.
○ LOCAL: Indicates that the file is located on the local filesystem of the machine where
Hive is running.
○ INPATH 'employee.csv': Specifies the path to the file to be loaded. Since it’s a local
file, the path should be relative to the machine where the Hive CLI or Beeline is being
executed.
○ INTO TABLE employee57: Specifies the Hive table into which the data should be
loaded.
● Describe table
10
Heetika Vaity BDAV Lab SYMCA/B-57
11
Heetika Vaity BDAV Lab SYMCA/B-57
● Max() -
12
Heetika Vaity BDAV Lab SYMCA/B-57
● sum() -
● limit
13
Heetika Vaity BDAV Lab SYMCA/B-57
14
Heetika Vaity BDAV Lab SYMCA/B-57
● Operators
15
Heetika Vaity BDAV Lab SYMCA/B-57
HCATALOG:
HCatalog is a table and storage management layer that sits on top of Apache Hive and
provides a unified interface for accessing data stored in various formats in Hadoop. It is
designed to facilitate interoperability between different data processing tools in the
Hadoop ecosystem, such as Pig, MapReduce, and Hive, by providing a consistent view of
data stored in HDFS (Hadoop Distributed File System).
● Create Table
● Show tables
● Describe table
● Drop table
16
Heetika Vaity BDAV Lab SYMCA/B-57
JOIN OPERATION:
17
Heetika Vaity BDAV Lab SYMCA/B-57
18
Heetika Vaity BDAV Lab SYMCA/B-57
19
Heetika Vaity BDAV Lab SYMCA/B-57
● INNER JOIN
20
Heetika Vaity BDAV Lab SYMCA/B-57
21
Heetika Vaity BDAV Lab SYMCA/B-57
22
Heetika Vaity BDAV Lab SYMCA/B-57
Hive Partition:
● Creating a table partitioned by department values
23
Heetika Vaity BDAV Lab SYMCA/B-57
24
Heetika Vaity BDAV Lab SYMCA/B-57
Hive View:
● Creating and Querying a View
Hive Index:
● Create index
● Drop index
CONCLUSION:
Hive simplifies the process of managing and querying large datasets in a Hadoop
ecosystem. Its SQL-like language, scalability, and integration with Hadoop make it a
powerful tool for big data analytics. However, its high latency and limitations in complex
queries mean it is best suited for batch processing and data warehousing tasks rather than
real-time analytics.
25
Name of Student : Heetika Mahesh Vaity
Title of LAB Assignment : To create a Pig Data Model, Read and Store
Data and Perform following Pig Operations,
1. Pig Latin Basic
2. Pig Data Types,
3. Download the data
4. Create your Script
5. Save and Execute the Script
6. Pig Operations : Diagnostic Operators, Grouping and Joining, Combining &
Splitting, Filtering, Sorting
AIM: TO CREATE A PIG DATA MODEL, READ AND STORE DATA AND
PERFORM FOLLOWING PIG OPERATIONS:
1. PIG LATIN BASIC
2. PIG DATA TYPES,
3. DOWNLOAD THE DATA
4. CREATE YOUR SCRIPT
5. SAVE AND EXECUTE THE SCRIPT
6. PIG OPERATIONS : DIAGNOSTIC OPERATORS, GROUPING AND JOINING,
COMBINING & SPLITTING, FILTERING, SORTING
THEORY:
Apache Pig is a high-level platform for processing and analyzing large datasets. It provides
an abstraction over the complexities of MapReduce, allowing users to write data
transformation scripts using Pig Latin, a high-level language that makes data processing
simpler and more intuitive. Pig is often used in scenarios where large-scale data analysis is
required, such as in ETL (Extract, Transform, Load) processes, data preparation for
machine learning, and log data analysis.
The Pig Data Model represents the structure of data in Apache Pig. It consists of the
following elements:
1. Atom: The simplest data type in Pig, representing a single value, such as a number or
a string. Examples include int, long, float, double, chararray, and bytearray.
2. Tuple: A record that consists of a sequence of fields. Each field can be of any data
type, including another tuple or a bag.
3. Bag: An unordered collection of tuples. Bags can contain duplicate tuples and are
used to represent relations in Pig.
4. Map: A set of key-value pairs where keys are chararray and values can be any data
type. Maps are useful for semi-structured data like JSON.
1
Heetika Vaity BDAV Lab SYMCA/B-57
Pig Latin is the language used to write scripts for data processing in Apache Pig. It supports
a wide range of operations, including loading data, filtering, grouping, joining, and more. Pig
Latin scripts are made up of a series of statements that describe a sequence of
transformations to be applied to the data.
Data in Pig is typically loaded from external storage, such as HDFS, using the LOAD
statement. The LOAD statement allows you to specify the file path, the storage format, and
the schema of the data. After processing, the transformed data can be stored back into
external storage using the STORE statement.
Example:
data = LOAD 'data.csv' USING PigStorage(',') AS (id:int, name:chararray, age:int);
Pig Operations
1. Diagnostic Operators:
○ DESCRIBE: Displays the schema of a relation.
○ DUMP: Outputs the content of a relation to the console.
2. Grouping and Joining:
○ Grouping: Groups records based on a common field.
○ Joining: Combines records from two or more relations based on a common
key.
3. Combining & Splitting:
○ Union: Combines the contents of two or more relations into one.
○ Split: Divides a relation into multiple parts based on specified conditions.
4. Filtering:
○ Filters data based on a condition, allowing only the records that meet the
condition to pass through.
5. Sorting:
○ Sorts the records in a relation based on one or more fields.
2
Heetika Vaity BDAV Lab SYMCA/B-57
CODE/OUTPUT:
3
Heetika Vaity BDAV Lab SYMCA/B-57
4
Heetika Vaity BDAV Lab SYMCA/B-57
5
Heetika Vaity BDAV Lab SYMCA/B-57
6
Heetika Vaity BDAV Lab SYMCA/B-57
7
Heetika Vaity BDAV Lab SYMCA/B-57
8
Heetika Vaity BDAV Lab SYMCA/B-57
9
Heetika Vaity BDAV Lab SYMCA/B-57
10
Heetika Vaity BDAV Lab SYMCA/B-57
11
Heetika Vaity BDAV Lab SYMCA/B-57
12
Heetika Vaity BDAV Lab SYMCA/B-57
13
Heetika Vaity BDAV Lab SYMCA/B-57
14
Heetika Vaity BDAV Lab SYMCA/B-57
15
Heetika Vaity BDAV Lab SYMCA/B-57
● INNER JOIN
grunt> InnerJoin= JOIN student_info by city, InternData by city;
grunt> dump InnerJoin;
● LEFT JOIN
grunt> LeftJoin= JOIN student_info by city LEFT, InternData by city;
grunt> dump LeftJoin;
16
Heetika Vaity BDAV Lab SYMCA/B-57
● RIGHT JOIN
grunt> RightJoin= JOIN student_info by city RIGHT OUTER, InternData by city;
grunt> dump RightJoin;
● FULL JOIN
grunt> FullJoin= JOIN student_info by city FULL OUTER, InternData by city;
grunt> dump FullJoin;
17
Heetika Vaity BDAV Lab SYMCA/B-57
● ORDER BY
grunt> B5= ORDER student_info by id desc;
grunt> dump B5;
18
Heetika Vaity BDAV Lab SYMCA/B-57
● LIMIT
grunt> C = LIMIT B 3;
grunt> dump C;
● UNION
grunt> C = LIMIT B 3;
grunt> C1 = LIMIT B 5;
grunt> UnionData= union C, C1;
grunt> dump UnionData;
19
Heetika Vaity BDAV Lab SYMCA/B-57
20
Heetika Vaity BDAV Lab SYMCA/B-57
● FILTER
grunt> filter_city = FILTER student_info by city == 'Mumbai';
grunt> dump filter_city;
21
Heetika Vaity BDAV Lab SYMCA/B-57
● DISTINCT
grunt> distinct_cities= Distinct all_city;
grunt> dump distinct_cities;
CONCLUSION:
Apache Pig provides a powerful yet simple way to process large datasets. By using Pig Latin,
users can perform complex data transformations without needing to write complex code.
The Pig Data Model allows for flexible and scalable data handling, making it suitable for a
wide range of data processing tasks. Understanding and applying the various operations
available in Pig enables efficient data analysis and processing in a distributed environment.
22
Name of Student : Heetika Mahesh Vaity
THEORY:
Apache Spark is an open-source, distributed computing system designed for large-scale
data processing. Developed at UC Berkeley’s AMPLab, Spark extends the MapReduce model
to provide faster and more comprehensive data processing capabilities. Its in-memory
computation and high-level APIs make it ideal for big data applications, offering a unified
engine that supports a wide range of data analytics tasks.
1. Speed:
○ Spark can process data up to 100 times faster than Hadoop MapReduce,
primarily due to its in-memory processing capability.
○ It reduces the need for repeated disk read/writes by keeping intermediate
data in memory whenever possible.
2. Ease of Use:
○ Spark provides high-level APIs in several programming languages, including
Java, Python, Scala, and R.
○ The DataFrame API offers the ease of SQL-like querying, while RDDs
(Resilient Distributed Datasets) enable functional programming constructs.
3. Unified Engine:
○ Spark offers a unified platform that supports various big data processing
tasks such as batch processing, stream processing, machine learning, and
graph processing.
○ Its core components include Spark SQL, Spark Streaming, MLlib (for machine
learning), and GraphX (for graph processing).
4. Fault Tolerance:
○ Spark achieves fault tolerance through lineage, meaning that lost data can be
recomputed based on its transformation history.
○ Spark’s RDDs maintain the transformations applied to data, so even if nodes
fail, the system can rebuild the lost partitions automatically.
1. Spark Core:
○ The Spark Core is responsible for basic data processing, including memory
management, task scheduling, fault recovery, and interacting with storage
systems like HDFS, S3, and others.
○ RDDs form the core abstraction, enabling fault-tolerant, distributed data
storage.
1
Heetika Vaity BDAV Lab SYMCA/B-57
2. Spark SQL:
○ Spark SQL allows users to run SQL queries and interact with structured data
through DataFrames.
○ It also supports a variety of data sources such as Hive, Parquet, ORC, and
JSON.
3. Spark Streaming:
○ Spark Streaming enables real-time data processing and analytics by breaking
incoming data into mini-batches and processing them using Spark’s core API.
4. MLlib:
○ MLlib is Spark’s library for scalable machine learning algorithms, including
classification, regression, clustering, and collaborative filtering.
5. GraphX:
○ GraphX is the component that allows for the processing and analysis of
large-scale graph data, offering tools for graph-parallel computations.
2
Heetika Vaity BDAV Lab SYMCA/B-57
1. Data Processing: Spark is used for ETL tasks, large-scale data processing, and
real-time data analytics in industries like finance, healthcare, and e-commerce.
2. Machine Learning: Spark’s MLlib library is used for training machine learning
models at scale.
3. Graph Processing: GraphX enables large-scale graph analytics for social networks
and other graph-based data.
CODE/OUTPUT:
Open cloudera terminal and change directory to desktop and create a text file and put it in hdfs.
● [cloudera@quickstart ~]$ gedit heetika.txt
3
Heetika Vaity BDAV Lab SYMCA/B-57
● scala> sc.appName
4
Heetika Vaity BDAV Lab SYMCA/B-57
● scala> heetikaData.count()
Change to uppercase:
● scala> val newHeetikaData = heetikaData.map(line => line.toUpperCase)
Filter file:
5
Heetika Vaity BDAV Lab SYMCA/B-57
List of numbers:
● scala> val heetDataNum = sc.parallelize(List(10,20,30))
6
Heetika Vaity BDAV Lab SYMCA/B-57
● scala> heetMapFunc.collect()
Sequence of names:
● scala> val name = Seq("Heetika", "Vaity")
● scala> name.flatMap(_.toLowerCase)
7
Heetika Vaity BDAV Lab SYMCA/B-57
8
Heetika Vaity BDAV Lab SYMCA/B-57
● scala> heetData.collect()
9
Heetika Vaity BDAV Lab SYMCA/B-57
● scala> map_heetData.collect()
● scala> reduce_heetData.collect()
10
Heetika Vaity BDAV Lab SYMCA/B-57
Note: Write code → press ctrl S → press enter. Also give proper indentations.
11
Heetika Vaity BDAV Lab SYMCA/B-57
12
Heetika Vaity BDAV Lab SYMCA/B-57
Code to collect all ranks (note: this command may take few minutes to run completely)
13
Heetika Vaity BDAV Lab SYMCA/B-57
>>> ranks.take(5)
14
Heetika Vaity BDAV Lab SYMCA/B-57
>>> ranks.saveAsTextFile(‘page_ranks_output_57B’)
15
Heetika Vaity BDAV Lab SYMCA/B-57
16
Heetika Vaity BDAV Lab SYMCA/B-57
17
Heetika Vaity BDAV Lab SYMCA/B-57
CONCLUSION:
Apache Spark is a versatile and powerful platform that simplifies big data processing by
providing a unified engine with in-memory computation capabilities. Its ability to handle
diverse workloads (batch, streaming, machine learning, and graph processing) on a single
platform makes it a go-to solution for large-scale data analytics.
18
Name of Student : Heetika Mahesh Vaity
THEORY:
Description:
● Power BI is a powerful business analytics tool developed by Microsoft that allows
users to visualize data, create reports, and share insights across their organizations.
● It integrates with various data sources and provides features for creating interactive
dashboards, custom reports, and analytics.
● Power BI is a business intelligence and data visualization tool developed by
Microsoft that allows users to connect to various data sources, transform and model
the data, and create interactive reports and dashboards. It helps organizations make
informed decisions by providing insights from their data in a visually appealing and
easy-to-understand format.
1
Heetika Vaity BDAV Lab SYMCA/B-57
● It allows users to create custom calculations and measures within their reports.
7. Power BI Dashboard
● A single-page visual interface that consolidates visualizations from different reports
into one view.
● Dashboards offer high-level insights with links to deeper reports for detailed
analysis.
8. Power BI Embedded
● An API service that allows developers to integrate Power BI reports and dashboards
into custom applications.
9. Row-Level Security (RLS)
● A feature that allows restricting data access for users based on roles, ensuring that
sensitive information is protected.
10. Sharing and Collaboration
● Power BI enables sharing reports within an organization or externally through
Power BI service and publishing to the web.
● It also integrates with Microsoft Teams for collaboration.
2
Heetika Vaity BDAV Lab SYMCA/B-57
Use Cases:
● Business Analytics: Visualize sales trends, monitor key performance indicators
(KPIs), and track business growth.
● Finance: Generate reports on financial performance, budgeting, and forecasting.
● Marketing: Analyze campaign effectiveness, customer engagement, and website
traffic data.
● Supply Chain: Track inventory, optimize logistics, and monitor supplier
performance.
● Human Resources: Visualize employee data, analyze retention, and monitor
productivity metrics.
3
Heetika Vaity BDAV Lab SYMCA/B-57
Power BI Desktop is a free application that allows users to create reports and dashboards.
Steps:
● Visit the official Power BI download page.
4
Heetika Vaity BDAV Lab SYMCA/B-57
5
Heetika Vaity BDAV Lab SYMCA/B-57
After installation, you'll need to configure Power BI Desktop to connect to data sources and
set up workspaces for educational purposes.
Steps:
● Launch Power BI Desktop.
CONCLUSION:
Power BI transforms raw data into actionable insights through visually appealing reports
and dashboards, allowing organizations to make better data-driven decisions.
6
Name of Student : Heetika Mahesh Vaity
THEORY:
Power BI
Power BI is a business analytics tool developed by Microsoft that allows users to visualize
data, share insights, and create interactive reports and dashboards. It is widely used by
organizations for data-driven decision-making, enabling users to turn raw data into
meaningful insights. Power BI offers various services and components that cater to
different levels of users—from data analysts and business professionals to developers.
1. Data Connectivity:
○ Power BI can connect to a wide range of data sources, including Excel, SQL
Server, Azure, Oracle, web APIs, and many more. This allows users to
aggregate and analyze data from different systems in one place.
2. Interactive Visualizations:
○ Power BI offers a variety of built-in visualizations (such as charts, graphs,
tables, and maps) to help users create interactive and dynamic reports. Users
can also use custom visuals from Power BI’s marketplace.
○ Visuals are highly customizable, allowing users to change colors, labels, axes,
and more to match their analytical needs.
3. Power Query Editor:
○ This is where users can perform data cleaning and transformation tasks. It
allows for tasks such as removing duplicates, renaming columns, splitting
columns, and combining data from multiple sources. Power Query makes
preprocessing data user-friendly with a simple, step-by-step process.
4. Data Analysis Expressions (DAX):
○ DAX is a formula language used in Power BI for creating custom calculations,
measures, and columns. It is similar to Excel formulas but is more powerful
and optimized for relational data.
○ With DAX, users can perform complex calculations, such as calculating
growth rates, creating running totals, or filtering data for specific time
periods.
5. Power BI Desktop:
1
Heetika Vaity BDAV Lab SYMCA/B-57
○ Power BI Desktop is a free application for PCs that enables users to connect
to data, transform it, and build reports with visuals. It is the primary tool for
designing and creating reports before they are published and shared.
6. Power BI Service (Cloud):
○ Power BI Service is a cloud-based platform where users can publish, share,
and collaborate on reports and dashboards. Users can access their data and
reports from anywhere via the web or mobile devices.
○ With the Power BI Service, users can set up automatic data refreshes,
ensuring that their reports always show up-to-date information.
7. Power BI Mobile:
○ Power BI has mobile apps available for iOS, Android, and Windows devices.
This allows users to access their reports and dashboards on the go, ensuring
they can monitor key metrics from anywhere.
8. Power BI Pro:
○ Power BI Pro is a paid subscription that adds collaboration features to Power
BI. It allows users to share reports and dashboards with others, collaborate in
workspaces, and distribute content across the organization.
9. Power BI Premium:
○ This subscription provides additional capabilities, including larger storage,
higher data refresh rates, and the ability to share reports with users who
don’t have a Power BI Pro license. It also includes features like AI
capabilities and Paginated Reports for highly detailed report layouts.
10. Real-Time Analytics:
○ Power BI allows for real-time data updates, where data streams from sources
like IoT devices or live databases can be used to provide up-to-the-minute
insights on dashboards.
11. Security and Data Governance:
○ Power BI includes security features such as row-level security (RLS), which
restricts access to specific data based on user roles. This ensures that
sensitive information is only visible to authorized users.
○ Integration with Azure Active Directory (AAD) allows organizations to
control access and apply security measures.
12. Natural Language Querying (Q&A):
○ Power BI offers a Q&A feature where users can ask questions in plain
language (e.g., "What were the total sales last year?"), and Power BI will
generate a relevant visual or answer.
13. AI and Machine Learning:
2
Heetika Vaity BDAV Lab SYMCA/B-57
Data Preprocessing in Power BI involves a series of activities that transform raw data into a
clean, organized, and usable format, ready for analysis. It is a critical step to ensure data
quality and integrity, which improves the accuracy and reliability of insights generated by
reports or dashboards. Power BI provides several built-in tools and features, primarily in
Power Query Editor, that enable users to perform various preprocessing tasks.
1. Data Importing: Power BI supports data import from multiple sources such as Excel,
SQL databases, web services, CSV files, and more. During the import phase, it's
important to connect to the correct data source and understand the structure of the
data.
2. Data Cleaning:
○ Removing Duplicates: Identifying and eliminating duplicate rows to prevent
redundant records.
○ Handling Missing Values: Replacing missing values with a default value,
mean, or interpolating them.
○ Removing Unwanted Columns: Dropping unnecessary columns that do not
contribute to the analysis.
3. Data Transformation:
3
Heetika Vaity BDAV Lab SYMCA/B-57
○ Data Type Conversion: Ensuring that all columns have the appropriate data
types (e.g., dates, numbers, text) for correct processing.
○ Splitting and Merging Columns: Splitting columns into multiple parts (e.g.,
full name into first and last name) or merging columns into one (e.g.,
combining city and state into a single location column).
○ Filtering Data: Removing unnecessary rows based on conditions to reduce
noise and focus on relevant data.
4. Data Aggregation:
○ Summarizing data (e.g., total sales by month, average transaction value) to
create meaningful metrics.
○ Grouping data by categories (e.g., region, product category) to identify trends
and patterns.
5. Data Normalization:
○ Standardizing data values to ensure consistency, especially when dealing
with data from multiple sources (e.g., date formats, currency symbols).
6. Creating Calculated Columns and Measures:
○ Power BI allows users to create custom calculations using DAX (Data Analysis
Expressions), which can generate new columns or measures based on
existing data (e.g., profit margin, sales growth rate).
7. Data Merging and Joining:
○ Combining multiple datasets through joins (inner, outer, left, right) or merges
to create a single, comprehensive data table for analysis.
8. Data Validation:
○ Verifying the accuracy and consistency of the data by checking for outliers,
invalid data entries, or inconsistencies in formats and categories.
4
Heetika Vaity BDAV Lab SYMCA/B-57
DATASET:
Link: https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales
Context: The growth of supermarkets in most populated cities are increasing and market
competitions are also high. The dataset is one of the historical sales of supermarket
company which has recorded in 3 different branches for 3 months data. Predictive data
analytics methods are easy to apply with this dataset.
Attribute information
5
Heetika Vaity BDAV Lab SYMCA/B-57
CODE/STEPS:
1. Importing Data
a. Data Sources: Power BI allows you to import data from various sources like
Excel, SQL, Web, etc.
b. Steps:
i. Open Power BI Desktop.
ii. Go to the Home tab.
6
Heetika Vaity BDAV Lab SYMCA/B-57
7
Heetika Vaity BDAV Lab SYMCA/B-57
8
Heetika Vaity BDAV Lab SYMCA/B-57
Removed Invoice-ID
Before:
After:
● Change Data Types: Change the data types (e.g., Text, Number, Date).
9
Heetika Vaity BDAV Lab SYMCA/B-57
We can change the datatype of the specific columns if we want. Here we change unit price
to fixed decimal number.
10
Heetika Vaity BDAV Lab SYMCA/B-57
Steps:
i. Select a column with text data.
ii. Use Transform Tab options such as:
● Format Column:
This option allows you to apply transformations to text data, such as converting text to
uppercase, lowercase, or capitalizing each word.
Go to Transform > Format
Before:
After:
11
Heetika Vaity BDAV Lab SYMCA/B-57
● Replace Values: Replace specific values (e.g., replacing "null" with blank).
Right-click your column. —> Choose Replace Values
Before:
After:
12
Heetika Vaity BDAV Lab SYMCA/B-57
After:
13
Heetika Vaity BDAV Lab SYMCA/B-57
Steps:
i. Select a numeric column in the Power Query Editor.
ii. Use Transform Tab options:
1. Statistics: Calculate statistics like mean, median, standard
deviation, etc.
Before:
After:
14
Heetika Vaity BDAV Lab SYMCA/B-57
After:
15
Heetika Vaity BDAV Lab SYMCA/B-57
After:
16
Heetika Vaity BDAV Lab SYMCA/B-57
Example:
● If Total > 5000, then “High”
● If Total > 1000, then “Medium”
● Else, label it as “Low”.
Before:
17
Heetika Vaity BDAV Lab SYMCA/B-57
After:
18
Heetika Vaity BDAV Lab SYMCA/B-57
19
Heetika Vaity BDAV Lab SYMCA/B-57
4. Creating Dashboards
A dashboard is a collection of charts and visuals that give insights at a glance.
Steps:
a. Combine different visualizations (charts, tables, maps) into one page.
b. Arrange them in a layout that tells a clear story.
c. Use Slicers or Filters to allow dynamic changes to the dashboard.
Dashboard
20
Heetika Vaity BDAV Lab SYMCA/B-57
Sales Performance Overview: The sales have shown a steady trend with significant
contributions from the Fashion Accessories and Food and Beverages categories. Payment
preferences are diverse, but Cash leads slightly over E-wallet and Credit card.
Interestingly, there is a relatively balanced split between Members and Non-Members,
with Non-Members contributing marginally higher to total sales. Sales spikes are noticeable
on certain days of the month, suggesting possible promotional activities or market demand
shifts.
Chart Insights:
21
Heetika Vaity BDAV Lab SYMCA/B-57
Key Insights:
➢ The dominance of Fashion Accessories and Food and Beverages highlights strong
customer demand for these categories, making them focal points for future sales
strategies.
➢ The daily sales fluctuations suggest that certain days may benefit from targeted
promotions to smooth out sales dips, while peak days may reveal opportunities for
deeper engagement.
➢ Non-Members slightly outperform Members, indicating potential to boost
member-driven sales through exclusive offers or loyalty programs.
CONCLUSION:
22
Heetika Vaity BDAV Lab SYMCA/B-57
Data preprocessing in Power BI is a fundamental step that ensures the data is in a clean,
organized, and ready-to-use format for effective analysis. Proper preprocessing
activities—such as cleaning, transforming, aggregating, and validating data—lead to higher
data quality and more accurate insights. Power BI's Power Query Editor provides a robust
environment for these tasks, allowing users to perform complex data transformations
without writing code. This functionality ensures that even non-technical users can
preprocess data efficiently, enabling them to generate meaningful reports and dashboards
that drive data-driven decision-making.
23
Name of Student : Heetika Mahesh Vaity
THEORY:
Power BI is a powerful business intelligence tool that enables users to extract, transform,
and analyze data from various sources. Learning how to handle tables and queries is crucial
for making data-driven decisions, and it involves a few key concepts and operations.
1. Tables in Power BI
Operations on Tables:
● Loading Tables: Power BI can import data from different sources. You use the Get
Data feature to connect to data sources and load tables.
● Relationships between Tables: When working with multiple tables, relationships
define how data is related. This allows Power BI to perform calculations and
aggregations across multiple tables. For example, linking a Customer ID from a
Sales table to a Customers table lets you analyze sales based on customer
attributes like gender or city.
● Table Formats and Data Types: Tables in Power BI consist of columns, each having
a specific data type such as Text, Number, or Date. Correct formatting ensures
accurate calculations and visualizations.
● Purpose: Power Query Editor is used to clean, transform, and reshape data before
loading it into Power BI. It allows users to write queries without knowing SQL or
other programming languages.
● Key Functions:
○ Transformations: These include renaming columns, filtering data, changing
data types, and removing duplicate rows.
○ Combining Data: Power Query enables merging or appending data from
different tables or queries.
1
Heetika Vaity BDAV Lab SYMCA/B-57
● Merging Queries: This operation is similar to a SQL JOIN. It combines data from
two tables based on a common column (such as Invoice ID or Customer ID).
This is useful when you want to combine related information stored in different
tables. For example, you might merge a Sales table with a Products table to
analyze product-level sales.
● Appending Queries: This operation stacks tables vertically, increasing the number
of rows. It’s used when you have data split across multiple tables with the same
structure. For example, if sales data is stored in separate tables for different
branches, you can append them to create a single dataset for analysis.
● Adding Custom Columns: Power BI allows users to add calculated columns using
expressions in DAX (Data Analysis Expressions). For instance, you can create a new
column for the total sales amount by multiplying Quantity by Unit Price.
● Removing and Reordering Columns: Unnecessary columns can be removed to
streamline the data model and improve performance. Reordering helps in
organizing columns logically.
2
Heetika Vaity BDAV Lab SYMCA/B-57
6. Performance Considerations
● Efficient Queries: Using filters and transformations smartly reduces the amount of
data being loaded, which can improve the performance of reports and dashboards.
For example, instead of loading entire datasets, apply filters to load only relevant
data (e.g., filtering for data from a specific year or branch).
● Query Folding: Power BI attempts to push as many transformations as possible
back to the data source (especially in relational databases), ensuring that operations
are performed at the source level, which enhances performance.
● After transforming data in Power Query, it is loaded into the Power BI Data Model.
Data can be refreshed to reflect updates in the source data without redoing the
transformations.
● You can set scheduled refreshes in Power BI Service to automatically update the data
at defined intervals.
Dataset Link:
https://github.com/Ayushi0214/Datasets/blob/main/classic_models_dataset.zip
CODE/ STEPS:
3
Heetika Vaity BDAV Lab SYMCA/B-57
Load Dataset:
4
Heetika Vaity BDAV Lab SYMCA/B-57
5
Heetika Vaity BDAV Lab SYMCA/B-57
Put the table name here and do the same for meaning datasets:
6
Heetika Vaity BDAV Lab SYMCA/B-57
➢ In the dialog box select Product table and matching columns. Select the type of
join(Here: Inner Join)
7
Heetika Vaity BDAV Lab SYMCA/B-57
➢ Calculating margin from Buy Price and quantity by creating custom column
8
Heetika Vaity BDAV Lab SYMCA/B-57
➢ Calculating Profit
Append Queries: This operation is similar to a SQL UNION. It stacks rows from two or
more tables on top of each other. Append is used when the structure of both tables is the
same, but the data might represent different periods or entities (e.g., sales data from
different branches or time periods).
9
Heetika Vaity BDAV Lab SYMCA/B-57
10
Heetika Vaity BDAV Lab SYMCA/B-57
2. Column Formats
Column formatting refers to defining how data is displayed and processed. Different types
of data require different formats to ensure proper analysis:
● Text Format: Used for categorical data like Invoice ID, Customer Type, or Branch.
Text data is treated as labels and is not suitable for calculations.
● Numeric Format: Used for values that require mathematical operations. Examples
include Quantity, Unit Price, Total, Gross Income, etc.
● Date Format: For time-related data like Date and Time, which allows the creation of
time-based analyses like trends and comparisons.
● Proper formatting ensures correct data aggregation and calculations in reports and
visualizations.
11
Heetika Vaity BDAV Lab SYMCA/B-57
12
Heetika Vaity BDAV Lab SYMCA/B-57
➢ Result:
3. Creating a Table
Creating a table in Power BI can be done by importing data or manually entering data.
Manually created tables are often used for reference purposes, like a lookup table
containing Product Line or Branch details. This helps in organizing and categorizing data,
making it easier to perform analyses like product-line-wise or branch-wise sales
performance.
13
Heetika Vaity BDAV Lab SYMCA/B-57
Pivoting: Pivoting transforms data from rows into columns, allowing you to summarize and
aggregate the data in a different structure. For example, pivoting sales data by Branch
might transform rows of Branch A, B, and C into separate columns, making it easier to
compare sales performance across branches.
Unpivoting: Unpivoting is the opposite of pivoting. It converts columns into rows, which
can help when analyzing multiple variables that were originally separate. For example,
unpivoting Sales by year (where each year is a separate column) would make all sales
records appear as rows, making time-based analysis easier.
➢ Suppose we have data like this and we are asked to convert this table to vertical
table:
14
Heetika Vaity BDAV Lab SYMCA/B-57
15
Heetika Vaity BDAV Lab SYMCA/B-57
Data Model: A data model in Power BI defines how different tables relate to each other.
This is essential for combining data from multiple tables into meaningful reports. The data
model is created by establishing relationships between tables based on shared columns
(called keys).
➢ Go to Model View
16
Heetika Vaity BDAV Lab SYMCA/B-57
➢ Since the Orders table and Customers table have a common column
“CustomerNumber”. We can drag this column from Orders and put it in Customers
table and it will create a relationship
17
Heetika Vaity BDAV Lab SYMCA/B-57
Relationships: Power BI enables you to create and manage relationships between tables.
Relationships define how data in different tables connects, allowing Power BI to perform
calculations and aggregations across them. Common relationship types are:
➢ This is a One to Many relationship between Customers and Orders table. As one
customer can have multiple orders
18
Heetika Vaity BDAV Lab SYMCA/B-57
● Cardinality: Refers to the nature of the relationship between two tables. Power BI
supports three types of cardinality:
○ One-to-Many: One value in a column corresponds to many values in another
column. This is the most common relationship type.
○ Many-to-One: Reverse of one-to-many, where multiple records in one table
are linked to a single record in another.
○ Many-to-Many: Both tables can have multiple matching records, commonly
used when there is no single unique key.
● Cross-Filter Direction: This determines how filters applied to one table affect the
related table. There are two types:
○ Single Direction: Filters flow in one direction only. This is useful when one
table is dependent on another (e.g., filtering products by product category).
19
Heetika Vaity BDAV Lab SYMCA/B-57
CONCLUSION:
Handling tables and queries in Power BI involves various tasks like loading tables from data
sources, transforming data using Power Query, combining and merging tables, and
managing relationships between them. These operations are crucial for building an
optimized and accurate data model, which serves as the foundation for generating insights
and building reports.
20
Name of Student : Heetika Mahesh Vaity
THEORY:
Data visualization is the graphical representation of data to make it easier for users to
understand trends, patterns, and insights. Effective visualization simplifies complex data
sets into charts, graphs, and maps, allowing for better decision-making. Power BI is a
powerful tool for creating such visualizations with a user-friendly interface and numerous
features that enable users to analyze data interactively.
2. Power BI Overview
Power BI is a business analytics tool developed by Microsoft for creating visual reports,
dashboards, and data models. It allows users to connect to various data sources, transform
raw data, and visualize it in a meaningful way.
Before visualizing data, it must be clean and well-organized. Power BI offers tools for:
● Data Importation: Power BI can connect to multiple data sources like Excel, SQL
databases, Azure, APIs, and more.
● Data Transformation (Power Query): Users can clean, filter, reshape, and
aggregate data using Power Query, which is built into Power BI. Operations include
removing duplicates, merging tables, changing data types, and more.
1
Heetika Vaity BDAV Lab SYMCA/B-57
Each chart has its own use cases based on the nature of the data and the insights you want
to derive.
5. Creating Dashboards
● Slicers and Filters: Allow users to filter data directly on the dashboard, improving
interactivity.
● Drill-through and Drill-down: These enable users to explore data at different
levels of granularity.
● Bookmarks and Selections: Help in navigating between different views within a
report.
● Custom Visuals: Power BI allows the import of third-party visuals to meet specific
needs.
DAX is the formula language in Power BI used to perform calculations and create custom
measures. Understanding DAX is crucial for advanced data modeling and creating
meaningful KPIs. It enables operations like:
2
Heetika Vaity BDAV Lab SYMCA/B-57
● Focus on Simplicity: Avoid clutter and unnecessary visuals; focus on key metrics.
● Use Color Wisely: Use consistent and meaningful colors for categories and trends.
Avoid overuse of color.
● Ensure Readability: Use appropriate font sizes, legends, and labels for readability.
● Context and Annotations: Provide context for your data with titles, annotations, or
tooltips.
● Interactivity and Exploration: Allow users to explore data interactively via slicers,
filters, and tooltips.
Once the dashboard is built, you can share it with others via the Power BI service. Power BI
also supports real-time collaboration, where users can comment on dashboards, create
alerts, and schedule data refreshes.
Some common use cases for dashboards and visualizations in Power BI include:
● Sales and Financial Analysis: Visualizing sales trends, profit margins, and
forecasting.
● Customer Segmentation: Understanding customer behavior and segmenting
markets.
● Supply Chain Management: Tracking inventory levels, logistics, and supplier
performance.
● HR Analytics: Visualizing employee performance, attrition rates, and recruitment
data.
3
Heetika Vaity BDAV Lab SYMCA/B-57
DATASET USED:
Link: https://www.kaggle.com/datasets/lainguyn123/student-performance-factors
Description:
This dataset provides a comprehensive overview of various factors affecting student performance in
exams. It includes information on study habits, attendance, parental involvement, and other aspects
influencing academic success.
4
Heetika Vaity BDAV Lab SYMCA/B-57
STEPS:
DATA PREPROCESSING:
1. Data Cleaning
● Load your Dataset
5
Heetika Vaity BDAV Lab SYMCA/B-57
● Remove Duplicates: Ensure there are no duplicate rows in your dataset. In Power
Query, go to Home → Remove Duplicates.
● Handle Missing Values: Check for missing values and decide on an approach:
○ Replace missing numeric values with a mean/median.
○ For categorical fields (e.g., Parental_Involvement), replace missing
values with the mode or a default value (e.g., "Unknown").
○ In Power Query, select Transform → Replace Values to handle missing values.
6
Heetika Vaity BDAV Lab SYMCA/B-57
2. Data Transformation
● Convert Data Types: Ensure each field has the correct data type:
○ Numeric fields (e.g., Hours_Studied, Exam_Score) should be set to
Whole Number or Decimal Number.
○ Categorical fields (e.g., Parental_Involvement, Motivation_Level)
should be set as Text or Categorical.
7
Heetika Vaity BDAV Lab SYMCA/B-57
8
Heetika Vaity BDAV Lab SYMCA/B-57
1. Introduction to Visuals
● From the "Visualizations" pane, select a visual (e.g., Bar Chart, Pie Chart).
● Drag dataset fields into the "Values" and "Axis" sections.
9
Heetika Vaity BDAV Lab SYMCA/B-57
2. Visualization Charts
● Select your chart type from the "Visualizations" pane (e.g., bar, line, pie).
● Drag relevant fields into the chart.
Line Charts: Plot Hours_Studied against Exam_Score over time (if applicable).
Pie Charts: Represent the proportion of students with High, Medium, and Low
Parental_Involvement.
10
Heetika Vaity BDAV Lab SYMCA/B-57
3. Filtering Options
11
Heetika Vaity BDAV Lab SYMCA/B-57
12
Heetika Vaity BDAV Lab SYMCA/B-57
7. KPI Visuals
Average of Previous Scores: You can calculate the average of all the previous scores as a
reference point.
Target_Score = AVERAGE('StudentPerformanceFactors'[Previous_Scores])
Avg_Exam_Score = AVERAGE('StudentPerformanceFactors'[Exam_Score])
13
Heetika Vaity BDAV Lab SYMCA/B-57
1. Value: Select Average of Exam_Score from the dropdown in the "Value" field.
2. Trend Axis: selected Previous_Scores, which is great for showing trends based on
historical data.
3. Target: create a calculated measure for the Target and use it here.
14
Heetika Vaity BDAV Lab SYMCA/B-57
9. TreeMap
● Select a visual.
● In the "Format" pane, under "Data colors," modify the color for each value category.
15
Heetika Vaity BDAV Lab SYMCA/B-57
16
Heetika Vaity BDAV Lab SYMCA/B-57
13. AI Visuals
17
Heetika Vaity BDAV Lab SYMCA/B-57
18
Heetika Vaity BDAV Lab SYMCA/B-57
Dashboard:
CONCLUSION:
In this practical, we focused on data visualization techniques using Power BI to analyze the
student performance dataset. Various visuals, including KPI, were used to track Exam_Score
trends and compare them against targets like the average of previous scores. The use of
charts, slicers, and filters provided insights into student performance patterns, making it
easier to interpret data and identify areas for improvement.
19
Name of Student : Heetika Mahesh Vaity
ASSIGNMENT 1
1. Objective
1. Data Preparation: Cleaning and preprocessing data to handle missing values and
normalize features.
2. Feature Selection: Choosing the relevant features for clustering.
3. Choosing a Clustering Algorithm: Selecting the appropriate clustering technique
based on the data characteristics.
4. Determining the Number of Clusters: Methods like the elbow method or
silhouette analysis can help find the optimal number of clusters.
Heetika Vaity BDAV Lab SYMCA/B-57
5. Cluster Validation: Assessing the quality and validity of the clusters formed, often
using internal metrics (like cohesion and separation) or external validation against
known labels.
4. Applications
5. Challenges
● Choosing the Right Algorithm: Different algorithms may yield different results.
● Scalability: Some methods can struggle with large datasets.
● Interpreting Results: Clusters may not always be easily interpretable or
meaningful.
Cluster analysis is a powerful tool for discovering patterns and structures in data, making it
widely used in various fields such as marketing, biology, finance, and social sciences.
Similarity measures are used in clustering and other data analysis techniques to quantify
how similar two data points are. The choice of similarity measure can significantly affect
the clustering results. Here are some common similarity measures, along with their
characteristics and applications:
1. Euclidean Distance
● Formula:
● Formula:
● Description: Measures the distance between two points along the axes (like a grid).
It sums the absolute differences of their coordinates.
● Use Cases: Useful in scenarios where the data is arranged in a grid-like structure or
when you want to reduce the impact of outliers.
3. Minkowski Distance
● Formula:
● Use Cases: Flexible and can be tailored to specific needs by adjusting mmm.
4. Cosine Similarity
● Formula:
● Description: Measures the cosine of the angle between two non-zero vectors in an
inner product space. It ranges from -1 to 1.
● Use Cases: Particularly useful in text mining and natural language processing, where
documents can be represented as vectors.
● Formula:
● Description: Measures the linear correlation between two variables, ranging from
-1 (perfectly negatively correlated) to 1 (perfectly positively correlated).
Heetika Vaity BDAV Lab SYMCA/B-57
● Use Cases: Often used in clustering when the goal is to identify relationships
between features.
The K-means algorithm is a popular clustering technique used to partition a dataset into K
distinct clusters based on feature similarity. It is widely used for its simplicity and efficiency
in handling large datasets. The objective of K-means is to minimize the variance within each
cluster while maximizing the variance between different clusters. This is typically done by
minimizing the sum of squared distances between each data point and the centroid of its
assigned cluster.
1. Initialization:
○ Choose the number of clusters KKK.
○ Randomly initialize KKK centroids, which can be selected from the dataset or
randomly generated within the feature space.
2. Assignment Step:
○ For each data point in the dataset, calculate the distance between the point
and each centroid (using Euclidean distance or another similarity measure).
○ Assign each data point to the cluster associated with the nearest centroid.
3. Update Step:
○ Recalculate the centroids of each cluster. The new centroid is the mean of all
data points assigned to that cluster: Ck=1Nk∑xi∈ClusterkxiC_k =
\frac{1}{N_k} \sum_{x_i \in \text{Cluster}_k} x_iCk=Nk1∑xi∈Clusterkxi
where NkN_kNkis the number of points in cluster kkk.
4. Convergence Check:
○ Repeat the assignment and update steps until the centroids no longer change
significantly, or until a maximum number of iterations is reached. This
indicates that the algorithm has converged.
● Choosing K: The number of clusters KKK must be specified in advance, which may
not always be known.
● Sensitivity to Initialization: The final results can depend on the initial choice of
centroids. Different initializations may lead to different clustering results.
● Assumption of Spherical Clusters: K-means assumes clusters are spherical and
equally sized, which may not always be the case in real-world data.
● Outliers: The algorithm is sensitive to outliers, as they can disproportionately
influence the position of centroids.
ASSIGNMENT 2
Big Graphs refer to graph data structures that are large in scale, often containing millions
or billions of nodes (vertices) and edges. These graphs typically represent complex
networks, such as social networks, web pages, biological networks, or transportation
systems, where the relationships between entities are vast and interconnected.
PageRank is an algorithm developed by Larry Page and Sergey Brin (founders of Google)
to rank web pages in search engine results. It evaluates the importance of each webpage
based on the number and quality of links pointing to it. Pages that are linked to by many
other important pages are considered more important themselves.
● Concept: PageRank is based on the idea that if a webpage is linked to by many other
pages, especially by important pages, it is likely to be more relevant and
authoritative. Links from higher-quality pages carry more weight.
Heetika Vaity BDAV Lab SYMCA/B-57
● Web as a Graph: The web is represented as a graph where web pages are nodes and
hyperlinks between them are edges. PageRank uses the structure of this graph to
determine the ranking of pages.
● Intuition: PageRank can be thought of as a way of measuring the likelihood that a
random web surfer, who randomly follows links, will land on a particular page. Pages
that have more incoming links are more likely to be visited.
The PageRank algorithm follows a simple iterative process to assign a rank (or score) to
each page based on the link structure of the web. Here are the key steps:
● Assign an initial PageRank value to every page in the network. Typically, every page
is given an equal value at the start. If there are N pages, each page could start with a
rank of 1/N.
● For each page, its PageRank is distributed evenly across the pages it links to.
● Formula:
● For each iteration, update the PageRank of each page based on the contributions
from the pages that link to it (using the formula above). The process continues
iteratively until the PageRank values converge, meaning they stop changing
significantly between iterations.
● Some pages (dead ends or dangling nodes) may have no outbound links. These are
handled by redistributing their PageRank equally among all pages in the graph, or in
some implementations, they're simply ignored.
● The random surfer model is introduced through the damping factor d. In each
iteration, a portion (1 - d) of the total PageRank is evenly distributed among all
pages to simulate the chance that a random user may jump to any page at random
instead of following a link.
● Repeat the process of distributing PageRank and updating values over multiple
iterations until the algorithm converges. Convergence means that the PageRank
values between iterations change very little, indicating that the system has reached a
steady state.
● Once the algorithm converges, each page will have a final PageRank value, which
represents its importance relative to other pages. Pages with higher PageRank
values are considered more important and will appear higher in search engine
rankings.