0% found this document useful (0 votes)
39 views21 pages

Unit 2

Uploaded by

sisax91607
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views21 pages

Unit 2

Uploaded by

sisax91607
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Big Data Notes Unit 2

Topic
History of Hadoop
Apache Hadoop
Hadoop Distributed File System (HDFS)
Components of Hadoop
Data format
Analyzing data with Hadoop
Scaling out
Hadoop streaming
Hadoop pipes
Hadoop Eco System
Map Reduce framework and basics
How Map Reduce works
Developing a Map Reduce application
Unit tests with MR unit
Test data and local tests
Anatomy of a Map Reduce job run
Failures
Job scheduling
Shuffle and sort
Task execution
Map Reduce types
Input formats
Output formats
Map Reduce features, Real-world Map Reduce
History of Hadoop
Hadoop, an open-source framework, emerged from the need to
handle big data—massive datasets that traditional systems struggled to
manage. Here’s how it all began:
1. Origins: In 2002, Doug Cutting and Mike Cafarella embarked on
the Apache Nutch project, aiming to build a web search engine
capable of indexing a billion pages. However, they faced two major
challenges:
o Storage: Storing such vast amounts of data was expensive
using traditional relational databases.
o Processing: Efficiently processing this data required a novel
approach.
2. Google’s Influence: In 2003, Cutting and Cafarella stumbled upon
Google’s research papers:
o Google File System (GFS): Described a distributed file
system for storing large datasets.
o MapReduce: Introduced a programming model for
processing these datasets.
3. Open-Source Implementation: Inspired by Google’s techniques,
Cutting and Cafarella decided to implement them as open-source
tools within the Apache Nutch project. They focused on:
o HDFS (Hadoop Distributed File System): A storage solution
akin to GFS.
o MapReduce: A way to process data efficiently.
4. Birth of Hadoop: By 2006, Hadoop was born—a framework
combining distributed storage (HDFS) and large-scale data
processing (MapReduce). It aimed to democratize big data
handling.

What is Hadoop

Hadoop is an open source framework from Apache and is used to store


process and analyze data which are very huge in volume. Hadoop is
written in Java and is not OLAP (online analytical processing). It is used
for batch/offline processing.It is being used by Facebook, Yahoo,
Google, Twitter, LinkedIn and many more. Moreover it can be scaled up
just by adding nodes in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper
GFS and on the basis of that HDFS was developed. It states that
the files will be broken into blocks and stored in nodes over the
distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling
and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to
do the parallel computation on data using key value pair. The Map
task takes input data and converts it into a data set which can be
computed in Key value pair. The output of Map task is consumed
by reduce task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop
and are used by other Hadoop modules.
Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce
engine and the HDFS (Hadoop Distributed File System). The MapReduce
engine can be MapReduce/MR1 or YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes.
The master node includes Job Tracker, Task Tracker, NameNode, and
DataNode whereas the slave node includes DataNode and TaskTracker.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system
for Hadoop. It contains a master/slave architecture. This architecture
consist of a single NameNode performs the role of master, and multiple
DataNodes performs the role of a slave.
Both NameNode and DataNode are capable enough to run on
commodity machines. The Java language is used to develop HDFS. So
any machine that supports Java language can easily run the NameNode
and DataNode software.
NameNode

o It is a single master server exist in the HDFS cluster.


o As it is a single node, it may become the reason of single point
failure.
o It manages the file system namespace by executing an operation
like the opening, renaming and closing the files.
o It simplifies the architecture of the system.
DataNode

o The HDFS cluster contains multiple DataNodes.


o Each DataNode contains multiple data blocks.
o These data blocks are used to store data.
o It is the responsibility of DataNode to read and write requests from
the file system's clients.
o It performs block creation, deletion, and replication upon
instruction from the NameNode.
Job Tracker

o The role of Job Tracker is to accept the MapReduce jobs from


client and process the data by using NameNode.
o In response, NameNode provides metadata to Job Tracker.
Task Tracker

o It works as a slave node for Job Tracker.


o It receives task and code from Job Tracker and applies that code
on the file. This process can also be called as a Mapper.
MapReduce Layer
The MapReduce comes into existence when the client application
submits the MapReduce job to Job Tracker. In response, the Job Tracker
sends the request to the appropriate Task Trackers. Sometimes, the
TaskTracker fails or time out. In such a case, that part of the job is
rescheduled.
Advantages of Hadoop

o Fast: In HDFS the data distributed over the cluster and are mapped
which helps in faster retrieval. Even the tools to process the data
are often on the same servers, thus reducing the processing time.
It is able to process terabytes of data in minutes and Peta bytes in
hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in
the cluster.
o Cost Effective: Hadoop is open source and uses commodity
hardware to store data so it really cost effective as compared to
traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can
replicate data over the network, so if one node is down or some
other network failure happens, then Hadoop takes the other copy
of data and use it. Normally, data are replicated thrice but the
replication factor is configurable.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its
origin was the Google File System paper, published by Google.

Let's focus on the history of Hadoop in the following steps: -

o In 2002, Doug Cutting and Mike Cafarella started to work on a


project, Apache Nutch. It is an open source web crawler software
project.
o While working on Apache Nutch, they were dealing with big data.
To store that data they have to spend a lot of costs which
becomes the consequence of that project. This problem becomes
one of the important reason for the emergence of Hadoop.
o In 2003, Google introduced a file system known as GFS (Google
file system). It is a proprietary distributed file system developed to
provide efficient access to data.
o In 2004, Google released a white paper on Map Reduce. This
technique simplifies the data processing on large clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file
system known as NDFS (Nutch Distributed File System). This file
system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis
of the Nutch project, Dough Cutting introduces a new project
Hadoop with a file system known as HDFS (Hadoop Distributed
File System). Hadoop first version 0.1.0 released in this year.
o Doug Cutting gave named his project Hadoop after his son's toy
elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of
data on a 900 node cluster within 209 seconds.
o In 2013, Hadoop 2.2 was released.
o In 2017, Hadoop 3.0 was released.

Components of Hadoop
Hadoop comprises three core components:
1. Hadoop Distributed File System (HDFS):
o Purpose: Stores data across a cluster of commodity
hardware.
o Key Features:
▪ Distributed: Splits files into blocks and distributes
them across nodes.
▪ Fault-Tolerant: Replicates data for resilience.
▪ Scalable: Handles petabytes of data.

What is HDFS?

Hadoop comes with a distributed file system called HDFS. In HDFS data
is distributed over several machines and replicated to ensure their
durability to failure and high availability to parallel application.
It is cost effective as it uses commodity hardware. It involves the
concept of blocks, data nodes and node name.

Where to use HDFS

o Very Large Files: Files should be of hundreds of megabytes,


gigabytes or more.
o Streaming Data Access: The time to read whole data set is more
important than latency in reading the first. HDFS is built on write-
once and read-many-times pattern.
o Commodity Hardware:It works on low cost hardware.

Where not to use HDFS

o Low Latency data access: Applications that require very less time
to access the first data should not use HDFS as it is giving
importance to whole data rather than time to fetch the first record.
o Lots Of Small Files:The name node contains the metadata of files
in memory and if the files are small in size it takes a lot of memory
for name node's memory which is not feasible.
o Multiple Writes:It should not be used when we have to write
multiple times.

HDFS Concepts

1. Blocks: A Block is the minimum amount of data that it can read or


write.HDFS blocks are 128 MB by default and this is
configurable.Files n HDFS are broken into block-sized
chunks,which are stored as independent units.Unlike a file system,
if the file is in HDFS is smaller than block size, then it does not
occupy full block?s size, i.e. 5 MB of file stored in HDFS of block
size 128 MB takes 5MB of space only.The HDFS block size is large
just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the
name node acts as master.Name Node is controller and manager
of HDFS as it knows the status and the metadata of all the files in
HDFS; the metadata information being file permission, names and
location of each block.The metadata are small, so it is stored in
the memory of name node,allowing faster access to data.
Moreover the HDFS cluster is accessed by multiple clients
concurrently,so all this information is handled bya single machine.
The file system operations like opening, closing, renaming etc. are
executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by
client or name node. They report back to name node periodically,
with list of blocks that they are storing. The data node being a
commodity hardware also does the work of block creation,
deletion and replication as stated by the name node.

HDFS DataNode and NameNode Image:

HDFS Read Image:


HDFS Write Image:

Since all the metadata is stored in name node, it is very important. If it


fails the file system cannot be used as there would be no way of
knowing how to reconstruct the files from blocks present in data node.
To overcome this, the concept of secondary name node arises.

Secondary Name Node: It is a separate physical machine which acts as


a helper of name node. It performs periodic check points.It
communicates with the name node and take snapshot of meta data
which helps minimize downtime and loss of data.

Starting HDFS

The HDFS should be formatted initially and then started in the


distributed mode. Commands are given below.

To Format $ hadoop namenode -format

To Start $ start-dfs.sh

HDFS Basic File Operations

1. Putting data to HDFS from local file system


o First create a folder in HDFS where data can be put form
local file system.

$ hadoop fs -mkdir /user/test

o Copy the file "data.txt" from a file kept in local folder


/usr/home/Desktop to HDFS folder /user/ test

$ hadoop fs -copyFromLocal /usr/home/Desktop/data.txt /user/test

o Display the content of HDFS folder

$ Hadoop fs -ls /user/test

2. Copying data from HDFS to local file system


o $ hadoop fs -copyToLocal /user/test/data.txt
/usr/bin/data_copy.txt
3. Compare the files and see that both are same
o $ md5 /usr/bin/data_copy.txt /usr/home/Desktop/data.txt

Recursive deleting

o hadoop fs -rmr <arg>

Example:

o hadoop fs -rmr /user/sonoo/

HDFS Other commands

The below is used in the commands

"<path>" means any file or directory name.

"<path>..." means one or more file or directory names.

"<file>" means any filename.

"<src>" and "<dest>" are path names in a directed operation.

"<localSrc>" and "<localDest>" are paths as above, but on the local file
system

o put <localSrc><dest>
Copies the file or directory from the local file system identified by
localSrc to dest within the DFS.

o copyFromLocal <localSrc><dest>

Identical to -put

o copyFromLocal <localSrc><dest>

Identical to -put

o moveFromLocal <localSrc><dest>

Copies the file or directory from the local file system identified by
localSrc to dest within HDFS, and then deletes the local copy on
success.

o get [-crc] <src><localDest>

Copies the file or directory in HDFS identified by src to the local file
system path identified by localDest.

o cat <filen-ame>

Displays the contents of filename on stdout.

o moveToLocal <src><localDest>

Works like -get, but deletes the HDFS copy on success.

o setrep [-R] [-w] rep <path>

Sets the target replication factor for files identified by path to rep. (The
actual replication factor will move toward the target over time)

o touchz <path>

Creates a file at path containing the current time as a timestamp. Fails if


a file already exists at path, unless the file is already size 0.

o test -[ezd] <path>

Returns 1 if path exists; has zero length; or is a directory or 0 otherwise.


o stat [format] <path>

Prints information about path. Format is a string which accepts file size
in blocks (%b), filename (%n), block size (%o), replication (%r), and
modification date (%y, %Y).

HDFS Features and Goals

The Hadoop Distributed File System (HDFS) is a distributed file system.


It is a core part of Hadoop which is used for data storage. It is designed
to run on commodity hardware.

Unlike other distributed file system, HDFS is highly fault-tolerant and can
be deployed on low-cost hardware. It can easily handle the application
that contains large data sets.

Let's see some of the important features and goals of HDFS.

Features of HDFS

o Highly Scalable - HDFS is highly scalable as it can scale hundreds


of nodes in a single cluster.
o Replication - Due to some unfavourable conditions, the node
containing the data may be loss. So, to overcome such problems,
HDFS always maintains the copy of data on a different machine.
o Fault tolerance - In HDFS, the fault tolerance signifies the
robustness of the system in the event of failure. The HDFS is highly
fault-tolerant that if any machine fails, the other machine
containing the copy of that data automatically become active.
o Distributed data storage - This is one of the most important
features of HDFS that makes Hadoop very powerful. Here, data is
divided into multiple blocks and stored into nodes.
o Portable - HDFS is designed in such a way that it can easily
portable from platform to another.

Goals of HDFS

o Handling the hardware failure - The HDFS contains multiple server


machines. Anyhow, if any machine fails, the HDFS goal is to
recover it quickly.
o Streaming data access - The HDFS applications usually run on the
general-purpose file system. This application requires streaming
access to their data sets.
o Coherence Model - The application that runs on HDFS require to
follow the write-once-ready-many approach. So, a file once created
need not to be changed. However, it can be appended and
truncate.

2. Hadoop MapReduce:
o Purpose: Processes data in parallel across the cluster.
o How It Works:
▪ Map Phase: Breaks down data into key-value pairs.
▪ Shuffle and Sort: Organizes intermediate data.
▪ Reduce Phase: Aggregates results.

Map Reduce in Hadoop

One of the three components of Hadoop is Map Reduce. The first


component of Hadoop that is, Hadoop Distributed File System (HDFS) is
responsible for storing the file. The second component that is, Map
Reduce is responsible for processing the file.

MapReduce has mainly 2 tasks which are divided phase-wise.In first


phase, Map is utilised and in next phase Reduce is utilised.
o
3. Yet Another Resource Negotiator (YARN):
o Purpose: Manages resources in the cluster.
o Functions:
▪ Resource Allocation: Allocates CPU, memory, and
other resources.
▪ Job Scheduling: Ensures efficient utilization.

What is YARN?

Yet Another Resource Manager takes programming to the next level


beyond Java , and makes it interactive to let another application Hbase,
Spark etc. to work on it.Different Yarn applications can co-exist on the
same cluster so MapReduce, Hbase, Spark all can run at the same time
bringing great benefits for manageability and cluster utilization.

Components Of YARN

o Client: For submitting MapReduce jobs.


o Resource Manager: To manage the use of resources across the
cluster
o Node Manager:For launching and monitoring the computer
containers on machines in the cluster.
o Map Reduce Application Master: Checks tasks running the
MapReduce job. The application master and the MapReduce tasks
run in containers that are scheduled by the resource manager, and
managed by the node managers.

Jobtracker & Tasktrackerwere were used in previous version of Hadoop,


which were responsible for handling resources and checking progress
management. However, Hadoop 2.0 has Resource manager and
NodeManager to overcome the shortfall of Jobtracker & Tasktracker.

Benefits of YARN

o Scalability: Map Reduce 1 hits ascalability bottleneck at 4000


nodes and 40000 task, but Yarn is designed for 10,000 nodes and
1 lakh tasks.
o Utiliazation: Node Manager manages a pool of resources, rather
than a fixed number of the designated slots thus increasing the
utilization.
o Multitenancy: Different version of MapReduce can run on YARN,
which makes the process of upgrading MapReduce more
manageable.

Real-World Applications
Hadoop has revolutionized data processing across industries:

• Big Data Analytics: Organizations analyze massive datasets


efficiently.
• Log Processing: Hadoop handles logs from servers, applications,
and devices.
• Recommendation Systems: Personalized recommendations (e.g.,
Netflix, Amazon).
• Genomic Research: Analyzing DNA sequences.
• Social Media Analysis: Extracting insights from social platforms.
• Financial Services: Fraud detection, risk assessment, and trading
analytics.

In summary, Hadoop’s journey—from Google’s research papers to an


open-source powerhouse—has transformed how we handle big data. Its
components work in harmony, enabling scalable storage, parallel
processing, and resource management.

Data Formats

• In Hadoop, data formats play a crucial role in how information is


stored and processed. Common formats include:
o Text Files: Simple, human-readable files (e.g., CSV, TSV,
JSON).
o Sequence Files: Binary files optimized for MapReduce.
o Avro: A compact, schema-based format.
o Parquet: Columnar storage format.
o ORC (Optimized Row Columnar): Another columnar format.
o XML: Used less frequently due to verbosity.
Scaling Out

• Hadoop’s strength lies in its ability to scale horizontally. As data


grows, you can add more nodes to the cluster:
o Adding Nodes: Increases storage capacity and processing
power.
o Load Balancing: Distributes work evenly across nodes.
o Fault Tolerance: Even if a node fails, others continue
processing.

Input and Output Formats

• Hadoop supports various input and output formats:


o Input Formats: Define how data is read into MapReduce jobs.
▪ Examples: TextInputFormat, KeyValueTextInputFormat
, SequenceFileInputFormat.
o Output Formats: Specify how results are written.
▪ Examples: TextOutputFormat, SequenceFileOutputFor
mat, AvroOutputFormat.

Real-World MapReduce Applications

• Let’s explore some practical scenarios where MapReduce shines:


o Log Analysis: Processing server logs to extract insights.
o Recommendation Systems: Generating personalized
recommendations.
o Sentiment Analysis: Analyzing social media posts.
o PageRank Algorithm: Used by search engines to rank web
pages.
o Data Cleansing: Identifying and correcting errors in large
datasets.

MapReduce Features and Real-World Applications


MapReduce Framework and Basics

• MapReduce is the heart of Hadoop. It processes large datasets by


dividing them into smaller chunks and distributing the work across
a cluster. Here are the basics:
o Map Phase:
▪ Input Data: Divided into key-value pairs.
▪ Mapper Function: Processes each pair independently.
▪ Intermediate Key-Value Pairs: Generated as output.
o Shuffle and Sort:
▪ Organizes intermediate data for efficient grouping.
▪ Ensures that data with the same key ends up on the
same reducer.
o Reduce Phase:
▪ Aggregates results based on keys.
▪ Reducer function processes data and produces final
output.

Developing a MapReduce Application

• To create a MapReduce job:


1. Write Mapper and Reducer Functions:
▪ Mapper: Extracts relevant information from input data.
▪ Reducer: Combines and summarizes data.
2. Configure Input and Output Formats:
▪ Specify how data is read and written.
3. Submit the Job to Hadoop:
▪ Hadoop takes care of distributing tasks across nodes.
4. Monitor and Analyze Results:
▪ Check logs and output files.

Unit Tests with MRUnit

• MRUnit is a testing framework for MapReduce applications.


• It allows you to write unit tests for your Mapper and Reducer
functions.
• Ensures correctness and helps catch bugs early.

Test Data and Local Tests

• Before deploying to a cluster, test your MapReduce job locally:


o Use a small dataset.
o Run the job on your local machine.
o Verify correctness and performance.

Anatomy of a MapReduce Job Run

• A typical MapReduce job involves several stages:


1. Job Submission:
▪ Submit your job to Hadoop.
▪ Hadoop schedules it for execution.
2. Job Initialization:
▪ Hadoop sets up necessary resources.
3. Map Phase Execution:
▪ Mappers process input data.
4. Shuffle and Sort:
▪ Data is shuffled and grouped.
5. Reduce Phase Execution:
▪ Reducers aggregate results.
6. Job Completion:
▪ Output is written to HDFS.

Failures and Job Scheduling

• Hadoop handles failures gracefully:


o Task Failures: Retries failed tasks.
o Node Failures: Redistributes work to healthy nodes.
• Job Scheduling:
o Hadoop schedules tasks based on available resources.
o Prioritizes jobs based on their requirements.

MapReduce Types

• Hadoop supports different types of MapReduce jobs:


o Batch Processing: Analyzing historical data.
o Real-Time Processing: Handling streaming data.
o Iterative Processing: Running multiple MapReduce
iterations.
o Graph Processing: Implementing graph algorithms.

Input and Output Formats

• Input Formats:
o Define how data is read into MapReduce jobs.
o Examples: TextInputFormat, KeyValueTextInputFormat, Seq
uenceFileInputFormat.
• Output Formats:
o Specify how results are written.
o Examples: TextOutputFormat, SequenceFileOutputFormat, A
vroOutputFormat.
Real-World MapReduce Applications

• Let’s explore some practical scenarios where MapReduce shines:


o Log Analysis: Processing server logs to extract insights.
o Recommendation Systems: Generating personalized
recommendations.
o Sentiment Analysis: Analyzing social media posts.
o PageRank Algorithm: Used by search engines to rank web
pages.
o Data Cleansing: Identifying and correcting errors in large
datasets.

Remember, Hadoop’s power lies in its ability to handle massive data


volumes efficiently. Whether you’re analyzing user behavior, processing
sensor data, or running complex algorithms, Hadoop provides the
foundation for scalable and distributed computing.
MapReduce Features and Real-World Applications
MapReduce Framework and Basics

• MapReduce is the heart of Hadoop. It processes large datasets by


dividing them into smaller chunks and distributing the work across
a cluster. Here are the basics:
o Map Phase:
▪ Input Data: Divided into key-value pairs.
▪ Mapper Function: Processes each pair independently.
▪ Intermediate Key-Value Pairs: Generated as output.
o Shuffle and Sort:
▪ Organizes intermediate data for efficient grouping.
▪ Ensures that data with the same key ends up on the
same reducer.
o Reduce Phase:
▪ Aggregates results based on keys.
▪ Reducer function processes data and produces final
output.

Developing a MapReduce Application

• To create a MapReduce job:


1. Write Mapper and Reducer Functions:
▪ Mapper: Extracts relevant information from input data.
▪ Reducer: Combines and summarizes data.
2. Configure Input and Output Formats:
▪ Specify how data is read and written.
3. Submit the Job to Hadoop:
▪ Hadoop takes care of distributing tasks across nodes.
4. Monitor and Analyze Results:
▪ Check logs and output files.

Unit Tests with MRUnit

• MRUnit is a testing framework for MapReduce applications.


• It allows you to write unit tests for your Mapper and Reducer
functions.
• Ensures correctness and helps catch bugs early.

Test Data and Local Tests

• Before deploying to a cluster, test your MapReduce job locally:


o Use a small dataset.
o Run the job on your local machine.
o Verify correctness and performance.

Anatomy of a MapReduce Job Run

• A typical MapReduce job involves several stages:


1. Job Submission:
▪ Submit your job to Hadoop.
▪ Hadoop schedules it for execution.
2. Job Initialization:
▪ Hadoop sets up necessary resources.
3. Map Phase Execution:
▪ Mappers process input data.
4. Shuffle and Sort:
▪ Data is shuffled and grouped.
5. Reduce Phase Execution:
▪ Reducers aggregate results.
6. Job Completion:
▪ Output is written to HDFS.

Failures and Job Scheduling

• Hadoop handles failures gracefully:


o Task Failures: Retries failed tasks.
o Node Failures: Redistributes work to healthy nodes.
• Job Scheduling:
o Hadoop schedules tasks based on available resources.
o Prioritizes jobs based on their requirements.

MapReduce Types

• Hadoop supports different types of MapReduce jobs:


o Batch Processing: Analyzing historical data.
o Real-Time Processing: Handling streaming data.
o Iterative Processing: Running multiple MapReduce
iterations.
o Graph Processing: Implementing graph algorithms.

Input and Output Formats

• Input Formats:
o Define how data is read into MapReduce jobs.
o Examples: TextInputFormat, KeyValueTextInputFormat, Seq
uenceFileInputFormat.
• Output Formats:
o Specify how results are written.
o Examples: TextOutputFormat, SequenceFileOutputFormat, A
vroOutputFormat.

Real-World MapReduce Applications

• Let’s explore some practical scenarios where MapReduce shines:


o Log Analysis: Processing server logs to extract insights.
o Recommendation Systems: Generating personalized
recommendations.
o Sentiment Analysis: Analyzing social media posts.
o PageRank Algorithm: Used by search engines to rank web
pages.
o Data Cleansing: Identifying and correcting errors in large
datasets.

You might also like