0% found this document useful (0 votes)
39 views13 pages

Hadoop Notesforstudents

Hadoop is an open-source framework for distributed storage and processing of large datasets. It was created based on Google's MapReduce and GFS papers and allows parallel processing of data across clusters. The core Hadoop components are HDFS for storage, YARN for resource management, and MapReduce for distributed processing.

Uploaded by

Saif Fazal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views13 pages

Hadoop Notesforstudents

Hadoop is an open-source framework for distributed storage and processing of large datasets. It was created based on Google's MapReduce and GFS papers and allows parallel processing of data across clusters. The core Hadoop components are HDFS for storage, YARN for resource management, and MapReduce for distributed processing.

Uploaded by

Saif Fazal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Hadoop - High Availability Distributed Object-Oriented Platform

History of Hadoop-

• Hadoop, an open-source framework for distributed storage and


processing of large datasets, has its roots in the early 2000s. The history
of Hadoop is closely tied to the evolution of web search and the
challenges posed by managing and analysing vast amounts of data.

Google's MapReduce and GFS (Google File System):

• In 2003, Google published two influential papers that laid the foundation
for Hadoop: "MapReduce: Simplified Data Processing on Large Clusters"
and "The Google File System."Google's MapReduce was a programming
model and processing engine designed for large-scale data processing
across distributed clusters.

• The Google File System (GFS) introduced the concept of a distributed


file system capable of handling large-scale data across multiple nodes.

Creation of Hadoop by Doug Cutting and Mike Cafarella

• Inspired by Google's work, Doug Cutting and Mike Cafarella started the
Apache Nutch project in 2002, an open-source web search engine. As part
of this project, they developed an open-source implementation of
Google's MapReduce and GFS.

• In 2006, Doug Cutting joined Yahoo, and the Nutch MapReduce


implementation became the core of what would later be known as
Hadoop.

Apache Hadoop Project and Growth

• In 2006, the Apache Hadoop project was officially launched as an open-


source project under the Apache Software Foundation.

• The Hadoop project quickly gained popularity due to its ability to handle
massive amounts of data cost-effectively.

Definition of Hadoop
• Hadoop is an open-source software framework that is used for storing and
processing large amounts of data in a distributed computing environment.
It is designed to handle big data and is based on the MapReduce
programming model, which allows for the parallel processing of large
datasets.

• It was created by Apache Software Foundation in 2006, based on a


white paper written by Google in 2003 that described the Google File
System (GFS) and the MapReduce programming model.
• Hadoop is an open-source framework based on Java that manages the
storage and processing of large amounts of data for applications. Hadoop
uses distributed storage and parallel processing to handle big data and
analytics jobs, breaking workloads down into smaller workloads that can
be run at the same time.

• Four modules comprise the primary Hadoop framework and work


collectively to form the Hadoop ecosystem:

Hadoop Distributed File System (HDFS): As the primary component of
the Hadoop ecosystem, HDFS is a distributed file system in which
individual Hadoop nodes operate on data that resides in their local
storage. This removes network latency, providing high-throughput access
to application data.

• Yet Another Resource Negotiator (YARN): YARN is a resource-


management platform responsible for managing compute resources in
clusters and using them to schedule users’ applications. It performs
scheduling and resource allocation across the Hadoop system.

• MapReduce: MapReduce is a programming model for large-scale data


processing. In the MapReduce model, subsets of larger datasets and
instructions for processing the subsets are dispatched to multiple different
nodes, where each subset is processed by a node in parallel with other
processing jobs. After processing the results, individual subsets are
combined into a smaller, more manageable dataset.

• Hadoop Common: Hadoop Common includes the libraries and utilities


used and shared by other Hadoop modules.
• Beyond HDFS, YARN, and MapReduce, the entire Hadoop open-source
ecosystem continues to grow and includes many tools and applications to
help collect, store, process, analyse, and manage big data. These include
Apache Pig, Apache Hive, Apache HBase, Apache Spark, Presto, and
Apache Zeppelin.

What are blocks?

The smallest quantity of data it can read or write is called a block. The default
size of HDFS blocks is 128 MB, although this can be changed. HDFS files are
divided into block-sized portions and stored as separate units. Unlike a file
system, if a file in HDFS is less than the block size, it does not take up the
entire block size; for example, a 5 MB file saved in HDFS with a block size of
128 MB takes up just 5 MB of space. The HDFS block size is big solely to
reduce search costs.

Map Reduce framework

MapReduce is a Java-based, distributed execution framework within the Apache


Hadoop Ecosystem. It takes away the complexity of distributed programming
by exposing two processing steps that developers implement: 1) Map and 2)
Reduce.

In the Mapping step, data is split between parallel processing tasks.


Transformation logic can be applied to each chunk of data. Once completed, the
Reduce phase takes over to handle aggregating data from the Map set.

In general, MapReduce uses Hadoop Distributed File System (HDFS) for both
input and output.
It is a s/w framework for easily writing applications which is process by big
amounts of data in parallel or large clusters (thousands of nodes) of commodity
hardware in a reliable fault tolerance manner. Hadoop requires Java Run Time
Environment JRE 1.6 or higher version.

The Map Task – It is the 1 st task that takes input data and convert it into a set
of data where individual elements are broken down into tuples (key/ value pair).

Reduce Task- This Task takes the output from the map task as input and
combines those data tuples into a smaller set of tuples. Reduce task is always
performed after map task.
Typically, both the input and output are stored in a file system. The framework
takes care of Scheduling task, monitoring them and re execute the failed task.

Map reduce is the parallel processing engine that allows Hadoop to go through
large data set in relatedly short order. A group of system in which a distributed
application is running is called a cluster and each machine in a cluster is called a
node.

It Consists of two Parts


1) Job Tracker (Master node)
2) Task Tracker (Slave node)

Job Tracker/ Master Node is responsible for resource management, tracking


resource consumption, availability, scheduling of jobs, monitoring the task and
rescheduling the failed task.

Job Tracker is a single point of failure for the Hadoop map reduce service. It
means if job tracker goes down all running jobs are halted.
Slave node executes the task as directed by the master and provide the task
status information to the master periodically.

Name Node and Data Node

Name Node is the master node in the Apache Hadoop HDFS Architecture that
maintains and manages the blocks present on the Data Nodes (slave nodes).
Name Node is a very highly available server that manages the File System
Namespace and controls access to files by clients.

How does MapReduce work?


A MapReduce system is usually composed of three steps (even though it's
generalized as the combination of Map and Reduce operations/functions). The
MapReduce operations are:

Map: The input data is first split into smaller blocks. The Hadoop framework
then decides how many mappers to use, based on the size of the data to be
processed and the memory block available on each mapper server. Each block is
then assigned to a mapper for processing. Each ‘worker’ node applies the map
function to the local data, and writes the output to temporary storage. The
primary (master) node ensures that only a single copy of the redundant input
data is processed.

Shuffle, combine and partition: worker nodes redistribute data based on the
output keys (produced by the map function), such that all data belonging to one
key is located on the same worker node. As an optional process the combiner (a
reducer) can run individually on each mapper server to reduce the data on each
mapper even further making reducing the data footprint and shuffling and
sorting easier. Partition (not optional) is the process that decides how the data
has to be presented to the reducer and also assigns it to a particular reducer.

Reduce: A reducer cannot start while a mapper is still in progress. Worker


nodes process each group of <key,value> pairs output data, in parallel to
produce <key,value> pairs as output. All the map output values that have the
same key are assigned to a single reducer, which then aggregates the values for
that key. Unlike the map function which is mandatory to filter and sort the
initial data, the reduce function is optional.

Yarn

In simple terms, Hadoop YARN is a newer and much-improved version of


MapReduce. However, that is not a completely accurate picture. This is because
YARN is also used for scheduling and processing and the executions of job
sequences. But YARN is the resource management layer of Hadoop where
each job runs on the data as a separate Java application.

Acting as the framework's operating system, YARN allows things like batch
processing and data handled on a single platform. Much above the capabilities
of MapReduce, YARN allows programmers to build interactive and real-time
streaming applications.
YARN allows for programmers to run as many applications as needed on the same
cluster. It provides a secure and stable foundation for the operational management
and sharing of system resources for maximum efficiency and flexibility.
Features of Hadoop

• It is fault tolerance.
• It is highly available.
• It’s programming is easy.
• It have huge flexible storage.
• It has low cost.

Benefits of Hadoop

Scalability

Hadoop is important as one of the primary tools to store and process huge
amounts of data quickly. It does this by using a distributed computing model
which enables the fast processing of data that can be rapidly scaled by adding
computing nodes.

Flexibility

Hadoop allows for flexibility in data storage as data does not require pre-
processing before storing it which means that an organization can store as much
data as they like and then utilize it later.

Low cost

As an open-source framework that can run on commodity hardware and has a


large ecosystem of tools, Hadoop is a low-cost option for the storage and
management of big data.

Resilience

As a distributed computing model, Hadoop allows for fault tolerance and


system resilience, meaning if one of the hardware nodes fail, jobs are redirected
to other nodes. Data stored on one Hadoop cluster is replicated across other
nodes within the system to fortify against the possibility of hardware or
software failure.
Challenges in Hadoop

MapReduce complexity and limitations

As a file-intensive system, MapReduce can be a difficult tool to utilize for


complex jobs, such as interactive analytical tasks. MapReduce functions also
need to be written in Java and can require a steep learning curve. The
MapReduce ecosystem is quite large, with many components for different
functions that can make it difficult to determine what tools to use.

Security

• Data sensitivity and protection can be issues as Hadoop handles such


large datasets. An ecosystem of tools for authentication, encryption,
auditing, and provisioning has emerged to help developers secure data in
Hadoop.
• Governance and management
• Hadoop does not have many robust tools for data management and
governance, nor for data quality and standardization.

Talent gap

Like many areas of programming, Hadoop has an acknowledged talent gap.


Finding developers with the combined requisite skills in Java to program
MapReduce, operating systems, and hardware can be difficult. In addition,
MapReduce has a steep learning curve, making it hard to get new programmers
up to speed on its best practices and ecosystem.

What is Apache Hadoop used for?

Analytics and big data

A wide variety of companies and organizations use Hadoop for research,


production data processing, and analytics that require processing terabytes or
petabytes of big data, storing diverse datasets, and data parallel processing.

Data storage and archiving

As Hadoop enables mass storage on commodity hardware, it is useful as a low-


cost storage option for all kinds of data, such as transactions, click streams, or
sensor and machine data.
Data lakes

Since Hadoop can help store data without pre-processing, it can be used to
complement to data lakes, where large amounts of unrefined data are stored.

Risk management

Banks, insurance companies, and other financial services companies use


Hadoop to build risk analysis and management models.

Marketing analytics

Marketing departments often use Hadoop to store and analyse customer


relationship management (CRM) data.

AI and machine learning

Hadoop ecosystems help with the processing of data and model training
operations for machine learning applications.

What are the Ways in which Hadoop operate in situations like ingesting,
purifying, and transforming data, when working with various data sources
and formats?

Hadoop, as part of its broader ecosystem, provides several tools and


frameworks to address the challenges associated with data ingestion, cleansing,
and transformation, especially when dealing with diverse data sources and
formats. Here are some key components and approaches within the Hadoop
ecosystem:

Apache Hive:
Data Transformation: Apache Hive provides a data warehousing and SQL-like
query language (HiveQL) for Hadoop. Users can define and execute ETL
(Extract, Transform, Load) processes using Hive queries, making it easier to
transform and structure data stored in Hadoop.

Apache Spark:
Data Transformation: Apache Spark, while not limited to Hadoop, is often used
within the Hadoop ecosystem for data processing. It offers in-memory data
processing and supports complex data transformations. Spark's Data Frame API
and SQL capabilities make it a powerful tool for data cleansing and
transformation tasks.

Apache NiFi:

Data Ingestion: Apache NiFi is a powerful tool for data ingestion. It provides a
web-based user interface for designing data flows, making it easy to collect,
transfer, and process data from various sources to Hadoop clusters.

Data Cleansing: NiFi supports data enrichment, validation, and cleansing


through its processors. Users can define custom data transformation rules to
clean and enrich the data during the ingestion process.

Apache Flume:

Data Ingestion: Apache Flume is designed for efficient and reliable large-scale
data ingestion into Hadoop. It allows the collection, aggregation, and movement
of log data from various sources to Hadoop Distributed File System (HDFS) or
other data stores.

Data Transformation: Flume can be configured with custom interceptors that


enable basic data transformation tasks during the ingestion process.

What are the best practise to optimise Hadoop cluster performance?

1) Configuration parameters

One of the first steps to optimize your Hadoop cluster's performance is to tune
the configuration parameters that control the behavior of the cluster
components, such as the Hadoop Distributed File System (HDFS), the
MapReduce engine, and the YARN resource manager. These parameters can
affect the memory allocation, the parallelism, the replication factor, the block
size, the shuffle and sort operations, and the compression methods. You should
adjust these parameters according to your data characteristics, your cluster size,
and your application requirements. You can use tools like Ambari or Cloudera
Manager to manage and optimize your configuration settings.

2) Hardware and software

Another important factor that can influence your Hadoop cluster's performance
is the choice of hardware and software for your nodes. You should consider the
CPU speed, the RAM size, the disk type and capacity, the network bandwidth,
and the operating system of your nodes. You should also ensure that your nodes
are compatible with the Hadoop version and the Java version that you are using.
You should avoid overloading your nodes with too many processes or tasks, and
you should distribute your data evenly across your nodes. You should also
upgrade your hardware and software regularly to take advantage of the latest
features and improvements.

3) Workload balancing

A common challenge that Hadoop cluster administrators face is how to balance


the workload among the nodes and avoid bottlenecks or failures. You should
design your applications to use the optimal number of mappers and reducers,
and to avoid skewness or imbalance in the input or output data. You should also
use partitioners, combiners, and custom serializers to reduce the network traffic
and the disk I/O. You should also schedule your jobs according to their priority,
their duration, and their resource consumption. You can use tools like Oozie or
Airflow to orchestrate your workflows and dependencies.

4) Cluster monitoring

A crucial aspect of optimizing your Hadoop cluster's performance is to monitor


its health and performance metrics, such as the CPU utilization, the memory
usage, the disk throughput, the network latency, the garbage collection, and the
error logs. You should also monitor the status and the progress of your jobs,
such as the number of running, completed, failed, or killed tasks, and the
execution time and the counters of each task. You should use tools like Ganglia,
Nagios, or Prometheus to collect and visualize your cluster metrics, and tools
like Flume, Kafka, or Logstash to collect and analyse your cluster logs. You
should also use tools like Spark or Hive to query and process your data
efficiently.

Apache spark–
• It was originally developed at UC(University of California) Berkeley in
2009 to overcome the limitations of Hadoop.

• Apache Spark is a lightning-fast, open-source data-processing engine for


machine learning and AI applications, backed by the largest open-source
community in big data.

• Apache Spark (Spark) easily handles large-scale data sets and is a fast,
general-purpose clustering system that is well-suited for Py Spark.

• It is designed to deliver the computational speed, scalability, and


programmability required for big data—specifically for streaming data,
graph data, analytics, machine learning, large-scale data processing,
and artificial intelligence (AI) applications.

• It scales by distributing processing workflows across large clusters of


computers, with built-in parallelism and fault tolerance. It even includes
APIs for programming languages that are popular among data analysts
and data scientists, including Scala, Java, Python, and R.

Resilient Distributed Dataset (RDD)

• Speed − Spark helps to run an application in Hadoop cluster, up to 100


times faster in memory, and 10 times faster when running on disk. This is
possible by reducing number of read/write operations to disk. It stores the
intermediate processing data in memory.
• Supports multiple languages − Spark provides built-in APIs in Java,
Scala, or Python. Therefore, you can write applications in different
languages.
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It
also supports SQL queries, Streaming data, Machine learning (ML), and
Graph algorithms.

Benefits of Apache Spark

Speed
Because data is organised to scale in-memory processing across distributed
cluster nodes, and because Spark can do processing without having to write data
back to disk storage, it can perform up to 100 times faster than MapReduce on
batch jobs when processing in memory and ten times faster on disk.
Multilingual
Written in Scala, Spark also comes with API connectors for using Java and
Python, as well as an R programming package that allows users to process very
large data sets required by data scientists.
Advanced analytics
Spark comes packaged with several libraries of code to run data analytics
applications. For example, the MLlib has machine learning code for advanced
statistical operations, and the Spark Streaming library enables users to analyse
data in real time.

Increased access to Big Data


Spark separates storage and compute, which allows customers to scale each to
accommodate the performance needs of analytics applications. And it
seamlessly performs batch jobs to move data into a data lake or data warehouse
for advanced analytics.
Ease of use
Due to Spark’s more efficient way of distributing data across nodes and clusters,
it can perform parallel data processing and data abstraction. And its ability to tie
together multiple types of databases and compute data from many types of data
stores allows it to be used across multiple use cases.
Power
Spark can handle a huge volume of data – as much as several petabytes
according to proponents. And it allows users to perform exploratory data
analysis on this petabyte-scale data without needing to down sample.

Spark Ecosystem
Spark Ecosystem

• Engine — Spark Core: It is the basic core component of Spark


ecosystem on top of which the entire ecosystem is built. It performs
the tasks of scheduling/monitoring and basic I/O functionality
• Management — Spark cluster can be managed by Hadoop YARN,
Mesos or Spark cluster manager.
• Library — Spark ecosystem comprises of Spark SQL (for running
SQL like queries on RDD or data from external sources), Spark Mlib
(for ML),Spark Graph X (for constructing graphs for better
visualisation of data), Spark streaming (for batch processing and
streaming of data in the same application)
• Programming can be done in Python, Java, Scala and R
• Storage — Data can be stored in HDFS, S3, local storage and it
supports both SQL and NoSQL databases.

You might also like