Hadoop Notesforstudents
Hadoop Notesforstudents
History of Hadoop-
• In 2003, Google published two influential papers that laid the foundation
for Hadoop: "MapReduce: Simplified Data Processing on Large Clusters"
and "The Google File System."Google's MapReduce was a programming
model and processing engine designed for large-scale data processing
across distributed clusters.
• Inspired by Google's work, Doug Cutting and Mike Cafarella started the
Apache Nutch project in 2002, an open-source web search engine. As part
of this project, they developed an open-source implementation of
Google's MapReduce and GFS.
• The Hadoop project quickly gained popularity due to its ability to handle
massive amounts of data cost-effectively.
Definition of Hadoop
• Hadoop is an open-source software framework that is used for storing and
processing large amounts of data in a distributed computing environment.
It is designed to handle big data and is based on the MapReduce
programming model, which allows for the parallel processing of large
datasets.
The smallest quantity of data it can read or write is called a block. The default
size of HDFS blocks is 128 MB, although this can be changed. HDFS files are
divided into block-sized portions and stored as separate units. Unlike a file
system, if a file in HDFS is less than the block size, it does not take up the
entire block size; for example, a 5 MB file saved in HDFS with a block size of
128 MB takes up just 5 MB of space. The HDFS block size is big solely to
reduce search costs.
In general, MapReduce uses Hadoop Distributed File System (HDFS) for both
input and output.
It is a s/w framework for easily writing applications which is process by big
amounts of data in parallel or large clusters (thousands of nodes) of commodity
hardware in a reliable fault tolerance manner. Hadoop requires Java Run Time
Environment JRE 1.6 or higher version.
The Map Task – It is the 1 st task that takes input data and convert it into a set
of data where individual elements are broken down into tuples (key/ value pair).
Reduce Task- This Task takes the output from the map task as input and
combines those data tuples into a smaller set of tuples. Reduce task is always
performed after map task.
Typically, both the input and output are stored in a file system. The framework
takes care of Scheduling task, monitoring them and re execute the failed task.
Map reduce is the parallel processing engine that allows Hadoop to go through
large data set in relatedly short order. A group of system in which a distributed
application is running is called a cluster and each machine in a cluster is called a
node.
Job Tracker is a single point of failure for the Hadoop map reduce service. It
means if job tracker goes down all running jobs are halted.
Slave node executes the task as directed by the master and provide the task
status information to the master periodically.
Name Node is the master node in the Apache Hadoop HDFS Architecture that
maintains and manages the blocks present on the Data Nodes (slave nodes).
Name Node is a very highly available server that manages the File System
Namespace and controls access to files by clients.
Map: The input data is first split into smaller blocks. The Hadoop framework
then decides how many mappers to use, based on the size of the data to be
processed and the memory block available on each mapper server. Each block is
then assigned to a mapper for processing. Each ‘worker’ node applies the map
function to the local data, and writes the output to temporary storage. The
primary (master) node ensures that only a single copy of the redundant input
data is processed.
Shuffle, combine and partition: worker nodes redistribute data based on the
output keys (produced by the map function), such that all data belonging to one
key is located on the same worker node. As an optional process the combiner (a
reducer) can run individually on each mapper server to reduce the data on each
mapper even further making reducing the data footprint and shuffling and
sorting easier. Partition (not optional) is the process that decides how the data
has to be presented to the reducer and also assigns it to a particular reducer.
Yarn
Acting as the framework's operating system, YARN allows things like batch
processing and data handled on a single platform. Much above the capabilities
of MapReduce, YARN allows programmers to build interactive and real-time
streaming applications.
YARN allows for programmers to run as many applications as needed on the same
cluster. It provides a secure and stable foundation for the operational management
and sharing of system resources for maximum efficiency and flexibility.
Features of Hadoop
• It is fault tolerance.
• It is highly available.
• It’s programming is easy.
• It have huge flexible storage.
• It has low cost.
Benefits of Hadoop
Scalability
Hadoop is important as one of the primary tools to store and process huge
amounts of data quickly. It does this by using a distributed computing model
which enables the fast processing of data that can be rapidly scaled by adding
computing nodes.
Flexibility
Hadoop allows for flexibility in data storage as data does not require pre-
processing before storing it which means that an organization can store as much
data as they like and then utilize it later.
Low cost
Resilience
Security
Talent gap
Since Hadoop can help store data without pre-processing, it can be used to
complement to data lakes, where large amounts of unrefined data are stored.
Risk management
Marketing analytics
Hadoop ecosystems help with the processing of data and model training
operations for machine learning applications.
What are the Ways in which Hadoop operate in situations like ingesting,
purifying, and transforming data, when working with various data sources
and formats?
Apache Hive:
Data Transformation: Apache Hive provides a data warehousing and SQL-like
query language (HiveQL) for Hadoop. Users can define and execute ETL
(Extract, Transform, Load) processes using Hive queries, making it easier to
transform and structure data stored in Hadoop.
Apache Spark:
Data Transformation: Apache Spark, while not limited to Hadoop, is often used
within the Hadoop ecosystem for data processing. It offers in-memory data
processing and supports complex data transformations. Spark's Data Frame API
and SQL capabilities make it a powerful tool for data cleansing and
transformation tasks.
Apache NiFi:
Data Ingestion: Apache NiFi is a powerful tool for data ingestion. It provides a
web-based user interface for designing data flows, making it easy to collect,
transfer, and process data from various sources to Hadoop clusters.
Apache Flume:
Data Ingestion: Apache Flume is designed for efficient and reliable large-scale
data ingestion into Hadoop. It allows the collection, aggregation, and movement
of log data from various sources to Hadoop Distributed File System (HDFS) or
other data stores.
1) Configuration parameters
One of the first steps to optimize your Hadoop cluster's performance is to tune
the configuration parameters that control the behavior of the cluster
components, such as the Hadoop Distributed File System (HDFS), the
MapReduce engine, and the YARN resource manager. These parameters can
affect the memory allocation, the parallelism, the replication factor, the block
size, the shuffle and sort operations, and the compression methods. You should
adjust these parameters according to your data characteristics, your cluster size,
and your application requirements. You can use tools like Ambari or Cloudera
Manager to manage and optimize your configuration settings.
Another important factor that can influence your Hadoop cluster's performance
is the choice of hardware and software for your nodes. You should consider the
CPU speed, the RAM size, the disk type and capacity, the network bandwidth,
and the operating system of your nodes. You should also ensure that your nodes
are compatible with the Hadoop version and the Java version that you are using.
You should avoid overloading your nodes with too many processes or tasks, and
you should distribute your data evenly across your nodes. You should also
upgrade your hardware and software regularly to take advantage of the latest
features and improvements.
3) Workload balancing
4) Cluster monitoring
Apache spark–
• It was originally developed at UC(University of California) Berkeley in
2009 to overcome the limitations of Hadoop.
• Apache Spark (Spark) easily handles large-scale data sets and is a fast,
general-purpose clustering system that is well-suited for Py Spark.
Speed
Because data is organised to scale in-memory processing across distributed
cluster nodes, and because Spark can do processing without having to write data
back to disk storage, it can perform up to 100 times faster than MapReduce on
batch jobs when processing in memory and ten times faster on disk.
Multilingual
Written in Scala, Spark also comes with API connectors for using Java and
Python, as well as an R programming package that allows users to process very
large data sets required by data scientists.
Advanced analytics
Spark comes packaged with several libraries of code to run data analytics
applications. For example, the MLlib has machine learning code for advanced
statistical operations, and the Spark Streaming library enables users to analyse
data in real time.
Spark Ecosystem
Spark Ecosystem