0% found this document useful (0 votes)

39 views13 pages

Hadoop Notesforstudents

Hadoop is an open-source framework for distributed storage and processing of large datasets. It was created based on Google's MapReduce and GFS papers and allows parallel processing of data across clusters. The core Hadoop components are HDFS for storage, YARN for resource management, and MapReduce for distributed processing.

Uploaded by

Saif Fazal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views13 pages

Hadoop Notesforstudents

Uploaded by

Saif Fazal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Hadoop - High Availability Distributed Object-Oriented Platform

History of Hadoop-

• Hadoop, an open-source framework for distributed storage and

processing of large datasets, has its roots in the early 2000s. The history
of Hadoop is closely tied to the evolution of web search and the
challenges posed by managing and analysing vast amounts of data.

Google's MapReduce and GFS (Google File System):

• In 2003, Google published two influential papers that laid the foundation
for Hadoop: "MapReduce: Simplified Data Processing on Large Clusters"
and "The Google File System."Google's MapReduce was a programming
model and processing engine designed for large-scale data processing
across distributed clusters.

• The Google File System (GFS) introduced the concept of a distributed

file system capable of handling large-scale data across multiple nodes.

Creation of Hadoop by Doug Cutting and Mike Cafarella

• Inspired by Google's work, Doug Cutting and Mike Cafarella started the
Apache Nutch project in 2002, an open-source web search engine. As part
of this project, they developed an open-source implementation of
Google's MapReduce and GFS.

• In 2006, Doug Cutting joined Yahoo, and the Nutch MapReduce

implementation became the core of what would later be known as
Hadoop.

Apache Hadoop Project and Growth

• In 2006, the Apache Hadoop project was officially launched as an open-

source project under the Apache Software Foundation.

• The Hadoop project quickly gained popularity due to its ability to handle
massive amounts of data cost-effectively.

Definition of Hadoop
• Hadoop is an open-source software framework that is used for storing and
processing large amounts of data in a distributed computing environment.
It is designed to handle big data and is based on the MapReduce
programming model, which allows for the parallel processing of large
datasets.

• It was created by Apache Software Foundation in 2006, based on a

white paper written by Google in 2003 that described the Google File
System (GFS) and the MapReduce programming model.
• Hadoop is an open-source framework based on Java that manages the
storage and processing of large amounts of data for applications. Hadoop
uses distributed storage and parallel processing to handle big data and
analytics jobs, breaking workloads down into smaller workloads that can
be run at the same time.

• Four modules comprise the primary Hadoop framework and work

collectively to form the Hadoop ecosystem:
•
Hadoop Distributed File System (HDFS): As the primary component of
the Hadoop ecosystem, HDFS is a distributed file system in which
individual Hadoop nodes operate on data that resides in their local
storage. This removes network latency, providing high-throughput access
to application data.

• Yet Another Resource Negotiator (YARN): YARN is a resource-

management platform responsible for managing compute resources in
clusters and using them to schedule users’ applications. It performs
scheduling and resource allocation across the Hadoop system.

• MapReduce: MapReduce is a programming model for large-scale data

processing. In the MapReduce model, subsets of larger datasets and
instructions for processing the subsets are dispatched to multiple different
nodes, where each subset is processed by a node in parallel with other
processing jobs. After processing the results, individual subsets are
combined into a smaller, more manageable dataset.

• Hadoop Common: Hadoop Common includes the libraries and utilities

used and shared by other Hadoop modules.
• Beyond HDFS, YARN, and MapReduce, the entire Hadoop open-source
ecosystem continues to grow and includes many tools and applications to
help collect, store, process, analyse, and manage big data. These include
Apache Pig, Apache Hive, Apache HBase, Apache Spark, Presto, and
Apache Zeppelin.

What are blocks?

The smallest quantity of data it can read or write is called a block. The default
size of HDFS blocks is 128 MB, although this can be changed. HDFS files are
divided into block-sized portions and stored as separate units. Unlike a file
system, if a file in HDFS is less than the block size, it does not take up the
entire block size; for example, a 5 MB file saved in HDFS with a block size of
128 MB takes up just 5 MB of space. The HDFS block size is big solely to
reduce search costs.

Map Reduce framework

MapReduce is a Java-based, distributed execution framework within the Apache

Hadoop Ecosystem. It takes away the complexity of distributed programming
by exposing two processing steps that developers implement: 1) Map and 2)
Reduce.

In the Mapping step, data is split between parallel processing tasks.

Transformation logic can be applied to each chunk of data. Once completed, the
Reduce phase takes over to handle aggregating data from the Map set.

In general, MapReduce uses Hadoop Distributed File System (HDFS) for both
input and output.
It is a s/w framework for easily writing applications which is process by big
amounts of data in parallel or large clusters (thousands of nodes) of commodity
hardware in a reliable fault tolerance manner. Hadoop requires Java Run Time
Environment JRE 1.6 or higher version.

The Map Task – It is the 1 st task that takes input data and convert it into a set
of data where individual elements are broken down into tuples (key/ value pair).

Reduce Task- This Task takes the output from the map task as input and
combines those data tuples into a smaller set of tuples. Reduce task is always
performed after map task.
Typically, both the input and output are stored in a file system. The framework
takes care of Scheduling task, monitoring them and re execute the failed task.

Map reduce is the parallel processing engine that allows Hadoop to go through
large data set in relatedly short order. A group of system in which a distributed
application is running is called a cluster and each machine in a cluster is called a
node.

It Consists of two Parts

1) Job Tracker (Master node)
2) Task Tracker (Slave node)

Job Tracker/ Master Node is responsible for resource management, tracking

resource consumption, availability, scheduling of jobs, monitoring the task and
rescheduling the failed task.

Job Tracker is a single point of failure for the Hadoop map reduce service. It
means if job tracker goes down all running jobs are halted.
Slave node executes the task as directed by the master and provide the task
status information to the master periodically.

Name Node and Data Node

Name Node is the master node in the Apache Hadoop HDFS Architecture that
maintains and manages the blocks present on the Data Nodes (slave nodes).
Name Node is a very highly available server that manages the File System
Namespace and controls access to files by clients.

How does MapReduce work?

A MapReduce system is usually composed of three steps (even though it's
generalized as the combination of Map and Reduce operations/functions). The
MapReduce operations are:

Map: The input data is first split into smaller blocks. The Hadoop framework
then decides how many mappers to use, based on the size of the data to be
processed and the memory block available on each mapper server. Each block is
then assigned to a mapper for processing. Each ‘worker’ node applies the map
function to the local data, and writes the output to temporary storage. The
primary (master) node ensures that only a single copy of the redundant input
data is processed.

Shuffle, combine and partition: worker nodes redistribute data based on the
output keys (produced by the map function), such that all data belonging to one
key is located on the same worker node. As an optional process the combiner (a
reducer) can run individually on each mapper server to reduce the data on each
mapper even further making reducing the data footprint and shuffling and
sorting easier. Partition (not optional) is the process that decides how the data
has to be presented to the reducer and also assigns it to a particular reducer.

Reduce: A reducer cannot start while a mapper is still in progress. Worker

nodes process each group of <key,value> pairs output data, in parallel to
produce <key,value> pairs as output. All the map output values that have the
same key are assigned to a single reducer, which then aggregates the values for
that key. Unlike the map function which is mandatory to filter and sort the
initial data, the reduce function is optional.

Yarn

In simple terms, Hadoop YARN is a newer and much-improved version of

MapReduce. However, that is not a completely accurate picture. This is because
YARN is also used for scheduling and processing and the executions of job
sequences. But YARN is the resource management layer of Hadoop where
each job runs on the data as a separate Java application.

Acting as the framework's operating system, YARN allows things like batch
processing and data handled on a single platform. Much above the capabilities
of MapReduce, YARN allows programmers to build interactive and real-time
streaming applications.
YARN allows for programmers to run as many applications as needed on the same
cluster. It provides a secure and stable foundation for the operational management
and sharing of system resources for maximum efficiency and flexibility.
Features of Hadoop

• It is fault tolerance.
• It is highly available.
• It’s programming is easy.
• It have huge flexible storage.
• It has low cost.

Benefits of Hadoop

Scalability

Hadoop is important as one of the primary tools to store and process huge
amounts of data quickly. It does this by using a distributed computing model
which enables the fast processing of data that can be rapidly scaled by adding
computing nodes.

Flexibility

Hadoop allows for flexibility in data storage as data does not require pre-
processing before storing it which means that an organization can store as much
data as they like and then utilize it later.

Low cost

As an open-source framework that can run on commodity hardware and has a

large ecosystem of tools, Hadoop is a low-cost option for the storage and
management of big data.

Resilience

As a distributed computing model, Hadoop allows for fault tolerance and

system resilience, meaning if one of the hardware nodes fail, jobs are redirected
to other nodes. Data stored on one Hadoop cluster is replicated across other
nodes within the system to fortify against the possibility of hardware or
software failure.
Challenges in Hadoop

MapReduce complexity and limitations

As a file-intensive system, MapReduce can be a difficult tool to utilize for

complex jobs, such as interactive analytical tasks. MapReduce functions also
need to be written in Java and can require a steep learning curve. The
MapReduce ecosystem is quite large, with many components for different
functions that can make it difficult to determine what tools to use.

Security

• Data sensitivity and protection can be issues as Hadoop handles such

large datasets. An ecosystem of tools for authentication, encryption,
auditing, and provisioning has emerged to help developers secure data in
Hadoop.
• Governance and management
• Hadoop does not have many robust tools for data management and
governance, nor for data quality and standardization.

Talent gap

Like many areas of programming, Hadoop has an acknowledged talent gap.

Finding developers with the combined requisite skills in Java to program
MapReduce, operating systems, and hardware can be difficult. In addition,
MapReduce has a steep learning curve, making it hard to get new programmers
up to speed on its best practices and ecosystem.

What is Apache Hadoop used for?

Analytics and big data

A wide variety of companies and organizations use Hadoop for research,

production data processing, and analytics that require processing terabytes or
petabytes of big data, storing diverse datasets, and data parallel processing.

Data storage and archiving

As Hadoop enables mass storage on commodity hardware, it is useful as a low-

cost storage option for all kinds of data, such as transactions, click streams, or
sensor and machine data.
Data lakes

Since Hadoop can help store data without pre-processing, it can be used to
complement to data lakes, where large amounts of unrefined data are stored.

Risk management

Banks, insurance companies, and other financial services companies use

Hadoop to build risk analysis and management models.

Marketing analytics

Marketing departments often use Hadoop to store and analyse customer

relationship management (CRM) data.

AI and machine learning

Hadoop ecosystems help with the processing of data and model training
operations for machine learning applications.

What are the Ways in which Hadoop operate in situations like ingesting,
purifying, and transforming data, when working with various data sources
and formats?

Hadoop, as part of its broader ecosystem, provides several tools and

frameworks to address the challenges associated with data ingestion, cleansing,
and transformation, especially when dealing with diverse data sources and
formats. Here are some key components and approaches within the Hadoop
ecosystem:

Apache Hive:
Data Transformation: Apache Hive provides a data warehousing and SQL-like
query language (HiveQL) for Hadoop. Users can define and execute ETL
(Extract, Transform, Load) processes using Hive queries, making it easier to
transform and structure data stored in Hadoop.

Apache Spark:
Data Transformation: Apache Spark, while not limited to Hadoop, is often used
within the Hadoop ecosystem for data processing. It offers in-memory data
processing and supports complex data transformations. Spark's Data Frame API
and SQL capabilities make it a powerful tool for data cleansing and
transformation tasks.

Apache NiFi:

Data Ingestion: Apache NiFi is a powerful tool for data ingestion. It provides a
web-based user interface for designing data flows, making it easy to collect,
transfer, and process data from various sources to Hadoop clusters.

Data Cleansing: NiFi supports data enrichment, validation, and cleansing

through its processors. Users can define custom data transformation rules to
clean and enrich the data during the ingestion process.

Apache Flume:

Data Ingestion: Apache Flume is designed for efficient and reliable large-scale
data ingestion into Hadoop. It allows the collection, aggregation, and movement
of log data from various sources to Hadoop Distributed File System (HDFS) or
other data stores.

Data Transformation: Flume can be configured with custom interceptors that

enable basic data transformation tasks during the ingestion process.

What are the best practise to optimise Hadoop cluster performance?

1) Configuration parameters

One of the first steps to optimize your Hadoop cluster's performance is to tune
the configuration parameters that control the behavior of the cluster
components, such as the Hadoop Distributed File System (HDFS), the
MapReduce engine, and the YARN resource manager. These parameters can
affect the memory allocation, the parallelism, the replication factor, the block
size, the shuffle and sort operations, and the compression methods. You should
adjust these parameters according to your data characteristics, your cluster size,
and your application requirements. You can use tools like Ambari or Cloudera
Manager to manage and optimize your configuration settings.

2) Hardware and software

Another important factor that can influence your Hadoop cluster's performance
is the choice of hardware and software for your nodes. You should consider the
CPU speed, the RAM size, the disk type and capacity, the network bandwidth,
and the operating system of your nodes. You should also ensure that your nodes
are compatible with the Hadoop version and the Java version that you are using.
You should avoid overloading your nodes with too many processes or tasks, and
you should distribute your data evenly across your nodes. You should also
upgrade your hardware and software regularly to take advantage of the latest
features and improvements.

3) Workload balancing

A common challenge that Hadoop cluster administrators face is how to balance

the workload among the nodes and avoid bottlenecks or failures. You should
design your applications to use the optimal number of mappers and reducers,
and to avoid skewness or imbalance in the input or output data. You should also
use partitioners, combiners, and custom serializers to reduce the network traffic
and the disk I/O. You should also schedule your jobs according to their priority,
their duration, and their resource consumption. You can use tools like Oozie or
Airflow to orchestrate your workflows and dependencies.

4) Cluster monitoring

A crucial aspect of optimizing your Hadoop cluster's performance is to monitor

its health and performance metrics, such as the CPU utilization, the memory
usage, the disk throughput, the network latency, the garbage collection, and the
error logs. You should also monitor the status and the progress of your jobs,
such as the number of running, completed, failed, or killed tasks, and the
execution time and the counters of each task. You should use tools like Ganglia,
Nagios, or Prometheus to collect and visualize your cluster metrics, and tools
like Flume, Kafka, or Logstash to collect and analyse your cluster logs. You
should also use tools like Spark or Hive to query and process your data
efficiently.

Apache spark–
• It was originally developed at UC(University of California) Berkeley in
2009 to overcome the limitations of Hadoop.

• Apache Spark is a lightning-fast, open-source data-processing engine for

machine learning and AI applications, backed by the largest open-source
community in big data.

• Apache Spark (Spark) easily handles large-scale data sets and is a fast,
general-purpose clustering system that is well-suited for Py Spark.

• It is designed to deliver the computational speed, scalability, and

programmability required for big data—specifically for streaming data,
graph data, analytics, machine learning, large-scale data processing,
and artificial intelligence (AI) applications.

• It scales by distributing processing workflows across large clusters of

computers, with built-in parallelism and fault tolerance. It even includes
APIs for programming languages that are popular among data analysts
and data scientists, including Scala, Java, Python, and R.

Resilient Distributed Dataset (RDD)

• Speed − Spark helps to run an application in Hadoop cluster, up to 100

times faster in memory, and 10 times faster when running on disk. This is
possible by reducing number of read/write operations to disk. It stores the
intermediate processing data in memory.
• Supports multiple languages − Spark provides built-in APIs in Java,
Scala, or Python. Therefore, you can write applications in different
languages.
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It
also supports SQL queries, Streaming data, Machine learning (ML), and
Graph algorithms.

Benefits of Apache Spark

Speed
Because data is organised to scale in-memory processing across distributed
cluster nodes, and because Spark can do processing without having to write data
back to disk storage, it can perform up to 100 times faster than MapReduce on
batch jobs when processing in memory and ten times faster on disk.
Multilingual
Written in Scala, Spark also comes with API connectors for using Java and
Python, as well as an R programming package that allows users to process very
large data sets required by data scientists.
Advanced analytics
Spark comes packaged with several libraries of code to run data analytics
applications. For example, the MLlib has machine learning code for advanced
statistical operations, and the Spark Streaming library enables users to analyse
data in real time.

Increased access to Big Data

Spark separates storage and compute, which allows customers to scale each to
accommodate the performance needs of analytics applications. And it
seamlessly performs batch jobs to move data into a data lake or data warehouse
for advanced analytics.
Ease of use
Due to Spark’s more efficient way of distributing data across nodes and clusters,
it can perform parallel data processing and data abstraction. And its ability to tie
together multiple types of databases and compute data from many types of data
stores allows it to be used across multiple use cases.
Power
Spark can handle a huge volume of data – as much as several petabytes
according to proponents. And it allows users to perform exploratory data
analysis on this petabyte-scale data without needing to down sample.

Spark Ecosystem
Spark Ecosystem

• Engine — Spark Core: It is the basic core component of Spark

ecosystem on top of which the entire ecosystem is built. It performs
the tasks of scheduling/monitoring and basic I/O functionality
• Management — Spark cluster can be managed by Hadoop YARN,
Mesos or Spark cluster manager.
• Library — Spark ecosystem comprises of Spark SQL (for running
SQL like queries on RDD or data from external sources), Spark Mlib
(for ML),Spark Graph X (for constructing graphs for better
visualisation of data), Spark streaming (for batch processing and
streaming of data in the same application)
• Programming can be done in Python, Java, Scala and R
• Storage — Data can be stored in HDFS, S3, local storage and it
supports both SQL and NoSQL databases.

The RC Circuit
No ratings yet
The RC Circuit
2 pages
1-Cloud Architecture and Model
No ratings yet
1-Cloud Architecture and Model
10 pages
LT Spice Manual
100% (1)
LT Spice Manual
27 pages
Big Data Analytics Assignment
No ratings yet
Big Data Analytics Assignment
7 pages
Coc Level2
No ratings yet
Coc Level2
3 pages
Bda Unit-2
No ratings yet
Bda Unit-2
52 pages
Lun Size Recommeded by Oracle
No ratings yet
Lun Size Recommeded by Oracle
7 pages
Unit 2
No ratings yet
Unit 2
56 pages
Logcat From Poco
No ratings yet
Logcat From Poco
59 pages
Lab 3
No ratings yet
Lab 3
8 pages
C Programming Guddu Mehta
No ratings yet
C Programming Guddu Mehta
177 pages
Final Test
No ratings yet
Final Test
4 pages
Pipeline Design
100% (3)
Pipeline Design
16 pages
Unit II BDA
No ratings yet
Unit II BDA
32 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
Prat Ikawa Le
No ratings yet
Prat Ikawa Le
3 pages
Xapp495 S6TMDS Video Interface
No ratings yet
Xapp495 S6TMDS Video Interface
16 pages
HADOOP: A Solution To Big Data Problems Using Partitioning Mechanism Map-Reduce
No ratings yet
HADOOP: A Solution To Big Data Problems Using Partitioning Mechanism Map-Reduce
6 pages
Unit III
No ratings yet
Unit III
32 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Big Data 3rd Module
No ratings yet
Big Data 3rd Module
22 pages
Unit 2-1
No ratings yet
Unit 2-1
43 pages
Module-1 15CS561 Java
No ratings yet
Module-1 15CS561 Java
23 pages
Biodiesel Research
No ratings yet
Biodiesel Research
29 pages
Survey Paper On Traditional Hadoop and Pipelined Map Reduce: Dhole Poonam B, Gunjal Baisa L
No ratings yet
Survey Paper On Traditional Hadoop and Pipelined Map Reduce: Dhole Poonam B, Gunjal Baisa L
5 pages
Pylontech Compatible List V2.22 ESS Updated 20231
No ratings yet
Pylontech Compatible List V2.22 ESS Updated 20231
5 pages
Servview17": Contact Us Worldwide: WWW - Blackbox.Eu
No ratings yet
Servview17": Contact Us Worldwide: WWW - Blackbox.Eu
4 pages
Unit 3
No ratings yet
Unit 3
18 pages
Assignment 10
No ratings yet
Assignment 10
5 pages
Java Security Model: COEN 351: E-Commerce Security
No ratings yet
Java Security Model: COEN 351: E-Commerce Security
59 pages
Unit Iv-1
No ratings yet
Unit Iv-1
84 pages
Hadoop Notes 1
No ratings yet
Hadoop Notes 1
9 pages
Module III Note
No ratings yet
Module III Note
36 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
History of Hadoop Apache Hadoop - The Hadoop Distributed File System
No ratings yet
History of Hadoop Apache Hadoop - The Hadoop Distributed File System
8 pages
BD Sec B
No ratings yet
BD Sec B
19 pages
Big Data Unit 2 Notes
No ratings yet
Big Data Unit 2 Notes
6 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Module 2.1
No ratings yet
Module 2.1
21 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
Unit 2
No ratings yet
Unit 2
10 pages
Unit 2
No ratings yet
Unit 2
28 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Unit 2
No ratings yet
Unit 2
21 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Hadoop
No ratings yet
Hadoop
5 pages
CC 2
No ratings yet
CC 2
25 pages
Hadoop Streaming: Mapreduce
No ratings yet
Hadoop Streaming: Mapreduce
8 pages
Unit 2
No ratings yet
Unit 2
30 pages
Spark 4-2 Documentation
No ratings yet
Spark 4-2 Documentation
60 pages
Minimum Hardware Requirement.: Starbase 2.2.1 (June 2012) Installation Notes
No ratings yet
Minimum Hardware Requirement.: Starbase 2.2.1 (June 2012) Installation Notes
7 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Hadoop 10
No ratings yet
Hadoop 10
8 pages
Monitoring Online Tests Through Data Visualization 1
No ratings yet
Monitoring Online Tests Through Data Visualization 1
74 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Hadoop Notes 2
No ratings yet
Hadoop Notes 2
5 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Introduction
No ratings yet
Introduction
2 pages
Memory Interfacing
No ratings yet
Memory Interfacing
46 pages
Hadoop
No ratings yet
Hadoop
7 pages
CC-KML051-Unit V
No ratings yet
CC-KML051-Unit V
17 pages
List WTH Device Ram
No ratings yet
List WTH Device Ram
38 pages
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
No ratings yet
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
15 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Bda Unit-Iii-R20
No ratings yet
Bda Unit-Iii-R20
44 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Teardown Manual For Ipad Wi-Fi
No ratings yet
Teardown Manual For Ipad Wi-Fi
34 pages
Data Sheet 6AV2123-2MB03-0AX0: General Information
No ratings yet
Data Sheet 6AV2123-2MB03-0AX0: General Information
9 pages
Isaa Project Report 3
No ratings yet
Isaa Project Report 3
25 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Unit 1 An Overview of Object-Oriented Systems Development: Structure
No ratings yet
Unit 1 An Overview of Object-Oriented Systems Development: Structure
4 pages
Unit 2
No ratings yet
Unit 2
9 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
VI Characteristics of Diode
No ratings yet
VI Characteristics of Diode
5 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
An Introduction To Hadoop
No ratings yet
An Introduction To Hadoop
12 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
EMI Troubleshooting App Note 48W 67730 0
No ratings yet
EMI Troubleshooting App Note 48W 67730 0
22 pages
Test Methods and Practices
No ratings yet
Test Methods and Practices
30 pages
Ece406 Embedded-system-Design TH 1.00 Ac16
No ratings yet
Ece406 Embedded-system-Design TH 1.00 Ac16
1 page
IOLINK2-Connection Manual PDF
No ratings yet
IOLINK2-Connection Manual PDF
192 pages
Terminals of Ecu: 1. Radio Receiver Assy
No ratings yet
Terminals of Ecu: 1. Radio Receiver Assy
1 page
Hadoop PDF
0% (1)
Hadoop PDF
4 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet

Hadoop Notesforstudents

Uploaded by

Hadoop Notesforstudents

Uploaded by

Hadoop - High Availability Distributed Object-Oriented Platform

• Hadoop, an open-source framework for distributed storage and

Google's MapReduce and GFS (Google File System):

• The Google File System (GFS) introduced the concept of a distributed

Creation of Hadoop by Doug Cutting and Mike Cafarella

• In 2006, Doug Cutting joined Yahoo, and the Nutch MapReduce

Apache Hadoop Project and Growth

• In 2006, the Apache Hadoop project was officially launched as an open-

• It was created by Apache Software Foundation in 2006, based on a

• Four modules comprise the primary Hadoop framework and work

• Yet Another Resource Negotiator (YARN): YARN is a resource-

• MapReduce: MapReduce is a programming model for large-scale data

• Hadoop Common: Hadoop Common includes the libraries and utilities

What are blocks?

Map Reduce framework

MapReduce is a Java-based, distributed execution framework within the Apache

In the Mapping step, data is split between parallel processing tasks.

It Consists of two Parts

Job Tracker/ Master Node is responsible for resource management, tracking

Name Node and Data Node

How does MapReduce work?

Reduce: A reducer cannot start while a mapper is still in progress. Worker

In simple terms, Hadoop YARN is a newer and much-improved version of

As an open-source framework that can run on commodity hardware and has a

As a distributed computing model, Hadoop allows for fault tolerance and

MapReduce complexity and limitations

As a file-intensive system, MapReduce can be a difficult tool to utilize for

• Data sensitivity and protection can be issues as Hadoop handles such

Like many areas of programming, Hadoop has an acknowledged talent gap.

What is Apache Hadoop used for?

Analytics and big data

A wide variety of companies and organizations use Hadoop for research,

Data storage and archiving

As Hadoop enables mass storage on commodity hardware, it is useful as a low-

Banks, insurance companies, and other financial services companies use

Marketing departments often use Hadoop to store and analyse customer

AI and machine learning

Hadoop, as part of its broader ecosystem, provides several tools and

Data Cleansing: NiFi supports data enrichment, validation, and cleansing

Data Transformation: Flume can be configured with custom interceptors that

What are the best practise to optimise Hadoop cluster performance?

2) Hardware and software

A common challenge that Hadoop cluster administrators face is how to balance

A crucial aspect of optimizing your Hadoop cluster's performance is to monitor

• Apache Spark is a lightning-fast, open-source data-processing engine for

• It is designed to deliver the computational speed, scalability, and

• It scales by distributing processing workflows across large clusters of

Resilient Distributed Dataset (RDD)

• Speed − Spark helps to run an application in Hadoop cluster, up to 100

Benefits of Apache Spark

Increased access to Big Data

• Engine — Spark Core: It is the basic core component of Spark

You might also like