Unit 2

Uploaded by

tripathineeharika

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views10 pages

Unit 2

Uploaded by

tripathineeharika

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

UNIT-2

Hadoop:
Introduction to Hadoop:
Apache Hadoop software is an open source framework that allows for the distributed storage and
processing of large datasets across clusters of computers using simple programming models. Its
framework is based on Java programming with some native code in C and shell scripts.
Hadoop is designed to scale up from a single computer to thousands of clustered computers, with
each machine offering local computation and storage. In this way, Hadoop can efficiently store
and process large datasets ranging in size from gigabytes to petabytes of data. Hadoop is an
open-source software framework that is used for storing and processing large amounts of data
in a distributed computing environment. It is designed to handle big data and is based on the
MapReduce programming model, which allows for the parallel processing of large datasets.
There are mainly two problems with the big data. First one is to store such a huge amount of
data and the second one is to process that stored data. The traditional approach like RDBMS
is not sufficient due to the heterogeneity of the data. So Hadoop comes as the solution to the
problem of big data i.e. storing and processing the big data with some extra capabilities.
Difference between RDBMS and Apache Hadoop:
S.No. RDBMS Hadoop
1. Traditional row-column based databases, An open-source software used for storing data
basically used for data storage, and running applications or processes
manipulation and retrieval. concurrently.
2. In this structured data is mostly processed. In this both structured and unstructured data is
processed.
3. It is best suited for OLTP environment. It is best suited for BIG data.

4. It is less scalable than Hadoop. It is highly scalable.

5. Data normalization is required in RDBMS. Data normalization is not required in Hadoop.

6. It stores transformed and aggregated data. It stores huge volume of data.

7. It has no latency in response. It has some latency in response.

8. The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.

9. High data integrity available. Low data integrity available than RDBMS.
10. Cost is applicable for licensed software. Free of cost, as it is an open source software.
History of Apache Hadoop:
Hadoop was started with Doug Cutting and Mike Cafarella in the year 2002 when they both
started to work on Apache Nutch project. Apache Nutch project was the process of building a
search engine system that can index 1 billion pages. After a lot of research on Nutch, they
concluded that such a system will cost around half a million dollars in hardware, and along
with a monthly running cost of $30, 000 approximately, which is very expensive. So, they
realized that their project architecture will not be capable enough to the workaround with
billions of pages on the web. So they were looking for a feasible solution which can reduce
the implementation cost as well as the problem of storing and processing of large datasets. In
2003, they came across a paper that described the architecture of Google’s distributed file
system, called GFS (Google File System) which was published by Google, for storing the
large data sets. Now they realize that this paper can solve their problem of storing very large
files which were being generated because of web crawling and indexing processes. But this
paper was just the half solution to their problem. In 2004, Google published one more paper
on the technique MapReduce, which was the solution of processing those large datasets. Now
this paper was another half solution for Doug Cutting and Mike Cafarella for their Nutch
project. These both techniques (GFS & MapReduce) were just on white paper at Google.
Google didn’t implement these two techniques. Doug Cutting knew from his work on Apache
Lucene ( It is a free and open-source information retrieval software library, originally written
in Java by Doug Cutting in 1999) that open-source is a great way to spread the technology to
more people. So, together with Mike Cafarella, he started implementing Google’s techniques
(GFS & MapReduce) as open-source in the Apache Nutch project. In 2005, Cutting found that
Nutch is limited to only 20-to-40 node clusters. He soon realized two problems: (a) Nutch
wouldn’t achieve its potential until it ran reliably on the larger clusters (b) And that was
looking impossible with just two people (Doug Cutting & Mike Cafarella). The engineering
task in Nutch project was much bigger than he realized. So he started to find a job with a
company who is interested in investing in their efforts. And he found Yahoo!.Yahoo had a
large team of engineers that was eager to work on this there project. So in 2006, Doug Cutting
joined Yahoo along with Nutch project. He wanted to provide the world with an open-source,
reliable, scalable computing framework, with the help of Yahoo. So at Yahoo first, he
separates the distributed computing parts from Nutch and formed a new project Hadoop (He
gave name Hadoop it was the name of a yellow toy elephant which was owned by the
Doug Cutting’s son. and it was easy to pronounce and was the unique word.) Now he
wanted to make Hadoop in such a way that it can work well on thousands of nodes. So with
GFS and MapReduce, he started to work on Hadoop. In 2007, Yahoo successfully tested
Hadoop on a 1000 node cluster and start using it. In January of 2008, Yahoo released
Hadoop as an open source project to ASF(Apache Software Foundation). And in July of
2008, Apache Software Foundation successfully tested a 4000 node cluster with
Hadoop. In 2009, Hadoop was successfully tested to sort a PB (PetaByte) of data in less than
17 hours for handling billions of searches and indexing millions of web pages. And Doug
Cutting left the Yahoo and joined Cloudera to fulfill the challenge of spreading Hadoop
to other industries. In December of 2011, Apache Software Foundation released Apache
Hadoop version 1.0. And later in Aug 2013, Version 2.0.6 was available. And currently, we
have Apache Hadoop version 3.0 which released in December 2017.
Benefits of Hadoop
Scalability
Hadoop is important as one of the primary tools to store and process huge amounts of data
quickly. It does this by using a distributed computing model which enables the fast processing
of data that can be rapidly scaled by adding computing nodes.
Low cost
As an open source framework that can run on commodity hardware and has a large ecosystem
of tools, Hadoop is a low-cost option for the storage and management of big data.
Flexibility
Hadoop allows for flexibility in data storage as data does not require preprocessing before
storing it which means that an organization can store as much data as they like and then utilize
it later.
Resilience
As a distributed computing model, Hadoop allows for fault tolerance and system resilience,
meaning if one of the hardware nodes fail, jobs are redirected to other nodes. Data stored on
one Hadoop cluster is replicated across other nodes within the system to fortify against the
possibility of hardware or software failure.

Challenges of Hadoop
MapReduce complexity and limitations
As a file-intensive system, MapReduce can be a difficult tool to utilize for complex jobs, such as
interactive analytical tasks. MapReduce functions also need to be written in Java and can require
a steep learning curve. The MapReduce ecosystem is quite large, with many components for
different functions that can make it difficult to determine what tools to use.
Security
Data sensitivity and protection can be issues as Hadoop handles such large datasets. An
ecosystem of tools for authentication, encryption, auditing, and provisioning has emerged to help
developers secure data in Hadoop.
Governance and management
Hadoop does not have many robust tools for data management and governance, nor for data
quality and standardization.
Talent gap
Like many areas of programming, Hadoop has an acknowledged talent gap. Finding developers
with the combined requisite skills in Java to program MapReduce, operating systems, and
hardware can be difficult. In addition, MapReduce has a steep learning curve, making it hard to
get new programmers up to speed on its best practices and ecosystem.

MASTER AND SLAVE NODES

Master and slave nodes form the HDFS cluster. The name node is called the master, and the data
nodes are called the slaves.

The name node is responsible for the workings of the data nodes. It also stores the metadata.
The data nodes read, write, process, and replicate the data. They also send signals, known as
heartbeats, to the name node. These heartbeats show the status of the data node.

Consider that 30TB of data is loaded into the name node. The name node distributes it across the
data nodes, and this data is replicated among the data notes. You can see in the image above that
the blue, grey, and red data are replicated among the three data nodes.
Replication of the data is performed three times by default. It is done this way, so if a commodity
machine fails, you can replace it with a new machine that has the same data.
HADOOP DISTRIBUTED FILE SYSTEM:
HDFS is known as the Hadoop distributed file system. It is the allocated File System. It is the
primary data storage system in Hadoop Applications. It is the storage system of Hadoop that is
spread all over the system. In HDFS, the data is once written on the server, and it will
continuously be used many times according to the need. The targets of HDFS are as follows.
 The ability to recover from hardware failures in a timely manner
 Access to Streaming Data
 Accommodation of Large data sets
 Portability
Hadoop Distributed File System has two nodes included in it. They are the Name Node and Data
Node.
Name Node:
Name Node is the primary component of HDFS. Name Node maintains the file systems along
with namespaces. Actual data can not be stored in the Name Node. The modified data, such as
Metadata, block data etc., can be stored here.
Data Node:
Data Node follows the instructions given by the Name Node. Data Nodes are also known as
‘slave Nodes’. These nodes store the actual data provided by the client and simply follow the
commands of the Name Node.
Job Tracker:
The primary function of the Job Tracker is resource management. Job Tracker determines the
location of the data by communicating with the Name Node. Job Tracker also helps in finding
the Task Tracker. It also tracks the MapReduce from Local Node to Slave Node. In Hadoop,
there is only one instance of Job Trackers. Job Tracker monitors the individual Task Tracker and
tracks the status. Job Tracker also helps in the execution of MapReduce in Hadoop.
Task Tracker:
Task Tracker is the slave daemon in the cluster which accepts all the instructions from the Job
Tracker. Task Tracker runs on its process. The task trackers monitor all the tasks by capturing
the input and output codes. The Task Tracker helps in mapping, shuffling and reducing the data
operations. Task Tracker arranges different slots to perform different tasks. Task Tracker
continuously updates the status of the Job Tracker. It also informs about the number of slots
available in the cluster. In case the Task Tracker is unresponsive, then Job Tracker assigns the
work to some other nodes.
Types of Hadoop File Formats
Hive and Impala table in HDFS can be created using four different Hadoop file formats:
 Text files
 Sequence File
 Avro data files
 Parquet file format
1. Text files
A text file is the most basic and a human-readable file. It can be read or written in any
programming language and is mostly delimited by comma or tab.
The text file format consumes more space when a numeric value needs to be stored as a string. It
is also difficult to represent binary data such as an image.
2. Sequence File
The sequencefile format can be used to store an image in the binary format. They store key-value
pairs in a binary container format and are more efficient than a text file. However, sequence files
are not human- readable.
3. Avro Data Files
The Avro file format has efficient storage due to optimized binary encoding. It is widely
supported both inside and outside the Hadoop ecosystem.
The Avro file format is ideal for long-term storage of important data. It can read from and write
in many languages like Java, Scala and so on.Schema metadata can be embedded in the file to
ensure that it will always be readable. Schema evolution can accommodate changes. The Avro
file format is considered the best choice for general-purpose storage in Hadoop.
4. Parquet File Format
Parquet is a columnar format developed by Cloudera and Twitter. It is supported in Spark,
MapReduce, Hive, Pig, Impala, Crunch, and so on. Like Avro, schema metadata is embedded in
the file.
Parquet file format uses advanced optimizations described in Google’s Dremel paper. These
optimizations reduce the storage space and increase performance. This Parquet file format is
considered the most efficient for adding multiple records at a time. Some optimizations rely on
identifying repeated patterns. We will look into what data serialization is in the next section.
HADOOP ECOSYSTEM
Overview: Apache Hadoop is an open source framework intended to make interaction
with big data easier, However, for those who are not acquainted with this technology, one
question arises that what is big data ? Big data is a term given to the data sets which can’t be
processed in an efficient manner with the help of traditional methodology such as RDBMS.
Hadoop has made its place in the industries and companies that need to work on large data
sets which are sensitive and needs efficient handling. Hadoop is a framework that enables
processing of large data sets which reside in the form of clusters. Being a framework, Hadoop
is made up of several modules that are supported by a large ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to
solve the big data problems. It includes Apache projects and various commercial tools and
solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and
Hadoop Common Utilities. Most of the tools or solutions are used to supplement or support
these major elements. All these tools work collectively to provide services such as absorption,
analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:

 HDFS: Hadoop Distributed File System

 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby maintaining
the metadata in the form of log files.
HDFS consists of two core components i.e.
1. Name node
2. Data Node
 Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data
nodes are commodity hardware in the distributed environment. Undoubtedly, making
Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware, thus working at
the heart of the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
 Resource manager has the privilege of allocating resources for the applications in a
system whereas Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource manager and node
manager and performs negotiations as per the requirement of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to
carry over the processing’s logic and helps to write applications which transform big data sets
into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form
of group. Map generates a key-value pair based result which is later on processed by
the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped
data. In simple, Reduce() takes the output generated by Map() as input and combines
those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query
based language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing huge data sets.
 Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
 Pig Latin language is specially designed for this framework which runs on Pig Runtime.
Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a major segment
of the Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and writing of
large data sets. However, its query language is called as HQL (Hive Query Language).
 It is highly scalable as it allows real-time processing and batch processing both. Also, all
the SQL datatypes are supported by Hive thus, making the query processing easier.
 Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
 JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.
Mahout:
Mahout, allows Machine Learnability to a system or application. Machine Learning, as the
name suggests helps the system to develop itself based on some patterns, user/environmental
interaction or on the basis of algorithms. It provides various libraries or functionalities such as
collaborative filtering, clustering, and classification which are nothing but concepts of
Machine learning. It allows invoking algorithms as per our need with the help of its own
libraries.
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization, etc.
 It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
 Spark is best suited for real-time data whereas Hadoop is best suited for structured data or
batch processing, hence both are used in most of the companies interchangeably.
Apache HBase:

 It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able to
work on Big Data sets effectively.
 At times where we need to search or retrieve the occurrences of something small in a huge
database, the request must be processed within a short quick span of time. At such times,
HBase comes handy as it gives us a tolerant way of storing limited data
Other Components: Apart from all of these, there are some other components too that carry
out a huge task in order to make Hadoop capable of processing large datasets. They are as
follows:
 Solr, Lucene: These are the two services that perform the task of searching and
indexing with the help of some java libraries, especially Lucene is based on Java
which allows spell check mechanism, as well. However, Lucene is driven by Solr.
 Zookeeper: There was a huge issue of management of coordination and synchronization
among the resources or the components of Hadoop which resulted in inconsistency, often.
Zookeeper overcame all the problems by performing synchronization, inter-component
based communication, grouping, and maintenance.
 Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding
them together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie
coordinator jobs. Oozie workflow is the jobs that need to be executed in a sequentially
ordered manner whereas Oozie Coordinator jobs are those that are triggered when some
data or external stimulus is given to it.

HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Apache Hadoop: Big Data (Unit 2)
No ratings yet
Apache Hadoop: Big Data (Unit 2)
40 pages
Big Data Notes - 2 Unit
No ratings yet
Big Data Notes - 2 Unit
20 pages
Bda Aiml Note Unit 2
No ratings yet
Bda Aiml Note Unit 2
13 pages
UNIT-4-Hadoop Ecosystem-Part 1
No ratings yet
UNIT-4-Hadoop Ecosystem-Part 1
22 pages
Unit II BDA
No ratings yet
Unit II BDA
32 pages
Unit III
No ratings yet
Unit III
32 pages
Unit 2 Bda
No ratings yet
Unit 2 Bda
30 pages
Module 2
No ratings yet
Module 2
34 pages
Big Data 3rd Module
No ratings yet
Big Data 3rd Module
22 pages
History of Hadoop Apache Hadoop - The Hadoop Distributed File System
No ratings yet
History of Hadoop Apache Hadoop - The Hadoop Distributed File System
8 pages
Cloud Computing
No ratings yet
Cloud Computing
21 pages
Ad205 Unit1
No ratings yet
Ad205 Unit1
63 pages
Unit 2-1
No ratings yet
Unit 2-1
43 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
25 pages
Big Data Analytics Assignment
No ratings yet
Big Data Analytics Assignment
7 pages
Big Data Analytics Unit-II New 2025
No ratings yet
Big Data Analytics Unit-II New 2025
62 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Unit 2
No ratings yet
Unit 2
21 pages
IBM Hadoop
No ratings yet
IBM Hadoop
11 pages
Module III Note
No ratings yet
Module III Note
36 pages
Unit 2
No ratings yet
Unit 2
28 pages
1 Introduction To SAP MM
No ratings yet
1 Introduction To SAP MM
26 pages
Unit-2 Hadoop
No ratings yet
Unit-2 Hadoop
16 pages
CC 2
No ratings yet
CC 2
25 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
Unit 2
No ratings yet
Unit 2
30 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
14 pages
Unit 4 Hadoop
No ratings yet
Unit 4 Hadoop
31 pages
Big Data 2 - Part
No ratings yet
Big Data 2 - Part
40 pages
Hadoop Notesforstudents
No ratings yet
Hadoop Notesforstudents
13 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
Big Data Unit 2 Notes
No ratings yet
Big Data Unit 2 Notes
6 pages
Hadoop Notes 2
No ratings yet
Hadoop Notes 2
5 pages
Hadoop Notes 1
No ratings yet
Hadoop Notes 1
9 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
Hadoop
No ratings yet
Hadoop
7 pages
Report On An Exploratory Analysis of The
No ratings yet
Report On An Exploratory Analysis of The
19 pages
CC-KML051-Unit V
No ratings yet
CC-KML051-Unit V
17 pages
Big Data ABHISHEK PRAJA C CCCCCCCCCCC
No ratings yet
Big Data ABHISHEK PRAJA C CCCCCCCCCCC
11 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Hadoop
No ratings yet
Hadoop
5 pages
Unit 2
No ratings yet
Unit 2
9 pages
Unit Iii
No ratings yet
Unit Iii
43 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Big Data RAJNEESH CCC
No ratings yet
Big Data RAJNEESH CCC
11 pages
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
No ratings yet
Bachelor of Engineering: C K Pithawalla College of Engineering & Technology, SURAT
14 pages
Eng - Hadoopthe Next Big Thing in - Tanvi Deshpande
No ratings yet
Eng - Hadoopthe Next Big Thing in - Tanvi Deshpande
6 pages
Unit 3 Introduction To Hadoop Syllabus
No ratings yet
Unit 3 Introduction To Hadoop Syllabus
22 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
Hadoop Presentation: Swarnali B.SC Computer Science Hons. 2 Year Chandernagore Govt. College Halder
No ratings yet
Hadoop Presentation: Swarnali B.SC Computer Science Hons. 2 Year Chandernagore Govt. College Halder
8 pages
Comparison Between V2 and Revised
No ratings yet
Comparison Between V2 and Revised
76 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Hadoop - Project 5th Sem - 1
No ratings yet
Hadoop - Project 5th Sem - 1
62 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
ISO 27002 New Standards
No ratings yet
ISO 27002 New Standards
1 page
00 HadoopWelcome Transcript
No ratings yet
00 HadoopWelcome Transcript
4 pages
DCS MCQs Unit 1to4 Sir
100% (1)
DCS MCQs Unit 1to4 Sir
20 pages
A - Z Linux Commands - Overview With Examples PDF
No ratings yet
A - Z Linux Commands - Overview With Examples PDF
39 pages
9.1.3 Packet Tracer - Identify MAC and IP Addresses
No ratings yet
9.1.3 Packet Tracer - Identify MAC and IP Addresses
7 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
Control Desk Calibration and Data Set Management
No ratings yet
Control Desk Calibration and Data Set Management
168 pages
ISO 27001-2005 Awareness
No ratings yet
ISO 27001-2005 Awareness
14 pages
AAWFO 15.2 Overview 110418df
No ratings yet
AAWFO 15.2 Overview 110418df
63 pages
Unit 4 (MongoDB)
No ratings yet
Unit 4 (MongoDB)
46 pages
2022-23-BDA-LAB Manual
No ratings yet
2022-23-BDA-LAB Manual
59 pages
KNIME Will They Blend 20200817
No ratings yet
KNIME Will They Blend 20200817
165 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Big Data-Introduction
No ratings yet
Big Data-Introduction
14 pages
Unit-2 (Hadoop)
No ratings yet
Unit-2 (Hadoop)
16 pages
Network Monitoring Project
No ratings yet
Network Monitoring Project
30 pages
Free Data Visualization Tutorial - Data Visualization With Excel - Crash Course - Udemy
No ratings yet
Free Data Visualization Tutorial - Data Visualization With Excel - Crash Course - Udemy
4 pages
CAP170 Practice
No ratings yet
CAP170 Practice
153 pages
Tabela Periodica Da Seguranca Da Informacao
No ratings yet
Tabela Periodica Da Seguranca Da Informacao
1 page
Table and Image in HTML
No ratings yet
Table and Image in HTML
21 pages
2 Malware
No ratings yet
2 Malware
21 pages
Unit-2 (MapReduce-II)
No ratings yet
Unit-2 (MapReduce-II)
11 pages
ResponsiveWebDesign PresentandFuturebyMRizwanPasha
No ratings yet
ResponsiveWebDesign PresentandFuturebyMRizwanPasha
21 pages
Https Openai - Github.io Openai-Agents-Python Quickstart
No ratings yet
Https Openai - Github.io Openai-Agents-Python Quickstart
6 pages
CSS Margin and Padding Properties Box Model
No ratings yet
CSS Margin and Padding Properties Box Model
15 pages
MOP For HNBGW Config On Redundant Serving Node
No ratings yet
MOP For HNBGW Config On Redundant Serving Node
8 pages
Java Quiz V1
No ratings yet
Java Quiz V1
4 pages
BSBPMG421 Project Management Plan Template V5.2021
No ratings yet
BSBPMG421 Project Management Plan Template V5.2021
13 pages
Module 9 - Transactions & ACID Properties
No ratings yet
Module 9 - Transactions & ACID Properties
9 pages
SE Assignment 1-2g
No ratings yet
SE Assignment 1-2g
1 page
May Jun 2024
No ratings yet
May Jun 2024
2 pages
Rakshitha CV
No ratings yet
Rakshitha CV
4 pages
Splunk Es
No ratings yet
Splunk Es
3 pages
1.3.4 Lab - Visualizing The Black Hats
No ratings yet
1.3.4 Lab - Visualizing The Black Hats
3 pages
Acasde
No ratings yet
Acasde
3 pages
Chat Transcript 06 Mar 2025
No ratings yet
Chat Transcript 06 Mar 2025
1 page
Ripunjay CV9
No ratings yet
Ripunjay CV9
1 page
Product Road Map Website
No ratings yet
Product Road Map Website
1 page
Statistical Databases: ISID (Institute For Studies in Industrial Development) Social Science Data Services
No ratings yet
Statistical Databases: ISID (Institute For Studies in Industrial Development) Social Science Data Services
1 page

Unit 2

Uploaded by

Unit 2

Uploaded by

UNIT-2

4. It is less scalable than Hadoop. It is highly scalable.

6. It stores transformed and aggregated data. It stores huge volume of data.

7. It has no latency in response. It has some latency in response.

MASTER AND SLAVE NODES

 HDFS: Hadoop Distributed File System

You might also like