0% found this document useful (0 votes)

21 views14 pages

Unit 3-1

The document is a question bank focused on Hadoop, covering its history, features, components, and architecture. It includes various questions categorized into parts A, B, and C, addressing key concepts such as NameNode, DataNode, MapReduce, and HDFS. Additionally, it outlines the advantages and disadvantages of Hadoop streaming and discusses the design and scalability of HDFS.

Uploaded by

SNEHA B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views14 pages

Unit 3-1

Uploaded by

SNEHA B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

1

UNIT III - HADOOP

QUESTION BANK
PART A ( 2 Marks each)
1. Briefly discuss about the history of Hadoop. Imp
2. List out the companies used by Hadoop.
3. Name the two Specialist Hadoop companies.
4. List out the vendors established by Hadoop.
5. List out the modules of Apache Hadoop framework. Imp
6. Define NameNode. Imp
7. Define DataNode. Imp
8. Explain the Job Scheduling in MapReduce. Imp
9. Define shuffle or sort. Imp

PART B ( 5 Marks each)

10. Explain the features of Hadoop. Imp
11. Explain HDFS features. Imp
12. Explain the components of Hadoop. V.Imp
13. Explain scaling out. Imp
14. What is Hadoop streaming? Explain. Imp
15. Explain Java Interfaces to HDFS. Imp
16. Discuss the failures of MapReduce. Imp
17. Explain MapReduce types and formats. V.Imp
18. Explain the features of MapReduce. Imp

PART C ( 15 Marks each)

19. Explain in detail the architecture of HDFS. V.V.Imp
20. Explain how analysing the data with Hadoop. OR Explain the working of MapReduce
in detail. OR Explain the anatomy of MapReduce in detail. V.V.Imp (Same Answer)
21. Explain the design of HDFS. OR Explain the features of HDFS. V.V.Imp (Same Answer)

NOTES
HISTORY OF HADOOP
▪ Created by Doug Cutting.
▪ He was the creator of Apache Lucene (Widely used text search library).
▪ Apache Nutch (Open-Source Web Search Engine) was started in 2002. It was a part
of Lucene Project.
▪ The origin of Hadoop was Apache Nutch in 2006.
▪ Hadoop is a made-up name.
▪ This name given by Doug Cutting’s kid to a yellowed elephant.

Prepared by NITHIN SEBASTIAN

▪ Projects in Hadoop ecosystem have names are unrelated to their function, often with
an elephant or other animal names.
▪ 2004: - Nutch Developers set about writing an open-source implementation called
Nutch Distributed File System (NDFS).
▪ 2004:- Google published a paper introduced MapReduce.
▪ 2008 January:- Hadoop made its own top-level project at Apache.
▪ At that time Hadoop was used by many other companies like:
✓ Yahoo!
✓ Facebook
✓ New York Times
▪ April 2008:- Hadoop got an world record of fastest system to sort an entire
terabytes of data.
▪ Running on 910 node cluster, Hadoop sorted 1Tb in 209 seconds (3.5 minutes).
November 2008 April 2009
Google Yahoo!
MapReduce Implementation Hadoop
Sorted 1Tb in 68 sec. Sorted same in 62 sec.
▪ Widely used in Mainstream enterprises.
▪ Hadoop is used for general purpose storage & analysis platform for big data.
▪ Hadoop support established vendors like:
✓ EMC: Electronic Membership Corporation
✓ IBM
✓ Microsoft
✓ Oracle
▪ Specialist Hadoop companies are:
✓ Cloudera
✓ Horton Works

FEATURES OF HADOOP
1) Storage & process Big Data: - Process Big Data of 3V characteristics.
2) Open-Source Framework:-
o Open-Source access & cloud services enable large data stores.
o Hadoop uses a cluster of multiple inexpensive servers of cloud.
3) Java & Linux based:-
o Hadoop uses Java Interfaces.
o Hadoop base is Linux.
o Hadoop has its own set of shell commands supports.
4) Fault-efficient scalable:-
o System provides servers at high scalability
o System provides scalable by adding new nodes to handles large data.
5) Flexible modular design:-
o Simple & modular programming.
o Hadoop very helpful in:
✓ Storing

Prepared by NITHIN SEBASTIAN

✓ Manipulating
✓ Processing
✓ Analysing of Big Data
o Modular functions make system flexible.
o One can add or replace components at ease.
6) Robust design of HDFS:-
o Execution of big data application continue even individual server or cluster
fails.
o Because Hadoop helps for backup & data recovery mechanism.
o HDFS has high reliability.
7) Hardware fault tolerant:-
o A fault does not affect data & application processing.
o If a node goes down, other nodes takes care.
o This is due to multiple copies of all data blocks which replicate automatically.
o Default 3 copies of data blocks.

MODULES OF APACHE HADOOP FRAMEWORK

1) Hadoop common:
o Contains libraries & utilities needed by other Hadoop components.
2) HDFS:
o It is distributed file system.
o Store data on machines.
o Providing very high aggregate bandwidth across clusters.
3) Hadoop YARN:
o Introduced in 2012.
o Platform responsible for managing computing resources in clusters & using
them for scheduling users’ applications.
4) Hadoop MapReduce:
o Implementation of MapReduce programming model for large scale data
processing.

HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

▪ Big Data analytics applications are software applications that make use of large-scale
data.
▪ The applications analyse Big Data using massive parallel processing frameworks.
▪ HDFS is a core component of Hadoop.
▪ HDFS is designed to run on a cluster of computers and servers at cloud-based utility
services.
▪ HDFS stores Big Data which may range from GBs to PBs.
▪ HDFS stores the data in a distributed manner in order to compute fast.
▪ The distributed data store in HDFS stores data in any format regardless of schema.
▪ HDFS provides high throughput access to data-centric applications that require
large-scale data processing workloads.

Prepared by NITHIN SEBASTIAN

▪ Figure 3-1 shows the HDFS architecture.

NAMENODE
▪ The name node is the commodity hardware that contains the GNU/Linux operating
system and the name node software.
▪ It is a software that can be run on commodity hardware.
▪ The system having the name node acts as the master server and it does the following
tasks:
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and
opening files and directories.

DATANODE
▪ The data node is a commodity hardware having the GNU/Linux operating system
and data node software.
▪ For every node (Commodity hardware/System) in a cluster, there will be a data node.
▪ These nodes manage the data storage of their system.
▪ Datanodes perform read-write operations on the file systems, as per client request.
▪ They also perform operations such as block creation, deletion, and replication
according to the instructions of the Namenode.

BLOCK
▪ Generally, the user data is stored in the files of HDFS.
▪ The file in a file system will be divided into one or more segments and/or stored in
individual data nodes.
▪ These file segments are called blocks.

Prepared by NITHIN SEBASTIAN

▪ In other words, the minimum amount of data that HDFS can read or write is called a
Block.
▪ The default block size is 64MB, but it can be increased as per the need to change in
HDFS configuration.
▪ Data at the stores enable running the distributed applications including analytics,
data mining, OLAP using the clusters.
▪ Hadoop HDFS features are as follows:
Create, append, delete, rename and attribute modification functions
Content of individual files cannot be modified or replaced but appended with
new data at the end of the file.
It is suitable for distributed storage and processing.
Hadoop provides a command interface to interact with HDFS
The built-in servers of name node and data node help users to easily check
the status of the cluster.
Streaming access to file system data
HDFS provides file permissions and authentication.

COMPONENTS OF HADOOP
▪ Apache Pig :
✓ A software for analysing large data sets that consists of a high-level language
similar to SQL for expressing data analysis programs, coupled with
infrastructure for evaluating these programs.
✓ It contains a compiler that produces sequences of Map- Reduce programs.
▪ HBase:
✓ A non-relational columnar distributed database designed to run on top of
Hadoop Distributed File system (HDFS).
✓ It is written in Java and modelled after Google’s Big Table.
✓ HBase is an example of a NoSQL data store.
▪ Hive:
✓ It is a Data warehousing application that provides the SQL interface and
relational model.
✓ Hive infrastructure is built on the top of Hadoop that help in providing
summarization, query and analysis.
▪ Cascading :
✓ A software abstraction layer for Hadoop, intended to hide the underlying
complexity of Map Reduce jobs.
✓ Cascading allows users to create and execute data processing workflows on
Hadoop clusters using any JVM based language.
▪ Avro:
✓ A data serialization system and data exchange service.
✓ It is basically used in Apache Hadoop.
✓ These services can be used together as well as independently.
Prepared by NITHIN SEBASTIAN
6

▪ Big Top:
✓ It is used for packaging and testing the Hadoop ecosystem.
▪ Oozie:
✓ Oozie is a java based web-application that runs in a java servlet.
✓ Oozie uses the database to store a definition of Workflow that is a collection
of actions.
✓ It manages the Hadoop jobs.
ANALYSING THE DATA WITH HADOOP
▪ Hadoop can be used for data analysis technologies to analyse the huge stock data
being generated very frequently.
▪ Important Hadoop data analysis technology is MapReduce.
MapReduce
▪ MapReduce is a framework in which we can write applications to process huge
amounts of data, in parallel, on large clusters of commodity hardware in a reliable
manner.
▪ MapReduce is a processing technique and a program model for distributed
computing based on java.
▪ The MapReduce algorithm contains two important tasks, namely Map and Reduce.
▪ Map takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs).
▪ Secondly, reduce the task, which takes the output from a map as an input and
combines those data tuples into a smaller set of tuples.
▪ As the sequence of the name MapReduce implies, the reduce task is always
performed after the map job.
▪ The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes.
▪ Under the MapReduce model, the data processing primitives are called mappers
and reducers.
▪ Once we write an application in the MapReduce form, scaling the application to run
over hundreds, thousands, or even tens of thousands of machines in a cluster is
merely a configuration change.
▪ This simple scalability is what has attracted many programmers to use the
MapReduce model.

Algorithm
▪ MapReduce program executes in three stages
1. Map stage
2. Shuffle stage
3. Reduce stage

Prepared by NITHIN SEBASTIAN

Map Stage
▪ The map or mapper’s job is to process the input data.
▪ Generally the input data is in the form of file or directory and is stored in the
Hadoop filesystem (HDFS).
▪ The input file is passed to the mapper function line by line.
▪ The mapper processes the data and creates several small chunks of data.
Reduce Stage
▪ This stage is the combination of the Shuffle stage and the Reduce stage.
▪ The Reducer’s job is to process the data that comes from the mapper.
▪ After processing, it produces a new set of output, which will be stored in the HDFS.
▪ During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
▪ Most of the computing takes place on nodes with data on local disks that reduces
the network traffic.
▪ After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.
▪ The MapReduce framework operates on pairs, the framework views the input to the
job as a set of pairs and produces a set of pairs as the output of the job.

SCALING OUT
▪ To scale out, we need to store the data in a distributed file system (typically HDFS).
▪ This allows Hadoop to move the MapReduce computation to each machine hosting a
part of the data, using Hadoop’s resource management system, called YARN.
▪ A MapReduce job is a unit of work that the client wants to be performed.
▪ It consists of the input data, the MapReduce program, and configuration information.
▪ Hadoop runs the job by dividing it into tasks, of which there are two types: map tasks
and reduce
tasks.
▪ The tasks
are
scheduled
using YARN
and run on
nodes in
the cluster.
▪ If a task
fails, it will
be
automatically rescheduled to run on a different node.

Prepared by NITHIN SEBASTIAN

▪ Hadoop divides the input to a MapReduce job into fixed-size pieces called input
splits, or just splits.
▪ Hadoop creates one map task for each split, which runs the user-defined map
function for each record in the split.
▪ The output of the reduce is normally stored in HDFS for reliability.
▪ For each HDFS block of the reduce output, the first replica is stored on the local
node, with other replicas being stored on off-rack nodes for reliability.

HADOOP STREAMING
▪ The core ideology of Hadoop is the data processing should be independent of the
language. i.e., it is flexible because the programs can be designed in any languages
to do the processing.
▪ Hadoop streaming is an ability of Hadoop to interface with Map & Reduce programs
written in Java/ Non-Java like Ruby, PHP, C++, Python etc.
▪ Hadoop streaming uses Unix standard streams as the interface between Hadoop
and our program.

Prepared by NITHIN SEBASTIAN

ADVANTAGES OF HADOOP STREAMING

▪ Availability: No separate Software is need to install.
▪ Learning: Doesn’t require anything new to learn concept/technology.
▪ Faster development time
▪ Faster conversion: Take less time to convert the data from one format to another
format.
▪ No need for non-Java code developer to learn Java language.

DISADVANTAGES OF HADOOP STREAMING

▪ Not good in performance
▪ Suited for handling data represented in text.
▪ Uses excessive amount of RAM.

DESIGN OF HDFS
▪ The HDFS is designed for big data processing.
▪ It is a core part of Hadoop, which is used for data storage.
▪ It is designed to run on commodity hardware. i.e., to run HDFS, we don’t have
specialized hardware or it can be easily installed and run on commodity hardware
such as low-cost hardware, easily available in the market.
FEATURES OF HDFS
▪ Distributed: In HDFS, the data is divided into multiple data blocks & stored into
different nodes. This is one of the most important features of HDFS that makes
Hadoop very powerful.
▪ Parallel computation: Data is divided and stored in different nodes; it allows parallel
computation.

▪ Highly scalable: HDFS is highly scalable as it can scale hundreds of nodes in a single
cluster.
✓ Scaling is divided into two types:

Prepared by NITHIN SEBASTIAN

1) Vertical scaling ( scale up):

o For more processing power, can increase the number of cores in CPU or can
increase the RAM size.
o But the disadvantage is that there is a limit in which increase the number of
cores in CPU/ increase the RAM size.
o So that particular machine will not properly working & entire system will be
collapsed, not able to access data, can’t processing on that system.
2) Horizontal scaling ( scale out):
o Initially have one machine.
o If want more processing power & storage, add one more machine.
o If still want more processing power & storage, add one more machine.
o In case, if any one of the machine is not working, other machines will be used
for processing the data.
o The above is the advantage of horizontal scaling than vertical scaling.
o Hence HDFS follows horizontal scaling.

▪ Replication:
o Due to some unfavourable conditions, the node containing the data may be
failed to work.
o So, to overcome such problems, HDFS always maintains the copy of data on a
different machine.
▪ Fault tolerance:
o The HDFS is highly fault-tolerant.
o i.e., if any machine fails, the other machine containing the copy of that data
automatically become active.
o In HDFS, the fault-tolerant signifies the robustness of the system in the event
of failure.
▪ Streaming data access:
o It follows write-once/ read-many design.

Prepared by NITHIN SEBASTIAN

o Which means that once the data stored in the HDFS can’t be altered.
o But we can append the data at the end.
▪ Portable:
o HDFS is designed in such a way that it can easily portable from one platform
to another.

JAVA INTERFACES TO HDFS

▪ JAVA interfaces are used for accessing Hadoop’s file system.
▪ JAVA is a high-level/ OOP language.
▪ In order to interact with Hadoop’s file system programmatically, Hadoop provides
multiple JAVA classes.
▪ Package name: org.apache.hadoop.fs
▪ org: Organization, apache: inventor of Hadoop, fs: file system
▪ This package contains classes useful in manipulation of a file in Hadoop’s Filesystem.
▪ HDFS cluster primarily consist of NameNode ( Master Node) that manages the file
system’s metadata & DataNode ( Slave Node) that stores the actual data.

READ OPERATION IN HDFS

▪ Data read request is provided by a client.
▪ A client initiate read request by calling ‘open()’ method of filesystem object.
▪ Data is read in the form of streams where in client invokes “read()” method
repeatedly.
▪ This process of read() operation continues till it reaches the end of block.
▪ Once a client has done with reading, it calls ‘close()’ method.

HOW MAP REDUCE WORKS

▪ Massive parallel processing technique for processing data which is distributed on a
commodity cluster.
▪ Parallel processing will save time.
▪ Map reducing technique has two phases.
1. Map:
• Divide the whole data into chunks.
• Maps input key/value pairs to a set of intermediate key/value pairs.
• After mapping, There is another phase is called shuffling & sorting.

Prepared by NITHIN SEBASTIAN

•Shuffling is the movement of

intermediate records form mapper
to reducer.
• The process of exchanging the
intermediate outputs from the map
task to the reducer task is known as
shuffling.
2. Reduce:
• After map process, reducer will
obtain result.
• Reduces a set of intermediate values which share a key to a smaller
set of values.

FAILURES OF MAP REDUCE

▪ There are 3 types of failures:
1. Task failure:
Tasks are the user codes.
It may be code caused issues like infinite loop/ Hung code.
In those cases, mapred.task.timeout property will set.
Another reason for the failure of user tasks can be run time errors.
Another error is JVM ( JAVA Virtual Machine) errors.
2. TaskTracker Failure:
In the case of TaskTracker failure, JobTracker will reschedule task on
another TaskTracker.
Rerun the following cases:
➢ In-completed tasks.
➢ Completed task but whose job is not completed.
➢ Blacklisted TaskTrackers.
3. Job Tracker Failure:
Most serious failure in Map Reduce.
Better to run on a more efficient hardware to avoid failure.

Prepared by NITHIN SEBASTIAN

JOB SCHEDULING IN MAP REDUCE

▪ Multiple users issuing the jobs on Hadoop distributed Network.
▪ The scheduling scheme would be employed at JobTracker in case of MapReduce 1
and Source manager in case of MapReduce 2.

▪ The following schemes can be set up in MapReduce framework:

▪ FIFO: First in First Out:
• Default method in MapReduce 1.
▪ Fair Scheduler:
• Concept is very similar to Capacitive scheduler with minor differences.
• Gives better user experience.
▪ Capacity Scheduler:
• Default method in MapReduce 2.
• It provides more security than other schedulers.

SHUFFLE & SORT

▪ Every MapReduce job goes to shuffle and sort phase.
▪ Map program processes input key/value then the map output is sorted and
transferred to reduce it. This is known as Shuffle.
▪ The processes are written to memory buffer. Its default size is 100 MB.
▪ Map writes in the memory buffer fills up and when the buffer reaches a threshold,
where threshold limit is 80%.
▪ If the maps output is very large, recommended to compress the maps output to
reduce the amount of data.

MAP REDUCE TYPES & FORMATS

▪ MapReduce has a simple model of data processing input & output for the map &
reduce functions.
▪ The map & Reduce functions in Hadoop having the following general form:
Map : K1, V1 → list(K2,V2)
Reduce: (K2, list(V2))→ list(K3,V3)
▪ When two process intercommunicate (ex: Map communicate to Reduce), the data is
transferred in terms of objects.

Prepared by NITHIN SEBASTIAN

▪ Serialization is the process of turning structured objects into a byte stream for
transmission over a network.
▪ Deserialization is the reverse process of serialization, in which turning a byte stream
into series of structured objects.
▪ The commonly used data types are:
Boolean
Byte
Short
Int
Float
Long
Double
string

MAP REDUCE FEATURES

▪ High Scalability
▪ Flexibility
▪ Cost effective
▪ Fast
▪ Security & Authentication
▪ Compatible
▪ Reliable
▪ Highly available
▪ Simple programming model
▪ Parallel & distributed computing
▪ Counters (Imp)
Task counters
Job counters etc
▪ Sorting (Imp)

Prepared by NITHIN SEBASTIAN

Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
Ena Doc 025-2022
100% (1)
Ena Doc 025-2022
186 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
Bigdata Module2 7th-Sem 18cs72
No ratings yet
Bigdata Module2 7th-Sem 18cs72
64 pages
Unit 2-1
No ratings yet
Unit 2-1
43 pages
Ui Ux MCQ
100% (1)
Ui Ux MCQ
18 pages
Bda-Unit-2 - 2023
No ratings yet
Bda-Unit-2 - 2023
58 pages
Unit - 3
No ratings yet
Unit - 3
34 pages
Bda 18CS72 Mod-2
No ratings yet
Bda 18CS72 Mod-2
152 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
UNIT 2 Full
No ratings yet
UNIT 2 Full
121 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
HADOOP
No ratings yet
HADOOP
19 pages
BDT Unit03.pptx
No ratings yet
BDT Unit03.pptx
93 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Module 2
No ratings yet
Module 2
23 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Unit 2
No ratings yet
Unit 2
73 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
25 pages
BD - Unit - II - Hadoop Frameworks and HDFS
No ratings yet
BD - Unit - II - Hadoop Frameworks and HDFS
37 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
Hadoop
No ratings yet
Hadoop
7 pages
Unit Ii
No ratings yet
Unit Ii
30 pages
Big Data - Introduction To Hadoop
No ratings yet
Big Data - Introduction To Hadoop
61 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Module 4 - Hadoop
No ratings yet
Module 4 - Hadoop
5 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Hadoop
No ratings yet
Hadoop
5 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
Attachment
No ratings yet
Attachment
11 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
14 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Verticle Transportation
No ratings yet
Verticle Transportation
10 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
Unit 5
No ratings yet
Unit 5
32 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Introduction To Big Data and Hadoop
100% (1)
Introduction To Big Data and Hadoop
29 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
IBM Hadoop
No ratings yet
IBM Hadoop
11 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
HADOOP
No ratings yet
HADOOP
18 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Seminario de Vibraciones Prufftechnik
100% (3)
Seminario de Vibraciones Prufftechnik
114 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
HADOOP
No ratings yet
HADOOP
10 pages
CARDIOVIT At-102 G2 - Espec Tecnicas
No ratings yet
CARDIOVIT At-102 G2 - Espec Tecnicas
2 pages
Hadoop
No ratings yet
Hadoop
11 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Project Plan For ISO 14001:2015 Implementation: Subtitle or Presenter
No ratings yet
Project Plan For ISO 14001:2015 Implementation: Subtitle or Presenter
11 pages
Computer Consumable PDF
No ratings yet
Computer Consumable PDF
3 pages
Huawei Enabling Fast Track 5G Rollout Through Lean Site Strategies 20190408 VF
100% (1)
Huawei Enabling Fast Track 5G Rollout Through Lean Site Strategies 20190408 VF
19 pages
Kesla 20RH Harvester Head (Operator)
100% (1)
Kesla 20RH Harvester Head (Operator)
64 pages
INE Command and Control C2 CC Course File
No ratings yet
INE Command and Control C2 CC Course File
73 pages
Catalogue-Sivacons8 en
No ratings yet
Catalogue-Sivacons8 en
40 pages
Lexus NX 200t
No ratings yet
Lexus NX 200t
28 pages
What's The Difference Between UPS and Inverter - Luminous India
No ratings yet
What's The Difference Between UPS and Inverter - Luminous India
9 pages
3.3 Active and Passive Attack
No ratings yet
3.3 Active and Passive Attack
2 pages
CAN-Bus Additional Info
No ratings yet
CAN-Bus Additional Info
26 pages
Make Content Viral
No ratings yet
Make Content Viral
22 pages
Software Engineer
100% (1)
Software Engineer
66 pages
BANNOH 1500: Product Description
No ratings yet
BANNOH 1500: Product Description
3 pages
FDS - Unit-I - Notes
No ratings yet
FDS - Unit-I - Notes
24 pages
Simran Final Resume PDF
No ratings yet
Simran Final Resume PDF
1 page
Instruction Manual Mode D'Emploi Manual de Instrucciones: Register Online
No ratings yet
Instruction Manual Mode D'Emploi Manual de Instrucciones: Register Online
20 pages
Agni College of Technology
No ratings yet
Agni College of Technology
2 pages
Service Catalog Management Process
No ratings yet
Service Catalog Management Process
4 pages
Lenovo G700/G710 Hardware Maintenance Manual
No ratings yet
Lenovo G700/G710 Hardware Maintenance Manual
88 pages
270 EHC Sec 2a EHC Actuators and Valves
No ratings yet
270 EHC Sec 2a EHC Actuators and Valves
19 pages
Ajay Seminar
No ratings yet
Ajay Seminar
2 pages
House of Risk To Mitigate Operational Risk Strategy in Shipyards A Case Study)
No ratings yet
House of Risk To Mitigate Operational Risk Strategy in Shipyards A Case Study)
7 pages
Intromema
No ratings yet
Intromema
4 pages
Sorsogon City Legal Office Information System
No ratings yet
Sorsogon City Legal Office Information System
5 pages
Partha Sarathi Dey01
No ratings yet
Partha Sarathi Dey01
4 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet

Unit 3-1

Uploaded by

Unit 3-1

Uploaded by

1

UNIT III - HADOOP

PART B ( 5 Marks each)

PART C ( 15 Marks each)

Prepared by NITHIN SEBASTIAN

Prepared by NITHIN SEBASTIAN

MODULES OF APACHE HADOOP FRAMEWORK

HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Prepared by NITHIN SEBASTIAN

▪ Figure 3-1 shows the HDFS architecture.

Prepared by NITHIN SEBASTIAN

Prepared by NITHIN SEBASTIAN

Prepared by NITHIN SEBASTIAN

Prepared by NITHIN SEBASTIAN

ADVANTAGES OF HADOOP STREAMING

DISADVANTAGES OF HADOOP STREAMING

Prepared by NITHIN SEBASTIAN

1) Vertical scaling ( scale up):

Prepared by NITHIN SEBASTIAN

JAVA INTERFACES TO HDFS

READ OPERATION IN HDFS

HOW MAP REDUCE WORKS

Prepared by NITHIN SEBASTIAN

•Shuffling is the movement of

FAILURES OF MAP REDUCE

Prepared by NITHIN SEBASTIAN

JOB SCHEDULING IN MAP REDUCE

▪ The following schemes can be set up in MapReduce framework:

SHUFFLE & SORT

MAP REDUCE TYPES & FORMATS

Prepared by NITHIN SEBASTIAN

MAP REDUCE FEATURES

Prepared by NITHIN SEBASTIAN

You might also like