0% found this document useful (0 votes)
9 views11 pages

Big Data Hadoop

Hadoop is an open-source framework for distributed processing of large data sets, consisting of components like HDFS, MapReduce, YARN, and Hadoop Common. HDFS provides high throughput access to application data, while MapReduce is a programming model for parallel data processing. The ecosystem also includes tools like Pig, Hive, and HBase for data processing and management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views11 pages

Big Data Hadoop

Hadoop is an open-source framework for distributed processing of large data sets, consisting of components like HDFS, MapReduce, YARN, and Hadoop Common. HDFS provides high throughput access to application data, while MapReduce is a programming model for parallel data processing. The ecosystem also includes tools like Pig, Hive, and HBase for data processing and management.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

1. What is Hadoop?

Answer:
Hadoop is an open-source framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming
models. It is designed to scale up from a single server to thousands of
machines.

2. Explain Hadoop Distributed File System (HDFS).


Answer:
HDFS is a distributed file system that provides high throughput access to
application data. It is designed to store large files across multiple machines and
is optimized for high throughput rather than low latency.

3. What are the components of Hadoop?


Answer:
Hadoop consists of four main components:
 HDFS: Distributed file system
 MapReduce: Computational model
 YARN: Resource management layer
 Hadoop Common: The set of utilities that support other Hadoop
modules

4. What is the role of NameNode in HDFS?


Answer:
NameNode is the master node in HDFS. It manages the metadata and the
directory structure of the file system. It keeps track of where data is stored in
the cluster but does not store the data itself.

5. What is the role of DataNode in HDFS?


Answer:
DataNodes are worker nodes that store the actual data in HDFS. They are
responsible for serving data requests from clients and managing the data
blocks on local disks.

6. What is MapReduce in Hadoop?


Answer:
MapReduce is a programming model for processing large datasets in parallel
across a distributed cluster. It works in two phases: the Map phase (where data
is mapped into key-value pairs) and the Reduce phase (where the mapped data
is aggregated).

7. What is YARN in Hadoop?


Answer:
YARN (Yet Another Resource Negotiator) is the resource management layer of
Hadoop. It manages resources and schedules jobs across the cluster, ensuring
effective resource utilization and job execution.

8. What is a Job Tracker in Hadoop?


Answer:
Job Tracker is a master daemon in Hadoop that is responsible for scheduling
jobs, monitoring task execution, and handling failures. In newer versions of
Hadoop, JobTracker is replaced by YARN.

9. What is a Task Tracker in Hadoop?


Answer:
TaskTracker is a slave daemon that runs on worker nodes in the cluster. It
receives tasks from the Job Tracker and executes them on the local data stored
in HDFS.

10. What is the difference between Hadoop 1.x and Hadoop 2.x?
Answer:
The main difference is that Hadoop 2.x introduces YARN (Yet Another Resource
Negotiator) for better resource management and scalability, while Hadoop 1.x
uses a single JobTracker to manage resources.

11. What is the replication factor in HDFS?


Answer:
Replication factor is the number of copies of each data block that are stored in
HDFS. The default replication factor is 3, meaning each block is stored on three
different DataNodes.

12. How does MapReduce work in Hadoop?


Answer:
In MapReduce, data is processed in two stages:
 Map: The input data is divided into key-value pairs and processed by
Mapper.
 Reduce: The output from the Mapper is aggregated by Reducer to form
the final output.

13. What is the difference between a Map and Reduce task in Hadoop?
Answer:
 Map task: Breaks down input data into key-value pairs.
 Reduce task: Aggregates or processes data based on the key-value pairs
produced by the Map task.

14. What is a combiner in Hadoop?


Answer:
A combiner is an optional optimization in MapReduce that performs a local
reduce operation on the map output before it is sent to the Reducer.

15. What are the different types of joins supported by Hadoop?


Answer:
Hadoop supports the following types of joins:
 Inner Join
 Outer Join
 Left Join
 Right Join
 Cross Join

16. What is the difference between HDFS and traditional file systems?
Answer:
HDFS is designed for distributed storage, offering fault tolerance and scalability,
while traditional file systems are typically limited to a single server with a
higher risk of failure and lower scalability.

17. Explain the concept of “block” in HDFS.


Answer:
A block is the basic unit of storage in HDFS, typically 128 MB or 256 MB in size.
Files in HDFS are split into blocks, and each block is stored across different
nodes in the cluster.

18. How is data processed in Hadoop?


Answer:
Data is processed in parallel across multiple nodes in a Hadoop cluster, utilizing
MapReduce for computation and HDFS for storage.

19. What are the advantages of Hadoop over traditional systems?


Answer:
 Scalability: Easily scales to store and process petabytes of data.
 Fault tolerance: Data is replicated across multiple nodes.
 Cost-effective: Uses commodity hardware for storage.
20. What is Pig in Hadoop?
Answer:
Pig is a high-level platform used to process large data sets. It provides a simpler
interface for developers using the Pig Latin language, which is built on top of
MapReduce.

21. What is Hive in Hadoop?


Answer:
Hive is a data warehouse system built on top of Hadoop that provides a query
language (HiveQL) similar to SQL for querying and managing large datasets.

22. What is HBase in Hadoop?


Answer:
HBase is a NoSQL database that runs on top of HDFS. It provides random access
to large datasets and is designed to handle structured data.

23. What is the difference between Hive and HBase?


Answer:
 Hive is used for batch processing and is suitable for querying large
datasets with SQL-like queries.
 HBase is used for real-time, random access to data and is designed for
high-speed read/write operations.

24. What is a mapper in MapReduce?


Answer:
A mapper is a component in MapReduce that processes input data, performs
computations, and produces intermediate key-value pairs.

25. What is a reducer in MapReduce?


Answer:
A reducer takes the output from the mappers, groups it by key, and performs
an aggregation or other computation to generate the final output.

26. Explain the concept of shuffle and sort in MapReduce.


Answer:
The shuffle and sort phase occurs between the Map and Reduce phases, where
the intermediate output from the Map phase is shuffled and sorted by key to
prepare it for the Reduce phase.

27. What is the role of the ResourceManager in YARN?


Answer:
The ResourceManager is responsible for managing the cluster's resources and
scheduling jobs. It interacts with the NodeManagers on worker nodes to
allocate resources for job execution.

28. What is a NodeManager in YARN?


Answer:
The NodeManager is responsible for managing the resources and monitoring
the health of nodes in the cluster. It runs on every worker node and interacts
with the ResourceManager.

29. What is the function of the Secondary NameNode?


Answer:
The Secondary NameNode periodically merges the edits log with the file
system image to prevent the NameNode from becoming too large.

30. What is an input format in Hadoop?


Answer:
An InputFormat in Hadoop is responsible for reading and splitting the input
data before it is passed to the Mapper. Examples include TextInputFormat and
KeyValueInputFormat.
31. What is the purpose of the OutputFormat in Hadoop?
Answer:
OutputFormat is responsible for writing the output of a MapReduce job to the
specified location in the distributed file system.

32. What is the role of the client in Hadoop?


Answer:
The client submits the MapReduce jobs and interacts with the Hadoop cluster
to run and manage jobs. The client communicates with the ResourceManager
and NameNode to access resources.

33. What is a job in Hadoop?


Answer:
A job is a single unit of work in Hadoop, typically consisting of a MapReduce
operation or an operation on HDFS.

34. What are the main differences between Hadoop and Spark?
Answer:
 Hadoop is based on MapReduce, which writes intermediate data to disk,
making it slower.
 Spark is in-memory computing and provides faster processing due to its
ability to store intermediate data in memory.

35. What is Apache Kafka?


Answer:
Kafka is a distributed event streaming platform used to build real-time data
pipelines and streaming applications. It is designed for high-throughput, low-
latency, and fault tolerance.

36. Explain HDFS data block replication.


Answer:
HDFS replicates each data block to multiple DataNodes for fault tolerance. By
default, each block is replicated three times.

37. What is an HDFS pipeline?


Answer:
HDFS pipeline is the process in which data blocks are written in a sequential
manner across DataNodes. It ensures that replication happens in parallel to
writing.

38. What is the difference between structured, unstructured, and semi-


structured data?
Answer:
 Structured data: Data that is organized in a predefined format (e.g.,
relational databases).
 Unstructured data: Data that has no predefined format (e.g., images,
videos).
 Semi-structured data: Data that doesn’t have a strict format but has
some organizational properties (e.g., JSON, XML).

39. What is Sqoop in Hadoop?


Answer:
Sqoop is a tool designed to efficiently transfer bulk data between Hadoop and
relational databases.

40. What is Flume in Hadoop?


Answer:
Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data into Hadoop.

41. What is Zookeeper in Hadoop?


Answer:
Zookeeper is a centralized service for maintaining configuration information,
naming, and providing distributed synchronization.

42. What is a file system checkpoint in HDFS?


Answer:
A file system checkpoint in HDFS is a process in which the NameNode
periodically takes a snapshot of the current file system metadata to protect
against data loss.

43. What is the difference between HDFS and Amazon S3?


Answer:
 HDFS is a distributed file system used with Hadoop clusters for on-
premise storage.
 Amazon S3 is a cloud-based object storage service offered by AWS for
storing and retrieving any amount of data at any time.

44. What is a key benefit of Hadoop?


Answer:
Hadoop enables distributed storage and parallel processing of large datasets
across clusters of computers, ensuring scalability, fault tolerance, and cost-
effectiveness.

45. Explain the term “data locality” in Hadoop.


Answer:
Data locality refers to the concept of processing data on the same node or
nearby node to reduce the overhead of moving data across the network.

46. What is a partitioner in MapReduce?


Answer:
A partitioner determines how the key-value pairs are distributed to different
reducers based on the key.

47. What are the types of joins in MapReduce?


Answer:
The types of joins in MapReduce include:
 Inner Join
 Outer Join
 Left Join
 Right Join

48. What are the key features of Hadoop?


Answer:
Key features include:
 Scalability
 Fault tolerance
 Cost-effectiveness
 High throughput
 Open-source

49. What are the limitations of Hadoop?


Answer:
 Not suitable for real-time processing
 Requires significant storage
 Complex to manage

50. What is a big data solution in Hadoop?


Answer:
A big data solution in Hadoop refers to processing and analyzing large amounts
of structured, semi-structured, or unstructured data in a distributed manner
using the Hadoop ecosystem.

You might also like