bdcc-2 2

big data

Uploaded by

yexadat679

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

34 views12 pages

bdcc-2 2

big data

Uploaded by

yexadat679

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 12

276124, 8:41 AM BOC -2-Apache Hadoop BDCC APACHE HADOOP oO nps:ifbdce santechz.comiuni-22-apacne-hadoop aneyarn6r24, att AN BOC -2-Apache Hadoop OVERVIEW OF HADOOP Hadoop is an open-source framework designed for the distributed storage and processing of large datasets, ranging from gigabytes to petabytes. Developed as part of the Apache Software Foundation, it utilizes a cluster of commodity hardware to manage data efficiently, making it a cornerstone technology in big data applications. KEY COMPONENTS OF HADOOP Hadoop consists of several core modules that work together to provide its functionality: 1. Hadoop Distributed File System (HDFS): This is the primary storage component, designed to store large files across multiple machines. HDFS breaks down files into large blocks and distributes them across nodes in a cluster, allowing for high- throughput access and fault tolerance by replicating data across different nodes. N MapReduce: This programming model processes data in parallel across the cluster. It breaks down tasks into smaller sub-tasks that can be executed simultaneously, thus leveraging the distributed nature of the framework for efficient data processing, » Yet Another Resource Negotiator (YARN): This module manages resources in the cluster, allocating them to various applications running on Hadoop. YARN enhances the scalability and resource utilization of Hadoop by allowing multiple data processing engines to run ona single platform. & Hadoop Common: This package contains libraries and utilities needed by other Hadoop modules, providing essential services like file system abstractions and operating system-level functionalities. Apache Hadoop 3.X Apache Hadoop MapReduce (Ojai P Eley Processing Engines ntps:fodee-santechz com/uni-2!2-apacneshadoop276124, 8:41 AM BCC -2-Apache Hadoop PVor (oi merrell em Ali Hadoop Cq p Distributed File System oO nps:ifbdce santechz.comiuni-22-apacne-hadooprarn6124, att AN BOC -2-Anache Hadoop BI oT Neier t=) Metadata (Name, replicas, . ‘Thome/foo/data, 3, NEC Corners Ce DataNodes DataNodes HADOOP DISTRIBUTED FILE SYSTEM (HDFS) The Hadoop Distributed File System (HDFS) is a key component of the Apache Hadoop ecosystem, specifically designed for storing and managing large datasets across a distributed network of computers. HDFS provides high-throughput access to application data, making it suitable for big data applications such as data warehousing, machine learning, and analytics. ARCHITECTURE OF HDFS HDFS employs a master/slave architecture, consisting of two main types of nodes: 1, NameNode (Master Node): = The NameNode is the master server that manages the filesystem namespace and regulates access to files by clients. It maintains metadata about the files stored in HDES, including information about file names, permissions, and the locations of data blocks on DataNodes. @ = It does not store the actual data but keeps track of where each block of data is hips bac. santechz.comvunit2/2-apache-hadoop annerarrez4, et Ant BOCC -2-Aoache Hadoop located across the cluster. The metadata is stored in memory for fast access and persisted on disk for reliability. * Operations such as opening, closing, renaming files, and directories are executed by the NameNode. 2. DataNodes (Slave Nodes): = DataNodes are responsible for storing the actual data blocks. Each DataNode manages storage attached to the nodes it runs on and serves read and write requests from clients. * When a file is stored in HDFS, it is split into blocks (default size typically 128 MB or 256 MB), which are distributed across multiple DataNodes for storage. = DataNodes communicate with the NameNode by sending periodic heartbeat signals to indicate their status and report on the blocks they store. DATA STORAGE AND REPLICATION * Block Structure: HDFS divides files into large blocks. Each block is stored independently across different DataNodes to ensure fault tolerance and high availability. * Replication: To protect against data loss due to hardware failures, HDFS replicates each block across multiple DataNodes. The default replication factor is three, meaning each block is stored on three different nodes: two on the same rack and one on a different rack. This strategy enhances fault tolerance while optimizing network bandwidth during data writes. READ AND WRITE OPERATIONS » Read Operation: 1. Aclient initiates a read request by contacting the NameNode to retrieve metadata about the file's blocks and their locations. 2. The client then communicates directly with the relevant DataNodes to read the blocks in parallel, which improves performance. * Write Operation: © similar to reading, a write request begins with the client contacting the NameNode hips baie. santechcomvunit2/2-apache-hadoop sieyarner24, att AN 'BOCC -2 -Apache Hadoop ‘2: The client sends data to one DataNode, which then replicates it to other specified DataNodes based on the replication factor35. ADVANTAGES OF HDFS * Scalability: HDFS can easily scale horizontally by adding more DataNodes to accommodate growing datasets without significant reconfiguration. = Fault Tolerance: The replication mechanism ensures that even if some nodes fail, data remains accessible from other nodes. = High Throughput: Designed for high throughput rather than low latency, HDFS efficiently handles large volumes of data. * Cost-Effectiveness: It can run on commodity hardware, making it an economical choice for organizations dealing with big data45. In summary, HDFS is a robust storage solution tailored for big data applications, providing essential features such as scalability, fault tolerance, and high throughput through its distributed architecture. MapReduce + Apache Hadoop provides a distributed data processing framework for large datasets using a simple programming model called MapReduce. + A programming task that is divided into multiple identical subtasks and that is distributed among multiple ‘machines for processing is called a map task. + The results of these map tasks are combined together into one or many reduce tasks. + Overall, this approach of computing tasks is called the MapReduce Approach. + The MapReduce programming paradigm forms the heart of the Apache Hadoop framework, and any application that is deployed on this framework must comply to MapReduce programming, + Each task is divided into a mapper task, followed by a reducer task. The following diagram demonstrates how MapReduce uses the divide-and-conquer methodology to solve its complex problem using a simplified methodology: oO ntps:fodce-santechz.com/uni-2!2-apacnesnadoop eney2mn6124, att AN BOC -2-Apache Hadoop BDCC Input Splitting Mapping —Shuffling Reducing _ Final Result K2,List(V2) Bear, (1,1) Kivi Peete ied Car Car River Deer Car Bear Pee MapReduce Word Count Process oO ntps:fodce-santechz.com/uni-2!2-apacnesnadoop meyarn6r24, att AN BOC -2-Apache Hadoop Client f= Fy a oa — Est Ea tte cad Fay a = IE Ef Ee Ea client ——_, Job Submission == —+ Node Status — Resource Request. >> MapReducestatus = YARN, which stands for Yet Another Resource Negotiator, is a key component of the Apache Hadoop ecosystem introduced in Hadoop 2.x. It serves as a resource management layer that allows multiple data processing engines to run concurrently, improving the efficiency and scalability of big data applications. Here's an overview of its architecture and components. YARN ARCHITECTURE YARN's architecture separates resource management from job scheduling and monitoring, which enhances its flexibility and scalability. The main components of YARN are: 1. RESOURCE MANAGER (RM) = Role: The Resource Manager is the master daemon responsible for managing resources across the cluster. It allocates resources to various applications based on their requirements. = Components: © scheduler: This component is responsible for resource allocation among running ntps:fodce-santechz com/uni-22-apacnesnadoop aneyarn6124, att ANE BOC -2-Anache Hadoop @PPHCALIONS WILNOUL MOMILUTINY UF WALKIN, LEI SLALUS, IL Operdtes vase UF Bfedefined policies, such as Capacity Scheduler or Fair Scheduler, to ensure efficient resource distribution. * Application Manager: This manages the lifecycle of application masters, handling job submissions and negotiating the first container for execution. 2. NODE MANAGER (NM) = Role: Each Node Manager runs on individual nodes within the cluster and manages the execution of containers on that node. It monitors resource usage (CPU, memory, disk) and reports this information back to the Resource Manager. = Responsibilities: = Launching and managing containers as directed by the Application Master. * Monitoring the health of the node and reporting any issues to the Resource Manager. 3. APPLICATION MASTER (AM) = Role: The Application Master is a framework-specific process that negotiates resources from the Resource Manager and coordinates with Node Managers to execute tasks. = Responsibilities: = Managing the application's lifecycle, including resource requests, task execution, and fault tolerance. = Reporting progress and status back to the Resource Manager. WORKFLOW IN YARN 1. Job Submission: A client submits a job to the Resource Manager. 2. Resource Allocation: The Resource Manager allocates a container for the Application Master. 3. Application Master Execution: The Application Master requests additional © ;ntainers from the Resource Manager as needed. hips bac. santechz.comvunit2/2-apache-hadoop ene‘a7n6r24, att ANE BOC -2-Anache Hadoop 4. Task Execution: Node Managers launch containers based on instructions from the ‘Application Master, executing tasks in parallel. 5. Monitoring and Completion: The Application Master monitors task execution and reports back to the Resource Manager upon completion. ADVANTAGES OF YARN = Scalability: YARN can efficiently manage thousands of nodes in a cluster, allowing for extensive data processing capabilities. * Multi-tenancy: It supports running multiple processing frameworks (e MapReduce, Spark, Flink) simultaneously on a single cluster. = Dynamic Resource Management: YARN dynamically allocates resources based on application needs, optimizing cluster utilization. In summary, YARN enhances Hadoop's capabilities by providing a robust architecture for managing resources and scheduling jobs across distributed systems, making it an essential component for modern big data processing environments. oO hips bac. santechz.comvunit2/2-apache-hadoop soreyarn6r24, att AN BOC -2-Anache Hadoop HABOOP COMMON Hadoop Common is a critical component of the Apache Hadoop framework, providing essential libraries and utilities that support the other Hadoop modules. It serves as the backbone for the entire Hadoop ecosystem, enabling various functionalities that facilitate distributed data processing and storage. KEY FEATURES OF HADOOP COMMON * Java Libraries: Hadoop Common includes a set of Java Archive (AR) files that contain the necessary libraries for the operation of Hadoop applications, These libraries provide shared functionalities across all modules, ensuring consistency and efficiency in operations. * File System Abstractions: It offers file system and operating system-level abstractions that allow Hadoop to interact with different types of storage systems. This flexibility enables Hadoop to work with various file systems beyond HDES, such as Amazon S3 or other Hadoop-compatible file systems. * MapReduce Engine: While MapReduce is primarily known as a processing model, Hadoop Common includes the necessary components to execute MapReduce jobs. This includes job scheduling and resource management, which are crucial for efficient data processing across a cluster. IMPORTANCE IN THE HADOOP ECOSYSTEM Hadoop Common plays a vital role in ensuring that all other components of the Hadoop ecosystem function smoothly. Its functionalities enable: * Location Awareness: Hadoop applications can utilize information about where data is stored within the cluster, allowing for optimized task execution on nodes that have local access to data. This reduces network traffic and enhances performance. = Fault Tolerance: The design of Hadoop Common assumes hardware failures are common. It incorporates mechanisms for automatic recovery and redirection of tasks to ensure continuous operation even when individual nodes fail. oO » ‘Nuteroperability: By providing a unified set of libraries and utilities, Hadoop Common hips bac. santechz.comvunit2/2-apache-hadoop se2/6240: AM OCC -2-Apsche Hadoop allows developers to build applications that can easily integrate with other feabonents of the Hadoop ecosystem, such as YARN (Yet Another Resource Negotiator) and HDFS (Hadoop Distributed File System) . In summary, Hadoop Common is an indispensable part of the Apache Hadoop framework, providing foundational services that support distributed computing and storage capabilities essential for handling large datasets efficiently. + Apache Hadoop was invented to solve large data problems that no existing system or commercial software could solve. + With the help of Apache Hadoop, the data that used to get archived on tape backups or was lost is now being utilized in the system. + This data offers immense opportunities to provide insights in history and to predict the best course of action. + Hadoop is targeted to solve problems involving the four Vs (Volume, Variety, Velocity, and Veracity) of data. + The following diagram shows key differentiators of why Apache Hadoop is useful for business: Compiled by Aaron Stanislaus Johns oO ntps:fodce-santechz.com/uni-2!2-apacnesnadoop rane

Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages
Big Data
No ratings yet
Big Data
67 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Hadoop Major Components
No ratings yet
Hadoop Major Components
10 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
Lecture 2
No ratings yet
Lecture 2
28 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
HADOOP
No ratings yet
HADOOP
18 pages
CH 2
No ratings yet
CH 2
6 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
HADOOP
No ratings yet
HADOOP
19 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
BDA Unit 1
No ratings yet
BDA Unit 1
35 pages
Bdav QB
No ratings yet
Bdav QB
88 pages
Hadoop Frame Work
No ratings yet
Hadoop Frame Work
38 pages
Cloud PDF
No ratings yet
Cloud PDF
138 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
2-Hadoop History Terminologies DFS-03-01-2025
No ratings yet
2-Hadoop History Terminologies DFS-03-01-2025
52 pages
10th August Morning and Afternoon Session Hadoop
No ratings yet
10th August Morning and Afternoon Session Hadoop
18 pages
Bible Quiz
No ratings yet
Bible Quiz
1 page
Module 4 - Hadoop
No ratings yet
Module 4 - Hadoop
5 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Hadoop
No ratings yet
Hadoop
83 pages
Time Table
No ratings yet
Time Table
7 pages
ECS765P - W3 - Hadoop Principles and Components
No ratings yet
ECS765P - W3 - Hadoop Principles and Components
47 pages
bdcc-2 3
No ratings yet
bdcc-2 3
16 pages
Updated Unit-IV Reference PPT 08-02-2022
No ratings yet
Updated Unit-IV Reference PPT 08-02-2022
103 pages
bdcc-2 4
No ratings yet
bdcc-2 4
5 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Weekly Digest - Must Read Highlights PDF
No ratings yet
Weekly Digest - Must Read Highlights PDF
7 pages
bdcc-2 6
No ratings yet
bdcc-2 6
7 pages
bdcc-2 5
No ratings yet
bdcc-2 5
9 pages
BD Unit II
No ratings yet
BD Unit II
57 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
Hadoop Platform & Services
No ratings yet
Hadoop Platform & Services
41 pages
Big Data Unit-2 PPT Part1
No ratings yet
Big Data Unit-2 PPT Part1
76 pages
Bda Unit34
No ratings yet
Bda Unit34
17 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Unit 5
No ratings yet
Unit 5
101 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
56 pages
BD U-2 (Anupam Sir)
No ratings yet
BD U-2 (Anupam Sir)
30 pages
Attachment
No ratings yet
Attachment
11 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
Big Data Unit 2 (Easy Notes) Edushine Classes
No ratings yet
Big Data Unit 2 (Easy Notes) Edushine Classes
35 pages
Big Data Unit 3 by Multi Atoms
No ratings yet
Big Data Unit 3 by Multi Atoms
6 pages
wk8 Final
No ratings yet
wk8 Final
39 pages
Introduction To
No ratings yet
Introduction To
7 pages
Unit 2,3
No ratings yet
Unit 2,3
24 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
Unit-3: Describe Mapreduce With Application?
No ratings yet
Unit-3: Describe Mapreduce With Application?
6 pages

bdcc-2 2

Uploaded by

bdcc-2 2

Uploaded by

You might also like