0% found this document useful (0 votes)
34 views12 pages

bdcc-2 2

big data

Uploaded by

yexadat679
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
34 views12 pages

bdcc-2 2

big data

Uploaded by

yexadat679
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 12
276124, 8:41 AM BOC -2-Apache Hadoop BDCC APACHE HADOOP oO nps:ifbdce santechz.comiuni-22-apacne-hadoop ane yarn6r24, att AN BOC -2-Apache Hadoop OVERVIEW OF HADOOP Hadoop is an open-source framework designed for the distributed storage and processing of large datasets, ranging from gigabytes to petabytes. Developed as part of the Apache Software Foundation, it utilizes a cluster of commodity hardware to manage data efficiently, making it a cornerstone technology in big data applications. KEY COMPONENTS OF HADOOP Hadoop consists of several core modules that work together to provide its functionality: 1. Hadoop Distributed File System (HDFS): This is the primary storage component, designed to store large files across multiple machines. HDFS breaks down files into large blocks and distributes them across nodes in a cluster, allowing for high- throughput access and fault tolerance by replicating data across different nodes. N MapReduce: This programming model processes data in parallel across the cluster. It breaks down tasks into smaller sub-tasks that can be executed simultaneously, thus leveraging the distributed nature of the framework for efficient data processing, » Yet Another Resource Negotiator (YARN): This module manages resources in the cluster, allocating them to various applications running on Hadoop. YARN enhances the scalability and resource utilization of Hadoop by allowing multiple data processing engines to run ona single platform. & Hadoop Common: This package contains libraries and utilities needed by other Hadoop modules, providing essential services like file system abstractions and operating system-level functionalities. Apache Hadoop 3.X Apache Hadoop MapReduce (Ojai P Eley Processing Engines ntps:fodee-santechz com/uni-2!2-apacneshadoop 276124, 8:41 AM BCC -2-Apache Hadoop PVor (oi merrell em Ali Hadoop Cq p Distributed File System oO nps:ifbdce santechz.comiuni-22-apacne-hadoop rarn6124, att AN BOC -2-Anache Hadoop BI oT Neier t=) Metadata (Name, replicas, . ‘Thome/foo/data, 3, NEC Corners Ce DataNodes DataNodes HADOOP DISTRIBUTED FILE SYSTEM (HDFS) The Hadoop Distributed File System (HDFS) is a key component of the Apache Hadoop ecosystem, specifically designed for storing and managing large datasets across a distributed network of computers. HDFS provides high-throughput access to application data, making it suitable for big data applications such as data warehousing, machine learning, and analytics. ARCHITECTURE OF HDFS HDFS employs a master/slave architecture, consisting of two main types of nodes: 1, NameNode (Master Node): = The NameNode is the master server that manages the filesystem namespace and regulates access to files by clients. It maintains metadata about the files stored in HDES, including information about file names, permissions, and the locations of data blocks on DataNodes. @ = It does not store the actual data but keeps track of where each block of data is hips bac. santechz.comvunit2/2-apache-hadoop anne rarrez4, et Ant BOCC -2-Aoache Hadoop located across the cluster. The metadata is stored in memory for fast access and persisted on disk for reliability. * Operations such as opening, closing, renaming files, and directories are executed by the NameNode. 2. DataNodes (Slave Nodes): = DataNodes are responsible for storing the actual data blocks. Each DataNode manages storage attached to the nodes it runs on and serves read and write requests from clients. * When a file is stored in HDFS, it is split into blocks (default size typically 128 MB or 256 MB), which are distributed across multiple DataNodes for storage. = DataNodes communicate with the NameNode by sending periodic heartbeat signals to indicate their status and report on the blocks they store. DATA STORAGE AND REPLICATION * Block Structure: HDFS divides files into large blocks. Each block is stored independently across different DataNodes to ensure fault tolerance and high availability. * Replication: To protect against data loss due to hardware failures, HDFS replicates each block across multiple DataNodes. The default replication factor is three, meaning each block is stored on three different nodes: two on the same rack and one on a different rack. This strategy enhances fault tolerance while optimizing network bandwidth during data writes. READ AND WRITE OPERATIONS » Read Operation: 1. Aclient initiates a read request by contacting the NameNode to retrieve metadata about the file's blocks and their locations. 2. The client then communicates directly with the relevant DataNodes to read the blocks in parallel, which improves performance. * Write Operation: © similar to reading, a write request begins with the client contacting the NameNode hips baie. santechcomvunit2/2-apache-hadoop sie yarner24, att AN 'BOCC -2 -Apache Hadoop ‘2: The client sends data to one DataNode, which then replicates it to other specified DataNodes based on the replication factor35. ADVANTAGES OF HDFS * Scalability: HDFS can easily scale horizontally by adding more DataNodes to accommodate growing datasets without significant reconfiguration. = Fault Tolerance: The replication mechanism ensures that even if some nodes fail, data remains accessible from other nodes. = High Throughput: Designed for high throughput rather than low latency, HDFS efficiently handles large volumes of data. * Cost-Effectiveness: It can run on commodity hardware, making it an economical choice for organizations dealing with big data45. In summary, HDFS is a robust storage solution tailored for big data applications, providing essential features such as scalability, fault tolerance, and high throughput through its distributed architecture. MapReduce + Apache Hadoop provides a distributed data processing framework for large datasets using a simple programming model called MapReduce. + A programming task that is divided into multiple identical subtasks and that is distributed among multiple ‘machines for processing is called a map task. + The results of these map tasks are combined together into one or many reduce tasks. + Overall, this approach of computing tasks is called the MapReduce Approach. + The MapReduce programming paradigm forms the heart of the Apache Hadoop framework, and any application that is deployed on this framework must comply to MapReduce programming, + Each task is divided into a mapper task, followed by a reducer task. The following diagram demonstrates how MapReduce uses the divide-and-conquer methodology to solve its complex problem using a simplified methodology: oO ntps:fodce-santechz.com/uni-2!2-apacnesnadoop ene y2mn6124, att AN BOC -2-Apache Hadoop BDCC Input Splitting Mapping —Shuffling Reducing _ Final Result K2,List(V2) Bear, (1,1) Kivi Peete ied Car Car River Deer Car Bear Pee MapReduce Word Count Process oO ntps:fodce-santechz.com/uni-2!2-apacnesnadoop me yarn6r24, att AN BOC -2-Apache Hadoop Client f= Fy a oa — Est Ea tte cad Fay a = IE Ef Ee Ea client ——_, Job Submission == —+ Node Status — Resource Request. >> MapReducestatus = YARN, which stands for Yet Another Resource Negotiator, is a key component of the Apache Hadoop ecosystem introduced in Hadoop 2.x. It serves as a resource management layer that allows multiple data processing engines to run concurrently, improving the efficiency and scalability of big data applications. Here's an overview of its architecture and components. YARN ARCHITECTURE YARN's architecture separates resource management from job scheduling and monitoring, which enhances its flexibility and scalability. The main components of YARN are: 1. RESOURCE MANAGER (RM) = Role: The Resource Manager is the master daemon responsible for managing resources across the cluster. It allocates resources to various applications based on their requirements. = Components: © scheduler: This component is responsible for resource allocation among running ntps:fodce-santechz com/uni-22-apacnesnadoop ane yarn6124, att ANE BOC -2-Anache Hadoop @PPHCALIONS WILNOUL MOMILUTINY UF WALKIN, LEI SLALUS, IL Operdtes vase UF Bfedefined policies, such as Capacity Scheduler or Fair Scheduler, to ensure efficient resource distribution. * Application Manager: This manages the lifecycle of application masters, handling job submissions and negotiating the first container for execution. 2. NODE MANAGER (NM) = Role: Each Node Manager runs on individual nodes within the cluster and manages the execution of containers on that node. It monitors resource usage (CPU, memory, disk) and reports this information back to the Resource Manager. = Responsibilities: = Launching and managing containers as directed by the Application Master. * Monitoring the health of the node and reporting any issues to the Resource Manager. 3. APPLICATION MASTER (AM) = Role: The Application Master is a framework-specific process that negotiates resources from the Resource Manager and coordinates with Node Managers to execute tasks. = Responsibilities: = Managing the application's lifecycle, including resource requests, task execution, and fault tolerance. = Reporting progress and status back to the Resource Manager. WORKFLOW IN YARN 1. Job Submission: A client submits a job to the Resource Manager. 2. Resource Allocation: The Resource Manager allocates a container for the Application Master. 3. Application Master Execution: The Application Master requests additional © ;ntainers from the Resource Manager as needed. hips bac. santechz.comvunit2/2-apache-hadoop ene ‘a7n6r24, att ANE BOC -2-Anache Hadoop 4. Task Execution: Node Managers launch containers based on instructions from the ‘Application Master, executing tasks in parallel. 5. Monitoring and Completion: The Application Master monitors task execution and reports back to the Resource Manager upon completion. ADVANTAGES OF YARN = Scalability: YARN can efficiently manage thousands of nodes in a cluster, allowing for extensive data processing capabilities. * Multi-tenancy: It supports running multiple processing frameworks (e MapReduce, Spark, Flink) simultaneously on a single cluster. = Dynamic Resource Management: YARN dynamically allocates resources based on application needs, optimizing cluster utilization. In summary, YARN enhances Hadoop's capabilities by providing a robust architecture for managing resources and scheduling jobs across distributed systems, making it an essential component for modern big data processing environments. oO hips bac. santechz.comvunit2/2-apache-hadoop sore yarn6r24, att AN BOC -2-Anache Hadoop HABOOP COMMON Hadoop Common is a critical component of the Apache Hadoop framework, providing essential libraries and utilities that support the other Hadoop modules. It serves as the backbone for the entire Hadoop ecosystem, enabling various functionalities that facilitate distributed data processing and storage. KEY FEATURES OF HADOOP COMMON * Java Libraries: Hadoop Common includes a set of Java Archive (AR) files that contain the necessary libraries for the operation of Hadoop applications, These libraries provide shared functionalities across all modules, ensuring consistency and efficiency in operations. * File System Abstractions: It offers file system and operating system-level abstractions that allow Hadoop to interact with different types of storage systems. This flexibility enables Hadoop to work with various file systems beyond HDES, such as Amazon S3 or other Hadoop-compatible file systems. * MapReduce Engine: While MapReduce is primarily known as a processing model, Hadoop Common includes the necessary components to execute MapReduce jobs. This includes job scheduling and resource management, which are crucial for efficient data processing across a cluster. IMPORTANCE IN THE HADOOP ECOSYSTEM Hadoop Common plays a vital role in ensuring that all other components of the Hadoop ecosystem function smoothly. Its functionalities enable: * Location Awareness: Hadoop applications can utilize information about where data is stored within the cluster, allowing for optimized task execution on nodes that have local access to data. This reduces network traffic and enhances performance. = Fault Tolerance: The design of Hadoop Common assumes hardware failures are common. It incorporates mechanisms for automatic recovery and redirection of tasks to ensure continuous operation even when individual nodes fail. oO » ‘Nuteroperability: By providing a unified set of libraries and utilities, Hadoop Common hips bac. santechz.comvunit2/2-apache-hadoop se 2/6240: AM OCC -2-Apsche Hadoop allows developers to build applications that can easily integrate with other feabonents of the Hadoop ecosystem, such as YARN (Yet Another Resource Negotiator) and HDFS (Hadoop Distributed File System) . In summary, Hadoop Common is an indispensable part of the Apache Hadoop framework, providing foundational services that support distributed computing and storage capabilities essential for handling large datasets efficiently. + Apache Hadoop was invented to solve large data problems that no existing system or commercial software could solve. + With the help of Apache Hadoop, the data that used to get archived on tape backups or was lost is now being utilized in the system. + This data offers immense opportunities to provide insights in history and to predict the best course of action. + Hadoop is targeted to solve problems involving the four Vs (Volume, Variety, Velocity, and Veracity) of data. + The following diagram shows key differentiators of why Apache Hadoop is useful for business: Compiled by Aaron Stanislaus Johns oO ntps:fodce-santechz.com/uni-2!2-apacnesnadoop rane

You might also like