Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It uses HDFS for storage, which partitions and replicates data across nodes for reliability. MapReduce is used for parallel processing, where jobs are split into tasks that are mapped and reduced across nodes.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It uses HDFS for storage, which partitions and replicates data across nodes for reliability. MapReduce is used for parallel processing, where jobs are split into tasks that are mapped and reduced across nodes.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25
Hadoop:
HDFS & MapReduce
Traditional Approach
• A computer to store and process big data.
• For storage purpose, the programmers will take the help of their choice of database vendors such as Oracle, IBM, etc. • Limitation: hectic task to process huge amount of scalable data through a single database bottleneck. Google’s Solution • Google solved this problem using an algorithm called MapReduce.
• This algorithm divides the task
into small parts and assigns them to many computers.
• Collects the results from them
which when integrated, form the result dataset. Hadoop • Using the solution provided by Google, Doug Cutting and his team developed an Open Source Project called HADOOP.
• Hadoop runs applications
using the MapReduce algorithm, where the data is processed in parallel with others. How Does Hadoop Work? • It is quite expensive to build bigger servers with heavy configurations.
• As an alternative, we can tie together many commodity computers with
single-CPU, as a single functional distributed system.
• The clustered machines can read the dataset in parallel and provide a much higher throughput.
• Moreover, it is cheaper than one high-end server.
Hadoop runs code across a cluster of computers. This process includes the following core tasks that Hadoop performs − • Data is initially divided into directories and files. • Files are divided into uniform sized blocks of 128M and 64M (preferably 128M). • These files are then distributed across various cluster nodes for further processing. • HDFS, being on top of the local file system, supervises the processing. • Blocks are replicated for handling hardware failure. • Performing the sort that takes place between the map and reduce stages. • Sending the sorted data to a certain computer. • Writing the debugging logs for each job. Hadoop Architecture • At its core, Hadoop has two major layers namely − • Processing/Computation layer (MapReduce), and • Storage layer (Hadoop Distributed File System). 1. Hadoop - MapReduce • A software framework for distributed processing of large data sets. • The framework takes care of scheduling tasks, monitoring them and re- executing any failed tasks. • It splits the input data set into independent chunks that are processed in a completely parallel manner. • MapReduce framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically, both the input and the output of the job are stored in a file system. Map Reduce Architecture MAP Reduce Dataflow in Map Reduce • An input reader • A Map function • A partition function • A compare function • A Reduce function • An output writer JobTracker • JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop.
• JobTracker performs following actions in Hadoop :
• It accepts the MapReduce Jobs from client applications • Talks to NameNode to determine data location • Locates available TaskTracker Node • Submits the work to the chosen TaskTracker Node TaskTracker • A TaskTracker node accepts map, reduce or shuffle operations from a JobTracker. • Its configured with a set of slots, these indicate the number of tasks that it can accept • JobTracker seeks for the free slot to assign a job. • TaskTracker notifies the JobTracker about job success status. • TaskTracker also sends the heartbeat signals to the job tracker to ensure its availability, it also reports the no. of available free slots with it. 2. Hadoop - HDFS Overview • HDFS holds very large amount of data and provides easier access. • To store such huge data, the files are stored across multiple machines. • These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. • HDFS also makes applications available to parallel processing. Features of HDFS • It is suitable for the distributed storage and processing. • Hadoop provides a command interface to interact with HDFS. • The built-in servers of namenode and datanode help users to easily check the status of cluster. • Streaming access to file system data. • HDFS provides file permissions and authentication. HDFS Architecture • HDFS follows the master-slave architecture and it has the following elements. • Namenode • Datanode • Block Namenode • The namenode is the commodity hardware that contains the namenode software. • The system having the namenode acts as the master server and it does the following tasks − • Manages the file system namespace. • Regulates client’s access to files. • It also executes file system operations such as renaming, closing, and opening files and directories. Namenode Datanode • The datanode is a commodity hardware having the datanode software. • For every node (Commodity hardware/System) in a cluster, there will be a datanode. • These nodes manage the data storage of their system. • Datanodes perform read-write operations on the file systems, as per client request. • They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode. Block • Generally the user data is stored in the files of HDFS. • The file in a file system will be divided into one or more segments and/or stored in individual data nodes. • These file segments are called as blocks. • In other words, the minimum amount of data that HDFS can read or write is called a Block. • The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration. Roles of Components Data Replication • Replication placement • High initialization time to create replication to all machines • An approximate solution: Only 3 replications • One replication resides in current node • One replication resides in current rack • One replication resides in another rack Goals of HDFS • Fault detection and recovery − • Since HDFS includes a large number of commodity hardware, failure of components is frequent. • Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery. • Huge datasets − • HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets. • Hardware at data − • A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput. Thank You