0% found this document useful (0 votes)
14 views

Big Data Analytics-4

Notes

Uploaded by

prasathdhanam66
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
14 views

Big Data Analytics-4

Notes

Uploaded by

prasathdhanam66
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 26
M ap Reduce Applications syllabus duce wo) = unit i a oe - ciate Hopredue ae MRUnit- test data and local tests - anatomy of ‘MapReduce ute and sort = petipesielee failures in classic Map-reduce ‘and YARN - job scheduling - = sk execution - MapReduce types - input formats = output formats. Contents 41 Introduction to MapReduce 4,2. Unit Tests with MRUnit 4.3 Anatomy of MapReduce Job Run 44 YARN 4.5 Failures In Classic Map Reduce and YARN 4,6 Job Scheduling 47 Shuffle and Sort 48 Task Execution 4.9 MapReduce Types : tions with Answers 4.10 Two Marks Ques! Map Reduce Applications 4-2 Big Data Analytics [ESB Introduction to MapReduce * MapReduce is a Java - based, distributed ne Hadoop Ecosystem. It takes away the comp! ie y sure # exposing two processing steps that developers imp! alle procemtlls eas In the Mapping step, data is split between as Once combine, aa Transformation logic ean be applied to each chunk : ae Reduce phase takes over to handle aggregating data 7 ee In general, MapReduce uses Hadoop Distributed File System Cee) Seat input and output. The MapReduce framework consis! ape? JobTracker and one slave TaskTracker per cluster - node. The master id r for scheduling the jobs’ component tasks on the slaves, Sous a “ re-executing the failed tasks. The slaves execute the tasks as directed by the master. amework within the Apache distributed programming by Map and Reduce. * MapReduce is a programming model and software framework first developed by Google. Intended to facilitate and simplify the processing of vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. Characteristics of MapReduce : 1. Very large scale data : peta, exa bytes. 2. Write once and read many data, It allows for parallelism without mutexes. 3. Map and Reduce are the main operations : Simple code, 4, All the map should be completed before reduce operation starts, 5. Map and reduce operations are typically performed by the: same physical processor. 6. Number of map tasks and reduce tasks are configurable, MapReduce Workflows * With HDFS, we are able to distribute the data so that data is ds of nodes instead of a single large machine. ee Mapreduce provides the framewo clusters of commodity hardware, F rk for highly ig. 4.1.1 show: posits mm | ated Data split ca ‘Map Reduce Applications ase Reduce phase Map() Map() Out Map() Out2 _Map() Fig. 4.1.1 MapReduce data processing ework spli : ey a fi splits the data into smaller chunks that are processed in parallel on cluster of machines by programs called mappers, fr : ‘The output from the mappers is then consolidated by reducers into desired result. The share nothing architecture of mappers and reducers make them highly parallel. Input : This is the input data / file to be processed. Split : Hadoop splits the incoming data into smaller pieces called “splits”. Map : In this step, MapReduce processes each split according to the logic defined in map() ‘function. Each mapper works on each split at a time. Each mapper is treated as a task and multiple tasks are executed across different TaskTrackers and coordinated by the JobTracker. Combine : This is an optional step and is used to improve the performance by reducing the amount of data transferred across the network. Combiner is the same as the reduce step and is used for aggregating the output of the map() function before it is passed to the subsequent steps. Shuffle and Sort : In this step, outputs from all the mappers is shuffled, sorted to put them in order and grouped before sending them to the next step. Reduce : This step is used to aggregate the outputs of mappers using the reduce() function, Output of reducer is sent to the next and final step. Each reducer is treated as a task and multiple tasks are executed across different TaskTrackers and coordinated by the JobTracker. Output : Finally the output of Fe The output from the mappers # The share nothing architecture duce step is written to a file in HDFS. then consolidated by reducers into desired result. of mappers and reducers make them highly Big Date Analytics 44 ‘Mop Reduce Applications * Data locality is achieved by mapreduce by working closely with HDFS. When you specify the file system as HDFS for mapreduce, it automatically schedules the mappers on the same node as where the block of data exists. « Mapreduce can get the blocks from HDFS and process them. The final output from Mapreduce also can be stored in HDFS file system. However, the intermediate files between mappers and reducers are not stored in HDFS and are stored on the local file system of the mappers. (CREA Data Flow in the MapReduce Programming Model — MapReduce : A programming model to facilitate the development and execution of distributed tasks. © The programmer defines the program logic as two functions : a. Map transforms the input into key-value pairs to process. b. Reduce aggregates the list of values for each key. Output data Output data Fig. 4.1.2 Data flow in map reduce TECHNICAL PUBLICATIONS® - an up-thrust for knewhadae ee ee Map Reduce Applications , The MapReduce environment program can be deco; fakes. in ‘ mposed charge distributi pigher-level languages (Pig, “a a aa spade A complex : el ani | wapReduce isa parallel c) help with writin inp and Reduce tasks. vetbuted computa Programming model \g distributed applications. ne ions, which has been ne ‘especially dedicated for complex and | tn general, MapReduce processing i rived from the functional paradigm. for most problems Sing is comy Peart jo FD sowed are repeated joie of two consecutive stages, which Fig 41: s data flow in the MapReduve’ The map and the reduce phase. | ap processes the data on hosts j programming model. results. At each iteration in ae whereas the reduce aggregates the in turn, are used as the input for ae the whole data is split into chunks, which, TS. Bach chunk ma) be ot yy be processed by only one mapper. Once the data is rocessed b mappers, they can emit View the MathML F # source ? key, value ? pairs to the reduce phase. fl : pefore the reduce phase the pairs are sorted and collected according the View the list of values related the MathML sourcekey values, therefore each reducer gets to a given View the MathML sourcekey. The consolidated output of the reduce phase is gaved into the distributed file system. [SEM Functions of Job Tracker and Task Tracker Function of Job tracker : « There is a single job tracker map ~ reduce jobs. I 1, Accepts jobs from client an 2, Schedules tasks on worker eartbeat info from tas alternate worker that runs on the master node. It is the driver for the its functions are > d divides into tasks. nodes called task trackers. k trackers on worker nodes. 3, Keeps hi if a worker fails. 4, Reschedules the task on as many task trackers as the and there.are f HDFS also become worker Function of task tracker * : ker No * Task tracker runs OF each Wor data nodes o a ve also used then ata 0 eee 1 is also Mons of a task tracker °° les for FE i . oe tanes assignments OP mas ly. ks local a number of Mapper and reducer tasks it can take c. Each worker node at one time. an in The tasks assigned 4-6 Big Data Analytics ke more map jabs than reduce tasks: g task. Jaring a task as failed. . Normally they can ta : Task tracker does a task attempt before executin i decl do multiple attempts before i ae uae th the task attempt called umbilical ra mo . Task tracker maintains a connection witl protocol. ‘Task tracker sends a regular heartbeat signal to job including available map and reduce tasks. tracker indicating its status JVM. $o even if the task has i each task attempt in a separate j. Task tracker runs eacl p ee bad code due to which it fails, it will not cause task t Limitation of MapReduce = . Cannot control the order in which the maps or reductions are Tun. . For maximum parallelism, we need Maps and Reduces to not depend on data generated in the same MapReduce job (i.e. stateless). . A database with an index will always be faster than a MapReduce job on . unindexed data. |. Reduce operations do not take place until all Maps are complete. . General assumption that the output of Reduce is smaller than the input to Map; large data source used to generate smaller final values. E24 unit Tests with MRUnit . MRUnit is a JUnit-based Java library that allows us to unit test Hadoop MapReduce programs. There is specialized test suite for testing mapreduce jobs, known as MRUnit. MRUnit removes as much of the Hadoop framework as possible while developing and testing. The focus is narrowed to the map and reduce code, their inputs and expected outputs. With MRUnit, developing and testing MapReduce code can be done entirely in the IDE and these tests take fractions of a second to run. MRUnit is built on top of the popular JUnit testi Seo user only needs to focus on the map and reduce logic MRUnit supports testing Mappers and Reducers s MapReduce computations as a whole, Hadoop objects so the eparately as well as testing To get started, download MRUnit. After we have mrunit-0.9.0- incubating/lib directory. In there we eke ene file, od into the should see the following : TECHNICAL PUBLICATIONS® - an up-thruar 1 Data Analytics : Big Data Analyti 4-7 ‘Mep Reduce Applications |-incubating-hadoop1.jar )-incubating-hadoop2.jar + The mrunit-0.9.0-incubating-hadoop1 jar is for MapReduce version 1 of Hadoop and mrunit-0.9.0-incubating-hadoop2jar_ is for working the new version of Hadoop's MapReduce. * Given a MapReduce job that writes to an HBase table called MyTest, which has one column family called CF, the reducer of such a job could look like the following : public class MyReducer extends TableReducer { public static final byte[] CF = "CF*.getBytes(); public static final byte[] QUALIFIER = "CQ-1".getBytes(); public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { /fpunch of processing to extract data to be inserted, in our case, lets say we are simply //appending all the records we receive from the mapper for this particular /fkey and insert one record into HBase | | StringBuffer data = new StringButfer(); | Put put = new Put(Bytes.toBytes(key.toStririg())); | for (Text val : values) { | data = data.append(val); Ret put.add(CF, QUALIFIER, Bytes.toBytes(data.toString())); /Iwrrite to HBase context.write(new ImmutableBytesWritable(Bytes.toBytes(key.toString())), put); } } * To test this code, the first step is to add a dependency to MRUnit to Maven POM file. org.apache.mrunit mnunit 1.0.0 | test « Next, use the ReducerDriver provided by MRUnit, in Reducer job. public class MyReducerTest { ReduceDriver reduceDriver; bytel] CF = “CF*.getBytes(); byte[] QUALIFIER = "CO-1".getBytes(); @Before public void setUp() { MyReducer reducer = new MyReducer(); TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Mep Reduce Appian, 3ig Dats Analytics 4-8 Applications teduceDriver = ReduceDriver.newReduceDriver(reducer); @Test : ublic void testHBaselnsert() throws IOException { - i String strKey = aaneran strValue = "DATA’, strValuel = "DATA", strValue2 = "DATA", List list = new ArrayList(); list.add(new Text(strValue)); | list.add(new Text(strValue1)); list.add(new Text(strValue2)); Hsince the reducer is doing is appending the records that the mapper /isends it, we should get the following back String expectedOutput = strValue + strValuel + strValue2; “Setup Input, mimic what mapper would have passed //to the reducer and run test | teduceDriver.withinput(new Text(strKey), list); | //von the reducer and get its output List> result = reduceDriver.run(); /Jextract key from result and verify : AassertEquals(Bytes.toString (result. get(0).gotFirst().got()), strKey); Hextract value for CF/QUALIFIER and verify ~ Put a = (Put)result.get(0).getSecond(); String c = Bytes.toString(a.get(CF, QUALIFIER).get(0).getValue()); assertEquals(expectedOutput,c ): } * MRUnit test verifies that the out itput is as expected, the Put that is inserted into HBase has the correct value and the ColumnF: ‘amily and ColumnQualifier have the correct values. MRUnit includes a MapperDriver to test mapping jobs and we can use MRUnit to test other operations, including Treading from HBase, data, or writing to HDFS, processing * To Unit test MapReduce jobs : ~ Create a new test class to the existing Project . Add the mrunit jar file to build path . Declare the drivers . Write a method for initializations and environme nt setup . Write a method to test mapper Write a method to test reducer eNaueeone ~ Write a method to test the whole MapReduce job . Run the test * How to test Java MapReduce Jobs in Hadoop ? TECHNICAL PUBLICATIONS? . an up-tnvat yp ‘knowledge Pose eles 7 ee giep FE Deee MapReduce Code gop 2+ UNIE Testing of Map Rody Ic step 3? Create Jar file for Mad e Code ug © Using MRUnit framework luce resting activities : ae step 2+ Run jar file by Providing data file as i step 3? Check output file created on HDEg, i a3 | Anatomy of MapReduce Job Run « We can runa MapReduce j, : This single line code execution sith 2 sins line of code ; : JobClient.runjob(conf). ution process is shown below. client JVM client nod 3: copy job resources MapTask or ReduceTask ‘nade manager node “gat How Hadoop runs a MapReduce Job Fig. 4.3- TECHNICAL PUBLICATION: Map Reduce Application, Big Data Analytics 4-10 * There are five independent entities : SS |. The client, which submits the MapReduce joD- ji jinates 2. The YARN resource manager, which coordin: resources on the cluster. 3. The YARN node managers, which launch on machines in the cluster. the allocation of compute and monitor the compute containers . The MapReduce application master, which coordinates e asks running the MapReduce job. The application master and the MapReduce tasks run in containers that are scheduled by the resource manager and managed by the node managers. 5. The distributed file system, which is used for sharing job files between the other entities. - Job submission : 1, The submit() method on Job creates an internal JobSubmitter instance and calls submitJobinternal() on it. 2. Having submitted the job, waitForCompletion polls the job's progress once per second and reports the progress to the console if it has changed since the last report. 3. When the job completes successfully, the job counters are displayed otherwise, the exror that caused the job to fail is logged to the console. The job submission process implemented by JobSubmitter does the following : 1. Asks the resource manager for a new application ID, used for the MapReduce job ID. 2. Checks the output 5; ation of the job For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program. 3. Computes the input splits for the job If the splits cannot be computed (because the input paths don't exist, for example), the job is not submitted and an error is thrown to the MapReduce program. 4, Copies the resources needed to run the job, including the job JAR file, the configuration file and the computed input splits, to the shared filesystem in a directory named after the job ID. 5 - Submits the job by calling submitApplication() on the resource manager. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Map Reduce Applications When the resource ma, ni é hands off the request to he yan? , The scheduler allocates a contai application master's process nae 3, The application master for M : is MRAppMaster. a | It initializes the job by creatin the job's progress, tasks. It retrieves the input splits computed in the client from the shared filesystem. 6. Hee oe task object for each split, as well as a number of reduce task 1 ined by the mapreducejob.reduces property (set by the setNumReduceTasks() method on Job). Sa call to j . bie a ‘0 its submitApplication() method, it T an a the resource manager then Jaunches the t the node manager's management. duce jobs jobs is a Java application whose main class as it ae number of bookkeeping objects to keep track of feceive progress and completion reports from the a Task assignment : 1. If the job does not qualify for running as an uber task, then the application master requests containers for all the map and reduce tasks in the job from the resource manager. 2. Requests for map tasks are made first an‘ reduce tasks, since all the map tasks must reduce can start. 3. Requests for reduce tasks are not made until 5 % of map tasks have completed. .d with a higher priority than those for complete before the sort phase of the Task execution : : many k assigned resources for a container on a particular node by . mee = es saatulce the application master starts the container by es0 contacting the node manager. 2. The task is executed by a Java eee can run the task, it localizes the ee aa configuration and JAR file and any task. 3. Finally, it runs the map OF reduce tion whose main class is YarnChild, Before it wurces that the task needs, including the job e distributed cache. pose of launching the ice tasks for the pw The Streaming task 8 ‘reaming : and red © Streaming runs special M™P 4 communicating with it. user supplied executable asing standard input and output streams. th the PPO communicates Wi up-thrust for knowledge “eae ~~ BLIGATIONS Map Reduce Applications Big Date Analytics 4-12 ml : * During execution of the task, the Java process passes input key a be the external process, which runs it through the user defined map or reduce function and passes the output key value pairs back to the Java process. * From the node manager's point of view, it is as if the child process ran the map or reduce code itself. Progress and status updates : * MapReduce jobs are long running batch jobs, taking anything from tens of seconds to hours to run. A job and each of its tasks have a status, which includes such things as the state of the job or task, the Progress of maps and reduces, the values of the job's counters and a status message or description. * When a task is running, it keeps track of its progress. For map tasks, this is the Proportion of the input that has been processed. For reduce tasks, it's a little more complex, but the system can still estimate the proportion of the reduce input processed, Job completion : © When the application master receives a notification that the last task for a job is complete, it changes the status for the job to Successful. Then, when the Job polls for status, it learns that the job has completed successfully, so it prints a message to tell the user and then returns from the waitForCompletion(). * Finally, on job completion, the application master and the task containers clean up their working state and the OutputCommitter's commitjob () method is called, * Job information is archived by the job history server to enable later interrogation by users if desired. 4.4 BZ * YARN stands for Yet Another Resource Negotiator. It is the Next generation computing framework in Apache Hadoop with support for programming Paradigms besides MapReduce. It is a large - scale distributed operating system for big data applications. * It has two major responsibilities : 1. Management of cluster resources such as compute, network and memory. 2. Scheduling and monitoring of jobs. TECHNICAL PUBLICATIONS® - an up-thrust iar vane { Anaiyie> ge 4-13 why is YARN Useg Map Reduce Applications a) YARN in Hadoo, : P effic, resulting in hi, lently help i Detter one Hadoop ig Pal al, elp in better clustey util; Utilization, cates all cluster resources, i compa ‘ i b) Clusters in YARN in at Pared to previous versions which eee, Had interactive queries jy OP can no : Parallet yi Ww Tun streaming dat i o) It can now handle se th MapReduce batch jobs, ee of applications, *! Pr0cessing methods « Fg. 44.1 shows YARN and can support a wider range The architecture consists of several lode Manager and Application Master. MapReduce status ———> Job submission. = -----~ - Node status Resource request Fig. 4.4.1 YARN architecture * YARN Architecture consists of the following main components : a) Resource manager : Runs on a master daemon and manages the resource allocation in the cluster. b) Node manager : They run on the slave daemons and are responsible for the n every single data node. | Manages the user job lifecycle and resource needs of 2) pee plcnbare Tt works along with the node manager and monitors individual a] = th tion of tasks. = execut 4 sage of resources including RAM, CPU, Network, HDD etc on a ‘ontainer : Pac PUBLICATIONS an up-thnust for knowledge TECHNICAL execution of a task 0 Big Date Analytics erg Big Data Aneiytics manager form the data-computation mate authority that arbitrates m, The node manager is the containers, monitoring their reporting the same to the and the node is the ulti plications in the syste hho is responsible for disk, network) and « The resource manager framework. The resource manager resources among all the aj per-machine framework agent w’ resource usage (cpu, memory, resource manager/scheduler. aa YARN containers are managed by a container Jaunch context Mises : Life-Cycle(CLC). This record contains a map of ane dependencies stored in a remotely le storage, security tokens, node manager services and the comm the process. Container variables, payload for accessibl and necessary to create Application workflow in YARN : a) Client submits an application b) Resource manager allocates a container to start ap) plication manager ©) Application manager registers with resource manager manager €) Application Manager asks containers from resource tainers ©) Application Manager notifies node manager to launch con! f) Application code is executed in the container g) Client contacts resource manager/application manager to monitor application's status h) Application manager unregisters with resource manager Architecture features of YARN : « YARN has become popular for the following reasons - a) With the scalability of the resource manager of the YARN architecture, Hadoop may manage thousands of nodes and clusters. b) YARN compatibility with Hadoop 1.0 is maintained by not affecting map-reduce applications. E ©) Dynamic utilization of clusters in Hadoop is facilitated by YARN, which gi Y, better cluster utilization, " Le Ee d) Multi-tenancy enables izati i i Daisey an organization to gain the benefits of multiple engines ZEEM Merits and Demerits of YARN 1. Merits ; © Scalability : YARN is designed for large number of nodes * Utilization : Node manager mana; ges a pool of re: 7 number of the designated slots thus increasing the utilization ee ion, TECHNICAL PUBLICATIONS® - an up-thrust for knowled Igo > Big Data Analytics : cS jee eee Map Reduce Applications © Multitenancy : Different version of MapReduce can run on YARN, which makes the process of upgrading MapReduce more manageable. + JobTracker is the only point of availability in Hadoop 1.0 [EEE pitrerence between YARN and MapReduce MapReduc~ | | Used in Hadoop version 2 Used in Hadoop version 1 The Yam has a name node, data node, Map Reduce has a name node, data node, | Secondary name node, resource manager and secondary name node, job tracker and task tracker. I node manager. HES bs z | HADOOP2, based on YARN architecture, has Map reduce has a single master and multiple ‘the concept of multiple masters and slaves. _slave architecture | Default node size of Datanode is 64 MB "Default node size of Datanode is 128 MB - MapReduce is less scalable than YARN. YARN is more isolated and scalable _ Provided static allocations of resources for designated work and Supported its own batch_ processing applications only. { (ZEF Failures in Classic Map Reduce and YARN 4. Failures in classic MapReduce « MapReduce support three types of failures + a) Running task failure b) Tasktracker failure c) Jobtracker failure Task failure Child task fail : This happens when user code in the map o reduce task throws @ runtime exception. If this happens, the child JVM reports the error back to its parent tasktracker, before it exits. The error ultimately makes it into the user logs. The tasktracker marks the task attempt as failed, freeing up a slot fo run another task. TECHNICAL PUBLICATIONS® - an upthnust for knowiedge cae: Big Data Analytics 4-16 Map Reduce Applications * Streaming tasks fail: if the streaming process exits with a nonzero exit code, itis marked as failed. This behavior is governed by the stream.non.zero.exit.is failure property, * The tasktracker notices that it hasn't received a progress update for a while and proceeds to mark the task as failed. The child JVM process will be automatically killed after this period. Tasktracker failure * Ifa tasktracker fails by crashing, or running very slowly, it will stop sending heartbeats to the jobtracker, The jobtracker will notice a tasktracker that has stopped sending heartbeats and remove it from its pool of tasktrackers to schedule tasks on. ‘The jobtracker arranges for map tasks that were run and completed successfully on that tasktracker to be rerun if they belong to incomplete jobs, since their intermediate output residing on the failed tasktracker's local filesystem may not be accessible to the reduce task, Any tasks in progress are also rescheduled, * A tasktracker can also be blacklisted by the jobtracker, even if the tasktracker has not failed. If more than four tasks from the same job fail on a particular tasktracker, then the jobtracker records this as a fault. Jobtracker failure * It is most serious failure mode. It is a single point of failure. Hadoop has no mechanism for dealing with failure of the jobtracker, * YARN is used to overcome this situation. * After restarting a jobtracker, any jobs that were running at the time it was stopped will need to be re - submitted. 2. Failures in YARN * Task failure : Failure of the running task is similar to the classic case. Runtime exceptions and sudden exits of the JVM are propagated. back to the application master and the task attempt is marked as failed. * Node manager failure ; heartbeats to the resource the resource manager's p running on the failed nod. * Resource manager failure ; Failure of the resource manager is serious, since without it neither jobs nor task containers can be launched, The resource manager was designed from the Outset to be able to recover from crashes, by using a sheckpointing mechanism to save its state to persistent storage, although at the time of writing the latest release did not have a complete implementation. If a node manager fails, then it will stop sending manager and the node manager will be removed from TECHNICAL PUBLICATIONS® - an up-thrust for knowledge — Map Reduce Applications ga Job Scheduling * Synchronization : The The Hadoop schedulers are desi, performance enhancement, R, scheduling in Hadoop are ag igned for better utilization of resources and equirements (ig: a follows, “Sites and. challenges) regarding job Energy efficiency ; To Perform oy amount of energy is required in dat minimization of energy in data cent erations on large amount of data, a large ta centers, which increase the overall cost. The 2 ets is a big challenge in Hadoop. Load balancing : Map and Reduce st default, the data is equally Portione system imbalance in case skewed considered in processing and not the ‘ages are linked with the partition stage. By ‘d by partition algorithms, which handle the data is encountered. As only the key is data size, load balancing problem occurs. Mapping scheme : Mapping scheme that could be helpful in the minimization of communication cost is required. Automation and configuration : It helps in the deployment of Hadoop cluster by setting all the parameters. For proper configuration, both the hardware and work amount should be known at the timing of deployment. During configuration, a small mistake could cause inefficient execution of the job, leading to performance degradation: To overcome such types of issues, we need to develop new techniques and algorithms that perform calculations in such a manner that the setting could be done efficiently. Fairness : Fairness refer to the fairness in scheduling algorithms. It indicates how fairly of scheduling algorithms divide the resources among users. Data locality : The distance between the task node and input node is known as locality. The data transfer rate depends on the locality. The data transfer time will be short if the computational node is near the input node. process of transferring the intermediate output data of the e input data to the reduce process is known as Mapping process. as the inp synchronization. oa FIFO Scheduler in Hadoop is First In First Out. More preferences is g first than those coming later. It places the them in the order of their submission (first * Default scheduling policy used i Biven to the application comune applications in a queue and execul in, first out). ize and priority, the request for the first application in * Here, irrespective of the fist ‘Dace the first application request is satisfied, then the queue are allocated cue is served. in the q' onk li . ly the next app! a PUBLIOAT JONS® - an up-thrust for knowledge iNIC TECH! Map Reduce Applications Big Data Analytics 4-18 * Advantage of FIFO scheduler : cn, 1. It is simple to understand and ‘doesn't need any configu) 2. Jobs are executed in the order of their submission. * Disadvantage of FIFO scheduler 1. It is not suitable for shared clusters. 2. It does not take into account the balance long applications and short applications. EEGEE Fair Scheduler Fair scheduler is developed Facebook. Fair scheduler aims to give every user a fair share of the cluster capacity over time. If a single job is running, it gets all of the Cluster. As more jobs are submitted, free task slots are given to the jobs in such a way as to give each user a fair share of the cluster. The main idea behind fair scheduler is to allocate equal shat job. It creates groups of jobs based on configurable attributes like user name, called pools. * The fair scheduler ensures fairness in sharing of resources between pools. The pools also control job configurable properties and the configurable properties determine the pool in which a job is placed. * All the users have’ their own pools with a minimum share assigned to each. Minimum share means that a small part of the total number of slots is always achieved by a pool. : By default, there is a fair allocation of resources among the pools with the MapReduce task slot. If any pool is free ie. they are not being used, then their idle slots will be used by the other pools: * If the same user or same’ pool sends too many jobs , then the fair scheduler can limit these jobs by marking the jobs as not runnable. If there is only a single job running at a given time, then it can use the entire cluster. of resource allocation between the re of resources to each Advantages of Fair scheduler 1. This scheduler makes a fair and dynamic resource reallocation. 2. It provides faster response to small jobs than large jobs. 3.. It has the ability to fix the number of concurrent running jobs from each user and pool. TECHNICAL PUBLICATIONS® - an up-hrust for knowledge eduler Map Reduce Applications Se in each pool/node, job, which leads to mon the numb er of Cae ‘ pa Capacity Scheduler Tunning jobs under fair scheduling. » Yahoo developed capaci aa ime ity schedul i maximize the utilization of resources. The main objective of this scheduler is to This scheduling algorithm can ces and throughput in a cluster environment. among a large number of ee fair management of computational resources an organization after the users. It uses queues and each queue will be assigned to To coe Tesources have been divided among these queues. In order get control over the queues, a security mechanism is built to ensure that each organizati ganization can access only one of the queues. It can never access the queues of another organization. This scheduler ‘guarantees minimum capacity by having limits on the running tasks and jpbs from a single queue. When new jobs arrive in a queue, the resources are assigned back to the previous queue after completion of the currently running jobs. Capacity scheduler allows job scheduling based on priority in an organization's queue. Advantages of capacity scheduler : 1. Capacity scheduling policy maximizes utilization of resources and throughput in cluster environment. This scheduler guarantees the reuse of the unused capacity of the jobs within np queues. “ a 3. It also supports the features of hierarchical queues, elasticity and operability. . Tt also suy ntrol memory based on the available hardware resources. co 4, Tt can allocate and ce Disadvantages of capacity Schr sa among the other schedulers. a ce ae > oe x queues. ie ausony aS . The P 5 | jobs, it has some 3. With - oe a queue and single user fairness of the limitations in ensuring stability and for knowledge m ws? an up-thrust Map Reduce Applic Big Data Analytics 4-20 e Application, Capacity scheduler REX Difference between Fair and Capacity Scheduler auntie ee | Fait scheduler Fair scheduler is developed Facebook Yahoo developed capacity scheduler. scheduler assigns resource based on Fair scheduler assigns equal amount of Capacity resource to all running jobs. the capacity required by the organization. It support hierarchical XML configuration. It cannot support hierarchical XML, configuration. | Fair scheduler is not complex., It is complex amongst the other scheduler. _ | The main idea behind fair scheduler is to The main objective of this scheduler is to | allocate equal share of resources to each job. maximize the utilization of resources and L foe throughput in a cluster environment. Shuffle and Sort * Map-Reduce gives the guarantee that input to every reducer is sorted by key. The process by which system performs the sort and transfers the map outputs to the reducers as inputs are called shuffle. * When we run a MapRedice job and mappers start producing output internally lots of processing is done by the Hadoop framework before the reducers get their input. Hadoop framework also guarantees that the map output is sorted by keys. This whole internal processing of sorting map output and transferring it to reducers is known as shuffle phase in Hadoop framework. « Fig. 4.7.1 shows shuffle and sort. * It is a phase which happens between each Map and Reduce phase. Just to remind Map and Reduce handles the data which are organised into key - value paits. Once the Mappers are done with the calculations, the results of each Mapper are sorted by the key in so called buffers. * Once the buffers are filled up, the data spills on the machines’ local disks. Then it is straightly pushed to the Reducers (the shuffle phase), so that records associated with the same key end up in the same Reducer. * This shuffling happens as soon as each Mapper finishes in order not to over-flood the network, which could happen if we waited for all of the Mappers to finish. * Depending on the setup Reducers may start running before all Mappers have finished their jobs. When data gets to Reducers it is sorted once again. Results af then written to the HDFS or any other used filesystem. * In order to reduce the amount of data shuffled through the network we ™4J define a combiner function, which does the same calculations as Reducer but 0” the Mappers’ side, so before the data is being transferred, | a TECHNICAL PUBLICATIONS® - an up-thrust for knowledge tics gk uv 4-21 ‘Map Reduce Applications Data on HDFS input partion §=OOOo0 g ae : df ane ogoo 5 ; ee Mab: (Map | 0 0 0 Tap ea Oooo t t ooo Oooo =6oooo Shuffiing Ooaoeooo0o gooooooo i ad Sort in parallel oooo Baa Baa ' ‘ “Reale | output partton aneqanca panraco 0 8 4) Data on HDFS Fig. 4.7.1 Shuffle and sort Role of the combiner is to simply pre-calculate the data so that we send less over the network. However there is no way of controlling how many times and if at all combiner will actually be used - Hadoop decides on that. Hadoop has a default shuffle and sort mechanism which is based on alphabetical sorting and hash shuffling of the keys. However there is a way of implementing a ting the following classes : custom mechanism by overwrit 1. Partitioner - According to which the data will be shuffled, = Responsible for data sorting on the Mapper side. handles the data grouping on the Reducer side. Hadoop framework with in the shuffle phase are as 2. RawComparato 3. RawComparator - Which The tasks done internally by follows - 1. Data from mappers is 2 Data is also sorted by Keys within 3. Output from Maps is written to dis up-trust for knowledge partitioned as per the number of reducers. : a partition. k as many temporary files. Map Reduce Applications Big Data Analytics 4-22 i disk are merged 4. Once the map task is finished all the files written to the Bed to create a single file. 5. Data from a particular oe ea yi nea that is suppose to process that particular . ; 6. If data ca a reducer exceeded the memory limit then it is copied to a disk. oe 7. Once reducer has got its portion of data from all the mappers, ee is again merged while still maintaining the sort order of keys to create reduce task input. Ei Task Execution + MapReduce model is to break jobs into tasks and run the tasks int parallel to make the overall job execution time smaller than it would otherwise be if the tasks ran sequentially. ) is transferred to a reducer * In Hadoop, speculative execution is a process that takes place during the slower execution of a task at a node. In this process, the master node starts executing another instance of that same task on the other node. And the task which is finished first is accepted and the execution of other is stopped by killing that. Speculative execution in Hadoop * Speculative execution in Hadoop MapReduce is an option to run a duplicate’ map or reduce task for the same input data on an alternative node. This is done so that any slow running task doesn't slow down the whole job. * MapReduce job is dominated by the slowest task. slow tasks, called stragglers. If a stray task is run that will optimistically co: MapReduce attempts to locate ggler is discovered, a redundant (speculative) mmit before the corresponding straggler. Speculative execution is turned on by Y default. It can be enabled isabled independently for map tasks and redi see eee pies ae Raed P Hesuce tasks, on a cluster-wide basis or on a Speculative execution is enable d by default Properties for speculat 7 default for Both map and feduce tack. ¢ execution are set in mapred - sitescmd TECHNICAL PUBLICATIONS® . an Up-thrust for knowled. ledae Big Data Analytics 4-23 oe Mop Reduce Applications Advantages af speculative execution : In many of th ave is i ‘ : \ beiorn Inge scale clusters with thousands of sna acieaape neta te loop jobs at the same time. Problems like Fae ee eagle fers are common in lary >it cooled Wales os : ge - scale clusters. So, i cae \uplicate tasks in case one server might fail. ce acces dia EW MapReduce Types « The map and i i ae 1p and reduce functions in Hadoop MapReduce have the following general map: (Ki, V1) reduce: (K2, + In general, the map input key and value types (Ki and V1) are different from th map output types (K2 and V2). However, the teduce input must have the an types as the map output, although the reduce output types may be different again (K3 and V3). ZEEE Input Formats - Output 4. Input formats : types of data formats, from flat text files f0 Be © Hadoop can process many different databases. that is processed by a single map. Each maP d the map processes s a chunk of the input divided into records an An input split i processes a single split. Each split is tach record, a key-value pair. « Splits and records are logical files. In a database context, 2 split a record to a row in that rant There is nothing that requires them to be tied to might correspond to a range of rows from a table and ge. Input splits are represented by the Java interface ie. InputSplit- ‘An InputSplit has @ length in hostname strings: An InputFo! dividing them into records. FileInputFormat + FileInputFormat for all implementations _ InputFormat that use files as their ; :APl define which files are included as the input to job an generating splits for the input files. The job of dividing split: performed Py subclasses: : 1 files and CombineFi a f Jan sjes than a large M™ bytes and a set of storage locations, which are just ating the input splits and mat is responsible for cre: ‘4 an implementatio 5 into records is : Hadoop works mber of small files. One © way that each spl jJeInputFormat Map Reduce Applications Big Data Analytics 4-24 Jot of them, then each map task e as a single file Ifthe file is very small and there are @ 10% one nent will process very little input and there will be a lo! which imposes extra bookkeeping overhead. that is delegated by the context to : i input splits, ‘An InputFormat object creates the input spit, that (0 method of the Mapper to get the records. It is the public and customizabl get the records from context. + FilelnputFormat and DBinputFormat classes are derived from InputFormat. The FilelnputFormat is further specialized, with classes that i.e. combine small files or prevent file splitting 2, Text input * TextInputFormat : This file format is the default InputFormat. Each record is a line of input. The key, a LongWritable, is the byte offset within the file of the beginning of the line, The value is the contents of the line, excluding any line terminators (newline, carriage return) and is packaged as a Text object. TextInputFormat is the default InputFormat. Each tecord is a line of input. The key, a LongWritable, is the byte offset within the file of the beginning of the line. The value is the contents of the line, excluding any line terminators (e.g., newline or carriage return) and is packaged as a Text object. * A file is broken, into splits at byte, not line, boundaries. Splits are processed independently. Relationship between input splits and HDFS blocks : * The logical records that FileInputFormats define usually do not fit neatly into HDFS blocks. For example, a TextInputFormat's logical records are lines, which will cross HDFS boundaries more often than not. * This has no bearing on the functioning of our program : Lines are not missed or broken * This means it could be needed to perform some remote reads. The slight overhead this causes is not normally significant. Fig. 4.9.1 shows logical records and HDFS blocks for TextInputFormat * A single file is broken into lines and the line boundaries do not correspond with the HDFS block boundaries. Splits honor logical record boundaries, in this case lines, so we see that the first split contains line 5, even thou gh it spans. the first and second block. The second split starts at line 6, Binary input : ©, Hadoop MapReduce is not restricted to processing textual data, It has support for binary formats, too. TECHNICAL PUBLICATIONS® - an up-thust for knowledge oi i Map Reduce Applications Split ? si i \ ioe Perth a 7 1 spit ae ' 9 i [oat | pounday vou, ‘ i ndary Block Block Fig. 49:4 Logical records a boundary boundary doop's SequenceFi}, TextinputFormat ees files are apie TRY tok ; sequenc: are splittable, they su ‘equences of binary key - value pairs. they can store arbitrary types, ‘PPort compression as a part of the format and , sequenceFileAsBinaryinputFormar ; vcfeves ‘the sequence ‘fete! 'S Variant of SequenceFileinputFormat that encapsulated as B; ytesWritable sy and values as opaque binary objects. They are underlying byte array. ects and the application is free to interpret the 70 Two Marks Questions with Answers qi Define MapReduce, + MapReduce i : i ae iented to be eee model and software framework first developed by e fate and simplify the - : garallel on large clusters of plify the processing of vast amounts of data in commodity hardware in a reliable, fault-tolerant manner. a2 List the charactoristics of MapReduce 2 doa Characteristics of MapReduce | 1, Very large scale data : peta, exa bytes. 2, Write once and read many data. It allows for parallelism without mutexes. 3. Map and Reduce are the main operations : Simple code. 4, All the map should be completed before reduce operation starts. 5, Map and reduce operations are typically performed by the same physical j processor. i 6. Number of map tasks and reduce tasks are configurable. | 7. Operations are provisioned near the data. 8. Commodity hardware and storage: |@3 What are the major responsibilities of YARN 7 | Ans. Tt has two major responsibilities : ¢ Management of cluster resources such as compute, network and memory, © Scheduling and monitoring of jobs.” © TECHNICAL PUBLICATIONS” - an up-hrust for knowledge Map mews mee Big Data Analytics 4-26 a4 Why is YARN Used ? Ans. : is i amically allocates all cluster resources, 2) YARN in Hadoop ete and yrernpered to previous versione whch help in better cluster utilization. i b) Clusters in YARN in Hadoop can now. run streaming data processing and interactive queries in parallel with MapReduce batch jobs. : ¢) It can now handle several processing methods and can support a wider range of applications. Qs What is fair scheduler 7 Ans. : Fair scheduler aims to give every user a fair share of the cluster capacity over time. If a single job is running, it gets all of the cluster. As more jobs are submitted, free task slots are given to the jobs in such a way as to give each user a fair share of the cluster Q.6 List the failures of MapReduce. Ans. : MapReduce support three types of failures : Running task failure, Tasktracker failure and Jobtracker failure. Q.7 Explain First In First Out (FIFO) scheduling. Ans. : FIFO scheduling policy gives more preference to the jobs coming in earlier than those coming in later. When new jobs arrive, the Job Tracker pulls the earliest job first from the queue. Q.8 Why Hadoop works better with a small numbor of large files 7 ‘Ans. : Hadoop works better with a small number of large files than a large number of small files. One reason for this is that FilelnputFormat generates splits in such a way that each split is all or part of a single file. If the file is very small and there are a lot of them, then each map task will process very little input and there will be a lot of them (one per file), each of which imposes extra bookkeeping overhead. as What is TextInputFormat ? . Ans. : TextInputFormat is the default InputFormat. Each record is a line of input. The key, a LongWritable, is the byte offset within the file of the beginning of the line. The value is the contents ‘of the line, excluding any line terminators (e.g, newline or carriage return) and is packaged as a Text object. A file is broken into splits at byte, not line, boundaries. Splits are processed independently. Q.10 What is Node Manager Failure in YARN ? Ans. : If a node manager fails, then it will stop sending heartbeats to the resource manager and the node manager will be removed from the resource manager's pool of available nodes. Any task or application master running on the failed node manager will be recovered using the mechanisms. gaa TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

You might also like