0% found this document useful (0 votes)
19 views5 pages

Scheduling For Hadoop Cluster

This provides an hybrid approach of scheduling

Uploaded by

1by20cs209
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
19 views5 pages

Scheduling For Hadoop Cluster

This provides an hybrid approach of scheduling

Uploaded by

1by20cs209
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 5
Algorithms and Techniques to optimize The Performance of The Hadoop Clusters , Reduce Data Processing Time and Minimize Resource Wastage Arunakumati BN Dept. CSE, BMS Institue of ‘Technology and Management, Bengalura Kamatake 560064 scunskumanbn/abmsitin Absiract- This study investigates several methods and approaches for enhancing Hadoop cluster performance. Within these distributed computing environments, cutting down on resource waste and processing time for data are the main goals. This paper explores ways to make Hadoop clusters more efficient, mainly by optimizing process by processes and raising overall system performance. The goal of the project isto ‘address major issues in large data processing by developing more resource- and time-efficient Hadoop cluster through the ‘application of sophisticated algorithms and optimization approaches. 1. INTRODUCTION "Big data" is a collection of extraordinarily large data sets, that exceed the capabilities of conventional database administration systems. Big dats has become quite popular in the world of information technology since large amounts of new, complicated data are created every day in this industry. Data is provided by social media platforms, mobile devices, sensors, online transactions, and other sources. AS data volume, velocity, and variety increase, processing leads to hurdles and complexity. Large amounts of data so become challenging, to manage, process, match, and relate. A new platform is required for the processing, storing, and transfer Of the data due to its huge size and lack of structure, It is possible to process and evaluate lange volumes of data Ullss BK Dept. CSE BMS Institue of Technology and Management, Bengalura Karnataka - 560064 1by20es209¢@lomsit in Sujith SR Dept. CSE BMS Institute of Technology and Management, Bengaluru Kamataka - 560068 Iby20es197;@lomsit in ‘The MapReduce programming paradigm is popular across all frameworks. By using the Map and Reduce features of MapReduce, users can design parallel processes without worrying about the finer details of parallelism, including data dissemination, load balancing, and fault tolerance. MapReduce enables continuous processing of large amounts of data, Map and Reduce are the two parts of MapReduce, The initial phase in this system's parallel computing process is to assign map jobs to various nodes and have them process input data, Subsequently, the final outputs of the combined map are produced by applying the reduction function, Big Data Processing Any organization cannot benefit from raw data. The process of gathering unprocessed data and turning it into information that ean be used is known as data processing. In an enterprise, a team of data scientists and engineers often completes it’ step-by-step. After being gathered, filtered, sorted, processed, examined, and stored, the raw data is finally shown in an understandable manner. Organizations need data processing to improve their business strategy and get a competitive advantage. All personnel in the organization will be able to comprehend and utilize the data if itis transformed into usable formats such as documents, graphs, and charts, Data processing cycle “Minimize Resource wastage in Hadoop Im large data processing, resource allocation plays a critical role in enhancing system performance, In the context of cloud computing, resource demands for different applications can differ greatly. As a result, when certain resource capacities on a single server are depleted and ‘others remain available, there is a resource gap. When there is a greater variation in the computer resources, this issue becomes more noticeable. Prior —resource-allocation algorithms gave this scenario little thought. On a server with a Variety of resources, allocating resources in this way could result ina large waste of the available but underutilized resources. In order to reduce resource waste, this work suggests the minimizing resource gap (MRG) method for heterogeneous resources as a resource-allocation algorithm, ‘The variations in resource utilization across all MRG servers. The results of the experiments demonstrated the effectiveness of the suggested MRG technique for inereasing system utilization and lowering the total completion time for heterogeneous servers in cloud ‘computing by as much as 24.7%, I, MOTIVATION ‘The inspiration for streamlining Hadoop group execution, lessening information handling time, and limiting asset wastage comes from the basic to upgrade functional cffectiveness, Clients mean to assist information examination, lower handling costs, and use assets really, guaranteeing convenient experiences and worked on generally speaking execution in taking care of broad datasets, I. LITERATURE SURVEY ‘A literature survey on algorithms and techniques to optimize the performance of Hadoop clusters, reduce data processing time, and minimize resource wastage would involve reviewing relevant research papers, articles, and publications in the ficld of big data and distributed ‘computing. Below, I outline key topics and areas of focus for such a literature survey: Hadoop Architecture and Bastes: Understand the foundational concepts of Hadoop, including the Hadoop Distributed File System (HDS) and MapReduce.Explore key components of Hadoop, such as ‘NameNode, DataNode, and TaskTracker. Challenges in Hadoop Performance: Identify common challenges faced in Hadoop clusters, including issues related to data locality task scheduling, and resource utilization. ‘Task Scheduling Algorithms: Tnvestigate various task scheduling algorithms used in Hadoop to efficiently allocate resources and reduce processing time. Examples include Fair Scheduler, Capacity Scheduler, and Delay Scheduling. Data Locality Optimization: Examine techniques for optimizing data locality to minimize daa transfer across the network. Explore solutions like speculative execution and data replication strategies, Parallel Processing and Computation Model Review research on parallel processing models and algorithms to enhance the parallelism in Hadoop clusters Investigate alternative computation models beyond ‘MapReduce, such as Spark. Resource Management and Optimization: Explore resource management techniques and algorithms to efficiently allocate and utilize cluster resources, Investigate solutions like YARN (Yet Another Resource Negotiator) for resource management. Dynamic Resource Allocation: Study dynamic resource allocation strategies that adjust resources based on workload fluctuations. Explore auto-scaling mechanisms and dynamic resource provisioning. Distributed Caching Investigate the use of distributed caching to minimize redundant computations and improve data access times. ‘Compression Techniques: Explore compression algorithms and techniques to reduce the storage requirements and speed up data processing, Machine Learning for Optimization: ‘Machine Learning for Optimization: Investigate the application of machine learning algorithms for proditing somes wsage md optimizing chstr perfomance Fault Tolerance and Reliabii Review methods for eiancing fault tolerance and reliability in Hadoop clusters, sich as checkpoinring and recovery mechanisms. ‘Benchmarking and Performance Evaluation: Examine research on benchmarking methodologies and performance evaluation metrics for assessing the effectiveness of optimization techniqies. A study on existing techniques in Big data analytics with, ‘multiple clusters for heterogeneons data management [1]Since there is no established distance between the two classification data centers, data stored in digital foumat is taken into accom while calculating distance ‘measutements,[2] Takes info account the series of data-intensive applications running on the Hadoop cluster. The data residency scheme automatically maintains a balance between the total amount of data stored in every node to achieve improved data processing performance. [3] ‘Takes into account an approach called enhanced dynamic slot allocation for Hadoop, which keeps the features of the space-based model [dJoffeed a variety of solutions to addiess the different issues raised by the Distnbuted File System on Hadoop's Mapper and Reducer-based Framework. Citation SJdiscovered a usefil method for obtaining and storing unstructured data, Created a sizable data application that retrieves publicly accessible data trom ‘Twitter, stores it in a cluster, and then uses the state transfer method to access data obtained from HBase for analysis. The cchister centers of the categorization feature is represented using 2 yellow value function approach in reference(6], which takes the mode vahues into consideration. As aresult, the clusier core consists of just one feature, Inaccurate chistering stems from enoncous value allocation. By ‘modifying the cost fimetion, Reference(7] presented the alobal spending functioning of the grouping dataset.n [8], the covariance probability approach takes into account the cluster setup properties when dividing the data points (9) Has talked about the strategy of identifying the top da‘a item and using a new distance exleulstion to name a cluster core We employ the entopy derived from the alteruate Manhattan distance. The purpose of Reference[10} is to compare Word Cout’s execution times under various scenarios. A solitary node is configured to carry ont the Map Reduce Word Length. The stuxly examines the impact of adjusting the aunber of lowered jobs and ile size on the execution time Grid-based enviroriments can aid with some advanced cluster handling|11], and the cluster cam be improved by implementing effective fault tolerance techniques{12] and integrating an adaptive scheduling strategy. TV, PROPOSED METHODOLOGY The analysis of cwrent methods and the review of the literature paints a picture of the majority of the work being Gone to increase big datas processing. speed or capacity. How well a processing environment can be used and ‘monitored is one of the maim topics omutted. The goal of the proposed research project isto use Hadoop clusters to model a high-performance environment High-performance environments generally demand supercomputers or highly Gesigned systems fo support them. The goal is to set up a robust computing environment using a chister of minimally configuwed machines. Since ell of the machines have ‘nominal processing capability. making effective wse of them and keeping an eye on them is erncial The research environment is shaped by a distibuted platform consisting of several clusiers. Hadoop clusters, each having a Data Node, Name Node, Job tracker, and secondary Name Node, form the foundation of the distributed system. This raises a requirement: how can performance be optimized in a distnbuted setting with several data nodes? In a distributed system, there is typically a high probability of performance degradations. To stabilize the environment, a noile monitor is required, The second goal of the proposed research is accomplished with the aid of Zoo Keeper, which serves as a watchdog over the Zonodes that are patt of the Hadoop clusters, As a result, the Hadoop cluster is constantly checked for performance problems. abit parallel Hadoop Syren, Zi ER, * we Fig 1: Architecture of Proposed Platform Performance cannot be optimized or improved by the mere presence of a monitor; instead, a robust scheduling procedure that cooperates with the resource negotiator and 5job tracker is needed. Yarn enters the scene at this point, and ‘an algorithm is ru on top of the Yam scheduler. “The algorithm receives inputs such es the number of jobs inside a cluster, the time it takes for a work to execute, he number ff jobs that ae not completed, the aumber of Maps and ‘Reduees, the energy of a chister, and so on. The algorithm creates an optimization factor for each parameter andl sets up the distributed cluster system with the monitor's assistance The system's overall performance can be measured in terms of shorter execution and waiting periods as well as higher accuracy, performance, and storage, Thus, going forward, the suggested platform may be referred to as optimized The fundamental component of the ‘proposed system's structural design, shown in Fig. Lis a distributed computing environment, which essentially ‘handles data processing. Though an imterface designed for the chister, the user submits jobs and data and provides inputs. Programming for the interface makes use of Java and Hadoop jar packages. The distbuted cluster, which is the central component ofthe architecture, is configured with the fewest number of machines clustered together In the system, Hadoop Architecture is imposed. As a result, te cluster is made up of several nodes with a master-slave architecture, with a Name node configured on the master The slaves can then also receive new data nodes added to them, Once the Hadoop service is up and running, jabs can be carried out within the Hadoop cluster and managed by the Hadoop master. The crucial duties of a perfornauce optimizer and monitor emerge after the Hadoop cluster is configured. As shown in Fig 1, the Zoo keeper is responsible for monitoring, and the Yam Manager is in charge of scheduling and performance optimization. \. RESULT AND ANALYSIS WITH GRAPHS An optimization algorithi-based research idea is put into practice on a Hadoop cluster platform. The Hadoop platform allows researchers to nun jobs at different loads with flexibiity.As stated in the job queue deseription, jobs are being assigned to formula’ Here, it hecomes necessary to assess the effectiveness of various tasks by comfasting the schedulers included in YARN(Capneity, Fair, and FIFO) in the appropriate way. The Next, chister load 1s computed using the running time of the work. Every time, the average chuster performance is noted and Viriations are noted. The tree foundations for the performance metrics are dimensions, lead, memory usage, and CPU wilization, To cary out the performance assessment and oversight necessary for the It uses research ganglia, Ganglia monitor the activities In order for the yam algorithm to optimize the cluster, itis necessary to monitor the chistar performance. Every variation in CPU cycles, memory, aud cluster load is tracked, and all of this information is used as input by the algorithm to determine the performance factor. The assigned tasks are futher divided into progressive and nnon-progressive categories, and considerable resource optimization is carried out. Ad Fig 2:Chuser live dash board Following the introduction of the algorithm and performance monitor, there was a significant change in the system load. The chistes jobs are opernting at a faster pace decause of job profiling and performance factor calculations, witich also reduce server load. VL CONCLUSIONS Performance on the Hadoop cluster has increased significantly with the proposed method, The integrated use of Hadoop, Zoakeaper, and Ganglia allows the system to conduct multiple types of jobs simultaneously and efficiently. The resource manager's observations of load and resource use are the foundation for the recommended. ‘method and approach. A comparison of Yarn schedulers for ‘vatious wouk types has been conducted, The work preiile and load optimization have resulted in a significant boost in the efficiency of the cluster. As a result, both overall performance and CPU and memory consumption ate declining for the recommended cluster. Future experiments on the scalability and performance of the proposed system on a variety of tasks might be carried out in a heterogeneous distributed REFERENCES {1} Deap 1 & Ghemawat S,“MapReduce: Simplified Data Processing on Large Chstes", Commun. ACM, Vo.SI, No.1, 2008. {2} Asana! RA Rumani DU, “lnpeovingmapredoce performance ‘through complexity and performance based data placement in heterogeneous hadoop clusters, Lecture Notes it Computer Science (Gnclotingsoseres Lecture Notes in Arica Intelligence ane Lectre [Nats in Bioinformatics), (2013, pp. 15-125, [3] Deeptay Asbok M, Subramanian K & Prabhu A, "Dynamic Slot Allocation for Map Rede Custer", Internationa loutal of Corl ‘Taeory and Applications, (2017) {4} Nandimath, Bane, Patil A, Kakade , Vaidya S 8 ChaturvedD, “Big data analysis sing Apache Hodoop”, IEEE 1th Interational Conference on Infermtion Rese & Intezation (IRD, (2013), pp 700-703 {5} Dae TK & Kumar PM, "BIG Data Anse: A Fremework for ‘Unsdtared Data Analysis, Int J Eng. 8c Teco, VOLS, No.l, (2013, pp152-136. (5) in AK, Mary SIN 8 FlyanB3, "Data chstering review", ACM Comper. Suv, VoL31, No.3, (1998), pp26-273. [6] Abad A de Dey L,"A koma clsterngslgrithm for mixed names and eaegoccal daa", Dae Knowl. Eng, VOL.63, No, (2007), pp-s0s-s27 [7] VelmuruganT, “Evaluation of k-Medoid ané Fuzzy C-Means ‘luserngalgocthns for clstring teleeomrmunication dat, It Cot Emerg. Trends Sel. Eng, Techol,, 2012), pp15~120, {8} Kim M & Ramakrshoa RS, "Projected lsterng for ategorical datas, Patern Recogot. Lat, Vol27, No.2, (2008), pp. 1405-1417, 2006, {0} Subramanian K &e Prabhu A, "Simplified Data Analysis of Big Dat in ‘Map Redes" 2017, {10} Gokaldev 5, Rao A & Kart, “An EMT scheduling aproach with ‘optimum loa balancing in computational git. Appl: Eng Res, VoL, No, 2016), p.5753-5757 (11 Gokuldev 5 & Radhaleishnan R, “An adaptive job scheduling with ffcient al tolerance statgy in computational gl Int I. BD ‘Technol, VeL6, No4, (2014), pp.1783-1756,

You might also like