0% found this document useful (0 votes)
135 views

Efficient Algorithm For Big Data Application

Data mining applications play an important role in IT firms where energy wastage is the main problem. Increase in workload and computation leads to high energy cost. Mapreduce scheduling algorithm is a model which is developed for processing and storing large volume of data at the same time. EMRSA is an algorithm gives reliable energy and reduction in maps based on arrangement priority based scheduling is provided to the test for utilization and system work is easily improved by reduction with maps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
135 views

Efficient Algorithm For Big Data Application

Data mining applications play an important role in IT firms where energy wastage is the main problem. Increase in workload and computation leads to high energy cost. Mapreduce scheduling algorithm is a model which is developed for processing and storing large volume of data at the same time. EMRSA is an algorithm gives reliable energy and reduction in maps based on arrangement priority based scheduling is provided to the test for utilization and system work is easily improved by reduction with maps
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

International Journal of Advanced and Innovative Research (2278-7844) / # 1/

Volume 6 Issue 11

Efficient Algorithm for Big Data Application


Santhiya R, Revathi M, Madanachitran R

Assistant Professor, Department of Computer Science and Engineering, Paavai Engineering College, Namakkal

ABSTRACT: using one step algorithm and three step algorithm Iterative
Data mining applications play an important role in IT firms algorithm by various calculations Efficient mining
where energy wastage is the main problem. Increase in characteristics, too energy.
workload and computation leads to high energy cost.
Mapreduce scheduling algorithm is a model which is
developed for processing and storing large volume of data at
the same time. EMRSA is an algorithm gives reliable energy
and reduction in maps based on arrangement priority based
scheduling is provided to the test for utilization and system
work is easily improved by reduction with maps.

Keywords: Big Data, EMRSA, Mapreduce, Incremental


processing.

1 INTRODUCTION
Big data – both structured and unstructured – that
overwhelms a business on a day-to-day basis. It’s what
organizations do with the data that matters. Big data can be
analysed for visions that lead to well decisions and strategic
business moves. The major areas covered finance, banking,
education, E-commerce and so on.

Map reduce program is collected of map procedure that


performs a summary operation. It is used to gather data Fig.1 Structure of Big Data
according to the request. To progression big data proper
scheduling is required to attain greater performance. The processing approach called energy map to reduce the
Scheduling is a procedure of assigning jobs to available scheduling algorithm .EMRSA is an algorithm which
resources in a manner to diminish starvation and maximize provide extra energy and fewer map. Based on priority
resource utilization. scheduling is a task to allocate a based on the schedule
Trades need and utilization. And map, it is easy because the
There is a tendency to focus on reduce. Currently energy has, to reduce the work of the resource to improve.
implemented in bigdata application. Large data is constantly The final result shows experimental difference it is one of a
evolving. With the arrival of communication means such as variety of algorithms included in this paper.
new technologies, devices and social networking sites, the
amount of data produced by humans is hastily increasing 2. RELATED WORKS
every year. All this data is meaningful and useful when
treated, but it is ignored. Data essentially means large data. 2.1 HADOOP:
Hadoop is an open-source framework, which is open source
It is a large data set that cannot be processed using search technology.Hadoop allows to supply and procedure
traditional computing technology. Big data is not just large big data in a distributedEnvironment with asymmetrical
data; it is a complete subject counting a variety of tools, clusters of computational usingsimple programming models.
techniques and frameworks. As new data and updates are It is designed to expand from single server to thousands of
gathered, the input data of the large data mining algorithm machines, each machine contribution local computation and
changes increasingly and the result becomes out-of-date. storage [33]. Because of its distributed file system, it can run
applications that include thousands of nodes containing
EMRSA METHODOLOGY: terabytes of data [48]. A single node ruin doesn’t affect the
damaging system failure.
In the current situation, energy waste is great. The problem
is a lot of IT companies. More workload cunnings increase 2.2. ENERGY & PERFORMANCE MODELS FOR
high energy costs. The main purpose is we will reduce MAPREDUCE
energy costs from effective maps reduce the concept. In
order to optimize the mining results, Evaluate Map Reduce The user creates the energy and performance models for
Map Reduce framework which is used to forecastthe energy

©2017 IJAIR. All Rights Reserved


http://ijairjournal.com/
International Journal of Advanced and Innovative Research (2278-7844) / # 1/
Volume 6 Issue 11

used and presentation of jobs with various Hadoop Support vector machine organization method.
configuration settings. Theidea to use the multivariate Supervised Used in the algorithm:
regression modelling on the data collected from the energy • Classification and regression (binary and multi-
reading of the Hadoop Map Reduceto generate these models classproblem)
to control. The parameters added in a model which by • Anomaly detection (one class problem)
getting output by doing the slight factorial analysis of results The SVM training algorithm, categorize the new example
of the energy description done using the max and min into one category, Non stochastic binary linear classifier.
possible values of all the parameters mentioned above. Then In the SVM model, Points in space, isolated categories are.
make and verify the stochastic Markov chain models for the It is as wide as possible.Maintenance vector machines are
Map Reduce systems to calculate the performance and being developed as robust.
energy by making use of data collected from
energyrepresentation. Tools for noisy complex classification
andregressiondomain. Two important functions of
2.3 ENERGY MAP REDUCE SCHEDULING maintenance vector machine It is generality theory, which
ALGORITHM (EMRSA) leads to the principle method to chosen hypothesis and, OS
functions, which introduce a non-linearity to the hypothesis
space. Explicitly requires a nonlinear algorithm.

2.4.1 SUPPORT VECTORS

Fig 2. Dimensional Hyper Plain


A black line separating the two clouds in the class is in the
middle of the channel. Departure is 2D, A Line, 3D is a
plane, and over 4 dimensions are hyper planes.
Mathematically, separation can be found by taking two
critical members, one for each class. These points are called
maintenance vectors. These are important points defining
the channel. This is the perpendicular bisector of the straight
line connecting these two support vectors. It is a concept of
maintenance vector machine.
This involves the input files with the .arff extensions, that is,
attribute relation file format (ARFF). Since SVM is based on a consistent statistical basis and
Hadoop plug-in is applied in the eclipse environment. mathematical basis for simplification and optimization
Hadoop is the flexible and available architecture for large theory, it is not classified as a "just another algorithm" class.
scale calculation on data processing on a network of service In addition, it shows that it is superior to the existing
hardware. Eclipse is an integrated development environment technology on various problems in the real world. Although
(IDE) which contains a base workspace and an extensible SVM does not solve all of user problems, the kernel method
plug-in system for customizing the environment. Here in and the maximum margin method are further improved and,
this paper user can able to implementing hadoopplug-in by when adopted by the data mining community, become an
including the jar files in eclipse and which generates a important tool in data minor toolkits.
virtual memory of 1GB.
2.4.2 NAÏVE BAYESIAN
2.4 ENERGY EFFICIENT CLASSIFICATION The Naive Bayesian classifier is based on Bayes' theorem
METHOD with independence supposition between analysts. The Naive
An efficient organization methodResults evaluated using the Bayesian model is easy to hypothesis, requires no estimation
following twoOrganization method: of complex iteration parameters and is particularly useful for
(I) Support Vector Machine (SVM) very large data sets. Although its simplicity Naive Bayesian
(II) Naïve Bayesian. classifiers often operate astonishingly well and are widely
(I) Support Vector Machine (SVM): used as they are often superior to more cultured
classification methods. Algorithm:

©2017 IJAIR. All Rights Reserved


http://ijairjournal.com/
International Journal of Advanced and Innovative Research (2278-7844) / # 1/
Volume 6 Issue 11

The Bayes Theorem provides a way to calculate the


posterior probability P (c | x) from P (c), P (x), and P (x | c).
The Naive Bayes classifier assumes that the effect of the
value of analyst (x) on a given class (c) is independent of the
values of the other predictors. This supposition is called
class conditional independence.

Formula 2. Normal Distribution

2.5.1 MAP REDUCE ARCHITECTURE

Formula 1. Conditional Probability


• P (c | x) is the subsequent probability of the class
(target) given the predictor (attribute). Fig.3 Implementation for Processing and Generating
• P (c) is the preceding probability of the class. Dataset
• P (x | c) is the likelihood that the prediction class Map Reduce is a programming model and relatedapplication
is a given probability. for processing and generating large datasets consuming
• P (x) is the predictor's probability. parallel distributed algorithms on clusters. Map Reduce is
Example: the center of Hadoop. This programming paradigm enables
A posterior probability first, each attribute for the target. It huge scalability across hundreds or even thousands of
can be planned by building aoccurrence table. After that, servers in a Hadoop cluster. The first map job is a map job
convert the occurrence table to the likelihood table, and that takes a series of data and converts each element into
finally calculate the subsequent probability of each class another data set that breaks down into individual
using the naive Bayesian expression. The class with the dictionaries (key / value pairs). The reduce job accepts the
highest a subsequent probability is the result of the output from the map as input and syndicates the data tuples
prediction. into a smaller set of As the arrangement named Map
Reduce shows, the diminish job always runs after the map
job.

2.5.2 SCHEDULING

Table 1. Conditional Probability

Zero-frequency problem
If the attribute value (Outlook = Overcast) does not occur
with all class values (Play Golf = no), add 1 to the number
of all attribute values - class combination (Laplace
estimator). Numerical Prognostic Variable Numerical tables
must be converted to resounding variables (binning) before
creating the frequency table. Another option user have is to
use the distribution of numerical variables to get common
guesses. For example, one common approach is to assume a
Fig. 4 Scheduling Process
normal distribution of numeric variables. The probability
Scheduling process scheduling is afundamental part of the
density function for a normal distribution is distinct by two
multiprogramming operating system. Such an operating
limitations (mean and standard deviation).
system allows multiple processes to be overloaded into
executable memory at one time and the overloaded process
parts the CPU using time multiplexing. Priority scheduling.
The basic idea is simple. Priority is allocated to each process
and priority is executed. Equal priority processes are

©2017 IJAIR. All Rights Reserved


http://ijairjournal.com/
International Journal of Advanced and Innovative Research (2278-7844) / # 1/
Volume 6 Issue 11

scheduled in FCFS order. The shortest job priority (SJF) computations,” in Proc. 2nd ACM Symp. Cloud Comput.,
algorithm is a superior case of the general priority 2011, pp. 13:1–13:14.
scheduling algorithm. [15] T. J€org, R. Parvizi, H. Yong, and S. Dessloch,
“Incremental recomputations in mapreduce,” in Proc. 3rd
5. CONCLUSION: Int. Workshop Cloud Data Manage., 2011, pp. 7–14.
This chapter, and the classification method of maintenance [16 ] Y. Zhang, Q. Gao, L. Gao, and C. Wang, “imapreduce:
vector machines and naive Bayes for effective data analysis A distributed computing framework for iterative
results, said for a set of efficient techniques for repetitive computation,” J. Grid Comput., vol. 10, no. 1, pp. 47–68,
repetition calculation. In a real-time experiment, the 2012.
described classification method and EMRSA is, [17] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M.
suggestively reducing the amount of time it takes in order to McCauley, M. J. Franklin, S. Shenker, and I. Stoica,
refresh the large amounts of data mining results, compared “Resilient distributed datasets: A fault-tolerant abstraction
with the re-calculation of a simple replication Map Reduce, for, in-memory cluster computing,” in Proc. 9th USENIX
consistent efficient Energy use. Conf. Netw. Syst. Des. Implementation, 2012, p. 2.
[18] S. R. Mihaylov, Z. G. Ives, and S. Guha, “Rex:
REFERENCES Recursive, deltabased data-centric computation,” in Proc.
[1] S. Lloyd, “Least squares quantization in PCM,” IEEE VLDB Endowment, 2012, vol. 5, no. 11, pp. 1280–1291.
Trans. Inform. Theory., vol. 28, no. 2, pp. 129–137, Mar. [19] Y.Zhang,Q.Gao,L.Gao,andC.Wang,“Acceleratelarge-
1982. scaleiterative computation through asynchronous
[2] R. Agrawal and R. Srikant, “Fast algorithms for mining accumulative updates,”
association rules in large databases,” in Proc. 20th Int. Conf. inProc.3rdWorkshopSci.CloudComput.Date,2012,pp.13–22.
Very Large Data Bases, 1994, pp. 487–499. [20] C. Yan, X. Yang, Z. Yu, M. Li, and X. Li, “IncMR:
[3]S. Brin, and L. Page, “The anatomy of a large-scale Incremental data processing based on mapreduce,” in Proc.
hypertextual web search engine,” Comput. Netw. ISDN IEEE 5th Int. Conf. Cloud Comput., 2012, pp.pp. 534–541.
Syst., vol. 30, no. 1–7, pp. 107–117, Apr. 1998. [21] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A.
[4] J. Dean and S. Ghemawat, “Mapreduce: Simplified data Kyrola, and J. M. Hellerstein, “Distributed graphlab: A
processing on large clusters,” in Proc. 6th Conf. Symp. framework for machine learning and data mining in the
Opear. Syst. Des. Implementation, 2004, p. 10. cloud,” in Proc. VLDB Endowment, 2012, vol. 5, no. 8, pp.
[5] R. Power and J. Li, “Piccolo: Building fast, distributed 716–727.
programs with partitioned tables,” in Proc. 9th USENIX [22] S. Ewen, K. Tzoumas, M. Kaufmann, and V. Markl,
Conf. Oper. Syst. Des. Implementation, 2010, pp. 1–14. “Spinning fast iterative data flows,” in Proc. VLDB
[6] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Endowment, 2012, vol. 5, no. 11, pp. 1268–1279.
Horn, N. Leiser, and G. Czajkowski, “Pregel: A system for [23] D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P.
large-scale graph processing,” in Proc. ACM SIGMOD Int. Barham, and M. Abadi, “Naiad: A timely dataflow system,”
Conf. Manage. Data, 2010, pp. 135–146. in Proc.24th ACM Symp. Oper. Syst. Principles, 2013, pp.
[7] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, 439–455.
“Haloop: Efficient iterative data processing on large [24] U. Kang, C. Tsourakakis, and C. Faloutsos, “Pegasus:
clusters,” in Proc. VLDB Endowment, 2010, vol. 3, no. 1–2, A peta-scale graph mining system implementation and
pp. 285–296. observations,” in Proc. IEEE Int. Conf. Data Mining, 2009,
[8] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. pp. 229–238.
Bae, J. Qiu, and G. Fox, “Twister: A runtime for iterative
mapreduce,” in Proc. 19th ACM Symp. High Performance
Distributed Comput., 2010, pp. 810–818.
[9] D. Peng and F. Dabek, “Large-scale incremental
processing using distributed transactions and notifications,”
in Proc. 9th USENIX Conf. Oper. Syst. Des.
Implementation, 2010, pp. 1–15.
[10] D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K.
Yocum, “Stateful bulk processing for incremental
analytics,” in Proc. 1st ACM Symp. Cloud Comput., 2010,
pp. 51–62.
[11] J. Cho and H. Garcia-Molina, “The evolution of the
web and implications for an incremental crawler,” in Proc.
26th Int. Conf. Very Large Data Bases, 2000, pp. 200–209.
[12] C. Olston and M. Najork, “Web crawling,” Found.
Trends Inform. Retrieval, vol. 4, no. 3, pp. 175–246, 2010.
[13] P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar, and
R. Pasquin, “Incoop: Mapreduce for incremental
computations,” in Proc. 2nd ACM Symp. Cloud Comput.,
2011, pp. 7:1–7:14.
[14] Y. Zhang, Q. Gao, L. Gao, and C. Wang, “Priter: A
distributed framework for prioritized iterative

©2017 IJAIR. All Rights Reserved


http://ijairjournal.com/

You might also like