Testmagzine Admin,+115+manuscript 6

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

January - February 2020

ISSN: 0193 - 4120 Page No. 1224 - 1231

Applicational Achievement of K-Means Algorithm


among Apache Spark and Map Reduce
1
Dr. E. Laxmi Lydia, 2G Sandhya, 3Hima BinduGogineni ,4Guvvu PavaniLatha5N.Sharmili
1
Professor, Vignan’s Institute of Information Technology (A), Department of Computer Science and Engineering,
Visakhapatnam, Andhra Pradesh, [email protected]
2
Assistant professor,Vignan's institute of engineering for women
3
Asstistant. Professor,Department of Computer Applications,Vignan’s Institute of Information and Technology
4
Assistant Professor,Dept of CSE,Vignan's institute of engineering for women
5
Associate professor ,Computer science and engineering Department, GayatriVidyaParishad college of engineering for
women, visakhapatnam, Andhra pradesh, India

Article Info Abstract


Volume 82 Tremendous data all around the globe have been an enthusiastic subject in computer science
Page Number: 1224 - 1231 to explore and analyze that has raised the prominence of information. Blast incoming data
Publication Issue: through online networking,explorationin big organizations to get more access to intelligent
January-February 2020
research has become a great demand.MapReduce and its discrepancy have been very
worthwhile in accomplishingenormouscalibratereports with robust applications on specialty
groups. Therefore, a substantial quantity of the particular schemes is assembled over a non-
cyclic intelligence flow and is not suitable to demonstrate for some other influential
applications. An unbending architecture design was exclusively introduced using
MapReduce that evaluates each job in a straightforward approach. Major steps in
MapReduce such as a map, shuffle and reduce are allowed to change, synchronize and
combine the outputs that are collected from every node cluster. Subsequently,to overwhelm
the system to manual and recede, this paper proposes Apache Spark a manipulating form to
split the tremendous information. The prime adversary for “successor to MapReduce” is
Apache Spark. Similar to a broadly significant engine MapReduce, Spark has been designed
to run distinctadditional workloads and to perform in that space witha greatlyaccelerated
speedadapted framework. In this paper conflict between these two systems
altogetherutilized with execution exploration by considering its information computation in
a specified machine. Clustering process (K-Means) and asserting different criteria
Article History
Article Received: 14 March 2019 essentially, speed up the system, energy consumption of the system,scheduling delay of the
Revised: 27 May 2019 jobthan the current systems.
Accepted: 16 October 2019
Publication: 06 January 2020 Keywords: Spark, MapReduce, Hadoop, BigData

1. INTRODUCTION of Big Data [5] cloud computing[6] granted cloud


storage in disturbed systems [9].
Well, known cluster computing has broadly directed its
process to data-parallel computations. These clusters These systems accomplish their scalability and fault
are executed with uncertainty in systems that tolerance by giving a programming model where the
accordinglyproduce locality-receptiveprogram, client makes non-cyclic data stream graphs to go input
detection of faults in components or any failures data through an arrangement of operators. This permits
during execution, and distribution of loads through the hidden framework to oversee scheduling and to
load balancing in clustering. MapReduceprompts this respond to faults without client mediation. While this
design, during machines like Dryad and data streams, data flow-programming model is useful for a large
are sorted after merging by MapReduce. The facilitator class of applications, there are applications that cannot

Published by: The Mattingley Publishing Co., Inc. 1224


January - February 2020
ISSN: 0193 - 4120 Page No. 1224 - 1231

communicate proficiently as non-cyclic data flows. In Data processing spark is implemented in Scala [15],
this paper, we concentrate on one such class of written in high-level programming
applications: those that reuse a working arrangement languagepassivelyconsidering the Java Virtual
of data over numerous parallel operations [21-36]. This Machine, and DryadLINQfunctional programming
incorporates two use cases where we have seen interface is discovered. Besides, the spark can be
Hadoop users report that MapReduce is lacking: handledrationally from a transformed translation of the
scalaexponent, and also allows the client to represent
Iterative field: In this field, capacity is applied more RDDs, functional operations, volatile defined data and
than once to the same dataset using a lot of usual classify the data based on classes and apply them to
machine learning algorithms with inclination plummet correlate the operations on a cluster. This produces the
to upgrade the parameter.Working withMapReduce accredit framework on spark concerning the
and Dryadfor every iterationof communication a huge appropriate process for large-scale datasets on a
performance is attained as a drawback. cluster.
Intelligent logicalanalytics through the Hadoop Although the managing controlevidence,the spark is
ecosystem:Large datasets use SQL interfaces with still a model that implementsempowering connections
exploratory questions run on Hadoop such as Pig [13] with the framework. It is observed that continual
and Hive[11].In a flawlessglobe, any end-user may machine learning assignment Hadoop 10x was
have the possibility of transferring the dataset into overtaken by a spark. and can be employedrationally to
memoryfrom disparate machines and examine it more penetrate a 39 GB dataset including sub-second
than once. Hadoop uses queries with Be that as it may, intermission.
with Hadoop, every inquiry acquires huge inertness
because it keeps running as a different MapReduce 1.1 HADOOP OVER SPARK
occupation and peruses data from disk.
One of the most advanced data processing technology
In this paper, advanced computational spark helps to fora long time for Big Data [1]is Hadoop and is
create a new cluster computational framework achieved by being the key to generate clarification
thatmaintains scalability and fault tolerance onthe outcome for transforming substantial datasets.
characteristics along with applicational working On the basis of one-pass MapReduce is an
applications with working comparative settingsto extraordinary solution for one-pass reckoning, yet not
MapReduce. severely worthwhile for multi-pass reckoning and
methods. Each stage in the data processing task
The underlying concept behind spark [19] is mechanismincludes two phases such as map state and
partitioned items are read quantitatively among large reducestate and based on the change over each
provision of systems for the reestablishment of lost utilization case into MapReduceorder to impact this
segments. Memory expressly stores RDDinformation outcome. The task outcome statisticsamong every
crosswise through clients over machines and change it betterment must be taken away in the
in variousMapReduce-like coordinative procedures. distributedciteddesign before the process towards the
RDDs manage fault tolerance over an intention of step can start. Accordingly, this technique has a
extraction; if any allowance of an RDD is missing, the predisposition to be tolerablefor the reason of likeness.
RDD process adequacy detailsregarding it about how it Furthermore, Hadoop solutions
was drawn from divergent RDDs to acquire the scope typicallyintegrategroups that are demanding to create
to transmute only that package. Although the and inspect. Additionally, itdemands the fusion of
experience that RDDs are neither a regularly shared fewer machines for assorted big data [2]systems (a
memory consideration, they express to a sweet-spot at stream of data processing in machine learning such as
intervalsarticulation from one context,extensible and Mahout).
authenticity, and an appropriate mixture of
applications were identified.

Published by: The Mattingley Publishing Co., Inc. 1225


January - February 2020
ISSN: 0193 - 4120 Page No. 1224 - 1231

Achievement of bringing out the required outcomes Hadoop distributed file system(HDFS) [15] .It delivers
from complicated data, MapReduce [8]tasks are assistance to set up spark [18]operations in present
ordered together with development and implement Hadoop v1 chunk (that includes spark-inside-
performed tasks in series. The tasks that are allowed to MapReduce (SIMR)) or rather Hadoop v2 YARN
execute have high-latency, and no new task is allowed chunkon the contraryproportional open-source
to execute until the previous task is accomplished computer cluster such as apache mesos. A special
completely. Most complicated problems are resolved attractive view at spark grabbed its attain alternatively
using Spark throughnon-cyclic diagram (DAG) design toHadoop MapReduce [14] as conflicting a
and multi-step data pipelines. Moreover, it depends on replacement to Hadoop. It neitherrecommendedto
theangular data sharing in-memory overDAGs, succeedHadoop but relativelywell-ranging bound well-
remarkably disparate tasks can perform with identical adjusted responses for supervisingcharacteristic big
data. data stretches out and essentiality. Figure 1 professed
the oppositionthroughout hadoop and spark.
To gain improved and additional benefits in Spark
[17], it allows tasks to run on the currently executed

Fig.1 Contrast to Spark and MapReduce

1.2 ARCHITECTURE OF APACHE SPARK The first factor of Apache Spark is data storage, it
handles the HDFS structure for information
There are three major standard factors in accumulationoutcome. Apache Spark includes its work
Sparkarchitecture. Data Storage, Programming progress through Hadoop that allows works Cassandra,
Interface, and Resource Management. Hadoop Distributed File System [16], HBase, and so
on.

Published by: The Mattingley Publishing Co., Inc. 1226


January - February 2020
ISSN: 0193 - 4120 Page No. 1224 - 1231

The second factor of Apache Spark [20] is The third factor of Apache Spark is Resource
Programming Interface which provides spark based management transfers message as a standalone server
applications to design API applications and make use or in some prominent cases, it is used asMesos or
of its standard programming languages such as python, YARN that follows a distributed computing
java, scala using API interface. framework [3].

Fig.2:Components of spark architecture design.

1.3 SPARK SPECIFICATION develop to be anessentiallyubiquitous cause in the


enterprise.
The data specified in Spark uses RDD (Resilient
Distributed Dataset) which approves data Inadequate use of multi-pass applications in
computations on the cluster at distinct nodes. It helps MapReduceneeds low-latency data allocation over the
to estimate the damaged data through fault tolerance diversified parallel process. These analytical
and node failures. Data is distributed over manifold applications are very primitive, and hold:
nodes. RDD parallelly[12] implements
execution(faster than as usual MapReduce program  Algorithms related to Iterative,
i.e., 10,000 times). This process routinely preserves the encompassabundant machine learning methods
data in memory and retainsthe existence of iterative and graph procedures like PageRank.
algorithms related to machine learning [7].  This is also used to Iterative data
mining,which consigns the client data into
The behavior of both Conventional MapReduce and RAMin addition to the cluster and inspects it
Directed Acyclic Graphprocessing mechanism more than once.
machinesis uncertain on particular applications that  Streaming applications with a gradual change
keep relying on the acyclic data stream, consist of in time providesa quantity accumulated state.
stable storage, and also have evolutionaryanalyzing of
data with the unmistakable task.
2. IMPLEMENTATION
The effective speed in flash grantsus to function stream
refinement with comprehensive info data and 2.1 APPROACH TO K-MEANS CLUSTERING
governmassive chunks of data on the glide. Similarly,
A straightforward transparent K-Means clustering
this can also be promoted and takeadvantage of online
algorithm using clustering analysis. The primary
machine learning. This process appoints
intention is to select an elite partitioning of n entities in
theprerequisite use cases ofrepeatedscrutiny that
k cluster categories, on satisfying the distance

Published by: The Mattingley Publishing Co., Inc. 1227


January - February 2020
ISSN: 0193 - 4120 Page No. 1224 - 1231

condition among the categorized members and its 3. COMPARISON


interrelated centroids, prototypical of the category, is
lessened. Cluster center is estimated by all over all The interrelationship betweenApache Spark and
object’s the mean value within the cluster. The MapReducedesignated to reach a decision that has
methodology of the designed algorithm is as follows: been carried out by testing andhandling all systems on
a dataset that authorizesthe user to
Step1: Create n number of clusters each and everyone functionclusteringby applying the K-means estimation.
objects that reside in clusters. Every cluster is labelled
by allocating numbers. 3.1 DATASET SPECIFICATION

Step2: Consider and estimate the distance among This paper includesthe
cluster objects defined as D(u,v). Where u,v were healthcare_sample_datasetsdataset with the size of
defined as the objects within the 3.13 MB gatheredfrom recent years. Datasetstores
objects,(atu,v=1,2,...n.). Assume the square matrix for Patient identification number (Patient_ID),
calculating distance as D= (D(u,v)). Incase vectors are Patient_Nameand Patient_DOBand other values
shown on behalf of objects, Euclidean distance is information about particular records. Following table
applied. represents the data records and is testified in the
table1:
Step3: Later, asset the largestcoincident match of
clusters r and s, lest the distance, D(u,v), is Table 1: Patient records representation in
lowestbetwixt all the duo wise interval. Healthcare_sample_datasets

Step4: Now fuse u and v to a unique cluster c and Patient_ID int


findthe distance among cluster distance D(c,k) for Patient_Name Chararray
Patient_DOB Chararray
eachactual cluster k!=u,v.Obtained distances are
Patient_PhoneNumber Chararray
observed andequivalent values from rows and columns Patient_emailAddress Chararray
are eliminated to the old cluster u and v in the defined Patient_SSN Chararray
matrix D, as u and v do not prevail further. Finally, Patient_Gender Chararray
concatenate new row and column in matrix D related Patient_Disease Chararray
to cluster c. Patient_weight Float

Step5: Periodically perform the process from step3 a


result of n-1 times up to only single cluster is left

Sample Record Values

211 Fa1 5478 [email protected] 11 M Diabetes 72


212 Fa2 5478 [email protected] 11 F PCOS 64

3.2 DATASET PERFORMANCE EVALUATION necessity employing K-Means procedure. Following


AND EXPLANATION are the specifications o the systems that perform Spark
and MapReduce:
Samples from healthcare_sample_datasets dataset
applythe K-means algorithm. This has resulted inthe  The memory size of 4GB RAM
following outputs described in the below table2 using  The operating system as Linux Ubuntu
comparison. To acquire a mixed evaluation, we  Hard Drivewith 500 GB
examined 64MB and a single nodewith 3.13MB, two
nodes with 3.13MB and supervise the efficiencybased Observing the obtained results from Apache spark has
on the conditions and its timingfor clustering as per the gained high speed in terms of time. It is identified that
Published by: The Mattingley Publishing Co., Inc. 1228
January - February 2020
ISSN: 0193 - 4120 Page No. 1224 - 1231

depending on the dataset size Spark is 3times faster 3.2.2 Analysis for the Speed up using Spark
than MapReduce. Despite the minor inconstancy in vs.MapReduce
this product,the K-means algorithm performs randomly
and will not influence large quantities. The speedup was defined as the ratio of the sequential
complete time of the schedule to the total length of the
Table 2 Output for K-Means using Spark (MLib) schedule obtained. Figure4 represents the Cluster
analysis over speed up with respect to the spark and
Size of the Number of Executed map-reduce approachesand its corresponding value
Dataset Nodes Time(s)
continuously increases concern to the number of
64MB 1 18
clusters.
3.13 MB 1 149

Table 3 Output for K-Means using Map


Reduce(Mahout)

Size of the Number of Executed


Dataset Size Nodes Time (s)
64MB 1 44
3.13 MB 1 291
3.13 MB 2 163
Fig.4: Cluster analysis over Speedup

For the assessment number of considered nodes and 3.2.2 Analysis for Energy Consumption using Spark
effective performance of spark and MapReduce, vs.MapReduce
metrics like scheduling delay, speed up, energy
consumption is measured for each cluster. Figure5 represents the Cluster analysis over speed up
with respect to the spark and map-reduce approach. It
3.2.1 Analysis of Scheduling Delay using Spark was identified that Spark has consumed less energy
vs.MapReduce when compared to MapReduce. Cluster resource has
continuously increased.

Fig.3: Cluster analysis over scheduling delay

Figure3 represents the Cluster analysis over scheduling


delay with respect to the spark and map-reduce in the
Hadoop. The spark is exhibitingabetter scheduling
length in contrast withthe map reduce.

Fig.5: Cluster analysis over Energy consumption

Published by: The Mattingley Publishing Co., Inc. 1229


January - February 2020
ISSN: 0193 - 4120 Page No. 1224 - 1231

Conclusion 8. Apiletti, D., Baralis, E., Cerquitelli, T., Garza, P.,


This paper provides anaudit pair of the structuresthat Pulvirenti, F., Michiardi, P. “A parallel
moreover evaluates the different particular parameters MapReduce algorithm to efficiently support
itemset mining on high dimensional data”, Big
that drawafterward an enforcement review and takes
Data Research, Vol.10,2017.
advantage of K-means estimation. The overall reaction 9. K Pavan Kumar, “An integrated health care system
for this exploration establishes that spark is remarkably using IoT”, International Journal Of Recent
solid foe and command beyond anambiguityto manage Technology and Engineering, Vol.7, ISSN: 2277-
and modify through the resort as a part of 3878, 2019.
flashbackpreparation. On observations carried out on 10. Dr. B. Premamayudu, LeelaPriya, “New reliability
spark capacity to functionon organized manipulations, routing path for detects malicious”, Ingeineri des
susyem’s d, Vol.24(2), 2019.
spouting, and machine learning over unchanginggroup
11. Apache Hive
and business catching a quick look at the instant rate of thhp://hadoop.apache.org/hiveScalaprogramming
receivingof spark. Spark has provided efficient language. http://www.scala-lang.org.
solutions to countless cases among much other 12. C.Olston, B. Reed, U. Srivastava, R. Kumar, and
processing that includes Big Data preparing. A. Tomkins. Pig Latin: a not-so-foreign language
for data
References 13. Y. Yu, M. Isard, D. Fetterly, M. Budiu, U.
1. BayuPrabowoSutjiatmo, AfianErwinsyah, E. Erlingson, P. K. Gunda, and J. Currey.
Laxmi Lydia, K. Shankar, PhongThanh Nguyen, “DryadLINQ: A system for general-purpose
wahidahHashim, AndinoMaseleno, “Empowering distributed data-parallel computing using a high-
the Internet of Things (IoT) through Big Data”, level language” In OSDI ’08, an Diego, CA, 2008.
International Journal of Engineering and Advanced 14. J. Dean and S. Ghemawat. “MapReduce:
Technology (IJEAT), Vol.8, pg. 938-942, August Simplified data processing on large clusters”.
2019. Commun. ACM, 51(1):107-113, 2008.
2. Muruganantham A., PhongThanh Nguyen, E. 15. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D.
Laxmi Lydia, K. Shankar, WahidahHashim, Fetterly. Dryad: Distributed data-parallel programs
AndinoMaseleno, “Big Data Analytics and from sequential building blocks. In EuroSys 2007,
intelligence: A perspective for Healthcare”, pp.59-72, 2007.
International Journal of Engineering and Advanced 16. B. Nitzberg and V. Lo, “Distributed shared
Technology, Vol.8, pp.861-864, 2019. memory: a survey of issues and algorithms”,
3. Chen, Z., Xu, G., Mahalingam, V., Ge, L., computer, 24(8):52-60, Aug 1991.
Nguyen, J., Yu,W., Lu, C. “A cloud computing 17. Elhoseny, M., Bian, G. B., Lakshmanaprabu, S.
based network monitoring and threat detection K., Shankar, K., Singh, A. K., & Wu, W. (2019).
system for critical infrastructures”, Big Data Effective features to classify ovarian cancer data in
Research, Vol.3, pp.10-23, 2016. internet of medical things. Computer Networks,
4. Celli, F., Cumbo, F., Weitschek, E. “Classification 159, 147-156.
of large DNA Methylation datasets for identifying 18. Shankar, K., Elhoseny, M., Perumal, E., Ilayaraja,
cancer drivers”, Big Data Research, Vol.13, pp.21- M., & Kumar, K. S. (2019). An Efficient Image
28, 2018. Encryption Scheme Based on Signcryption
5. Subbu, K. P., Vasilakos, A,V. “Big Data for Technique with Adaptive Elephant Herding
Context-Aware Computing - Perspectives and Optimization. In Cybersecurity and Secure
Challenges”, Big Data Research, Vol.10, pp.33-43, Information Systems (pp. 31-42). Springer, Cham.
2017. 19. Elhoseny, M., & Shankar, K. (2020). Energy
6. Milani, B, A., Navimipour, N. J. “A Systematic efficient optimal routing for communication in
literature review of the data replication techniques VANETs via clustering model. In Emerging
in the cloud environments”, Big Data Research, Technologies for Connected Internet of Vehicles
Vol.10, pp.1-7, 2017. and Intelligent Transportation System Networks
7. Elshawi, R., Sakr, S., Talia, D., Trunfio, P. “Big (pp. 1-14). Springer, Cham.
Data systems meet machine learning challenges: 20. Elhoseny, M., Shankar, K., & Uthayakumar, J.
towards BigData science as a service”, Big Data Intelligent Diagnostic Prediction and Classification
Research, Vol.14, pp.1-11, 2018. System for Chronic Kidney Disease, Nature

Published by: The Mattingley Publishing Co., Inc. 1230


January - February 2020
ISSN: 0193 - 4120 Page No. 1224 - 1231

Scientific Reports, July 2019. Press. DOI: Advanced Technology, 8(6 Special Issue 2), 978-
https://doi. org/10.1038/s41598-019-46074-2. 981.
21. Dutta, A. K., Elhoseny, M., Dahiya, V., & 31. Lakshmanaprabu, S. K., Mohanty, S. N.,
Shankar, K. (2019). An efficient hierarchical Krishnamoorthy, S., Uthayakumar, J., & Shankar,
clustering protocol for multihop Internet of K. (2019). Online clinical decision support system
vehicles communication. Transactions on using optimal deep neural networks. Applied Soft
Emerging Telecommunications Technologies. Computing, 81, 105487.
22. Elhoseny, M., & Shankar, K. (2019). Optimal 32. Shankar, K., Lakshmanaprabu, S. K., Khanna, A.,
bilateral filter and Convolutional Neural Network Tanwar, S., Rodrigues, J. J., & Roy, N. R. (2019).
based denoising method of medical image Alzheimer detection using Group Grey Wolf
measurements. Measurement, 143, 125-135. Optimization based features with convolutional
23. Murugan, B. S., Elhoseny, M., Shankar, K., & classifier. Computers & Electrical Engineering,
Uthayakumar, J. (2019). Region-based scalable 77, 230-243.
smart system for anomaly detection in pedestrian
walkways. Computers & Electrical Engineering,
75, 146-160.
24. Famila, S., Jawahar, A., Sariga, A., & Shankar, K.
(2019). Improved artificial bee colony
optimization based clustering algorithm for
SMART sensor environments. Peer-to-Peer
Networking and Applications, 1-9.
25. Lakshmanaprabu, S. K., Shankar, K., Rani, S. S.,
Abdulhay, E., Arunkumar, N., Ramirez, G., &
Uthayakumar, J. (2019). An effect of big data
technology with ant colony optimization based
routing in vehicular ad hoc networks: Towards
smart cities. Journal of cleaner production, 217,
584-593.
26. Maheswari, P. U., Manickam, P., Kumar, K. S.,
Maseleno, A., & Shankar, K. Bat optimization
algorithm with fuzzy based PIT sharing (BF-PIT)
algorithm for Named Data Networking (NDN).
Journal of Intelligent & Fuzzy Systems, (Preprint),
1-8.
27. Shankar, K., Ilayaraja, M., & Kumar, K. S. (2018).
Technological Solutions for Health Care
Protection and Services Through Internet Of
Things (IoT). International Journal of Pure and
Applied Mathematics, 118(7), 277-283.
28. Lakshmanaprabu, S. K., Shankar, K., Ilayaraja,
M., Nasir, A. W., Vijayakumar, V., & Chilamkurti,
N. (2019). Random forest for big data
classification in the internet of things using
optimal features. International Journal of Machine
Learning and Cybernetics, 1-10.
29. Sankhwar, S., Gupta, D., Ramya, K. C., Rani, S.
S., Shankar, K., & Lakshmanaprabu, S. K. (2016).
Improved grey wolf optimization-based feature
subset selection with fuzzy neural classifier for
financial crisis prediction. Soft Computing, 1-10.
30. Iswanto, I., Lydia, E. L., Shankar, K., Nguyen, P.
T., Hashim, W., & Maseleno, A. (2019).
Identifying diseases and diagnosis using machine
learning. International Journal of Engineering and

Published by: The Mattingley Publishing Co., Inc. 1231

You might also like