Cost-Aware Big Data Processing Across-2

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 37

MINIMISATION OF CLOUD COST

ACROSS VARIOUS SERVERS

GUIDE
Mr.M.Krishnaraj B.E, M.Tech.,
Assistant Professor
Department of Information Technology,
Panimalar Institute of Technology.

TEAM MEMBERS: [Batch No:11]


SHEIK ABDULLAH K 211517205101
KRISHNA RAJ S 211517205058
SUNIL KUMAR S M 211517205109
ABSTRACT
• Traditionally central approach that moving all data to a single
cluster is inefficient or infeasible due to the limitations such as
the scarcity of wide-area bandwidth and the low latency
requirement of data processing.
• Processing big data across geo-distributed datacenters
continues to gain popularity in recent years.
• Challenges is addressed by balancing high performance and
low cost like bandwidth cost, storage cost.
• We formulate this complex cost optimization problem for data
movement, resource provisioning and reducer selection into a
joint stochastic integer nonlinear optimization problem by
minimizing the five cost factors simultaneously.
INTRODUCTION
• Large scale cloud organizations are deploying datacenters and
“edge” clusters globally to provide their users low latency
access to their services.
• Analyzing the geo-distributed data gathered across these sites
is an important workload.
• The widely-used approach is to aggregate all the datasets to a
central site before executing the queries.
• MapReduce is a distributed programming model for
processing large-scale dataset in parallel, which has shown its
outstanding effectiveness in many existing applications.
LITERATURE SURVEY

S.NO TITLE YEAR AUTHOR DESCRIPTION DEMERITS


S
1. Apache 2020 Maryam Population-based meta- The
Spark Aljame, heuristic algorithms are performance of
Implement Imtiaz among the dominant the proposed
ation of Ahmad algorithms used to solve algorithm is
Whale challenging real world compared with
Optimizati problems in diverse state-of-the-art
on fields. algorithms using
Algorithm statistical
measures like
Mean Absolute
Error, Standard
Deviation, Root
Mean Squared
Error and t-
value.
S.NO TITLE YEAR AUTHOR DESCRIPTION DEMERITS
2. Map 2020 Bharath R MapReduce is a fault- To provide an
Reduce: tolerance , simple, and abstraction
Data scalable framework for layer between
Processing data processing that fault tolerance
on large enables its users to collect ,data
clusters, massive amounts of data. distribution
Applications and other
and parallel
Implementat systems tasks.
ions
S.NO TITLE YEAR AUTHOR DESCRIPTION DEMERITS
3. Fast 2019 Zhenxue It is a fast minimization Efficient
Minimizatio He, algorithm of fixed polarity ternary-
n Limin Reed-Muller expression. encoded DE
of Fixed Xiao The main idea behind the algorithm to
Polarity FMA is to search the minimize
Reed-Muller minimum FPRM with the mixed
Expressions fewest products by using polarity RM
the proposed binary expressions.
differential evolution
algorithm.
S.NO TITLE YEAR AUTHORS DESCRIPTION DEMERITS
4. Global 2020 Ashish large data scales due to To demonstrate
analytics in Vulimiri, expensive transoceanic the flexibility
the face of Carlo Curino links, and may be rendered of our system
bandwidth impossible by emerging we
and regulatory constraints. implemented
regulatory two function-
constraints specific
optimizations.
S.NO TITLE YEAR AUTHOR DESCRIPTION DEMERITS
5. End-to-End 2016 Benjamin This exploration by we have
Optimization Heintz; applying MapReduce presented a
for Geo- Abhishek across geo-distributed model-driven
Distributed Chandra data over geo-distributed optimization
framework, as
MapReduce computation resources well as cross-
phase
optimization
algorithms
suitable for a
real-world
MapReduce
implementation.
EXISTING SYSTEM
• Global-scale organizations produce large volumes of data
across geographically distributed data centers. Querying
and analyzing such data as a whole introduces new
research issues at the intersection of networks and
databases.
• Today systems that compute SQL analytics over
geographically distributed data operate by pulling all data
to a central location.
• This is problematic at large data scales due to expensive
transoceanic links, and may be rendered impossible by
emerging regulatory constraints
DISADVANTAGE
• consists in orchestrating query execution
across data centers to minimize bandwidth
• network-centric mechanisms designed for a
wide-area setting such as pseudo distributed
execution
PROPOSED SYSTEM
• With high velocity and high volume of big data generated from
geographically dispersed sources
• big data processing across geographically distributed datacenters is
becoming an attractive and cost effective strategy for many big data
companies and organizations.
• resource provisioning and reducer selection with the goal of cost
minimization is developed.
• We balance five types of cost: bandwidth cost, storage cost, computing
cost, migration cost, and latency cost, between the two Map Reduce
phases across datacenters.
• This complex cost optimization problem is formulated into a joint
stochastic integer nonlinear optimization problem by minimizing the
five cost factors simultaneously.
ADVANTAGE
• we transform the original problem into three
independent subproblems
• online algorithm MiniBDP to minimize the long-
term time-average operation cost
• widespread application prospects in those
globally-serving companies since analyzing the
geographically dispersed datasets is an efficient
way to support their marketing decision
MODULES
Module split-up:
 Admin login
 User login
 Cloud manage
MODULE 1:ADMIN LOGIN
• Admin need to login with admin’s password in
the cloud application.
• After that he need to register for the new user
with the user details and then mail is
generated to the registered mail address.
• The admin alone able to create a new user.
MODULE 2:USER LOGIN
• The user need to login with his username and
password that has sent by the admin via mail
address.
• The valued user name and password can only
able to login on cloud application
MODULE 3:CLOUD MANAGE
• User need to upload their documents that
need to be shared.
• It ask for confirmation, after file uploaded
successfully and the link will be suddenly
generated above as a alert box.
SOFTWARE REQUIREMENTS
Windows 7 and above
JAVA

HARDWARE REQUIREMENTS
Hard Disk : 80GB and Above
RAM : 4GB and Above
Processor : P IV and Above
TECHNOLOGY USED
• J2EE
• Cloud computing
• Framework: Apache
ARCHITECTURE
DATA-FLOW DIAGRAM
LEVEL 0
DATA-FLOW DIAGRAM
LEVEL 1
DATA-FLOW DIAGRAM
LEVEL 2
USE CASE DIAGRAM
COLLABRATION DIAGRAM
ACTIVITY DIAGRAM
Coding Part
ADMIN LOGIN
USER CREATION
USER LOGIN
UPLOAD DATA
LINK GENERATED
LOGOUT
CONCLUSION
• The proposed approach is predicted to be with
widespread application prospects in those globally-
serving companies since analyzing the geographically
dispersed datasets is an efficient way to support their
marketing decision.
• As the subproblems in the algorithm MiniBDP are
with analytical or efficient solutions that guarantee
the algorithm running in an online manner, the
proposed approach can be easily implemented in the
real system to reduce the operation cost
Future Work
• Deploying the proposed algorithm in the real
systems such as Amazon EC2
• Cost minimization, introducing data replication
will add additional cost of replicating data
across datacenters.
• Extending the original model to support other
types of jobs like astronomic image
processing.
REFERENCE
1. “Square kilometre array,” http://www.skatelescope.org/
2. A. Vulimiri, C. Curino, B. Godfrey, T. Jungblut, J. Padhye, and G. Varghese, “Global analytics in the face
of bandwidth and regulatory constraints,” in Proceedings of the USENIX NSDI’15, 2015.
3. Global analytics in the face of bandwidth and regulatory constraints, Ashish Vulimiri,Carlo Curino,2020
4. J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Communications of
the ACM, vol. 51, no. 1, pp. 107–113, 2008.
5. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with
working sets,” in Proceedings of the USENIX HotCloud’10, 2010.
6. E. E. Schadt, M. D. Linderman, J. Sorenson, L. Lee, and G. P.Nolan, “Computational solutions to large-
scale data management and analysis,” Nature Reviews Genetics, vol. 11, no. 9, pp. 647–657, 2010.
7. M. Cardosa, C. Wang, A. Nangia et al., “Exploring mapreduce efficiency with highly-distributed data,” in
Proceedings of the second international workshop on MapReduce and its applications, 2011.

8. L. Zhang, C. Wu, Z. Li, C. Guo, M. Chen, and F. C. M. Lau, “Moving big data to the cloud: An online cost-
minimizing approach,” IEEE Journal on Selected Areas in Communications, vol. 31, pp. 2710–2721, 2013.
9. W. Yang, X. Liu, L. Zhang, and L. T. Yang, “Big data real-time processing based on storm,” in Proceedings
of the IEEE TrustCom’13, 2013.
10. Y. Zhang, S. Chen, Q. Wang, and G. Yu, “i2mapreduce: Incremental mapreduce for mining evolving big
data,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, pp. 1906–1919, 2015.
11. D. Lee, J. S. Kim, and S. Maeng, “Large-scale incremental processing with mapreduce,” Future
Generation Computer Systems, vol. 36, no. 7, pp. 66–79, 2014.
12. B. Heintz, A. Chandra, R. K. Sitaraman, and J. Weissman, “End-toend optimization for geo-distributed
mapreduce,” IEEE Transactions on Cloud Computing, 2014
13. C. Jayalath, J. Stephen, and P. Eugster, “From the cloud to the atmosphere: Running mapreduce across
data centers,” IEEE Transactions on Computers, vol. 63, no. 1, pp. 74–87, 2014. [13] P. Li, S. Guo, S.
Yu, and W. Zhuang, “Cross-cloud mapreduce for big data,” IEEE Transactions on Cloud Computing,
2015, dOI:10.1109/TCC.2015.2474385.
14. A. Sfrent and F. Pop, “Asymptotic scheduling for many task computing in big data platforms,”
Information Sciences, vol. 319, pp. 71–91, 2015.
15. L. Zhang, Z. Li, C. Wu, and M. Chen, “Online algorithms for uploading deferrable big data to the
cloud,” in Proceedings of the IEEE INFOCOM, 2014, pp. 2022–2030. [16] Q. Zhang, L. Liu, A.
Singhand et al., “Improving hadoop service provisioning in a geographically distributed cloud,” in
Proceedings of IEEE Cloud’14, 2014.
16. 2017, [online] Available: http://www.datacenterknowledge.com/archives/2008/11/18/where- amazons-
data-centers-are-located/.
THE END

You might also like