Optimization Approach For Data Placement in Cloud Computing: Preliminary Study
Optimization Approach For Data Placement in Cloud Computing: Preliminary Study
presented by
Outlines
1.
Overview
2.
3. 4. 5. 6. 7. 8.
Survey
Summary of Survey System model and assumption Problem formulation GAMS (Introduction) Experimental results Conclusion
Overview
Data placement = data locality management in cloud
computing
Cloud computing is an efficient solution for data
placement
Overview (Continued)
Size of data Monetary cost for data placement/computation Resource reliability Network bandwidth Geographical data movement
Overview (Continued)
In this study, we propose ODPA (i.e., Optimal Data Placement
Algorithm) to store data in cloud providers in order to minimize cost and delay
investigate a number of constraints (i.e., demand, processing
placement problem
use GAMS to solve data placement problem in terms to get
optimal solutions
perform numerical studies and experiments to evaluate our
Survey
A Data Placement Strategy in Scientific Cloud Workflows Data Storage Placement in Sensor Networks Load Balancing and Data Placement for Multi-tiered
Database Systems
New Algorithms for Planning Bulk Transfer via Internet
scientists need to analyse terabytes of data from existing data resource or collected from physical devices To effectively store these data, scientists must intelligently select data centers problems
o o
moving the data becomes a challenge data movement can also impact in costs
Continued [2/4]
to minimize data movement generate test workflows to run on SwinDeW-C two types of data (existing & generated)
o
o
Continued [3/4]
k-means clustering algorithm can make to cluster data sets to the data centers
o
build-time stage: cluster the existing data sets into k data centers as the initial partitions run-time stage: cluster the generated data sets to one of the k data centers based on their dependencies
do not consider the structure of data it is not practical to calculate the data sets' dependencies and assign them to a data center at the build-time stage very hard to predict when a certain dataset will be generated in a dynamic cloud environment
it is impractical and inefficient to reserve the storage for generated data at the build-time stage
storage node placement problems (i.e., how to store and search the collected data)
to minimize cost
sensor is equipped with only limited memory or storage space since sensors are battery operated, the stored data will be lost searching the data of interest in a widely scattered network is a hard problem
Continued [3/6]
collected data can be transmitted to the sink and stored there for future information retrieval
problems
o
large amount of data cannot be transmitted from the sensor network to the sink effectively
take long routes consuming much energy and depleting of sensor battery power quickly
Continued [4/6]
fixed tree model: assume sensor network has organized into a tree rooted at the sink
dynamic tree model: the optimal communication tree is constructed after the storage nodes are deployed each sensor selects a storage node in its proximity for its data storage to minimize energy cost
In sensor networks, query is the most important application If data are stored in the sink, it can be beneficial to the query reply with no transmission cost but data accumulation to the sink is very costly
communication tree may be broken due to link failure consider when building the tree, only stable links are chosen In reality, storage nodes may not be deployed in a precise way use stochastic analysis to evaluate the performance of random deployment of storage nodes in both models
Load Balancing and Data Placement for Multi-tiered Database Systems [1/4]
MQT (Materialized Query Table) is an auxiliary table with precomputed data MQTA (Materialized Query Table Advisor) is often used to recommend and create MQTs MQTA is placed at the backend database server to recommend and create MQTs at the frontend database server to improve the response time of a query workload
Continued [2/4]
problem
o
placing all or many MQTs at the frontend database server cannot improve the response time of the workload
MQTA cannot be used extend the MQTA functionality with DPA (Data Placement Advisor) and load balancing strategies for automatic recommendation and placement of MQTs used WebSphereII as a frontend database server
o
statistics about remote data sources are collected and maintained in WebSphereII for later use by the query optimizer
Continued [3/4]
DPA takes input from user specified preference in order to cache the MQTs at the frontend database server considers information output from the MQTA which provides MQT dependency information
ignores construction cost of MQT the cost of constructing a MQT is usually higher than the benefit of it
New Algorithms for Planning Bulk Transfer via Internet and Shipping Networks [1/4]
these data need to be transferred to a single sink (e.g., AWS, Google data centers, etc) for processing
problem
o
planning a group-based deadline data transfer through the internet and shipping storage devices from companies
Continued [2/4]
difference between pros and cons of internet transfer and shipping transfer using internet transfer can be cheap and fast for small datasets, but very slow and expensive for large datasets using shipping transfer can be cheap and fast for large datasets, but expensive for small datasets take into account for the costs and latencies to get optimal choice for shipping as well as internet transfer
Continued [3/4]
build Pandora (People and Networks Moving Data Around) planning system
o
takes inputs the dataset sizes at source sites, interconnectivity between sources and sink (bandwidth, cost and latency for both internet and shipping links), and latency deadline which bounds total time taken for transfer
formulate the inputs using integer programming into a data transfer problem
Continued [4/4]
Minimize c(f) =
shows to minimize total dollar cost while satisfying a latency deadline at Time T use Mixed Integer Program (MIP) solver use real data from Fedex and PlanetLab prove their transfer planning algorithms satisfy deadlines while simultaneously minimizing dollar costs
Summary of Survey
reviewed a lot of papers related to data placement problems in
models
did not consider the structure of data, assume homogeneity for
Cloud Provider
Cloud Provider
Cloud Provider
26
Problem Formulation
Minimize Cost: Minimize Delay:
GAMS (Introduction)
refers to General Algebraic Modeling System
use GAMS version 23.7.3 allows using syntax so that dont need to implement
algorithms
can be solved on different types of computers
28
Experiment 1 (Continued)
Results of Brute Force Search (delay optimization)
30
Experiment 2
Configuration Cost ($) 1 2 3 4 5 6 7 8 9 10 $94.3990 $65.8896 $44.9550 $294.3855 $101.1281 $216.2001 $116.1138 $217.5314 $114.4319 $86.7977 Objective Value Delay (MB/sec) 15.7768 8.8820 7.6150 43.2565 16.7685 35.1447 23.4208 49.6789 23.4466 13.7720
31
Experiment 2 (Continued)
Experiment 3
Change size of contents for optimization models
33
Experiment 3 (Continued)
Change network bandwidth for optimization models
34
Experiment 3 (Continued)
Change processing power of each content for
optimization models
35
Experiment 3 (Continued)
Change maximum processing power offered by each
36
Conclusion
Extensively performed the survey Proposed 2 optimization models Performed experiments Limitation and future work
Uncertainty Optimization approaches: stochastic programming and Markov Game theory Revision of System Model & Assumption
38