A Hybrid Filtering Approach For Storage Optimization - 2015 - Egyptian Informat
A Hybrid Filtering Approach For Storage Optimization - 2015 - Egyptian Informat
Cairo University
FULL-LENGTH ARTICLE
a
Department of Information System, Faculty of Computers and Information, Cairo University, Egypt
b
Faculty of Computer Science, MSA University, Cairo, Egypt
KEYWORDS Abstract Enterprises and cloud service providers face dramatic increase in the amount of data
Cloud computing; stored in private and public clouds. Thus, data storage costs are growing hastily because they
Cloud storage; use only one single high-performance storage tier for storing all cloud data. There’s considerable
Main-memory database; potential to reduce cloud costs by classifying data into active (hot) and inactive (cold). In the
Hot/cold data; main-memory databases research, recent works focus on approaches to identify hot/cold data.
Cold data management Most of these approaches track tuple accesses to identify hot/cold tuples. In contrast, we introduce
a novel Hybrid Filtering Approach (HFA) that tracks both tuples and columns accesses in
main-memory databases. Our objective is to enhance the performance in terms of three dimensions:
storage space, query elapsed time and CPU time. In order to validate the effectiveness of our
approach, we realized its concrete implementation on Hekaton, a SQL’s server memory-
optimized engine using the well-known TPC-H benchmark. Experimental results show that the
proposed HFA outperforms Hekaton approach in respect of all performance dimensions. In
specific, HFA reduces the storage space by average of 44–96%, reduces the query elapsed time
by average of 25–93% and reduces the CPU time by average of 31–97% compared to the traditional
database approach.
Ó 2015 Production and hosting by Elsevier B.V. on behalf of Faculty of Computers and Information,
Cairo University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.
org/licenses/by-nc-nd/4.0/).
1. Introduction
http://dx.doi.org/10.1016/j.eij.2015.06.007
1110-8665 Ó 2015 Production and hosting by Elsevier B.V. on behalf of Faculty of Computers and Information, Cairo University.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
330 G.M. Afify et al.
time when querying the data, which provides faster and more systems are HYRISE [9], H-Store [10], HyPer [11] and
predictable performance than disk [1]. MonetDB [12]. These systems are suitable for the databases
Recent evolution in main-memory sizes has prompted huge that are smaller than the amount of the physical available
increases in the prevalence of database systems that keep the memory. If memory is exceeded, then it will lead to perfor-
entire database in memory. Nonetheless, main-memory is still mance problems. This problem of capacity limitation of
a scarce resource and expensive compared to disk [2]. A major main-memory DBMS has been addressed by a number of
goal of recent research works is to improve main-memory stor- recent works.
age optimization. The more free memory the larger systems to SAP HANA [13] is a columnar in-memory DBMS suitable
be stored in the database, which improves the performance for both OLTP and Online Analytical Processing (OLAP)
and the cost efficiency. The objective is to separate the data workloads. It offers an approach to handle data aging [14].
into active (hot) and inactive (cold) data. The hot data will Hot data refers to columns that are loaded into
remain in main-memory and the cold ones will be moved to main-memory and can be accessed by the DBMS. Cold data
a cheaper cold store [3]. The main difference in the existing is not loaded into main memory but is stored in the disk-
techniques is the level of granularity in which the data is based persistence layer. It uses the Least Recently Used
accessed and classified as hot or cold; which in some databases (LRU) technique to distinguish between hot and cold data.
is at the tuple-level and in others at page-level. Oracle Database 12c In-Memory Option [15] is based on
In the same context, cloud storage becomes more expensive dual-format data store, suitable for use by response-time crit-
because charges of ‘‘GB transferred” over the network vary ical OLTP applications as well as analytical applications for
with the amount of data transferred each month, conceivably real-time decision-making. Oracle in-memory column store
with amazing and capricious variations. Moreover, extra hid- uses LRU technique to identify hot/cold data.
den fees, such as connecting fees, maintenance charges, and HyPer is a main-memory hybrid OLTP and OLAP system
data access charges can add up quickly [4]. Therefore, the con- [11]. It has a compacting-based approach used to handle hot
cept of multi-temperature cloud storage (hot, cold) was devel- and cold data [16]. In this approach, the authors use the capa-
oped to improve the economics of storing the enormous bilities of modern server systems to track data accesses. The
amounts of data. Frequently accessed (hot) data is available data stored in a columnar layout is partitioned horizontally
on fast, high performance storage, while inactive (cold) data and each partition is categorized by its access frequency.
is archived onto lower cost storage [5]. Data in the (rarely accessed) frozen category is still kept in
To the best of the author’s knowledge, this is the first initia- memory but compressed and stored in huge pages to better uti-
tive to propose a Hybrid Filtering Approach (HFA) that hor- lize main memory. HyPer performs hot/cold data classification
izontally filters the database by hot tuples and then, vertically at the Virtual Machine (VM) page level.
filters the database by defining hot attributes, in the aspect of In [17], authors proposed a simple and low-overhead tech-
storage optimization (reducing storage space) in main-memory nique that enables main-memory database to efficiently
cloud database. Moreover, we prove its efficiency compared to migrate cold data to secondary storage by relying on the
the traditional approach using standard benchmark. Operating System (OS)’s virtual memory paging mechanism.
The contributions of this paper can be summarized as Hot pages are pinned in memory while, cold pages are moved
follow: out by the OS to cold storage.
In [18], the authors implemented hot and cold separation in
1. Comprehensive analysis of existing main-memory data- the main-memory database H-Store. The authors call this
bases that focus on hot/cold data management. approach ‘‘Anti-Caching” to underline that hot data is no
2. Introduce the proposed approach and explain it through a longer cached in main-memory but cold data is evicted to sec-
detailed case study. ondary storage. To trace accesses to tuples, tuples are stored in
3. Evaluate the effectiveness of the proposed approach using a a LRU chain per table.
standard benchmark. A comparable approach is presented in Hekaton [19], a
SQL server’s memory-optimized OTLP engine that manages
The remaining of this paper is organized as follow. hot and cold tuples. In Hekaton, the primary copy of the data-
Section 2 surveys the recent related work. Section 3 introduces base is entirely stored in main-memory. Hot tuples remain in
the proposed hybrid filtering approach. Section 4 presents a main-memory while cold ones are moved to cold secondary
detailed case study to illustrate the workflow of the proposed storage [20].
approach. Section 5 reports the experimental evaluation of Table 1 summarizes the comparison between hot/cold data
the proposed approach. Finally, Section 6 concludes the paper. management approaches in main-memory databases. We
observe that SAP HANA [14] vertically filters the data in a
columnar layout, which is a different context than the row lay-
2. Related work out employed in our HFA approach. Oracle 12c dual-format
[15] stores the primary copy of the data on disk, and then uses
Recent development in hardware has led to rapidly dropping the concept of hybrid filtering in its approach. However, it
market prices of main-memory in the past years. This develop- applies HF and VF on disk, and then moves these hot data
ment made it economically feasible to use the main-memory as into in-memory columnar store. In contrast, we apply hybrid
the primary data store of DBMS, which is the main character- filtering approach on data resident in main-memory. HyPer
istic of a main-memory DBMS. Recent research works focus [16,17] perform cold/hot data classification at the VM page
on main-memory DBMS storage. level, which is different from our scope. It is proved by [18]
Commercial systems include Oracle’s Times Ten [6], IBM’s that it is best to make the classification at the same level
solidDB [7], and VoltDB [8]. On the other hand, research of granularity that the data is accessed, which is at the
A hybrid filtering approach for storage optimization in main-memory cloud database 331
tuple-level. Compared to Anti-Caching [18], it uses the LRU 3. Proposed hybrid filtering approach
technique to horizontally filter the database, while our
approach uses the ‘‘datetime” key filtering method. Finally, Our proposed approach is composed of two phases as shown
Hekaton [19] is the work closest to our approach as it uses in Fig. 1. In the first phase, the offline analysis, we classify
the same horizontal filtering methodology by hot tuples which the hot and cold attributes. Hot attributes will remain in main
is using the application pattern ‘‘datetime” key to split the data memory and cold ones will be moved to a cheaper secondary
[21]. Therefore, we chose to build on their work and extend storage. In the second phase, the online analysis, the system
their architecture in order to implement our HFA. interacts with users. The user enters a query and receives a
Our novel Hybrid Filtering Approach (HFA) is based on a response to his query. In this paper, we focus on the offline
row store main-memory database. Our primary copy of data is analysis phase. Comprehensive details on the online analysis
entirely stored in main-memory. First, HFA horizontally fil- phase and query profiling process will be addressed in a sepa-
ters the data by hot tuples. Then, it vertically filters the data rate publication.
by hot columns.
3.1. Phase1: offline analysis the pre-specified threshold. Cold attributes will be stored to
cold-attributes list.
The offline phase is composed of three modules. Periodically,
we run the offline analysis to define the hot and cold attributes 3.1.3. Vertical filtering
in the log files and update the hot/cold attributes list. The Vertical filtering of a table T splits it into two or more tables
duration is predefined by the system administration according (sub-tables), each of which contains a subset of the attributes
to one of two factors either by time (i.e. number of months) or in T. Since many queries access only a small subset of the attri-
by database workload (i.e. number of queries). butes in a table, vertical filtering can reduce the amount of data
that needs to be scanned to answer the query. According to the
3.1.1. Horizontal filtering hot-attributes list, the database in main-memory is vertically
Similar to recent research work, the primary copy of database filtered at attribute-level of granularity to hot and cold attri-
resides in main-memory and is horizontally filtered at tuple- butes. The hot attributes will remain in main-memory, while
level of granularity to hot and cold tuples. The hot tuples will the cold ones will be migrated to a cold secondary storage.
remain in main-memory and the cold ones will be migrated to
a cold secondary storage. In HFA, we use a horizontal filtering
approach that depends mainly on the application business
logic. Thus, we use the filtering pattern ‘‘datetime” key to split Table 4 Query log file.
the data to hot/cold tuples [21]. Q_ID Table name Attributes
3.1.2. Frequent attributes identification 101 Items Item_ID, Brand, Description, Price
101 Customers Name, Phone
In this module, we developed a novel technique to identify the 102 Customers Name, Phone
hot/cold attributes. We analyze the queries stored in the log 102 Items Item_ID, Brand, Description, Price
files to compute the frequency of occurrence for each attribute. 102 Employee Name, Phone
The hot (most frequent) attributes are identified as the attri- 103 Items Item_ID, Brand, Description
butes that appear more than or equal to a pre-specified thresh- 103 Customers Name
old. The results are stored to hot-attributes list. On the other 104 Items Item_ID, Brand, Price, Cost
hand, the hot attribute will be cold if its frequency is less than 104 Customers Name
CREATE VIEW V1
AS SELECT Cost, Weight, Taxable
FROM dbo.Cold_Table;
3.2. Phase 2: online analysis GO
The online phase is composed of three modules. Second, we run the view to verify its contents
1. Query parsing: This module receives the user query and SELECT * FROM V1;
parses it to identify the requested tables and attributes. GO
2. Query storage: This module stores the user query into the
Log files. These attributes’ frequencies will be incremented such as
3. Query execution: This module executes the query and (Cost = 3, Weight = 1, Taxable = 1). The Cost attribute fre-
returns the results to the user. The algorithm of the query quency is equal to the threshold (Lines 11–14) then it’ll be
execution is demonstrated using pseudo code in added to hot-attributes list. Thus, the updated hot-attributes
Algorithm 1. list = [Item_ID, Brand, Description, Price, Cost] and Cold-
attributes list = [Weight, Shape, Taxable, Size, UPC].
Finally, the query is stored to the Log files.
4. Case study
5. Experimental evaluation of HFA approach
In this section, a detailed case study is presented in order to
demonstrate the proposed HFA workflow. Table 2 shows the In order to systematically validate the effectiveness of our
Items table which consists of 9 attributes. HFA approach, we have implemented it and the Hekaton
334 G.M. Afify et al.
5.2. Workload
Hekaton HFA-2 cols HFA-4 cols Hekaton HFA-3 cols HFA-7 cols
HFA-6 cols HFA-8 cols HFA-11 cols HFA-15 cols
300
1000
Hekaton HFA-2 cols HFA-4 cols Hekaton HFA-3 cols HFA-7 cols
HFA-6 cols HFA-8 cols HFA-11 cols HFA-15 cols
120% 100%
Storage Space Imp. (%)
0% 0%
25% 50% 75% 25% 50% 75%
Hot Rows (%) Hot Rows (%)
(a) (b)
Figure 4 Storage space improvements (a) for ORDERS table (b) for LINEITEM table.
76% compared to the original ORDERS table. In Fig. 4(b), From Fig. 5(a), it can be noted that the best elapsed time
the HFA approach has storage improvement on average 47– value for Hekaton is worse than the best value for our pro-
94% and Hekaton approach on average 25–75% compared posed HFA approach using HFA-2, HFA-4 and HFA-6. In
to the original LINEITEM table. Fig. 5(b), the best elapsed time value for Hekaton is worse than
the best value for our proposed HFA approach using HFA-3,
5.4.2. Query elapsed time dimension HFA-7 and HFA-11 hot columns.
In this experiment, we investigate the query elapsed time of the As shown in Fig. 6(a), results show that our HFA outper-
proposed HFA compared to Hekaton in main-memory data- forms Hekaton. The HFA approach has elapsed time improve-
base. As shown in Fig. 5, results show that the query elapsed ment on average 25–90% and Hekaton approach on average
time of all approaches increase with increasing number of 12–74% compared to the original ORDERS table. In Fig. 6
hot rows. It is obvious that our HFA outperforms Hekaton (b), the HFA approach has elapsed time improvement on aver-
in all cases of vertical filtering except in the case of HFA-8 age 45–93% and Hekaton approach on average 40–81% com-
hot columns in the case of hot rows less than 50%. pared to the original LINEITEM table.
Hekaton HFA-2 cols HFA-4 cols Hekaton HFA-3 cols HFA-7 cols
HFA-6 cols HFA-8 cols
HFA-11 cols HFA-15 cols
25
125
Elapsed Time (Sec)
20
Elapsed Time (Sec)
100
15 75
10 50
5 25
0 0
25% 50% 75% 25% 50% 75%
Hot Rows (%) Hot Rows (%)
(a) (b)
Figure 5 Query elapsed time (a) for ORDERS table (b) for LINEITEM table.
336 G.M. Afify et al.
Hekaton HFA-2 cols HFA-4 cols Hekaton HFA-3 cols HFA-7 cols
HFA-6 cols HFA-8 cols HFA-11 cols HFA-15 cols
100% 100%
60% 60%
40% 40%
20% 20%
0% 0%
25% 50% 75% 25% 50% 75%
Hot Rows (%) Hot Rows (%)
(a) (b)
Figure 6 Elapsed time improvements (a) for ORDERS table (b) for LINEITEM table.
Hekaton HFA-2 cols HFA-4 cols Hekaton HFA-3 cols HFA-7 cols
HFA-6 cols HFA-8 cols
2.5 HFA-11 cols HFA-15 cols
12.5
2
CPU Time (Sec)
0 0
25% 50% 75% 25% 50% 75%
Hot Rows (%) Hot Rows (%)
(a) (b)
Figure 7 CPU time (a) for ORDERS table (b) for LINEITEM table.
Hekaton HFA-2 cols HFA-4 cols Hekaton HFA-3 cols HFA-7 cols
HFA-6 cols HFA-8 cols HFA-11 cols HFA-15 cols
120% 120%
100% 100%
CPU Time Imp. (%)
CPU Time Imp. (%)
80% 80%
60% 60%
40% 40%
20% 20%
0% 0%
25% 50% 75% 25% 50% 75%
Hot Rows (%) Hot Rows (%)
(a) (b)
Figure 8 CPU time improvements (a) for ORDERS table (b) for LINEITEM table.
5.4.3. CPU time dimension Fig. 7(b), the best CPU time value for Hekaton is worse than
In this experiment, we investigate the CPU time of the pro- the best value for our proposed HFA approach using HFA-3,
posed HFA compared to Hekaton in main-memory database. HFA-7 and HFA-11 hot columns.
As shown in Fig. 7, results show that the CPU time of all As shown in Fig. 8(a), results show that our HFA out-
approaches increase with increasing number of hot rows. It performs Hekaton. The HFA approach has CPU time
is obvious that our HFA outperforms Hekaton in all cases improvement on average 31–97% and Hekaton approach
of vertical filtering except in the case of HFA-15 hot columns on average 12–62% compared to the original ORDERS
in the case of hot rows less than 50%. table. In Fig. 8(b), the HFA approach has CPU time
From Fig. 7(a), it can be noted that the best CPU time improvement on average 60–96% and Hekaton approach
value for Hekaton is worse than the best value for our pro- on average 41–83% compared to the original LINEITEM
posed HFA approach using HFA-2, HFA-4 and HFA-6. In table.
A hybrid filtering approach for storage optimization in main-memory cloud database 337