Storage Characteristics of Call Data Records in Column Store Databases
Storage Characteristics of Call Data Records in Column Store Databases
D AV I D M WA L K E R D ATA M A N A G E M E N T & WA R E H O U S I N G
OVERVIEW
This presentation gives a brief overview of the storage characteristics of Call Data Records in Column Store Databases It discusses
What are Call Data Records (CDRs)? What is a Column Store Database? How efficient is a column store database for storing CDR and other (similar) machine generated data?
It does not:
Examine performance in any detail Compare column store to traditional row-based
Jan 2012 2012 Data Management & Warehousing 2
Jan 2012
Jan 2012
Jan 2012
The example data we are using has 2 Datetime fields, 11 Char fields, 10 Numeric fields, 33 Integer fields and 25 Varchar fields which is a fairly typical mix for this type of machine generated data. In the source file these are all held as ASCII text.
Jan 2012
Jan 2012
Column store databases store the values in columns and then hold a mapping to form the record This is transparent to the user, who queries a table with SQL in exactly the same way as they would a row-based database
Jan 2012 2012 Data Management & Warehousing 10
COLUMN STORAGE
First Name Value David Helen Sheila Jones Walker Gender Value Female Male
Jan 2012
Note: To the user this appears as a conventional row-based table that can be queried by standard SQL, it is only the underlying storage that is different
11
Consequently Column Store Databases are not efficient at OLTP type applications however they are very efficient for DWH/BI/Archive type applications because the data is bulk loaded rather than individual row inserts, it is not frequently updated and used in large set based queries
Jan 2012 2012 Data Management & Warehousing 13
Jan 2012
14
Sybase has had a column storage database called IQ since 1996 and is one of the most established of the 25 or so currently listed on Wikipedia The server was running CentOS 5.7 x64, a Redhat Linux derivative The hardware consisted of:
Intel Xeon Quad-Core X3363 16GB Memory Adaptec 5405 RAID Controller with 2x 1TB 7200rpm Hard Disk (RAID1) The database was built on file systems rather than raw devices
Total hardware cost was less than US$3000 Software licences were provided on evaluation
Jan 2012 2012 Data Management & Warehousing 15
A PRODUCTION ENVIRONMENT?
To make this into a production environment would depend on the volume of data per month and the number of months data to be held and the type of CDR The biggest performance driver would be to have more disk spindles adding more (faster) drives or using solid state disks. This would improve performance as well as adding greater capacity
e.g. 16 1Tb drives in RAID10 configuration would provide around 7.75Tb of space and store 75 Billion of these CDRs Using raw devices instead or file systems would also improve performance Moving from 1 to 2 or 4 Quad Core CPUs Adding another 16Gb of memory
Jan 2012
16
Insert into the main CDR table from the DQ view CDR_CONVERT over the CDR_LOAD table Record the size of the CDR table in kilobytes Truncate the CDR_LOAD table Compress the source file with gzip -9 (maximum compression, longest execution) Record the size of the .gz file in bytes Move the compressed .gz file to an archive directory
18
RESULTS
12,902 files were loaded with zero data quality errors 435,583,388 CDRs 236.50 Gb of raw files Loading: 33 hours, 22 minutes, 12 second Indexing: 2 hours, 13 minutes, 9 seconds 27.48 Gb of un-indexed storage in the database
8.6:1 Compression Ratio
Jan 2012
19
ADDING INDEXES
By default the table has no indexes
This is the same in most databases
The total space used was still 5.7 times smaller than the space used by the raw files These indexes would significantly improve query performance
However not all the indexes would be required in a production system as not all fields would be actively queried and this would reduce the space used
Jan 2012 2012 Data Management & Warehousing 20
Jan 2012
21
LOAD PERFORMANCE
The average file had 33,760 records The ETL to load an average file took 11 seconds
2 seconds to copy to the working directory and decompress 3 seconds import into CDR_LOAD table 3 seconds copy from CDR_CONVERT table to CDRS table 2 seconds to gzip -9 and archive 1 second logging and truncating tables
Jan 2012
22
OBSERVATIONS (1)
The results were approximately in the middle of our expectations and previous experience of other similar data sets where the raw data has been compressed between 5 and 10 times Even low end hardware gives acceptable load performance suitable for archive functionality but production scale hardware is needed for BI/DWH
Jan 2012
23
OBSERVATIONS (2)
Some database tuning techniques are needed for truly massive data sets but can be designed in from the outset at low cost (e.g. which indexes/index types) It is worth considering putting each month (or some other similar date based partitioning) in separate tables for systems management purposes as it makes it easy to remove the data at the end of the archiving process Smaller reference tables added to the schema would have little/no compression but they are also very small and therefore not contribute greatly to the space used
Jan 2012
24
ALTERNATIVE SCENARIOS
This presentation uses information gathered on specific data used for a specific purpose by a client Companies may wonder how their data would work in both storage and performance terms Vendors may also wonder how their technologies compare in both storage and performance terms If you are interested in finding out please contact us with these or any other Data Warehousing/Business Intelligence enquiries
Jan 2012
25
CONTACT US
Data Management & Warehousing
Website: http://www.datamgmt.com Telephone: +44 (0) 118 321 5930
David Walker
E-Mail: [email protected] Telephone: +44 (0) 7990 594 372 Skype: datamgmt White Papers: http://scribd.com/davidmwalker
Jan 2012
26
ABOUT US
Data Management & Warehousing is a UK based consultancy that has been delivering successful business intelligence and data warehousing solutions since 1995. Our consultants have worked with major corporations around the world including the US, Europe, Africa and the Middle East. We have worked in many industry sectors such as telcos, manufacturing, retail, financial and transport. We provide governance and project management as well as expertise in the leading technologies.
Jan 2012
27
THANK YOU
2 0 1 2 - D ATA M A N A G E M E N T & WA R E H O U S I N G H T T P : / / W W W. D ATA M G M T. C O M