Using Machine Learning To Characterize Database Workloads Blog
Using Machine Learning To Characterize Database Workloads Blog
Databases have been helping us manage our data for decades. Like much of the technology that we work with on a daily
basis, we may begin to take them for granted and miss the opportunities to examine our use of them—and especially
their cost.
For example, Intel stores much of its vast volume of manufacturing data in a massively parallel processing (MPP)
relational database management system (RDBMS). To keep data management costs under control, Intel IT decided to
evaluate our current MPP RDBMS against alternative solutions. Before we could do that, we needed to better understand
our database workloads and define a benchmark that is a good representation of those workloads. We knew that
thousands of manufacturing engineers queried the data, and we knew how much data was being ingested into the
system. However, we needed more details.
“How many concurrent users are there for each kind of a query?”
Imagine that you’ve decided to open a beauty salon in your hometown. You want to build a facility that can meet today’s
demand for services as well as accommodate business growth. You should estimate how many people will be in the shop
at the peak time, so you know how many stations to set up. You need to decide what services you will offer. How many
people you can serve depends on three factors: 1) the speed at which the beauticians work; 2) how many beauticians are
working; and 3) what services the customer wants (just a trim, or a manicure, a hair coloring and a massage, for
example). The “workload” in this case is a function of what the customers want and how many customers there are. But
that also varies over time. Perhaps there are periods of time when a lot of customers just want trims. During other
periods (say, before Valentine’s Day), both trims and hair coloring are in demand, and yet at other times a massage might
be almost the only demand (say, people using all those massage gift cards they just got on Valentine’s Day). It may even
be seemingly random, unrelated to any calendar event. If you get more customers at a peak time and you don’t have
enough stations or qualified beauticians, people will have to wait, and some may deem it too crowded and walk away.
So now let’s return to the database. For our MPP RDBMS, the “services” are the different types of interactions between
the database and the engineers (consumption) and the systems that are sending data (ingestion). Ingestion consists of
standard extraction-transformation-loading (ETL), critical path ETL, bulk loads, and within-DB insert/update/delete
requests (both large and small). Consumption consists of reports and queries—some run as batch jobs, some ad hoc.
At the outset of our workload characterization, we wanted to identify the kinds of database “services” that were being
performed. We knew that, like a trim versus a full service in the beauty salon example, SQL requests could be very simple
or very complex or somewhere in between. What we didn’t know was how to generalize a large variety of these requests
into something more manageable without missing something important. Rather than trusting our gut feel, we wanted to
be methodical about it. We took a novel approach to developing a full understanding of the SQL requests: we decided to
apply Machine Learning (ML) techniques including k-means clustering and Classification and Regression Trees (CARTs).
In our beauty salon example, we might use k-means clustering and CART to analyze customers and identify groups with
similarities such as “just hair services,” “hair and nail services,” and “just nail services.”
For our database, our k-means clustering and CART efforts revealed that ETL requests consisted of seven clusters
(predicted by CPU time, highest thread I/O, and running time) and SQL requests could be grouped into six clusters (based
on CPU time).
Once we had our groupings, we could take the next step, which was to characterize various peak periods. The goal was
to identify something equivalent to “regular,” “just before Valentine’s” and “just after Valentine’s” workload types—but
without really knowing upfront about any “Valentine’s Day” events. We started by generating counts of requests per
each group per each hour based on months of historical database logs. Next, we used k-means clustering again, this time
to create clusters of one-hour slots that are similar to each other with respect to their counts of requests per group.
Finally, we picked a few one-hour slots from each cluster that had the highest overall CPU utilization to create sample
workloads.
The best thing about this process was that it was driven by data and reliable ML-based insights. (This is not the case with
my post-Valentine’s massages-only conjecture, because I didn’t have any gift cards.) The workload characterization was
essential to benchmarking the cost and performance of our existing MPP RDBMS and several alternatives. You can read
the IT@Intel white paper, “Minimizing Manufacturing Data Management Costs,” for a full discussion of how we created a
custom benchmark and then conducted several proofs of concept with vendors to run the benchmark.