Big Data 3
Big Data 3
commissioned by
1. THERE’S MORE TO BIG DATA THAN “BIG”
The “Big” in Big Data applies to much more than simply the volume of data. There is a threshold above which data
becomes truly Big Data, but that threshold is constantly moving as technology improves. With current
technologies, Big Data seems an appropriate term as one begins dealing with data-analysis scenarios that process
hundreds of terabytes. This is even more true when petabytes become the practical unit of measure. Note the
qualifying phrase, “data analysis scenarios that process.” A physical data center that hosts an exabyte of data is
not necessarily dealing with Big Data. But if you must analyze an exabyte of data to answer a given question, then
you are far into the realm of Big Data.
The point is that large amounts of data becomes Big Data only when you must analyze that data as a set. If you are
simply storing 20 years’ worth of nightly system backups so that you can someday reference what a modest-sized
data set looked like 12 years ago, then you don’t have a Big Data scenario on your hands; you simply have a big
storage situation. Big Data is about the analysis of truly large sets of data. If pressed to use a single, simple metric
for Big Data, it might be most accurate to use record quantity. But as you'll see, there are more dimensions to Big
Data than either sheer volume or record quantity.
If Big Data were all about running traditional SELECT queries against bigger and bigger row quantities and sizes,
then we could simply build bigger clusters of relational databases. When you talk to data scientists about Big Data,
the primary idea that you come away with is the difference in analytical methods compared to traditional
relational-database queries. Big Data is about finding the compound relationships between many records of varied
information types. With traditional relational databases, the relationships are predefined in terms of discreet
entities with primary and foreign keys and views that join data along those linked keys. Each time you encounter a
new entity type, you must add a new table
and define its relationship to all the existing
Big
tables. Such encounters are often more
complicated, requiring you to refactor a Data
table into two or more new tables. Velocity
The final measure of magnitude that helps to define Big Data is velocity, or the rate at which new data must be
stored. (For a second, more significant aspect to velocity, see section 2, “The Real-Time Requirement for BDSA.”)
Certainly, not all Big Data scenarios include high velocity. Analysis of data that is collected over a multi-decade