44 Recognizing Your Data Types: Structured and Unstructured Data
44 Recognizing Your Data Types: Structured and Unstructured Data
Your raw data may consist of presentations, individual text files, images,
audio and video files, and e-mails — for openers.
The sheer amount of this data can be overwhelming. If you categorize it,
however, you create the core of any predictive analytics effort. The more
you learn about your data, the better able you are to analyze and use it.
You can start by getting a good working knowledge of your data types — in
particular, structured versus unstructured data, and streamed versus static
data. The upcoming sections give you a closer look at these data types.
It’s also worth noting that search engine platforms provide readily available
tools for indexing data and making it searchable.
Unstructured data does not completely lack structure — you just have to ferret
it out. Even the text inside digital files still has some structure associated with
it, often showing up in the metadata — for example, document titles, dates the
files were last modified, and their authors’ names. The same thing applies for
e-mails: The contents may be unstructured, but structured data is associated
with them — for example, the date and time they were sent, the names of their
senders and recipients, whether they contain attachments.
The idea here is that you can still find some order you can use while you’re
going through all that “unstructured data”. Of course, you may have to do
some digging. The content of a thread of 25 e-mails shooting back and forth
between two recipients may wander away from the subject line of the first
original e-mail, even if the subject line stays the same. Additionally, the
very first subject line in that e-mail thread may not accurately reflect even
the content of that very first e-mail. (For example, the subject line may say
something as unhelpful as “Hi, there!”)
The separation line between the two data types isn’t always clear. In general,
you can always find some attributes of unstructured data that can be
considered structured data. Whether that structure is reflective of the
content of that data — or useful in data analysis — is unclear at best.
For that matter, structured data can hold unstructured data within it. In a
web form, for example, users may be asked to give feedback on a product
by choosing an answer from multiple choices — but also presented with
a comment box where they can provide additional feedback. The answers
from multiple choices are structured; the comment field is unstructured
because of its free-form nature. Such cases are best understood as a mix
of structured and unstructured data. Most data is a composite of both.
For a successful predictive analytics project, both your structured and unstruc-
tured data must be combined in a logical format that can be analyzed.
The two main models for analyzing streamed data are as follows:
✓ Examine only the newest data points and make a decision about the state
of the model and its next move. This approach is incremental — essentially
building up a picture of the data as it arrives.
✓ Evaluate the entire dataset, or a subset of it, to make a decision each
time new data points arrive. This approach is inclusive of more data
points in the analysis — what constitutes the “entire” dataset changes
every time new data is added.
Depending on the nature of your business and the anticipated impact of the
decision, one model is preferable over the other.
Clearly, analyzing streamed data differs from analyzing static data. Analyzing
a mix of both data types can be even more challenging.
✓ Items bought
✓ Methods of payment
✓ Whether the purchased items were on sale
48 Part I: Getting Started with Predictive Analytics
Other types of data can be collected from customers with their co-operation:
In addition, the type of data that a business collects from its operations can
provide information about its customers. Common examples include the
amount of time that customers spend on company websites, as well as
customers’ browsing histories. All that data combined can be analyzed to
answer some important questions:
The first step toward answering these questions (and many others) is to
collect and use all customer-related operations data for a comprehensive
analysis. The data types that make up such data can intersect and could
be described and/or grouped differently for the purposes of analysis.
When companies put out surveys that ask their customers for feedback and
their thoughts about their line of businesses and products, the collected
data is an example of attitudinal data.
Behavioral data
Behavioral data derives from what customers do when they interact with the
business; it consists mainly of data from sales transactions. Behavioral data
tends to be more reliable than attitudinal data because it represents what
actually happened.
Businesses know, for example, what products are selling, who is buying them,
and how customers are paying for them.
Combining both attitudinal and behavioral data can make your predictive
analytics models more accurate by helping you define the segments of your
customer base, offer a more personalized customer experience, and identify
the drivers behind the business.
50 Part I: Getting Started with Predictive Analytics
Demographic data
Demographic data comprises information including age, race, marital status,
education level, employment status, household income, and location. You
can get demographic data from the U.S. Census Bureau, other government
agencies, or through commercial entities.
The more data you have about your customers, the better the insight you’ll
have into identifying specific demographic and market trends as well as
how they may affect your business. Measuring the pulse of the demographic
trends will enable you to adjust to the changes and better market to, attract,
and serve those segments.
Combining both types of analysis empowers your business and enables you
to expand your understanding, insight, and awareness of your business and
your customers. It makes your decision process smarter and subsequently
more profitable.
Data-driven analytics
If you’re basing your analysis purely on existing data, you can use internal
data — accumulated by your company over the years — or external data
(often purchased from a source outside your company) that is relevant to
your line of business.
To make sense of that data, you can employ data-mining tools to overcome
both its complexity and size; reveal some patterns you were not aware of;
uncover some associations and links within your data; and use your findings
to generate new categorizations, new insights and new understanding.
Data-driven analysis can even reveal a gem or two that can radically improve
your business — all of which gives this approach an element of surprise that
feeds on curiosity and builds anticipation.
Data-driven analysis is best suited for large datasets because it’s hard
for human beings to wrap their minds around huge amounts of data.
Data-mining tools and visualization techniques help us get a closer look
and cut the overwhelming mass of data down to size. Keep these general
principles in mind:
✓ The more complete your data is, the better the outcome of data-driven
analytics. If you have extensive data that has key information to the
variables you’re measuring, and spans an extended period of time,
you’re guaranteed to discover something new about your business.
✓ Data-driven analytics is neutral because no prior knowledge about the
data is necessary and you’re not after a specific goal in particular, but
analyzing the data for the sake of it.
✓ The nature of this analysis is broad and it does not concern itself with
a specific search or validation of a preconceived idea. This approach
to analytics can be viewed as sort of random and broad data mining.