0% found this document useful (0 votes)
50 views

44 Recognizing Your Data Types: Structured and Unstructured Data

This document discusses different types of data that are important to consider for predictive analytics projects. It describes structured versus unstructured data, with structured data being well-organized and easy for computers to analyze, while unstructured data is free-form and requires more preprocessing. It also discusses static versus streamed data, with static data being self-contained and streamed data changing continuously in real-time. The document provides examples and comparisons of these different data types to help categorize data sources.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

44 Recognizing Your Data Types: Structured and Unstructured Data

This document discusses different types of data that are important to consider for predictive analytics projects. It describes structured versus unstructured data, with structured data being well-organized and easy for computers to analyze, while unstructured data is free-form and requires more preprocessing. It also discusses static versus streamed data, with static data being self-contained and streamed data changing continuously in real-time. The document provides examples and comparisons of these different data types to help categorize data sources.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

44 Part I: Getting Started with Predictive Analytics

Recognizing Your Data Types


If your company is like most others, you’ve gathered a large amount of data
through the years — simply as a result of operating a business. Some of
this data can be found in your databases; some may be scattered across
hard drives on your company’s computers or in its online content.

Your raw data may consist of presentations, individual text files, images,
audio and video files, and e-mails — for openers.

The sheer amount of this data can be overwhelming. If you categorize it,
however, you create the core of any predictive analytics effort. The more
you learn about your data, the better able you are to analyze and use it.
You can start by getting a good working knowledge of your data types — in
particular, structured versus unstructured data, and streamed versus static
data. The upcoming sections give you a closer look at these data types.

Structured and unstructured data


Data contained in databases, documents, e-mails, and other data files can be
categorized either as structured or unstructured data.

Structured data is well organized, follows a consistent order, is relatively easy


to search and query, and can be readily accessed and understood by a
person or a computer program.

A classic example of structured data is an Excel spreadsheet with labeled


columns. Such structured data is consistent; column headers — usually
brief, accurate descriptions of the content in each column — tell you
exactly what kind of content to expect. In a column labeled e-mail address,
for example, you can count on finding a list of (no surprise here) e-mail
addresses. Such overt consistency makes structured data amenable to
automated data management.

Structured data is usually stored in well-defined schemas such as databases.


It’s usually tabular, with columns and rows that clearly define its attributes.

Unstructured data, on the other hand, tends to be free-form, non-tabular,


dispersed, and not easily retrievable; such data requires deliberate
intervention to make sense of it. Miscellaneous e-mails, documents,
web pages, and files (whether text, audio, and/or video) in scattered
locations are examples of unstructured data.

It’s hard to categorize the content of unstructured data. It tends to be mostly


text, it’s usually created in a hodgepodge of free-form styles, and finding any
attributes you can use to describe or group it is no small task.
Chapter 3: Exploring Your Data Types and Associated Techniques 45
The content of unstructured data is hard to work with or make sense of
programmatically. Computer programs cannot analyze or generate
reports on such data, simply because it lacks structure, has no
underlying dominant characteristic, and individual items of data
have no common ground.

In general, there’s a higher percentage of unstructured data than structured


data in the world. Unstructured data requires more work to make it useful,
so it gets more attention — thus tends to consume more time. No wonder
the promise of a processing capability that can swiftly make sense of huge
bodies of unstructured data is a major selling point for predictive analytics.

Don’t underestimate the importance of structured data and the power it


brings to your analysis. It’s far more efficient to analyze structured data
than to analyze unstructured data. Unstructured data can also be costly
to preprocess for analysis as you’re building a predictive analytics
project. The selection of relevant data, its cleansing, and subsequent
transformations can be lengthy and tedious. The resultant newly
organized data from those necessary preprocessing steps can then be
used in a predictive analytics model. The wholesale transformation
of unstructured data however, may have to wait until you have your
predictive analytics model up and running.

Data mining and text analytics are two approaches to structuring text


documents, linking their contents, grouping and summarizing their
data, and uncovering patterns in that data. Both disciplines provide
a rich framework of algorithms and techniques to mine the text
scattered across a sea of documents.

It’s also worth noting that search engine platforms provide readily available
tools for indexing data and making it searchable.

Table 3-1 compares structured and unstructured data.

Table 3-1 Characteristics of Structured and Structured Data


Characteristics Structured Unstructured
Association Organized Scattered and dispersed
Appearance Formally defined Free-form
Accessibility Easy to access and query Hard to access and query
Availability Percentagewise lower Percentagewise higher
Analysis Efficient to analyze Additional preprocessing
is needed
46 Part I: Getting Started with Predictive Analytics

Unstructured data does not completely lack structure — you just have to ferret
it out. Even the text inside digital files still has some structure associated with
it, often showing up in the metadata — for example, document titles, dates the
files were last modified, and their authors’ names. The same thing applies for
e-mails: The contents may be unstructured, but structured data is associated
with them — for example, the date and time they were sent, the names of their
senders and recipients, whether they contain attachments.

The idea here is that you can still find some order you can use while you’re
going through all that “unstructured data”. Of course, you may have to do
some digging. The content of a thread of 25 e-mails shooting back and forth
between two recipients may wander away from the subject line of the first
original e-mail, even if the subject line stays the same. Additionally, the
very first subject line in that e-mail thread may not accurately reflect even
the content of that very first e-mail. (For example, the subject line may say
something as unhelpful as “Hi, there!”)

The separation line between the two data types isn’t always clear. In general,
you can always find some attributes of unstructured data that can be
considered structured data. Whether that structure is reflective of the
content of that data — or useful in data analysis — is unclear at best.
For that matter, structured data can hold unstructured data within it. In a
web form, for example, users may be asked to give feedback on a product
by choosing an answer from multiple choices — but also presented with
a comment box where they can provide additional feedback. The answers
from multiple choices are structured; the comment field is unstructured
because of its free-form nature. Such cases are best understood as a mix
of structured and unstructured data. Most data is a composite of both.

Technically speaking, there will always be some exceptions in defining data


categories; the lines between the two can be blurry. But the idea is to make
a useful distinction between structured and unstructured data — and that is
almost always possible.

For a successful predictive analytics project, both your structured and unstruc-
tured data must be combined in a logical format that can be analyzed.

Static and streamed data


Data can also be identified as streamed, static, or a mix of the two. Streamed
data changes continuously; examples include the constant stream of
Facebook updates, tweets on Twitter, and the constantly changing
stock prices while the market is still open.

Streamed data is continuously changing; static data is self-contained and


enclosed. The problems associated with static data include gaps, outliers,
or incorrect data, all of which may require some cleansing, preparation,
and preprocessing before you can use static data for an analysis.
Chapter 3: Exploring Your Data Types and Associated Techniques 47
As with streamed data, other problems may arise. Volume can be a problem; the
sheer amount of non-stop data constantly arriving can be overwhelming. And
the faster the data is streaming in, the harder it is for the analysis to catch up.

The two main models for analyzing streamed data are as follows:

✓ Examine only the newest data points and make a decision about the state
of the model and its next move. This approach is incremental — essentially
building up a picture of the data as it arrives.
✓ Evaluate the entire dataset, or a subset of it, to make a decision each
time new data points arrive. This approach is inclusive of more data
points in the analysis — what constitutes the “entire” dataset changes
every time new data is added.

Depending on the nature of your business and the anticipated impact of the
decision, one model is preferable over the other.

Some business domains, such as the analysis of environmental, market, or


intelligence data, prize new data that arrives in real time. All this data must
be analyzed as it’s being streamed — and interpreted not only correctly but
right away. Based on the newly available information, the model redraws the
whole internal representation of the outside world. Doing so provides you
with the most up-to-date basis for a decision you may need to make and act
upon quickly.

For example, a predictive analytics model may process a stock price as a


data feed, even while the data is rapidly changing, analyze the data in the
context of immediate market conditions existing in real time, and then decide
whether to trade a particular stock.

Clearly, analyzing streamed data differs from analyzing static data. Analyzing
a mix of both data types can be even more challenging.

Identifying Data Categories


As a result of doing business, companies have gathered masses of data about
their business and customers, often referred to as business intelligence. To
help you develop categories for your data, what follows is a general rundown
of the types of data that are considered business intelligence:

Behavioral data derives from transactions, and can be collected


automatically:

✓ Items bought
✓ Methods of payment
✓ Whether the purchased items were on sale
48 Part I: Getting Started with Predictive Analytics

✓ The purchasers’ access information:


t "EESFTT
t 1IPOFOVNCFS
t &NBJMBEESFTT
All of us have provided such data when making a purchase online (or even
when buying at a store or over the phone).

Other types of data can be collected from customers with their co-operation:

✓ Data provided by customers when they fill out surveys


✓ Customers’ collected answers to polls via questionnaires
✓ Information collected from customers who make direct contact with
companies
t *OBQIZTJDBMTUPSF
t 0WFSUIFQIPOF
t 5ISPVHIUIFDPNQBOZXFCTJUF

In addition, the type of data that a business collects from its operations can
provide information about its customers. Common examples include the
amount of time that customers spend on company websites, as well as
customers’ browsing histories. All that data combined can be analyzed to
answer some important questions:

✓ How can your business improve the customer experience?


✓ How can you retain existing customers and attract new ones?
✓ What would your customer base like to buy next?
✓ What purchases can you recommend to particular customers?

The first step toward answering these questions (and many others) is to
collect and use all customer-related operations data for a comprehensive
analysis. The data types that make up such data can intersect and could
be described and/or grouped differently for the purposes of analysis.

Some companies collect these types of data by giving customers personal-


ized experiences. For example, when a business provides its customers with
the tools they need to build personalized websites, it not only empowers
customers (and enriches their experience of dealing with the company), it
also allows the company to learn from a direct expression of its customers’
wants and needs: the websites they create.
Chapter 3: Exploring Your Data Types and Associated Techniques 49
Attitudinal data
Any information that can shed light on how customers think or feel is
considered attitudinal data.

When companies put out surveys that ask their customers for feedback and
their thoughts about their line of businesses and products, the collected
data is an example of attitudinal data.

Attitudinal data has a direct impact on the type of marketing campaign


a company can launch. It helps shape and target the message of that
campaign. Attitudinal data can help make both the message and
the products more relevant to the customers’ needs and wants —
allowing the business to serve existing customers better and attract
prospective ones.

The limitation of attitudinal data is a certain imperfection: Not everyone


objectively answers survey questions, and not everyone provides all the
relevant details that shaped their thinking at the time of the survey.

Behavioral data
Behavioral data derives from what customers do when they interact with the
business; it consists mainly of data from sales transactions. Behavioral data
tends to be more reliable than attitudinal data because it represents what
actually happened.

Businesses know, for example, what products are selling, who is buying them,
and how customers are paying for them.

Behavioral data is a by-product of normal operations, so is available to a


company at no extra cost. Attitudinal data, on the other hand, requires
conducting surveys or commissioning market research to get insights
into the minds of the customers.

Attitudinal data is analyzed to understand why customers behave the way


they do, and details their views of your company. Behavioral data tells
you what is happening and records customers’ real actions. Attitudinal
data provides insight into motivations; behavioral data provides the
who-did-what — the overall context that led to customers’ particular
reactions. Your analysis should include groups for both types of data;
they are complementary.

Combining both attitudinal and behavioral data can make your predictive
analytics models more accurate by helping you define the segments of your
customer base, offer a more personalized customer experience, and identify
the drivers behind the business.
50 Part I: Getting Started with Predictive Analytics

Table 3-2 compares attitudinal and behavioral data.

Table 3-2 Comparing Attitudinal and Behavioral Data


Characteristics Attitudinal Behavioral
Data Source Customers’ thoughts Customers’ actions
Data Means Collected from surveys Collected from transactions
Data Type Subjective Objective
Data Cost May cost extra No extra cost

Demographic data
Demographic data comprises information including age, race, marital status,
education level, employment status, household income, and location. You
can get demographic data from the U.S. Census Bureau, other government
agencies, or through commercial entities.

The more data you have about your customers, the better the insight you’ll
have into identifying specific demographic and market trends as well as
how they may affect your business. Measuring the pulse of the demographic
trends will enable you to adjust to the changes and better market to, attract,
and serve those segments.

Different segments of the population are interested in different products.

Small businesses catering to specific locations should pay attention to


the demographic changes in those locations. All of us have witnessed
populations changing over time in certain neighborhoods. Businesses
must be aware of such changes; they may affect business significantly.

Demographic data, when combined with behavioral and attitudinal data,


allows marketers to paint an accurate picture of their current and
potential customers, allowing them to increase satisfaction, retention,
and acquisition.

Generating Predictive Analytics


There are two ways to go about generating or implementing predictive analytics:
purely on the basis of your data (with no prior knowledge of what you’re
after) or with a proposed business goal that the data may or may not support.
You don’t have to choose one or the other; the two approaches can be
complementary. Each has its advantages and disadvantages.
Chapter 3: Exploring Your Data Types and Associated Techniques 51
Whether you’re coming up with hypotheses to test, analyzing the results that
come out of your data analysis (and making sense of them), or starting to
examine your data with no prior assumptions of what you may find, the goal
of your analysis is always the same: to decide whether to act on what you
find. You have an active role in implementing the process needed for either
approach to predictive analytics. Both approaches to predictive analytics
have their limitations; keep risk management in mind as you cross-examine
their results. Which approach do you find to be both promising of good
results and relatively safe?

Combining both types of analysis empowers your business and enables you
to expand your understanding, insight, and awareness of your business and
your customers. It makes your decision process smarter and subsequently
more profitable.

Data-driven analytics
If you’re basing your analysis purely on existing data, you can use internal
data — accumulated by your company over the years — or external data
(often purchased from a source outside your company) that is relevant to
your line of business.

To make sense of that data, you can employ data-mining tools to overcome
both its complexity and size; reveal some patterns you were not aware of;
uncover some associations and links within your data; and use your findings
to generate new categorizations, new insights and new understanding.
Data-driven analysis can even reveal a gem or two that can radically improve
your business — all of which gives this approach an element of surprise that
feeds on curiosity and builds anticipation.

Data-driven analysis is best suited for large datasets because it’s hard
for human beings to wrap their minds around huge amounts of data.
Data-mining tools and visualization techniques help us get a closer look
and cut the overwhelming mass of data down to size. Keep these general
principles in mind:

✓ The more complete your data is, the better the outcome of data-driven
analytics. If you have extensive data that has key information to the
variables you’re measuring, and spans an extended period of time,
you’re guaranteed to discover something new about your business.
✓ Data-driven analytics is neutral because no prior knowledge about the
data is necessary and you’re not after a specific goal in particular, but
analyzing the data for the sake of it.
✓ The nature of this analysis is broad and it does not concern itself with
a specific search or validation of a preconceived idea. This approach
to analytics can be viewed as sort of random and broad data mining.

You might also like