0% found this document useful (0 votes)
58 views22 pages

5.1. - Structured and Unstrucutred Data

The document discusses the differences between structured and unstructured data. Structured data resides in databases and has a defined structure, making it easier to analyze. Unstructured data includes various file formats like text, audio, video and has no predefined structure, making it more challenging to analyze. New tools are emerging that use machine learning to better analyze unstructured data.

Uploaded by

Dave Chapelle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views22 pages

5.1. - Structured and Unstrucutred Data

The document discusses the differences between structured and unstructured data. Structured data resides in databases and has a defined structure, making it easier to analyze. Unstructured data includes various file formats like text, audio, video and has no predefined structure, making it more challenging to analyze. New tools are emerging that use machine learning to better analyze unstructured data.

Uploaded by

Dave Chapelle
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

5.1.

2
Structured data vs. unstructured data: structured data is comprised of clearly defined
data types whose pattern makes them easily searchable; while unstructured data –
“everything else” – is comprised of data that is usually not as easily searchable,
including formats like audio, video, and social media postings.

Unstructured data vs. structured data does not denote any real conflict between the
two. Customers select one or the other not based on their data structure, but on the
applications that use them: relational databases for structured, and most any other type
of application for unstructured data.

If you're looking for big data solutions for your enterprise, refer to our list of the top big
data companies

However, there is a growing tension between the ease of analysis on structured data
versus more challenging analysis on unstructured data. Structured data analytics is a
mature process and technology. Unstructured data analytics is a nascent industry with
a lot of new investment into R&D, but is not a mature technology. The structured data
vs. unstructured data issue within corporations is deciding if they should invest in
analytics for unstructured data, and if it is possible to aggregate the two into better
business intelligence.

Data Management Resource: Forrester Wave - Master Data Management

What is Structured Data?


Structured data usually resides in relational databases (RDBMS). Fields store length-
delineated data phone numbers, Social Security numbers, or ZIP codes. Even text
strings of variable length like names are contained in records, making it a simple matter
to search. Data may be human- or machine-generated as long as the data is created
within an RDBMS structure. This format is eminently searchable both with human
generated queries and via algorithms using type of data and field names, such as
alphabetical or numeric, currency or date.

Common relational database applications with structured data include airline


reservation systems, inventory control, sales transactions, and ATM activity. Structured
Query Language (SQL) enables queries on this type of structured data within relational
databases.
Some relational databases do store or point to unstructured data such as customer
relationship management (CRM) applications. The integration can be awkward at best
since memo fields do not loan themselves to traditional database queries. Still, most of
the CRM data is structured.

What is Unstructured Data?


Unstructured data is essentially everything else. Unstructured data has internal
structure but is not structured via pre-defined data models or schema. It may be textual
or non-textual, and human- or machine-generated. It may also be stored within a non-
relational database like NoSQL.

Typical human-generated unstructured data includes:

● Text files: Word processing, spreadsheets, presentations, email, logs.


● Email: Email has some internal structure thanks to its metadata, and we
sometimes refer to it as semi-structured. However, its message field is
unstructured and traditional analytics tools cannot parse it.
● Social Media: Data from Facebook, Twitter, LinkedIn.
● Website: YouTube, Instagram, photo sharing sites.
● Mobile data: Text messages, locations.
● Communications: Chat, IM, phone recordings, collaboration software.
● Media: MP3, digital photos, audio and video files.
● Business applications: MS Office documents, productivity applications.

Typical machine-generated unstructured data includes:

● Satellite imagery: Weather data, land forms, military movements.


● Scientific data: Oil and gas exploration, space exploration, seismic imagery,
atmospheric data.
● Digital surveillance: Surveillance photos and video.
● Sensor data: Traffic, weather, oceanographic sensors.
The most inclusive Big Data analysis makes use of both structured and unstructured
data.

Structured vs. Unstructured Data: What’s the


Difference?
Besides the obvious difference between storing in a relational database and storing
outside of one, the biggest difference is the ease of analyzing structured data vs.
unstructured data. Mature analytics tools exist for structured data, but analytics tools
for mining unstructured data are nascent and developing.

Users can run simple content searches across textual unstructured data. But its lack of
orderly internal structure defeats the purpose of traditional data mining tools, and the
enterprise gets little value from potentially valuable data sources like rich media,
network or weblogs, customer interactions, and social media data. Even though
unstructured data analytics tools are in the marketplace, no one vendor or toolset are
clear winners. And many customers are reluctant to invest in analytics tools with
uncertain development roadmaps.

On top of this, there is simply much more unstructured data than structured.
Unstructured data makes up 80% and more of enterprise data, and is growing at the rate
of 55% and 65% per year. And without the tools to analyze this massive data,
organizations are leaving vast amounts of valuable data on the business intelligence
table.

Structured data is traditionally easier for Big Data applications to digest, yet today's data
analytics solutions are making great strides in this area.

How Semi-Structured Data Fits with Structured and


Unstructured Data
Semi-structured data maintains internal tags and markings that identify separate data
elements, which enables information grouping and hierarchies. Both documents and
databases can be semi-structured. This type of data only represents about 5-10% of the
structured/semi-structured/unstructured data pie, but has critical business usage
cases.

Email is a very common example of a semi-structured data type. Although more


advanced analysis tools are necessary for thread tracking, near-dedupe, and concept
searching; email’s native metadata enables classification and keyword searching
without any additional tools.

Email is a huge use case, but most semi-structured development centers on easing data
transport issues. Sharing sensor data is a growing use case, as are Web-based data
sharing and transport: electronic data interchange (EDI), many social media platforms,
document markup languages, and NoSQL databases.

Examples of Semi-structured Data

● Markup language XML This is a semi-structured document language. XML is a


set of document encoding rules that defines a human- and machine-readable
format. (Although saying that XML is human-readable doesn’t pack a big punch:
anyone trying to read an XML document has better things to do with their time.)
Its value is that its tag-driven structure is highly flexible, and coders can adapt it
to universalize data structure, storage, and transport on the Web.
● Open standard JSON (JavaScript Object Notation) JSON is another semi-
structured data interchange format. Java is implicit in the name but other C-like
programming languages recognize it. Its structure consists of name/value pairs
(or object, hash table, etc.) and an ordered value list (or array, sequence, list).
Since the structure is interchangeable among languages, JSON excels at
transmitting data between web applications and servers.
● NoSQL Semi-structured data is also an important element of many NoSQL (“not
only SQL”) databases. NoSQL databases differ from relational databases
because they do not separate the organization (schema) from the data. This
makes NoSQL a better choice to store information that does not easily fit into the
record and table format, such as text with varying lengths. It also allows for
easier data exchange between databases. Some newer NoSQL databases like
MongoDB and Couchbase also incorporate semi-structured documents by
natively storing them in the JSON format.

In big data environments, NoSQL does not require admins to separate operational and
analytics databases into separate deployments. NoSQL is the operational database and
hosts native analytics tools for business intelligence. In Hadoop environments, NoSQL
databases ingest and manage incoming data and serve up analytic results.
These databases are common in big data infrastructure and real-time Web applications
like LinkedIn. On LinkedIn, hundreds of millions of business users freely share job titles,
locations, skills, and more; and LinkedIn captures the massive data in a semi-structured
format. When job seeking users create a search, LinkedIn matches the query to its
massive semi-structured data stores, cross-references data to hiring trends, and shares
the resulting recommendations with job seekers. The same process operates with sales
and marketing queries in premium LinkedIn services like Salesforce. Amazon also
bases its reader recommendations on semi-structured databases.

Structured vs. Unstructured Data: Next Gen Tools are


Game Changers
New tools are available to analyze unstructured data, particularly given specific use
case parameters. Most of these tools are based on machine learning. Structured data
analytics can use machine learning as well, but the massive volume and many different
types of unstructured data requires it.

A few years ago, analysts using keywords and key phrases could search unstructured
data and get a decent idea of what the data involved. eDiscovery was (and is) a prime
example of this approach. However, unstructured data has grown so dramatically that
users need to employ analytics that not only work at compute speeds, but also
automatically learn from their activity and user decisions. Natural Language Processing
(NLP), pattern sensing and classification, and text-mining algorithms are all common
examples, as are document relevance analytics, sentiment analysis, and filter-driven
Web harvesting. Unstructured data analytics with machine-learning intelligence allows
organizations to:

● Analyze digital communications for compliance. Failed compliance can cost


companies millions of dollars in fees, litigation, and lost business. Pattern
recognition and email threading analysis software searches massive amounts of
email and chat data for potential noncompliance. A recent example includes
Volkswagen’s woes, who might have avoided a huge fines and reputational hits
by using analytics to monitor communications for suspicious messages.
● Track high-volume customer conversations in social media. Text analytics and
sentiment analysis lets analysts review positive and negative results of
marketing campaigns, or even identify online threats. This level of analytics is far
more sophisticated simple keyword search, which can only report basics like how
often posters mentioned the company name during a new campaign. New
analytics also include context: was the mention positive or negative? Were
posters reacting to each other? What was the tone of reactions to executive
announcements? The automotive industry for example is heavily involved in
analyzing social media, since car buyers often turn to other posters to gauge
their car buying experience. Analysts use a combination of text mining and
sentiment analysis to track auto-related user posts on Twitter and Facebook.
● Gain new marketing intelligence. Machine-learning analytics tools quickly work
on massive amounts of documents to analyze customer behavior. A major
magazine publisher applied text mining to hundreds of thousands of articles,
analyzing each separate publication by the popularity of major subtopics. Then
they extended analytics across all their content properties to see which overall
topics got the most attention by customer demographic. The analytics ran
across hundreds of thousands of pieces of content across all publications, and
cross-referenced hot topic results by segments. The result was a rich education
on which topics were most interesting to distinct customers, and which
marketing messages resonated most strongly with them.

New Info

Structured vs Unstructured Data: 5 Key


Differences
1. Structured data is clearly defined and searchable types of data, while
unstructured data is usually stored in its native format.
2. Structured data is quantitative, while unstructured data is qualitative.
3. Structured data is often stored in data warehouses, while unstructured
data is stored in data lakes.
4. Structured data is easy to search and analyze, while unstructured data
requires more work to process and understand.
5. Structured data exists in predefined formats, while unstructured data is in a
variety of formats.
Data is fundamental to business decisions. A company's ability to gather the right
data, interpret it, and act on those insights is often what will determine its level of
success. But the amount of data accessible to companies is ever increasing, as
are the different kinds of data available. Business data comes in a wide variety of
formats, from strictly formed relational databases to your last tweet. All of this
data, in all its different formats, can be divided into two main categories:
structured data and unstructured data.
In this article, we'll take a closer look at these concepts and the differences
between them.

What is Structured Data?


The term structured data refers to data that resides in a fixed field within a file or
record. Structured data is typically stored in a relational database (RDBMS). It
can consist of numbers and text, and sourcing can happen automatically or
manually, as long as it's within an RDBMS structure. It depends on the creation
of a data model, defining what types of data to include and how to store and
process it.
The programming language used for structured data is SQL (Structured Query
Language). Developed by IBM in the 1970s, SQL handles relational databases.
Typical examples of structured data are names, addresses, credit card numbers,
geolocation, and so on.

What is Unstructured Data?


Unstructured data is more or less all the data that is not structured. Even though
unstructured data may have a native, internal structure, it's not structured in a
predefined way. There is no data model; the data is stored in its native format.
Typical examples of unstructured data are rich media, text, social media activity,
surveillance imagery, and so on.
The amount of unstructured data is much larger than that of structured data.
Unstructured data makes up a whopping 80% or more of all enterprise data, and
the percentage keeps growing. This means that companies not taking
unstructured data into account are missing out on a lot of valuable business
intelligence.

What is Semistructured Data?


Semistructured data is a third category that falls somewhere between the other
two. It's a type of structured data that does not fit into the formal structure of a
relational database. But while not matching the description of structured data
entirely, it still employs tagging systems or other markers, separating different
elements and enabling search. Sometimes, this is referred to as data with a self-
describing structure.
A typical example of semistructured data is smartphone photos. Every photo
taken with a smartphone contains unstructured image content as well as the
tagged time, location, and other identifiable (and structured) information. Semi-
structured data formats include JSON, CSV, and XML file types.

Structured vs Unstructured Data: 5 Key Differences


1) Defined vs Undefined Data
Structured data is clearly defined types of data in a structure, while unstructured
data is usually stored in its native format. Structured data lives in rows and
columns and it can be mapped into pre-defined fields. Unlike structured data,
which is organized and easy to access in relational databases, unstructured data
does not have a predefined data model.

2) Qualitative vs Quantitative Data


Structured data is often quantitative data, meaning it usually consists of hard
numbers or things that can be counted. Methods for analysis include regression
(to predict relationships between variables); classification (to estimate
probability); and clustering of data (based on different attributes).
Unstructured data, on the other hand, is often categorized as qualitative data,
and cannot be processed and analyzed using conventional tools and methods. In
a business context, qualitative data can, for example, come from customer
surveys, interviews, and social media interactions. Extracting insights from
qualitative data requires advanced analytics techniques like data mining and data
stacking.

3) Storage in Data Houses vs Data Lakes


Structured data is often stored in data warehouses, while unstructured data is
stored in data lakes. A data warehouse is the endpoint for the data’s journey
through an ETL pipeline. A data lake, on the other hand, is a sort of almost
limitless repository where data is stored in its original format or after undergoing
a basic “cleaning” process.
Both have the potential for cloud-use. Structured data requires less storage
space, while unstructured data requires more. For example, even a tiny image
takes up more space than many pages of text.
As for databases, structured data is usually stored in a relational database
(RDBMS), while the best fit for unstructured data instead is so-called non-
relational, or NoSQL databases.

4) Ease of Analysis
One of the most significant differences between structured and unstructured data
is how well it lends itself to analysis. Structured data is easy to search, both for
humans and for algorithms. Unstructured data, on the other hand, is intrinsically
more difficult to search and requires processing to become understandable. It's
challenging to deconstruct since it lacks a predefined data model and hence
doesn't fit in in relational databases.
While there are a wide array of sophisticated analytics tools for structured data,
most analytics tools for mining and arranging unstructured data are still in the
developing phase. The lack of predefined structure makes data mining tricky, and
developing best practices on how to handle data sources like rich media, blogs,
social media data, and customer communication is a challenge.

5) Predefined Format vs Variety of Formats


The most common format for structured data is text and numbers. Structured
data has been defined beforehand in a data model.
Unstructured data, on the other hand, comes in a variety of shapes and sizes. It
can consist of everything from audio, video, and imagery to email and sensor
data. There is no data model for the unstructured data; it is stored natively or in a
data lake that doesn't require any transformation.

New Info

What is Structured Data?

Last Updated: 15-04-2019


Structured data is the data which conforms to a data model, has a well
define structure, follows a consistent order and can be easily accessed and
used by a person or a computer program.

Structured data is usually stored in well-defined schemas such as


Databases. It is generally tabular with column and rows that clearly define
its attributes.

SQL (Structured Query language) is often used to manage structured data


stored in databases.

Characteristics of Structured Data:

● Data conforms to a data model and has easily identifiable structure


● Data is stored in the form of rows and columns
● Example : Database
● Data is well organised so, Definition, Format and Meaning of data is
explicitly known
● Data resides in fixed fields within a record or file
● Similar entities are grouped together to form relations or classes
● Entities in the same group have same attributes
● Easy to access and query, So data can be easily used by other
programs
● Data elements are addressable, so efficient to analyse and process
Sources of Structured Data:

● SQL Databases
● Spreadsheets such as Excel
● OLTP Systems
● Online forms
● Sensors such as GPS or RFID tags
● Network and Web server logs
● Medical devices

Advantages of Structured Data:

● Structured data have a well defined structure that helps in easy


storage and access of data
● Data can be indexed based on text string as well as attributes. This
makes search operation hassle-free
● Data mining is easy i.e knowledge can be easily extracted from data
● Operations such as Updating and deleting is easy due to well
structured form of data
● Business Intelligence operations such as Data warehousing can be
easily undertaken
● Easily scalable in case there is an increment of data
● Ensuring security to data is easy

What is Unstructured Data?


Last Updated: 15-04-2019

Unstructured data is the data which does not conforms to a data model and has no

easily identifiable structure such that it can not be used by a computer program easily.

Unstructured data is not organised in a pre-defined manner or does not have a pre-

defined data model, thus it is not a good fit for a mainstream relational database.

Characteristics of Unstructured Data:

● Data neither conforms to a data model nor has any structure.


● Data can not be stored in the form of rows and columns as in Databases

● Data does not follows any semantic or rules

● Data lacks any particular format or sequence

● Data has no easily identifiable structure

● Due to lack of identifiable structure, it can not used by computer programs

easily

Sources of Unstructured Data:

● Web pages

● Images (JPEG, GIF, PNG, etc.)

● Videos

● Memos

● Reports

● Word documents and PowerPoint persentations

● Surveys

Advantages of Unstructured Data:

● Its supports the data which lacks a proper format or sequence

● The data is not constrained by a fixed schema

● Very Flexible due to absence of schema.

● Data is portable

● It is very scalable

● It can deal easily with the heterogeneity of sources.


● These type of data have a variety of business intelligence and analytics

applications.

Disadvantages Of Unstructured data:

● It is difficult to store and manage unstructured data due to lack of schema

and structure

● Indexing the data is difficult and error prone due to unclear structure and not

having pre-defined attributes. Due to which search results are not very

accurate.

● Ensuring security to data is difficult task.

Problems faced in storing unstructured data:

● It requires a lot of storage space to store unstructured data.

● It is difficult to store videos, images, audios, etc.

● Due to unclear structure, operations like update, delete and search is very

difficult.

● Storage cost is high as compared to structured data

● Indexing the unstructured data is difficult

Possible solution for storing Unstructured data:

● Unstructured data can be converted to easily manageable formats


● using Content addressable storage system (CAS) to store unstructured

data.

It stores data based on their metadata and a unique name is assigned to

every object stored in it.The object is retrieved based on content not its

location.

● Unstructured data can be stored in XML format.

● Unstructured data can be stored in RDBMS which supports BLOBs

Extracting information from unstructured Data:

unstructured data do not have any structure. So it can not easily interpreted by

conventional algorithms. It is also difficult to tag and index unstructured data. So

extracting information from them is tough job. Here are possible solutions:

● Taxonomies or classification of data helps in organising data in hierarchical

structure. Which will make search process easy.

● Data can be stored in virtual repository and be automatically tagged. For

example Documentum.

● Use of application platforms like XOLAP.

XOLAP helps in extracting information from e-mails and XML based

documents

● Use of various data mining tools


What is Semi-structured data?
Last Updated: 15-04-2019

Semi-structured data is the data which does not conforms to a data model but has

some structure. It lacks a fixed or rigid schema. It is the data that does not reside in a

rational database but that have some organisational properties that make it easier to

analyse. With some process, we can store them in the relational database.

Characteristics of semi-structured Data:

● Data does not conforms to a data model but has some structure.

● Data can not be stored in the form of rows and columns as in Databases

● Semi-structured data contains tags and elements (Metadata) which is used

to group data and describe how the data is stored

● Similar entities are grouped together and organised in a hierarchy

● Entities in the same group may or may not have the same attributes or

properties

● Does not contains sufficient metadata which makes automation and

management of data difficult

● Size and type of the same attributes in a group may differ

● Due to lack of a well defined structure, it can not used by computer

programs easily

Sources of semi-structured Data:

● E-mails
● XML and other markup languages

● Binary executables

● TCP/IP packets

● Zipped files

● Integration of data from different sources

● Web pages

Advantages of Semi-structured Data:

● The data is not constrained by a fixed schema

● Flexible i.e Schema can be easily changed.

● Data is portable

● It is possible to view structured data as semi-structured data

● Its supports users who can not express their need in SQL

● It can deal easily with the heterogeneity of sources.

Disadvantages of Semi-structured data

● Lack of fixed, rigid schema make it difficult in storage of the data

● Interpreting the relationship between data is difficult as there is no

separation of the schema and the data.

● Queries are less efficient as compared to structured data.

Problems faced in storing semi-structured data


● Data usually has an irregular and partial structure. Some sources have

implicit structure of data, which makes it difficult to interpret the

relationship between data.

● Schema and data are usually tightly coupled i.e they are not only linked

together but are also dependent of each other. Same query may update

both schema and data with the schema being updated frequently.

● Distinction between schema and data is very uncertain or unclear. This

complicates the designing of structure of data

● Storage cost is high as compared to structured data

Possible solution for storing semi-structured data

● Data can be stored in DBMS specially designed to store semi-structured

data

● XML is widely used to store and exchange semi-structured data. It allows its

user to define tags and attributes to store the data in hierarchical form.

Schema and Data are not tightly coupled in XML.

● Object Exchange Model (OEM) can be used to store and exchange semi-

structured data. OEM structures data in form of graph.

● RDBMS can be used to store the data by mapping the data to relational

schema and then mapping it to a table

Extracting information from semi-structured Data

Semi-structured data have different structure because of heterogeneity of the sources.

Sometimes they do not contain any structure at all. This makes it difficult to tag and
index. So while extract information from them is tough job. Here are possible solutions

● Graph based models (e.g OEM) can be used to index semi-structured data

● Data modelling technique in OEM allows the data to be stored in graph

based model. The data in graph based model is easier to search and index.

● XML allows data to be arranged in hierarchical order which enables the data

to be indexed and searched

● Use of various data mining tools

Difference between Structured, Semi-structured


and Unstructured data
Last Updated: 18-08-2020

Big Data includes huge volume, high velocity, and extensible variety of data. These are 3

types: Structured data, Semi-structured data, and Unstructured data.

1. Structured data –

Structured data is data whose elements are addressable for effective


analysis. It has been organized into a formatted repository that is typically a

database. It concerns all data which can be stored in database SQL in a

table with rows and columns. They have relational keys and can easily be

mapped into pre-designed fields. Today, those data are most processed in

the development and simplest way to manage information. Example:

Relational data.

2. Semi-Structured data –

Semi-structured data is information that does not reside in a relational

database but that have some organizational properties that make it easier

to analyze. With some process, you can store them in the relation database

(it could be very hard for some kind of semi-structured data), but Semi-

structured exist to ease space. Example: XML data.

3. Unstructured data –

Unstructured data is a data which is not organized in a predefined manner

or does not have a predefined data model, thus it is not a good fit for a

mainstream relational database. So for Unstructured data, there are

alternative platforms for storing and managing, it is increasingly prevalent in

IT systems and is used by organizations in a variety of business intelligence

and analytics applications. Example: Word, PDF, Text, Media logs.

Differences between Structured, Semi-structured and Unstructured data:


PROPERTIE
STRUCTURED DATA SEMI-STRUCTURED DATA UNSTRUCTURED DATA
S

It is based on It is based on It is based on


Techn
Relational XML/RDF(Resource character and
ology
database table Description Framework). binary data

Matured
Transa No transaction
transaction and
ction Transaction is adapted management
various
manag from DBMS not matured and no
concurrency
ement concurrency
techniques

Versio

n Versioning over Versioning over tuples or Versioned as a

manag tuples,row,tables graph is possible whole

ement

It is more flexible than It is more


It is schema
Flexibil structured data but less flexible and
dependent and less
ity flexible than unstructured there is absence
flexible
data of schema
Scalabi It is very difficult to It’s scaling is simpler than It is more

lity scale DB schema structured data scalable.

Robust New technology, not very


Very robust —
ness spread

Query Structured query Only textual


Queries over anonymous
perfor allow complex queries are
nodes are possible
mance joining possible

You might also like