5.1. - Structured and Unstrucutred Data
5.1. - Structured and Unstrucutred Data
2
Structured data vs. unstructured data: structured data is comprised of clearly defined
data types whose pattern makes them easily searchable; while unstructured data –
“everything else” – is comprised of data that is usually not as easily searchable,
including formats like audio, video, and social media postings.
Unstructured data vs. structured data does not denote any real conflict between the
two. Customers select one or the other not based on their data structure, but on the
applications that use them: relational databases for structured, and most any other type
of application for unstructured data.
If you're looking for big data solutions for your enterprise, refer to our list of the top big
data companies
However, there is a growing tension between the ease of analysis on structured data
versus more challenging analysis on unstructured data. Structured data analytics is a
mature process and technology. Unstructured data analytics is a nascent industry with
a lot of new investment into R&D, but is not a mature technology. The structured data
vs. unstructured data issue within corporations is deciding if they should invest in
analytics for unstructured data, and if it is possible to aggregate the two into better
business intelligence.
Users can run simple content searches across textual unstructured data. But its lack of
orderly internal structure defeats the purpose of traditional data mining tools, and the
enterprise gets little value from potentially valuable data sources like rich media,
network or weblogs, customer interactions, and social media data. Even though
unstructured data analytics tools are in the marketplace, no one vendor or toolset are
clear winners. And many customers are reluctant to invest in analytics tools with
uncertain development roadmaps.
On top of this, there is simply much more unstructured data than structured.
Unstructured data makes up 80% and more of enterprise data, and is growing at the rate
of 55% and 65% per year. And without the tools to analyze this massive data,
organizations are leaving vast amounts of valuable data on the business intelligence
table.
Structured data is traditionally easier for Big Data applications to digest, yet today's data
analytics solutions are making great strides in this area.
Email is a huge use case, but most semi-structured development centers on easing data
transport issues. Sharing sensor data is a growing use case, as are Web-based data
sharing and transport: electronic data interchange (EDI), many social media platforms,
document markup languages, and NoSQL databases.
In big data environments, NoSQL does not require admins to separate operational and
analytics databases into separate deployments. NoSQL is the operational database and
hosts native analytics tools for business intelligence. In Hadoop environments, NoSQL
databases ingest and manage incoming data and serve up analytic results.
These databases are common in big data infrastructure and real-time Web applications
like LinkedIn. On LinkedIn, hundreds of millions of business users freely share job titles,
locations, skills, and more; and LinkedIn captures the massive data in a semi-structured
format. When job seeking users create a search, LinkedIn matches the query to its
massive semi-structured data stores, cross-references data to hiring trends, and shares
the resulting recommendations with job seekers. The same process operates with sales
and marketing queries in premium LinkedIn services like Salesforce. Amazon also
bases its reader recommendations on semi-structured databases.
A few years ago, analysts using keywords and key phrases could search unstructured
data and get a decent idea of what the data involved. eDiscovery was (and is) a prime
example of this approach. However, unstructured data has grown so dramatically that
users need to employ analytics that not only work at compute speeds, but also
automatically learn from their activity and user decisions. Natural Language Processing
(NLP), pattern sensing and classification, and text-mining algorithms are all common
examples, as are document relevance analytics, sentiment analysis, and filter-driven
Web harvesting. Unstructured data analytics with machine-learning intelligence allows
organizations to:
New Info
4) Ease of Analysis
One of the most significant differences between structured and unstructured data
is how well it lends itself to analysis. Structured data is easy to search, both for
humans and for algorithms. Unstructured data, on the other hand, is intrinsically
more difficult to search and requires processing to become understandable. It's
challenging to deconstruct since it lacks a predefined data model and hence
doesn't fit in in relational databases.
While there are a wide array of sophisticated analytics tools for structured data,
most analytics tools for mining and arranging unstructured data are still in the
developing phase. The lack of predefined structure makes data mining tricky, and
developing best practices on how to handle data sources like rich media, blogs,
social media data, and customer communication is a challenge.
New Info
● SQL Databases
● Spreadsheets such as Excel
● OLTP Systems
● Online forms
● Sensors such as GPS or RFID tags
● Network and Web server logs
● Medical devices
Unstructured data is the data which does not conforms to a data model and has no
easily identifiable structure such that it can not be used by a computer program easily.
Unstructured data is not organised in a pre-defined manner or does not have a pre-
defined data model, thus it is not a good fit for a mainstream relational database.
easily
● Web pages
● Videos
● Memos
● Reports
● Surveys
● Data is portable
● It is very scalable
applications.
and structure
● Indexing the data is difficult and error prone due to unclear structure and not
having pre-defined attributes. Due to which search results are not very
accurate.
● Due to unclear structure, operations like update, delete and search is very
difficult.
data.
every object stored in it.The object is retrieved based on content not its
location.
unstructured data do not have any structure. So it can not easily interpreted by
extracting information from them is tough job. Here are possible solutions:
example Documentum.
documents
Semi-structured data is the data which does not conforms to a data model but has
some structure. It lacks a fixed or rigid schema. It is the data that does not reside in a
rational database but that have some organisational properties that make it easier to
analyse. With some process, we can store them in the relational database.
● Data does not conforms to a data model but has some structure.
● Data can not be stored in the form of rows and columns as in Databases
● Entities in the same group may or may not have the same attributes or
properties
programs easily
● E-mails
● XML and other markup languages
● Binary executables
● TCP/IP packets
● Zipped files
● Web pages
● Data is portable
● Its supports users who can not express their need in SQL
● Schema and data are usually tightly coupled i.e they are not only linked
together but are also dependent of each other. Same query may update
both schema and data with the schema being updated frequently.
data
● XML is widely used to store and exchange semi-structured data. It allows its
user to define tags and attributes to store the data in hierarchical form.
● Object Exchange Model (OEM) can be used to store and exchange semi-
● RDBMS can be used to store the data by mapping the data to relational
Sometimes they do not contain any structure at all. This makes it difficult to tag and
index. So while extract information from them is tough job. Here are possible solutions
● Graph based models (e.g OEM) can be used to index semi-structured data
based model. The data in graph based model is easier to search and index.
● XML allows data to be arranged in hierarchical order which enables the data
Big Data includes huge volume, high velocity, and extensible variety of data. These are 3
1. Structured data –
table with rows and columns. They have relational keys and can easily be
mapped into pre-designed fields. Today, those data are most processed in
Relational data.
2. Semi-Structured data –
database but that have some organizational properties that make it easier
to analyze. With some process, you can store them in the relation database
(it could be very hard for some kind of semi-structured data), but Semi-
3. Unstructured data –
or does not have a predefined data model, thus it is not a good fit for a
Matured
Transa No transaction
transaction and
ction Transaction is adapted management
various
manag from DBMS not matured and no
concurrency
ement concurrency
techniques
Versio
ement