Digital Data Part 1
Digital Data Part 1
Digital Data
• Today, data undoubtedly is an invaluable asset of any enterprise (big or small).
Even though professionals work with data all the time, the understanding,
management and analysis of data from heterogeneous sources remains a
serious challenge.
• In this lecture, the various formats of digital data (structured, semi-structured
and unstructured data), data storage mechanism, data access methods,
management of data, the process ofdata access methods, management of data,
the process of extracting desired information from data, challenges posed by
various formats of data, etc. will be explained.
• Data growth has seen exponential acceleration since the advent of the
computer and Internet.
In fact, the computer and Internet duo has imparted the digital form to data.
Digital data can be classified into three forms:
– Unstructured
– Semi-structured
– Structured
• Usually, data is in the unstructured format which makes extracting• Usually,
data is in the unstructured format which makes extracting information from it
difficult.
• According to Merrill Lynch, 80–90% of business data is either unstructured or
semi-structured.
• Gartner also estimates that unstructured data constitutes 80% of the whole
enterprise data.
1|Page
DATA SCIENCE PART 1
between entities of data, such as classes and their objects. Data stored in
databases is an example of structured data.
2|Page
DATA SCIENCE PART 1
in a database and does not conform to any data model, i.e. it is difficult to
determine the
meaning of the data. It does not follow any rules or semantics. It can be of any
type and
is hence unpredictable.
Where does Unstructured Data Come from?
Broadly speaking, anything in a non-database form is unstructureddata.
It can be classified into two broad categories:
• Bitmap objects : For example, image, video, or audio files.
• Textual objects : For example, Microsoft Word documents, emails, or Microsoft
Excel spread-sheets.
Stanley are organized in databases such as Microsoft Exchange or Lotus Notes,
the body of the email is essentially raw data, i.e. free form text without any
structure.
A lot of unstructured data is also noisy text such as chats, emails and SMS
texts.
The language of noisy text differs significantly from the standard form of
language.
A Myth Demystified
• Web pages are said to be unstructured data even though they are defined by
HTML, a markup language which has a rich structure.
• HTML is solely used for rendering and presentations.
• The tagged elements do not capture the meaning of the data that the HTML
page contains. This makes it difficult to automatically process the information in
the HTML page.
• automatically process the information in the HTML page.
•Another characteristic that makes web pages unstructured data is that they
usually carry links and references to external unstructured content such as
images, XML files, etc.
How to Manage Unstructured Data?
Let us look at a few generic tasks to be performed to enable storage and search
of unstructured data:
Indexing: Let us go back to our understanding of the Relational Database
Management System(RDBMS). In this system, data is indexed to enable faster
search and retrieval. On the basis of some value in the data, index is defined
which is nothing but an identifier and represents the large record in the data set.
In the absence of an index, the whole data set/ document will be scanned for
retrieving the desired information.
3|Page
DATA SCIENCE PART 1
In the case of unstructured data too, indexing helps in searching and retrieval.
Based on text or some other attributes, e.g. file name, the unstructured data is
indexed.
Indexing in unstructured data is difficult because neither does this data have any
predefined attributes nor does it follow any pattern or naming conventions. Text
can be indexed based on a text string but in case of non-text based files, e.g.
audio/video, etc., indexing depends on file names.
This becomes a hindrance when naming conventions are not being followed.
Tags/Metadata: Using metadata, data in a document, etc. can be tagged. This
enables search and: Using metadata, data in a document, etc. can be tagged.
This enables search and retrieval. But in unstructured data, this is difficult as
little or no metadata is available. Structure of data has to be determined which is
very difficult as the data itself has no particular format and is coming from more
than one source.
Classification/Taxonomy: Taxonomy is classifying data on the basis of the
relationships that exist between data. Data can be arranged in groups and
placed in hierarchies based on the taxonomy prevalent in an organization.
However, classifying unstructured data is difficult as identifying relationships
between data is not an easy task. In the absence of any structure or metadata or
schema,
identifying accurate relationships and classifying is not easy. Since the datails
unstructured, naming conventions or standards are not consistent across an
organization, thus making it difficult to classify data. CAS (Content Addressable
Storage): It stores data based on their metadata.
It assigns 2 unique name to every object stored in it. The object is retrieved
based on its content and not its location. It is used extensively to store emails,
etc.
UIMA
UIMA (Unstructured Information Management Architecture) is an open source
platform from IBM which integrates different kinds of analysis engines to provide
a complete solution for edge discovery from unstructured data.
In UIMA, the analysis engines integration and analysis of unstructured
information and bridge the gap between structured and unstructured data.
UIMA stores information in a structured format. The structured resources can
be mined, searched, and put to other uses. The information obtained from
structured sources is also for sub-sequent analysis of unstructured from
structured sources is also for sub-sequent analysis of unstructured data.
Various analysis engines analyze unstructured data in different ways such as:
– Breaking up of documents into separate words.
– Grouping and classifying according to taxonomy.
– Detecting parts of speech, grammar, and synonyms.
4|Page
DATA SCIENCE PART 1
5|Page