Digital Data Part 1

The document discusses the significance of digital data, particularly focusing on its various formats: unstructured, semi-structured, and structured data. It highlights the challenges of managing and extracting information from unstructured data, which constitutes a large portion of business data, and outlines methods for indexing, tagging, and classifying this type of data. Additionally, it introduces UIMA, an open-source platform that integrates analysis engines to facilitate the management and analysis of unstructured information.

Uploaded by

Manivannan B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views5 pages

Digital Data Part 1

Uploaded by

Manivannan B

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

DATA SCIENCE PART 1

Digital Data
• Today, data undoubtedly is an invaluable asset of any enterprise (big or small).
Even though professionals work with data all the time, the understanding,
management and analysis of data from heterogeneous sources remains a
serious challenge.
• In this lecture, the various formats of digital data (structured, semi-structured
and unstructured data), data storage mechanism, data access methods,
management of data, the process ofdata access methods, management of data,
the process of extracting desired information from data, challenges posed by
various formats of data, etc. will be explained.
• Data growth has seen exponential acceleration since the advent of the
computer and Internet.

In fact, the computer and Internet duo has imparted the digital form to data.
Digital data can be classified into three forms:
– Unstructured
– Semi-structured
– Structured
• Usually, data is in the unstructured format which makes extracting• Usually,
data is in the unstructured format which makes extracting information from it
difficult.
• According to Merrill Lynch, 80–90% of business data is either unstructured or
semi-structured.
• Gartner also estimates that unstructured data constitutes 80% of the whole
enterprise data.

Data Forms Defined-

Unstructured data: This is the data which does not conform to a data model or is
not in a form which can be used easily by a computer program. About 80—90%
data of an organization is in this format; for example, memos, chat rooms,
PowerPoint presentations, images, videos, letters, researches, white papers,
body of an email, etc.
Semi-structured data: This is the data which does not conform to a data model
but has some structure. However, it is not in a formdata model but has some
structure. However, it is not in a form which can be used easily by a computer
program; for example, emails, XML, markup languages like HTML, etc. Metadata
for this data is available but is not sufficient.
Structured data: This is the data which is in an organized form (e.g., in rows and
columns) and can be easily used by a computer program. Relationships exist

1|Page
DATA SCIENCE PART 1

between entities of data, such as classes and their objects. Data stored in
databases is an example of structured data.

Unstructured Data – Getting to Know

• Dr. Ben, Dr. Stanley, and Dr. Mark work at the medical facility of “GoodLife”.
Over
the past few days, Dr. Ben and Dr. Stanley had been exchanging long emails
about a
particular case of testinal problem. Dr. Stanley has chanced upon a particular
combination of drugs that has cured gastro-intestinal disorders in his patients.
He has
written an email about this combination of drugs to Dr. Ben.
• Dr. Mark has a patient in the “GoodLife” emergency unit with quite a similar
case of
gastro-intestinal disorder whose cure Dr. Stanley has chanced upon. Dr. Mark
has already
tried regular drugs but with no positive results so far. He quickly searches the
organization's database for answers, but with no luck. The information he wants
is tucked
away in the email conversation between two other “GoodLife” doctors, Dr. Ben
and Dr.
Stanley. Dr. Mark would have accessed the solution with few mouse clicks had
theStanley. Dr. Mark would have accessed the solution with few mouse clicks
had the
storage and analysis of unstructured data been undertaken by “GoodLife”.
• As is the case at “GoodLife”, 80-85% of data in any organization is
unstructured and
is an alarming rate. An enormous amount of knowledge is buried in this data. In
the
above Stanley's email to Dr. Ben had not been successfully updated into the
medical
system in the unstructured format.
• Unstructured data, thus, is the one which cannot be stored in the form of rows
and as

2|Page
DATA SCIENCE PART 1

in a database and does not conform to any data model, i.e. it is difficult to
determine the
meaning of the data. It does not follow any rules or semantics. It can be of any
type and
is hence unpredictable.
Where does Unstructured Data Come from?
Broadly speaking, anything in a non-database form is unstructureddata.
It can be classified into two broad categories:
• Bitmap objects : For example, image, video, or audio files.
• Textual objects : For example, Microsoft Word documents, emails, or Microsoft
Excel spread-sheets.
Stanley are organized in databases such as Microsoft Exchange or Lotus Notes,
the body of the email is essentially raw data, i.e. free form text without any
structure.
A lot of unstructured data is also noisy text such as chats, emails and SMS
texts.
The language of noisy text differs significantly from the standard form of
language.
A Myth Demystified
• Web pages are said to be unstructured data even though they are defined by
HTML, a markup language which has a rich structure.
• HTML is solely used for rendering and presentations.
• The tagged elements do not capture the meaning of the data that the HTML
page contains. This makes it difficult to automatically process the information in
the HTML page.
• automatically process the information in the HTML page.
•Another characteristic that makes web pages unstructured data is that they
usually carry links and references to external unstructured content such as
images, XML files, etc.
How to Manage Unstructured Data?
Let us look at a few generic tasks to be performed to enable storage and search
of unstructured data:
Indexing: Let us go back to our understanding of the Relational Database
Management System(RDBMS). In this system, data is indexed to enable faster
search and retrieval. On the basis of some value in the data, index is defined
which is nothing but an identifier and represents the large record in the data set.
In the absence of an index, the whole data set/ document will be scanned for
retrieving the desired information.

3|Page
DATA SCIENCE PART 1

In the case of unstructured data too, indexing helps in searching and retrieval.
Based on text or some other attributes, e.g. file name, the unstructured data is
indexed.
Indexing in unstructured data is difficult because neither does this data have any
predefined attributes nor does it follow any pattern or naming conventions. Text
can be indexed based on a text string but in case of non-text based files, e.g.
audio/video, etc., indexing depends on file names.
This becomes a hindrance when naming conventions are not being followed.
Tags/Metadata: Using metadata, data in a document, etc. can be tagged. This
enables search and: Using metadata, data in a document, etc. can be tagged.
This enables search and retrieval. But in unstructured data, this is difficult as
little or no metadata is available. Structure of data has to be determined which is
very difficult as the data itself has no particular format and is coming from more
than one source.
Classification/Taxonomy: Taxonomy is classifying data on the basis of the
relationships that exist between data. Data can be arranged in groups and
placed in hierarchies based on the taxonomy prevalent in an organization.
However, classifying unstructured data is difficult as identifying relationships
between data is not an easy task. In the absence of any structure or metadata or
schema,
identifying accurate relationships and classifying is not easy. Since the datails
unstructured, naming conventions or standards are not consistent across an
organization, thus making it difficult to classify data. CAS (Content Addressable
Storage): It stores data based on their metadata.
It assigns 2 unique name to every object stored in it. The object is retrieved
based on its content and not its location. It is used extensively to store emails,
etc.

UIMA
 UIMA (Unstructured Information Management Architecture) is an open source
platform from IBM which integrates different kinds of analysis engines to provide
a complete solution for edge discovery from unstructured data.
 In UIMA, the analysis engines integration and analysis of unstructured
information and bridge the gap between structured and unstructured data.
 UIMA stores information in a structured format. The structured resources can
be mined, searched, and put to other uses. The information obtained from
structured sources is also for sub-sequent analysis of unstructured from
structured sources is also for sub-sequent analysis of unstructured data.
 Various analysis engines analyze unstructured data in different ways such as:
– Breaking up of documents into separate words.
– Grouping and classifying according to taxonomy.
– Detecting parts of speech, grammar, and synonyms.

4|Page
DATA SCIENCE PART 1

– Detecting events and times.

¢ Detecting relationships between various elements.

5|Page

Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
From Everand
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
1/5 (1)
Jesus The Teacher: Marie Noël Keller RSM
No ratings yet
Jesus The Teacher: Marie Noël Keller RSM
12 pages
Type of Data
No ratings yet
Type of Data
44 pages
Unit 1 Notes Final Part A
No ratings yet
Unit 1 Notes Final Part A
82 pages
Bussiness Analytics Chep-2
No ratings yet
Bussiness Analytics Chep-2
36 pages
Structured Vs Unstructured Data
No ratings yet
Structured Vs Unstructured Data
3 pages
TIT 721 BI-Unit-II Study Materials
No ratings yet
TIT 721 BI-Unit-II Study Materials
38 pages
Unit I EBDP 2022
No ratings yet
Unit I EBDP 2022
80 pages
Data Types
No ratings yet
Data Types
36 pages
Unit I Types of Digital Data: CO1: Explain About Big Data Paradigm
No ratings yet
Unit I Types of Digital Data: CO1: Explain About Big Data Paradigm
37 pages
Unit-1 (3)
No ratings yet
Unit-1 (3)
62 pages
1 - Chap 3 - Types of Digital Data
68% (19)
1 - Chap 3 - Types of Digital Data
40 pages
Big Data and Analytics Cse448 Module 1 L
No ratings yet
Big Data and Analytics Cse448 Module 1 L
38 pages
Chapter 01: Types of Digital Data
No ratings yet
Chapter 01: Types of Digital Data
79 pages
Chapter 2-converted BI
No ratings yet
Chapter 2-converted BI
39 pages
Module 1
No ratings yet
Module 1
27 pages
Business Intelligence - Concepts
100% (2)
Business Intelligence - Concepts
162 pages
Digital Data
No ratings yet
Digital Data
32 pages
02-Types of Digital Data
No ratings yet
02-Types of Digital Data
33 pages
BigData_1
No ratings yet
BigData_1
14 pages
Types of Digital Data
No ratings yet
Types of Digital Data
26 pages
Chapter 2
67% (3)
Chapter 2
39 pages
Chapter 2
No ratings yet
Chapter 2
39 pages
Unit - I: Types of Digital Data
No ratings yet
Unit - I: Types of Digital Data
5 pages
CSC4404 Chap3
No ratings yet
CSC4404 Chap3
84 pages
1_Data and Organizations
No ratings yet
1_Data and Organizations
5 pages
All
No ratings yet
All
62 pages
Bi Mid 1
No ratings yet
Bi Mid 1
173 pages
UNIT 1 INTRODUCTION TO BIGDATA by MIT
No ratings yet
UNIT 1 INTRODUCTION TO BIGDATA by MIT
12 pages
Unit 4 DigitalData
No ratings yet
Unit 4 DigitalData
22 pages
Chapter 01: Types of Digital Data
No ratings yet
Chapter 01: Types of Digital Data
80 pages
CH 2
No ratings yet
CH 2
42 pages
Unit-1-Part1-Big Data Analytics and Tools
No ratings yet
Unit-1-Part1-Big Data Analytics and Tools
12 pages
Big Data - Unit-1 - KCS-061
No ratings yet
Big Data - Unit-1 - KCS-061
63 pages
Big data aktu unit 1
No ratings yet
Big data aktu unit 1
85 pages
5.1. - Structured and Unstrucutred Data
No ratings yet
5.1. - Structured and Unstrucutred Data
22 pages
Big Data Unit-1 Kcs-061
No ratings yet
Big Data Unit-1 Kcs-061
64 pages
Data and Data Storage
No ratings yet
Data and Data Storage
29 pages
Database Data: Definition - Unstructured Data Is A Generic Label For Describing Any Corporate Information That Is Not
No ratings yet
Database Data: Definition - Unstructured Data Is A Generic Label For Describing Any Corporate Information That Is Not
14 pages
Data Categories
No ratings yet
Data Categories
4 pages
44 Recognizing Your Data Types: Structured and Unstructured Data
No ratings yet
44 Recognizing Your Data Types: Structured and Unstructured Data
8 pages
Big Data & Analytics (CSE448) L1 (1)
No ratings yet
Big Data & Analytics (CSE448) L1 (1)
51 pages
Unit 1: To Data Science
No ratings yet
Unit 1: To Data Science
56 pages
Types of Digital Data
No ratings yet
Types of Digital Data
33 pages
Unstructured Data Analysis-A Survey: K.V.Kanimozhi, Dr.M.Venkatesan
No ratings yet
Unstructured Data Analysis-A Survey: K.V.Kanimozhi, Dr.M.Venkatesan
3 pages
5.1 Data and Databases
No ratings yet
5.1 Data and Databases
14 pages
6259_5_128_MODULE_1 (1)
No ratings yet
6259_5_128_MODULE_1 (1)
73 pages
Structured and Unstructured Data: Learning Outcomes
100% (1)
Structured and Unstructured Data: Learning Outcomes
13 pages
Assignment On Business Analytics
No ratings yet
Assignment On Business Analytics
6 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
UNIT 1-2
No ratings yet
UNIT 1-2
78 pages
Types of Digital Data
No ratings yet
Types of Digital Data
19 pages
Module-1
No ratings yet
Module-1
40 pages
DA_Unit_1
No ratings yet
DA_Unit_1
44 pages
Dbms Harsha P
No ratings yet
Dbms Harsha P
16 pages
Computer
No ratings yet
Computer
4 pages
Data and Its Types
No ratings yet
Data and Its Types
40 pages
Structured, Semi-Structured and Unstructured Data (M-2)
No ratings yet
Structured, Semi-Structured and Unstructured Data (M-2)
3 pages
3. AI primer
No ratings yet
3. AI primer
24 pages
The Key Differences Between Data Vs Information: Unit 1 Introduction and Fundamentals of Data
No ratings yet
The Key Differences Between Data Vs Information: Unit 1 Introduction and Fundamentals of Data
27 pages
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
From Everand
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
Marlowe Reyes
No ratings yet
Alliteration
No ratings yet
Alliteration
4 pages
Diameter Calculation
No ratings yet
Diameter Calculation
30 pages
X20 (C) DO8332: 1 General Information
No ratings yet
X20 (C) DO8332: 1 General Information
12 pages
PFMS FAQs
No ratings yet
PFMS FAQs
3 pages
James and The Giant Peach
No ratings yet
James and The Giant Peach
2 pages
Excel 2024
No ratings yet
Excel 2024
84 pages
Lesson 1 - Introduction
No ratings yet
Lesson 1 - Introduction
18 pages
Psalm 110 and The Melchizedek Priesthood
100% (1)
Psalm 110 and The Melchizedek Priesthood
74 pages
en An Analysis On Students Speaking Skill A
No ratings yet
en An Analysis On Students Speaking Skill A
8 pages
Far Away, Away, A Long Way
No ratings yet
Far Away, Away, A Long Way
1 page
Surface Waves Tutorial
No ratings yet
Surface Waves Tutorial
79 pages
Morphological Process Exercise
No ratings yet
Morphological Process Exercise
2 pages
Fortiadc 5.1.0 Cli Reference PDF
No ratings yet
Fortiadc 5.1.0 Cli Reference PDF
454 pages
Activity 1 - Finals
No ratings yet
Activity 1 - Finals
3 pages
Plenary Paper Behlmer Rome Bibliography-libre-Coptic Lit
No ratings yet
Plenary Paper Behlmer Rome Bibliography-libre-Coptic Lit
11 pages
Speechcraft
No ratings yet
Speechcraft
4 pages
Principle of Minimum Potential Energy
100% (1)
Principle of Minimum Potential Energy
5 pages
FTDS SD E 227 Add Warrior Customer Number To The Customer Master
No ratings yet
FTDS SD E 227 Add Warrior Customer Number To The Customer Master
25 pages
Simplifying Expressions
No ratings yet
Simplifying Expressions
3 pages
Q3 ENG10 Wk-1 Final
No ratings yet
Q3 ENG10 Wk-1 Final
8 pages
Kontrak Perkuliahan-Poetry-Hasriati
No ratings yet
Kontrak Perkuliahan-Poetry-Hasriati
3 pages
Entity Catalog Management API REST Specification
No ratings yet
Entity Catalog Management API REST Specification
105 pages
Region I: Ilocos Region
No ratings yet
Region I: Ilocos Region
49 pages
Paper Pattern All Subjects X 2025
No ratings yet
Paper Pattern All Subjects X 2025
7 pages
Learning To. Perceive The Sound Pattern of English : Catherine T. Bestt
No ratings yet
Learning To. Perceive The Sound Pattern of English : Catherine T. Bestt
49 pages
Duas For Allahs Mercy
No ratings yet
Duas For Allahs Mercy
14 pages
FPGA Based Emulation Environment For
No ratings yet
FPGA Based Emulation Environment For
8 pages
n220 - Computer Practice n5 Memo Nov 2019
No ratings yet
n220 - Computer Practice n5 Memo Nov 2019
35 pages
2nd generation computers
No ratings yet
2nd generation computers
3 pages

Digital Data Part 1

Uploaded by

Digital Data Part 1

Uploaded by

DATA SCIENCE PART 1

Data Forms Defined-

Unstructured Data – Getting to Know

– Detecting events and times.

You might also like