Digital Data
Digital Data
1
Digital data
2
Types of Digital data
Digital data can be classified into three forms:
• Unstructured data: This is the data which does not conform to a data model or is not
in a form which can be used easily by a computer program. About 80—90% data of
an organization is in this format; for example, memos, chat rooms, PowerPoint
presentations, images, videos, letters, researches, body of an email, etc.
• Semi-structured data: This is the data which does not conform to a data model but
has some structure. However, it is not in a form which can be used easily by a
computer program; for example XML, mark-up languages like HTML, etc. Metadata
for this data is available but is not sufficient.
• Structured data: This is the data which is in an organized form (e.g., in rows and
columns) and can be easily used by a computer program. Relationships exist between
entities of data, such as classes and their objects. Data stored in databases is an
example of structured data.
3
Formats of Digital Data
5
Sources of Un-structured Data
Broadly speaking, anything in a non-database form is unstructured data.
6
Managing Un-structured Data
Few generic tasks to be performed to enable storage and search of unstructured data:
Indexing: Let us go back to our understanding of the Relational Database Management System(RDBMS). In this system,
data is indexed to enable faster search and retrieval. On the basis of some value in the data, index is defined which is
nothing but an identifier and represents the large record in the data set. In the absence of an index, the whole data set/
document will be scanned for retrieving the desired information. In the case of unstructured data too, indexing helps in
searching and retrieval. Based on text or some other attributes, e.g. file name, the unstructured data is indexed. Indexing in
unstructured data is difficult because neither does this data have any predefined attributes nor does it follow any pattern or
naming conventions. Text can be indexed based on a text string but in case of non-text based files, e.g. audio/video, etc.,
indexing depends on file names.
Tags/Metadata: Using metadata, metadata, data in a document, document, etc. can be tagged. This enables search and
retrieval. But in unstructured data, this is difficult as little or no metadata is available. Structure of data has to be
determined which is very difficult as the data itself has no particular format and is coming from more than one source.
Classification/Taxonomy: Taxonomy is classifying data on the basis of the relationships that exist between data. Data can
be arranged in groups and placed in hierarchies based on the taxonomy prevalent in an organization. However, classifying
unstructured data is difficult as identifying relationships between data is not an easy task. In the absence of any structure or
metadata or schema, identifying accurate relationships and classifying is not easy. Since the data is unstructured, naming
conventions or standards are not consistent across an organization, thus making it difficult to classify data.
CAS (Content Addressable Storage): It stores data based on their metadata. It assigns 2 unique name to every object
stored in it. The object is retrieved based on its content and not its location. It is used extensively to store emails, etc
7
Challenges to store Un-structured Data
8
Possible Solutions to Store Un-structured Data
9
Challenges to extract Information
10
Solutions to extract Information
11
Semi-structured Data
• Semi-structured data does not conform to any data model i.e. it is difficult to determine the meaning of data
neither can data be stored in rows and columns as in a database but semi-structured data has tags and
markers which help to group data and describe how data is stored, giving some metadata but it is not
sufficient for management and automation of data.
• Similar entities in the data are grouped and organized in a hierarchy. The attributes or the properties within a
group may or may not be the same. For example two addresses may or may not contain the same number of
properties as in
Address 1
<house number><street name><area name><city> Address 2
<house number><street name><city>
• For example an e-mail follows a standard format
To: <Name> From: <Name> Subject: <Text> CC:
<Name>
Body: <Text, Graphics, Images etc. >
• The tags give us some metadata but the body of the e-mail contains no format neither is such which conveys
meaning of the data it contains.
• There is very fine line between unstructured and semi-structured data.
12
Semi-Structured Data
13
Sources of Semi-structured Data
14
Managing Semi-structured Data
Some ways in which semi-structured data is managed and stored
15
Challenges to store Semi-structured Data
16
Possible Solutions to Store Semi-structured Data
17
Challenges to extract Semi-structured Data
18
Solutions to extract Semi-structured Data
19
XML: to manage Semi-structured Data
20
XML: to manage Semi-structured Data
XML has no predefined tags
<message>
<to> XYZ </to>
<from> ABC </from>
<subject> Greetings </subject>
<body> Hello! How are you? </body>
</message>
22
Structured Data
Conforms to a
data model
Data is stored in
form of rows and
Similar entities columns
are grouped (e.g., relational
database)
Structured
data
Definition, format
& meaning of data
is explicitly
known
23
Sources of Structured Data
Spreadsheets
Structured Data
SQL
24
Managing Structured Data
25
26
Storing Structured Data
27
Retrieving Structured Data
28
Difference b/w types of Data
Sr. No. Key Structured Data Semi Structured Data Unstructured Data
Level of Structured Data as name On other hand in case of Semi Structured In last the data is fully non
organizing suggest this type of data Data the data is organized up to some organized in case of
is well organized and extent only and rest is non organized Unstructured Data and
1
hence level of organizing hence the level of organizing is less than hence level of organizing is
is highest in this type of that of Structured Data and higher than lowest in case of
data. that of Unstructured Data. Unstructured Data.
Means of Data Structured Data is get While in case of Semi Structured Data is On other hand in case of
Organization organized by the means of partially organized by the means of Unstructured Data is based
2
Relational Database. XML/RDF. on simple character and
binary data.
Transaction In Structured Data In Semi Structured Data transaction is not While in Unstructured Data
Management management and by default but is get adapted from DBMS no transaction management
concurrency of data is but data concurrency is not present. and no concurrency are
3
present and hence mostly present.
preferred in multitasking 29
process.
Difference b/w types of Data
30
Difference b/w types of Data
31
References
• R.N. Prasad and Seema Acharya, “Fundamentals of Business Analytics”,
Wiley India Publishers.
• http://www.punjabiuniversity.ac.in/Pages/Images/elearn/DigitalData.pdf
• https://www.tutorialspoint.com/difference-between-structured-semi-
structured-and-unstructured-data
• https://www.michael-gramlich.com/what-is-structured-semi-structured-and-
unstructured-data/
• https://www.datamation.com/big-data/structured-vs-unstructured-data/
32