0% found this document useful (0 votes)
183 views32 pages

Digital Data

Digital data refers to information stored in a binary format that can be interpreted by machines. It includes ones and zeros that represent complex information like text, audio, and video. Digital data comes in unstructured, semi-structured, and structured forms. Unstructured data does not fit a predefined model and makes extracting information difficult. Semi-structured data has some tags but not enough structure for full automation. Structured data follows a defined schema and stores entities in rows and columns for easy computer processing. Managing and extracting information from unstructured and semi-structured data poses challenges due to the lack of structure.

Uploaded by

Tanish Saajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
183 views32 pages

Digital Data

Digital data refers to information stored in a binary format that can be interpreted by machines. It includes ones and zeros that represent complex information like text, audio, and video. Digital data comes in unstructured, semi-structured, and structured forms. Unstructured data does not fit a predefined model and makes extracting information difficult. Semi-structured data has some tags but not enough structure for full automation. Structured data follows a defined schema and stores entities in rows and columns for easy computer processing. Managing and extracting information from unstructured and semi-structured data poses challenges due to the lack of structure.

Uploaded by

Tanish Saajan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Digital Data

Dr. Atul Garg

1
Digital data

• The term of digital data is a binary format of information. The computer is


converted into some machine-readable digital format.
• Digital data is data that represents other forms of data using specific machine
language systems that can be interpreted by various technologies.
• The most fundamental of these systems is a binary system, which simply
stores complex audio, video or text information in a series of binary
characters, traditionally ones and zeros, or "on" and "off" values.
• These days, digital data is everywhere. Whenever you send an email, read a
social media post, or take pictures with your digital camera, you are working
with digital data.

2
Types of Digital data
Digital data can be classified into three forms:

• Unstructured data: This is the data which does not conform to a data model or is not
in a form which can be used easily by a computer program. About 80—90% data of
an organization is in this format; for example, memos, chat rooms, PowerPoint
presentations, images, videos, letters, researches, body of an email, etc.
• Semi-structured data: This is the data which does not conform to a data model but
has some structure. However, it is not in a form which can be used easily by a
computer program; for example XML, mark-up languages like HTML, etc. Metadata
for this data is available but is not sufficient.
• Structured data: This is the data which is in an organized form (e.g., in rows and
columns) and can be easily used by a computer program. Relationships exist between
entities of data, such as classes and their objects. Data stored in databases is an
example of structured data.

3
Formats of Digital Data

• Usually, data is in the unstructured format which makes


extracting information from it difficult.
• According to Merrill Lynch, 80–90% of business data is
either unstructured or semi-structured.
• Gartner also estimates that unstructured data constitutes
80% of the whole enterprise data.
4
Characteristics of Un-structured Data

5
Sources of Un-structured Data
Broadly speaking, anything in a non-database form is unstructured data.

6
Managing Un-structured Data
Few generic tasks to be performed to enable storage and search of unstructured data:
Indexing: Let us go back to our understanding of the Relational Database Management System(RDBMS). In this system,
data is indexed to enable faster search and retrieval. On the basis of some value in the data, index is defined which is
nothing but an identifier and represents the large record in the data set. In the absence of an index, the whole data set/
document will be scanned for retrieving the desired information. In the case of unstructured data too, indexing helps in
searching and retrieval. Based on text or some other attributes, e.g. file name, the unstructured data is indexed. Indexing in
unstructured data is difficult because neither does this data have any predefined attributes nor does it follow any pattern or
naming conventions. Text can be indexed based on a text string but in case of non-text based files, e.g. audio/video, etc.,
indexing depends on file names.
Tags/Metadata: Using metadata, metadata, data in a document, document, etc. can be tagged. This enables search and
retrieval. But in unstructured data, this is difficult as little or no metadata is available. Structure of data has to be
determined which is very difficult as the data itself has no particular format and is coming from more than one source.
Classification/Taxonomy: Taxonomy is classifying data on the basis of the relationships that exist between data. Data can
be arranged in groups and placed in hierarchies based on the taxonomy prevalent in an organization. However, classifying
unstructured data is difficult as identifying relationships between data is not an easy task. In the absence of any structure or
metadata or schema, identifying accurate relationships and classifying is not easy. Since the data is unstructured, naming
conventions or standards are not consistent across an organization, thus making it difficult to classify data.
CAS (Content Addressable Storage): It stores data based on their metadata. It assigns 2 unique name to every object
stored in it. The object is retrieved based on its content and not its location. It is used extensively to store emails, etc
7
Challenges to store Un-structured Data

8
Possible Solutions to Store Un-structured Data

9
Challenges to extract Information

10
Solutions to extract Information

XOLAP (extended online analytic processing)

11
Semi-structured Data

• Semi-structured data does not conform to any data model i.e. it is difficult to determine the meaning of data
neither can data be stored in rows and columns as in a database but semi-structured data has tags and
markers which help to group data and describe how data is stored, giving some metadata but it is not
sufficient for management and automation of data.
• Similar entities in the data are grouped and organized in a hierarchy. The attributes or the properties within a
group may or may not be the same. For example two addresses may or may not contain the same number of
properties as in
Address 1
<house number><street name><area name><city> Address 2
<house number><street name><city>
• For example an e-mail follows a standard format
To: <Name> From: <Name> Subject: <Text> CC:
<Name>
Body: <Text, Graphics, Images etc. >
• The tags give us some metadata but the body of the e-mail contains no format neither is such which conveys
meaning of the data it contains.
• There is very fine line between unstructured and semi-structured data.
12
Semi-Structured Data

13
Sources of Semi-structured Data

14
Managing Semi-structured Data
Some ways in which semi-structured data is managed and stored

Schemas Graph-based data XML


models

• Describe the • Contain data on • Models the data


structure and the leaves of the using tags and
content of data to graph. Also known elements
some extent as ‘schema less’

• Assign meaning to • Used for data • Schemas are not


data hence exchange among tightly coupled to
allowing automatic heterogeneous data
search and sources
indexing

15
Challenges to store Semi-structured Data

16
Possible Solutions to Store Semi-structured Data

Object Exchange Model

17
Challenges to extract Semi-structured Data

18
Solutions to extract Semi-structured Data

Object Exchange Model

19
XML: to manage Semi-structured Data

XML Extensible MarkUp Language

What is XML? Open-source mark up language written in plain


text. It is hardware and software independent

Does what? Designed to store and transport data over


the Internet

It allows data to be stored in a


How? hierarchical/nested structure. It allows user to
define tags to store the data

20
XML: to manage Semi-structured Data
XML has no predefined tags

<message>
<to> XYZ </to>
<from> ABC </from>
<subject> Greetings </subject>
<body> Hello! How are you? </body>
</message>

The words in the <> (angular brackets) are user-defined tags


XML is known as self-describing as data can exist without a schema and schema can
be added later
Schema can be described in XSLT or XML schema
21
Structured Data

• Structured data is organized in semantic chunks (entities)


• Similar entities are grouped together (relations or classes)
• Entities in the same group have the same descriptions
(attributes)
• Descriptions for all entities in a group (schema) have the
same defined format have a predefined length are all
present and follow the same order

22
Structured Data

Conforms to a
data model
Data is stored in
form of rows and
Similar entities columns
are grouped (e.g., relational
database)

Structured
data

Attributes in a Data resides in


group are the fixed fields within
same a record or file

Definition, format
& meaning of data
is explicitly
known

23
Sources of Structured Data

Databases (e.g., Access)

Spreadsheets

Structured Data

SQL

Online Transaction Processing


OLTP systems

24
Managing Structured Data

Fully described datasets

Clearly defined categories and sub-categories

Data neatly placed in rows and columns

Data that goes into the records is regulated by a well-defined structure

Indexing can be easily done either by the DBMS itself or manually

25
26
Storing Structured Data

27
Retrieving Structured Data

28
Difference b/w types of Data

Sr. No. Key Structured Data Semi Structured Data Unstructured Data
Level of Structured Data as name On other hand in case of Semi Structured In last the data is fully non
organizing suggest this type of data Data the data is organized up to some organized in case of
is well organized and extent only and rest is non organized Unstructured Data and
1
hence level of organizing hence the level of organizing is less than hence level of organizing is
is highest in this type of that of Structured Data and higher than lowest in case of
data. that of Unstructured Data. Unstructured Data.
Means of Data Structured Data is get While in case of Semi Structured Data is On other hand in case of
Organization organized by the means of partially organized by the means of Unstructured Data is based
2
Relational Database. XML/RDF. on simple character and
binary data.
Transaction In Structured Data In Semi Structured Data transaction is not While in Unstructured Data
Management management and by default but is get adapted from DBMS no transaction management
concurrency of data is but data concurrency is not present. and no concurrency are
3
present and hence mostly present.
preferred in multitasking 29
process.
Difference b/w types of Data

Sr. Key Structured Data Semi Structured Data Unstructured Data


No.
Versioning Structured Data supports in On other hand in case of Semi Versioning in case of
Relational Database so versioning Structured Data versioning is done Unstructured Data is possible
is done over tuples, rows and only where tuples or graph is only as on whole data as no
4
table as well. possible as partial database is support of database at all.
supported in case of Semi Structured
Data.
Flexible and As Structured Data is based on While in case Semi Structured Data is As there is no dependency on
Scalable relational database so it becomes more flexible than Structured Data but any database so Unstructured
schema dependent and less less flexible and scalable as compare Data is more flexible and
5
flexible as well as less scalable. to Unstructured Data. scalable as compare to
Structured and Semi
Structured Data.

30
Difference b/w types of Data

Sr. Key Structured Data Semi Structured Data Unstructured Data


No.
Performance In Structure Data we can perform On other hand in case of Semi While in case of Unstructured
structured query which allow Structured Data only queries over Data only textual query are
complex joining and thus anonymous nodes are possible so its possible so performance is
6
performance is highest as performance is lower than Structured lower than both Structured
compare to that of Semi Data but more than that of and Semi Structured Data.
Structured and Unstructured Data. Unstructured Data

31
References
• R.N. Prasad and Seema Acharya, “Fundamentals of Business Analytics”,
Wiley India Publishers.
• http://www.punjabiuniversity.ac.in/Pages/Images/elearn/DigitalData.pdf
• https://www.tutorialspoint.com/difference-between-structured-semi-
structured-and-unstructured-data
• https://www.michael-gramlich.com/what-is-structured-semi-structured-and-
unstructured-data/
• https://www.datamation.com/big-data/structured-vs-unstructured-data/

32

You might also like