0% found this document useful (0 votes)

12 views37 pages

Unit I Types of Digital Data: CO1: Explain About Big Data Paradigm

Uploaded by

gowthamprasathmzkp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views37 pages

Unit I Types of Digital Data: CO1: Explain About Big Data Paradigm

Uploaded by

gowthamprasathmzkp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 37

Unit I

Types of Digital Data

CO1: Explain about Big Data paradigm

TEXT BOOK:

Seema Acharya, Subhashini Chellappan, Big Data and

Analytics, Wiley, 2018

Instructor: Dr. R. MUTHUSAMI, AP(Selection Grade)

Course Instructor: Dr. R. Muthusami, AP(SG)/CA
OUTLINE

1.Introduction
2.Structured Data
3.Unstructured Data
4.Semi-Structured Data
5.Difference between Semi structured and
structured data
LO1: Describe the classification of Digital data
Introduction:
• Data growth has seen exponential acceleration since
the advent of the computer and internet.
• define: it is defined as the data that is stored on digital
format may be in the form of a picture, document or
video etc. it is the data that is not physical but stored
in digital form.
• Digital data can be classified into three forms:
• 1. Unstructured Data
• 2. Semi-Structured Data
• 3. Structured
SO1: Describe the Structured Data

Sources of structured data

Databases eg. Access

Structured
spreadsheet
data

SQL

OLTP systems
Characteristics of structured data
Conforms to a data
model

Data is stored in the form

Similar entities are of rows and columns
grouped

Structured
data

Data resides in fixed

Attributes in the group fields withn a
are the same record or a file

Definition, format,meaning
of data is explicitly known
Ease with Structured Data
storage

Scalibility
Ease with
structured
data
Security

Update and
delete

*
Hassle free structured data
Retrieving
information

Indexing and
Ease with searching
structured data

Mining data

BI operations
Hassle Free Retrieval
• Retrieval of structured data is totally hassle free.
The features are as follows:
• Retrieving information: a well defined structure helps in
easy retrieval of data
• Indexing and searching: Data can be indexed based not only on
a text string but also on other attributes . This enables streamlined search .

• Mining Data: Structured data can be easily mined and knowledge

can be extracted from it.
• BI operations: BI works extremely well with structured data.
Hence data mining, warehousing etc. can be easily undertaken
SO2: Describe the Un-Structured & Semi Structured Data

UNSTRUCTURED DATA
• It is the one which cannot be stored in the form of
rows and columns as in a database and does not
conform to any data model, i.e. it is difficult to
determine the meaning of the data.

• It does not follow any rules and it can be of any type

and thus its unpredictable.
CHARACTERISTICS OF UNSTRUCTURED
DATA
SOURCES OF
UNSTRUCTURED DATA
• Web pages, Memos, Videos (MPEG, etc.), Images (JPEG, GIF,
etc.), body of an email, Word document, PowerPoint
presentation, Chats, Reports, White papers, Surveys etc.

Where does Unstructured data come from ?

Anything in a non-database form is unstructured data. It can be
divided into two broad categories :
• Bitmap objects : For e.g. Image, video or audio files.
• Textual objects : For e.g. Microsoft word documents, emails or
MS Excel.
• A lot of unstructured data is also noisy text such as chats, emails
and SMS texts.
MANAGING UNSTRUCTURED DATA

• INDEXING : Data is indexed to enable faster search and retrieval.

On the basis of some value in data, index is defined as an
identifier which represents a large record in the data set.
• Indexing in unstructured data is difficult as text can be indexed
based on a text string but in case of non-text based files, e.g.
audio/video, indexing depends on file names.
• TAGS/METADATA : Using metadata, data in a document can be
tagged. But in unstructured data, it is difficult as little or no
metadata is available. Also, the data itself has no particular format
and is coming from more than one source.
• CLASSIFICATION/TAXONOMY : Taxonomy is classifying data
on the basis of relationship that exist between data. Data can be
grouped and placed in hierarchies based on the taxonomy prevalent
in a firm.
• But in absence of any structure/metadata, identifying relationships
between data is difficult as data is unstructured, naming standards are
not consistent across the firm thus making it difficult to classify data.
• CAS (Content Addressable Storage) : It stores data based on their
metadata. It assigns a unique name to every object stored in it
• The object is retrieved based on its content and not its location.
• It is used to store emails etc.
CHALLENGES FACED WHILE STORING
UNSTRUCTURED DATA
• Storage space : It is difficult to store and manage unstructured data. A lot
of space is required t store such data. It is difficult to store images, videos,
audios etc.
• Scalability : As the data grows, scalability becomes an issue and the cost
of storing such data grows.
• Retrieve information : Even if unstructured data is stored, it is difficult
to retrieve and recover from it.
• Security : Ensuring security is difficult due to varied sources of data.
E.g. emails, web pages, etc.
• Update and delete : Updating and deleting unstructured data are very
difficult as retrieval is difficult due to no clear structure.
• Indexing and searching : Indexing unstructured data is difficult as the
structure is not clear and attributes are not pre-defined.

*
SOLUTIONS FOR STORING
UNSTRUCTURED DATA
• Changing format : Unstructured data may be converted to formats which are
easily managed, stored and searched.
• Developing new hardware : New hardware needs to be developed to support
unstructured data. It may either complement the existing storage device or may
be stand-alone for unstructured data.
• Storing in RDBMS/BLOBs (Binary Large Objects): While unstructured
data such as video/image cannot be stored into a relational column, there is no
such problem when it comes to storing its metadata, like the date & time of its
creation, the author of the data etc.
• Storing in XML format : Unstructured data may be stored in XML format
which tries to give some structure to it by using tags and elements.
• CAS (Content Addressable Storage) : It organizes files based on their
metadata and assigns a unique name to every object stored in it. Used
extensively to store emails.
CHALLENGES FACED WHILE EXTRACTING
INFORMATION FROM STORED UNSTRUCTURED
DATA
• Interpretation : Unstructured data is not easily interpreted by
conventional search algorithms.
• Classification/Taxonomy : Different naming conventions
followed across the firm make it difficult to classify the data.
• Indexing : Designing algorithms to understand the meaning
of the documents and then tagging or indexing them
accordingly is difficult.
• Deriving meaning : Computer programs cannot automatically
derive meaning from unstructured data.
• File formats : Increasing number of file formats makes it
difficult to interpret data.
• Tags : As the data grows, it is not possible to put tags
manually.
POSSIBLE SOLUTIONS TO THESE
CHALLENGES
• Tags : Unstructured data can be stored in a virtual repository and can
be automatically tagged. For e.g. Documentum(Document
Management software) provides this type of solution.
• Text mining : It helps in grouping as well as classifying unstructured
data and assist in analysing by considering grammar, context,
synonyms etc.
• Application platforms : such as XOLAP help extract information
from email and XML-based documents.
• Classification/Taxonomy : Taxonomies within the firm can be
managed automatically to organize data in the hierarchical structures.
• Naming conventions/standards : Following naming conventions
across a firm can greatly improve storage, retrieval, index and search.
UIMA (Unstructured Information
Management Architecture)
• UIMA is an open source platform for IBM which integrates
different types of analysis engines to provide a complete solution
for knowledge discovery from unstructured data.
• In UIMA, the analysis engine enables integration and analysis of
unstructured information and bridge the gap between structured
and unstructured data.
• It stores information in structured format which can be then
mined, searched and put to other uses. They are analysed in below
ways :
• Breaking up of documents into separate words.
• Grouping and classifying according to Taxonomy.
• Detecting parts of speech, grammar, and synonyms.
• Detecting relationship between various elements.

*
Getting to know semi-structured
data
Only about 10% of data in any organization is semi-
structured.
 still it is important to understand, manage, and analyze this
semi-structured data coming from heterogeneous sources.
Semi-structured data does not conform to any data model. Also,
this data cannot be stored in rows and columns as in a database
Semi-structured data has tags and markers which helps group the
data and describe how the data is stored. But they are not
sufficient for management and autonomous of data
Similar entities are grouped and organized in a hierarchy. The
attributes or the properties within a group may or may not be the
same.
Similar Does not
entities conform to a
are data model but
grouped contains tags
and elements

Cannot be
Attributes stored in the
in a group Semi rows and
may not be structured data columns as in
the same a database

XML

TCP/IP Packets

Semi structured
data Zipped File

Binary
Executables

Mark-Up
Languages

Integration of data
from heterogeneous
sources
• Characteristics of semi structured data are summarized as below :

• It is organized into semantic entities.

• Similar entities are grouped together.
• Entities in the same group may not have the same attributes.
• The order of attributes is not necessarily important.
• Not always all attributes are required.
• Size of the same attributes in a group may differ.
• Type of the same attributes in a group may differ.

(Semantic – relating to “meaning”, or arising from distinctions between the meaning of

different words)
How to manage semi-structured
data?
• Schemas :
• These can be used to describe the structured data. Schemas
define the constrains on the structure, content of the documents.
• Graph Based data models :
• These can be used to describe data. This is “schema-less”
approach and is also known as “Self-desrcibing” as data is
presented in such a way that it explains itself.
• XML:
• This is widely used to store and exchange semi structured data.
schemas in XML are not tightly coupled to data.
How to store semi-structured
data?
Storage cost

RDBMS

Irregular
and partial
structured
Challenges
faced
Implicit
structure

Evolving
Distinction
Schemas
between
schemas and
data
• Possible solution contains:
• XML
• RDBMS
• Special Purpose DBMS
• OEM (Object Exchange Model)

• The possible solutions to the challenges faced in

storing semi-structured data are indicates above.
Modeling Semi-structured Data
• The OEM Way:
• Object exchange model is a model for storing and
exchanging semi-structured data.
• This brings us to the next questions.
• Labeled directed graphs (from object exchange
model):
• Object exchange modeling. Nodes are objects;
labels on the arcs are attributes names
How to extract information from
semi-structured data?
• Data coming from heterogeneous sources contain
different structures. And it is difficult to tag and
index them
• The various challenges faced while extracting
information from semi-structured . The possible
solutions to the challenges are depicted as below.
• Challenges faced:
• 1) Flat file
• 2) Heterogeneous sources
• 3) Incomplete/Irregular structure
Possible solutions:
• Indexing :
• OEM (Object Exchange Model)
• XML
• Mining Tools
XML : A solution for Semi-
structured data management
• XML is slowly emerging as a standard for
exchanging data over the web.
• It enables separation of content and presentation.
• DTD’s (Document Type Definition) provide partial
schemas for XML documents.
• XML :eXtensible markup language
• What is XML? : open source markup language
written in plain text. It is hardware and software
independent.
• Semi-structured data XML
• Consists of attributes Consists of tags
• Consists of objects Consists of elements
• Atomic values are the constituents CDATA(Characters)
are used
Difference between semi-
structured data and structured
data
• Semi-structured data is the same as structured data
with one minor exception.
• semi-structured data requires looking at the data
itself to determine structure as opposed to structured
data that only requires examining the data element
name.
• Semi-structured data is one processing step away
from structured data.
• This semi-structured data when stored in the
structured format will be in the form of rows and
columns each having a defined format.
Thank You

Apex Coding Fundamentals
100% (1)
Apex Coding Fundamentals
15 pages
Characteristics and Types of Qualitative Research
100% (11)
Characteristics and Types of Qualitative Research
2 pages
Christ Lecture 1 and 2 - Semistructured, Structured and Unstructured Data
No ratings yet
Christ Lecture 1 and 2 - Semistructured, Structured and Unstructured Data
130 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
95 pages
UNIT 1 INTRODUCTION TO BIGDATA by MIT
No ratings yet
UNIT 1 INTRODUCTION TO BIGDATA by MIT
12 pages
Structured, Semi Structured and Unstructured Data
No ratings yet
Structured, Semi Structured and Unstructured Data
13 pages
Type of Data
No ratings yet
Type of Data
44 pages
TIT 721 BI-Unit-II Study Materials
No ratings yet
TIT 721 BI-Unit-II Study Materials
38 pages
6259 5 128 Module 1
No ratings yet
6259 5 128 Module 1
73 pages
Chapter 2-Converted BI
No ratings yet
Chapter 2-Converted BI
39 pages
Digital Data Part 1
No ratings yet
Digital Data Part 1
5 pages
Unit 1 Notes Final Part A
No ratings yet
Unit 1 Notes Final Part A
82 pages
Chapter 2
No ratings yet
Chapter 2
39 pages
BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan
No ratings yet
BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan
62 pages
Unit 1
No ratings yet
Unit 1
62 pages
DBMS Unit - 1 and Unit-2 Notes
100% (3)
DBMS Unit - 1 and Unit-2 Notes
62 pages
Unstructured Data
No ratings yet
Unstructured Data
2 pages
IBPS PO Computer Digest
No ratings yet
IBPS PO Computer Digest
28 pages
1 - Data and Organizations
No ratings yet
1 - Data and Organizations
5 pages
Sybca Bigdata Notes
100% (1)
Sybca Bigdata Notes
11 pages
Bda Unit 1
No ratings yet
Bda Unit 1
25 pages
Unit 1-2
No ratings yet
Unit 1-2
78 pages
Unit I EBDP 2022
No ratings yet
Unit I EBDP 2022
80 pages
Big Data Aktu Unit 1
No ratings yet
Big Data Aktu Unit 1
85 pages
BBA Business Analytics
No ratings yet
BBA Business Analytics
197 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
CSC4404 Chap3
No ratings yet
CSC4404 Chap3
84 pages
1 - Chap 3 - Types of Digital Data
68% (19)
1 - Chap 3 - Types of Digital Data
40 pages
Bi Mid 1
No ratings yet
Bi Mid 1
173 pages
Business Intelligence - Concepts
100% (2)
Business Intelligence - Concepts
162 pages
02-Types of Digital Data
No ratings yet
02-Types of Digital Data
33 pages
ICT Course Outline
No ratings yet
ICT Course Outline
3 pages
Big Data & Analytics (CSE448) L1
No ratings yet
Big Data & Analytics (CSE448) L1
51 pages
Data and Its Types
No ratings yet
Data and Its Types
40 pages
UART
100% (1)
UART
27 pages
Data and Data Storage
No ratings yet
Data and Data Storage
29 pages
Unit-1-Part1-Big Data Analytics and Tools
No ratings yet
Unit-1-Part1-Big Data Analytics and Tools
12 pages
Data Management API Guide: General Parallel File System
No ratings yet
Data Management API Guide: General Parallel File System
76 pages
CS-702 (D) BigData
No ratings yet
CS-702 (D) BigData
61 pages
Find Any Action Inside The IDE: Ctrl+Shift+A
No ratings yet
Find Any Action Inside The IDE: Ctrl+Shift+A
2 pages
BigData 1
No ratings yet
BigData 1
14 pages
Types of Digital Data
No ratings yet
Types of Digital Data
19 pages
Big Data Analytics QB
No ratings yet
Big Data Analytics QB
44 pages
Unit 4 DigitalData
No ratings yet
Unit 4 DigitalData
22 pages
FusionCompute V100R005C00 Storage Virtualization PDF
No ratings yet
FusionCompute V100R005C00 Storage Virtualization PDF
46 pages
Big Data - Unit-1 - KCS-061
No ratings yet
Big Data - Unit-1 - KCS-061
63 pages
DA Unit 1
No ratings yet
DA Unit 1
44 pages
Big Data Unit-1 Kcs-061
No ratings yet
Big Data Unit-1 Kcs-061
64 pages
Chapter 01: Types of Digital Data
No ratings yet
Chapter 01: Types of Digital Data
80 pages
Geospatial Data Abstraction Library (GDAL) - Utilities
No ratings yet
Geospatial Data Abstraction Library (GDAL) - Utilities
31 pages
File Concept: Contiguous Logical Address Space Types
No ratings yet
File Concept: Contiguous Logical Address Space Types
48 pages
Types of Digital Data
No ratings yet
Types of Digital Data
26 pages
22Xx405 - Database Management System Unit 1 & LP 1-Understanding Data and Information, Database Vs Information
No ratings yet
22Xx405 - Database Management System Unit 1 & LP 1-Understanding Data and Information, Database Vs Information
11 pages
Chapter 2 - Types of Digital Data
No ratings yet
Chapter 2 - Types of Digital Data
12 pages
Performance Analysis of Bata Shoe Company LTD
No ratings yet
Performance Analysis of Bata Shoe Company LTD
57 pages
Basics of Big Data Notes
No ratings yet
Basics of Big Data Notes
17 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
Big Data and Analytics Cse448 Module 1 L
No ratings yet
Big Data and Analytics Cse448 Module 1 L
38 pages
MT6572 Android Scatter
0% (1)
MT6572 Android Scatter
6 pages
Pipes: Pipes Represent A Channel For Interprocess Communication
No ratings yet
Pipes: Pipes Represent A Channel For Interprocess Communication
15 pages
1 Big Data Analytics-Introduction R21 A7902 ABP
No ratings yet
1 Big Data Analytics-Introduction R21 A7902 ABP
14 pages
Basic Concepts in Data Structures
From Everand
Basic Concepts in Data Structures
K.Meenendranath Reddy
No ratings yet
Chapter 01: Types of Digital Data
No ratings yet
Chapter 01: Types of Digital Data
79 pages
Fbda Unit-1
No ratings yet
Fbda Unit-1
17 pages
Arrays in C
No ratings yet
Arrays in C
16 pages
Bussiness Analytics Chep-2
No ratings yet
Bussiness Analytics Chep-2
36 pages
Interview Questions and Answers On Database Basics
No ratings yet
Interview Questions and Answers On Database Basics
13 pages
Chapter 2
67% (3)
Chapter 2
39 pages
Final Project Format
No ratings yet
Final Project Format
11 pages
Data Types
No ratings yet
Data Types
36 pages
First Derivatives In-Memory Databases: Peter Storeng
No ratings yet
First Derivatives In-Memory Databases: Peter Storeng
34 pages
Pengelolaan Arsip Berbasis Aplikasi Surat Di Dinas Perpustakaan Dan Kearsipan Provinsi Jawa Timur
No ratings yet
Pengelolaan Arsip Berbasis Aplikasi Surat Di Dinas Perpustakaan Dan Kearsipan Provinsi Jawa Timur
14 pages
LAS File Processing Using LASTOOLS
No ratings yet
LAS File Processing Using LASTOOLS
12 pages
Types of Digital Data
No ratings yet
Types of Digital Data
33 pages
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
No ratings yet
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
72 pages
Veeam Backup & Replication Vs Quest Vranger
No ratings yet
Veeam Backup & Replication Vs Quest Vranger
8 pages
Structured, Semi-Structured and Unstructured Data (M-2)
No ratings yet
Structured, Semi-Structured and Unstructured Data (M-2)
3 pages
CH 2
No ratings yet
CH 2
42 pages
Expert Systems With Applications: Li Yan, Z.M. Ma
No ratings yet
Expert Systems With Applications: Li Yan, Z.M. Ma
13 pages
MG414 CW1 Assignment Brief
No ratings yet
MG414 CW1 Assignment Brief
7 pages
Digital Data
No ratings yet
Digital Data
32 pages
Big Data & Analytics (CSE448) L1
No ratings yet
Big Data & Analytics (CSE448) L1
50 pages
IBM Watson Studio Explanation Cleaned
No ratings yet
IBM Watson Studio Explanation Cleaned
3 pages
Online Secure Data Service
No ratings yet
Online Secure Data Service
4 pages
Unstructured Data Analysis-A Survey: K.V.Kanimozhi, Dr.M.Venkatesan
No ratings yet
Unstructured Data Analysis-A Survey: K.V.Kanimozhi, Dr.M.Venkatesan
3 pages
Database Systems (ISYS1001/ISYS5008)
No ratings yet
Database Systems (ISYS1001/ISYS5008)
50 pages
Unit - I: Types of Digital Data
No ratings yet
Unit - I: Types of Digital Data
5 pages
Step 4. Uploading Supplementary Files
No ratings yet
Step 4. Uploading Supplementary Files
1 page
CS IT Planner of All Batches
No ratings yet
CS IT Planner of All Batches
1 page
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
From Everand
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
1/5 (1)

Unit I Types of Digital Data: CO1: Explain About Big Data Paradigm

Uploaded by

Unit I Types of Digital Data: CO1: Explain About Big Data Paradigm

Uploaded by

Unit I

Types of Digital Data

Seema Acharya, Subhashini Chellappan, Big Data and

Instructor: Dr. R. MUTHUSAMI, AP(Selection Grade)

Sources of structured data

Databases eg. Access

Data is stored in the form

Data resides in fixed

• Mining Data: Structured data can be easily mined and knowledge

• It does not follow any rules and it can be of any type

Where does Unstructured data come from ?

• INDEXING : Data is indexed to enable faster search and retrieval.

Not The tags and

• It is organized into semantic entities.

(Semantic – relating to “meaning”, or arising from distinctions between the meaning of

• The possible solutions to the challenges faced in

You might also like