0% found this document useful (0 votes)

24 views22 pages

Unit 4 DigitalData

The document discusses the classification of digital data into three categories: structured, semi-structured, and unstructured, highlighting their characteristics and examples. It emphasizes the challenges of managing and analyzing data from various sources, particularly the prevalence of unstructured data in organizations. Additionally, it introduces Big Data concepts and Hadoop as tools for handling large volumes of data.

Uploaded by

SAYEEDA KHANUM PATHAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views22 pages

Unit 4 DigitalData

Uploaded by

SAYEEDA KHANUM PATHAN

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

UNIT – IV:

Types of Digital Data

Introduction to Big Data: Characteristics of Data, Evolution of Big Data

and Challenges with Big Data, Big Data, Terminologies used in Big Data
Environment.

Introduction to Hadoop: Features of Hadoop, Why Hadoop, RDBMS vs

Hadoop, Hadoop Overview, HDFS, Processing Data with Hadoop.
Types of Digital Data
Data :

Data provide an information from where the meaningful insights can be derived.

Data: Where does it come from?????

Data comes from Everywhere:

 We Speak

 We Move

 Sensors

 Computers

 Documents
Digital Data

• Today, data undoubtedly is an invaluable asset of any enterprise (big or

small). Even though professionals work with data all the time, the
understanding, management and analysis of data from heterogeneous
sources remains a serious challenge.

• Data growth has seen exponential acceleration since the advent of the
computer and Internet.

• Now, the various formats of digital data (structured, semi-structured and

unstructured data), data storage mechanism, data access methods,
management of data, the process of extracting desired information from
data, challenges posed by various formats of data, etc. will be discussed.
Classification of Digital Data:
Digital data can be broadly classified into three forms:
– Unstructured
– Semi-structured
– Structured

Unstructured

 This is the data which does not conform to a data model or is not in a form
which can be used easily by a computer program.

 About 80-90% data of an organization is unstructured

Example: images, videos, letters, ,text, PDFs, social media posts, body of an
email, log files, PowerPoint presentations etc.
Classification of Digital Data contd..

Semi-structured data (self- describing structure)

 This is the data which does not conform to a data model but has some
structure.
 It is not in a form which can be used easily by a computer programs. It has
self describing structure. It uses tags to separate semantic elements.
 Metadata for this data is available but is not sufficient.

Example: XML, markup languages like HTML, emails, etc.

Structured data

 This is the data which is in an organized form (e.g., in rows and columns) and
can be easily used by a computer program.

 Relationships exist between entities of data, such as classes their objects.

 Data stored in databases is an example of structured data.

Example: Oracle, DB2,My-SQL,OLTP (online Transactional processing) systems,

spreadsheets.
Classification of Digital Data contd..

 Since 1980’s enterprises data has been stored in RDBMS, it stores structured data

 Later, with internet connecting the world data has become an integral part of
daily transactions.

 All of this data was not structured, almost 80% of data generated in any
enterprise today is unstructured data.

 Roughly around 10% of data is in the structured and semi structured category.

 Here is a percent distribution of the three forms of data -

Structured Data
 When data is having predefined schema / structure then it is a structured data.

 In the context of RDBMS , data is stored in rows/columns.

 The number of rows/records/tuples is a relation is called the cardinality of a

relation

 The number of columns is referred to as the degree of a relation.

Sources of Structured data

 If the data is structured, then RBDMS can be used [Oracle Corp. — Oracle, IBM — DB2,
Microsoft — Microsoft SQL Server, EMC — Greenplum, Teradata — Teradata, MySQL
(open source), PostgreSQL (advanced open source) etc.] to house it.

 These databases are typically used to hold transaction/operational data generated and
collected by day-to-day business activities.

 In other words, the data of the On-Line Transaction Processing (OLTP) systems are
generally quite structured.
Ease of Working with Structured Data

The ease is with respect to the following:

1. Insert/update/delete: The Data Manipulation Language (DML) operations provide

the required ease with data input, storage, access, process, analysis, etc.

2. Security: Encryption solutions are available to secure the information. Organizations

are able to retain control and maintain compliance adherence by ensuring that only
authorized individuals are able to decrypt and view sensitive information (encryption
and tokenization solutions )

3. Indexing: An index is a data structure that speeds up the data retrieval operations
(primarily the SELECT DML statement) at the cost of additional writes and storage
space, but the benefits that ensue in search operation are worth the additional writes
and storage space.

4. Scalability: The storage and processing capabilities of the traditional RDBMS can be
easily scaled up by increasing the horsepower of the database server (increasing the
primary and secondary or peripheral storage capacity, processing capacity of the
processor, etc.)
Ease of Working with Structured Data
5. Transaction processing:

RDBMS has support for Atomicity, Consistency, Isolation, and Durability (ACID)
properties of transaction.

 Atomicity: A transaction is atomic, means that either it happens in its entirety or

none of it at all.
 Consistency: The database moves from one consistent state to another
consistent state. In other words, if the same piece of information is stored at two
or more places, they are in complete agreement.
 Isolation: The resource allocation to the transaction happens such that the
transaction gets the impression that it is the only transaction happening in
isolation.
 Durability: All changes made to the database during a transaction are permanent
and that accounts for the durability of the transaction.
Semi-structured Data
• It does not conform to any data model i.e. it is difficult to determine the meaning of
data neither can data be stored in rows and columns as in a database

• It uses tags to separate semantic elements and markers which help to group data and
describe how data is stored, giving some metadata but it is not sufficient for
management and automation of data.
• Similar entities in the data are grouped and organized in a hierarchy.

• There is no separation between the data and the schema.

• In semi-structured data, entities belonging to the same class and also grouped together
need not necessarily have the same set of attributes.

Example: Two addresses may or may not contain the same number of properties as in
Address 1 Semi-structured Data in Address 2

•
• And if at all, they have the same set of attributes, the order of attributes
may not be similar and for all practical purposes it is not important as well.

• The tags give us some metadata but the body of the e-mail contains no
format neither is such which conveys meaning of the data it contains.
Sources of Semi-structured Data

• Amongst the sources for semi-structured data, the front runners are ―XML and
―JSON.

1. XML: eXtensible Markup Language (XML) is hugely popularized by web

services developed utilizing the Simple Object Access Protocol (SOAP)
principles.

2. JSON: Java Script Object Notation is used to transform data between

a server and a web application. It uses Representational State Transfer(REST) ,
MongoDB etc.
Unstructured Data
 It does not confirm to a data model or is not in a form which can be used easily by
a computer program.

 About 80–90% data of an organization is in this format.

 Example: memos, chat rooms, PowerPoint presentations, images, videos, letters,

researches, white papers, body of an email, etc.
Dealing with Unstructured data:

The following techniques are used to find the patterns in or interpret unstructured data.

Data Mining: Knowledge discovery in databases, popular Mining algorithms are

Association rule mining, Regression Analysis, and Collaborative filtering

Natural Language Processing: It is related to Human Computer Interaction. It is about

enabling computers to understand human or natural language input.

Text Analytics: Text mining is the process of gleaning high quality and meaningful
information from text. It includes tasks such as text categorization, text clustering,
sentiment analysis and concept/entity extraction.

Noisy text analytics: Process of extraction structured or semi structured information

from noisy unstructured data such as chats, blogs, wikis, emails, Spelling mistakes,
abbreviations, such as uh, hm, non standard words.

Manual Tagging with meta data: This is about tagging manually with adequate meta data
to provide the requisite semantics to understand unstructured data.
Parts of Speech Tagging: POST is the process of reading text and tagging each word in
the sentence belonging to particular parts of speech such as noun, verb, objective.

Unstructured Information management architecture(UIMA): Open source platform

from IBM used for real time content analytics.

Oxford Essential Chemistry Coursebook
100% (5)
Oxford Essential Chemistry Coursebook
286 pages
DA Unit 1
No ratings yet
DA Unit 1
44 pages
Full Stack Development (Mern) : Submitted in Partial Fulfillment of The Requirements For The Award of The Degree of
No ratings yet
Full Stack Development (Mern) : Submitted in Partial Fulfillment of The Requirements For The Award of The Degree of
27 pages
Solidworks Quiz Ebook
No ratings yet
Solidworks Quiz Ebook
55 pages
Unit - Big - Data - (DK - PPT) - Part - 1
No ratings yet
Unit - Big - Data - (DK - PPT) - Part - 1
70 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
95 pages
CSC4404 Chap3
No ratings yet
CSC4404 Chap3
84 pages
Chapter 01: Types of Digital Data
No ratings yet
Chapter 01: Types of Digital Data
80 pages
Unit 1-2
No ratings yet
Unit 1-2
78 pages
Big Data Aktu Unit 1
No ratings yet
Big Data Aktu Unit 1
85 pages
Bigdata Notes-1 To 3
No ratings yet
Bigdata Notes-1 To 3
32 pages
Bda Unit 1
No ratings yet
Bda Unit 1
25 pages
Module 1
No ratings yet
Module 1
40 pages
Big Data
No ratings yet
Big Data
18 pages
BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan
No ratings yet
BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan
62 pages
Unit 1
No ratings yet
Unit 1
62 pages
Term Paper Linking Words
100% (1)
Term Paper Linking Words
6 pages
Unit-1 Bda
No ratings yet
Unit-1 Bda
17 pages
Big Data Introduction
No ratings yet
Big Data Introduction
46 pages
01 Unit-BDA - Intro BDA
No ratings yet
01 Unit-BDA - Intro BDA
37 pages
Big Data & Analytics (CSE448) L1
No ratings yet
Big Data & Analytics (CSE448) L1
51 pages
Cse Big Data 702 Notes
No ratings yet
Cse Big Data 702 Notes
91 pages
Lecture 1
No ratings yet
Lecture 1
25 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
DA (Unit 1)
No ratings yet
DA (Unit 1)
45 pages
Chapter 1 Notes
No ratings yet
Chapter 1 Notes
10 pages
Unit - Big - Data
No ratings yet
Unit - Big - Data
107 pages
Data and Data Storage
No ratings yet
Data and Data Storage
29 pages
BD Unit 1
No ratings yet
BD Unit 1
72 pages
Array Leetcode PDF
No ratings yet
Array Leetcode PDF
4 pages
G-117 Project Report 1
No ratings yet
G-117 Project Report 1
71 pages
Lecture 1 Introduction To Data Engineering
No ratings yet
Lecture 1 Introduction To Data Engineering
7 pages
Mod 2 Business Analytics
No ratings yet
Mod 2 Business Analytics
43 pages
BigData 1
No ratings yet
BigData 1
14 pages
Untitled
No ratings yet
Untitled
105 pages
Carrental PDF
No ratings yet
Carrental PDF
32 pages
Big Data - Unit-1 - KCS-061
No ratings yet
Big Data - Unit-1 - KCS-061
63 pages
AI Primer
No ratings yet
AI Primer
24 pages
Unit I Types of Digital Data: CO1: Explain About Big Data Paradigm
No ratings yet
Unit I Types of Digital Data: CO1: Explain About Big Data Paradigm
37 pages
Big Data Unit-1 Kcs-061
No ratings yet
Big Data Unit-1 Kcs-061
64 pages
22Xx405 - Database Management System Unit 1 & LP 1-Understanding Data and Information, Database Vs Information
No ratings yet
22Xx405 - Database Management System Unit 1 & LP 1-Understanding Data and Information, Database Vs Information
11 pages
Structured and Unstructured Data: Learning Outcomes
100% (1)
Structured and Unstructured Data: Learning Outcomes
13 pages
BFS, Stacks & Queue Data Structure
No ratings yet
BFS, Stacks & Queue Data Structure
10 pages
Big Data Analytics QB
No ratings yet
Big Data Analytics QB
44 pages
Chapter 2 - Types of Digital Data
No ratings yet
Chapter 2 - Types of Digital Data
12 pages
Dk30a2dhu Datasheet
No ratings yet
Dk30a2dhu Datasheet
5 pages
1 - Chap 3 - Types of Digital Data
68% (19)
1 - Chap 3 - Types of Digital Data
40 pages
1 Bda A6515 Intro Bda
No ratings yet
1 Bda A6515 Intro Bda
48 pages
UNIT 1 INTRODUCTION TO BIGDATA by MIT
No ratings yet
UNIT 1 INTRODUCTION TO BIGDATA by MIT
12 pages
1 Big Data Analytics-Introduction R21 A7902 ABP
No ratings yet
1 Big Data Analytics-Introduction R21 A7902 ABP
14 pages
Big Data and Analytics Cse448 Module 1 L
No ratings yet
Big Data and Analytics Cse448 Module 1 L
38 pages
Data Science Class2
No ratings yet
Data Science Class2
33 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
Basics of Big Data Notes
No ratings yet
Basics of Big Data Notes
17 pages
Mongodb Vs Mysql
No ratings yet
Mongodb Vs Mysql
10 pages
Plantgro Windows Manual Final
No ratings yet
Plantgro Windows Manual Final
125 pages
Infographic Poster COM167
No ratings yet
Infographic Poster COM167
2 pages
Column Security
No ratings yet
Column Security
3 pages
Set 1
No ratings yet
Set 1
5 pages
MiniCapt Mobile Refresh Spec Sheet
No ratings yet
MiniCapt Mobile Refresh Spec Sheet
2 pages
Data Types
No ratings yet
Data Types
36 pages
Unit-1-Part1-Big Data Analytics and Tools
No ratings yet
Unit-1-Part1-Big Data Analytics and Tools
12 pages
Diagnostic Lights - Dell OptiPlex 755 User Manual (Page 347)
100% (1)
Diagnostic Lights - Dell OptiPlex 755 User Manual (Page 347)
5 pages
Lab 2 Part 2
No ratings yet
Lab 2 Part 2
13 pages
Chapter 01: Types of Digital Data
No ratings yet
Chapter 01: Types of Digital Data
79 pages
AJP Practicals: Practical 1
No ratings yet
AJP Practicals: Practical 1
37 pages
Test#1: Sub Inspector Bs14: Email
No ratings yet
Test#1: Sub Inspector Bs14: Email
34 pages
Business Intelligence - Concepts
100% (2)
Business Intelligence - Concepts
162 pages
Social Network 1.synopsis
No ratings yet
Social Network 1.synopsis
45 pages
PCI DSS - Notes - GRC Training
No ratings yet
PCI DSS - Notes - GRC Training
3 pages
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
No ratings yet
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
8 pages
Digital Data
No ratings yet
Digital Data
32 pages
Internet of Things (Iot) For Smart Cities
No ratings yet
Internet of Things (Iot) For Smart Cities
9 pages
SESSION 2017-2018: B.Tech (Cse) Year: Iv Semester: Viii
No ratings yet
SESSION 2017-2018: B.Tech (Cse) Year: Iv Semester: Viii
68 pages
Fbda Unit-1
No ratings yet
Fbda Unit-1
17 pages
Unit 1: To Data Science
No ratings yet
Unit 1: To Data Science
56 pages
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
No ratings yet
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
72 pages
B.sc. I, II & 3rd (Computer Science As A Subject) Session 2012-13
No ratings yet
B.sc. I, II & 3rd (Computer Science As A Subject) Session 2012-13
17 pages
Structured, Semi-Structured and Unstructured Data (M-2)
No ratings yet
Structured, Semi-Structured and Unstructured Data (M-2)
3 pages
Big Data & Analytics (CSE448) L1
No ratings yet
Big Data & Analytics (CSE448) L1
50 pages
Siprotec 7sl87 Profile
No ratings yet
Siprotec 7sl87 Profile
2 pages
Class 8 - Year Plan: Term 1 Month Topics
No ratings yet
Class 8 - Year Plan: Term 1 Month Topics
4 pages
Bussiness Analytics Chep-2
No ratings yet
Bussiness Analytics Chep-2
36 pages
Firsov Indictment
No ratings yet
Firsov Indictment
6 pages
Unit - I: Types of Digital Data
No ratings yet
Unit - I: Types of Digital Data
5 pages
Classification 1 Definition and Classification of Cyber Crime
No ratings yet
Classification 1 Definition and Classification of Cyber Crime
8 pages
Unit 2 Data Representation: Worksheet 3 Characters
No ratings yet
Unit 2 Data Representation: Worksheet 3 Characters
3 pages
Sap MM Module Most Essential Notes at One Place
88% (8)
Sap MM Module Most Essential Notes at One Place
18 pages
Semantic Translation: Fundamentals and Applications
From Everand
Semantic Translation: Fundamentals and Applications
Fouad Sabry
No ratings yet

Unit 4 DigitalData

Uploaded by

Unit 4 DigitalData

Uploaded by

UNIT – IV:

Types of Digital Data

Introduction to Big Data: Characteristics of Data, Evolution of Big Data

Introduction to Hadoop: Features of Hadoop, Why Hadoop, RDBMS vs

Data: Where does it come from?????

Data comes from Everywhere:

• Today, data undoubtedly is an invaluable asset of any enterprise (big or

• Now, the various formats of digital data (structured, semi-structured and

 About 80-90% data of an organization is unstructured

Semi-structured data (self- describing structure)

Example: XML, markup languages like HTML, emails, etc.

 Relationships exist between entities of data, such as classes their objects.

Example: Oracle, DB2,My-SQL,OLTP (online Transactional processing) systems,

 Here is a percent distribution of the three forms of data -

 In the context of RDBMS , data is stored in rows/columns.

 The number of rows/records/tuples is a relation is called the cardinality of a

 The number of columns is referred to as the degree of a relation.

The ease is with respect to the following:

1. Insert/update/delete: The Data Manipulation Language (DML) operations provide

2. Security: Encryption solutions are available to secure the information. Organizations

 Atomicity: A transaction is atomic, means that either it happens in its entirety or

• There is no separation between the data and the schema.

1. XML: eXtensible Markup Language (XML) is hugely popularized by web

2. JSON: Java Script Object Notation is used to transform data between

 About 80–90% data of an organization is in this format.

 Example: memos, chat rooms, PowerPoint presentations, images, videos, letters,

Data Mining: Knowledge discovery in databases, popular Mining algorithms are

Natural Language Processing: It is related to Human Computer Interaction. It is about

Noisy text analytics: Process of extraction structured or semi structured information

Unstructured Information management architecture(UIMA): Open source platform

You might also like