0% found this document useful (0 votes)
24 views22 pages

Unit 4 DigitalData

The document discusses the classification of digital data into three categories: structured, semi-structured, and unstructured, highlighting their characteristics and examples. It emphasizes the challenges of managing and analyzing data from various sources, particularly the prevalence of unstructured data in organizations. Additionally, it introduces Big Data concepts and Hadoop as tools for handling large volumes of data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views22 pages

Unit 4 DigitalData

The document discusses the classification of digital data into three categories: structured, semi-structured, and unstructured, highlighting their characteristics and examples. It emphasizes the challenges of managing and analyzing data from various sources, particularly the prevalence of unstructured data in organizations. Additionally, it introduces Big Data concepts and Hadoop as tools for handling large volumes of data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

UNIT – IV:

Types of Digital Data

Introduction to Big Data: Characteristics of Data, Evolution of Big Data


and Challenges with Big Data, Big Data, Terminologies used in Big Data
Environment.

Introduction to Hadoop: Features of Hadoop, Why Hadoop, RDBMS vs


Hadoop, Hadoop Overview, HDFS, Processing Data with Hadoop.
Types of Digital Data
Data :

Data provide an information from where the meaningful insights can be derived.

Data: Where does it come from?????

Data comes from Everywhere:

 We Speak

 We Move

 Sensors

 Computers

 Documents
Digital Data

• Today, data undoubtedly is an invaluable asset of any enterprise (big or


small). Even though professionals work with data all the time, the
understanding, management and analysis of data from heterogeneous
sources remains a serious challenge.

• Data growth has seen exponential acceleration since the advent of the
computer and Internet.

• Now, the various formats of digital data (structured, semi-structured and


unstructured data), data storage mechanism, data access methods,
management of data, the process of extracting desired information from
data, challenges posed by various formats of data, etc. will be discussed.
Classification of Digital Data:
Digital data can be broadly classified into three forms:
– Unstructured
– Semi-structured
– Structured

Unstructured

 This is the data which does not conform to a data model or is not in a form
which can be used easily by a computer program.

 About 80-90% data of an organization is unstructured

Example: images, videos, letters, ,text, PDFs, social media posts, body of an
email, log files, PowerPoint presentations etc.
Classification of Digital Data contd..

Semi-structured data (self- describing structure)

 This is the data which does not conform to a data model but has some
structure.
 It is not in a form which can be used easily by a computer programs. It has
self describing structure. It uses tags to separate semantic elements.
 Metadata for this data is available but is not sufficient.

Example: XML, markup languages like HTML, emails, etc.

Structured data

 This is the data which is in an organized form (e.g., in rows and columns) and
can be easily used by a computer program.

 Relationships exist between entities of data, such as classes their objects.


 Data stored in databases is an example of structured data.

Example: Oracle, DB2,My-SQL,OLTP (online Transactional processing) systems,


spreadsheets.
Classification of Digital Data contd..

 Since 1980’s enterprises data has been stored in RDBMS, it stores structured data

 Later, with internet connecting the world data has become an integral part of
daily transactions.

 All of this data was not structured, almost 80% of data generated in any
enterprise today is unstructured data.

 Roughly around 10% of data is in the structured and semi structured category.

 Here is a percent distribution of the three forms of data -


Structured Data
 When data is having predefined schema / structure then it is a structured data.

 In the context of RDBMS , data is stored in rows/columns.

 The number of rows/records/tuples is a relation is called the cardinality of a


relation

 The number of columns is referred to as the degree of a relation.


Sources of Structured data

 If the data is structured, then RBDMS can be used [Oracle Corp. — Oracle, IBM — DB2,
Microsoft — Microsoft SQL Server, EMC — Greenplum, Teradata — Teradata, MySQL
(open source), PostgreSQL (advanced open source) etc.] to house it.

 These databases are typically used to hold transaction/operational data generated and
collected by day-to-day business activities.

 In other words, the data of the On-Line Transaction Processing (OLTP) systems are
generally quite structured.
Ease of Working with Structured Data

The ease is with respect to the following:

1. Insert/update/delete: The Data Manipulation Language (DML) operations provide


the required ease with data input, storage, access, process, analysis, etc.

2. Security: Encryption solutions are available to secure the information. Organizations


are able to retain control and maintain compliance adherence by ensuring that only
authorized individuals are able to decrypt and view sensitive information (encryption
and tokenization solutions )

3. Indexing: An index is a data structure that speeds up the data retrieval operations
(primarily the SELECT DML statement) at the cost of additional writes and storage
space, but the benefits that ensue in search operation are worth the additional writes
and storage space.

4. Scalability: The storage and processing capabilities of the traditional RDBMS can be
easily scaled up by increasing the horsepower of the database server (increasing the
primary and secondary or peripheral storage capacity, processing capacity of the
processor, etc.)
Ease of Working with Structured Data
5. Transaction processing:

RDBMS has support for Atomicity, Consistency, Isolation, and Durability (ACID)
properties of transaction.

 Atomicity: A transaction is atomic, means that either it happens in its entirety or


none of it at all.
 Consistency: The database moves from one consistent state to another
consistent state. In other words, if the same piece of information is stored at two
or more places, they are in complete agreement.
 Isolation: The resource allocation to the transaction happens such that the
transaction gets the impression that it is the only transaction happening in
isolation.
 Durability: All changes made to the database during a transaction are permanent
and that accounts for the durability of the transaction.
Semi-structured Data
• It does not conform to any data model i.e. it is difficult to determine the meaning of
data neither can data be stored in rows and columns as in a database

• It uses tags to separate semantic elements and markers which help to group data and
describe how data is stored, giving some metadata but it is not sufficient for
management and automation of data.
• Similar entities in the data are grouped and organized in a hierarchy.

• There is no separation between the data and the schema.

• In semi-structured data, entities belonging to the same class and also grouped together
need not necessarily have the same set of attributes.

Example: Two addresses may or may not contain the same number of properties as in
Address 1 Semi-structured Data in Address 2


• And if at all, they have the same set of attributes, the order of attributes
may not be similar and for all practical purposes it is not important as well.

• The tags give us some metadata but the body of the e-mail contains no
format neither is such which conveys meaning of the data it contains.
Sources of Semi-structured Data

• Amongst the sources for semi-structured data, the front runners are ―XML and
―JSON.

1. XML: eXtensible Markup Language (XML) is hugely popularized by web


services developed utilizing the Simple Object Access Protocol (SOAP)
principles.

2. JSON: Java Script Object Notation is used to transform data between


a server and a web application. It uses Representational State Transfer(REST) ,
MongoDB etc.
Unstructured Data
 It does not confirm to a data model or is not in a form which can be used easily by
a computer program.

 About 80–90% data of an organization is in this format.

 Example: memos, chat rooms, PowerPoint presentations, images, videos, letters,


researches, white papers, body of an email, etc.
Dealing with Unstructured data:

The following techniques are used to find the patterns in or interpret unstructured data.

Data Mining: Knowledge discovery in databases, popular Mining algorithms are


Association rule mining, Regression Analysis, and Collaborative filtering

Natural Language Processing: It is related to Human Computer Interaction. It is about


enabling computers to understand human or natural language input.

Text Analytics: Text mining is the process of gleaning high quality and meaningful
information from text. It includes tasks such as text categorization, text clustering,
sentiment analysis and concept/entity extraction.

Noisy text analytics: Process of extraction structured or semi structured information


from noisy unstructured data such as chats, blogs, wikis, emails, Spelling mistakes,
abbreviations, such as uh, hm, non standard words.

Manual Tagging with meta data: This is about tagging manually with adequate meta data
to provide the requisite semantics to understand unstructured data.
Parts of Speech Tagging: POST is the process of reading text and tagging each word in
the sentence belonging to particular parts of speech such as noun, verb, objective.

Unstructured Information management architecture(UIMA): Open source platform


from IBM used for real time content analytics.

You might also like