0% found this document useful (0 votes)
31 views51 pages

Big Data & Analytics (CSE448) L1

The document discusses the classification of digital data into structured, semi-structured, and unstructured categories, highlighting their characteristics and examples. It emphasizes that structured data is organized and easily processed, while semi-structured data has some structure but is not easily usable, and unstructured data makes up 80-90% of an organization's data, often being difficult to process. The document also outlines various methods for dealing with unstructured data, including data mining and natural language processing.

Uploaded by

lobljl4ct
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views51 pages

Big Data & Analytics (CSE448) L1

The document discusses the classification of digital data into structured, semi-structured, and unstructured categories, highlighting their characteristics and examples. It emphasizes that structured data is organized and easily processed, while semi-structured data has some structure but is not easily usable, and unstructured data makes up 80-90% of an organization's data, often being difficult to process. The document also outlines various methods for dealing with unstructured data, including data mining and natural language processing.

Uploaded by

lobljl4ct
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

BIG DATA AND ANALYTICS

MODULE 1 (L1)
Do you know what happens in
one minute on the Internet?

• In just one minute, more than 204


million emails are sent.
• Amazon rings up about $83,000 in sales.
• Around 20 million photos are viewed and
• 3,000 uploaded on Flickr.
• At least 6 million Facebook pages are
viewed around the world.
• And more than 61,000 hours of music
are played on Pandora while more than
• 1.3 million video clips are watched on
YouTube.
Classification of Digital
Data
Digital data is classified into the
following categories:
 Structured data
 Semi-structured data
 Unstructured data
Classification of Digital
Data
 Unstructured data:
 This is the data which does not conform

to a data model or is not in a form which


can be used easily by a computer
program.
 About 80-90% data of an organization is

in this for example, memos, chat rooms,


PowerPoint presentations, images,
videos, letters, researches, white papers,
body of an email etc.
Classification of Digital
Data..
 Semi-structured data: This is the data which
does not conform to a data model but has some
structure. However, it is not in a form which can
be used easily by a computer program;
 for example, en XML, markup languages like
HTML, etc. Metadata for this data is available
but is not sufficient.
 Structured data: This is the data which is in an
organized form (e.g., in rows and columns) and
can be easily used by a computer program.
Relationships exist between entities of data,
such as classes their objects. Data stored in
databases is an example of structured data.
Approximate Percentage
Distribution of Digital Data
 Approximate percentage distribution of
digital data
Structured Data
 This is the data which is in an organized
form (e.g., in rows and columns) and can
be easily used by a computer program.
 Relationships exist between entities of
data, such as classes and their objects.
 Data stored in databases is an example of
structured data.
Sources of Structured Data
 If your data is highly structured, one can look at
leveraging any of the available RDBMS
 [Oracle Corp. — Oracle, IBM — DB2, Microsoft —
Microsoft SQL Server, EMC — Greenplum, Teradata
— Teradata, MySQL (open source), PostgreSQL
(advanced open source) etc.] to house it.
 These databases are typically used to hold
transaction/operational data generated and
collected by day-to-day business activities. In
other words, the data of the On-Line Transaction
Processing (OLTP) systems are generally quite
structured.
Sources of Structured Data

Databases such as
Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc

Structured data Spreadsheets

OLTP Systems
Ease of Working with
Structured Data
The ease is with respect to the following:
 Insert/update/delete: The Data Manipulation
Language (DML) operations provide the required
ease with data input, storage, access, process,
analysis, etc.
 Security: How does one ensure the security of

information? There are available check encryption


and tokenization solutions to warrant the security of
information throughout its lifecycle.
 Organizations are able to retain control and
maintain compliance adherence by ensuring that
only authorized individuals are able to decrypt and
view sensitive information.
Ease of Working with
Structured Data
 Indexing: An index is a data structure that
speeds up the data retrieval operations
(primarily the SELECT DML statement) at the
cost of additional writes and storage space, but
the benefits that ensue in search operation are
worth the additional writes and storage space.
 Scalability: The storage and processing
capabilities of the traditional RDBMS can be
easily scaled up by increasing the horsepower
of the database server (increasing the primary
and secondary or peripheral storage capacity,
processing capacity of the processor, etc.).
Ease of Working with
Structured Data
 Transaction processing: RDBMS has support for
Atomicity, Consistency, Isolation, and Durability
(ACID) properties of transaction.
 Atomicity: A transaction is atomic, means that either it
happens in its entirety or none of it at all.
 Consistency: The database moves from one consistent
state to another consistent state. In other words, if the
same piece of information is stored at two or more
places, they are in complete agreement.
 Isolation: The resource allocation to the transaction
happens such that the transaction gets the impression
that it is the only transaction happening in isolation.
 Durability: All changes made to the database during a
transaction are permanent and that accounts for the
durability of the transaction.
Ease of Working with
Structured Data
 Transaction processing: RDBMS has support for
Atomicity, Consistency, Isolation, and Durability
(ACID) properties of transaction.
 Atomicity: A transaction is atomic, means that either it
happens in its entirety or none of it at all.
 Consistency: The database moves from one consistent
state to another consistent state. In other words, if the
same piece of information is stored at two or more
places, they are in complete agreement.
 Isolation: The resource allocation to the transaction
happens such that the transaction gets the impression
that it is the only transaction happening in isolation.
 Durability: All changes made to the database during a
transaction are permanent and that accounts for the
durability of the transaction.
Ease with Structured Data

Input / Update /
Delete

Security

Ease with Structured data Indexing /


Searching

Scalability

Transaction
Processing
Semi-structured Data
 This is the data which does not conform to
a data model but has some structure.
 However, it is not in a form which can be
used easily by a computer program.
 Example, emails, XML, markup languages
like HTML, etc. Metadata for this data is
available but is not sufficient.
Semi-structured Data
It has the following features:
 It does not conform to the data models that one typically
associates with relational databases or any other form of
data tables.
 It uses tags to segregate semantic elements.
 Tags are also used to enforce hierarchies of records and
fields within data.
 There is no separation between the data and the schema.
 The amount of structure used is dictated by the purpose
at hand.
 In semi-structured data, entities belonging to the same
class and also grouped together need not necessarily
have the same set of attributes.
 And if at all, they have the same set of attributes, the
Sources of Semi-structured
Data
 Amongst the sources for semi-structured data, the
front runners are “XML” and “JSON”.
 XML: eXtensible Markup Language (XML) is hugely
popularized by web services developed utilizing the
Simple Object Access Protocol (SOAP) principles.
Sources of Semi-structured
Data

XML (eXtensible Markup Language)

Other Markup Languages

JSON (Java Script Object Notation)


Semi-Structured Data
Characteristics of Semi-structured
Data

Inconsistent Structure

Self-describing
(lable/value pairs)
Semi-structured data
Often Schema information is
blended with data values

Data objects may have different


attributes not known beforehand
Sources of Semi-structured
Data
 JSON: Java Script Object Notation (JSON) is used to
transmit data between a server and a web
application.
 JSON is popularized by web services developed
utilizing the Representational State Transfer (REST)
- an architecture style for creating scalable web
services.
 MongoDB (open-source, distributed, NoSQL,
documented-oriented database) and Couchbase
(originally known as Membase, open-source,
distributed, NoSQL, document-oriented database)
store data natively in JSON format.
Sources of Semi-structured
Data
An example of HTML is as follows:
<HTML>
<HEAD>
<TITLE>Place your title here</TITLE>
</HEAD>
<BODY BGCOLOR="FFFFFF">
<CENTER><IMG SRC="clouds.jpg" ALIGN="BOTTOM"x/CENTER>
<HR> <a href="http://bigdatauniversity.com">Link Name</a>
<Hl>this is a Header</Hl>
<H2>this is a sub Header</H2>
Send me mail at <a href="mailto:[email protected]">
[email protected]</a>.
<P>a new paragraph!
<PxB>a new paragraph!</B>
<BRxBxI>this is a new sentence without a paragraph break, in bold italics.</Ix/B>
<HR>
</BODY>
</HTML>
Sources of Semi-structured
Data
Sample JSON document
{
_id:9,
BookTitle: “Fundamentals of Business
Analytics”,
AuthorName: “Seema Acharya”,
Publisher: “Wiley India”,
YearofPublication: “2011”
}
Unstructured Data
 This is the data which does not conform to
a data model or is not in a form which can
be used easily by a computer program.
 About 80–90% data of an organization is in
this format.
 Example: memos, chat rooms, PowerPoint
presentations, images, videos, letters,
researches, white papers, body of an
email, etc.
Sources of Unstructured
Data
Web Pages

Images

Free-Form
Text

Audios
Unstructured data

Videos

Body of
Email

Text
Messages

Chats

Social
Media data

Word
Document
Issues with terminology –
Unstructured Data

Structure can be implied despite not being


formerly defined.

Data with some structure may still be labeled


Issues with terminology
unstructured if the structure doesn’t help with
processing task at hand

Data may have some structure or may even be


highly structured in ways that are unanticipated
or unannounced.
How to Deal with Unstructured
Data?

 Today, unstructured data constitutes


approximately 80% of the data that is
being generated in any enterprise.
Dealing with Unstructured
Data

Data Mining

Natural Language Processing (NLP)

Dealing with Unstructured Data Text Analytics

Noisy Text Analytics


Issues with "Unstructured"
Data
 Data Mining:
 First, we deal with large data sets.
 Second, we use methods at the
intersection of arti­ficial intelligence,
machine learning, statistics, and
database systems to unearth consistent
patterns in large data sets and/or
systematic relationships between
variables.
 It is the analysis step of the “knowledge

discovery in databases” process.


Issues with "Unstructured"
Data
Few popular data mining algorithms
are as follows:
 Association rule mining:

 It is also called “market basket analysis”

or “affinity analysis”.
 It is used to determine “What goes with

what?”
 It is about when you buy a product, what

is the other product that you are likely to


purchase with it.
 For example, if you pick up bread from
Issues with "Unstructured"
Data
 Regression analysis:
 It helps to predict the relationship
between two variables.
 The variable whose value needs to be

predicted is called the dependent


variable and the variables which are
used to predict the value are referred to
as the independent variables.
Issues with "Unstructured"
Data
 Collaborative filtering:
 It is about predicting a user’s preference

or preferences based on the preferences


of a group of users.
 For example, take a look at Table next slide.
 We are looking at predicting whether User 4 will
prefer to learn using videos or is a textual
leaner depending on one or a couple of his or
her known preferences.
 We analyze the preferences of similar user
profiles and on the basis of it, predict that User
4 will also like to learn using videos and is not a
Issues with "Unstructured"
Data
Table . Sample Record depicting learner’s
preferences for model of learning
Issues with "Unstructured"
Data
 Text Analytics or Text Mining: Compared
to the structured data stored in relational
databases, text largely unstructured,
amorphous, and difficult to deal with
algorithmically.
 Text mining is the process of gleaning high
quality and meaningful information
(through devising of patterns and trends by
means of statistical pattern learning) from
text.
 It includes tasks such as text categorization,
Issues with "Unstructured"
Data
 Natural language processing (NLP): It is
related to the area of human computer
interaction. It about enabling computers to
understand human or natural language
input.
 Noisy text analytics: It is the process of
extracting structured or semi-structured
information from noisy unstructured data
such as chats, blogs, wikis, emails, message-
boards, text messages, etc.
 The noisy unstructured data usually comprises one
or more of the following: Spelling mistakes,
Issues with "Unstructured"
Data
 Manual tagging with metadata: This is
about tagging manually with adequate
metadata to provide the requisite
semantics to understand unstructured
data.
 Part-of-speech tagging: It is also called
POS or POST or grammatical tagging. It is
the process reading text and tagging each
word in the sentence as belonging to a
particular part of speech such aj “noun”,
“verb”, “adjective”, etc.
Issues with "Unstructured"
Data
 Unstructured Information
Management Architecture (UIMA): It is
an open source platform from IBM. It is
used for real-time content analytics.
 It is about processing text and other
unstructured to find latent meaning and
relevant relationship buried therein. Read
up more on UIMA at the link
http://www.ibm.com/developerworks/data/
downloads/uima/
Summary
 Structured data: It conforms to a data model. For
example, RDBMS conforms to relational daci
model. It has a pre-defined schema.
 Semi-structured data: For this format of data, little
metadata is available, but is insufficient. Semi-
structured data have a self-describing structure.
There is little or no separation between data and
schema.
 Unstructured data: This data is growing by the day
and growing by leaps and bounds. It has
innumerable sources such as human generated
(social media data, emails, word documents, pre­
sentations, audio and video files that we create
Answer a few quick questions …

 Match the following


Column A Column B
NLP Content analytics
Text analytics Text messages
UIMA Chats
Noisy unstructured Text mining
data

Data mining Comprehend human or natural language input

Noisy unstructured Uses methods at the intersection of statistics,


data Artificial Intelligence, machine learning & DBs

IBM UIMA
Question‘s Answer ??
 Which category (structured, semi-
structured, or unstructured) will you
place a Web Page in?
 Which category (structured, semi-
structured, or unstructured) will you
place Word Document in?
 State a few examples of human
generated and machine-generated
data.

You might also like