0% found this document useful (0 votes)

31 views51 pages

Big Data & Analytics (CSE448) L1

The document discusses the classification of digital data into structured, semi-structured, and unstructured categories, highlighting their characteristics and examples. It emphasizes that structured data is organized and easily processed, while semi-structured data has some structure but is not easily usable, and unstructured data makes up 80-90% of an organization's data, often being difficult to process. The document also outlines various methods for dealing with unstructured data, including data mining and natural language processing.

Uploaded by

lobljl4ct

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views51 pages

Big Data & Analytics (CSE448) L1

Uploaded by

lobljl4ct

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 51

BIG DATA AND ANALYTICS

MODULE 1 (L1)
Do you know what happens in
one minute on the Internet?

• In just one minute, more than 204

million emails are sent.
• Amazon rings up about $83,000 in sales.
• Around 20 million photos are viewed and
• 3,000 uploaded on Flickr.
• At least 6 million Facebook pages are
viewed around the world.
• And more than 61,000 hours of music
are played on Pandora while more than
• 1.3 million video clips are watched on
YouTube.
Classification of Digital
Data
Digital data is classified into the
following categories:
 Structured data
 Semi-structured data
 Unstructured data
Classification of Digital
Data
 Unstructured data:
 This is the data which does not conform

to a data model or is not in a form which

can be used easily by a computer
program.
 About 80-90% data of an organization is

in this for example, memos, chat rooms,

PowerPoint presentations, images,
videos, letters, researches, white papers,
body of an email etc.
Classification of Digital
Data..
 Semi-structured data: This is the data which
does not conform to a data model but has some
structure. However, it is not in a form which can
be used easily by a computer program;
 for example, en XML, markup languages like
HTML, etc. Metadata for this data is available
but is not sufficient.
 Structured data: This is the data which is in an
organized form (e.g., in rows and columns) and
can be easily used by a computer program.
Relationships exist between entities of data,
such as classes their objects. Data stored in
databases is an example of structured data.
Approximate Percentage
Distribution of Digital Data
 Approximate percentage distribution of
digital data
Structured Data
 This is the data which is in an organized
form (e.g., in rows and columns) and can
be easily used by a computer program.
 Relationships exist between entities of
data, such as classes and their objects.
 Data stored in databases is an example of
structured data.
Sources of Structured Data
 If your data is highly structured, one can look at
leveraging any of the available RDBMS
 [Oracle Corp. — Oracle, IBM — DB2, Microsoft —
Microsoft SQL Server, EMC — Greenplum, Teradata
— Teradata, MySQL (open source), PostgreSQL
(advanced open source) etc.] to house it.
 These databases are typically used to hold
transaction/operational data generated and
collected by day-to-day business activities. In
other words, the data of the On-Line Transaction
Processing (OLTP) systems are generally quite
structured.
Sources of Structured Data

Databases such as
Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc

Structured data Spreadsheets

OLTP Systems
Ease of Working with
Structured Data
The ease is with respect to the following:
 Insert/update/delete: The Data Manipulation
Language (DML) operations provide the required
ease with data input, storage, access, process,
analysis, etc.
 Security: How does one ensure the security of

information? There are available check encryption

and tokenization solutions to warrant the security of
information throughout its lifecycle.
 Organizations are able to retain control and
maintain compliance adherence by ensuring that
only authorized individuals are able to decrypt and
view sensitive information.
Ease of Working with
Structured Data
 Indexing: An index is a data structure that
speeds up the data retrieval operations
(primarily the SELECT DML statement) at the
cost of additional writes and storage space, but
the benefits that ensue in search operation are
worth the additional writes and storage space.
 Scalability: The storage and processing
capabilities of the traditional RDBMS can be
easily scaled up by increasing the horsepower
of the database server (increasing the primary
and secondary or peripheral storage capacity,
processing capacity of the processor, etc.).
Ease of Working with
Structured Data
 Transaction processing: RDBMS has support for
Atomicity, Consistency, Isolation, and Durability
(ACID) properties of transaction.
 Atomicity: A transaction is atomic, means that either it
happens in its entirety or none of it at all.
 Consistency: The database moves from one consistent
state to another consistent state. In other words, if the
same piece of information is stored at two or more
places, they are in complete agreement.
 Isolation: The resource allocation to the transaction
happens such that the transaction gets the impression
that it is the only transaction happening in isolation.
 Durability: All changes made to the database during a
transaction are permanent and that accounts for the
durability of the transaction.
Ease of Working with
Structured Data
 Transaction processing: RDBMS has support for
Atomicity, Consistency, Isolation, and Durability
(ACID) properties of transaction.
 Atomicity: A transaction is atomic, means that either it
happens in its entirety or none of it at all.
 Consistency: The database moves from one consistent
state to another consistent state. In other words, if the
same piece of information is stored at two or more
places, they are in complete agreement.
 Isolation: The resource allocation to the transaction
happens such that the transaction gets the impression
that it is the only transaction happening in isolation.
 Durability: All changes made to the database during a
transaction are permanent and that accounts for the
durability of the transaction.
Ease with Structured Data

Input / Update /
Delete

Security

Ease with Structured data Indexing /

Searching

Scalability

Transaction
Processing
Semi-structured Data
 This is the data which does not conform to
a data model but has some structure.
 However, it is not in a form which can be
used easily by a computer program.
 Example, emails, XML, markup languages
like HTML, etc. Metadata for this data is
available but is not sufficient.
Semi-structured Data
It has the following features:
 It does not conform to the data models that one typically
associates with relational databases or any other form of
data tables.
 It uses tags to segregate semantic elements.
 Tags are also used to enforce hierarchies of records and
fields within data.
 There is no separation between the data and the schema.
 The amount of structure used is dictated by the purpose
at hand.
 In semi-structured data, entities belonging to the same
class and also grouped together need not necessarily
have the same set of attributes.
 And if at all, they have the same set of attributes, the
Sources of Semi-structured
Data
 Amongst the sources for semi-structured data, the
front runners are “XML” and “JSON”.
 XML: eXtensible Markup Language (XML) is hugely
popularized by web services developed utilizing the
Simple Object Access Protocol (SOAP) principles.
Sources of Semi-structured
Data

XML (eXtensible Markup Language)

Other Markup Languages

JSON (Java Script Object Notation)

Semi-Structured Data
Characteristics of Semi-structured
Data

Inconsistent Structure

Self-describing
(lable/value pairs)
Semi-structured data
Often Schema information is
blended with data values

Data objects may have different

attributes not known beforehand
Sources of Semi-structured
Data
 JSON: Java Script Object Notation (JSON) is used to
transmit data between a server and a web
application.
 JSON is popularized by web services developed
utilizing the Representational State Transfer (REST)
- an architecture style for creating scalable web
services.
 MongoDB (open-source, distributed, NoSQL,
documented-oriented database) and Couchbase
(originally known as Membase, open-source,
distributed, NoSQL, document-oriented database)
store data natively in JSON format.
Sources of Semi-structured
Data
An example of HTML is as follows:
<HTML>
<HEAD>
<TITLE>Place your title here</TITLE>
</HEAD>
<BODY BGCOLOR="FFFFFF">
<CENTER><IMG SRC="clouds.jpg" ALIGN="BOTTOM"x/CENTER>
<HR> <a href="http://bigdatauniversity.com">Link Name</a>
<Hl>this is a Header</Hl>
<H2>this is a sub Header</H2>
Send me mail at <a href="mailto:[email protected]">
[email protected]</a>.
<P>a new paragraph!
<PxB>a new paragraph!</B>
<BRxBxI>this is a new sentence without a paragraph break, in bold italics.</Ix/B>
<HR>
</BODY>
</HTML>
Sources of Semi-structured
Data
Sample JSON document
{
_id:9,
BookTitle: “Fundamentals of Business
Analytics”,
AuthorName: “Seema Acharya”,
Publisher: “Wiley India”,
YearofPublication: “2011”
}
Unstructured Data
 This is the data which does not conform to
a data model or is not in a form which can
be used easily by a computer program.
 About 80–90% data of an organization is in
this format.
 Example: memos, chat rooms, PowerPoint
presentations, images, videos, letters,
researches, white papers, body of an
email, etc.
Sources of Unstructured
Data
Web Pages

Images

Free-Form
Text

Audios
Unstructured data

Videos

Body of
Email

Text
Messages

Chats

Social
Media data

Word
Document
Issues with terminology –
Unstructured Data

Structure can be implied despite not being

formerly defined.

Data with some structure may still be labeled

Issues with terminology
unstructured if the structure doesn’t help with
processing task at hand

Data may have some structure or may even be

highly structured in ways that are unanticipated
or unannounced.
How to Deal with Unstructured
Data?

 Today, unstructured data constitutes

approximately 80% of the data that is
being generated in any enterprise.
Dealing with Unstructured
Data

Data Mining

Natural Language Processing (NLP)

Dealing with Unstructured Data Text Analytics

Noisy Text Analytics

Issues with "Unstructured"
Data
 Data Mining:
 First, we deal with large data sets.
 Second, we use methods at the
intersection of artificial intelligence,
machine learning, statistics, and
database systems to unearth consistent
patterns in large data sets and/or
systematic relationships between
variables.
 It is the analysis step of the “knowledge

discovery in databases” process.

Issues with "Unstructured"
Data
Few popular data mining algorithms
are as follows:
 Association rule mining:

 It is also called “market basket analysis”

or “affinity analysis”.
 It is used to determine “What goes with

what?”
 It is about when you buy a product, what

is the other product that you are likely to

purchase with it.
 For example, if you pick up bread from
Issues with "Unstructured"
Data
 Regression analysis:
 It helps to predict the relationship
between two variables.
 The variable whose value needs to be

predicted is called the dependent

variable and the variables which are
used to predict the value are referred to
as the independent variables.
Issues with "Unstructured"
Data
 Collaborative filtering:
 It is about predicting a user’s preference

or preferences based on the preferences

of a group of users.
 For example, take a look at Table next slide.
 We are looking at predicting whether User 4 will
prefer to learn using videos or is a textual
leaner depending on one or a couple of his or
her known preferences.
 We analyze the preferences of similar user
profiles and on the basis of it, predict that User
4 will also like to learn using videos and is not a
Issues with "Unstructured"
Data
Table . Sample Record depicting learner’s
preferences for model of learning
Issues with "Unstructured"
Data
 Text Analytics or Text Mining: Compared
to the structured data stored in relational
databases, text largely unstructured,
amorphous, and difficult to deal with
algorithmically.
 Text mining is the process of gleaning high
quality and meaningful information
(through devising of patterns and trends by
means of statistical pattern learning) from
text.
 It includes tasks such as text categorization,
Issues with "Unstructured"
Data
 Natural language processing (NLP): It is
related to the area of human computer
interaction. It about enabling computers to
understand human or natural language
input.
 Noisy text analytics: It is the process of
extracting structured or semi-structured
information from noisy unstructured data
such as chats, blogs, wikis, emails, message-
boards, text messages, etc.
 The noisy unstructured data usually comprises one
or more of the following: Spelling mistakes,
Issues with "Unstructured"
Data
 Manual tagging with metadata: This is
about tagging manually with adequate
metadata to provide the requisite
semantics to understand unstructured
data.
 Part-of-speech tagging: It is also called
POS or POST or grammatical tagging. It is
the process reading text and tagging each
word in the sentence as belonging to a
particular part of speech such aj “noun”,
“verb”, “adjective”, etc.
Issues with "Unstructured"
Data
 Unstructured Information
Management Architecture (UIMA): It is
an open source platform from IBM. It is
used for real-time content analytics.
 It is about processing text and other
unstructured to find latent meaning and
relevant relationship buried therein. Read
up more on UIMA at the link
http://www.ibm.com/developerworks/data/
downloads/uima/
Summary
 Structured data: It conforms to a data model. For
example, RDBMS conforms to relational daci
model. It has a pre-defined schema.
 Semi-structured data: For this format of data, little
metadata is available, but is insufficient. Semi-
structured data have a self-describing structure.
There is little or no separation between data and
schema.
 Unstructured data: This data is growing by the day
and growing by leaps and bounds. It has
innumerable sources such as human generated
(social media data, emails, word documents, pre
sentations, audio and video files that we create
Answer a few quick questions …

 Match the following

Column A Column B
NLP Content analytics
Text analytics Text messages
UIMA Chats
Noisy unstructured Text mining
data

Data mining Comprehend human or natural language input

Noisy unstructured Uses methods at the intersection of statistics,

data Artificial Intelligence, machine learning & DBs

IBM UIMA
Question‘s Answer ??
 Which category (structured, semi-
structured, or unstructured) will you
place a Web Page in?
 Which category (structured, semi-
structured, or unstructured) will you
place Word Document in?
 State a few examples of human
generated and machine-generated
data.

UNIT 1 INTRODUCTION TO BIGDATA by MIT
No ratings yet
UNIT 1 INTRODUCTION TO BIGDATA by MIT
12 pages
Big Data Class 27feb
No ratings yet
Big Data Class 27feb
48 pages
Big Data
No ratings yet
Big Data
18 pages
Unit 1
No ratings yet
Unit 1
62 pages
DA Unit 1
No ratings yet
DA Unit 1
44 pages
BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan
No ratings yet
BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan
62 pages
Chapter 1 Notes
No ratings yet
Chapter 1 Notes
10 pages
01 Unit-BDA - Intro BDA
No ratings yet
01 Unit-BDA - Intro BDA
37 pages
Big Data Introduction
No ratings yet
Big Data Introduction
46 pages
Cse Big Data 702 Notes
No ratings yet
Cse Big Data 702 Notes
91 pages
Unit-1 Bda
No ratings yet
Unit-1 Bda
17 pages
CSC4404 Chap3
No ratings yet
CSC4404 Chap3
84 pages
Module 1
No ratings yet
Module 1
40 pages
Unit 1-2
No ratings yet
Unit 1-2
78 pages
Bigdata Notes-1 To 3
No ratings yet
Bigdata Notes-1 To 3
32 pages
Big Data Aktu Unit 1
No ratings yet
Big Data Aktu Unit 1
85 pages
BD Unit 1
No ratings yet
BD Unit 1
72 pages
Bda Unit 1
No ratings yet
Bda Unit 1
25 pages
1 Bda A6515 Intro Bda
No ratings yet
1 Bda A6515 Intro Bda
48 pages
Unit-1-Part1-Big Data Analytics and Tools
No ratings yet
Unit-1-Part1-Big Data Analytics and Tools
12 pages
BigData 1
No ratings yet
BigData 1
14 pages
DA (Unit 1)
No ratings yet
DA (Unit 1)
45 pages
Big Data - Unit-1 - KCS-061
No ratings yet
Big Data - Unit-1 - KCS-061
63 pages
Unit I Types of Digital Data: CO1: Explain About Big Data Paradigm
No ratings yet
Unit I Types of Digital Data: CO1: Explain About Big Data Paradigm
37 pages
Unit 4 DigitalData
No ratings yet
Unit 4 DigitalData
22 pages
Big Data Unit-1 Kcs-061
No ratings yet
Big Data Unit-1 Kcs-061
64 pages
UNIT4
No ratings yet
UNIT4
20 pages
Structured and Unstructured Data: Learning Outcomes
100% (1)
Structured and Unstructured Data: Learning Outcomes
13 pages
Cloud Computing
No ratings yet
Cloud Computing
86 pages
Data Types
No ratings yet
Data Types
36 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
35 pages
Big Data Analytics QB
No ratings yet
Big Data Analytics QB
44 pages
Data and Data Storage
No ratings yet
Data and Data Storage
29 pages
Data Science Class2
No ratings yet
Data Science Class2
33 pages
1 - Data and Organizations
No ratings yet
1 - Data and Organizations
5 pages
Structured, Semi-Structured and Unstructured Data (M-2)
No ratings yet
Structured, Semi-Structured and Unstructured Data (M-2)
3 pages
1 Big Data Analytics-Introduction R21 A7902 ABP
No ratings yet
1 Big Data Analytics-Introduction R21 A7902 ABP
14 pages
AI Primer
No ratings yet
AI Primer
24 pages
Chapter 2 - Types of Digital Data
No ratings yet
Chapter 2 - Types of Digital Data
12 pages
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
No ratings yet
Lecture Notes Hands-On With Nosql - Mongodb: - O O O O O O - O O O O O O O
8 pages
Bussiness Analytics Chep-2
No ratings yet
Bussiness Analytics Chep-2
36 pages
Big Data Intro
No ratings yet
Big Data Intro
12 pages
Big Data and Analytics Cse448 Module 1 L
No ratings yet
Big Data and Analytics Cse448 Module 1 L
38 pages
Business Intelligence - Concepts
100% (2)
Business Intelligence - Concepts
162 pages
Chapter 01: Types of Digital Data
No ratings yet
Chapter 01: Types of Digital Data
79 pages
Fbda Unit-1
No ratings yet
Fbda Unit-1
17 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
Unit - Big - Data
No ratings yet
Unit - Big - Data
107 pages
Chapter 01: Types of Digital Data
No ratings yet
Chapter 01: Types of Digital Data
80 pages
22Xx405 - Database Management System Unit 1 & LP 1-Understanding Data and Information, Database Vs Information
No ratings yet
22Xx405 - Database Management System Unit 1 & LP 1-Understanding Data and Information, Database Vs Information
11 pages
SESSION 2017-2018: B.Tech (Cse) Year: Iv Semester: Viii
No ratings yet
SESSION 2017-2018: B.Tech (Cse) Year: Iv Semester: Viii
68 pages
File Management in Operating System of A Computer
100% (1)
File Management in Operating System of A Computer
70 pages
Basics of Big Data Notes
No ratings yet
Basics of Big Data Notes
17 pages
Research Proposal FINAL For PRINT
100% (1)
Research Proposal FINAL For PRINT
14 pages
Digital Data
No ratings yet
Digital Data
32 pages
Practical File of Database Management System Using Ms-Access
No ratings yet
Practical File of Database Management System Using Ms-Access
63 pages
1 - Chap 3 - Types of Digital Data
68% (19)
1 - Chap 3 - Types of Digital Data
40 pages
Unit - I: Types of Digital Data
No ratings yet
Unit - I: Types of Digital Data
5 pages
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
No ratings yet
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
72 pages
SP3d Guidlines For Reference Data Guide
No ratings yet
SP3d Guidlines For Reference Data Guide
38 pages
Big Data & Analytics (CSE448) L1
No ratings yet
Big Data & Analytics (CSE448) L1
50 pages
Class XI (Computer Science) Chapterwise MCQ
No ratings yet
Class XI (Computer Science) Chapterwise MCQ
66 pages
Quantitative Techniques
88% (8)
Quantitative Techniques
31 pages
05 - Strategies For Query Processing (Ch18)
No ratings yet
05 - Strategies For Query Processing (Ch18)
50 pages
Dbms Lesson Plan With Out Dates
No ratings yet
Dbms Lesson Plan With Out Dates
5 pages
Falsafah / Paradigma Penyelidikan
No ratings yet
Falsafah / Paradigma Penyelidikan
28 pages
Test C - s4fcf - 2021 Ingles
No ratings yet
Test C - s4fcf - 2021 Ingles
17 pages
Data Mining (Module-1)
No ratings yet
Data Mining (Module-1)
14 pages
AEC119
No ratings yet
AEC119
8 pages
Project On Banking System
100% (1)
Project On Banking System
4 pages
Database Design: Logical Design-Part2
No ratings yet
Database Design: Logical Design-Part2
49 pages
Intertuf 5003
No ratings yet
Intertuf 5003
4 pages
Doodads Quick Ref CSharp
No ratings yet
Doodads Quick Ref CSharp
15 pages
Susanne Winter Dissertation
100% (2)
Susanne Winter Dissertation
7 pages
Automaticallymountpartitions: Mount Partitions Automatically
No ratings yet
Automaticallymountpartitions: Mount Partitions Automatically
8 pages
DAT325 - Managed Oracle Databases With Amazon RDS New Features and Best Practices
No ratings yet
DAT325 - Managed Oracle Databases With Amazon RDS New Features and Best Practices
48 pages
CV Gizem Çataldal - Eng
No ratings yet
CV Gizem Çataldal - Eng
2 pages
DSA Lab Manual-Group A Writeup
No ratings yet
DSA Lab Manual-Group A Writeup
9 pages
Record Book BCSL-022
No ratings yet
Record Book BCSL-022
52 pages
Academic Quality Assurance Policy and Procedures - 02 - 2017 - 1
No ratings yet
Academic Quality Assurance Policy and Procedures - 02 - 2017 - 1
24 pages
FOC Project
No ratings yet
FOC Project
41 pages
Google About Bard
No ratings yet
Google About Bard
7 pages
Create All Time Zone Tables in HANA Schema SYSTEM
No ratings yet
Create All Time Zone Tables in HANA Schema SYSTEM
4 pages
Chapter 10
No ratings yet
Chapter 10
14 pages
ADOdb
No ratings yet
ADOdb
13 pages
Dbms - Assignment 1 Sol
No ratings yet
Dbms - Assignment 1 Sol
2 pages
Brajesh Patra DW Informatica
No ratings yet
Brajesh Patra DW Informatica
2 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet
Semantic Translation: Fundamentals and Applications
From Everand
Semantic Translation: Fundamentals and Applications
Fouad Sabry
No ratings yet

Big Data & Analytics (CSE448) L1

Uploaded by

Big Data & Analytics (CSE448) L1

Uploaded by

BIG DATA AND ANALYTICS

• In just one minute, more than 204

to a data model or is not in a form which

in this for example, memos, chat rooms,

Structured data Spreadsheets

information? There are available check encryption

Ease with Structured data Indexing /

XML (eXtensible Markup Language)

Other Markup Languages

JSON (Java Script Object Notation)

Data objects may have different

Structure can be implied despite not being

Data with some structure may still be labeled

Data may have some structure or may even be

 Today, unstructured data constitutes

Natural Language Processing (NLP)

Dealing with Unstructured Data Text Analytics

Noisy Text Analytics

discovery in databases” process.

 It is also called “market basket analysis”

is the other product that you are likely to

predicted is called the dependent

or preferences based on the preferences

 Match the following

Data mining Comprehend human or natural language input

Noisy unstructured Uses methods at the intersection of statistics,

You might also like