0% found this document useful (0 votes)

11 views62 pages

BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan

The document provides an overview of Big Data and its classification into structured, semi-structured, and unstructured data, detailing their characteristics, sources, and challenges. It also discusses the significance of Big Data, its definition, and contrasts traditional business intelligence with Big Data environments. Additionally, it highlights the importance of data characteristics such as volume, velocity, and variety, along with the challenges faced in managing Big Data.

Uploaded by

billy973171

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views62 pages

BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan

Uploaded by

billy973171

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

BIG DATA System

Big Data and Analytics by Seema Acharya and Subhashini

Chellappan
Learning Objectives and Learning Outcomes

Learning Objectives Learning Outcomes

Introduction to digital data
and its types

1. Structured data: Sources a) To differentiate between

of structured data, ease structured, semi-structured
with structured data, etc. and unstructured data.

2. Semi-Structured data: Sources b) To understand the need to

of semi-structured data, integrate structured, semi-
characteristics of structured and
semi- structured data. unstructured data.

3. Unstructured data: Sources of

unstructured data, issues with
terminology, dealing with
unstructured data.
Agenda

Types of Digital Data

Structured
❖ Sources of structured data
❖ Ease with structured data

Semi-Structured
❖ Sources of semi-structured
data

Unstructured
❖ Sources of unstructured
data
❖ Issues with terminology
❖ Dealing with unstructured
data
Classification of Digital Data
Digital data is classified into the following categories:

Structured data- This is the data which is in an organized form(e.g, rows and
columns) and can be easily used by a computer program. Relationships exist
between entities of data, such as classes and their objects. Data stored in
databases is an example of structured data.

Semi-structured data- This is the data which does not conform to a data
model but has some structure. However, it is not in a form which can be used
easily by a computer program, for example, emails, XML, markup languages like
HTML etc.,

Unstructured data- -This is the data which does not conform to a data model
or is not in a form which can be used easily by a computer program. About
80%-90% data of an organization is in this format for example, memos, chat
rooms, powerpoint presentations, images, videos, letters etc,.
Approximate Distribution of Digital Data

Approximate percentage distribution of digital data

Structured Data
Structured Data

This is the data which is in an organized form (e.g., in

rows and columns) and can be easily used by a computer
program.
In structured data, all row in a table has the same set of columns.

Data stored in databases is an example of structured data.

Sources of Structured Data

Databases such as
Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc

Structured data Spreadsheets

OLTP Systems
Ease with Structured Data

Input / Update /
Delete

Security

Ease with Structured data Indexing /

Searching

Scalability

Transaction
Processing
(ACID
properties
Semi-structured Data
Semi-structured Data

This is the data which does not conform to a data

model but has some structure. However, it is not in a
form which can be used easily by a computer program.

Example, emails, XML, markup languages like HTML,

etc. Metadata for this data is available but is not
sufficient.
Sources of Semi-structured Data

XML Extensible MarkUp Language

Other MarkUp Language

JSON(JavaScript Object Notation)

Semi-Structured
Data
Characteristics of Semi-structured Data

Inconsistent Structure

Self-describing
(lable/value
Semi-structured data pairs)
Often Schema information
is blended with data
values
Data objects may have
different attributes not known
beforehand
Unstructured Data
Unstructured Data

This is the data which does not conform to a data model

or is not in a form which can be used easily by a computer
program.

About 80–90% data of an organization is in this format.

Example: memos, chat rooms, PowerPoint

presentations, images, videos, letters, researches,
white papers, body of an email, etc.
Sources of Unstructured Data
Web Pages

Images

Free-Form
Text

Audios
Unstructured data

Videos

Body of
Email

Text
Messages
Chats

Social

Media data

Word
Document
Issues with terminology – Unstructured Data

Structure can be implied despite not

being formerly defined.

Issues with terminology Data with some structure may still be labeled
unstructured if the structure doesn’t help
with processing task at hand

Data may have some structure or may even be

highly structured in ways that are
unanticipated or unannounced.
Dealing with Unstructured Data

Data Mining

Natural Language Processing (NLP)

Dealing with Unstructured Data

Text Analytics

Noisy Text Analytics

Dealing with Unstructured Data

▪Data Mining
•Association Rule Mining
•Regression Analysis
•Collaborative Filtering

▪Text analysis and Text Mining

▪Natural Language Processing(NLP)

▪Noisy text Analysis

▪Manual tagging with metadata

▪Part-of-speech tagging

▪Unstructured Information Management Architecture(UIMA)

Answer a few quick questions …
Answer Me

Which category (structured, semi-structured, or unstructured) will you place

a Web Page in?

Which category (structured, semi-structured, or unstructured) will you

place
Word Document in?

State a few examples of human generated and machine-generated data.

Summary please…

few participants of the learning program to summarize the lecture.

Properties Structured data Semi-structured data Unstructured data

It is based on
It is based on Relational It is based on character and
Technology XML/RDF(Resource
database table binary data
Description Framework).
Matured transaction and
Transaction is adapted from No transaction management
Transaction management various concurrency
DBMS not matured and no concurrency
techniques

Versioning over Versioning over tuples or

Version management Versioned as a whole
tuples,row,tables graph is possible

It is more flexible than

It is schema dependent and structured data but less It is more flexible and there
Flexibility
less flexible flexible than unstructured is absence of schema
data

It is very difficult to scale DB It’s scaling is simpler than

Scalability It is more scalable.
schema structured data

New technology, not very

Robustness Very robust —
spread

Structured query allow Queries over anonymous Only textual queries are
Query performance
complex joining nodes are possible possible
References …
Further
Readings

http://data-magnum.com/the-big-deal-about-big-data-whats-inside-
structured-unstructured-and-semi-structured-data/

http://www.webopedia.com/TERM/S/structured_data.html

http://en.wikipedia.org/wiki/UIMA
Thank you
Chapter 2

Introduction to Big Data

Learning Objectives and Learning Outcomes

Learning Objectives Learning Outcomes

Introduction to big data a) To understand the
significance of big data.
1. Definition of big data.
b) To understand the other
2. Challenges of big data. characteristics of data that
are not definitional
3. Why big data? characteristics of big data.

4. Traditional Business c) To understand the

Intelligence versus big challenges of big data and
data. how to deal with the same.

d) To understand what is new

today.
Agenda

Definition of Big Data

❖ Volume
❖ Velocity
❖ Variety
Challenges of Big Data
Other Characteristics of Data Which are Not Definitional Traits of Big
Data
Why Big Data?
Traditional Business Intelligence (BI) versus Big Data
❖ A Typical Data Warehouse Environment
❖ A Typical Hadoop Environment
❖ Coexistence of Big Data and Data Warehouse
Characteristics of Data

Data has three characteristics:

1. Composition: deals with structure of data, that is, the sources of data , the granularity,
the types, and the nature of the data as to whether it is static or real-time streaming.

2. Condition: The condition of data deals with the state of the data that is “can one use
this data as is for analysis?” or “Does it require cleansing for further enhancement and
enrichment?”

3. Context: deals with “Where has this data been generated?”, “Why was this data
generated?” and so on.
Definition of Big Data
Definition of Big Data

High-volume
High-velocity Big Data is high-volume,
High-variety high- velocity, and
high-variety information assets
that demand cost effective,
innovative forms of
information processing for
Cost-effective, innovative forms of enhanced insight and decision
information processing making.

Source: Gartner IT Glossary

Enhanced insight &
decision making
Volume - A Mountain of
Data

1 Kilobyte (KB) = 1000 bytes

1 Megabyte (MB) = 1,000,000 bytes
1 Gigabyte (GB) = 1,000,000,000 bytes
1 Terabyte (TB) = 1,000,000,000,000 bytes
1 Petabyte (PB) = 1,000,000,000,000,000 bytes
1 Exabyte (EB) = 1,000,000,000,000,000,000 bytes
1 Zettabyte (ZB) = 1,000,000,000,000,000,000,000 bytes
1 Yottabyte (YB) = 1,000,000,000,000,000,000,000,000 bytes
Volume

Where does this data get generated?

1. Typical internal sources:
• Data Storage- File systems, SQL, NoSQL (MongoDB, Cassandra).
• Archives – Archives of scanned documents, paper archives, customer records,
patient health records etc,.
2. External data sources:
• public web - Wikipedia, weather, regulatory, census etc.
3. Both (internal+external)
• Sensor data – Car sensors, smart electric meters, office buildings etc,.
• Machine log data – Event logs, application logs, Business process logs, audit logs etc.
• Social media – Twitter, blogs, Facebook, LinkedIn, Youtube, Instagram etc,.
• Business apps – ERP,CRM, HR, Google Docs, and so on.
• Media – Audio, Video, Image, Podcast, etc.
• Docs – CSV, Word Documents, PDF,XLS, PPT and so on.
Sources of Big Data
Velocit
y

Batch → Periodic → Near real time → Real-time

processing
Variety

Structured data: example: traditional transaction processing systems

and
RDBMS, etc.

Semi-structured data: example: Hyper Text Markup Language

(HTML), eXtensible Markup Language (XML).

Unstructured data: example: unstructured text documents, audio,

video,
email, photos, PDFs, social media, etc.
Other Characteristics of Data –
Which are not Definitional Traits of Big
Data

• Veracity and Validity-Veracity refers to biases, noises and abnormality in data.

Validity refers to the accuracy and correctness of the data.

• Volatility-Deals with, how long is the data valid? And how long should it be stored?

• Variability- Data flows can be highly inconsistent with periodic peaks.

Challenges with Big Data
Challenges with Big Data
Capture

Storage

Curation

Challenges with Big Data

Analysis

Transfer

Visualization

Privacy
Violations
Why Big Data?
Why Big Data?

More Data

More Accurate
Analysis

More Confidence in decision making

Greater operational efficiencies, Cost reduction,

Time reduction, New product development,
Optimized offerings, etc.
Traditional Business Intelligence (BI) versus Big Data

1. In traditional BI environment, all the enterprise’s data is housed in a

central server whereas in a big data environment data resides in a
distributed file system. The distributed file system scales by scaling in
or out horizontally as compared to typical database server that scales
vertically.
2. In traditional BI, data is generally analyzed in an offline mode whereas
in big data, it is analyzed in both real time as well as in offline mode.
3. Traditional BI is about structured data and it is here that data is taken
to processing functions whereas big data is about variety and here the
processing functions are taken to the data.
A Typical Data Warehouse Environment

Reporting /
ERP
Dashboardin
g

CRM OLAP

Legacy Data Ad hoc querying

Warehouse

3rd party Apps Modeling

Co-existence of Big Data and Data Warehouse

Web Logs HDFS

Hadoop Operational
Systems
Images and Videos

Data Warehouse
Data Warehouse
Social Media
(Twitter, Facebook, etc.)
MapReduce
Data Marts

Docs & PDFs ODS

What is changing in the realms of Big data

•Competitive Advantage
•Decision Making
•Value of Data
Its time for Activity…
Teams Games Tournaments
Answer Me

Share your understanding of Big Data.

How is traditional BI environment different from the Big Data environment?

Share your experience as a customer on an e-commerce site. Comment on

the
big data that gets created on a typical e-commerce site.
Summary please…

Ask a few participants of the learning program to summarize the lecture.

References …
Further Readings

Big data for dummies - Judith Hurwitz, Alan Nugent, Fern Halper,
Marcia Kaufman
http://en.wikipedia.org/wiki/Big_data
http://www.sas.com/en_us/insights/big-data/what-is-big-data.html
https://www.oracle.com/bigdata/
http://bigdatauniversity.com/
THANK YOU

UNIT 1 INTRODUCTION TO BIGDATA by MIT
No ratings yet
UNIT 1 INTRODUCTION TO BIGDATA by MIT
12 pages
Business Intelligence - Concepts
100% (2)
Business Intelligence - Concepts
162 pages
Big Data Study 1
No ratings yet
Big Data Study 1
77 pages
Big Data Analytics Tutorial
100% (15)
Big Data Analytics Tutorial
101 pages
Big Data Analytics Methods and Applications Jovan Pehcevski
100% (5)
Big Data Analytics Methods and Applications Jovan Pehcevski
430 pages
CSC4404 Chap3
No ratings yet
CSC4404 Chap3
84 pages
Bi Mid 1
No ratings yet
Bi Mid 1
173 pages
Unit I EBDP 2022
No ratings yet
Unit I EBDP 2022
80 pages
Cloud Computing
No ratings yet
Cloud Computing
86 pages
Unit 1 Notes Final Part A
No ratings yet
Unit 1 Notes Final Part A
82 pages
Big Data
No ratings yet
Big Data
110 pages
Data Science Solutions Sample
100% (6)
Data Science Solutions Sample
53 pages
Unit 1
No ratings yet
Unit 1
62 pages
Big Data Aktu Unit 1
No ratings yet
Big Data Aktu Unit 1
85 pages
Big Data Class 27feb
No ratings yet
Big Data Class 27feb
48 pages
Unit 1-2
No ratings yet
Unit 1-2
78 pages
BD Unit 1
No ratings yet
BD Unit 1
72 pages
Big Data Unit-1 Kcs-061
No ratings yet
Big Data Unit-1 Kcs-061
64 pages
Big Data - Unit-1 - KCS-061
No ratings yet
Big Data - Unit-1 - KCS-061
63 pages
BDA M1 (40pgs)
No ratings yet
BDA M1 (40pgs)
40 pages
Module 1
No ratings yet
Module 1
27 pages
Big Data & Analytics (CSE448) L1
No ratings yet
Big Data & Analytics (CSE448) L1
51 pages
Bigdatanalyticsintro
No ratings yet
Bigdatanalyticsintro
60 pages
Data Science Class2
No ratings yet
Data Science Class2
33 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
Big Data Introduction
No ratings yet
Big Data Introduction
46 pages
Module 1
No ratings yet
Module 1
60 pages
BDA Presentations M1 P1
No ratings yet
BDA Presentations M1 P1
40 pages
Unit - Big - Data
No ratings yet
Unit - Big - Data
107 pages
Big Data and Analytics Cse448 Module 1 L
No ratings yet
Big Data and Analytics Cse448 Module 1 L
38 pages
Chapter 01: Types of Digital Data
No ratings yet
Chapter 01: Types of Digital Data
79 pages
Big Data Analytics QB
No ratings yet
Big Data Analytics QB
44 pages
DA (Unit 1)
No ratings yet
DA (Unit 1)
45 pages
Unit 4 DigitalData
No ratings yet
Unit 4 DigitalData
22 pages
Types of Digital Data
No ratings yet
Types of Digital Data
26 pages
Data Types
No ratings yet
Data Types
36 pages
Big Data UNIT I
No ratings yet
Big Data UNIT I
91 pages
BigData 1
No ratings yet
BigData 1
14 pages
1.1 Module-1
No ratings yet
1.1 Module-1
31 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
36 pages
Basics of Big Data Notes
No ratings yet
Basics of Big Data Notes
17 pages
Unit - I Part I
No ratings yet
Unit - I Part I
48 pages
Fbda Unit-1
No ratings yet
Fbda Unit-1
17 pages
Unit-I (Big Data)
No ratings yet
Unit-I (Big Data)
30 pages
Unit 1: To Data Science
No ratings yet
Unit 1: To Data Science
56 pages
R19 Bda Unit-1
No ratings yet
R19 Bda Unit-1
22 pages
Big - Data Lab Manual
No ratings yet
Big - Data Lab Manual
65 pages
Unit 1
No ratings yet
Unit 1
26 pages
Unit 1 BDT
No ratings yet
Unit 1 BDT
27 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
37 pages
Lecture 1
No ratings yet
Lecture 1
25 pages
Module I Big Data
No ratings yet
Module I Big Data
7 pages
Big Data Intro
No ratings yet
Big Data Intro
12 pages
Sengamala Thayaar Educational Trust Women's College: Sundarakkottai, Mannargudi
No ratings yet
Sengamala Thayaar Educational Trust Women's College: Sundarakkottai, Mannargudi
14 pages
Tableau Cheat Sheet 25 Feb 2014 P
100% (2)
Tableau Cheat Sheet 25 Feb 2014 P
5 pages
1.8 Big Data - Introduction & Characteristics
No ratings yet
1.8 Big Data - Introduction & Characteristics
9 pages
Chapter 2 - Types of Digital Data
No ratings yet
Chapter 2 - Types of Digital Data
12 pages
1 - Data and Organizations
No ratings yet
1 - Data and Organizations
5 pages
Introduction To Bigdata
No ratings yet
Introduction To Bigdata
31 pages
Big Data & Analytics (CSE448) L1
No ratings yet
Big Data & Analytics (CSE448) L1
50 pages
Bda (Chapter 1)
No ratings yet
Bda (Chapter 1)
8 pages
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
No ratings yet
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
72 pages
Free Sample - Study - Id50485 - in Depth Report Artificial Intelligence
No ratings yet
Free Sample - Study - Id50485 - in Depth Report Artificial Intelligence
211 pages
Gartner 361501-2019-Planning-Guide-For-Data-And-Analytics
No ratings yet
Gartner 361501-2019-Planning-Guide-For-Data-And-Analytics
44 pages
Lists of Figures, Tables and Boxes Preface To The Fourth Edition Notes On Contributors
No ratings yet
Lists of Figures, Tables and Boxes Preface To The Fourth Edition Notes On Contributors
33 pages
Big Data
No ratings yet
Big Data
28 pages
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
No ratings yet
Unit 1 Introduction To BIG DATA ANALYSIS: Evolution of Technology
9 pages
R For Data Science
No ratings yet
R For Data Science
4 pages
Free Ebook The Role of AI in Finance 1712162301
No ratings yet
Free Ebook The Role of AI in Finance 1712162301
43 pages
Big Data Analysis in Stock Market Prediction IJERTV8IS100224
No ratings yet
Big Data Analysis in Stock Market Prediction IJERTV8IS100224
3 pages
Artificial Intelligence and Psychology: December 2023
No ratings yet
Artificial Intelligence and Psychology: December 2023
15 pages
Database and BI
No ratings yet
Database and BI
33 pages
Fdsa Question Bank
No ratings yet
Fdsa Question Bank
6 pages
CH08 DSS Turban Data Warehouse
No ratings yet
CH08 DSS Turban Data Warehouse
65 pages
Bigdata UNIT-1
No ratings yet
Bigdata UNIT-1
5 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Pharma MMF
No ratings yet
Pharma MMF
24 pages
1 s2.0 S026974912301360X Main
No ratings yet
1 s2.0 S026974912301360X Main
10 pages
Data Analytics and Applications in The Fashion Industry - Six Inno
No ratings yet
Data Analytics and Applications in The Fashion Industry - Six Inno
60 pages
Government Open Data: Benefits, Strategies, and Use
No ratings yet
Government Open Data: Benefits, Strategies, and Use
16 pages
Chapter 6-Decision Making - Apr2022 - Ska5
No ratings yet
Chapter 6-Decision Making - Apr2022 - Ska5
39 pages
Fda Unit V
No ratings yet
Fda Unit V
14 pages
A Review of Cyber-Physical Energy System Security Assessment
No ratings yet
A Review of Cyber-Physical Energy System Security Assessment
7 pages
Top 5 Books For Aspiring Data Analysts
No ratings yet
Top 5 Books For Aspiring Data Analysts
1 page
BIGDATA Pharmaceutical Industry
No ratings yet
BIGDATA Pharmaceutical Industry
7 pages
Big Iot Data Analytics: Architecture, Tools and Cloud Solutions
No ratings yet
Big Iot Data Analytics: Architecture, Tools and Cloud Solutions
8 pages
Q1a) What Is Big Data? Explain Characteristics of Big Data (4M) Ans
No ratings yet
Q1a) What Is Big Data? Explain Characteristics of Big Data (4M) Ans
16 pages
Big Data Analytics Thinking and Big Data Analytics Intelligence
No ratings yet
Big Data Analytics Thinking and Big Data Analytics Intelligence
12 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
From Everand
THE SQL LANGUAGE: Master Database Management and Unlock the Power of Data (2024 Beginner's Guide)
JAMIE POWERS
No ratings yet

BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan

Uploaded by

BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan

Uploaded by

BIG DATA System

Big Data and Analytics by Seema Acharya and Subhashini

Learning Objectives Learning Outcomes

1. Structured data: Sources a) To differentiate between

2. Semi-Structured data: Sources b) To understand the need to

3. Unstructured data: Sources of

Types of Digital Data

Approximate percentage distribution of digital data

This is the data which is in an organized form (e.g., in

Data stored in databases is an example of structured data.

Structured data Spreadsheets

Ease with Structured data Indexing /

This is the data which does not conform to a data

Example, emails, XML, markup languages like HTML,

XML Extensible MarkUp Language

Other MarkUp Language

JSON(JavaScript Object Notation)

This is the data which does not conform to a data model

About 80–90% data of an organization is in this format.

Example: memos, chat rooms, PowerPoint

Structure can be implied despite not

Data may have some structure or may even be

Natural Language Processing (NLP)

Dealing with Unstructured Data

Noisy Text Analytics

▪Text analysis and Text Mining

▪Natural Language Processing(NLP)

▪Noisy text Analysis

▪Manual tagging with metadata

▪Unstructured Information Management Architecture(UIMA)

Which category (structured, semi-structured, or unstructured) will you place

Which category (structured, semi-structured, or unstructured) will you

State a few examples of human generated and machine-generated data.

few participants of the learning program to summarize the lecture.

Versioning over Versioning over tuples or

It is more flexible than

It is very difficult to scale DB It’s scaling is simpler than

New technology, not very

Introduction to Big Data

Learning Objectives Learning Outcomes

4. Traditional Business c) To understand the

d) To understand what is new

Definition of Big Data

Data has three characteristics:

Source: Gartner IT Glossary

1 Kilobyte (KB) = 1000 bytes

Where does this data get generated?

Batch → Periodic → Near real time → Real-time

Structured data: example: traditional transaction processing systems

Semi-structured data: example: Hyper Text Markup Language

Unstructured data: example: unstructured text documents, audio,

• Veracity and Validity-Veracity refers to biases, noises and abnormality in data.

• Variability- Data flows can be highly inconsistent with periodic peaks.

Challenges with Big Data

More Confidence in decision making

Greater operational efficiencies, Cost reduction,

1. In traditional BI environment, all the enterprise’s data is housed in a

Legacy Data Ad hoc querying

3rd party Apps Modeling

Web Logs HDFS

Docs & PDFs ODS

Share your understanding of Big Data.

How is traditional BI environment different from the Big Data environment?

Share your experience as a customer on an e-commerce site. Comment on

Ask a few participants of the learning program to summarize the lecture.

You might also like