0% found this document useful (0 votes)
11 views62 pages

BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan

The document provides an overview of Big Data and its classification into structured, semi-structured, and unstructured data, detailing their characteristics, sources, and challenges. It also discusses the significance of Big Data, its definition, and contrasts traditional business intelligence with Big Data environments. Additionally, it highlights the importance of data characteristics such as volume, velocity, and variety, along with the challenges faced in managing Big Data.

Uploaded by

billy973171
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views62 pages

BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan

The document provides an overview of Big Data and its classification into structured, semi-structured, and unstructured data, detailing their characteristics, sources, and challenges. It also discusses the significance of Big Data, its definition, and contrasts traditional business intelligence with Big Data environments. Additionally, it highlights the importance of data characteristics such as volume, velocity, and variety, along with the challenges faced in managing Big Data.

Uploaded by

billy973171
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

BIG DATA System

Big Data and Analytics by Seema Acharya and Subhashini


Chellappan
Learning Objectives and Learning Outcomes

Learning Objectives Learning Outcomes


Introduction to digital data
and its types

1. Structured data: Sources a) To differentiate between


of structured data, ease structured, semi-structured
with structured data, etc. and unstructured data.

2. Semi-Structured data: Sources b) To understand the need to


of semi-structured data, integrate structured, semi-
characteristics of structured and
semi- structured data. unstructured data.

3. Unstructured data: Sources of


unstructured data, issues with
terminology, dealing with
unstructured data.
Agenda

Types of Digital Data


Structured
❖ Sources of structured data
❖ Ease with structured data

Semi-Structured
❖ Sources of semi-structured
data

Unstructured
❖ Sources of unstructured
data
❖ Issues with terminology
❖ Dealing with unstructured
data
Classification of Digital Data
Digital data is classified into the following categories:

Structured data- This is the data which is in an organized form(e.g, rows and
columns) and can be easily used by a computer program. Relationships exist
between entities of data, such as classes and their objects. Data stored in
databases is an example of structured data.

Semi-structured data- This is the data which does not conform to a data
model but has some structure. However, it is not in a form which can be used
easily by a computer program, for example, emails, XML, markup languages like
HTML etc.,

Unstructured data- -This is the data which does not conform to a data model
or is not in a form which can be used easily by a computer program. About
80%-90% data of an organization is in this format for example, memos, chat
rooms, powerpoint presentations, images, videos, letters etc,.
Approximate Distribution of Digital Data

Approximate percentage distribution of digital data


Structured Data
Structured Data

This is the data which is in an organized form (e.g., in


rows and columns) and can be easily used by a computer
program.
In structured data, all row in a table has the same set of columns.

Data stored in databases is an example of structured data.


Sources of Structured Data

Databases such as
Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc

Structured data Spreadsheets

OLTP Systems
Ease with Structured Data

Input / Update /
Delete

Security

Ease with Structured data Indexing /


Searching

Scalability

Transaction
Processing
(ACID
properties
Semi-structured Data
Semi-structured Data

This is the data which does not conform to a data


model but has some structure. However, it is not in a
form which can be used easily by a computer program.

Example, emails, XML, markup languages like HTML,


etc. Metadata for this data is available but is not
sufficient.
Sources of Semi-structured Data

XML Extensible MarkUp Language

Other MarkUp Language

JSON(JavaScript Object Notation)

Semi-Structured
Data
Characteristics of Semi-structured Data

Inconsistent Structure

Self-describing
(lable/value
Semi-structured data pairs)
Often Schema information
is blended with data
values
Data objects may have
different attributes not known
beforehand
Unstructured Data
Unstructured Data

This is the data which does not conform to a data model


or is not in a form which can be used easily by a computer
program.

About 80–90% data of an organization is in this format.

Example: memos, chat rooms, PowerPoint


presentations, images, videos, letters, researches,
white papers, body of an email, etc.
Sources of Unstructured Data
Web Pages

Images

Free-Form
Text

Audios
Unstructured data

Videos

Body of
Email

Text
Messages
Chats

Social

Media data

Word
Document
Issues with terminology – Unstructured Data

Structure can be implied despite not


being formerly defined.

Issues with terminology Data with some structure may still be labeled
unstructured if the structure doesn’t help
with processing task at hand

Data may have some structure or may even be


highly structured in ways that are
unanticipated or unannounced.
Dealing with Unstructured Data

Data Mining

Natural Language Processing (NLP)

Dealing with Unstructured Data


Text Analytics

Noisy Text Analytics


Dealing with Unstructured Data

▪Data Mining
•Association Rule Mining
•Regression Analysis
•Collaborative Filtering

▪Text analysis and Text Mining

▪Natural Language Processing(NLP)

▪Noisy text Analysis

▪Manual tagging with metadata

▪Part-of-speech tagging

▪Unstructured Information Management Architecture(UIMA)


Answer a few quick questions …
Answer Me

Which category (structured, semi-structured, or unstructured) will you place


a Web Page in?

Which category (structured, semi-structured, or unstructured) will you


place
Word Document in?

State a few examples of human generated and machine-generated data.


Summary please…

few participants of the learning program to summarize the lecture.


Properties Structured data Semi-structured data Unstructured data

It is based on
It is based on Relational It is based on character and
Technology XML/RDF(Resource
database table binary data
Description Framework).
Matured transaction and
Transaction is adapted from No transaction management
Transaction management various concurrency
DBMS not matured and no concurrency
techniques

Versioning over Versioning over tuples or


Version management Versioned as a whole
tuples,row,tables graph is possible

It is more flexible than


It is schema dependent and structured data but less It is more flexible and there
Flexibility
less flexible flexible than unstructured is absence of schema
data

It is very difficult to scale DB It’s scaling is simpler than


Scalability It is more scalable.
schema structured data

New technology, not very


Robustness Very robust —
spread

Structured query allow Queries over anonymous Only textual queries are
Query performance
complex joining nodes are possible possible
References …
Further
Readings

http://data-magnum.com/the-big-deal-about-big-data-whats-inside-
structured-unstructured-and-semi-structured-data/

http://www.webopedia.com/TERM/S/structured_data.html

http://en.wikipedia.org/wiki/UIMA
Thank you
Chapter 2

Introduction to Big Data


Learning Objectives and Learning Outcomes

Learning Objectives Learning Outcomes


Introduction to big data a) To understand the
significance of big data.
1. Definition of big data.
b) To understand the other
2. Challenges of big data. characteristics of data that
are not definitional
3. Why big data? characteristics of big data.

4. Traditional Business c) To understand the


Intelligence versus big challenges of big data and
data. how to deal with the same.

d) To understand what is new


today.
Agenda

Definition of Big Data


❖ Volume
❖ Velocity
❖ Variety
Challenges of Big Data
Other Characteristics of Data Which are Not Definitional Traits of Big
Data
Why Big Data?
Traditional Business Intelligence (BI) versus Big Data
❖ A Typical Data Warehouse Environment
❖ A Typical Hadoop Environment
❖ Coexistence of Big Data and Data Warehouse
Characteristics of Data

Data has three characteristics:

1. Composition: deals with structure of data, that is, the sources of data , the granularity,
the types, and the nature of the data as to whether it is static or real-time streaming.

2. Condition: The condition of data deals with the state of the data that is “can one use
this data as is for analysis?” or “Does it require cleansing for further enhancement and
enrichment?”

3. Context: deals with “Where has this data been generated?”, “Why was this data
generated?” and so on.
Definition of Big Data
Definition of Big Data

High-volume
High-velocity Big Data is high-volume,
High-variety high- velocity, and
high-variety information assets
that demand cost effective,
innovative forms of
information processing for
Cost-effective, innovative forms of enhanced insight and decision
information processing making.

Source: Gartner IT Glossary


Enhanced insight &
decision making
Volume - A Mountain of
Data

1 Kilobyte (KB) = 1000 bytes


1 Megabyte (MB) = 1,000,000 bytes
1 Gigabyte (GB) = 1,000,000,000 bytes
1 Terabyte (TB) = 1,000,000,000,000 bytes
1 Petabyte (PB) = 1,000,000,000,000,000 bytes
1 Exabyte (EB) = 1,000,000,000,000,000,000 bytes
1 Zettabyte (ZB) = 1,000,000,000,000,000,000,000 bytes
1 Yottabyte (YB) = 1,000,000,000,000,000,000,000,000 bytes
Volume

Where does this data get generated?


1. Typical internal sources:
• Data Storage- File systems, SQL, NoSQL (MongoDB, Cassandra).
• Archives – Archives of scanned documents, paper archives, customer records,
patient health records etc,.
2. External data sources:
• public web - Wikipedia, weather, regulatory, census etc.
3. Both (internal+external)
• Sensor data – Car sensors, smart electric meters, office buildings etc,.
• Machine log data – Event logs, application logs, Business process logs, audit logs etc.
• Social media – Twitter, blogs, Facebook, LinkedIn, Youtube, Instagram etc,.
• Business apps – ERP,CRM, HR, Google Docs, and so on.
• Media – Audio, Video, Image, Podcast, etc.
• Docs – CSV, Word Documents, PDF,XLS, PPT and so on.
Sources of Big Data
Velocit
y

Batch → Periodic → Near real time → Real-time


processing
Variety

Structured data: example: traditional transaction processing systems


and
RDBMS, etc.

Semi-structured data: example: Hyper Text Markup Language


(HTML), eXtensible Markup Language (XML).

Unstructured data: example: unstructured text documents, audio,


video,
email, photos, PDFs, social media, etc.
Other Characteristics of Data –
Which are not Definitional Traits of Big
Data

• Veracity and Validity-Veracity refers to biases, noises and abnormality in data.


Validity refers to the accuracy and correctness of the data.

• Volatility-Deals with, how long is the data valid? And how long should it be stored?

• Variability- Data flows can be highly inconsistent with periodic peaks.


Challenges with Big Data
Challenges with Big Data
Capture

Storage

Curation

Challenges with Big Data


Search

Analysis

Transfer

Visualization

Privacy
Violations
Why Big Data?
Why Big Data?

More Data

More Accurate
Analysis

More Confidence in decision making

Greater operational efficiencies, Cost reduction,


Time reduction, New product development,
Optimized offerings, etc.
Traditional Business Intelligence (BI) versus Big Data

1. In traditional BI environment, all the enterprise’s data is housed in a


central server whereas in a big data environment data resides in a
distributed file system. The distributed file system scales by scaling in
or out horizontally as compared to typical database server that scales
vertically.
2. In traditional BI, data is generally analyzed in an offline mode whereas
in big data, it is analyzed in both real time as well as in offline mode.
3. Traditional BI is about structured data and it is here that data is taken
to processing functions whereas big data is about variety and here the
processing functions are taken to the data.
A Typical Data Warehouse Environment

Reporting /
ERP
Dashboardin
g

CRM OLAP

Legacy Data Ad hoc querying


Warehouse

3rd party Apps Modeling


Co-existence of Big Data and Data Warehouse

Web Logs HDFS

Hadoop Operational
Systems
Images and Videos

Data Warehouse
Data Warehouse
Social Media
(Twitter, Facebook, etc.)
MapReduce
Data Marts

Docs & PDFs ODS


What is changing in the realms of Big data

•Competitive Advantage
•Decision Making
•Value of Data
Its time for Activity…
Teams Games Tournaments
Answer Me

Share your understanding of Big Data.

How is traditional BI environment different from the Big Data environment?

Share your experience as a customer on an e-commerce site. Comment on


the
big data that gets created on a typical e-commerce site.
Summary please…

Ask a few participants of the learning program to summarize the lecture.


References …
Further Readings

Big data for dummies - Judith Hurwitz, Alan Nugent, Fern Halper,
Marcia Kaufman
http://en.wikipedia.org/wiki/Big_data
http://www.sas.com/en_us/insights/big-data/what-is-big-data.html
https://www.oracle.com/bigdata/
http://bigdatauniversity.com/
THANK YOU

You might also like