BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan
BIG DATA System: Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Semi-Structured
❖ Sources of semi-structured
data
Unstructured
❖ Sources of unstructured
data
❖ Issues with terminology
❖ Dealing with unstructured
data
Classification of Digital Data
Digital data is classified into the following categories:
Structured data- This is the data which is in an organized form(e.g, rows and
columns) and can be easily used by a computer program. Relationships exist
between entities of data, such as classes and their objects. Data stored in
databases is an example of structured data.
Semi-structured data- This is the data which does not conform to a data
model but has some structure. However, it is not in a form which can be used
easily by a computer program, for example, emails, XML, markup languages like
HTML etc.,
Unstructured data- -This is the data which does not conform to a data model
or is not in a form which can be used easily by a computer program. About
80%-90% data of an organization is in this format for example, memos, chat
rooms, powerpoint presentations, images, videos, letters etc,.
Approximate Distribution of Digital Data
Databases such as
Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc
OLTP Systems
Ease with Structured Data
Input / Update /
Delete
Security
Scalability
Transaction
Processing
(ACID
properties
Semi-structured Data
Semi-structured Data
Semi-Structured
Data
Characteristics of Semi-structured Data
Inconsistent Structure
Self-describing
(lable/value
Semi-structured data pairs)
Often Schema information
is blended with data
values
Data objects may have
different attributes not known
beforehand
Unstructured Data
Unstructured Data
Images
Free-Form
Text
Audios
Unstructured data
Videos
Body of
Email
Text
Messages
Chats
Social
Media data
Word
Document
Issues with terminology – Unstructured Data
Issues with terminology Data with some structure may still be labeled
unstructured if the structure doesn’t help
with processing task at hand
Data Mining
▪Data Mining
•Association Rule Mining
•Regression Analysis
•Collaborative Filtering
▪Part-of-speech tagging
It is based on
It is based on Relational It is based on character and
Technology XML/RDF(Resource
database table binary data
Description Framework).
Matured transaction and
Transaction is adapted from No transaction management
Transaction management various concurrency
DBMS not matured and no concurrency
techniques
Structured query allow Queries over anonymous Only textual queries are
Query performance
complex joining nodes are possible possible
References …
Further
Readings
http://data-magnum.com/the-big-deal-about-big-data-whats-inside-
structured-unstructured-and-semi-structured-data/
http://www.webopedia.com/TERM/S/structured_data.html
http://en.wikipedia.org/wiki/UIMA
Thank you
Chapter 2
1. Composition: deals with structure of data, that is, the sources of data , the granularity,
the types, and the nature of the data as to whether it is static or real-time streaming.
2. Condition: The condition of data deals with the state of the data that is “can one use
this data as is for analysis?” or “Does it require cleansing for further enhancement and
enrichment?”
3. Context: deals with “Where has this data been generated?”, “Why was this data
generated?” and so on.
Definition of Big Data
Definition of Big Data
High-volume
High-velocity Big Data is high-volume,
High-variety high- velocity, and
high-variety information assets
that demand cost effective,
innovative forms of
information processing for
Cost-effective, innovative forms of enhanced insight and decision
information processing making.
• Volatility-Deals with, how long is the data valid? And how long should it be stored?
Storage
Curation
Analysis
Transfer
Visualization
Privacy
Violations
Why Big Data?
Why Big Data?
More Data
More Accurate
Analysis
Reporting /
ERP
Dashboardin
g
CRM OLAP
Hadoop Operational
Systems
Images and Videos
Data Warehouse
Data Warehouse
Social Media
(Twitter, Facebook, etc.)
MapReduce
Data Marts
•Competitive Advantage
•Decision Making
•Value of Data
Its time for Activity…
Teams Games Tournaments
Answer Me
Big data for dummies - Judith Hurwitz, Alan Nugent, Fern Halper,
Marcia Kaufman
http://en.wikipedia.org/wiki/Big_data
http://www.sas.com/en_us/insights/big-data/what-is-big-data.html
https://www.oracle.com/bigdata/
http://bigdatauniversity.com/
THANK YOU