Big Data & Analytics (CSE448) L1
Big Data & Analytics (CSE448) L1
MODULE 1 (L1)
Do you know what happens in
one minute on the Internet?
Databases such as
Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc
OLTP Systems
Ease of Working with
Structured Data
The ease is with respect to the following:
Insert/update/delete: The Data Manipulation
Language (DML) operations provide the required
ease with data input, storage, access, process,
analysis, etc.
Security: How does one ensure the security of
Input / Update /
Delete
Security
Scalability
Transaction
Processing
Semi-structured Data
This is the data which does not conform to
a data model but has some structure.
However, it is not in a form which can be
used easily by a computer program.
Example, emails, XML, markup languages
like HTML, etc. Metadata for this data is
available but is not sufficient.
Semi-structured Data
It has the following features:
It does not conform to the data models that one typically
associates with relational databases or any other form of
data tables.
It uses tags to segregate semantic elements.
Tags are also used to enforce hierarchies of records and
fields within data.
There is no separation between the data and the schema.
The amount of structure used is dictated by the purpose
at hand.
In semi-structured data, entities belonging to the same
class and also grouped together need not necessarily
have the same set of attributes.
And if at all, they have the same set of attributes, the
Sources of Semi-structured
Data
Amongst the sources for semi-structured data, the
front runners are “XML” and “JSON”.
XML: eXtensible Markup Language (XML) is hugely
popularized by web services developed utilizing the
Simple Object Access Protocol (SOAP) principles.
Sources of Semi-structured
Data
Inconsistent Structure
Self-describing
(lable/value pairs)
Semi-structured data
Often Schema information is
blended with data values
Images
Free-Form
Text
Audios
Unstructured data
Videos
Body of
Email
Text
Messages
Chats
Social
Media data
Word
Document
Issues with terminology –
Unstructured Data
Data Mining
or “affinity analysis”.
It is used to determine “What goes with
what?”
It is about when you buy a product, what
IBM UIMA
Question‘s Answer ??
Which category (structured, semi-
structured, or unstructured) will you
place a Web Page in?
Which category (structured, semi-
structured, or unstructured) will you
place Word Document in?
State a few examples of human
generated and machine-generated
data.