0% found this document useful (0 votes)
10 views18 pages

Big Data

The document discusses the classification of digital data into structured, semi-structured, and unstructured data, focusing on the evolution and role of Relational Database Management Systems (RDBMS) in managing structured data. It explains the characteristics and examples of each data type, including the flexibility of semi-structured data and the prevalence of unstructured data in organizations. Additionally, it outlines methods for dealing with unstructured data, such as data mining, text analytics, and natural language processing.

Uploaded by

aritra sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views18 pages

Big Data

The document discusses the classification of digital data into structured, semi-structured, and unstructured data, focusing on the evolution and role of Relational Database Management Systems (RDBMS) in managing structured data. It explains the characteristics and examples of each data type, including the flexibility of semi-structured data and the prevalence of unstructured data in organizations. Additionally, it outlines methods for dealing with unstructured data, such as data mining, text analytics, and natural language processing.

Uploaded by

aritra sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Big Data Analytics and Visualization

Module 1: Classification of Digital Data

Dr. Jisha Francis

Department of Mathematics
School of Advanced Sciences
Vellore Institute of Technology
Vellore Campus, Vellore - 632 014
India

Dr. Jisha Francis Module 1 1 / 18


Classification of Digital Data

Earlier in 1980’s most of the enterprise data has been stored in relational database
complete with rows/records/tuples, columns/attributes/fields, primary keys, foreign keys.
Relational Database Management System (RDBMS)
The software used to store, manage, query, and retrieve data stored in a relational
database is called a RDBMS.
The RDBMS provides an interface between users and applications and the database, as
well as administrative functions for managing data storage, access, and performance.

Dr. Jisha Francis Module 1 1 / 18


Classification of Digital Data

Over a period of time RDBMS matured and the RDBMS, as they are available today,
have become more robust, cost-effective and efficient.
The data held in RDBMS is typically structured.

Dr. Jisha Francis Module 1 2 / 18


Example of RDBMS

Figure: Example of RDBMS: Oracle

Dr. Jisha Francis Module 1 3 / 18


Classification of Digital Data

Structured Data
Semi structured Data
Unstructured Data.

Dr. Jisha Francis Module 1 4 / 18


Structured Data

This is the data which is in an organized form (e.g. in rows and columns) and can be
easily used by a computer program.
i.e., when the data conforms to a pre-defined schema/ structure.
Relationships exist between entities of data, such as classes and their objects.
Data stored in databases is an example of structured data.

Dr. Jisha Francis Module 1 5 / 18


Examples for RDBMS

Company Database
Oracle Corp Oracle
IBM DB2
Microsoft Microsoft SQL Server
EMC Greenplum
Teradata Teradata
MySQL Open source
PostgreSQL Advanced open source
Table: Database Products by Companies

Dr. Jisha Francis Module 1 6 / 18


Sources of Structured Data

These databases are typically used to hold transaction/operational data generated and
collected by day-to-day business activities.
The data of the On-Line Transaction Processing (OLTP) systems are generally structured.
OLTP or Online Transaction Processing is a type of data processing that consists of
executing a number of transactions occurring concurrently?online banking, shopping,
order entry, or sending text messages, for example.
These transactions traditionally are referred to as economic or financial transactions,
recorded and secured so that an enterprise can access the information anytime for
accounting or reporting purposes.

Dr. Jisha Francis Module 1 7 / 18


Semi-Structured Data

It is also referred to as self-describing structure.


This is the data which does not conform to a data model but has some structure.
However, it is not in a form which can be used easily by a computer program.
For example- emails, XML, mark-up languages like HTML etc.

Dr. Jisha Francis Module 1 8 / 18


Semi-Structured Data

Flexible Schema: Unlike structured data with fixed schemas, semi-structured data allows
for flexibility in the organization of data. It may have a loose or dynamic schema,
enabling data to be added or changed without the need for predefined structures.
Data Contains Tags or Markers: Semi-structured data often includes tags, markers, or
other identifiers that provide a level of organization or hierarchy.
No Clear Data Relationships: While semi-structured data may have some organizational
elements, it lacks the strict relationships and constraints found in structured databases.

Dr. Jisha Francis Module 1 9 / 18


Semi-Structured Data

Variety of Data Types: Semi-structured data can accommodate various types of data,
including text, images, and nested structures.
Ease of Adaptability: It can easily incorporate new data elements without requiring
modifications to the entire schema.
Examples
XML (eXtensible Markup Language): Uses tags to define elements and attributes.
JSON (JavaScript Object Notation): Utilizes key-value pairs to represent data objects. It is
used to transmit data between a server sand web application.

Dr. Jisha Francis Module 1 10 / 18


Example of Semi-Structured Data

The person has attributes such as first name,


last name, and age.
The address is a nested object with street, city,
and zipcode.
The contacts attribute is an array containing
objects with type (email or phone) and
corresponding values.
Figure: The figure represents information The isStudent attribute is a boolean.
about a person.

Dr. Jisha Francis Module 1 11 / 18


Example of Semi-Structured Data

Tags such as <person>, <firstName>,


<lastName>, etc., represents different data
elements.
Nested elements, like <address> and
<contacts>, contain further details.
Attributes, such as type="email" within the
<contact> tag, provide additional information.
Figure: The XML data represents information The content inside the tags represents the
about a person values associated with each element.

Dr. Jisha Francis Module 1 12 / 18


Unstructured Data

This is the data which does not conform to a data model or is not in a form which can be
used easily by a computer program.
About 80-90% data of an organization is in this format.
For example- chat rooms, power point presentations, images, videos, letters, researchers,
body of an email, etc-

Dr. Jisha Francis Module 1 13 / 18


Examples of Unstructured Data

Text Documents: Social Media Content:


Word documents Tweets
PDF files Facebook posts
Text files Instagram posts
Email messages Comments and discussions
Multimedia Files: Web Pages:
Images (JPEG, PNG, GIF) HTML pages
Audio files (MP3, WAV) Blogs
Video files (MP4, AVI) Articles
Presentations (PowerPoint slides)

Dr. Jisha Francis Module 1 14 / 18


Examples of Unstructured Data

Sensor Data: Graphs and Networks:


Data from IoT devices Social network graphs
Environmental sensor data Organizational charts
Geospatial data Knowledge graphs
Speech Data: Raw Logs:
Transcriptions of speech Server logs
Voice recordings System logs
Application logs

Dr. Jisha Francis Module 1 15 / 18


Few Ways to Deal with Unstructured Data

Data Mining: First, we deal with large data sets, Second, we use methods at the
intersection of AI, ML, Statistics and database systems to unearth consistent patterns in
large data sets. It is the analysis step of the “Knowledge discovery in databases” process.
Few popular data mining algorithms are as follows:
Association Rule Mining (Market basket analysis):
It is used to determine “What goes with what?”
It is about when you buy a product, what is the other product that you are likely to purchase
with it
Regression Analysis
It helps to predict the relationship between two variables.
Collaborating Filtering
It is about predicting a user’s preference or preferences based on the preferences of a group of
users.
Dr. Jisha Francis Module 1 16 / 18
Few Ways to Deal with Unstructured Data

Text analytics or text mining: This involves analyzing and interpreting large volumes of
textual information to discover patterns, trends, and knowledge. It includes tasks such as
text categorization, text clustering, sentiment analysis, etc.
Natural Language Processing: It focuses on enabling computers to understand,
interpret, and generate human language. It involves the interaction between computers
and humans through natural language.
Noisy text analytics: It is the process of extracting structured and semi-structured
information from the unstructured data such as chats, blogs, wikis, emails, text messages
etc.

Dr. Jisha Francis Module 1 17 / 18

You might also like