0% found this document useful (0 votes)
118 views79 pages

Chapter 01: Types of Digital Data

This document discusses types of digital data, including unstructured, semi-structured, and structured data. Unstructured data lacks a specific structure and makes up 80-90% of organizational data, including documents, emails, images, and videos. Semi-structured data has some structure but not a rigid format, including XML, HTML, and JSON. Structured data is organized into a relational database with rows and columns and includes data from databases, ERP systems, and data warehouses. The document then provides examples and characteristics of each type of data.

Uploaded by

01fm19mca006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views79 pages

Chapter 01: Types of Digital Data

This document discusses types of digital data, including unstructured, semi-structured, and structured data. Unstructured data lacks a specific structure and makes up 80-90% of organizational data, including documents, emails, images, and videos. Semi-structured data has some structure but not a rigid format, including XML, HTML, and JSON. Structured data is organized into a relational database with rows and columns and includes data from databases, ERP systems, and data warehouses. The document then provides examples and characteristics of each type of data.

Uploaded by

01fm19mca006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 79

CHAPTER 01: TYPES OF DIGITAL DATA

Data
• The quantities, characters, or symbols on which operations are
performed by a computer, which may be stored and transmitted in the
form of electrical signals and recorded on magnetic, electronic, optical,
or mechanical recording media.
• In general, data is any set of characters that is gathered and translated
for some purpose, usually analysis. It can be any character, including
text and numbers, pictures, sound, or video. If data is not put into
context, it doesn't do anything to a human or computer.
Data
• It is an invaluable asset of any enterprise (big or small).
• Data is present internal to the enterprise and also exists outside
the firewalls of the enterprise.
• Data may be in homogeneous or heterogeneous.

Data Information
• Need of the hour is to
– Understand, manage, process,
– and take the data for analysis
– to draw valuable insights.
Insights
Types of Digital Data
• Unstructured
• Semi-structured
• Structured
Unstructured Data
• Unstructured data refers to the data that lacks any
specific form or structure.
• This makes it very difficult and time-consuming to
process and analyze unstructured data.
• Data which does not conform to a data model.
• Computer programs can not use this data directly.
• About 80-90% data of an organization is in this format.
Unstructured Data
• Example
– Memos, QR code (Quick Response), Blogs
– Chat rooms, Tweets, Comments, likes, tags
– PPTs, emojis, emoticons (emotion icons)
– Images
– Videos
– Doc files
– Body of email , GPS data, sensor data, etc.
Semi-structured Data
• Data which does not conform to a data model but has
some structure.
• Computer programs can not use this data easily.
• Example
– emails
– XML
– HTML
– JSON, and so on.
Structured Data
• Data which is in an organized form (In rows & columns).
• Computer programs can use this data easily.
• Relationships exists between entities of data.
• Example
– Data stored in databases
– ERP
– CRM
– DW
– Data Cube
Distribution of digital data (in %)
(by Gartner)

10

10

Unstructured
Semi-structured
Structured

80
structured Data
• The data that can be processed, stored, and retrieved in a
fixed format. It refers to highly organized information that
can be readily and seamlessly stored and accessed from a
database by simple search algorithms. 
• The data conforms to a pre-defined schema or structure is
known as structured data.
• Conforms to a relational data model.
• Structured data is organized in semantic chunks/entities
with similar entities grouped together to form
relations/tables.
structured Data
• Data is stored in the form of rows and columns.
• Definition, format, and meaning of data is explicitly
known.
• Descriptions for all entities in a group
• Have the same defined format
• Have a predefined length
• Follow the same order.
Example
Sources of Structured Data

Databases

Structured Excel
Data

OLTP
systems
Database (RDBMS)
• Oracle Corp. – Oracle
• IBM – DB2, IBM-Informix
• Microsoft – SQL
• EMC – Greenplum
• Teradata – Teradata
• Open source- MySQL, PostgresSQL
• Sqlite
• Sequel Pro
• Amazon Aurora
• SAP SQL Anywhere, SAP IQ (Sybase)
Ease with structured data
Indexing/ Transaction
Searching processing
(ACID)
Ease with Scalability
structured
data
Security

Insert/Update/
delete
Semi-structured data (SSD)
• It is referred to as self describing structure.
• It is a form of structured data that does not
conform with the formal structure of data models
associated with relational databases or other
forms of data tables.
• It uses metadata and tags to provide semantic
information.
Characteristics of semi-structured data (SSD)

• Does not conform to a data model


• Cannot be stored in the form of rows and columns
as in a database.
• The tags and elements are used to describe data.
• Attributes in a group may not be the same.
• Similar entities are grouped.
• Size of the same attributes in a group may differ
• Type of same attributes in group may differ.
• Evolving Schema
• Schema and data are tightly coupled.
Example (Names & Emails)
• One way is:
Name: Raju Patil
Email : [email protected], [email protected]

• Another way is:


First Name: Raju
Last Name :Patil
Email : [email protected]
Sources of SSD
• Email
• XML
• TCP/IP
• Zipped files
• Mark-up languages
• Integration of data from heterogeneous sources.
Example: Email format

To: <Name>
From: <Name>
Subject: <Text>
CC: <Name>
Body: <Text, Graphics, Images, etc.><Name>
ABC Healthcare Blood Test Report
Date <> ----

Department <> -----

Patient Name <> Attending Doctor <>

Hemoglobin <> Patient Age <>

content
RBC count <>

WBC count <>

Platelet count <>

Diagnosis <notes>
Conclusion <notes>
XML & JSON
Integration of data from heterogeneous sources

User

Mediator : Uniform access to multiple data sources

Structured Legacy
RDBMS OODBMS
file system
Getting to know Unstructured data
• Over the past few days, Dr. Ben and Dr. Stanley
had been exchanging long emails about a
particular case of gastro-intestinal problem.
• Email contains procedure practiced by Dr. Stanley,
about combination of drugs that has successfully
cured gastro-intestinal disorders in patients.
• Dr. Mark has a patient in the “GoodLife”
emergency unit with quite similar case of gastro-
intestinal disorder.
Getting to know Unstructured data
What is Unstructured data
• Unstructured data (or unstructured information)
refers to information that either does not have a
pre-defined data model or is not organized in a
pre-defined manner.
• Unstructured data is a generic label for describing
data that is not contained in a database or some
other type of data structure.
What is Unstructured data
• Unstructured data can be textual or non-textual.  
• Textual unstructured data is generated in media
like email messages, PowerPoint presentations,
Word documents, comments in social media, etc.
• Non-textual unstructured data is generated in
media like  images,  CCTV footage, audio files
and video files.
Unstructured data
• The 80-85% of data in any organization is
unstructured and is growing at an alarming rate.
• An enormous amount of knowledge is buried in this
data.
• Anything in a non-database form is unstructured
data.
• Two types:
1. Bitmap objects : image, video, or audio files
2. Textual objects : word, emails, ppts and so on.
Characteristics of Unstructured data

• This data cannot be stored in the form of rows


and columns as in a database and does not
conform to any data model.
• It is difficult to determine the meaning of the
data.
• It does not follow any rule or semantics, i.e. Not
in any particular format or sequence.
• Not easily usable by a program.
Sources of Unstructured data
• Web pages • Social media data
• Audio and Videos • White papers
• Images • Surveys
• Body of an email • SMS
• Word document • Free form text
• PPT and reports • Server Log files
• Chats and text messages
Web page is unstructured data

Multimedia Image

Web Page XML

Text
Database
Challenges
• Storage space: A lot of space is required to store USD.
• Scalability: As the data grows, scalability becomes an
issue and the cost of storing USD increases.
• Retrieve information: Difficult to retrieve required
information from USD
• Security: Ensuring security is difficult due to varied
sources of data. E.g. emails, web pages, etc.
• Indexing & searching: Very difficult and error-prone
as the structure of the USD is not clear.
Challenges
• Interpretation : USD is not easily interpreted by
conventional search algorithms.
• Classification : Different naming conventions
followed across the organization make it difficult to
classify data.
• Deriving meaning : Computer programs cannot
automatically derive meaning or structure from USD.
• File formats : Increasing number of file formats
makes it difficult to interpret data.
Portion of Unstructured data

SD

USD
Dealing with USD
1. Data mining
2. Text mining /Text Analytics
3. NLP
4. Noisy text analytics
Possible
5. Manual tagging with meta data
Solutions
6. Part of speech tagging
7. UIMA
Data Mining
• It is the computing process of discovering patterns
in large data sets involving methods at the
intersection of AI, machine learning &
DL, statistics, and database systems.
• Popular algorithms:
– Association rule mining (MBA)
– Regression Analysis (Y=mX+ c)
– Collaborative filtering
Collaborative filtering
• Collaborative filtering (CF) is a technique used by 
recommender systems.
• collaborative filtering is the process of filtering for
information or patterns using techniques involving
collaboration among multiple agents, viewpoints, data
sources, etc.
• In collaborative filtering, algorithms are used to make
automatic predictions about a user's interests by
compiling preferences from several users. 
Collaborative filtering
• Collaborative filtering (CF) is a technique commonly used
to build personalized recommendations on the Web.
• Popular websites that make use of the collaborative
filtering technology include Amazon, Netflix, iTunes.
Collaborative filtering

Image
Text analytics or text mining
• It is the process of converting
unstructured text data into meaningful data for
analysis, to measure customer opinions, product
reviews, feedback and sentimental analysis to
support fact based decision making.
• Uses many linguistic, statistical, and machine
learning techniques such as clustering, pattern
recognition, tagging, association analysis,
predictive analytics, etc.
Text analytics or text mining
• It helps organizations to find potentially valuable
business insights in corporate documents, customer
emails, call center logs, survey comments, social
network posts, medical records and other sources of
text-based data.
• Text mining capabilities are also being incorporated into
AI chatbots and virtual agents that companies deploy to
provide automated responses to customers as part of
their marketing, sales and customer service operations.
Natural Language Processing (NLP)
• Natural language processing (NLP) is the ability of a
computer program to understand human language
as it is spoken. NLP is a component of artificial
intelligence (AI).
• It is a field of computer science, artificial
intelligence and computational linguistics concerned
with the interactions
between computers and human (natural) languages.
• It is related to the area of Human Computer
Interaction (HCI).
Noisy text analytics
• It is the process of extracting structured or semi-
structured information from noisy unstructured text data
such as  online chat, text messages, emails, message
boards, blogs, wikis, etc.
• The noisy unstructured data comprises one or more of
the followings:
– Spelling mistakes,
– Acronyms
– Non-standard words (HBD, K, GN, GM, VGM, etc.)
– Missing punctuations,
– Missing letters and so on.
Manual tagging with metadata
• It is the process of tagging manually with adequate
metadata to provide the semantics to understand
unstructured data.

.
Part of Speech Tagging
• It is also called as POS or POST or grammatical
tagging.
• It is the process of reading text and tagging each
word in the sentence as belonging to a particular
part of speech such as “noun”, “verb”, “adjective”,
etc.

.
Unstructured Information Management
Architecture(UIMA)

• It is an open source platform from IBM, which


integrates different kinds of analysis engines to
provide a complete solution for knowledge
discovery from USD.
• In UIMA, the analysis engines enable integration
and analysis of unstructured information and
bridge the gap between structured and USD.
Uses of UIMA
• Used to convert unstructured data such as repair logs
 and service notes into relational tables. These tables
 can then be used by automated tools to detect
maintenance or manufacturing problems.
• Used in medical contexts to analyze clinical notes, such
as the Clinical Text Analysis and Knowledge Extraction
System ( Apache CTAKES).
• CTAKES is an open-source Natural Language Processing
 (NLP) system that extracts clinical information from 
electronic health/medical record  free-text (Users
are free to type whatever they want in any form).
UIMA block diagram
Analysis

Transformed into
Acquired from Subjected to
USD various semantic
sources analysis

Delivery

Structured
Query and Structured
information
presentation information
access

Users
Big data
• Big Data refers to huge data sets that are orders of magnitude larger
(volume), more diverse, including structured, semi-structured and
unstructured data (variety) and arriving faster (velocity).
• This flood of data is generated by connected devices –from PCs and
smart phones to sensors such as RFID readers and traffic cams. Plus,
it is heterogeneous and comes in many formats, including text,
document, image, video, and more.
• Making sense of all these data is today’s technological challenge.
What is BIG data
• Big data is a term for data sets that are so large or
complex that traditional data processing application
software is inadequate to deal with them.
• Big Data is a phrase used to mean a massive volume of
both structured and unstructured data so large that it is
difficult to process using traditional database and
software techniques.
• Big data challenges include, capturing data, data storage,
data, analysis, search, sharing, transfer, visualization,
querying, updating and information privacy.
 
Examples Of 'Big Data'

The New York Stock


Exchange generates
about one terabyte of new
trade data per day.
 
Examples Of 'Big Data'

Statistic shows that 500+terabytes of new data gets ingested into


the databases of social media site Facebook, every day. This data is
mainly generated in terms of photo and video uploads, message
exchanges, putting comments etc.
Examples Of 'Big Data'

Single Jet engine can generate 10+terabytes of data in 30 minutes of


a flight time. With many thousand flights per day, generation of data
reaches up to many Petabytes.
Big Data
• Every day, we create 2.5 quintillion(1018) bytes of data —90%
of the data in the world today has been created in the last
two years alone. This data comes from everywhere: sensors
used to gather climate information, posts to social media
sites, digital pictures and videos, purchase transaction
records, and cell phone GPS signals, WhatsApp, IOT and so
on.
Characteristics of Data
• Composition: Deals with structure of data, i.e.,
sources of data, the granularity, the types, nature
of data (Static or real-time).
• Condition: Deals with the state of data, that is,
“Can one use data as it is for analysis?” or “Does it
require cleansing for further enhancement and
enrichment?”.
Characteristics of Data
• Context: Deals with
– Where has this data been generated?
– Why was this data generated?
– How sensitive is this data?
– What are the events associated with this data?
– And so on.
Evolution of Big data
Big data definition- Gartner
• Big data is high-volume, high-velocity, and high-
variety information assets that demand cost
effective, innovative forms of information
processing for enhanced insight and decision
making.
• Cost effective and innovative forms of
information processing: Talks about embracing
new techniques and technologies to capture,
store, process, persevere, integrate and visualize
the big data(3vs).
Definition of Big data by Gartner
• Enhanced insight and decision making: Talks
about deriving deeper, richer, and meaningful
insights and then using these insights to make
faster and better decisions to gain business value
and thus a competitive edge.
Big data formula

Actionable Better
DATA Information
Intelligence Decisions

Enhanced
Business
Value
Challenges with Big Data
• Capture
• Storage (Solution: Cloud Computing)
• Curation ( Management of data + Data retention)
• Search
• Analysis
• Transfer
• Visualization
• Privacy violations
3 Vs
3 V’s of Big data
• The data that is big in Volume, Velocity and
Variety is known as big data.
Sources of big data
• Archieves: Archives of scanned documents,
customer correspondence records, patient’s
health records, student’s admission records,
students’ assessment records and so on.
• Sensor data: Car sensors, smart electric meters,
office buildings, washing m/c, other electronic
appliances and so on.
• Machine log data: Event logs, application logs,
audit logs, server logs, etc.
Sources of big data
• Public web: Wikipedia, Weather, regulatory, census, etc.
• Data storage: File systems, SQL database, NoSQL
database (Mongo DB, Cassandra) and so on.
• Media: Audio, Video, image, etc.
• Docs: CSV, word docs, PDF, PPT, XLS, etc.
• Business Apps: ERP, CRM, HR, Google Docs, etc.
• Social media: Twitter blogs, Facebook, LinkedIn,
YouTube, Instagram, etc.
• IOT
Other characteristics of big data
• Veracity and Validity: Refers to the accuracy
(quality) and correctness of the data.
• Volatility: Deals with how long the data is valid?,
and how long should it be stored?. (OTP, Aadhar
No., PW)
• Variability: Data flows can be highly inconsistent
with periodic peaks. (In total 7V’s of big data)
Why Big data

More Data

More Accurate analysis

More confidence in decision making

Greater operational efficiency, cost reduction, time


reduction, new product development, optimized
offerings, etc.
Three reasons for leveraging big data

1. Competitive Advantage.
2. Decision making
3. To create new business value out of data.
Typical data warehouse Environment
Typical Hadoop Environment
• It is different from DW environment.
• Here data sources are web logs, images, audios,
videos, social media, doc files, pdfs, etc.
Hadoop Environment
Big data & DW coexistence
Big data & DW coexistence

You might also like