Unit 4 DigitalData
Unit 4 DigitalData
Data provide an information from where the meaningful insights can be derived.
We Speak
We Move
Sensors
Computers
Documents
Digital Data
• Data growth has seen exponential acceleration since the advent of the
computer and Internet.
Unstructured
This is the data which does not conform to a data model or is not in a form
which can be used easily by a computer program.
Example: images, videos, letters, ,text, PDFs, social media posts, body of an
email, log files, PowerPoint presentations etc.
Classification of Digital Data contd..
This is the data which does not conform to a data model but has some
structure.
It is not in a form which can be used easily by a computer programs. It has
self describing structure. It uses tags to separate semantic elements.
Metadata for this data is available but is not sufficient.
Structured data
This is the data which is in an organized form (e.g., in rows and columns) and
can be easily used by a computer program.
Since 1980’s enterprises data has been stored in RDBMS, it stores structured data
Later, with internet connecting the world data has become an integral part of
daily transactions.
All of this data was not structured, almost 80% of data generated in any
enterprise today is unstructured data.
Roughly around 10% of data is in the structured and semi structured category.
If the data is structured, then RBDMS can be used [Oracle Corp. — Oracle, IBM — DB2,
Microsoft — Microsoft SQL Server, EMC — Greenplum, Teradata — Teradata, MySQL
(open source), PostgreSQL (advanced open source) etc.] to house it.
These databases are typically used to hold transaction/operational data generated and
collected by day-to-day business activities.
In other words, the data of the On-Line Transaction Processing (OLTP) systems are
generally quite structured.
Ease of Working with Structured Data
3. Indexing: An index is a data structure that speeds up the data retrieval operations
(primarily the SELECT DML statement) at the cost of additional writes and storage
space, but the benefits that ensue in search operation are worth the additional writes
and storage space.
4. Scalability: The storage and processing capabilities of the traditional RDBMS can be
easily scaled up by increasing the horsepower of the database server (increasing the
primary and secondary or peripheral storage capacity, processing capacity of the
processor, etc.)
Ease of Working with Structured Data
5. Transaction processing:
RDBMS has support for Atomicity, Consistency, Isolation, and Durability (ACID)
properties of transaction.
• It uses tags to separate semantic elements and markers which help to group data and
describe how data is stored, giving some metadata but it is not sufficient for
management and automation of data.
• Similar entities in the data are grouped and organized in a hierarchy.
• In semi-structured data, entities belonging to the same class and also grouped together
need not necessarily have the same set of attributes.
Example: Two addresses may or may not contain the same number of properties as in
Address 1 Semi-structured Data in Address 2
•
• And if at all, they have the same set of attributes, the order of attributes
may not be similar and for all practical purposes it is not important as well.
• The tags give us some metadata but the body of the e-mail contains no
format neither is such which conveys meaning of the data it contains.
Sources of Semi-structured Data
• Amongst the sources for semi-structured data, the front runners are ―XML and
―JSON.
The following techniques are used to find the patterns in or interpret unstructured data.
Text Analytics: Text mining is the process of gleaning high quality and meaningful
information from text. It includes tasks such as text categorization, text clustering,
sentiment analysis and concept/entity extraction.
Manual Tagging with meta data: This is about tagging manually with adequate meta data
to provide the requisite semantics to understand unstructured data.
Parts of Speech Tagging: POST is the process of reading text and tagging each word in
the sentence belonging to particular parts of speech such as noun, verb, objective.