Unit - I - Types of Digital Data
Unit - I - Types of Digital Data
Unit - I - Types of Digital Data
Dr.H.E.Khodke
What is a Data?
• Data is any set of characters that has been gathered and
translated for some purpose, usually analysis.
• It can be any character, including text and numbers, pictures,
sound, or video.
What is Digital Data?
• Digital data are discrete, discontinuous representations of
information or work.
• Digital data is a binary language.
Types of Digital Data
1.Unstructured Data
2. Semi Structured Data
3. Structured
Structured Data
• Refers to any data that resides in a fixed field within a record or file.
• Support ACID properties
• Structured data has the advantage of being easily entered, stored,
queried and analyzed.
• Structured data represent only 5 to 10% of all informatics data.
Unstructured Data
• Unstructured data is all those things that can't be so readily
classified and fit into a neat box.
• Unstructured data represent around 80% of data.
• Techniques: Data mining-Association rule, Regression analysis, Text
mining, NLP etc.,
Semi Structured Data
• Semi-structured data is a cross between the two. It is a type of
structured data, but lacks the strict data model structure.
• Semi-structured data is information that doesn’t reside in a
relational database but that does have some organizational
properties that make it easier to analyze.
Characteristic of Data
• Composition - What is the Structure, type and Nature of
data?
• Condition - Can the data be used as it is or it needs to be
cleansed?
• Context - Where this data is generated? Why? How sensitive
this data? What are the events associated with this data?
What is Big Data?
• Collection of data sets so large and complex that it becomes
difficult to process using on-hand database management tools
or traditional data processing applications.
What is Big Data? Cont..
• The data is too big, moves too fast, or doesn’t fit the structures
of your database architectures
• The scale, diversity, and complexity of the data require new
architecture, techniques, algorithms, and analytics to manage it
and extract value and hidden knowledge from it
• Big data is the realization of greater business intelligence by
storing, processing, and analyzing data that was previously
ignored due to the limitations of traditional data management
technologies.
Why Big Data? & what makes Big
Data?
• Key enablers for the growth of “Big Data” are
Availability of data
• a. In-Memory Analytics
• b. In-Database processing
• c. Symmetric Mulit-processor system
• d. Massively parallel processing
• e. Shared nothing architecture
• f. CAP Theorem
In-memory Analytics
• Data access from non-volatile storage such as
hard disk is a slow process. This problem has
been addressed using In-memory Analytics.
Here all the relevant data is stored in Random
Access memory (RAM) or primary storage thus
eliminating the need to access the data from
hard disk. The advantage is faster access rapid
deployment, better insights, and minimal IT
involvement.
In-Database Processing
• In-Database processing is also called In-
database analytics. It works by fusing data
warehouses with analytical systems. Typically
the data from various enterprise OLTP systems
after cleaning up through the process of ETL is
stored in the Enterprise Dataware house or
data marts. The huge data sets are then
exported to analytical programs for complex
and extensive computations.
Symmetric Multi-Processor System
• In this there is single common main memory
that is shared by two or more identical
processors. The processors have full access to
all I/O devices and are controlled by single
operating system instance.
• SMP are tightly coupled multiprocessor
systems. Each processor has its own high
speed memory called cache memory and are
connected using a system bus
Symmetric Multi-Processor System
Massively Parallel Processing
• Massively parallel Processing (MPP) refers to the
coordinated processing of programs by a number
of processors working parallel. The processors
each have their own OS and dedicated memory.
They work on different parts of the same
program. The MPP processors communicate
using some sort of messaging interface.
• MPP is different from symmetric multiprocessing
in that SMP works with processors sharing the
same OS and same memory. SMP also referred as
tightly coupled Multiprocessing
Massively Parallel Processing
Shared nothing Architecture
• The three most common types of architecture for multiprocessor
systems:
• 1. Shared memory
• 2. Shared disk
• 3. Shared nothing.
• 1. Consistency implies that every read fetches the last write. Consistency means
that all nodes see the same data at the same time. If there are multiple replicas
and there is an update being processed, all users see the update go live at the
same time even if they are reading from different replicas.
• 2. Availability implies that reads and writes always succeed. Availability is a
guarantee that every request receives a response about whether it was successful
or failed.
• 3. Partition tolerance implies that the system will continue to function when
network partition occurs. It means that the system continues to operate despite
arbitrary message loss or failure of part of the system
CAP Theorem
Thank You