unit-1-total-data-visualization-techniques
unit-1-total-data-visualization-techniques
Basics:
What Is Visualization?
Visualization is the communication of information using graphical
representations. Pictures have been used as a mechanism for communication
since before the formalization of written language. A single picture can contain a
wealth of information, and can be processed much more quickly than a comparable
page of words.
This is because image interpretation is performed in parallel within the human
perceptual system, while the speed of text analysis is limited by the sequential
process of reading. Pictures can also be independent of local language, as a graph or
a map may be understood by a group of people with no common tongue.
Data visualization is an easy and quick way to convey concepts universally. You
can experiment with a different outline by making a slight adjustment.
Info Graphics
Bubble Clouds
Bullet Graphs
Heat Maps
Fever Charts
Time Series Charts
Line charts. This is one of the most basic and common techniques used. Line charts
display how variables can change over time.
Area charts. This visualization method is a variation of a line chart; it displays multiple
values in a time series -- or a sequence of data collected at consecutive, equally
spaced points in time.
Scatter plots. This technique displays the relationship between two variables. A
scatter plot takes the form of an x- and y-axis with dots to represent data points.
These user interface components are often input via dialog boxes, but
they could be visual representations of the data to facilitate the selections
required by the user. Visualizations can provide mechanisms for translating
data and tasks into more visual and intuitive formats for users to perform
their tasks.
This means that the data values themselves, or perhaps the attributes of
the data, are used to define graphical objects, such as points, lines, and
shapes; and their attributes, such as size, position, orientation, and color.
These processes or pipelines are 3 different for All start with data and end
with the user.
1 Computer Graphics
2 Visualization
3 Knowledge Discovery
location(s), color, and intensity of the light source(s), the degree of occlusion
from direct light exposure, and the amount and color of light being reflected
off of other objects onto the polygon.
shaded polygons.
Data:- In the KD Pipeline there is more focus on data, as the graphics and
visualization processes often assume that the data is already structured to
facilitate its display.
The following pseudo code renders a scatter plot of circles. Records are
represented in the Scatter plot as circles of varying location, color, and size.
The x- and y-axes represent data from dimension numbers xDim and
yDim, respectively. The color of the circles is derived from dimension number
cDim.
The radius of the circles is derived from dimension number rDim, as
well as from the upper and lower bounds for the radius, rM in and rMax.
Std Id Marks
101 40
102 50
103 60
104 55
105 55
106 70
107 60
108 50
Student Marks
80
60
Marks
40
Marks
20
0
100 102 104 106 108 110 112
Student id
Data foundation:
Data comes from many sources; it can be gathered from sensors or surveys, or
it can be generated by simulations and computations.
Data can be raw (untreated), or it can be derived from raw data via some
process, such as smoothing, noise removal, scaling, or interpolation. It also can have a
wide range of characteristics and structures.
A typical data set used in visualization consists of a list of n records,
(r1, r2,...,rn). Each record ri consists of m (one or more) observations or
variables, (v1, v2,...vm)
A variable may be classified as either independent or dependent.
Types of Data:
In its simplest form, each observation or variable of a data record represents
a single piece of information. We can categorize this information as being
Ordinal (Numeric) Or Nominal (Nonnumeric). Subcategories of each can be
readily defined.
Ordering relation, with which the data can be ordered in some fashion. By
definition, ranked nominal variables and all ordinal variables exhibit this relation.
Distance metric, with which the distances can be computed between different
records. This measure is clearly present in all ordinal variables, but is generally
not found in nominal variables.
Existence of absolute zero, in which variables may have a fixed lowest value.
This is useful for differentiating types of ordinal variables. A variable such as
weight possesses an absolute zero, while bank balance does not. A variable
possesses an absolute zero if it makes sense to apply all four mathematical
operations (+, −, ×, ÷) to it [129]
Data sets have structure, both in terms of the means of Representation (Syntax ),
and the types of interrelationships within a given record and between
Records (Semantics). The data records considered in following ways.
Scalars,
Vectors
Tensors
Vector:- Multiple variables within a singlerecord can represent a composite data item.
For example, a point in a two-dimensional flow field might be represented by a pair of
values, such as a displacement in x and y. This pair, and any such composition, is
referred to as a vector.
While each component of a vector might be examined individually, it is most
common to treat the vector as a whole.
Data in Geometry and Grids form: Geometric structure can commonly be found in
data sets, especially those from scientific and engineering domains. The simplest
method of incorporating geometric structure in a data set is to have explicit coordinates
for each data record.
It is assumed that some form of grid exists, and the data set is structured such
that successive data records are located at successive locations on the grid.
It would be sufficient to indicate a starting location, orientation, and the step size
horizontally and vertically. There are many different coordinate systems that are used
for grid structured data, including cartesian, spherical, and hyperbolic coordinates.
Data Preprocessing:-
Data preprocessing is a Data Mining method that entails converting raw data into
a format that can be understood. Real-world data is frequently inadequate, inconsistent,
and/or lacking in specific activities or trends, as well as including numerous inaccuracies.
This might result in low-quality data collection and, as a result, low-quality models
based on that data. Preprocessing data is a method of resolving such problems.
Machines do not comprehend free text, image, or video data; instead, they comprehend
1s and 0s.
Data Preprocessing is the step in any Machine Learning process in which the
data is changed, or encoded, to make it easier for the machine to parse it. In other
words, the algorithm can now easily interpret the data’s features.
1. Data Cleaning/Cleaning,
2. Data Integration,
3. Data Transformation,
4. Data Reduction.
1.Data Cleaning :-
Data in the real world is frequently incomplete, noisy, and inconsistent. Many bits
of the data may be irrelevant or missing. Data cleaning is carried out to handle this
aspect. Data cleaning methods aim to fill in missing values, smooth out noise while
identifying outliers, and fix data discrepancies. Unclean data can confuse data and the
model. Therefore, running the data through various Data Cleaning/Cleansing methods is
an important Data Preprocessing step.
2.Data Integration:-
Data
Source-1
Data
Integratio UNIFIED VIEW
Data
Source-2 n
Data
Source-n
It is involved in a data analysis task that combines data from multiple sources into
a coherent data store. These sources may include multiple databases. Do you think how
data can be matched up ?? For a data analyst in one database, he finds Customer_ID
and in another he finds cust_id, How can he sure about them and say these two belong
to the same entity. Databases and Data warehouses have Metadata (It is the data about
data) it helps in avoiding errors.
3. Data Transformation:-
This stage is used to convert the data into a format that can be used in the mining
process. This is done in the following ways:
1. Normalization: It is done to scale the data values in a specified range (-1.0 to
1.0 or 0.0 to 1.0)
2. Concept Hierarchy Generation: Using concept hierarchies, low-level or
primitive/raw data is substituted with higher-level concepts in data generalization.
Categorical qualities, for example, are generalized to higher-level notions such as
street, city, and nation. Similarly, numeric attribute values can be translated to
higher-level concepts like age, such as youthful, middle-aged, or elderly.
3. Smoothing: Smoothing works to remove the noise from the data. Such
techniques include binning, clustering, and regression.
4. Aggregation: Aggregation is the process of applying summary or aggregation
operations on data. Daily sales data, for example, might be combined to calculate
monthly and annual totals.
4.Data Reduction:
Because data mining is a methodology for dealing with large amounts of data.
When dealing with large amounts of data, analysis becomes more difficult. We employ a
data reduction technique to get rid of this. Its goal is to improve storage efficiency while
lowering data storage and analysis expenses.
2. Numerosity Reduction:
Data is replaced or estimated using alternative and smaller data representations
such as parametric models (which store only the model parameters rather than the
actual data, such as Regression and Log-Linear Models) or non-parametric
approaches (e.g. Clustering, Sampling, and the use of histograms).
For the after reading text data we will apply Preprocess object present in preprocessing
There are 4 different types of Image Pre-Processing techniques and they are listed
below.
1. Pixel brightness transformations/ Brightness corrections
2. Geometric Transformations
3. Image Filtering and Segmentation
4. Fourier transform and Image restauration
In this blog I will talk about Pixel brightness transformations:
Data Sets :-
Dimensionality:-
The dimensionality of a data set is the number of attributes that the objects in the
data set have. In a particular data set if there are high number of attributes (also called
high dimensionality), then it can become difficult to analyse such a data set.
Sparsity
For some data sets, such as those with asymmetric features, most attributes of an
object have values of 0; in many cases fewer than 1% of the entries are non-zero. Such
a data is called sparse data or it can be said that the data set has Sparsity.
Resolution
The patterns in the data depend on the level of resolution. If the resolution is too fine, a
pattern may not be visible or may be buried in noise; if the resolution is too coarse, the
pattern may disappear. For example, variations in atmospheric pressure on a scale of
hours reflect the movement of storms and other weather systems. On a scale of months,
such phenomena are not detectable.
Record Data:
The most basic form of record data has no explicit relationship among records or data
fields, and every record (object) has the same set of attributes. Record data is usually
stored either in flat files or in relational databases.
There are a few variations of Record Data, which have some characteristic properties.
1. Transaction or Market Basket Data: It is a special type of record data, in which
each record contains a set of items. For example, shopping in a supermarket or a
grocery store. For any particular customer, a record will contain a set of items
purchased by the customer in that respective visit to the supermarket or the grocery
store. This type of data is called Market Basket Data. Transaction data is a
collection of sets of items, but it can be viewed as a set of records whose fields are
asymmetric attributes. Most often, the attributes are binary, indicating whether or not
an item was purchased or not.
2. The Data Matrix: If the data objects in a collection of data all have the same fixed
set of numeric attributes, then the data objects can be thought of as points
(vectors)in a multidimensional space, where each dimension represents a distinct
attribute describing the object. A set of such data objects can be interpreted as an m
X n matrix, where there are n rows, one for each object, and n columns, one for
each attribute. Standard matrix operation can be applied to transform and
manipulate the data. Therefore, the data matrix is the standard data format for most
statistical data.
3. The Sparse Data Matrix: A sparse data matrix (sometimes also called document-
data matrix)is a special case of a data matrix in which the attributes are of the same
type and are asymmetric; i.e., only non-zero values are important.
Graph-based Data
2. Data with Objects That Are Graphs: If objects have structure, that is, the objects
contain sub objects that have relationships, then such objects are frequently
represented as graphs. For example, the structure of chemical compounds can be
represented by a graph, where the nodes are atoms and the links between nodes
are chemical bonds.
Ordered Data
For some types of data, the attributes have relationships that involve order in time or
space. It can be segregated into four types:
1. Sequential Data: Also referred to as temporal data, can be thought of as an
extension of record data, where each record has a time associated with it. Consider
a retail transaction data set that also stores the time at which the transaction took
place
2. Sequence Data: Sequence data consists of a data set that is a sequence of
individual entities, such as a sequence of words or letters. It is quite similar to
sequential data, except that there are no time stamps; instead, there are positions in
an ordered sequence.
3. Time Series Data: Time series data is a special type of sequential data in which
each record is a time series, i.e., a series of measurements taken over time. For
example, a financial data set might contain objects that are time series of the daily
prices of various stocks.
4. Spatial Data: Some objects have spatial attributes, such as positions or areas, as
well as other types of attributes. An example of spatial data is weather data
(precipitation, temperature, pressure) that is collected for a variety of geographical
locations.