UNIT 1 DVT
UNIT 1 DVT
Introduction and Data Foundation: Basics - Relationship between Visualization and Other Fields
Pseudocode Conventions
Data Foundation
Types of Data
Data Pre-processing
Data Sets
Introduction:
What Is Visualization?
Its help you to make data more meaningful by adding life to it.
Data visualization Techniques: There are various data visualization techniques and tools available, and choice of
techniques depends on the nature of the data message you want to convey.
Bar Charts: bar charts used to compare categories of data. vertical and horizontal bar charts are common types.Disply the
data in bars of different heights.
Line charts: Line charts are great for showing trends over time. They connect data points with lines , making it easy to
visualize changes and patterns.
Pie Charts: It displays the partes of the whole. They are suitable for the composition of data set.
Scatter plots: These are used to show the relationship between two variables. Each data points are represented by a Dot.
Heat map, Histograms, Area charts, Bubble charts, Tree maps,3D visualization…
Visualization in Everyday Life:
a train and subway map with times used for determining train arrivals and departures;
a weather chart showing the movement of a storm front that might influence your weekend activities;
a graph of stock market activities that might indicate an upswing (or downturn) in the economy;
a plot comparing the effectiveness of your pain killer to that of the leading brand;
a mechanical and civil engineering rotary bridge design and systems analysis;
the study of actuarial data for confirming and guiding quantitative analysis;
Decision making
Comparative analysis
Importance
Data visualization allows business users to gain insight into their vast amounts of data. It benefits them to recognize
new patterns and errors in the data. Making sense of these patterns helps the users pay attention to areas that indicate
red flags or progress. This process, in turn, drives the business ahead. Because of the way the human brain processes
information, using charts or graphs to visualize large amounts of complex data is easier than poring over spreadsheets
or reports. ..., Data visualization can also: Identify areas that need attention or improvement .
visualization was considered a subfield of computer graphics, primarily because visualization uses graphics to display
information via images.visualization applies graphical techniques to generate visual displays of data. Here, graphics
is used as the communication medium.In all visualizations, one can clearly see the use of the graphics primitives
(points, lines, areas, and volumes).While computer graphics can be used to define and generate the displays that are
used to communicate the information, the sources of data and the way users interact and perceive the data are all
important components to understand when presenting information. A secondary application of computer graphics is
in art and entertainment, with video games, cartoons, advertisements, and movie special effects as typical examples.
Visualization, on the other hand, does not emphasize visual realism as much as the effective communication of
information. Many types of visualizations do not deal with physical objects, and those that do are often
communicating attributes of the objects that would normally not be visible, such as material stress or fluid flow
patterns. Thus, while computer graphics and visualization share many concepts, tools, and techniques, the underlying
models and goals are fundamentally different.Thus, computer graphics consists of the tools that display the
visualizations seen in this book. This includes the graphics-programming language, and more.
Although during the 1990s and early 2000s the visualization community differentiated between scientific visualization
and information visualization, we do not. Both provide representations of data. However the data sets are most often
different.In Scientific data visualization is frequently considered to focus on the visual display of spatial data associated
with scientific processes such as bonding of molecules in computational chemistry.Information data visualization
examines developing visual metaphors for non-inherently spatial data such as the exploration of text based document
data base.
Visualization is a powerful tool that helps to represent data and information in a graphical format.
It has connections to many fields, including computer science, data analytics, business intelligence, and more…
Data science and analytics:It can be used to explore the data, identify patterns and trends,communicate the results of an
analysis to stakeholders.
Business Intelligence:visualization is key component of Business intelligence(BI).It can be used to create the dashboards
and reports that provides insights into business performance.
Engineering:It is used in engineering to design the systems.It is used to create models of physical systems,simulate their
behavior,and identify the potential problems.
Medicine:Visualization used in medicine to diagnose diseases plan surgeries and track patient progress.It can be used to
create images of the body, such as X-Ray,MRI scans and ultrasounds.
Visualization also useful to education and art…in education …teacher explain use the animation(photosynthesis).
The visualization process is the steps involved in Creating visual representation of data.
Data visualization is the practice of translating information into a visual context, such as a map or graph, to make
data easier for the human brain to understand and pull insights from. The main goal of data visualization is to make
it easier to identify patterns, trends and outliers in large data sets.The first step toward good data visualization is to
identify the problem you’re trying to solve.More data isn’t always better — what you need is the right data for the
right question. And choosing the best pieces of the puzzle to highlight relies on a solid understanding of what you
want to measure.
• collect the relevant data from various sources such as database, sensors, surveys.
Clean and prepare the data:
After you’ve identified your purpose (Step1) and audience (step 2), you know what kind of data you need to
summarize. Armed with that knowledge, you can use the guide to the right to pick an appropriate chart type.
Choose the appropriate visualization tools and techniques based on nature of your data and objectives.
Design the visualization:
• Whenever possible, try to make sure important information I is communicated in a way that doesn’t rely entirely
on color. This will allow your data to be understood by the broadest possible audience.
• Color is an important component of data visualization and is used extensively as a way to represent information
within a graphic.
Instead of just presenting your data, connect it to a broader context. Make sure it’s easy to interpret the information
quickly and apply it in the context of a particular challenge.
Finally, and perhaps most importantly, make sure the data you’re sharing is actionable. The best visualizations can
turn insights into action by allowing your audience to use data to inform their strategies and business decisions.
The Visualization Process:
The process of visualizing directs your subconscious to be aware of the end goal you have in mind.
The designer of a new visualization most often begins with an analysis of the type of data available for display and of
the type of information the viewer hopes to extract from or convey with the display
The data can come from a wide variety of sources and may be simple or complex in structure
Modeling. A three-dimensional model, consisting of planar polygons defined by vertices and surface properties, is
generated using a world coordinate system.
Viewing. A virtual camera is defined at a location in world coordinates, along with a direction and orientation
(generally given as vectors). All vertices are transformed into a viewing coordinate system based on the camera
parameters.
Clipping. By specifying the bounds of the desired image (usually given by corner positions on a plane of projection
placed in front of the camera), objects out of view can be removed, and those that are partially visible can be clipped.
Objects may be transformed into normalized viewing coordinates to simplify the clipping process. Clipping can
actually be performed at many different stages of the pipeline.
Hidden surface removal. Polygons facing away from the camera, or those obscured by others, are removed or
clipped. This process may be integrated into the projection process.
Projection. Three-dimensional polygons are projected onto the two dimensional plane of projection, usually using a
perspective transformation. The results may be in a normalized 2D coordinate system or device/screen coordinates.
Rendering. The actual color of the pixels associated with a visible polygon depends on a number of factors,
including the material properties being synthesized (base color, texture, surface roughness, shininess), the type(s),
location(s), color, and intensity of the light source(s), the degree of occlusion from direct light exposure, and the
amount and color of light being reflected off of other objects onto the polygon.
The data/information visualization pipeline has some similarities to the graphics pipeline, at least on an abstract level.
The stages of this pipeline are as follows:
Data modelling. The data to be visualized, whether from a file or a database, has to be structured to facilitate its
visualization. The name, type, range, and semantics of each attribute or field of a data record must be available in a
format that ensures rapid access and easy modification
Data selection. Similar to clipping, data selection involves identifying the subset of the data that will be potentially
visualized. This can occur totally under user control or via algorithmic methods, such as cycling through time slices
or automatically detecting features of potential interest to the use.
The knowledge discovery (also called data mining) field has its own pipeline.Note that the visualization pipeline can
be overlaid on this knowledge discovery (KD) pipeline.
Data: In the KD pipeline there is more focus on data, as the graphics and visualization processes often assume that
the data is already structured to facilitate its display Data integration, cleaning, warehousing and selection:These
involve identifying the various data sets that will be potentially analyzed. Again, the user may participate in this step.
This can involve filtering, sampling, subsetting, aggregating, and other techniques that help curate and manage the
data for the data mining step. Data mining:The heart of the KD pipeline is algorithmically analyzing the data to
produce a model.
Pattern evaluation: The resulting model or models must be evaluated to determine their robustness, stability,
precision, and accuracy.
Rendering or visualization: The specific results must be presented to the user. It does not matter whether we think
of this as part of the graphics or visualization pipelines.
Interactive visualization can be used at every step of the KD pipeline. One can think of this as computational
steering.
Pseudocode Conventions:
In our pseudocode, we aim to convey the essence of the algorithms at hand, while leaving out details required for user
interaction, graphics nuances, and data management.
the following global variables and functions exist in the environment of the pseudocode:
data—The working data table. This data table is assumed to contain only numeric values. In practice, dimensions of
the original data table that contain non-numeric values must be somehow converted to numeric values. When
visualizing a subset of the entire original data table, the working data table is assumed to be the subset.
Dimensions are typically iterated over using j as the running dimension index.
n—The number of records (rows) in the working data table. Records are typically iterated over using i as the running
record index.
A function that maps the value for the given record and dimension in the working data table to a value between min
and max ,or between zero and one if min and max are not specified.
The normalization is typically linear and local to a single dimension. However, in practice, code must be structured
such that various kinds of normalization could be used (logarithmic or square root, for example) either locally (using
the bounds of the current dimension), globally (using the bounds of all dimensions), or local to the active dimensions
(using the bounds of the dimensions being displayed). Also, in practice, one must accommodate multiple kinds of
normalization within a single visualization. For example, a scatterplot may require a linear normalization for the x-
axis and a logarithmic normalization for the y-axis.\
Color(color)—A function that sets the color state of the graphics environment to the specified color (whose type is
assumed to be an integer containing RGB values).
MapColor(record, dimension)—A function that sets the color state of the graphics environment to be the color
derived from applying the global color map to the normalized value of the given record and dimension in the working
data table.
Circle(x, y,radius)—A function that fills a circle centered at the given(x, y)-location, with the given radius, with the
color of the color state of the graphics environment. The plotting space for all visualizations is the unit square. In
practice, this function must map the unit square to a square in pixel coordinates.
Polyline(xs, ys)—A function that draws a polyline (many connected line segments) from the given arrays of x and y
coordinates.
Polygon(xs, ys)—A function that fills the polygon defined by the given arrays of x- and y-coordinates with the color
of the current color state.
The Scatterplot:
A scatterplot is a type of data display that shows the relationship between two numerical variables. Each member of the
dataset gets plotted as a point whose ( x , y ) (x, y) (x,y)left parenthesis, x, comma, y, right parenthesis coordinates
relates to its values for the two variables.
A scatter plot is composed of a horizontal axis containing the measured values of one variable (independent variable)
and a vertical axis representing the measurements of the other variable (dependent variable). The purpose of the scatter
plot is to display what happens to one variable when another variable is changed.
A scatter plot is a type of graph.
A scatter plot can be defined as a graph containing bivariate data in the form of plotted points, which allows viewers
to see a correlation between plotted points.
bivariate data is simply data that has been collected that reflects two different variables.
Scatter plots are sometimes called scatter diagrams. Some other known that may be more familiar are line graphs, bar
graphs, box and whisker plots, or even picture graphs.
EX:
A simple scatter plot can be used to see the difference in outdoor temperatures compared to ice cream sales. The two
variables would be outside temperature and ice cream sales. This data could be collected and organized into a table.
Once the data is organized into a table, it can be turned into ordered pairs. The x-value will always be
the independent variable while the y-value will always be the dependent variable.
(50, 3), (65, 18), (70, 54), (85, 75), (100, 98)
Now that points have been created, they can be plotted to see what the scatter plot looks like. The independent
variable will go along the x-axis and the dependent variable will go along the y-axis.
Scatter plot showing Outside Temperature versus Ice Cream Cone Sales.
Being able to visualize the relationship between bivariate data gives us a lot of information.
Data Foundations:
Data Foundations will include the varied streams of data an organization has in a truly unified, interoperable, and
accessible system.
Any business analytics or decision making will depend upon this data foundation, which has become the
organizational source of truth.
Every visualization starts with the data that is to be displayed, a first step in addressing the design of visualizations is
to examine the characteristics of the data
Data comes from many sources; it can be gathered from sensors or surveys, or it can be generated by simulations and
computations.
Data can be raw (untreated), or it can be derived from raw data via some process, such as smoothing, noise removal,
scaling, or interpolation. It also can have a wide range of characteristics and structures.
An independent variable ivi is one whose value is not controlled or affected by another variable, such as the time
variable in a time-series data set.
A dependent variable dvj is one whose value is affected by a variation in one or more associated independent
variablesTemperature for a region would be considered a dependent variable, as its value could be affected by
variables such as date, time, or location. Thus we can formally represent a record as
where
Types of Data:
1. Ordinal(numeric)
2. Nominal(nonnumeric)
categorical—a value selected from a finite (often short) list of possibilities (e.g., red, blue, green);
ranked—a categorical variable that has an implied ordering (e.g., small, medium, large);
arbitrary—a variable with a potentially infinite range of values with no implied ordering (e.g., addresses).
Distance metric, with which the distances can be computed between different records. This measure is clearly
present in all ordinal variables, but is generally not found in nominal variables.
lowest value. This is useful for differentiating types of ordinal variables. A variable such as weight possesses an
absolute zero, while bank balance does not. A variable possesses an absolute zero if it makes sense to apply all four
mathematical operations(+, −, ×, ÷) to it [129].
Data sets have structure, both in terms of the means of representation (syntax ), and the types of interrelationships
within a given record and between records (semantics).
Scalar values, such as the cost of an item or the age of an individual, are often the focus for analysis and visualization.
Multiple variables within a single record can represent a composite data item.
For example, a point in a two-dimensional flow field might be represented by a pair of values, such as a displacement
in x and y. This pair, and any such composition, is referred to as a vector.
While each component of a vector might be examined individually, it is most common to treat the vector as a whole.
Scalars and vectors are simple variants on a more general structure known as a tensor.
A tensor is defined by its rank and by the dimensionality of the space within which it is defined. It is generally
represented as an array or matrix.
Geometric structure can commonly be found in data sets, especially those from scientific and engineering domains.
The simplest method of incorporating geometric structure in a data set is to have explicit coordinates for each data
record.
A grid can be used to organize graphic elements in relation to a page, in relation to other graphic elements on the
page, or relation to other parts of the same graphic element or shape.
A grid is a set of intersecting horizontal and vertical lines defining columns and rows. Elements can be placed onto
the grid within these column and row lines.
Another important form of structure found within many data sets is that of topology
MRI (magnetic resonance imaging). Density (scalar), with three spatial attributes, 3D grid connectivity;
CFD (computational fluid dynamics). Three dimensions for displacement, with one temporal and three spatial
attributes, 3D grid connectivity (uniform or non uniform);
CAD (computer-aided design). Three spatial attributes with edge and polygon connections, and surface properties;
Remote sensing. Multiple channels, with two or three spatial attributes, one temporal attribute, and grid connectivity;
Census. Multiple fields of all types, spatial attributes (e.g., addresses), temporal attribute, and connectivity implied by
similarities in fields;
Social Network. Nodes consisting of multiple fields of all types, with various connectivity attributes that could be
spatial, temporal, or dependent on other attributes, such as belonging to the same group or having some common
computed values.
Data Preprocessing
Data preprocessing is the process of transforming raw data into an understandable format.
3. Normalization
4. Segmentation
6. Dimension Reduction
Viewing raw data also often identifies problems in the data set, such as missing data, or outliers that may be the
result of errors in computation or input.
Depending on the type of data and the visualization techniques to be applied, however, some forms of preprocessing
might be necessary.
Metadata are data that describe other data. Thus, statistical metadata are data that describe statistical data.
Statistical metadata may also describe processes that collect, process, or produce statistical data;
Information regarding a data set of interest (its metadata) and statistical analysis can provide invaluable guidance in
preprocessing the data.
Metadata may provide information that can help in its interpretation, such as the format of individual fields within the
data records.
It may also contain the base reference point from which some of the data fields are measured, the units used in the
measurements, the symbol or number used to indicate a missing value and the resolution at which measurements were
acquired.
One of the realities of analyzing and visualizing “real” data sets is that they often are missing some data entries or
have erroneous entries.
Missing data may be caused by several reasons, including, for example, a malfunctioning sensor, a blank entry on a
survey, or an omission on the part of the person entering the data.
Erroneous data is most often caused by human error and can be difficult to detect.
In either case, the data analyst must choose a strategy for dealing with these common events.
Some of these strategies, specifically those that are commonly used in data visualization, are outlined below
Discard the bad record: This seemingly drastic measure, namely to throw away any data record containing a
missing or erroneous field, is actually one of the most commonly applied, since the quality of the remaining data
entries in that record may be in question.
Assign a sentinel value. Another popular strategy is to have a designated sentinel value for each variable in the data
set that can be assigned when the real value in a record is in question
Assign the average value. A simple strategy for dealing with bad or missing data is to replace it with the average
value for that variable or dimension.
Assign value based on nearest neighbor. A better approximation for a substitute value is to find the record that has
the highest similarity with the record in question, based on analyzing the differences in all other variables. The basic
idea here is that if record A is missing an entry for variable i, and record B is closer than any other record to A
without considering variable i, then using the value of variable i from record B as a substitute in A is a reasonable
assumption
Compute a substitute value. Researchers in multivariate statistics have dedicated a significant amount of energy to
developing methods for generating values to replace missing or erroneous data. The process, known as imputation,
seeks to find values that have high statistical confidence
4. Normalization:
Normalization is the process of transforming a data set so that the results satisfy a particular statistical property.
A simple example of this is to transform the range of values a particular variable assumes so that all numbers fall
within the range of 0.0 to 1.0.
Other forms of normalization convert the data such that each dimension has a common mean and standard deviation.
Normalization is a useful operation since it allows us to compare seemingly unrelated variables
we need to convert the data range to be compatible with the graphical attribute range.
For example, if dmin and dmax are the minimum and maximum values for a particular data variable, we can
normalize the values to the range of 0.0 to 1.0 using the formula
dnormalized = (doriginal − dmin)/(dmax − dmin).
If the data has a highly non-linear distribution, a linear normalization will map most values to the same or close-by
values. In this case, it may be more appropriate to perform a non-linear normalization such as a square root mapping
5. Segmentation:
Segmentation is the process of dividing a company's target market into groups of potential customers with similar
needs and behaviors.
In many situations, the data can be separated into contiguous regions, where each region corresponds to a particular
classification of the data .For example, an MRI data set might originally have 256 possible values for each data point,
and then be segmented into specific categories, such as bone, muscle, fat, and skin . Simple segmentation can be
performed by just mapping disjoint ranges of the data values to specific categories.
Market segmentation:
6. Sampling and Subsetting:
Often it is necessary to transform a data set with one spatial resolution into another data set with a different spatial
resolution. For example, we might have an image we would like to shrink or expand, or we might have only a small
sampling of data points and wish to fill in values for locations between our samples. In each case, we assume that the
data we possess is a discrete sampling of a continuous phenomenon, and therefore we can predict the values at
another location by examining the actual data nearest to it.
The process of interpolation is a commonly used resampling method in many fields, including visualization.
Some common techniques include the following
1. Linear interpolation
2. Bilinear interpolation
3. Nonlinear interpolation
Sampling means selecting the group that you will actually collect data from in your research
7. Dimension Reduction:
In situations where the dimensionality of the data exceeds the capabilities of the visualization technique, it is necessary
to investigate ways to reduce the data dimensionality, while at the same time preserving, as much as possible, the
information contained within This can be done manually by allowing the user to select the dimensions deemed most
important, or via computational techniques, such as principal component analysis (PCA) [385], multidimensional scaling
(MDS) [259], Kohonen self-organizing maps (SOMs) [248], and Local Linear Embedding (LLE) [350].
In many domains, one or more of the data dimensions consist of nominal values
We may have several alternative strategies for handling these dimensions within our visualizations
depending on how many nominal dimensions there are, how many distinct values each variable can take on, and
whether an ordering or distance relation is available or can be derived.
The key is to find a mapping of the data to a graphical entity or attribute that doesn’t introduce artificial relationships
that don’t exist in the data.
One way to display nominal variables using numeric displays is to map the nominal values to numbers, i.e.,
assigning order and spacing to the nominal values.
11. A typical way to perform this task is through a process known as convolution, which for our purposes can be viewed
as a weighted averaging of neighbors surrounding a data point.
13.Mean filtering is a simple method of smoothing and diminishing noise in images by eliminating pixel values that are
unrepresentative of their surroundings.
In spatial data visualization, our objects can be points or regions, or they can be linear structures, such as a road on a
map
It is sometimes useful to take a raster-based data set, such as an image, and extract linear structures from
The image processing and computer vision fields have developed a wide assortment of techniques for converting
raster images into vertex and edge based models [153, 371].
1. Thresholding.
2. Region-growing
3. Boundary-detection.
4. Thinning
Thresholding: Identify one or more values with which to break the data into regions, after which the boundaries can
be traced to generate the edges and vertices.
Region-growing: Starting with seed locations, either selected by a human ob-server or computed via scanning of the
data, merge pixels into clusters if they are sufficiently similar to any neighboring point that has been assigned to a
cluster associated with one of the seed pixels.
Boundary-Detection: Compute a new image from the existing image by convolving the image with a particular
pattern matrix.
Thinning: The convolution process mentioned above can also be used to perform a process called thinning, where the
goal is to reduce wide linear features, such as arteries, to a single pixel in width.
Data Sets:
Presented in tabular form.While each may be appreciated without understanding the data being displayed, in
general, the effectiveness of a visualization is enhanced by the user having some context for interpreting what is
being shown. A data set is a structured collection of data points related to a particular subject.
Data sets can hold information such as medical records or insurance records, to be used by a program running on the
system.
Data sets are also used to store information needed by applications or the operating system itself, such as source
programs, macro libraries, or system variables or parameters.